Exploratory data analysis is the “real world observation” of data science. In proper rigorous statistical analysis, you are supposed to have a hypothesis before you test your data, but that’s based on the idea that you’ve already observed some phenomena in the real world, and developed a hypothesis based on that. With data science, the data set and the “real world” are sometimes the same—the data you’ll eventually analyze is the thing you also have to observe to come up with questions in the first place.
So you do some exploratory analysis. Make a few simple visualizations, summarize the data, poke around, looking for fun facts. Often you learn things that wouldn’t have occurred to you without exploration, and the information gleaned can help you avoid mistakes moving forward.
This particular example was supposed to just be a self-driven exercise to practice a few core functions for basic recoding and visualizations in R. I fished around on the data.cityoftacoma.org portal, and downloaded a data set of all open business licenses issued by the city.
(For all the following analysis, I’ve attached an R Markdown report at the bottom to show the actual code used.)
After recoding the data a little to make it a more usable in R—converting the BUSINESS.OPEN.DATE field to a format recognized by R as a date, turning NAICS.CODE.DESCRIPTION (basically the business type) into a factor, making it easier to organize businesses by type later, and eliminating a few entries with unusable dates—we can visualize by date. There isn’t much numerical data to be had, but year opened seems like it might tell us something interesting about the overall economic growth of the city. Organized by year, the plot looks like this:
For the most part this looks about right. Given the scale of the more recent years (approaching 4000), early years barely show up, but growth really starts to pick up in the 60s and appears to be exponential.
I’m sure you can see what’s weird here, though. There’s one BIG spike in the early 2000s that stands out from the years on either side of it. Let’s zoom in a little, and make it really obvious.
The spike is 2004, and it’s the highest point of any year but 2015 (which we’d expect to be the highest, as the most recent complete year). We can also see that after 2004 values are higher that we might have expected if growth had continued as normal from 2003.
So, what’s the deal with 2004? Why is it weird? Since most of the rest of the fields are unique (business name, business owner, etc) they won’t give us much summary information, so the next place to look is that NAICS.CODE.DESCRIPTION. Because we converted it to a factor, we can get a quick summary of how many licenses of each business type were given out. I’ll just post the first few entries…
|Lessors of Residential Buildings and Dwellings||2084|
|Lessors of Nonresidential Bildngs||199|
|Commercial and Institutional Building Construction||16|
|Lessors of Other Real Estate Property||15|
…because it’s immediately clear who the culprit is: “Lessors of Residential Buildings,” a.k.a. landlords. Looking at the data for this category, the vast majority are clearly just individuals—the business name is the same as the owner name—so it seems like these are probably just private landlords renting out houses, rather than property management companies.
Analytically, here we can compare percentages in the surrounding years. In 2002, 7% of issued licenses were for this kind of business. In 2003, 12%. In 2004, the number is suddenly 72%, and in the next couple years, 37% and 35%. This fits with everything we saw on the initial plot.
In real life, I skipped this step, because my wife said something like, “maybe the laws changed and now individual landlords need business licenses?” A quick trip to Google says: yep! In January of 2004 the city of Tacoma instituted a new “Rental Business License“, and all the responsible landlords in the city rushed out to get legal.
So, with a little manipulation, a little exploration, and a little googling, mystery found and then solved! Sure seems like a lot of work for a little piece of information. But imagine if we were planning to perform some serious analysis in a data set like this. Here are some things we know now that could have made for seriously confusing results, otherwise:
- 2004 has a huge spike, so any attempt to map trends needs to account for this.
- All years after 2004 have a new chunk of businesses that change the data disposition significantly.
- Private landlords, which our future analysis of local businesses may not care about, are a part of the data set, but only the last part of it
- It may be worthwhile to search for other changes to Tacoma business license laws, since they clearly can impact spikes and trends.
All of this information (or information like it) can prove extremely useful in filtering down to the data you actually want to analyze, and that’s what exploratory data analysis is for.
R Markdown Report: landlord.analysis