As a still-learning data scientist (as if there were any other kind), I’m always on the lookout for interesting data to fiddle with, in my copious free time. Recently, my fiddling has turned toward the USDA’s Food Access Research Atlas, a resource for the study of food deserts in the US. In the coming months (here’s hopin’), I’ll be posting some of the work I’ve done, but first it’s worth diving into the concept a little.
A side project I am working on (more on that later if I can manage to blog more than once a year) has me needing to interact with several datasets that are broken down by census tract. This means that, among other things, I’m going to be outputting a lot of Tacoma and Pierce county map visualizations, and need to be able to outline and mark these tracts in a variety of ways.
At the end of What’s Weird About This Data, and Why?, I mentioned that the knowledge gained by analyzing one outlier could tell us where to look for others. When I first looked at the big graph, growth looked vaguely exponential. But knowing that 2004 saw a whole new business type (Individual Landlords) added, things start to seem a bit more linear in the before and after. And if that’s the case then 2015 looks like another unusual spike.
I’d originally guessed that maybe the most recent full year would always be the biggest because it includes businesses that might fail to materialize, and let their license expire, but since I knew from 2004 that a new business type could lead to unusual behavior, I did a similar summary of the NAICS code descriptions (again, only showing the first few here).
|Lessors of Residential Buildings||526|
|Lessors of Nonresidential Buildings||86|
Hoo boy, that is a lot of taxi services. If you’ve been paying attention, both to me and to the state of personal transport trends, you can probably guess the story: Tacoma requires all Uber drivers to have individual business licenses, just like traditional cab drivers.
It seems reasonable to expect this kind of thing will show up more and more, as these middleman services like this enable individuals to easily sign-up as independent contractors in a variety of industries.
Exploratory data analysis is the “real world observation” of data science. In proper rigorous statistical analysis, you are supposed to have a hypothesis before you test your data, but that’s based on the idea that you’ve already observed some phenomena in the real world, and developed a hypothesis based on that. With data science, the data set and the “real world” are sometimes the same—the data you’ll eventually analyze is the thing you also have to observe to come up with questions in the first place.
So you do some exploratory analysis. Make a few simple visualizations, summarize the data, poke around, looking for fun facts. Often you learn things that wouldn’t have occurred to you without exploration, and the information gleaned can help you avoid mistakes moving forward.
This particular example was supposed to just be a self-driven exercise to practice a few core functions for basic recoding and visualizations in R. I fished around on the data.cityoftacoma.org portal, and downloaded a data set of all open business licenses issued by the city. Continue reading
I lived in Spokane for a couple years, and I spent a lot of time biking on distraction-free trails. During that time I marathoned every available episode of Radiolab, and since then I’ve become a bit of a podcast junkie. I don’t have much time to watch TV or read recreationally these days, so instead I listen to podcasts whenever I don’t need my brain for something—on walks, doing the dishes, etc.
Diving headfirst into data science meant, among other things, adding a few new podcasts to the list. So here’s what I’ve been listening to. (I promise there will be some technical stuff one of these days.)
The Obama administration has been pretty big for data science. They appointed the country’s first official Chief Data Scientist, and launched data.gov, a clearinghouse for government generated public data sets. President Obama also signed an executive order declaring open and machine-readable formats as the new default for all forthcoming government information resources, guaranteeing the continuing availability of interesting data.
Beyond the practical use of generating actionable conclusions from public data, any repository of up-to-date, varied data sets is invaluable to beginners, looking to hone their skills, and build a portfolio of projects.
But self-directed amateur research can also present a significant roadblock. Exploratory analysis—essentially poking around in the data and learning interesting things—is important, but the real skill of an employable data scientist is the ability to frame and answer a well-defined question. Continue reading
I’m Joe. I’m the Senior Systems Administrator at a small web design company, in Tacoma, WA, which basically means I live torn between keeping the servers that host our websites running, and finding the time to make them better. I also run the databases, and try to be the resident expert whenever complex queries are needed to extract useful reports.
I’m also a newly minted graduate student in the Masters of Information & Data Science (MIDS) program at UC Berkeley. I’ll get more into broader questions of what exactly data science “is” later, but basically it means I’ll be spending the next 20 (19 now I suppose, got a late start on the blog) months learn how to find interesting data, learn from it, analyze it, and help people make good decisions based on it.
Along the way, I know I will need more practice than just the coursework will give me, and that’s what this blog is for. I want a place gather my thoughts as a learn, and a place to talk about side projects or little problems I’m working on.
I’ll say right here that I’m not an expert. I don’t have a rich background in data analytics, like some of my classmates. I intend to be an expert, but for now these are the ramblings of a learning amateur. So if you are a data science professional looking for sophisticated analyses, or advanced tips and tricks… well, come back in a couple years.
So what will be here? Mostly I’ll try to keep things short (full time job + school + toddler = pretty full days): code snippets, quick thoughts on this or that article, concepts I want to write down to make sure I understand them, datasets I’ve found that might be worth further investigation. Once I get better with ggplot2, interesting visualizations I’ve made. Occasionally, I’ll try to do some longer case studies of data explorations and side projects.
I’ll probably start talking about baseball statistics a lot. You can ignore that if you think it’s boring. I’ve also had a long-standing, unconsummated interest in digital signal processing, so I’m sure there will be some of that once I get into time series analysis, and a little fun with data sonification.
This and that. Things and stuff. Big data, little data, what begins with data? This blog, that’s what. Enjoy!