On Content Moderation and Human Eyeballs

In the weeks and months following the Charlottesville terror attack, among the after-effects has been rising pressure on online services to police hate speech and violent rhetoric on their sites. More qualified folks than I will have sifted (and continue to sift) through the moral quandaries of balancing a desire to stifle hate with a vision of an open and free internet, but I’d like to dig a little into the technical considerations of automated content moderation.

There’s a vision of machine learning as a kind of technological magic. Powerful computers and complicated algorithms sift through large swaths of data, and learn to recommend a movie, or identify your 2nd cousin in a photograph. Even the more tech-savvy users, who would never actually use the word magic, still consider this the work of a dispassionate, anonymous machine.

But there are a lot more human eyeballs involved than you might think. Most computers “learn” using a simple paradigm: look at a large swath of data, yes, but a large swath of data where someone also provides the answers. You don’t show a computer ten million pictures and ask it to tell you where the dog is in picture ten million and one. You show it ten million pictures and point at every dog. The aforementioned algorithm says, “Joe says this right here is a dog, and so is this, and so is this, and so is this. What mathematical set of features are common between them?”

To teach a computer to do a thing, a human needs to have already done that thing, many many times. Some use cases are fortunate to wield pre-existing data—methods of detecting positivity and negativity in language, for example, can look at movie reviews, with a helpful star rating or thumb label. Others rely on users—every thumbs up or down you give Netflix helps their systems build a picture of what B someone will like, if they also liked A.

Content moderation is not so simple. The end goal for companies is the automatic detection of terms-of-service violating language. So, how to teach this algorithm? Past data? We could leverage the content of known hate-group forums and news sites. But these forums are likely to have a large amount of off-topic, benign content as well. User labeling? Social media platforms have buttons to report or flag inappropriate content, but these are vulnerable to concentrated abuse, or the whims of a given user’s morality.

At some point, websites need eyeballs. And they have them, armies of content moderating “contractors,” performing the menial labor of the data economy: getting paid 4¢ a post to say “this is fine, this is fine, this is a violation, this is fine”—and if you think that job sounds like fun, you may not have it quite right.

There are (at least) three implications worth considering here, before we decide this is the way to go. (1) These people are in a hurry, (under)paid by the click, and clawing their way toward minimum wage. (2) These people are performing a variety of different tasks in rapid succession. One site might pay them to moderate 1000 flagged posts, the next might be that guy above, asking them to tell his computer where the dog is in 1000 pictures. (3) These people are, well, people. Even the most precise terms of service leave some wiggle room for human judgement; and most are instead intentionally vague, to allow for a lot of wiggle in either direction. Everyone’s got some bias in their tank, and the line for what is racist and what is not can vary drastically.

This is not a recipe for reasoned, consistent decisions. This is not “here is definitely a dog.” And it’s remarkable how easily even the most sophisticated technology can learn subtle bias, if the examples it learns from are biased in the first place. All it takes is a few people who keep pointing at the cat and insisting that it’s a dog. “Garbage in, garbage out,” is as true in machine learning as it is in cooking.

Does this mean that humans can’t do the job of teaching machines to detect hate? No. But it’s essential that the companies relying on these methods think long and hard about how to make their labeling process as effective as possible. Maybe pay the workers a little more to encourage a slower, more thoughtful response. Maybe recruit a demographically diverse group of workers to stymie bias in any direction. Most importantly, just think. Think as hard about the process you use to train your model as you’d think before handing any other critical decision over to an algorithm.

Recommended Listening:

In addition to all the links above, NPR’s Note to Self has a fascinating interview with a contract content moderator on the ins and outs of the job.

Drawing Pierce County: Census Maps in ggplot2

A side project I am working on (more on that later if I can manage to blog more than once a year) has me needing to interact with several datasets that are broken down by census tract. This means that, among other things, I’m going to be outputting a lot of Tacoma and Pierce county map visualizations, and need to be able to outline and mark these tracts in a variety of ways.

So! Here is a rundown of exactly how I am pulling the census shapes, and squooshing them into ggplot2 in R. First, we’re going to need a library or two. Or six. Probably six. Continue reading

Weird Data: Addendum

At the end of What’s Weird About This Data, and Why?, I mentioned that the knowledge gained by analyzing one outlier could tell us where to look for others. When I first looked at the big graph, growth looked vaguely exponential. But knowing that 2004 saw a whole new business type (Individual Landlords) added, things start to seem a bit more linear in the before and after. And if that’s the case then 2015 looks like another unusual spike.


I’d originally guessed that maybe the most recent full year would always be the biggest because it includes businesses that might fail to materialize, and let their license expire, but since I knew from 2004 that a new business type could lead to unusual behavior, I did a similar summary of the NAICS code descriptions (again, only showing the first few here).

Code Description Count
Taxi Service 1019
Lessors of Residential Buildings 526
Lessors of Nonresidential Buildings 86
Residential Remodelers 86

Hoo boy, that is a lot of taxi services. If you’ve been paying attention, both to me and to the state of personal transport trends, you can probably guess the story: Tacoma requires all Uber drivers to have individual business licenses, just like traditional cab drivers.

It seems reasonable to expect this kind of thing will show up more and more, as these middleman services like this enable individuals to easily sign-up as independent contractors in a variety of industries.

What’s Weird About This Data, and Why?

Exploratory data analysis is the “real world observation” of data science. In proper rigorous statistical analysis, you are supposed to have a hypothesis before you test your data, but that’s based on the idea that you’ve already observed some phenomena in the real world, and developed a hypothesis based on that. With data science, the data set and the “real world” are sometimes the same—the data you’ll eventually analyze is the thing you also have to observe to come up with questions in the first place.

So you do some exploratory analysis. Make a few simple visualizations, summarize the data, poke around, looking for fun facts. Often you learn things that wouldn’t have occurred to you without exploration, and the information gleaned can help you avoid mistakes moving forward.

This particular example was supposed to just be a self-driven exercise to practice a few core functions for basic recoding and visualizations in R. I fished around on the data.cityoftacoma.org portal, and downloaded a data set of all open business licenses issued by the city. Continue reading

Rattling My Headbones


I lived in Spokane for a couple years, and I spent a lot of time biking on distraction-free trails. During that time I marathoned every available episode of Radiolab, and since then I’ve become a bit of a podcast junkie. I don’t have much time to watch TV or read recreationally these days, so instead I listen to podcasts whenever I don’t need my brain for something—on walks, doing the dishes, etc.

Diving headfirst into data science meant, among other things, adding a few new podcasts to the list. So here’s what I’ve been listening to. (I promise there will be some technical stuff one of these days.)

Continue reading

[obligatory welcome post]

Oh, hi.

Oh, hi.

I’m Joe. I’m the Senior Systems Administrator at a small web design company, in Tacoma, WA, which basically means I live torn between keeping the servers that host our websites running, and finding the time to make them better. I also run the databases, and try to be the resident expert whenever complex queries are needed to extract useful reports.

I’m also a newly minted graduate student in the Masters of Information & Data Science (MIDS) program at UC Berkeley. I’ll get more into broader questions of what exactly data science “is” later, but basically it means I’ll be spending the next 20 (19 now I suppose, got a late start on the blog) months learn how to find interesting data, learn from it, analyze it, and help people make good decisions based on it.

Along the way, I know I will need more practice than just the coursework will give me, and that’s what this blog is for. I want a place gather my thoughts as a learn, and a place to talk about side projects or little problems I’m working on.

I’ll say right here that I’m not an expert. I don’t have a rich background in data analytics, like some of my classmates. I intend to be an expert, but for now these are the ramblings of a learning amateur. So if you are a data science professional looking for sophisticated analyses, or advanced tips and tricks… well, come back in a couple years.

So what will be here? Mostly I’ll try to keep things short (full time job + school + toddler = pretty full days): code snippets, quick thoughts on this or that article, concepts I want to write down to make sure I understand them, datasets I’ve found that might be worth further investigation. Once I get better with ggplot2, interesting visualizations I’ve made. Occasionally, I’ll try to do some longer case studies of data explorations and side projects.

I’ll probably start talking about baseball statistics a lot. You can ignore that if you think it’s boring. I’ve also had a long-standing, unconsummated interest in digital signal processing, so I’m sure there will be some of that once I get into time series analysis, and a little fun with data sonification.

This and that. Things and stuff. Big data, little data, what begins with data? This blog, that’s what. Enjoy!