Every Row in Your Dataset Is a Human Being


(This story was originally shared at DataKind’s 2020 Global Summit.)

So what is data? Now let’s be clear that I’m not referring to the obvious: we all know that they’re records of observations that you can run calculations upon.

Folks have variously compared data to oil, clay, gold, and — well — even bacon for their value in building machine learning applications. But in my experience, these metaphors often just don’t cut it. And I want to take a moment to talk about how I came to this realization…

So far, I’ve had the privilege of working directly with homeless folks for data projects on two occasions. The first was in 2014, during an annual data collection drive led by San Diego’s Regional Task Force for the Homeless. The second was in 2018 through a DataKind SF project with the nonprofit Lava Mae, where we hit the streets of San Francisco to learn how homeless data was collected.

During these occasions, I met some people that I still remember to this day. I recall that there was an elderly lady named Mildred living on the streets of San Diego’s Old Town. We told her that her name had a certain old-fashioned sophistication to it; that it reminded us of the movie Mildred Pierce (which is a very old movie with Joan Crawford). Mildred was obviously delighted that we liked her name.

In San Francisco’s Mission District, I met a clean-cut man who didn’t look homeless at all. He told me his weekly struggle was finding an electrical outlet for his beard trimmer. It’s because to him, not having a home shouldn’t have any bearing on his looks. Taking care of his appearance was his way of taking control of his life.

On that day, I also met another gentleman who was in his 50s. At night, he took classes at the local adult school. He told me that it’s because he wanted to write a book to tell stories about his life on the streets.

So when I look back upon these encounters, I realize something very fundamental. The farther up you go in the data collection lifecycle, the more you realize that comparing data to gold or oil or all these other things just isn’t quite right. Oftentimes, they’re people’s life stories in a compressed form. Said differently: every row in your dataset is a human being.

And because I’m secretly a liberal arts nerd despite my tech job, I happen to know this: the ancient Greeks used to compare each mortal life to a strand of thread. If that’s the case, the table or dataframe that you’re working with is really a tapestry of interwoven life stories. And it’s my belief that when we handle this tapestry, it’s our responsibility to pause for a moment, to appreciate it, and then to treat it gently.