long . lines and ripples

Bring Your Own Data

I suspect that many intelligent people do not pay attention to data analysis, or to even basic newspaper-level statistics about, say, the pandemic, because they lack experience working with data themselves.

There are several reasons for this. First, take the most basic, entry-level tool for working with data: a spreadsheet. That can still be too much technology. Today, spreadsheets are used for all sorts of organizational tasks that do not involve the actual analysis of data, for example to track invitations to an event.1 Along with slide-making and word processing programs, the spreadsheet is probably the most common tool for holding information.2 Millions of people who have worked a desk job in the U.S. have encountered one, and you needn’t pay for Excel to get access: Google Sheets and others offer a surprisingly powerful option for free. But in an era where Excel is everywhere, it is easy to forget that the spreadsheet was a much more specialized tool even two decades ago. Then, it was common among accountants, finance types, engineers, and others who needed to run calculations for professional work, but still fairly intimidating and arcane to the general population.3 The skills gap has been closed with younger people, but it is still there.

A second, more overlooked reason that data turns some people off is that decent data is hard to come by. Most people would struggle to find an accessible dataset, an interesting dataset, or even a good dataset to work with. But bad data, the kind that teaches you nothing, is cheap and ubiquitous. Most apps or software programs can export some kind of bad data into a .csv format, but it is likely to be selective, illogically organized, illegible, and generally not that useful. If you want to learn something from your data, then it has to be in a format that you can do something with, which means that someone will have had to put some thought into making it useful.

That being said, I think that if you are nerdy enough to have even a mild interest in working with data or doing some kind of analysis in your free time, coming up with a trove of your own datasets can be a valuable experience. In the last few years, I have come to think of amassing my own datasets as more of a lifelong exercise, which I do primarily out of interest in working with the data itself, but which can also store information and tell me things about my environment that I never would have paid attention to otherwise.

Building up your own carefully designed data sources can be a useful hobby for both amateurs and professionals. You can ask better analytical questions about data if you have more knowledge of what it describes (e.g., “domain” expertise), and there are things about us that could be data-fied, e.g., our personal health, about which each of us already has some knowledge.

Any good dataset will also have to be cleaned, refined and/or transformed based on what you want to accomplish with it. If you are doing this as a hobby, the hope is that your data isn’t too bad to begin with. But if not, just accept the work of cleaning it up as the price of getting to know your information in its “data-fied” form.4

An important related question: if you want to find personal datasets to toy around with, should you prefer data that is manually recorded, or something generated from an automated source like a sensor or internet platform? No doubt there are advantages to both, but if you are going the automated route, you will need to spend more time ensuring that what your source generates does not cause you a lot of manual transformation work. Minimize both the initial startup difficulty and the ongoing hassle.

Here are a few ideas for data sources:

  1. For general information tracking purposes, the spreadsheet is, in effect, the average person’s first–and likely only–introduction to databases.
  2. PowerPoint may be just as ubiquitous as Excel, but is much more controversial. A lot of people think PowerPoint is bad for business culture and communication in general. By contrast, I would conjecture that most people agree that spreadsheets are important tools.
  3. For example, in a well-known 2001 book by William Bernstein on personal investing, The Intelligent Asset Allocator, he has to inform his readers, who are likely to be more data-savvy than average due to the subject matter, that “If you are familiar with spreadsheets…all spreadsheet packages have extensive financial calculation capability.” Otherwise, he advises them to use a Texas Instruments BA-35 hand calculator (p. 5).
  4. And in the long run, figure out how to make your data less bad by default.