long . lines and ripples

Making Data from Scratch

The other day, I bought a coffee at a new shop near me, and I was surprised to check my email a few minutes later and see an invoice from Square for the purchase. Now, once my conscious brain had a moment to work, I wasn’t actually surprised. I know that Square is a common payment processor for businesses like this, and that my credit card has long been part of their trove of data. I even know that I can take a moment to unlink my credit card from their systems if I want.

But moments of sudden unease at being tracked still hit me occasionally. Not long before, I found an entry on my calendar that I didn’t recognize. It resembled a real event but was both wrong and oddly worded, like it had been typed by someone else. The cause was an automated entry from a calendar that I didn’t know existed: “Found in Natural Language.” I disabled the feature, but like the emails from Square, I never asked for it. I knew this was possible, but I still don’t quite now how it happened: Whose natural language? Where? In a text? Phone call?

In the last year or two the national news–NYTimes’s ‘Privacy Project’ in particular--has made an effort to show us that new, data-intensive monitoring of ordinary life goes much deeper than commonly known.

Sometimes this happens by putting together a lot of mundane data (e.g., browser fingerprinting), sometimes through unprecedented wide-scale implementations of technology that was formerly more niche (e.g., public facial recognition). Sometimes it is through a new technical breakthrough, like the compilation and statistical analysis of genomic data to solve old murder cases. The lesson of the last few years is that we should anticipate being tracking through our digital footprint in ways we didn’t even realize were a thing. It might take sci-fi imagination–coupled with relentless paranoia–to correctly guess at all the ways the average person is monitored in the U.S.

The email from Square made me think a little more about how the public understanding of digital surveillance informs public opinion about what it is acceptable to track. The acceptability question underlies trust between user and site (app, service, etc.) that is built into the open internet. A lot of public outrage about the power of Big Tech comes down to the perception that they are doing things without the consent of their customers. But I suspect that what is really going on here, especially with new technology, is that people are just surprised. You can’t answer the question of “what is acceptable” without studying how the public makes sense of new and emerging forms digital surveillance.

I break this problem down into three categories:

Category 1: Activity that is “born digital”

Activity that generates data because it is already data to begin with.

Think credit card transactions, social network activity, television viewing habits, etc. Many of us know that this stuff is being pegged to our identities (or a proxy for unique identity) and analyzed. At a minimum, it is stored in databases forever in anticipation of future use.

Many of these transactions generated a lot of data even before the modern internet, so to some extent they are grandfathered in to the current situation: very few people object to Mastercard storing your purchase history (with them) indefinitely. Nielsen is such a valuable company in part because they sit on top of a unique long-term dataset about our television viewing habits.

One could say that this category is the least controversial form of data collecting out there. It has become normalized. except that this normal data can still become quite invasive when it becomes comprehensive enough. See recent concern over Google’s record of purchase history through Gmail, or the failure of the credit rating bureaus to exercise adequate security standards.

Violations of trust in Category #1 occur when one group gains too much power to access, compile and generate normal data.

Think Facebook learning so much about its users’ online activity, on other websites, that they create the most effective micro-targetting platform for advertising, ever. Each of the FAANG companies reaches so deeply into all of our lives that they can or aspire to collect a disturbing amount of normal data.

Category 2: Aspects of ordinary life which are not inherently digital, but which are known to be receptive to digitization

I have in mind the reports that many states in both the East and West are building formidable, permanent facial recognition networks in public places.

Or refer to the use of big data analytical techniques on what were formerly consumer-recreational databases of genetic data. Millions of people freely gave their most personal biological data to a company. They knew that a company was mining their data, but they thought that the analysis would come to an end–that it was for determining ancestry and nothing more.

23andMe endcap
Who knew that 23andMe needed a whole endcap display at Target?
[description]
“These tests are for entertainment purposes only!” Until further notice.

Now 23andMe has entered a partnership with a pharmaceutical company for drug research purposes. Maybe that was the plan all along. And some genetic databases, like GEDmatch,, will be repurposed to catch murderers (e.g., the “Golden State Killer”). The hobbyists are now criminal suspects, too.

Once an activity becomes digital, inertia is on the side of using that information for new purposes.

Violations of trust in category #2 happen when an authority actualizes their latent capability to collect and use their data.

In the ongoing Hong Kong protests, facial recognition causes worldwide alarm because the Chinese authorities are using it to identify and punish the protestors. In turn, the protestors recognize the danger posed by facial recognition–and shield their faces.

I should be clear that in matters of violating the public trust, I am talking about the public perception of the surveillance that is (1) possible and (2) actively in use. Facial tracking had been on the market for a while. Consumers knew they were showing their face in endless ordinary-life settings (banks, courthouses, etc.). What is actually being tracked and measured is only part of the issue. What is understood to be tracked is. Regardless of how widely facial recognition is in use, many of us won’t be looking at cameras the same way again. But at least we know it is possible. Now for the last category:

Category 3. Surveillance of activity which is not widely understood to have a digital component, or even understood to be capable of digitization

This is what facial recognition used to be. Faces were just things people learned to recognize, draw, represent, etc. At best the police organized them into crude sketches of people who were wanted, using a system of boilerplate features that were combined by trained sketch artists. Then the human face gained a systematic analogue in something called “facial recognition.” Consumers got an introduction to this technology with smartphones that let you unlock your device using your face.

Faces are now things with a function: identifiers. If you collect data, you can hoard them, protect them, debate them. It is easy to forget just how recently they were just another important but useless thing. They were not (yet) data.

[description]
Not a real face. But made possible by real faces turned into data. Credit: thispersondoesnotexist.com

At this point, we would be foolish to rule out any type of activity for data collection purposes. As the the Harvard Business School school psychologist Shoshanna Zuboff argues in her book, The Age of Surveillance Capitalism:

…all aspects of human experience are claimed as raw-material supplies and targeted for rendering into behavioral data…As competition intensifies, surveillance capitalists learn that extracting human experience is not enough. The most-predictive raw-material supplies come from intervening in our experience to shape our behavior in ways that favor surveillance capitalists’ commercial outcomes.

Shoshanna Zuboff, The Age of Surveillance Capitalism, p. 19.

It might not matter who we are, and what the entirety of our activity actually says about us. The most important aspect of “us” might be whatever the data says we are: what we buy online, where we go when we have our phones along, what we write that a smartphone uploads to the cloud. This could affect how we are categorized for the purpose of vital benefits and services (healthcare, government assistance), where we can be employed, and how we are matched with others in the myriad social algorithms that increasingly influence with whom and how we socialize:

“The reality business renders all people, things and processes as computational objects in an endless queue of equivalence without equality. Now, as the reality business intensifies, the pursuit of totality necessarily leads to the annexation of “society,” “social relations” and key societal processes as a fresh terrain for rendition, calculation, modification and prediction.”

The Age of Surveillance Capitalism, p. 399

What is “real” is whatever the data says, because it is only activity with a data component that is useful (profitable, actionable, controllable) for someone else. The rest disappears from significance because it does not yet have this purpose.

New forms of data arise every day. I recently learned about the “Personality Insights” tool on IBM’s AI platform, Watson. As the company describes it on their website, it “applies linguistic analytics and personality theory to infer attributes from a person’s unstructured text.” Anything a person writes, online or not, is now fodder to be fed into this tool. Indeed, the more ordinary the language, the more innocuous, the more unguarded, the better. As the instructions say, “You need text written by the person whose personality you’re interested in. It should contain words about every day experiences, thoughts, and responses.” Text, even text written on a piece of paper before this tool existed, can now be used to generate a set of conclusions about the individual personality. Already, organizations are using this data to make decisions about the individuals they describe.

Violations of trust in Category #3 happen only when the mere existence of data is understood to be a danger. We need a better public understanding that the act of generating data is itself a threat to the public good.

It is not enough for information about new data collection practice to be “made available” to the public. Legal fine print, end-user license agreements, even investigative journalism after the fact–none of it is enough. Large numbers of people must understand that anything that becomes data can be used against them. The next frontier of data collection will always come as a surprise, but we needn’t be technical experts in burgeoning technical fields–machine vision, statistical genomics, etc.–to start protecting ourselves. What we need is to understand that data about us is always dangerous, simply because it exists.