How much information does it take to single out one person among billions?
The Arithmetic of Uniqueness
When I first heard about Latanya Sweeney’s demonstration that gender, zip code, and birth date are enough to identify many Americans, I found the result surprising, but the arithmetic is straightforward. For a back-of-the-envelope calculation, assume there are 300 million people in the United States, half male and half female, and that they are evenly distributed over 30,000 zip codes and 36,500 possible birth dates. (I am ignoring leap years and centenarians.) Each zip code has 5,000 male residents and 5,000 females. The question then becomes: If each of 5,000 people has a birth date chosen at random from 36,500 possibilities, how many will wind up with a date not shared by any other member of the group? The mathematically expected number is 4,360, or 87 percent.
The foregoing calculation is only a crude approximation. The real U.S. population is not distributed uniformly either by age or zip code. People in larger cohorts and more populous areas can more easily hide in the crowd. Philippe Golle of the Palo Alto Research Center has published an estimate of identifiability based on census data. He finds that the proportion of people with a unique combination of gender, zip code, and date of birth is a little over 60 percent.
Sweeney began her work on “re-identification” in the 1990s, when she was a graduate student at MIT. Her particular concern was the privacy of medical data. In 1997 she examined a batch of hospital documents released for statistical purposes and was able to identify the records of William Weld, a former governor of Massachusetts. The anonymized data listed each patient’s gender, five-digit zip code, and date of birth, which Sweeney cross-linked with voter registration rolls. (Weld confirmed that the records were his.)
Partly in reaction to this incident, the Health Information Portability and Accountability Act (HIPAA) of 2003 established guidelines for guarding patient confidentiality. In general, aggregated medical data must not reveal exact dates of birth or precise locations.