Log in

No account? Create an account

Josh-D. S. Davis

Xaminmo / Omnimax / Max Omni / Mad Scientist / Midnight Shadow / Radiation Master

Previous Entry Share Next Entry
Personally Identifiable Information
Josh 201604 KWP
Mr. X lives in ZIP code 02138 and was born July 31, 1945.

These facts about him were included in an anonymized medical record released to the public. Sounds like Mr. X is pretty anonymous, right?

Not if you're Latanya Sweeney, a Carnegie Mellon University computer science professor who showed in 1997 that this information was enough to pin down Mr. X's more familiar identity -- William Weld, the governor of Massachusetts throughout the 1990s.

Gender, ZIP code, and birth date feel anonymous, but Prof. Sweeney was able to identify Governor Weld through them for two reasons. First, each of these facts about an individual (or other kinds of facts we might not usually think of as identifying) independently narrows down the population, so much so that the combination of (gender, ZIP code, birthdate) was unique for about 87% of the U.S. population.

In summary, every little bit of your information is "Partially Identifying" of who you are. This is significant enough that birthday, zipcode and gender is enough to uniquely identify 87% of US citizens. To deanonymize someone from some of this info is fairly easy for those with access to the databases of information.

Other things that are partially identifying include your search terms and habits, your friend network structure, your preferences about books and movies, etc. This information isn't simply stored in your browser cookies, but also in Macromedia "Loadable Storage Objects". Aside from that, external sites can track your history by saying "this IP goes here regularly" which might not be unique, but might be enough to identify you as a group of a small handful of other people.

More info also at https://www.eff.org/deeplinks/2010/01/primer-information-theory-and-privacy which summarizes how much partial info might be required to identify someone.

In these examples, you can consider "entropy" to mean "information". "Because there are around 7 billion humans on the planet, the identity of a random, unknown person contains just under 33 bits of entropy (two to the power of 33 is 8 billion)."
"Birthday: ΔS = - log2 Pr(DOB=2nd of January) = -log2 (1/365) = 8.51 bits of information
Note that if you combine several facts together, you might not learn anything new; for instance, telling me someone's starsign doesn't tell me anything new if I already knew their birthday.
"Knowing my ZIP code is 90210: ΔS = - log2 (21,733/6,625,000,000) = 18.21 bits
Knowing my ZIP code is 40203: ΔS = - log2 (452/6,625,000,000) = 23.81 bits
Knowing that I live in Moscow: ΔS = -log2 (10524400/6,625,000,000) = 9.30 bits
"So for instance, if we know someone's birthday, and we know their ZIP code is 40203, we have 8.51 + 23.81 = 32.32 bits; that's almost, but perhaps not quite, enough to know who they are: there might be a couple of people who share those characteristics. Add in their gender, that's 33.32 bits, and we can probably say exactly who the person is."
"on average, User Agent strings contain about 10.5 bits of identifying information, meaning that if you pick a random person's browser, only one in 1,500 other Internet users will share their User Agent string."

"On its own, that isn't enough to recreate cookies and track people perfectly, but in combination with another detail like geolocation to a particular ZIP code or having an uncommon browser plugin installed, the User Agent string becomes a real privacy problem."