Further to my post mentioning Facebook's harvesting of user data after logout, a friend (who will remain anonymous) pointed me at some holes in Microsoft's announced deidentification procedures that could create similar vulnerabilities. Beyond the technical problems in design, this is concerning because both Microsoft and its partner sites have commercial incentives to allow such vulnerabilities.
The architecture of Microsoft's ad network deidentification system was recently described in a white paper (PDF) that was apparently created as an input to a recent Federal Trade Commission 'Town Hall' meeting on online advertising. Given the potential audience, it's possible the architectural description has been simplified, so I'll discuss some potential variants as well. Any flaws in analysis are my own.
A user who has an MSN / Live account and logs on has a cookie created on his or her browser that contains a 'LiveID'. That ID is not anonymous, because it is used by Microsoft as a database key to (among other things) e-mail accounts, real name and volunteered and inferred demographic and behavioral information. At the same time, an anonymous ID (ANID) is created and stored by cookie on the browser. The ANID is derived from the LiveID by a cryptographically strong hash function. The ANID is associated with non-identifying demographic and behavioral information, both volunteered and inferred. During the period when the user is logged in and visits either one of Microsoft's own sites, or a partner site on which it is serving ads, both the LiveID and ANID cookies are accessible. Note that in this scheme, behavioral and demographic information about the user created during logged-in sessions is also copied into the record attached to the ANID.
When the user logs out of MSN / Live, the LiveID cookie is deleted from the browser. The immediate association with individually identifying information is therefore broken, as it is (as Microsoft asserts) computationally intractable to deduce the LiveID from knowledge of the ANID alone. Such a brute-force, reversal attack is hard to begin with, and will be even more problematic if Microsoft makes it difficult to the deduce the range of LiveIDs that it might be generating. With the ANID cookie intact, Microsoft or advertising partner site will still have a stable, pseudonymous identifier to use for aggregation of click-stream and other behavioral data across sites and visits, but shouldn't be able to tie that information back to the actual person associated with the pseudonym.
But saying something is anonymous doesn't make it so. As we'll see, the anonymity of the 'ANID' depends entirely on Microsoft and its partners 'playing nice' - there are information leaks inherent in the scheme described. These vulnerabilities are not cryptographic in nature, so the security of the one-way hash that generates the ANID is not relevant.
The most obvious vulnerability is to Microsoft itself. According to the white paper, the ANID is derived from a hash of only the LiveID. So the same LiveID (which is user identifying) will always yield the same ANID. Even if Microsoft discards the LiveID/ANID pair when the user logs out, at the next login the same LiveID/ANID pair will be generated. Any behavior or other data that was recorded during the logged-out period, by either Microsoft or partner sites, and was associated with the ANID can now be retrospectively joined to the actual LiveID and the corresponding identifying information. The strength of the hashing function is irrelevant, as it is being invoked in the 'forward' sense.
Next I turn to vulnerabilities associated with Microsoft partner sites. Note that the protocol as described has both the LiveID and ANID cookies valid when the user is in the logged-in state. (See Figure 3 in the Microsoft white paper.) Any partner site visited in this state may observe both of these cookies, and compile a look-up dictionary of pairs of identifying and 'anonymous' identifiers. Later visits by the user in a logged-out state can still be associated with the original LiveID by looking up the ANID. If the site keeps a database of behaviors associated with a given ANID, even when the LiveID is not set, that individual may still become identified retrospectively if he or she later visits in a logged-in state with both the LiveID and ANID set. Again, note that this vulnerability does not rest at all on attacking the one-way hash, but using Microsoft's cookie protocol to compile a dictionary of valid pairings. The larger and the more active the partner site, or set of cooperating sites, the greater the potential vulnerability.
Some may find it objectionable that behavioral records derived during the logged-in state are copied into the 'anonymous' records at all. Even passing over direct privacy problems with this, having parallel records associated with LiveID and ANID itself creates a problem with potential reidentification after logout. Depending on the site involved, and the information that was recorded for the user, it could in fact be uniquely identifying with respect to that site's users. This would allow a probable connection to be made between the 'anonymous' and identified states, whether or not the LiveID/ANID pair had been collected as above. Note that while Microsoft could employ statistical disclosure control methods to limit the reidentification risk posed by the 'anonymous' records, it cannot totally plug this hole, as it will never know the characteristics of the user population on any partner site.
In summary, the Microsoft protocol as described protects the user's anonymity after logout only if Microsoft and its partner sites do not employ any of the methods as described. The protocol itself is not protective, and the strength of the hash function employed is irrelevant, as it is easily circumvented by non-cryptographic means.
Note there's a rather obvious - again non-cryptographic - means for eliminating some of the vulnerability: Never make the LiveID and ANID available at the same time to a partner site. Only place the ANID cookie on the browser after the LiveID cookie is deleted, and vice versa. No site can collect its own correspondence table of LiveID and ANID, and should therefore only be able to identify the user through his or her own actions (or statistical means). Again, leaving this hole means every Microsoft user has to trust all MSN / Live partner sites to play nice and not compile a LiveID/ANID dictionary. Fixing this vulnerability, however, does not eliminate the privacy risk from Microsoft itself: It will still be generating the LiveID/ANID pairs, and there will still be a one-to-one correspondence to enable retrospective capture of data from the logged-out, 'anonymous' period. Nonetheless, this change is so trivial on a technical basis that it's puzzling it is not part of the architecture to start.
Vulnerability could be further reduced by applying well-known augmentations to the hashing scheme. It's common to concatenate either a systematically generated key, or a random number (a salt) to an input string before hashing. This breaks the one-to-one correspondence between the input (LiveID in this case) and the output (ANID). The ANID generated at login would then vary whenever Microsoft changed its hash key, or with every login in the case of the random salt. The vulnerability of the user to accumulation of pairs of real identity and pseudonym by partner sites would be reduced, as each pairing would have a limited period of validity. In the case of added keys alone, the vulnerability to reidentification by Microsoft is not eliminated, as it will have a record of the last login time and valid hash key for that period. If a truly random salt is used and then discarded, Microsoft itself should not be able to reidentify, unless it was surreptitiously storing the LiveID, ANID pairs.
What makes this more than an interesting excursion into the land of data privacy and de/reidentification are the incentives for Microsoft and its partners to not improve the described scheme. Leaving the ANID as a stable pseudonym for a real individual for an indefinite time allows long-term accumulation of behavioral data, across sessions and potentially across Microsoft federated sites. Not only does that data become more valuable over time, but any subsequent transaction that reveals personally identifying information associated with the ANID will allow that accumulated data to then become linked to a real person, whether or not the LiveID is ever revealed through these vulnerabilities.
Any partner site has a direct incentive for this accumulation to occur over as long a period as possible, rather than having the ANID become a more-or-less ephemeral pseudonym. As an advertising network trying to come form behind in a viciously competitive marketplace, Microsoft has an indirect incentive to generate stable pseudonyms for its users when they are nominally logged out, even at risk to their ultimate privacy. The subtitle on the MS whitepaper says that it "does not personally and directly identify individual users". As given, however, it does indirectly identify individual users through the use of rather trivial attacks by partner sites, and is completely vulnerable to Microsoft itself.
If there's anyone out there who can perform a message analysis on the Microsoft protocols similar to what was done with Facebook, it would be very interesting to see whether it corresponds to the description given in the FTC submittal.