Over at Infoworld, Jon Udell has continued his string of posts on database translucency. This time he takes on a (self-admittedly) stretched example of hiding texts while still being able to search for matches indicating plagiarism (or sloppy quoting), in a database of term papers and essays. He suggests that a cryptographic hash function over sentences in texts might do the job.
So it could. But here I'm going to deliver a one line sermon that those who've worked for me in R&D mode have probably heard all too often: Do Your Prior Art! Seems the notion of using hashes to represent texts for searching was invented over 35 years ago. It's a specific application of an algorithm called Bloom filters. And, as the wikipedia article notes:
"...it it is not possible to extract the list of correct words from it – at best, one can extract a list containing the correct words plus a significant number of false positives..."So it does indeed have some translucency attributes. As commenters to Udell's original post point out, there are potential issues with things like deliberate punctuation errors, trivial word changes, and other 'chaffing' by those trying to hide their plagiarism. But, there is also a three decade repertoire of techniques such as tokenizing, stemming, and sliding windows over the text to combat them.
I'm picking on Jon a little bit to make a larger point: We've been this way before, at least in part. Due to the growing databases of personally identified (or identifiable) information, we've now all got the problem of a larger, more comprehensive, and largely invisible trail of data following us through life, for better or worse. As I've noted before, most of us are willing to tolerate some use of that data, so long as we see a compensating benefit. But a large number of us are ready to run up the black flag (or at least hassle companies and our elected representatives and write nasty blog posts) if we feel that data has compromised our financial or medical prospects, or allowed outsiders to transgress our own notions of how our lives should be compartmented.
But the problem, and the political impact, isn't new. Institutions have been piling up data since it was feasible and affordable, and it's governments that were the early adopters. Enter the Census Bureau and all those other busy fact compilers in service of some good work or another. The worries of a panopticon government are evergreen, and so those compilers have been hedged round with - and internalized - requirements on privacy preservation as they worked. Any prior art trail on translucency starts there, and it turns out there's quite a lot to be learned, including some basic concepts and terminology we could do far worse than steal borrow.
In 1994, a panel of agency statisticians pulled together a compendium of statistical disclosure control methods, and a review of then-current practices at agencies collecting data. It's notable that the information releases of concern were largely tabular summaries, and many techniques related to making sure the thinly populated cells in tables did not become a means of identifying particular individuals or enterprises, hence 'individually identifiable data'. At the time, release of so-called 'microdata', subsets or extracts of the raw data, was a relatively new practice, taking advantage of CD-ROMs and other pre-Web distribution means.
This particular study and other accumulated experience from government surveys influenced the outcome of HIPAA legislation and subsequent regulation. Since the regulation had the force of law, translucency related terminology used there has become a standard of practice. See, for instance:
Protected health information under HIPAA is individually identifiable health information. Identifiable refers not only to data that is explicitly linked to a particular individual (that's identified information). It also includes health information with data items which reasonably could be expected to allow individual identification.
When you remove the identifiers and quasi-identifiers (PDF at link), then what you've got is deidentified data. All is well, and privacy is preserved, right? Not so fast! Enter reidentification, the art of taking fields that are still preserved and comparing them with other data sets in order to deduce the identity of the originals, in part or whole. See a nice written MIT student paper here (MS Word doc) for an overview and an example based on Chicago murder statistics.
That's interesting for a bunch of academics, but who would bother? Well, there's the common data warehousing (and marketing) practice of householding, which works to assemble all the data held by or available to the enterprise that pertains to a living group, whether or not you've (for instance) played cute tricks with variant names or just omitted identifiers. Or check out this this paper (PDF) on reidentification based on Web browsing. And this all should be sounding familiar, because what a number of bloggers and journalists did with the AOL search data was reidentification. It might have been deidentified as far as the hapless researcher was concerned, but it was certainly not anonymous.
You'd think this area might be a hotbed of research, given HIPAA and the concerns about privacy on the net. In fact, there's one significant research group at CMU, but this is one domain where the Europeans are taking the lead. The EU has long had tougher data privacy regulations, and that's made it a relevant and fundable topic. There's a center for statistical disclosure control in the Netherlands, and a lot of work comes from a set of cooperating researchers in Catalonia of all places. I may dig into some of the themes of these groups' current research in a following post.