Metadata, Semiotics, and the Tower of Babel
Speaking of bubblets, looks like we're seeing the beginning of one around metadata. Joi Ito thinks (worries?) that metadata is so important it may be a major weapon for Microsoft. It's going to be so huge that we'll soon devote most of our storage space to metadata . In fact, it's going to be a semantic gold mine.
Rubbish. Unless you can repeal human nature.
As you can guess, a rant follows. A few clarifications before plunging in. First, I am not talking about simply using a standardized data structure to express information can be collected as a side effect of tool use, e.g., story divisions, permalinks and other inherent structural information in a blog. These are very sensible features, long present in boring tools like Word, and aren't in any danger of being a Big New Thing. I am talking about things like GeoURL and FOAF that require incremental efforts, and even more about grand taxonomic metadata schemes that will overturn Google and/or result in the second coming in the form of the Semantic Web. Second, the canonical plain English rant on this topic has already been written by Cory Doctorow, two years ago. I recommend it, in fact I will cite it as a gloss to this rant. But I have my own hobby horse to ride here. Third, for those who spied a certain word in the title and are considering drastic action: Take. Your. Finger. Off. The. Weapons. Release. Button. Now. I am not about to justify transcultural moral relativism or other silly-ass over-interpretations by French Persons.
That word is semiotics, of course. If you want the full French treatment of the topic (in English), you might start here. But I'll go with a engineer's gross simplification: Semiotics is the observation that words and other symbols are interpreted differently by different people, and that the disparity of interpretation is affected by the abstractness of the notion symbolized, and cultural and other differences between the people. A matter fairly easily demonstrated: There is likely a decent global consensus, at least among engineers, on what is meant by '1Ghz Pentium III'. You can go into any hardware store in America, ask for a 6-32 x 2" machine screw, and get one without further discussion. But don't try it in a metric country. Moving to the abstract, here we spy part of a rational discussion among two experienced engineers regarding the word 'friend'. And here we have a 'friendly' cross-cultural dialog among statesmen regarding the notion 'allies.' QED. If you want more nuance, go off to the formal treatment, start off with 'signifier/signified', and surf away. Be careful. I'm informed that too much of the stuff can cause you to believe that you can't communicate with anyone about anything, that everything you think is predetermined by your culture and/or class, or cause you to put scare quotes around every other word Try occasionally kicking a large rock, hard, while reading. It might help.
In spite of the risks, it's been shown possible to get productive use from this theory. For instance, Dina Mehta outlines its applications in design, and among other useful links, points to a more cut down overview of semiotics, possibly of use to that profession.
Having spent some years working with unstructured text and hypertext databases, I'm willing to suggest that the core notion of semiotics is in fact a useful engineering maxim, a True Theory of how humans behave in the context of symbolic systems. Like the laws of thermodynamics in energy systems, semiotics proposes a hard limit to the efficiency of any situation involving externalized representations of human thought. You can process character strings or other computational representations as long as you want, but just as the map is not the territory, the symbol is not the thought of its author, nor the thought elicited in an eventual reader. Even if all the ambiguity inherent in messy languages like English were eliminated, this would remain.
Anyone who has built or been a heavy user of text databases or similar systems has run into this problem repeatedly. Fancy lexical or statistical processing does help, as does the integration of information like link patterns, but at the end there is always a significant and irreduceable noise, that means one either has a certain amount of garbage in the output (from the view of the user), or is dropping some amount of useful information. Nor is there a way out by using an artificial set of symbols, e.g., 'controlled terms', taxonomies and the like. In the large, with a heterogenous set of users, these perform no better than grinding up plain text. In fact, it creates a secondary problem of inconsistency among the indexers employing the artificial symbol set, letting the ambiguity of language in the original documents back in at the rear door (Cory 2.4). And far from being neutral, any such artificial attempt to expunge the complexity of natural language does so by embodying a particular theory of importance, an intrinsic point of view, that will gain efficiency in one constrained setting at the price of being useless in others - and having no way to tell the difference. (Cory 2.5)
Now why should we suspect that taking character strings, and wrapping them in XML or RDF is going to change all of this? The syntactic sugar is all wonderful, and indeed a better mousetrap from the POV of systems integration, but the real basis for the blue sky claims that we're approaching Semantic Web nirvana is bound up in the signifiers, the symbols, that are to be wrapped in that sugar. Is there some magic in angle brackets that was not found in LISP parentheses, that will repeal human nature and semiotics? I think not. Call it a taxonomy, a controlled vocabulary, a metadata dictionary, it's all the same thing: yet another language, either small and brittle, or large and ambiguous. Either way, just another layer on the Tower of Babel.
Coming soon: One place where the French and the Chicago school agree: economic reasons why the Semantic Web is a crock.