Language Representation
K: The lang
does this, in the head of the document you are reading, and as an attribute on the link.
Translation Descriptions
K:
These belong as metadata in the translated document, probably as explicit human readable text. The XOXO definition list model might be useful here.
- Translation type: Manual, Automatic. More flavors?
- Translation authority: Who or what did it. What existing designators can be coopted?
- Translation time and date stamp, and perhaps an MD-5 hash of the original. This is a placeholder for the whole versioning can o' worms. If the original is edited or updated, we have a state consistency problem...
T: Kevin is essentially saying that all of these, and optionally the language description, should be factored onto the documents themselves. I can see that as being best for the minimal definition, but it also leaves one at the mercy of the standards & practices (or lack thereof) applied in each document. The nominal and robust version should provide the capability for external description, which allows for after-the-fact markup of existing translations, e.g., our examples in Pt. 1 of already translated blogs and news sites.
L: With regard to your bullet: “Translation type: Manual,Automatic." This raises the question: What is a translating entity? I think perhaps a more illuminating way of asking the question is this: “who/what is taking responsibility for this translation?”
I can see only two sorts of answers to this questions: 1) a human entity, either individual or collective, or 2) an MT system. Therefore, in defining the set of values which a “translator-type=” attribute may take on, a closed set values, {human, machine} would appear sufficient. Within each of these two categories of translator (human, machine), there is room for specific annotations.
Authority
L: The primary identifier for the human translation entities must be globally unique, without requiring a universal registry. It appears that email addresses should work fine, but there is the problem of spam email address harvester bots. An URL may be also be used. More generally, a URI, likely limited to a subset of the allowed schemes looks to be sufficient. (N.B.: I had to remind myself what the whole URL/URN/URI mess was about; I found this helpful.
Of course, some sort of alternate human readable label (e.g., a “name”!) should probably also be present, analogous to the usage of the dc:creator tag. This would be for display in “readers” (UI clients). I would suggest optional support for (digitally) signing the translation. More on that below.
T: This seems like a good place for recommending one or two identity schemes, and letting well enough alone. The whole identity infrastructure problem is well beyond our scope.
L: Other possible attributes for human translators: “accredited-by”, “reviewed-by”. Once again, these would be identified by URI’s and also possibly (cryptographically) signed.
In the case of a “pure” machine translation, ancillary annotations could include the version information, and any other information which could enable the reproduction of the results (for example, corpus based MT might have separate versions for the underlying software, and the actual corpus used in the training set). An instance of an MT system identifier can be used in two places: as the “pure” translating entity, or as an attribute to the human translator identifier.
Versioning, etc.
L: “Versioning can of worms”. There are two different versioning problems from the point of view of Rosetta bots. If the original article is revised, this is only an issue if the Rosetta bot does not harvest the original in time. In this case, the hash is a good validator. If the translation is updated (and the original article is unchanged), presumably it has been improved, and so should invalidate the previous copy, in order to improve the quality of the corpus.
T: From the bot perspective, undoubtedly. But from the human perspective seeing the partially obsolete translation is probably better than nothing.
L: For all translating entities, the date of the translation should be included.
Other Issues: PKI, rights, ontologies
NB: If you just care about currently active issues, you might want to stop right here. I believe we've agreed that the following are important, but orthogonal to the needs of the minimal translation effort, and certainly nothing to be put on critical path. But just in case you're curious or want to take up the cudgels...
L: If the translation is signed, all the attributes mentioned above should be included in the cryptographic hash which is signed, to prevent spoofing of the attributes. Why worry about that since the original itself could be spoofed? There are at least a couple issues:
1) Human readers would like to know that the entity claiming to have translated a given posting really is who they claim to be.
2) Rosetta ‘bots may want to validate signatures and certificates to help guard against pollution of the corpus.
With respect to digital signatures, the ugly PKI issue has to be faced. I looked at this in the context of another little side project and concluded
the following.
1) there’s no perfect solution and no likelihood of me creating one and getting it adopted.
2) People and organizations have to some extent adapted to the solutions which are out there.
3) X509 certificates need to be accommodated, due to their use in the corporate world and their increasing use via S/MIME.
4) PGP keys should be accommodated for those who don’t need/want x509 based solutions, and/or already have such keys and trust webs.
5) I’m not aware of any other standards, de-facto or formal, which really need accommodation.
There is arguably some danger in using PGP keys (or any signature scheme for that matter) since the actual guarantees such signatures make are commonly misunderstood. Further, revocation of certificates is commonly ignored. I don’t think this should prevent an effort to specify the option of digitally signing the translation, for the people who care enough to do it right. Further, from the perspective of creating large corpuses of translated material, commercial companies may be interested in validating those translations to some degree to prevent spoofing. Finally, keys, certificates, and signatures will in some cases be too “heavy” for the RSS feeds themselves, and probably need to comprise part of the content, i.e. be findable in some way in the HTML produced by blogging tools. Since digitally signed posts are an interesting topic in it’s own right, maybe this can be separated out.
T: I love playing with the crypto stuff - have been involved with a couple security companies and the early Cypherpunks. However, this should not be on any critical path for adoption in this project. The PK stuff in particular has wandered between the social and technical SPOF represented by root cert authorities, and the failure of trust networks to be transitive in the case of PGP. If we could, as a side effect, create a little more pull for pragmatic solutions, wonderful. But I sure don't want to depend on it for deployment.
L: With respect to the crypto stuff – yes, I agree it should not be on the critical path to adoption. I started thinking about it mainly from the point of view of gathering “squeaky clean, commercial grade” MT training sets in the future, and making sure that the initial steps didn’t preclude digital signatures and authentication down the line through lack of clean extension points, etc. So while I agree it isn’t worth requiring now or even formalizing now, it is worth thinking about now.
L: Copyright / License issues: Some sort of tagging with respect to license would seem to be in order. Some people might object to people using their translations for commercial benefit, either direct (e.g. harvest the translations to sell) or indirect (serving as a part of a corpus for an MT system). Further, what about the copyright issues inherent in the original work? To revisit my canonical bikeracing example (which I like because it’s the only translations I’ll likely be doing myself), if I publish translations of bikeracing news from Yahoo France, I’m probably violating someone’s copyright. Given contemporary practice, I’m not too worried about this, but it could be a concern for those who would create a commercial MT system trained on translated material which was copyrighted in the original language. (RDF/Dublin Core has a “rights” tag defined, which I see the “RSS 2.0” feed of Instapundit uses).
T: This seems to be a job for Creative Commons or someone who knows what they are doing in this domain. IANAL, but there are some interesting issues re how far the rights of the original document owner go. Most likely they adhere to a word-for-word translation of a large amount of material, as with a translated book. How about an automatic, draft quality MT output of a portion of a doc? I don't know. The situation is even less clear for an MT engine that is trained by running over many translations. The only precedent I can think of is a directory, and I think it's generally true that pointing to an original work is no infringement. I'd say punt this for now - it's more an issue for Rosetta bot operators than for this spec.
T: Re the hierarchy/ontology issue, I've never been a fan of them, due to my background in statistical text processing (primarily retrieval). (And the thought of trying to create consensus ontologies multi-lingually / multi-culturally just makes my head hurt :) I suggest this is a good issue to punt, in the form of just assuming that this spec and Rosetta bots could ride on top of anything that develops out of the RDF/OWL 'folksonomy' movement - to be specific, that any categorization information would adhere to the RSS entity, rather than be expressed in the translation tags. I think that realistically anyone using a Rosetta bot to train an MT engine is going to run their own categorizer anyway, whether it's manual or statistical. There are a lot of available statistical methodologies for topical clustering, and the NIST/DARPA guys continue to fund this area under the name of 'Topic Detection and Tracking', so we can assume some continued progress.
L: I saw your post on public ontologies – yes, universal ontology isn’t really possible, and statistical clustering as opposed to human categorization will be the way to go. Still, topic clustering and categorization of some kind has so much potential to improve MT (in my humble, naïve, conjecturing opinion) that it’s worth pursuing. But not at the level of impacting the RSS tag definition – it seems orthogonal to me.