This post is based on a comment by the pseudonymous Lewy14 over on Winds of Change, suggesting a small and pragmatic first step towards the inter-language blogosphere I've been writing about recently. Specifically, the idea of an RSS tag or something of the sort that would denote posts saying the same thing in different tongues, and be bait for aggregators and crawlers interested in that information. So following are some initial findings of existing relevant work (steal early and often) and notes on proposed requirements. Comments and further contributors welcome.
(Note to regular readers: There will be more VC posting to come, but my inner geek is emerging for a while. And I have a history of playing around with hypertext standards.)
Existing possibly relevant bits
- ISO 639.2 language codes
- RFC1766 tags using language codes
- Useful compendium of RSS specs, some emphasis on character set and language issues
- WorldWideLexicon project. As the name indicates, more oriented to lexicography than translation of documents. Description speaks of organizing a human network of translators, but nothing seems to have been done. Has created web services wrappers for existing online MT systems. Click here to try out the Google gateway or here to try out the Babelfish gateway. Copy and hack the URLs to translate other words, and guess other language codes. Both Google and Babelfish run the same Systran MT engine, which doesn't have Arabic - yet. More about that in another post.
- Here's another Web Services MT wrapper and the SOAP interface. Project description here and followup here. Bottom line: It's also calling Google, so it's Systran as well. (Hat tip: Rob Chartier)
Quicky Requirements
- Need both TRANSLATES and TRANSLATED-BY flavors. Since the former can be spoofed, the latter form embedded in the original doc will have more credibility.
- Need Source and Target URLs. Should be able to point at whole docs or tagged spans (posts) within docs. Arbitrary linkage problematic due to limits of good ol' HTML.
- Source and Target languages, in ISO-639.2
- Translation type: Manual, Automatic. More flavors?
- Translation authority: Who or what did it. What existing designators can be coopted?
- Translation time and date stamp, and perhaps an MD-5 hash of the original. This is a placeholder for the whole versioning can o' worms. If the original is edited or updated, we have a state consistency problem...
- Should do something useful in contemporary browsers, shouldn't be relying on having RSS readers/aggregators available in all target languages
Comments