« Survey of Arabic/English Machine Translation | Main | UN Parallel Texts Collection »

January 07, 2005



Re: Kevin Marks’ comments:

A Rosetta bot would see to have a couple of basic needs.

0) sufficient metadata in blog posts to identify and describe a translation
1) discovery of translations without spidering the entire blogosphere

Perhaps the only real “requirement” is 0), since a complete poll of the blogosphere is theoretically possible. But my understanding is that this situation is exactly what syndication in general (and RSS in particular) is designed to avoid.

For requirement 0), some metadata may reside in the syndication feed, but I’m becoming convinced that more, perhaps most, is appropriate to add to the blog post itself. Here, the guiding principle may be stated: “extend cleanly and don’t frell up existing specs and implementations”. Here I think the XHTML usage practice outlined by Kevin really shines.

So to be clear, my reading of Kevin’s suggestion is that existing XHTML practice is to insert within the HEAD element of a hypothetical document “free_eq_bro.html” something like the following to indicate the existence of a French translation.

<link href="http://www.free.fr/lib_eg_frat.html" rel="alternate" lang="fr"/>

Let’s now consider that the French document is the original, and the English document is the translation. Further, let’s assume a profile which defines the additional LinkType values “original” and “translation”.

Therefore we might specify within the HEAD element of “free_eq_bro.html”:

<link href="http://www.free.fr/lib_eg_frat.html" rel="alternate original" lang="fr"/>

We may also find within the HEAD element of lib_eq_frat.html:

<link href="http://www.free.us/free_eq_bro.html" rel="alternate translation" lang="en-US"/>

If we find the former, this is equivalent to Tim’s TRANSLATED-BY tag; if the latter, equivalent to Tim’s TRANSLATES tag, and if both (“bidirectional link”), we have a situation in which the translation may be taken to be even more securely “authenticated”.

With respect to the additional metadata for the translation authority, my reading of Kevin’s suggestion to follow the path of sematic XHTML is more or less the following: DO think about the data structure of the metadata in formal, semantic terms, DO NOT invent your own schema for basic data structures, DO use the XHTML data structures which have already been demonstrated, DO make the metadata human readable on the web page. Good suggestions all – I might play around with some example encodings to get a better idea of how this will work in practice.

For requirement 1), some sort of syndication format extension seems mandatory. Here again, the guiding principle should be, “extend cleanly and don’t frell up existing specs and implementations”. The relevance of XHTML to these formats is not direct, since, eg, links are specified completely differently in RSS vs HTML – the target URI is a text child of the “link” element in the former, and an “href” attribute value of the “a” or “link” element in the latter. So adopting XHTML conventions to extend RSS or Atom is as yet insufficiently motivated for me.

For both 0) and 1), the real enabling / gatekeeper entities are the providers of blogging tools, since they will be generating the syndication feeds, the XHTML data, and (perhaps trickiest) is the means by which bloggers and translators may specify the appropriate metadata. Kevin’s suggestion that the semantic XHTML approach allows the proper metadata for 0) to be embedded more or less “today” is encouraging and perhaps we should concentrate our efforts here.



The size of encoding UTF-8 arabic should be largely mitigated by GZIP encoding and transfer of the blog or rss feed.

With regards to blogging, this is where the transient nature of RSS becomes a real shame as there is often metadata captured there that is lost to HTML document. If the concept of "here's a pointer to a metadata" for this posting was around, you can easily work in the pure XML there and HTML back in the text side of things.

The HTML LINK tag is only valid in the HEAD section, which obviously will cause problems in the blog world:

And BTW - blog reading in Arabic:



With respect to the LINK element – yup. As I was drifting off last night I realized the limitation of the LINK element – it works well enough where the context of the HTML “document” is exactly one blog post, which is the case with many systems (MT, Scoop, Slash, and the newer Blogger blogs). However there are some systems where the permalink to a post points to an anchor in a larger page of HTML selected from the archives of the blog – the older Blogger blogs work this way, and some prominent bloggers (e.g. Andrew Sullivan) still use this system. The LINK element being restricted to the HEAD element will clearly fail here.

With respect to UTF-8, I agree the additional bytes shouldn’t create a huge obstacle in practice, especially considering compression, as you point out. If I were starting an Arabic language blog project, I’d use UTF-8, as the SoA sponsored project is. However, I think it would be too limiting to restrict a Rossetta bot to scavenging UTF-8 texts. The blog posts themselves should be free to encode in whatever charset works for them, per existing practice. Extensions to the RSS structure should probably be restricted to UTF-8, again per existing practice.

On metadata: you write: With regards to blogging, this is where the transient nature of RSS becomes a real shame as there is often metadata captured there that is lost to HTML document. Yes, it would seem that any metadata included in the RSS should also be present in the original HTML – translating entity, original/translated text pointers, etc. You also write: If the concept of "here's a pointer to a metadata" for this posting was around, you can easily work in the pure XML there and HTML back in the text side of things. I’m not quite sure I understand what you mean here.


With regards to the last point:

The HTML and XML/RSS (or whatever) are views on to some underlying data. For example, on this blog the underlying data is stored in an MySQL database on some server.

Now, let's assume that we put all the metadata, including new stuff that Tim wants to define, in the XML file. That's what XML is for and it's a good place to put it. Now all we have to do is define some sort of linkage from an entry in the HTML file to the corresponding XML file.

Since we know multiple blog posts can exist within a single HTML file, we'll have to be creative and "bend" HTML a tiny bit. For example (I'm using square brackets for angle brackets here), here's a blog posting

... entry 012345

Now, the vast bulk of the world that doesn't care about the meta data will work exactly the same as it always has; tools that understand the meta:link tag will know where to look in XML file based on the guid.


> There are a small number of blogs that already publish bilingually

Mine is trilingual, Basque, Spanish, English. However it is NOT a translated blog: I post different things in different languages. Bilingual content does not necessarily mean parallel content. However, I see the point in this proposal of yours. Good luck.

Tim Oren

I think the only way out of this is to begin with a minimal/minimal case that will work for a human reader from the start, but must presume some sophistication on the part of Rosetta bots and their HTML scrapers. This is true even in the case where a LINK specifying a translation is anchored at the whole document level and the docs are word-for-word translations. For trivial example, if you follow this self-link to this very post, you don't just get its original content, you also get the sidebar gorp specified in my Typepad template, and all of the appended comments, which of course I have just modified. Using a normal embedded A link, one could also have the convention that it may point to an anchor, which will be the title or beginning of a post that is translated in following text. Finding the end of each span would be a job for the bot, looking for clues like other headers and anchor points. Butt ugly from a data formalism point of view, but nothing worse than the crawlers for Google, Technorati and others cope with already. And perfectly obvious to a human reader.

David's suggestion looks like the next step beyond the trivial. I believe it would only require that the various bloggers and blog platforms mod their post templates to generate what amounts to span anchors (and verify that the resulting behavior is innocuous in browsers). My experience is limited to Radio and Typepad. Doing this would be fairly easy in Radio, I haven't tried handcoding Typepad templates so don't know there.

Re RSS as such, since we started the whole discussion there: My guess is that RSS/syndication in general is going to evolve in the direction of being a 'signaling channel' - though that terminology may make some twitch. The evolutionary pressure will come from a combination of need for readers/aggregators to help users cut through the clutter, and a need for some authors to syndicate out samples while selling advertising around the original. Both point towards RSS-like feeds evolving towards being a combination of metadata and samples/summaries of underlying media. If that's right, then our richer metadata concepts would more logically fit into that stream, rather than trying to get too elaborate within the framework of HTML.



A moderate amount of upgrading would have to be done to MoveableType configurations to accomplish what I'd like. All the codes in MT, but the upgrade is hard.

Convincing people to do it will be hard -- it's the chicken and egg problem. Still, maybe I'll hack away at my templates and see what can be produced.


One further comment -- if your actually getting some traction on your ideas above, consider setting up a Yahoo mailing list.

Luke Razzell

Interesting stuff... Microschemas would certainly seem to be a potentially fruitful approach to the distributed translation challenge. I guess what I wanted to communicate in my post was the way in which translation on blogs is also just an extension of what we do already in our blogging. So particular bloggers might get known for providing a translation window into a particular topic area or areas, and for the quality (or otherwise) of their translation. Topics and quality are subjective, not objective attributes, which is why I feel the fluid, neural-network like behaviour of the blogosphere and free-tagging services like delicious can nicely complement the kind of pre-structured meta-data you are discussing in facilitating distributed translation across blogs.

Tim Oren

Luke - I at least am trying to be agnostic about the human systems that will facilitate blog translation, and also want to have enough hooks for machine facilitation.

One of the fine dances in doing standards definition - and that's what we are doing, in a tiny way - is putting in enough capability and mechanism to make a difference in the tasks you are trying to facilitate, but without incorporating more 'policy' than absolutely required to get traction. 'Policy' is choices that embody assumptions about predominant usage patterns, availability of supporting technical, economic, or social infrastructure, etc. The more policy you buy into, the better the chance your effort will end up being a 'dead standard' - an irrelevant thing of beauty, because you guessed wrong about adoption. Having been part of early hypertext formalisms efforts, and tracking standardization efforts of the time (late 80s - early 90s) I'm acutely aware of this: There were some things of elegance created, all rendered irrelevant - except as idea mines - by this hack called HTML. In an open standards/source world, minimalism and incrementalism are the winning patterns.

Now, that said, I suspect you are right about the human systems. If you look at the blogs that already appear in translation - some are cited here and on other threads - they are from bloggers who have set themselves up as cultural bridges of a sort - to France, Iraq, even Mauritius. That strikes me as a specialization that's sustainable in a human (and maybe economic) sense, whereas the idea of "let's get a bunch of bilingual volunteers and have them translate whatever's hot" seems to me likely to fail: It's just a receipe for a grind, the kind of thing people want pay to endure. So, if we're both right, the (human) translated blogs are going to end being embedded in the overall blogospheric web, and will need to participate in whatever tagging systems evolve there. They can't be an island, or they will have failed by definition. Hence the urge to borrow as much existing bloggy infrastructure, and add as little policy as possible, perhaps to the extent that the output is merely a set of recommended usages of existing specs.

Luke Razzell

Glad I checked back in for your comment, which makes much sense. Wish it could auto-appear on my blog post's comments (but that's another discussion... Kevin? : ).

Mary Anne Martin

This blog posting was of great use in learning new information and also in exchanging our views. Thank you.
Mary Anne Martin


Thanks a lot, this is really helpful. Really well for me and I’m not going back to the proprietary guys! If You Need More Information Please Visit us :- eTranslate is an international company specialising in the provision of Internationalization and Globalization Solutions.

The comments to this entry are closed.