It's amazing what deceased French monarchs find to do with their time. 'Lewy14' and I have been talking about the Minimal Blog Translation project in e-mail, and with his permission I've edited our discussion into a post, nominally organized by topic. He's L, I'm T. I'm also interspersing a few counterpoints excerpted from Kevin Marks' reponse to my original - keep in mind he wasn't party to Lewy's and my exchange. He's K, and I've invited him to comment further.
Rosetta Bots
L: I think it would be useful to state some kind of overall requirement to the effect that the extensions should enable the programmatic harvesting of original/translation pairs, for the purpose of compiling a corpus used to train MT systems. (I’ll be calling these harvesters “Rosetta bots”; other suggestions welcome). Some requirements, especially regarding tricky issues like “tagged spans”, can reference back to this requirement.
T: I love 'Rosetta bots' as terminology. Motion made, seconded, carried. One other thing to keep in mind, other than spoon feeding MT training sessions, is that having a bot-built repository be searchable would be an excellent bootstrap for a human translation memory/assistant system and network -see TRADOS for an example of what I mean. Indicates a need for a little more thought about permissions for archiving, though the Archive.org guys seem to handwave it and survive. (NB: More about permissions later).
Existing Bilingual Blogs and Sites
L: There are a small number of blogs that already publish bilingually. “Merde in France” used to. Dissident Frogman still does. Interestingly, Iraqi blog Hammurabi does as well, although the correspondence between posts is not obvious as the posts are not side by side. FWIW, the latter is encoded in Arabic-Windows 1256, and displays just fine in Firefox and IE. (Wow, CNN publishes in Arabic as well. If only the translations were marked somehow...)
T: There are also translations at Sarmad's Road of a Nation Forums, though that's been sporadic since he got himself hired by an multinational engineering company. Faiza of the (in)famous Jarrar family also publishes bilingually, though some of the Arabic sections don't seem to render in my (Mac Safari) browser.
Existing Arabic Character Set Practices
Some of the blogs I’ve looked at which are in “two byte” languages are still basically published using one byte characters, e.g. Arabic: windows-1256. There is also Arabic ISO-8859-6, MacArabic, PC Arabic, and two different kinds of IBM Arabic (IBM, IBM Windows). All of these can be read by Java. See more here.
Thinking about it, publishing an Arabic blog in UTF-8 would nearly double the bandwidth, because the Unicode code points Arabic characters are encoded as two bytes in UTF-8 (corresponding to the Unicode code points U+0600 to U+06FF).
On the other hand, perusing the Unicode standard for the Arabic code points, it’s apparent even to a completely naïve reader (me) that the one byte encodings (e.g. ISO-08859-6) are a bit cramped and crippled compared to what a true Unicode based system could do, especially for Arabic as written by folks who live some distance from the actual peninsula. See this PDF. For example, although Farsi is nominally written in Arabic script, there exist two flavors of “Farsi” one byte charsets, both labeled as IBM charsets (Cp1097 and Cp1098). Pulling up the BBC Persian site the encoding is UTF-8. So is the blog by “hoder”. N.B. Farsi support is the least of their problems.
The RSS 2.0 spec is silent on the topic of legal character encodings. RSS 0.91 specifies a fairly short list which includes UTF-8 but does not include, e.g., windows-1256. Therefore an RSS syndication of an Arab blog would likely need to use UTF-8, even if the main blog used a compact (1 byte) Arabic charset, because RSS syndicators / readers might balk at “nonstandard” charsets.
T: The Spirit of America Arabic blogging tool will be using UTF-8. According to a contact at an Arabic MT company, most existing Arabic content is in either UTF-8 or cp1256. (Hat tip: Janice)
Standards Base
L: As to actually extending RSS, the RSS 2.0 can be extended “legally” using XML namespaces. See here. But, wow, there seems to be a lot of smoke and fire in syndicationland.
K: This set off my semantic XHTML radar. Surely we can express this with a rel attribute on a link? A quick rummage finds me existing specification text at w3c: Alternate: Designates substitute versions for the document in which the link occurs. When used together with the lang attribute, it implies a translated version of the document.
L: What about having the full text of the post in the syndicated XML? This would arguably make the XPaths more robust. Of course, solutions can’t rely on full text in the RSS feed as it may not be provided by the original post at all.
T: We probably already need to 'tier' the requirements statements. Rough notion:
- Minimal: Amenable to hand editing. No code or MT support required. Useful with contemporary HTML browsers and static page stores. Can be crawled by Rosetta bots, but bot is responsible for all meta-tasks such as authority, integrity, etc. Should make sense when scraped out of the Internet Archive a few years hence.
- Nominal: Meets basic Rosetta bot needs for authority, integrity, versioning, but doesn't require a tie into cryptographic and identity infrastructures
- Robust: Includes ties to cryptographic or other robust infrastructures for identity, integrity, archive.... (NB: This is a forward reference to later topics)
TRANSLATES/TRANSLATED-BY and Trust
L: TRANSLATES vs TRANSLATED-BY tags: In your post, you state: “Since the former can be spoofed, the latter form embedded in the original doc will have more credibility.” I’m not sure this holds universally. If I author a document in English, and I don’t know Farsi, my faith in a translation into Farsi is dependent on my trust in the translator. Perhaps the author is in a better position to judge, perhaps not. In any case, the TRANSLATED-BY translation would have to be made available at the time the RSS syndication is published, or else the RSS syndication would have to be updates.
T: I believe it's a strong requirement that we have some degree of robustness against 'translation spam' at the onset, presuming nothing stronger than MD5s, and without a corresponding strong identity or PK infrastructure. That's why I'm saying that TRANSLATED-BY is stronger: if you don't trust the place from which you're pulling the original feed, you've got problems we're not going to solve.
L: This issue of authentication of translations is a bit problematic given the lack of authentication of most blog posts themselves. Why worry about the translation when the post itself could have been spoofed? Here there are at least a couple issues:
1) Human readers would like to know that the entity claiming to have translated a given posting really is who they claim to be.
2) Rosetta ‘bots may want to validate signatures and certificates to help guard against pollution of the corpus.
K: Bidirectional links can affirm an authoritative translation, as in XFN's me attribute. We could perhaps add a original and translation values for rel if we define a new profile.
Source and Target URLs
K: The rel does this, with an implicit reference to the document you are reading. If you want subsections a <blockquote cite="..."> could be used.
L: The whole tagged range thing in particular, and the HTML scraping aspect in general, has me thinking ahead a bit. On my previous encounter with the HTML scraping task, I got the idea to send the HTML through this program called Tidy, which turned it into compliant XML. Then I could fish around for invariant XPath’s which would delimit what I wanted to scrape out, and use XSLT or some programmatic implementation of XPath to extract the data. From examining the HTML generated by a number of Blogger and Movable Type blogs, I’m betting that a decent scraper could be constructed from XPaths which use navigation relative to some of the key div elements. I’m also noticing that all syndication formats are not created equal. Worst case, it may be that there are few enough blogging tools which cover enough of the blogging universe so that a Rossetta bot could just hack it’s way through, by “knowing” how to scrape the main text out of posts. My conjecture is that XPaths could be used reliably in this way to describe and delimit excerpts of posts, when only fragments of posts are translated, at least down to the paragraph level.
T: One further comment on requirements in the Rosetta bot case: Training corpus based MT engines requires parallel texts that are aligned. It's not clear to me what granularity of alignment is required - more research is needed - but at any rate it would be a shame to lose the alignment that's implicit in the usually 'chunky' form of blog posts, and it might be necessary to support finer alignment in the robust case.
To Be Continued...
Long enough already. Topics to follow: Language specification. Factoring of the remaining requirements - link or document?. Translation type and translator identity. Translation and post integrity and versioning. Orthogonality to PKI, ontologies, and other infrastructure.
Re: Kevin Marks’ comments:
A Rosetta bot would see to have a couple of basic needs.
0) sufficient metadata in blog posts to identify and describe a translation
1) discovery of translations without spidering the entire blogosphere
Perhaps the only real “requirement” is 0), since a complete poll of the blogosphere is theoretically possible. But my understanding is that this situation is exactly what syndication in general (and RSS in particular) is designed to avoid.
For requirement 0), some metadata may reside in the syndication feed, but I’m becoming convinced that more, perhaps most, is appropriate to add to the blog post itself. Here, the guiding principle may be stated: “extend cleanly and don’t frell up existing specs and implementations”. Here I think the XHTML usage practice outlined by Kevin really shines.
So to be clear, my reading of Kevin’s suggestion is that existing XHTML practice is to insert within the HEAD element of a hypothetical document “free_eq_bro.html” something like the following to indicate the existence of a French translation.
<link href="http://www.free.fr/lib_eg_frat.html" rel="alternate" lang="fr"/>
Let’s now consider that the French document is the original, and the English document is the translation. Further, let’s assume a profile which defines the additional LinkType values “original” and “translation”.
Therefore we might specify within the HEAD element of “free_eq_bro.html”:
<link href="http://www.free.fr/lib_eg_frat.html" rel="alternate original" lang="fr"/>
We may also find within the HEAD element of lib_eq_frat.html:
<link href="http://www.free.us/free_eq_bro.html" rel="alternate translation" lang="en-US"/>
If we find the former, this is equivalent to Tim’s TRANSLATED-BY tag; if the latter, equivalent to Tim’s TRANSLATES tag, and if both (“bidirectional link”), we have a situation in which the translation may be taken to be even more securely “authenticated”.
With respect to the additional metadata for the translation authority, my reading of Kevin’s suggestion to follow the path of sematic XHTML is more or less the following: DO think about the data structure of the metadata in formal, semantic terms, DO NOT invent your own schema for basic data structures, DO use the XHTML data structures which have already been demonstrated, DO make the metadata human readable on the web page. Good suggestions all – I might play around with some example encodings to get a better idea of how this will work in practice.
For requirement 1), some sort of syndication format extension seems mandatory. Here again, the guiding principle should be, “extend cleanly and don’t frell up existing specs and implementations”. The relevance of XHTML to these formats is not direct, since, eg, links are specified completely differently in RSS vs HTML – the target URI is a text child of the “link” element in the former, and an “href” attribute value of the “a” or “link” element in the latter. So adopting XHTML conventions to extend RSS or Atom is as yet insufficiently motivated for me.
For both 0) and 1), the real enabling / gatekeeper entities are the providers of blogging tools, since they will be generating the syndication feeds, the XHTML data, and (perhaps trickiest) is the means by which bloggers and translators may specify the appropriate metadata. Kevin’s suggestion that the semantic XHTML approach allows the proper metadata for 0) to be embedded more or less “today” is encouraging and perhaps we should concentrate our efforts here.
--“Lewy”
Posted by: lewy14 | January 07, 2005 at 18:48
The size of encoding UTF-8 arabic should be largely mitigated by GZIP encoding and transfer of the blog or rss feed.
With regards to blogging, this is where the transient nature of RSS becomes a real shame as there is often metadata captured there that is lost to HTML document. If the concept of "here's a pointer to a metadata" for this posting was around, you can easily work in the pure XML there and HTML back in the text side of things.
The HTML LINK tag is only valid in the HEAD section, which obviously will cause problems in the blog world:
http://www.w3.org/TR/html4/struct/links.html#edef-LINK
And BTW - blog reading in Arabic:
http://jaeger.blogmatrix.com/weblog/archives/2004_12.shtml#003127
Posted by: David | January 08, 2005 at 05:55
David,
With respect to the LINK element – yup. As I was drifting off last night I realized the limitation of the LINK element – it works well enough where the context of the HTML “document” is exactly one blog post, which is the case with many systems (MT, Scoop, Slash, and the newer Blogger blogs). However there are some systems where the permalink to a post points to an anchor in a larger page of HTML selected from the archives of the blog – the older Blogger blogs work this way, and some prominent bloggers (e.g. Andrew Sullivan) still use this system. The LINK element being restricted to the HEAD element will clearly fail here.
With respect to UTF-8, I agree the additional bytes shouldn’t create a huge obstacle in practice, especially considering compression, as you point out. If I were starting an Arabic language blog project, I’d use UTF-8, as the SoA sponsored project is. However, I think it would be too limiting to restrict a Rossetta bot to scavenging UTF-8 texts. The blog posts themselves should be free to encode in whatever charset works for them, per existing practice. Extensions to the RSS structure should probably be restricted to UTF-8, again per existing practice.
On metadata: you write: With regards to blogging, this is where the transient nature of RSS becomes a real shame as there is often metadata captured there that is lost to HTML document. Yes, it would seem that any metadata included in the RSS should also be present in the original HTML – translating entity, original/translated text pointers, etc. You also write: If the concept of "here's a pointer to a metadata" for this posting was around, you can easily work in the pure XML there and HTML back in the text side of things. I’m not quite sure I understand what you mean here.
Posted by: lewy14 | January 08, 2005 at 14:27
With regards to the last point:
The HTML and XML/RSS (or whatever) are views on to some underlying data. For example, on this blog the underlying data is stored in an MySQL database on some server.
Now, let's assume that we put all the metadata, including new stuff that Tim wants to define, in the XML file. That's what XML is for and it's a good place to put it. Now all we have to do is define some sort of linkage from an entry in the HTML file to the corresponding XML file.
Since we know multiple blog posts can exist within a single HTML file, we'll have to be creative and "bend" HTML a tiny bit. For example (I'm using square brackets for angle brackets here), here's a blog posting
[div
id="entry"
meta:link="http://.../something.xml"
meta:guid="guidcorrespondingtoxmlfile"
]
... entry 012345
[/div]
Now, the vast bulk of the world that doesn't care about the meta data will work exactly the same as it always has; tools that understand the meta:link tag will know where to look in XML file based on the guid.
Posted by: David | January 09, 2005 at 16:08
> There are a small number of blogs that already publish bilingually
Mine is trilingual, Basque, Spanish, English. However it is NOT a translated blog: I post different things in different languages. Bilingual content does not necessarily mean parallel content. However, I see the point in this proposal of yours. Good luck.
Posted by: Luistxo | January 10, 2005 at 00:05
I think the only way out of this is to begin with a minimal/minimal case that will work for a human reader from the start, but must presume some sophistication on the part of Rosetta bots and their HTML scrapers. This is true even in the case where a LINK specifying a translation is anchored at the whole document level and the docs are word-for-word translations. For trivial example, if you follow this self-link to this very post, you don't just get its original content, you also get the sidebar gorp specified in my Typepad template, and all of the appended comments, which of course I have just modified. Using a normal embedded A link, one could also have the convention that it may point to an anchor, which will be the title or beginning of a post that is translated in following text. Finding the end of each span would be a job for the bot, looking for clues like other headers and anchor points. Butt ugly from a data formalism point of view, but nothing worse than the crawlers for Google, Technorati and others cope with already. And perfectly obvious to a human reader.
David's suggestion looks like the next step beyond the trivial. I believe it would only require that the various bloggers and blog platforms mod their post templates to generate what amounts to span anchors (and verify that the resulting behavior is innocuous in browsers). My experience is limited to Radio and Typepad. Doing this would be fairly easy in Radio, I haven't tried handcoding Typepad templates so don't know there.
Re RSS as such, since we started the whole discussion there: My guess is that RSS/syndication in general is going to evolve in the direction of being a 'signaling channel' - though that terminology may make some twitch. The evolutionary pressure will come from a combination of need for readers/aggregators to help users cut through the clutter, and a need for some authors to syndicate out samples while selling advertising around the original. Both point towards RSS-like feeds evolving towards being a combination of metadata and samples/summaries of underlying media. If that's right, then our richer metadata concepts would more logically fit into that stream, rather than trying to get too elaborate within the framework of HTML.
Posted by: Tim Oren | January 10, 2005 at 11:11
Tim,
A moderate amount of upgrading would have to be done to MoveableType configurations to accomplish what I'd like. All the codes in MT, but the upgrade is hard.
Convincing people to do it will be hard -- it's the chicken and egg problem. Still, maybe I'll hack away at my templates and see what can be produced.
Posted by: David | January 11, 2005 at 03:02
One further comment -- if your actually getting some traction on your ideas above, consider setting up a Yahoo mailing list.
Posted by: David | January 11, 2005 at 03:03
Interesting stuff... Microschemas would certainly seem to be a potentially fruitful approach to the distributed translation challenge. I guess what I wanted to communicate in my post was the way in which translation on blogs is also just an extension of what we do already in our blogging. So particular bloggers might get known for providing a translation window into a particular topic area or areas, and for the quality (or otherwise) of their translation. Topics and quality are subjective, not objective attributes, which is why I feel the fluid, neural-network like behaviour of the blogosphere and free-tagging services like delicious can nicely complement the kind of pre-structured meta-data you are discussing in facilitating distributed translation across blogs.
Posted by: Luke Razzell | January 12, 2005 at 02:02
Luke - I at least am trying to be agnostic about the human systems that will facilitate blog translation, and also want to have enough hooks for machine facilitation.
One of the fine dances in doing standards definition - and that's what we are doing, in a tiny way - is putting in enough capability and mechanism to make a difference in the tasks you are trying to facilitate, but without incorporating more 'policy' than absolutely required to get traction. 'Policy' is choices that embody assumptions about predominant usage patterns, availability of supporting technical, economic, or social infrastructure, etc. The more policy you buy into, the better the chance your effort will end up being a 'dead standard' - an irrelevant thing of beauty, because you guessed wrong about adoption. Having been part of early hypertext formalisms efforts, and tracking standardization efforts of the time (late 80s - early 90s) I'm acutely aware of this: There were some things of elegance created, all rendered irrelevant - except as idea mines - by this hack called HTML. In an open standards/source world, minimalism and incrementalism are the winning patterns.
Now, that said, I suspect you are right about the human systems. If you look at the blogs that already appear in translation - some are cited here and on other threads - they are from bloggers who have set themselves up as cultural bridges of a sort - to France, Iraq, even Mauritius. That strikes me as a specialization that's sustainable in a human (and maybe economic) sense, whereas the idea of "let's get a bunch of bilingual volunteers and have them translate whatever's hot" seems to me likely to fail: It's just a receipe for a grind, the kind of thing people want pay to endure. So, if we're both right, the (human) translated blogs are going to end being embedded in the overall blogospheric web, and will need to participate in whatever tagging systems evolve there. They can't be an island, or they will have failed by definition. Hence the urge to borrow as much existing bloggy infrastructure, and add as little policy as possible, perhaps to the extent that the output is merely a set of recommended usages of existing specs.
Posted by: Tim Oren | January 12, 2005 at 11:17
Glad I checked back in for your comment, which makes much sense. Wish it could auto-appear on my blog post's comments (but that's another discussion... Kevin? : ).
Posted by: Luke Razzell | January 13, 2005 at 06:10
This blog posting was of great use in learning new information and also in exchanging our views. Thank you.
Mary Anne Martin
http://www.rosettaaperture.com
Posted by: Mary Anne Martin | May 05, 2006 at 08:02
Thanks a lot, this is really helpful. Really well for me and I’m not going back to the proprietary guys! If You Need More Information Please Visit us :- eTranslate is an international company specialising in the provision of Internationalization and Globalization Solutions.
Posted by: Etranslate | June 09, 2008 at 04:24