« The Following Post Is A Translation Test | Main | New Model Software Startups: Two Stage Ventures »

January 24, 2005


Tim Oren

Yes, I know there's some funky markup in the post title and body. If it screwed up your browser or RSS reader, please leave a comment with system and symptoms, so I don't try that again.



I checked your markup using the W3C Validator.

The use of a link in the title of the post appears to be somewhat problematic in Movable Type, because the prev and next links are messed up. See for example this post at Winds of Change.

What’s strange is that compared to that Winds of Change post, your post appears to have a spurious "<a" before the <a> tag – what’s up with that? Note this gives the post it’s funny URL (ends in the first several characters of the markup for the linked URL: a_a_hrefhttplog.html ). Needless to say, the validator is not amused. (Also, you made one small error: the correct attribute is rel="alternate", not ref="alternate").

The good news is that the rosettabot specific markup in the body of the post checks out. The bad news is that while it renders fine in Firefox, the <object> tag appears as a tiny opaque box in Explorer in Win XP. My own test post does likewise. Sigh – need to go back and re-examine why the validator wants to see the <object> tag there; it really shouldn’t be needed (The XHTML 1.0 Strict DTD says I should be able to put %lists in the body of the document) and a <div class="urn:rosettabot"> or some such should serve just fine as a (styleable!) container for the rosettabot markup. When I use "div" as a container it looks fine, but the validator complains.

As for the RDF flavor of RSS which your site syndicates, it appears everything makes it through, but wrapped up in the CDATA section, which means it would have to be "double parsed" if the RDF generator doesn’t recognize it and break it out - but then all the syndication mechanisms will likely require some kind of patching; in the meantime it doesn't break anything (a first order requirement).


Quick update - got to the bottom of the problem with the div element and the validator.

Bottom line is that a div element is OK anywhere everywhere a list (ul, ol, dl) is OK, and if a list isn't OK in the body of a post, then, well, that's lame and useless. So there, that's my justification for using div and lists to create semantic XHTML used within post bodies.

See http://www.w3.org/TR/html401/struct/global.html#h-7.5.4

My problem is that my Blogger template has the $BlogItemBody embedded in a <p> element which in an XHTML STRICT sense precludes using both lists and div elements in the body of the post, which is of course lame, and useless.

All maner of folks use lists all over in their post bodies. This is common practice and won't go away. The fact that Blogger posts won't strictly validate doesn't appear to concern anyone, particularly Blogger, since they happily generate URLs with embedded ampersands (&), which violate basic XML and generate a dozen validation errors right there. So the DOCTYPE declaration of XHTML STRICT on the canned templates is what you might call a lie...

...not that two wrongs make a right. But the usage of the Semantic XHTML constructs is still OK I submit because like I said, if you can't use lists in a blog post, that's lame, and I think that position is defensible.

As for the <object> tag, I was definitely misusing it; I've now learned better how to misuse it correctly, if this is desired... 8) See update back at my test post, which now renders OK in Explorer. The <object> hack is only necessary to get "invisible" name/value parameters, and is arguably ugly. Wonder how the Semantic XHTML folks feel about it.


Final repeated caveat: my purpose here is to establish a minimal spanning bag of tricks for encoding the translation metatada, which are defensible and won't cause trouble down the road. Once these are established we can formalize the schema, if the overall approach is agreed upon.

Tim Oren

Changed bogus ref to rel, and deleted spurious

Tim Oren

Lewy - one suggestion on additional params for the object, however expressed: Optionally, start and end points, expressed as anchor ids, in either or both of the original and translated doc. The portion of the document between the start and end anchors willl contain the translated matter, though other text and markup may also occur.

This is at least a strong hint to bots re where the texts to be aligned reside in the respective docs. Anchors are automatically produced by many blogging tools, so it may be possible to find start/end points in already existing blogs and posts, though their span may include other matter such as titles, headers, footers. Manually inserting anchors in a new (translated) post is trivial, and doing so will make it easier to include commentary about the translated matter outside of the anchor-span, with the translation itself a sort-of-block-quote. This is also the only way I've found to insert id's without breaking various rules div and p nesting.

Over to you to define the params, assuming you agree.



Here’s my argument for <div> vs <a>: forgive me if this is tedious but I want to lay it all out.

First let me state explicitly a requirement which I think we can all agree on and which is driving my formulation of the translation metadata profile:

Requirement: translation metadata embedded within an HTML document formatted in accordance with the profile shall not preclude validation of the embedding HTML document against the XHTML 1.0 Strict DTD.
Now, with regards to the question of denoting sections of text which are translated: there would appear to be two fundamental approaches:

  • Container: enclosing the translated text within an element start tag and end tag. E.g.: <div class="rb:start-x">Some translated text.</div>
  • Delimiter : using empty anchors to signal the start and end of the translated text. E.g. <a id="rb:start" />Some translated text.<a id="rb:end" />

Now allow me to make an assertion: containing is preferable to delimiting. Containment is a first order concept in XML; if the whole document is well formed XML then the contained fragment is also well formed. This is not necessarily true with a delimited fragment. Further, the specification of the bounds of translated excerpts is easier using XPath when containment is used. (In fact, when you write it may be possible to find start/end points in already existing blogs and posts, though their span may include other matter such as titles, headers, footers, you are correct, and it is possible to write simple XPath expressions which extract these sections. These XPath expressions are valid for roughly every blog with a template which uses the same naming conventions. These XPath expressions can best be formulated using elements like <div> as landmarks.) While these may not be overwhelming advantages, I maintain they are valid preferences, and that they can be had without any cost.

What elements could be used as containers? The <a> cannot. According to the XMTML element prohibitions, the <a> element must not include other <a> elements. Clearly the translated texts must be allowed to contain links. Therefore the use of <a> elements as fragment identifiers to signal translated text excerpts must be done by delimiting , as opposed to containing , the translated text excerpt.

There are two elements whose explicit goal in life is to add pure structure to HTML: the <div> element and the <span> element. I maintain that it ought to be permissible to embed a <div> in the body of a post – if doing so precludes validation to XHTML Strict, then the problem is a bug in the template! I claim that this is so because <div> can be used anywhere where lists can be used, and the use of lists is commonly accepted (see, e.g., the inclusion of lists in the javascript based RTF editor in Blogger – this despite that the alleged XHTML Strict Blogger templates have a bug which precludes validation of posts containing lists!). I fixed the bug in my own Blogger template (replaced the <p> enclosing the post body with a pair of <br />s).

Therefore I maintain that specifying the use of the <div> element in the translation metadata profile satisfies the requirement given above – it does not preclude validation of the embedding HTML document as XHTML 1.0 Strict. Anyone who ACTUALLY CARES about such validation must provide a template which permits this, but this requirement adds nothing to the requirements already on the template to permit lists, and meeting these requirements are trivially accomplished.

Therefore the use of <div> both as a container of the translation metadata and as a container for the translation excerpt would appear to be a good choice.

Finally, the <span> element could also be used as a container. The set of elements which are permitted within a <span> is restricted compared to the set permitted within <div>, but it should be possible to encode a minimum profile within <span> elements.

Whew. Thanks for reading this far. Am I missing something here?

Tim Oren

The only thing I'd say is missing is actually making the trials to determine if embedding div's into the posts of default templates in various blogging tools causes either the tool or common browsers to barf. One might regard such failures as bugs, but if they are predominant, then the spec becomes moot through lack of usage.

Given a good result for the large majority of installed base, I'd certainly say that the div approach should be preferred. I think I'd still hold out that using delimiters be a (deprecated) alternative, because I can definitely envision cases where it will be necessary to work around post template structure issues.


Agreed - if [div] messes up tools or browsers using default templates, that's bad. I've tested [div] embeds in Explorer and Firefox under windows; no problem. I've only tested Blogger, which has no problem.

For some reason, many default templates seem to want to embed the entire post body in a [p] element. This seems whack to me; it prohibits all use of [p], lists, etc, in the post according to the DTD. The use of these elements in posts is ubiquitous, blogging tools and browsers have no problem with them. Since [div] can be used wherever [p] or [ul] etc can be used, I have good reason to believe that using [div] will be OK. This reasoning is NOT the same as testing, as you point out.

Tim Oren

A new version of the markup is being tried out here, so please continue comments and critique below that post. I am closing comments here.

The comments to this entry are closed.