The State of Machine Translation
The State of Machine Translation
Letting computers substitute for human translators in alleviating mankind's Tower of Babel is one of the most frequently hyped and recycled 'futures'. But, after plugging away for 50 years, what's been accomplished is rather modest. Market analysts predict that in 2007, machine translation (MT) will remain an insignificant one percent of an over $10b translation marketplace. What's gone wrong, and could it change?
The commercial state of MT practice is generally founded on 'rule based systems', a technology that has been in market for over a decade. Modestly successful companies such as Systran of France are built on such technology, which is the basis for most free Internet translators such as Babelfish. However, this generation of MT has a relatively high error rate, particularly when faced with fragmentary or colloquial inputs. This has limited its acceptance to market segments - such as 'free' services - that are willing to accept the modest quality of output that is euphemistically referred to as 'gisting'. The same error rate has also kept MT out of most human translation firms: the burden of proofing and fixing the numerous output errors is similar to that of doing a clean translation. Could this level of acceptance change?
There are two incipient trends that may indicate movement. In the first, large (human) translation bureaus are at least hedging their bets with MT. In the past few years, the business of language translation has consolidated with a few large firms such as SDL and Bowne emerging as the survivors, with service as their primary revenue model. In the last few years, these companies have begun to at least explore the MT portion of the market:
The sale of the Transcend technology to SDL in February was a bellwether of things to come. The Barcelona system was acquired by Bowne Global Solutions, and Lionbridge has partnered with Sail Labs to deploy, develop, and comarket NLP technologies and services. SDL and Bowne Global Solutions appear to have plans for continued marketing of systems to external users, although not necessarily in the retail "shrink-wrap" market.
(This except is from a report generally skewed toward Systran. Check here for another overview from one of the same authors.)
At least one of these acquiring companies, SDL, has continued to sell its acquired MT technology. The degree to which they and other service model translation businesses have attempted to integrate with their human workflow is undisclosed.
A second change is a shift in the research front in MT. A decade ago, much work was going into interlingua approaches to MT, in which any human langauge would first be converted into an abstract representation, then output into any other natural language. As the limits of rule based translation became apparent, and computer power continued to increase, efforts shifted toward 'corpus-based' approaches to MT (and other natural language tasks). In this approach, a very large collection of parallel, aligned texts in the two languages under consideration are analyzed, and used to develop a statistical model which is then applied for translation of previously unseen texts. Many of these approaches are based on hidden Markov algorithms similar to those that have shown some success in speech recognition, and similarly, they require a large stock of training data.
This research front is now sufficiently advanced that DARPA can have a shoot-off between over a dozen academic and commercial research groups. Small startup companies, such as Language Weaver (a USC/ISI spinoff) and Meaningful Machines have emerged in the area. With much of the applied research backed by military and intelligence sponsors, there's not as yet much commercial experience with the new generation of technology. Can it break MT out of market niches satisfied with 'gisting', and get MT into the human translation workflow? The jury is still out. One things for sure: If an organization is already paying for human translation of techdocs, website, or anything else, it should now be keeping a database of these parallel texts. They could be quite useful in a few years.