Last week I received a “Factory Tour” invite from Google but didn’t give it much thought. I wish I had because I missed a preview of the company’s ambitious machine translation (MT) efforts.
Thankfully, Philipp Lenssen includes a great recap of the Webcast at this site: Google Blogoscoped. It’s worth a read.
Apparently Google is taking massive libraries of source and target text and dumping them into a database where the relationships between source and target text are analyzed and memorized. This database is then leveraged to translate new source text. Philipp explains it better than I…
This is the Rosetta Stone approach of translation. Let’s take a simple example: if a book is titled “Thus Spoke Zarathustra” in English, and the German title is “Also sprach Zarathustra”, the system can begin to understand that “thus spoke” can be translated with “also sprach”. (This approach would even work for metaphors surely, Google researchers will take the longest available phrase which has high statistical matches across different works.) All it needs is someone to feed the system the two books and to teach it the two are translations from language A to language B, and the translator can create what Franz Och called a “language model.” I suspect it’s crucial that the body of text is immensely large, or else the system in its task of translating would stumble upon too many unlearned phrases. Google used the United Nations Documents to train their machine, and all in fed 200 billion words. This is brute force AI, if you want — it works on statistical learning theory only and has not much real “understanding” of anything but patterns.
This sure is brute force MT. I’ll be very interested to know just how long a string a text Google can effectively translate. More important, how will Google handle the flood of brand names, oddball terms, and local slang?
But let’s just assume that Google does make this ambitious project a success; how will this affect the translation industry in general and Web globalization in particular?
Assuming this all does work moderately well, companies will be incented to pull all text out of graphics to make the most of this free translation service. After all, if Google is providing users in Vietnam a free translation of your company’s Web site, why not do what you can to make everything translatable.
This would also be yet another blow to Macromedia Flash, not that the emergence of AJAX isn’t doing enough damage.
But what about the impact on translation vendors? i don’t think they have much to worry about, yet. The need for high-quality, human-edited translation isn’t going away anytime soon. Long term, however, all bets are off. Google should be on every translation vendor’s radar; this company has lots of money, lots of smarts, and lots of incentive to provide the world’s text in all the world’s languages.