Last week I received a “Factory Tour” invite from Google but didn’t give it much thought. I wish I had because I missed a preview of the company’s ambitious machine translation (MT) efforts.
Thankfully, Philipp Lenssen includes a great recap of the Webcast at this site: Google Blogoscoped. It’s worth a read.
Apparently Google is taking massive libraries of source and target text and dumping them into a database where the relationships between source and target text are analyzed and memorized. This database is then leveraged to translate new source text. Philipp explains it better than I…
This is the Rosetta Stone approach of translation. Let’s take a simple example: if a book is titled “Thus Spoke Zarathustra” in English, and the German title is “Also sprach Zarathustra”, the system can begin to understand that “thus spoke” can be translated with “also sprach”. (This approach would even work for metaphors surely, Google researchers will take the longest available phrase which has high statistical matches across different works.) All it needs is someone to feed the system the two books and to teach it the two are translations from language A to language B, and the translator can create what Franz Och called a “language model.” I suspect it’s crucial that the body of text is immensely large, or else the system in its task of translating would stumble upon too many unlearned phrases. Google used the United Nations Documents to train their machine, and all in fed 200 billion words. This is brute force AI, if you want — it works on statistical learning theory only and has not much real “understanding” of anything but patterns.
This sure is brute force MT. I’ll be very interested to know just how long a string a text Google can effectively translate. More important, how will Google handle the flood of brand names, oddball terms, and local slang?
But let’s just assume that Google does make this ambitious project a success; how will this affect the translation industry in general and Web globalization in particular?
Assuming this all does work moderately well, companies will be incented to pull all text out of graphics to make the most of this free translation service. After all, if Google is providing users in Vietnam a free translation of your company’s Web site, why not do what you can to make everything translatable.
This would also be yet another blow to Macromedia Flash, not that the emergence of AJAX isn’t doing enough damage.
But what about the impact on translation vendors? i don’t think they have much to worry about, yet. The need for high-quality, human-edited translation isn’t going away anytime soon. Long term, however, all bets are off. Google should be on every translation vendor’s radar; this company has lots of money, lots of smarts, and lots of incentive to provide the world’s text in all the world’s languages.
8 thoughts on “Will Google Kill the Translation Industry?”
As for brand names, I suppose they just leave them as they are because they are also translated as such (I suppose “Coca Cola” would stay “Coca Cola” in many languages). But oddball terms and slang is a bit of problem I guess…
There’s a name for people who think that computers will one day be able to translate. They’re called monolingual. Anyone who is fluent in several languages knows that languages aren’t merely the same thing with different vocabulary. A machine can indeed convert an Arabic text into English, but a Washington politician simply won’t understand it because it’s not in his social context. MT won’t kill the translations industry; on the contrary, as we globalize, countries and companies will discover they must work in multiple languages. There will be far more translation.
I can’t see why this would be a “blow to Macromedia Flash”. It is a straightforward matter to keep the text of Flash content or a Flash app seperate in XML and I can’t see any reason one couldn’t use a service like this as a Flash developer.
Machine translation is still machine translation, and it’s bound to stay just that for a long while still. Furthermore, the farther apart the languages are, the worse the outcome of the machine translation attempts will be. I had more than one chance to see it.
Keep in mind that when you’re talking about statistical machine translation, or brute force translation, as you put it, the translations have to exist in the first place. These systems are trained on content created by authors and translators.
It’s quite possible that Google will be able to produce pretty good machine translations in languages for which large translated corpora exist. After all, they have access to more content than any organization in history, probably. I suspect that their system will produce much better results than say, Systran.
But that doesn’t change that fact that for many pairs of languages, such corpora simply don’t exist. Google isn’t in the business of producing content, they’re in the business of extracting information from content. So don’t expect any good JapaneseSpanish or GermanChinese any time soon, to say nothing of languages with fewer speakers.
Of course, Google has a habit of doing things on an unprecedented scale. So we’ll see how far this goes. There will certainly repercussions for the translation industry here, but I don’t think it will be a blanket effect.
I’m less confident than other commenters that usable MT will forever remain “just a few more years away.”
Still, I’m not extremely worried even though I do believe that this kind of corpus-based approach is probably going to yield the promised accuracy, if only because computing power and storage are now cheap enough to make a large enough corpus possible.
Two issues stand out, just for starters: even if the corpus becomes large enough (and the software sophisticated enough) that the system can decipher any metaphor, it then needs to be able to select a culturally appropriate target-language metaphor or even a literal rendering. That’s the kind of decision that requires human intervention. (Let’s not even mention a device like sarcasm: in an age when many humans can’t detect it, I won’t hold my breath waiting for a machine with the ability.)
The second, related, issue is style. Brute force will surely improve the accuracy of MT, but I don’t see any likelihood of it ever producing stylistically interesting copy — and that’s something that many clients value very highly.
At the very least, human editors will remain a critical part of the process for many years to come.
Many here do not even grasp the concept and the philosophy behind this. This is NOT “machine translation”. It is not translating nothing. It is RETRIEVING human translated snippets, putting them together with clever algos and creating a translation which will be as good as the corpora they are using will allow. The laws of statistics (and google algos) will put ‘bad’ translations underneath and good translations on the surface.
They have already indexed ALL of the united nations and of the european union documents. We are speaking of BILLIONS of documents.
While the new european union languages may not have (yet) the necessary “critical mass”, so no latvian, no czech and no slovak, all the “older” languages have it: so yes german, yes italian, yes dutch, yes portuguese, yes greek, yes finn, yes swedish, yes danish… and fantastic englsh, french and spanish, coz they will draw BOTH from zillions of UN-documents and EU-documents… HUMAN TRANSLATED. Hence you just need some stupid, quick algos: alignements ok? keep segments. Alignements screwed? throw segments away, who cares, we have enough.
Search google limiting the search to the europa.eu.int server or to the united nations one. See? All documents, even the most useless ones, are already there (often also cached btw).
I bet it all started with a request to have a quick translator into english from arabic and chinese, and someone snapped his fingers and said, well, why not the UN-documents? They are on line and they will deliver just that.
Da cosa nasce cosa…
and first they found spanish adn french (and russian) as collateral advantage, now they added and are adding all the EU-documents to make it into the universal tarnslator.
Dont make me laugh. A billion documents (and a coupla of easy to program algos) provide quality “automagically”, duh.
My opinion is that statistical-based translation has a bright future, because it has a great opportunity of understanding how information collocates with a bruttal force word, term and snippet mining in its context, as long as it is able to perform the necessary pruning so that an acceptable output is rendered when translating a sentence.
Please note that one of the most important breakthroughs in terms of translation quality and dramatic shortening of the learning curve for new translators in human translation has been “concordance” (i.e. human-driven word, term and snippet mining of a translation memory).
If it worked very well for human beings, I’m not sure why it should not also work very well for machines. If Google is around the corner with that technology, MT translation may become mainstream earlier than we think, as there will be other players trying to catch up with that.
In any case, it will still remain true that “nobody can translate what he does not know what it is”, which is something that MT cannot deliver.
It is not clear what the scenario will be in the future for translators that currently are translating content that they do not understand, but that use TM concordance for guidance, as these translators could be eventually replaced by MT and pushed professionaly to other types of content that they understand and, therefore, where they can be competitive in terms of quality, but possibly as editors instead of as translators.
Comments are closed.