When machine translation and volunteer translators collide: A YouTube/TED case study

Google recently announced a rather nifty feature in YouTube: Auto-translation of auto-generated video captions.

So not only is Google automatically transcribing the text of its videos, it’s also providing translations — via machine translation. Now I just need a “machine reader” so I can process all of this new content — as I’m running out of hours in a day.

Google’s blog post notes:

In the next few months we expect over 150,000 Youtube channels to implement auto-captioning with translation. This is just the beginning and we hope that all Youtube content will soon be enjoyed by all Youtube users, regardless of what language they speak.

One of the examples cited is a TED talk by author Elizabeth Gilbert, show here:

Here’s how you enable the auto-translation — hover your mouse over the Closed Caption icon and click the Translate Captions link.

I found the language-selection overlay (shown below) challenging to scroll through. But I suspect this feature will be automated eventually, similar to how Google’s Chrome browser has automated translation based on your language setting.

What I find interesting about the Gilbert talk is that TED has recruited its own army of translators — human translators — to do the same thing but in higher quality.

Here is the TED-translated version of the same talk:

I think it’s safe to assume that the volunteers are going to offer a much higher-quality translation of the video. But TED does not (yet) support the breadth of languages that Google supports. So while TED has the advantage in quality, Google has the advantage in languages.

But the larger is to what extent Google will make the TED-translated video as easy to find as its own YouTube version.

I did a Google search today and both videos emerged at the top of the results:

I believe this scenario raises a few interesting issues that will need to be addressed in the years ahead:

  • How to easily differentiate between content that has been machine translated vs. human translated
  • How to quickly discover which content is available in which languages
  • Will the crowd continue to be as enthused about translating content by hand when Google  provides the same service, albeit in lower quality, for free?

Is Google the best machine translation engine? It depends…

Two weeks ago, I introduced Ethan Shen and his project to analyze the three major free machine translation (MT) engines — Google, Microsoft, and Yahoo! Babelfish — by relying on translator reviews.

Ethan has provided me with a mid-point summary of results, which I’ve included below. I was surprised to find that Microsoft and Babelfish are beating Google on some languages pairs, as well as on shorter text strings. Although Google is emerging the overall winner — and receiving some much-deserved attention from the media — it’s nice to see some healthy competition.

That said, quality is only one piece of the puzzle. The other piece — perhaps much more important — is usability. Now that Google has embedded its MT engine into Gmail and Reader — and now its Chrome client –I find I’m using Google exclusively as my MT engine.

Here are Ethan’s findings so far (emphasis mine):

At the highest level, it appears that survey participants prefer Google Translate’s results across the board.

In a few languages (Arabic, Polish, Dutch) the preference is overwhelming with votes for Google doubling its nearest competitor

However, once you remove voters that have self defined their fluency in the source or target language as “limited,” the contest becomes closer along some of the heavily trafficked languages. For example:

  • Microsoft Bing Translator leads in German
  • Yahoo! Babelfish leads in Chinese
  • Google maintains its lead in Spanish, Japanese, and French

Observing only the self-defined “limited fluency” voter reveals a strong brand bias. If your fluency in the target translation language is limited, it would stand to reason your ability to assess the quality of the translation is very limited. And yet…

  • Limited-fluency voters chose Google over Bing by 2 to 1
  • They also chose Google over Yahoo! Babelfish by 5 to 1

As I had guessed, Yahoo! and Microsoft’s hybrid rules-based MT model performed better on shorter text passages

For phrases below 50 characters, Google’s lead in Spanish, Japanese, and French disappear. And Microsoft’s lead in German widens.

Beyond 50 characters, Google’s relative performance seems to improve across the board.

For passages that are only one sentence, the same effect is seen, though to a lesser extent than under 50 characters.

On March 4th, we made a few changes to our survey – hiding the brands and randomizing the positions of the text results before voting.  Since then, we have not yet collected enough data to draw conclusions, but Babelfish seems to be receiving the biggest boost, perhaps showing the effects of the recent neglect of that tool.

Clearly, Ethan needs more data to arrive at more concrete conclusions. If you’re a translator and you want to lend a hand, here is the voting site.

PS: Here’s an interview with Google’s MT guru Franz Josef Och.

The best global web sites of 2010

I’m pleased to announce the publication of the 2010 Web Globalization Report Card.

Here are the top 25 web sites overall:

  1. Google
  2. Facebook
  3. Cisco Systems
  4. Philips
  5. Samsung
  6. Wikipedia
  7. 3M
  8. NIVEA
  9. Symantec
  10. Lenovo
  11. Xbox
  12. Autodesk
  13. Gmail
  14. Microsoft
  15. Nokia
  16. Intel
  17. Caterpillar
  18. Panasonic
  19. HP
  20. Deloitte Touche Tohmatsu
  21. LG
  22. Volvo Group
  23. Hotels.com
  24. SAP
  25. Kodak

Google has emerged on top again, but just barely.
The big story this year is that Facebook and Google finished in a numerical tie. But because Google supports more languages (for now), it edged out as the winner.

Moving down the list, there are a number of familiar faces — companies like Cisco and Philips, Panasonic, and NIVEA. But there are some new faces as well. Samsung jumped up in the rankings due to improvements to global navigation and localization. Kodak, Symantec, and Autodesk are also new to the top 25.

Although these sites represent a wide range of industries, they all share a high degree of global consistency and impressive support for languages. They average 50 languages — which is more than twice the average for all 225 sites reviewed.

20+ languages is the new baseline
Even as we look across all 225 web sites, the number of languages continues to increase. Although the rate of language growth slowed over the past two years — due in large part to the global recession — growth continues. This year, the average number of languages increased to 22, up from 20 languages in 2008.

It wasn’t that long ago that any web site that supported 10 languages would have qualified as “global.” The new baseline is 20 or more languages, and climbing.

I will be posting additional findings in the days and weeks ahead. If you want to learn more, we’ve posted a brochure here.