« Back to the top page
Ian Lamont

Language Weaver's translation tools move from spy vs. spy to Web content

Ian Lamont10.09.2008
Tags
Comments 1
100908_chinese.jpg
Like the story? Get Alerts of big news events. Enter your email address

Automated, effective machine translation has been a holy grail of computer scientists for decades. It's not just a practical challenge -- it's a technology that has the potential to revolutionize communication and lead to new classes of software applications.

The arrival of a global communications medium -- the Internet -- has increased the need for high-quality translation software. The Industry Standard spoke with Mark Tapling, the president and CEO of Language Weaver, about his company's approach to machine translation. The company's products are centered around a statistical-based translation engine that has been used for years by U.S. intelligence agencies. It supports 50 language pairs. Language Weaver has also launched a high-volume Internet-based application that can be optimized for internal company documents or websites.

One of the company's major selling points is cost: Tapling says human translation averages 21 cents per word, which is too expensive for most types of online content. In the interview (see below) he stated that Language Weaver is a far cheaper alternative.

The company has already attracted several customers who want to produce international versions of their websites. One of the largest is Trip Advisor, which uses the software to translate user-generated reviews of hotels and other places into various European languages and Japanese. However, price is not the only consideration for using Language Weaver -- output quality matters greatly to readers, and the software needs to be "trained" to improve the quality of translated Web content.

Our interview with Tapling follows:

The Industry Standard: How does Language Weaver compare with human translators and other machine translation services in terms of cost?

Mark Tapling: Language Weaver translations are at least 2,000 times more cost effective than the cost of human translators. Our products are priced by application versus tools. In this case, we are looking to establish leadership as both the quality and price performer. Pricing for alternate solutions have a wide variety of entry points, but we know when considering volume, speed, and accuracy that our products are viewed as very compelling on the pricing front.

TIS: What is the breakdown of clients who use the hosted service, compared to the standalone CD-based product?

Tapling: We just launched our SaaS offering 10 days ago. We have two clients up in production, and likely another three coming in the first two weeks, so we are encouraged. The historical business of the company has been geared toward government intelligence applications, and virtually all of those clients are on premise-based systems.

TIS: Which are the top three language pairs among your customers?

Tapling: Arabic, Spanish, are at the top, and then it becomes a race between German and French.

TIS: Who are your competitors?

Tapling: Our biggest competitor is "No Decision, Inc". This is the scenario where clients choose not to pursue translating because the human cost is viewed as too high. If the customer advances past that obstacle, we see Systran, and regional players with limited language coverage.

TIS: Can you describe what "domain training" involves?

Tapling: Our prospective customers deliver to us a combination of mono-lingual destination content, and bi-lingual source/target content. In some cases, we capture this data for them. We then have a series of tools that cleans and aligns data, and runs it against a backbone network containing billions of words. The process updates the statistical algorithms, so that future translation requests understand the lexicon and syntax of a customer's business.

TIS: I observed in several tests of website content (English to simplified Chinese and vice versa, and French to English) that the syntax was often corrupted in the target language. What are the factors that impact translation quality?

Tapling: The key factor is understanding the customer's communication style, and desired outputs. When we can match pairs of input and desired outputs, the translators respond very favorably.

TIS: User-generated content would seem


I think this article neglects to mention several new competitors that are also able to produce high quality statistical machine translation systems and are described below. Readers may also be interested to know that there is a well developed open source movement in the SMT technology area that is likely to produce many new interesting companies that address many content translation problems in a unique and very focused way.

I think it is important to point out that none of the MT out in the world today produces human quality translation output. In your interview, it is suggested that MT produces translation at a fraction of the cost of human translation, however, in most cases the output is significantly lower in quality than human quality. The reason it is cheaper is because it is much lower in quality. Many today are pointing to much more intensive man machine collaborations that can and will produce increasing quality within very specific domains. Many are saying this is where the breakthroughs will come from.

It has been understood for some time now, that as the quality of MT gets better, the market will expand and extend the reach of MT into the enterprise, and make all kinds of previously high value monolingual content multilingual. Unfortunately this has been a very slow process.

The most recent NIST (National Institute of Standards & Technology) MT competition results show that MSR, Google, IBM and BBN produced the best (as in highest quality measured by BLEU) generic Arabic and Chinese systems. But they were all pretty close, and none produced really huge improvements over last year’s results. Technology initiatives like syntax and hybrid approaches make small differences but something else is needed to really accelerate the rate of improvements. The quality that these generic systems produce today is not likely to get many major enterprises to step up and pay big money for the right of use, even though they are good enough to get millions of Internet users, who will use it as long as it is free.

We have learnt that focusing on a domain (especially a technical domain) makes better systems, and raises the accuracy of raw MT to a level that is much higher than what we see in these NIST competition focused systems. Microsoft has shown that their raw MT translations of knowledge base content is much higher in quality than the generic systems used at NIST and Goggle. This knowledge base content is heavily used by millions of users in their global customer base. Microsoft has disclosed that the satisfaction levels of customers who use raw domain focused MT output can sometimes actually exceed the satisfaction levels of people using the same material in the English source. To my mind this is the most successful use of MT in the world today.

Two recent SMT based startups are a good example of this. In Denmark, a company called Languagelens is having very good luck with translating patent information after training their engines on previous patents. Alfabetics is focusing on developing very high quality (human assisted) SMT translation engine for blog content.

At Asia Online we are seeing that technical domain focused SMT systems we have built with clean data, can produce some pretty compelling raw MT output. We expect that this will be a growth area for MT technology providers in the short term as technology focused enterprises make more and more content available in multilingual formats using MT as an accelerator.

However, it is also clear that none of the MT technology out there today can really replace human beings. Language is too complex, and too filled with variations and exceptions to be completely reduced to algorithmic resolution. I think it is becoming clear that it is important to engage human beings to come and help raise the quality of the raw MT to a level that it becomes more usable, useful and compelling. With SMT, this continuous human feedback can help to drive the quality and capability of these systems to a level that we can start to approach human draft quality.

MT coupled with large scale human feedback can enable systems to improve at a rate that we have not seen yet. Since MT can produce large amounts of content filled with linguistic errors, it is possible to clean this up if a crowd of capable/competent humans can be motivated to help. The popular term for this is phenomena is “crowdsourcing”. We have seen this at work on a small scale, at Facebook already, and at Asia Online we are embarking on a 3 million page translation of the English Wikipedia into Thai initially, then into several other Asian languages. This approach will be used to translate tens of millions of pages and gradually raise this content to human quality levels with assistance from the broad student community that would find the content most useful.
http://news.tourthailand.org/business-news/online-proofreaders-sought.ht...

http://www.bangkokpost.com/100908_Database/10Sep2008_data62.php

MT together with web based massive online collaboration is emerging as a model that can take on huge translation tasks and we see now several initiatives around the world beginning to explore this model. What is special about these efforts is that we are seeing is actually a social phenomenon coming to a focus around a collaborative technological platform involving machine translation. Even in the world of internet video content, we are seeing a company called dotSub that makes a crowdsourced subtitling capability available and allows a lot of video content in English to become viable in other parts of the world or vice versa.

Alain Desilets of the NRC of Canada recently said, “"Two technologies which will drastically change the way we translate content: massive online collaboration a la Wikipedia, and Machine Translation. Shared language data repositories are central to both the collaborative and MT innovations. A year ago, I would have said that MT was still too imperfect to impact the translation industry in any significant way. But recently, progress has been incredibly rapid, even more rapid than its most optimistic proponents ever dreamt of."
http://www.wiki-translation.com/tiki-index.php?page=Processes+and+tools+...

Brian McConnell of the Worldwide Lexicon Blog makes a prediction in an interesting article on this site:
“The language barrier, as we know it, will be gone by 2010. Computer scientists have been chasing a Holy Grail of machine intelligence for decades, but the breakthrough that will eliminate the language barrier is social, not technical. Language, like music or art, demands people to comprehend it.”

He goes on to say, “The language barrier will be broken down in a series of simple steps. The first phase of this transition will be driven by publishers with large or highly motivated audiences. These early adopters will recognize the value of making their content visible in many languages, and their readers will be happy to contribute. Each website will develop its own translation community from its audience. At this stage of the transition, the system will be driven by a few publishers, and probably a few thousand dedicated translators.

As these projects grow, and as multilingual publishing tools become more sophisticated, aggregators will emerge. These sites will create large translation communities that decide what to translate based on their interests, whether or not a particular publisher is aware of this activity. Roaming mobs of amateur translators will translate whatever they think is interesting. Commercial services that complement volunteer based systems will also appear.”

The full article can be seen at http://blog.dermundo.com/original/2356.html


Post new comment

The content of this field is kept private and will not be shown publicly.
Respectful debate is welcome, but comments that are defamatory, indecent, abusive, or in violation of any law will be removed.