How close are we to the reality of a Star Trek-type of Universal Communicator? The U.S. Government is making a serious effort to get us there. BY HANY FARAG
In the December 2007 issue of Translorial, Paula Dieli, in her article about Machine Translation (MT), concluded that MT is no longer the funny substitution of words in one language for words in another and that MT, according to Google, is based on a data-driven approach. MT also combines linguistic typology, phrase recognition, translation of idioms and isolation of anomalies. However MT does not have to be limited to one form of implementation centered on documents and word files. If it is combined with a system that converts natural speech of one language into text, with the output text fed into MT for translation to another language, and the translated text is then converted back into speech, we will have an automated speech translation system. This is, as I prefer to call it, an Interpreter Machine (IM).
Automated speech translation, as seen in films like Star Wars, is interesting but fictitious. So why do linguists need to know about the Interpreter Machine? The answer is: because the Defense Advance Research Project Agency (DARPA), a part of the U.S. Defense Department, is working on it quite seriously. DARPA is responsible for the development of new technology for the military and while its focus since its creation in 1958 was on military applications, many R&D by-products have found their way into ordinary life, resulting in significant changes in how we live. It is hard to imagine how the Internet would have been invented without DARPA’s work on networks and hypertext markup language. Historically, most technology breakthroughs that have occurred were first made in R&D for space and defense. In regard to the IM, R&D is moving along quickly and the results are periodically evaluated by the National Institute of Standards and Technology (NIST).
First efforts after 9/11
Following the events of September 11, 2001 and the military offensive in Afghanistan and Iraq, a new strategic language demand materialized. To help American forces communicate with the local population, DARPA initiated the TRANSTAC program (Spoken Language Communication and Translation System for Tactical Use). By late 2006, two models of Interpreter Machines became available for use. IBM developed MASTOR, a Multilingual Automatic Speech Translator, while the Stanford Research Institute (SRI) produced IRAQCOMM. Both machines are currently being utilized in Iraq.The Interpreter Machine achieves its objective by integrating three technological functions: Automatic Speech Recognition (ASR), followed by Machine Translation (MT), and finally Text-to-Speech Synthesis (TTS). Each function has existed independently for some time, but the IM is credited with combining these three technologies into one cohesive whole. It is likely that synergy resulting from cooperation in MT, AST and TTS will positively impact each individual technology and its application.
What is ASR?
Automatic Speech Recognition (ASR) is the process of converting a speech signal to a sequence of words by means of an algorithm executed in a computer program. This ASR algorithm is based on a statistical model called the Hidden Markov model (HMM), which is helpful in analyzing pattern recognition in speech, as well as in handwriting, biometric data, and other applications. The performance of speech recognition systems is evaluated in terms of accuracy and speed. Nowadays, ASR, as used in commercial dictation applications, exceeds 95% accuracy. In field-oriented applications, ASR performance is enhanced through speaker adaptation, microphone adaptation, end-of-speech detection, distributed speech recognition, and noise suppression.
In the last few years, ASR applications have included voice dialing, call routing, simple data entry, preparation of structured documents, and content-based audio search. As we all know, large-scale customer service is based on interactive voice response, which is quite successful in a controlled environment. In the IM, the speech recognition function is performed in a realistic and uncontrolled environment. Accuracy must be higher, since errors will be propagated through the consecutive functions in MT and TTS.
After ASR produces text, the MT function begins and in its usual routine will translate text from source into target language. However, there can be a marked difference between written and spoken speech, depending on language, and, in some cases, as in Arabic, the differences can be significant.
Once the translation is completed, the output in the target language is fed to the Text-to-Speech synthesizer. TTS has a long history: more than two centuries ago, an acoustic mechanical speech machine was built to model the human vocal tract. Computers were later used to synthesize speech. Successful synthesizers produce an output that is both intelligible and closely resembles human speech.
There are two fundamental approaches in TTS technology: Concatenate synthesis and Formant synthesis. In simple terms, Concatenate synthesis work is based on a large database of stored speech samples while Formant synthesis creates a waveform of artificial speech. Many websites offer demo samples allowing people to synthesize a short volume of text into natural speech. People can choose gender, as well as a speaker accent, for example, producing English with a German, French, or even a “cowboy” accent. Commercial software programs synthesize speech to user requirements adequate for video clips, power point presentations, games and entertainment. The Interpreter Machine model built by SRI includes a TTS function by Cepstral, a company known for providing spoken delivery of information. IBM’s MASTOR is a software application that performs real-time bidirectional translation. Software may be embedded in various platforms and operating systems, including PDAs, PCs and laptops.
Another language-oriented program from DARPA is called GALE (Global Autonomous Language Exploitation). This program attempts to address the lack of qualified (U.S.) linguists and analysts who know strategically important languages like Mandarin and Arabic. GALE’s objective is to develop a software program by 2010 that will be able to translate Arabic and Mandarin with 90 to 95% accuracy. The focus of GALE is to translate, transcribe, distill, and extract actionable information, filtering through huge volumes of foreign media to define what actually needs to be translated by humans.
Before GALE, MT accuracy was about 55% on structured text, and 35% on structured speech. One year later, the accuracy on Arabic structured text climbed to 75% of 90% of documents tested. Speech accuracy increased to 65% on 80% of the segments tested. The ultimate goal is 95% accuracy on 95% of documents and 90% on speech segments.
The Machine in Iraq
The Interpreter Machine delivered to Iraq enables the process of having the spoken English of service personnel translated automatically into Iraqi Arabic, and to have the spoken Iraqi responses translated into English.
In the summer of 2007, six research teams comprising university labs and vendors participated in evaluations of the IM. These were conducted in a lab-controlled environment, in the field, and in offline mode. Pre-recorded audio files were fed into the IM. Fifteeen Marines and 10 Arabic speakers took part in the evaluation session which centered on determining how many pieces of information a Marine would be able to retrieve from foreign language speakers while interacting only through the Interpreter Machine.
According to the military, the evaluation was a huge success, with 160 scenarios being run over the course of one week. The test and evaluation conducted by NIST provided DARPA with statistically significant data that identified TRANSTAC systems improvements over time. The evaluation measured system capability in speech recognition, machine translation, noise robustness, user interface and efficient performance on a limited hardware platform. Outdoor evaluation included background noises.
The results of this effort go far beyond just the Arabic language. Once the technology is fully developed, DARPA hopes to be able to develop an automatic translator system in a new language within 90 days of receiving a request for that language.
In theory, based on public information about the IM, the following is envisoned:
- When a human interpreter is faced with a word or phrase he or she does not completely understand, the interpreter’s response is to ask for clarification or repetition. The IM can minimize errors by establishing a level of certainty and also apply this rule: Don’t guess, just ask.
- The IM should provide a level of speaker recognition to reflect input gender and number of speakers in corresponding unique output channels.
- Sensitivity should be measured by change in output (translated speech) compared with a change in input (spoken language), even if the message content is the same.
- The benchmark of IM functionality and performance should be in comparison to a human interpreter whose response is spontaneous and reflexive. It is possible to count how many times the interpreter will intervene to correct the IM when they are working as a team.
Will the interpreter machine meet expectations in the battlefield? Can the machine be used in other fields? Can the IM application complement instead of compete with a human interpreter? What about European efforts such as the TC-Star project? Or Beijing’s offer of automated speech translation in the 2008 Olympics?
Good questions all, but this is another story.