Instroduction To Machine Translation
Instroduction To Machine Translation
Instroduction To Machine Translation
Copyright Thomas D. Hedden, 1992-2005 "During the TMI-92 conference in Montreal, Jaime Carbonell gave some details of the contract signed in May 1992 between Caterpillar, the world's largest manufacturer of earth-moving equipment, and the Center for Machine Translation at Carnegie-Mellon University for the development of a fully automatic translation system. The five-year multimillion dollar contract had been concluded after an extensive evaluation by Caterpillar since the 'proof-of-concept' demonstration by the CMU team in June 1991..." (MT News International 3:11 [September 1992]). News articles such as this show that whether machine translation is already here, coming soon, coming in the distant future, or not coming at all, it is winning big contracts. Therefore, whether we think it is laughable, impractical, a hoax, whatever we may think about it, it is an issue which must be addressed.
Machine translation (MT) means translation using computers. In its broadest sense MT can be understood to include such computer applications as compilers and compression programs, etc., which convert a file in one computer language into a file in another computer language. However, what we are interested in here is natural language processing (NLP). One thing which MT does not mean, but which is sometimes confused with MT, is automatic speech recognition. There are four basic types of translation, three of which are types of machine translation or machine-assisted (-aided) translation: Human translation. A human translator performs all the steps in the translation process, using a computer only as a word processor, if at all. Machine-assisted (-aided) human translation (MAHT). The translation is performed by a human translator, but he/she uses the computer as a tool to improve or speed up the translation process. This is called computer-assisted (-aided) translation (CAT) by people in the field of translation as opposed to the field of MT. Human-assisted (-aided) machine translation (HAMT). The source language (SL) text is modified by a human translator either before, during, or after it is translated by the computer. Fully automatic (automated) machine translation (FAMT). The SL text is fed into the computer as a file, and the computer produces a translation automatically without any human intervention. This is sometimes referred to as batch mode. There are two types of fully automatic machine translation: fully automatic high-quality machine translation (FAHQMT) and low-quality machine translation. The type of MT which people think of when they hear the word machine translation is usually the last type (fully automatic MT). The term MT will be used loosely in this report to mean MAHT, HAMT, or FAMT. Obviously the ultimate goal of MT is FAMT,
although its results have also made it the subject of much ridicule, sometimes with good reason. Note that the distinction between HAMT and FAMT is partly conventional, since a FAMT system would be considered a HAMT system if the output is post-edited, and any translation can be checked. MT is a part of the field of knowledge called artificial intelligence (AI). There are many different definitions about what artificial intelligence is, however a simple way of thinking of it is the attempt to emulate human patterns of thinking and behavior using computer models. Today AI is used in robotics, pattern recognition, etc., as well as in MT.
Apparently the first suggestions concerning MT were made by the Russian SmirnovTroyansky and the Frenchman G.B. Artsouni in the 1930's. However, the first serious discussions were begun in 1946 by the mathematician Warren Weaver. He and many others were inspired by the success of the Allied efforts using the British Colossus computer to break the German military code produced by the Enigma machine, and the obvious similarity between the task of decoding an encoded message and the task of translation of one language into another. By 1954 there was a MT project at Georgetown University which succeeded in correctly translating several sentences from Russian into English. Soon there were MT projects at MIT, Harvard, and the University of Pennsylvania. In 1964, after more than $20,000,000 had been invested by the Federal Government in MT, the National Academy of Sciences commissioned the Automatic Language Processing Advisory Committee (ALPAC) to write a study of the status of MT. The committee, headed by John R. Pierce, wrote a now-famous report in which it expressed doubt that a fully-automatic MT system could ever be produced. That report sounded the death-knell for funding of MT research, and MT was neglected for many years afterwards. The reasons for this failure have been described many times, and come down to the fact that the analysis by humans of messages in natural language relies to some extent on information which is not present in the words which make up the message. This led the linguist Yehoshua Bar-Hillel to declare that MT was impossible. The example which he provided has since become a classic, and is now called the Bar-Hillel paradox:
The pen is in the box. [i.e. the writing instrument is in the container] The box is in the pen. [i.e. the container is in the playpen or the pigpen]
There are two possible ways that a person could correctly infer the meaning of these sentences. First, if there is a context preceding these sentences, it could make clear which meaning of pen is being used in which sentence. That is, the meaning of the words and information about the context is carried over from one sentence to the next. There is now an entire branch of linguistics, called discourse analysis, devoted to the study of how context affects the meaning of words and sentences. In order to infer in this way the correct meaning of an ambiguous sentence, computers will have to learn how to "remember" a context and make use of it to interpret the correct meaning of words and sentences within that context.
However, in the examples given above, most humans can understand the meaning correctly without any context. In order for a fully automatic MT system to translate these sentences correctly, the following information would have to be available to the computer:
pens [writing instruments] are smaller than boxes boxes are bigger than pens [writing instruments], but smaller than pens [playpens, pigpens, etc.] it is impossible for a bigger object to be inside a smaller object
Thus, one way or the other, whether the correct meaning of the sentences is inferred based on the context or in isolation, it is necessary for the computer to have information at its disposal which is not included in the message itself. During the early days of MT this realization was enough to make MT seem an impossible task. Interest in MT revived in the 1980's, following dramatic advances in computer hardware (storage capacity, speed, etc.) and software (LISP, etc.). The need to store and process tremendous amounts of real-world knowledge in order to analyze a single word in the message ceased to be an impediment to design and use of MT systems.
The theory of MT is very complicated, and I will not go into in detail. However, there are several basic theoretical approaches to MT. The most primitive is called the direct MT strategy. This approach is always between pairs of languages. This approach is based on good glossaries and morphological analysis. The next most advanced system is called the transfer MT strategy. This is still being used today, although it has a competitor (see below). First, the SL is parsed into an abstract internal representation. Thereafter, a 'transfer' is made into the corresponding structures in the target language (TL). Then a translation is generated. This approach is more advanced theoretically, but also translates between specific pairs of languages. Both the direct MT strategy and the transfer MT strategy can take advantage of similarities between languages. The direct MT strategy has been criticized as theoretically inelegant, although it probably comes closest to modelling how human translators work. Both the direct MT strategy and the transfer MT strategy have been criticized since they require that separate translation software be written for every language combination. (Actually, from the point of view of someone with experience in the translation industry, these concerns seem trivial, since an overwhelming majority of translation work is done in less than a dozen language combinations: Ten or so packages could be written for these combinations, and other combinations could continue to be translated manually.) The most advanced system is called the interlingua MT strategy. The idea behind this approach is to create an artificial language, known as the interlingua, which shares all the features and makes all the distinctions of all languages. To translate between two different languages, an analyzer is used to put the SL into the interlingua, and a generator converts the interlingua into the TL. The proponents of this system argue that it reduces the number of analyzers and generators which are required, since only one generator and one analyzer is required for each language, no matter how many other languages there are. While this is true, I think that the proponents of the interlingua MT
strategy probably underestimate how complex an interlingua would have to be in order to work as an intermediary among many or even a few unrelated languages, as opposed to among the few related European languages on which most work has been done. If the interlingua is extremely complex, this also means that the analyzers and generators will have to be extremely complex. It is hard to avoid thinking that this approach was inspired by the idea of a universal language such as Esperanto, which has excited certain linguists for centuries.
Other strategies
There is a fairly new approach called knowledge-based machine translation (KBMT), which is similar to the interlingua approach, in that the SL is converted into an intermediate form independent of any specific language. It differs in that the intermediate form is of a semantic nature rather than a syntactic nature. Writers of MT systems have also explored the relative frequency of the various meanings of words with multiple meanings, and have attempted to include a great deal of realworld information in glossaries. For a more detailed discussion of the theory of MT see Nirenburg (1987).
Human intervention can also mean post-editing to check the translation and fix mistakes made by the computer. It should be noted that the pre-editing and glossary compilation required for HAMT typically require a person who is a trained linguist who can parse the syntax of the sentence, not simply a translator who understands the foreign language and can express it in his/her own language. Obviously the most primitive is the system which requires pre-editing, since the computer cannot handle the text unless a human converts NL into a semi-artificial language which is easier for the computer to understand. The ideal is when the automatic translation is so good that all that is necessary is to check the translation and change a few details. Interactive intervention can be anywhere in between.
Although there are FAMT systems, and although they may suit the needs of people who have to search through mountains of information and only need to get a very general idea of the contents of a document (a good example is provided by the low-quality needs of the military and the intelligence agencies), high-quality translation of truly natural language which is really fully automatic (automated) hardly exists. FAHQMT systems have requirements either for the compilation of extensive glossaries and/or are restricted to specific subworlds or sublanguages.
Survey of MT systems.
This part of this report is very incomplete. Hopefully it will be expanded later.
HAMT Systems
ALPS Logos system. New Word Search Noun Phrase Search Revisions Processor ALEX (Automatic Lexicographer) SEMANTHA Never EOS Merge and Restore Utilities Print Facilities Translation Facility The Logos system produces quite respectable results, but the amount of time which it is necessary to invest in building specialized glossaries is large. The Logos system can be a successful approach for large or continuing projects. Intergraph's DP/Translator Canadian Environmental Department's TAUM-MTO
FAMT Systems
Common "pocket calculator" translators FinalSoft ($139) Globalink ($998 per language direction) MicroTac Spanish Assistant, etc. ($79.95 each) PC-Translator ($985 per language direction) Systran Toltran ($498 per language direction) Translate ($495 Eng-Span only, Span-Eng available soon) Translator ($69.95 Eng-Span and Span-Eng only)
Experimental MT Systems
Ariane Eurotra Candide DLT (Distributed Language Translation) METAL SUSY (Saarbrcker bersetzungssystem) TRANSLATOR
for the form of the glossaries are much more elaborate than are the requirements for glossaries which are good enough for a human translators. As was mentioned above, the requirements for glossary compilation are such that they cannot be met by a typical translator. The time spent compiling glossaries has to be weighed against the time saved by using the MT system. Obviously, it is not cost-effective to compile a glossary in order to use an MT system to translate a two-page personnel policy into Spanish. But just as obviously, it could pay off in the translation of a whole series of manuals on the same subject. The process of glossary creation can now be simplified by routines which will help identify those terms which are not in the glossary, and tools for building new glossaries on the basis of existing glossaries. Reduced standardization and review of terminology. In the long run, glossaries compiled for use in-house cannot be of the same high quality as published dictionaries, since they are not widely distributed and exposed to criticism from thousands of users, the way published dictionaries are. Free-lancers. A problem for translation agencies is that they rely heavily on free-lance translators, who may not have the software necessary to work on the project. They may not be able to afford it. They may not be willing to learn it. Being freelancers, they work with various agencies, and if every agency requires the freelancer to learn a different package, the translator may balk. Even if translators are given the software and even if they are willing to learn it, they may not have the appropriate hardware. Obviously, such systems can be used to their full potential only in-house. Using MT systems in-house requires in-house translators. Having in-house translators requires a predictable flow of work which can keep a translator busy. Note that this always has been a problem, and continues to be a problem even for translation agencies, not to mention end-users of translations. IBM's and AT&T's in-house translation departments began accepting work from outside sources for that very reason. Poorly-formed SL text. If the SL original is badly written, as most manuals are, then the system will have difficulty translating it correctly, whereas a human translator can understand the intent of the writer, and produce a translation which is better than the SL original. In order to avoid this problem, some large users of MT, such as Caterpillar, have set up requirements for the style of the SL text. If the original writer did not adhere to these requirements it is necessary for the text to be edited. However, it can take almost as long for an editor to put the source language text into "standard" form as it would for a good human translator to translate it. Reluctance of translators to use the system. A good case study of how implementing an in-house HAMT system can backfire is when a well-known technical translation company in Southern California adopted a such a system. The system adopted was an extremely tedious interactive system. In the end the best translators quit. (This company was later purchased by one of the largest translation agencies, but some of the original players have formed another company under a similar name.) Size of translation assignment. Many assignments are simply too short to justify going to the trouble of using an MT system. No file available. For many assignments, no file of the SL text is available. Although OCR can solve this problem to some extent, this is an additional step which requires additional manpower, additional software, and additional hardware, and must be factored into any calculation of the benefits of MT vs. using human translators.
Secrecy. The competitive advantage offered by a successful system and the enormous investment required to attain that advantage means that successful MT systems and their glossaries may be jealously guarded rather than released and widely shared. Contrast this state of affairs with that of human translators, the best of whom often have a finanacial incentive to compile and publish glossaries, thus improving the over-all level of knowledge in the industry.
Other considerations.
Speed. An HAMT or FAMT system which already has high-quality glossaries and is up and running can translate 75-100 pages per hour. This is obviously worth thinking about when poorly organized clients need multiple manuals translated with ridiculously short turn-around times. Consistency. One respect in which MT is superior to human translators is consistency. The computer may not choose the correct translation, but it will use the same word everywhere, as opposed to human translators who sometimes try to make the translation more interesting by using first one translation, then another. Consistency is extremely important in translations of software and of software documentation, and also makes it easier to make changes later, should it be decided to change one term to another. PR value (showoff value). Since MT has a high-tech sound and look to it, it offers a lot of potential to impress gullible potential clients. Unfortunately there are a lot of huge clients who are easily tricked into believing the claims of MT companies. Of course, MT also has a very bad reputation among many people who have experience with it or who have heard horror stories about the poor quality, delays, etc., which can result from relying on MT. There was a horror story printed in NOTIS News in 1992. Thus, boasting about MT to potential clients is not always a good idea. Some agencies advertise that they
offer MT, but that they also offer human translations when quality is of paramount importance. Low-quality market. There is a real market for junk translations. That is not to say that clients want poor quality, but rather that they are satisfied with low quality, and if they can get a low-quality translation done quickly for a low price, that is what they will choose. Agencies which offer low-quality MT can satisfy these clients. File format compatibility. What are the requirements for compatibility with existing operating systems and software, and are any limitations placed on which applications can be used before/after/during the translation? Do files have to be converted? How much work will this involve? Can existing glossaries be imported into the system's glossaries? How much work will be necessary to convert these glossaries? Hardware. Some systems, such as Logos, require sophisticated equipment: a fairly powerful UNIX workstation (ca. $20,000). Some less sophisticated systems will work on PCs or Macs; one requires Windows NT. Compatibility. Can the system be used with a network? Can the system interact with and take advantage of powerful glossary tools such as the glossary produced by the Canadian government Termium? This is especially important since in-house glossaries will never be a match for resources such as Termium. Will the system be able to read and take advantage of existing CD-ROM and on-line glossaries, etc., or will such resources have to be used separately? Industry standard. Which system will be most widely adopted? (The best system to adopt is the one which will become standard.) Learning curve. Especially as long as it remains uncertain which package will become the industry standard, it is important that the system not require too great an investment of time to learn. Format. Are formatting/style codes preserved? (This is especially important if the file has to be converted.) Can the system interface with a common interchange formats? (RTF, FrameMaker's MIF, etc.) Will it be necessary for someone to reformat translations produced by a given system? Free-lance access to MT. AT&T tried to get free-lancers to turn over their rush projects to them, which AT&T will have machine-translated, and then return to the translator for post-editing. If free-lance translators begin using MT themselves, then there is no point in subcontracting to human translators. However, the fact that almost no good translators actually do this provides fairly good evidence that human translation is actually cheaper than MT (otherwise translators could subcontract to MT vendors, and then pass off the work as their own and reap a profit by charging a higher rate). Separation between research and reality. In the United States there is almost no connection between the world of theoretical research in MT and the real world of application (the only major exception being the work at Carengie Mellon for Catepillar). Theoreticians are mostly interested in discussions about methodology, syntactic parsers, etc., rather than the requirements of real-world applications and how to satisfy them. Research money and the energy of the best researchers are channelled into theory rather than into practical systems.
Association for Computational Linguistics. 1974- . Computational Linguistics. Cambridge, MA: MIT Press Journals. Association for Computational Linguistics. 1974- . The FINITE STRING. Cambridge, MA: MIT Press Journals. Association for Machine Translation in the Americas. (forthcoming). MT Yellow Pages. Barr, Avron, and E.A. Feigenbaum, eds. 1981. The Handbook of Artificial Intelligence, vol. 1. Reading, MA: Addison-Wesley Publishing Company. [Stanford $27.95] Carbonell, Jaime, et al. 1992. JTEC Panel Report on Machine Translation in Japan. Baltimore, MD: Loyola College in Maryland. [available from NTIS] Crystal, D. 1987. The Cambridge Encyclopedia of Language, pp. 350-351. Cambridge, etc.: Cambridge University Press. Firebaugh, M.W. 1988. Artificial Intelligence: a knowledge-based approach, pp. 262-272. Boston: Boyd & Fraser Publishing Co. Gazdar, G., A. Franz, K. Osborne, and R. Evans. 1987. Natural Language Processing in the 1980s. A bibliography. (CSLI Lecture Notes, 12.) Stanford-Palo Alto-Menlo Park: Center for the Study of Language and Information. Hutchins, W.J., and H.L. Sommers. 1992. An Introduction to Machine Translation. London, etc.: Academic Press (Hacourt, Brace Jovanovich Publishers). [Stanford $42.50] International Association for Machine Translation. 1992- . MT News International. Newsletter of the International Association for Machine Translation. Washington, D.C.: Lieberman, E.J. 1992. Language Futures. Esperantic Studies 1992:Summer.3.1-2. Mendoza, Rick. 1991. Translator's little helpers. Hispanic Business 1991:October.32-33. Nirenburg, S. 1987. Machine Translation: theoretical and methodological issues. Cambridge, etc.: Cambridge UP. Resnick, Rosalind. 1991. Language liberators. International Business 1991:December.61-62. Spark Jones, K., and M. Kay. 1973. Linguistics and Information Science. Academic Press. Tresman, Ian. 1991. Multilingual PC Directory: A guide to multilingual and foreign language products for IBM PCs and compatibles. Borehamwood, U.K.: Herts. Winograd, T. 1984. Computer Software for Working with Language. Scientific American. Reprinted in W.l S-Y. Wang, Language, Writing and the Computer, 61-72. (New York: W.H. Freeman and Company, 1986.)
American Society for Information Science (ASIS). 19??- . Address: 8720 Georgia Avenue, Suite 501, Silver Spring, MD 20910-3602. Asia-Pacific Association for Machine Translation (formerly Japan Association for Machine Translation). 1992- . Address: 3F, Shibakoen Sanada Bldg, 3-5-12 Shibakoen, Minato-ku, Tokyo 105-0011 Japan, e-mail [email protected]. Association for Information Management (ASLIB). 19??- . Address: Information House, 20-24 Old Street,London EC1V 9AP England tel. +44/71/253 4488; fax +44/71/430 0514 Association for Computational Linguistics (ACL). 1962- . PO Box 6090, Somerset, NJ 08875, USA. Tel. +1 (908) 873-3898, fax +1 (908) 873-0014. Email [email protected] Association for Machine Translation in the Americas. 19??- . Address: PMB 300, 1201 Pennsylvania Avenue, MW, Suite 300, Washington, DC 20004, USA, e-mail [email protected] Center for Machine Translation. Carnegie Mellon University. Center for the Study of Language and Information (CSLI). CSLI/SRI International, 333 Ravenswood Avenue, Menlo Park, CA 94025;
CSLI/Stanford, Ventura Hall, Stanford, CA 94305; CSLI/Xerox PARC, 3333 Coyote Road, Palo Alto, CA 94304. European Association for Machine Translation. 19??- . EAMT Secretariat, c/o TIM/ISSCO, Universit de Genve, Ecole de Traduction et d'Interpretation, 40 blvd du Pont d'Arve, CH-1211 Geneva 4, Switzerland. Email: [email protected] or [email protected] International Association for Machine Translation. 19??- . Address: c/o AMTA. Japan Electronic Dictionary Research Institute, Ltd. 19??- . Address: Mita Kokusai Building, 4-28 Mita 1-chome, Minato-ku, Tokyo 108, Japan. Linguistic Data Consortium. 1992?- . University of Pennsylvania. Microelectronics and Computer Technology Corporation (a consortium). 19??- . Austin, TX. Microsoft NLP Research Group. 19??- . Address: Microsoft Corporation, One Microsoft Way, Richmond, WA 98052. Click here for more interesting links
6034 West Courtyard Drive, Suite 305 Austin, TX 78730 tel 800/324-2150; 512/338-2150 fax 512/338-2151; 512/338-2152 INK Tools INK International Prins Hendriklaan 52 1075 BD Amsterdam Netherlands INK INternational V.B. Gebouw "De Amiraal" Baarjesweg 224 AA Amsterdam Netherlands tel. +31/20/164 591 fax +31/20/163 851 Intergraph Corporation Natural Language Group Huntsville, AL 35894-0003 Kurt Lcken Vertrieb fr EDV-Literatur und EDV-Zubehr Emmerichshohl 6 6380 Bad Homburg 6 Deutschland 06172/4 73 87 06172/4 59 76 3 Language Engineering Corporation 385 Concord Avenue Belmont, MA 02178 USA Tel: +1 (617) 489-4000 Fax: +1 (617) 489-3850 E-mail: [email protected] Language Weaver 4640 Admiralty Way, Suite 423 Marina Del Rey, CA 90292 USA Tel: +1 (310) 437-7300 Fax: +1 (310) 437-7307 E-mail: [email protected] Logos Corporation One Dedham Place, Suite 4 Dedham, MA 02026 tel 617/326-1595 fax 617/326-9341 TRADOS Corporation 113 South Columbus Street, Suite 400 Alexandria, VA 22314 Tel +1 (703) 683-6900 Fax +1 (703) 683-9457 E-mail: [email protected] (For contact information outside the United States, click here.) MicroTac Software, Inc. (Spanish Assistant, French Asst., etc.) 4655 Cass Street, Suite 214 San Diego, CA 92109 tel 1-800-423-3556; 619/272-5700
Linguistic Products (PC-Translator) The Woodlands, TX Polygon Industries, Inc. (Translator) New Orleans, LA SYSTRAN: Headquarters: SYSTRAN S.A. 1, rue du Cimetire 95230 Soisy-sous-Montmorency France Tel: +33 (1) 39 34 97 97 Fax: +33 (1) 39 89 49 34 North America: SYSTRAN Software, Inc. 9333 Genesee Avenue, Plaza Level, Suite PL1 San Diego, CA 92121-2112 USA Tel.: +1 (858) 457-1900 Fax: +1 (858) 457-0648 E-mail: [email protected] Toin America Corporation Atlanta Financial Center, Suite 1120 3353 Peachtree Road N.E. Atlanta, GA 30326 tel 404/240-4110 fax 404/240-4111 Toin Corporation 3-9-1 Meguro Meguro-ku Tokyo 153 Japan tel 03/5721-3016 fax 03/5721-3261 Toltran Ltd. Barrington, IL Trados GmbH Stuttgart
[Note: The first MT conference took place at MIT in 1952.] Annual Meeting of the Association for Computational Linguistics. Annual. ATA Conference. Annual. COLING. International Conference on Computational Linguistics. Conference on Applied Natural Language Processing. European Conference on Artificial Intelligence. International Conference on Current Issues in Computational Linguistics. International Conference on Theoretical and Methodological Issues in Machine Translation. Bi-annual. International Workshop on Natural Language Generation. MT Summit. Annual. MT World. Translation and the European Communities Conference.
American government.
Defense Advanced Research Projects Agency (DARPA). Dutch government (10M to BSO) Japanese government. Japan Key Technology Center
Distributed Language Translation (DLT) "Fifth Generation Project" Electronic Dictionary Project (will cost > $100,000,000) Cyc (pron. "psyche", from encyclopedia)
Change History
Minor updates. 8 July 2000: updated contact information for Trados Corporation; fixed closing "heading 2" code in section "Makers/Vendors of Commercial Products". 12 July 2000: added missing word to sentence "This is obviously worth thinking [about] when ..."; added link to ACL home page. 18 July 2000: Put FAMT in separate paragraph; changed "interpret correctly the meaning" to "interpret the correct meaning". 3 August 2000: Added sentence and link to page with list of MAHT tools; added link to Esperanto home page. 10 November 2000: Updated links to the ACL, AMTA, APAMT, and EAMT; added link to EAMT as a source of more information; added link to the Compendium of Translation Software. 10 October 2003 updated header and copyright statement. Changes made 26 October 2003: Made XHTML 1.0 compliant; added link to Termium website. Changes made 25 November 2003: added AppTek, Basis Technology, Language Engineering, and Language Weaver to list of commercial MT systems; updated contact info for SYSTRAN; fixed typo in name of PAHO; added link to MT Summit web site. 29 Jan 2004 added link to home page and "viewable with any browser" statement. Click here to return to Thomas Hedden's home page. This page is viewable with any browser.