The 1960s, the ALPAC report and the seventies

Лекционный комплекс

Lecture 1. History of machine translation

History of machine translation

The history of machine translation generally starts in the 1950s, although work can be found from earlier periods. The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The experiment was a great success and ushered in an era of significant funding for machine translation research. The authors claimed that within three or five years, machine translation would be a solved problem.ref|nutshell1

However, the real progress was much slower, and after the ALPAC report in 1966, which found that the ten years long research had failed to fulfill the expectations, the funding was dramatically reduced. Starting in the late 1980s, as computational power increased and became less expensive, more interest began to be shown in statistical models for machine translation.

Today there is still no system that provides the holy-grail of "fully automatic high quality translation" (FAHQT). However, there are many programs now available that are capable of providing useful output within strict constraints; several of them are available online, such as Google Translate and the SYSTRAN system which powers AltaVista's BabelFish.

The beginning

The history of machine translation dates back to the seventeenth century, when philosophers such as Leibniz and Descartes put forward proposals for codes which would relate words between languages. All of these proposals remained theoretical, and none resulted in the development of an actual machine.

The first patents for "translating machines" were applied for in the mid 1930s. One proposal, by Georges Artsrouni was simply an automatic bilingual dictionary using paper tape. The other proposal, by Peter Troyanskii, a Russian, was more detailed. It included both the bilingual dictionary, and a method for dealing with grammatical roles between languages, based on Esperanto. The system was split up into three stages: the first was for a native-speaking editor in the sources language to organise the words into their logical forms and syntactic functions; the second was for the machine to "translate" these forms into the target language; and the third was for a native-speaking editor in the target language to normalise this output. His scheme remained unknown until the late 1950s, by which time computers were well-known.

The early years

The first proposals for machine translation using computers were put forward by Warren Weaver, a researcher at the Rockefeller Foundation, in his March, 1949 memorandum.ref|weaver1949 These proposals were based on information theory, successes of code breaking during the second world war and speculation about universal underlying principles of natural language.

A few years after these proposals, research began in earnest at many universities in the United States. On 7 January 1954, the Georgetown-IBM experiment, the first public demonstration of a MT system, was held in New York at the head office of IBM. The demonstration was widely reported in the newspapers and received much public interest. The system itself, however, was no more than what today would be called a "toy" system, having just 250 words and translating just 49 carefully selected Russian sentences into English — mainly in the field of chemistry. Nevertheless it encouraged the view that machine translation was imminent — and in particular stimulated the financing of the research, not just in the US but worldwide.ref|nutshell2

Early systems used large bilingual dictionaries and hand-coded rules for fixing the word order in the final output. This was eventually found to be too restrictive, and developments in linguistics at the time, for example generative linguistics and transformational grammar were proposed to improve the quality of translations.

During this time, operational systems were installed. The United States Air Force used a system produced by IBM and Washington University, while the Atomic Energy Commission in the United States and Euratom in Italy used a system developed at Georgetown University. While the quality of the output was poor, it nevertheless met many of the customers' needs, chiefly in terms of speed.

At the end of the 1950s, an argument was put forward by Yehoshua Bar-Hillel, a researcher asked by the US government to look into machine translation against the possibility of "Fully Automatic High Quality Translation" by machines. The argument is one of semantic ambiguity or double-meaning. Consider the following sentence:

Little John was looking for his toy box. Finally he found it. The box was in the pen.

The word "pen" may have two meanings, the first meaning something you use to write with, the second meaning a container of some kind. To a human, the meaning is obvious, but he claimed that without a "universal encyclopedia" a machine would never be able to deal with this problem. Today, this type of semantic ambiguity can be solved by writing source texts for machine translation in a controlled language that uses a vocabulary in which each word has exactly one meaning.

The 1960s, the ALPAC report and the seventies

Research in the 1960s in both the Soviet Union and the United States concentrated mainly on the Russian-English language pair. Chiefly the objects of translation were scientific and technical documents, such as articles from scientific journals. The rough translations produced were sufficient to get a basic understanding of the articles. If an article discussed a subject deemed to be of security interest, it was sent to a human translator for a complete translation; if not, it was discarded.

A great blow came to machine translation research in 1966 with the publication of the ALPAC report. The report was commissioned by the US government and performed by ALPAC, the Automatic Language Processing Advisory Committee, a group of seven scientists convened by the US government in 1964. The US government was concerned that there was a lack of progress being made despite significant expenditure. It concluded that machine translation was more expensive, less accurate and slower than human translation, and that despite the expenses, machine translation was not likely to reach the quality of a human translator in the near future.

The report, however, recommended that tools be developed to aid translators — automatic dictionaries, for example — and that some research in computational linguistics should continue to be supported.

The publication of the report had a profound impact on research into machine translation in the United States, and to a lesser extent the Soviet Union and United Kingdom. Research, at least in the US, was almost completely abandoned for over a decade. In Canada, France and Germany, however, research continued; in 1970, the Systran system was installed for the United States Air Force and subsequently in 1976 by the Commission of the European Communities. The METEO System, developed at the Université de Montréal, was installed in Canada in 1977 to translate weather forecasts from English to French, and was translating close to 80,000 words a day or 30 million words a year until it was replaced by a competitor's system on the 30th September, 2001.ref|citt

While research in the 1960s concentrated on limited language pairs and input, demand in the 1970s was for low-cost systems that could translate a range of technical and commercial documents. This demand was spurred by the increase of globalisation and the demand for translation in Canada, Europe, and Japan.

The 1980s and early 1990s

By the 1980s, both the diversity and the number of installed systems for machine translation had increased. A number of systems relying on mainframe technology were in use, such as Systran, and Logos.

As a result of the improved availability of microcomputers, there was a market for lower-end machine translation systems. Many companies took advantage of this in Europe, Japan, and the USA. Systems were also brought onto the market in China, Eastern Europe, Korea, and the Soviet Union.

During the 1980s there was a lot of activity in MT in Japan especially. With the Fifth generation computer Japan intended to leap over its competition in computer hardware and software, and one project that many large Japanese electronics firms found themselves involved in was creating software for translating to and from English (Fujitsu, Toshiba, NTT, Brother, Catena, Matsushita, Mitsubishi, Sharp, Sanyo, Hitachi, NEC, Panasonic, Kodensha, Nova, Oki).

Research during the 1980s typically relied on translation through some variety of intermediary linguistic representation involving morphological along with syntactic and semantic analysis.

At the end of the 1980s there was a large surge in a number of novel methods for machine translation. One system was developed at IBM that was based on statistical methods. Other groups used methods based on large numbers of example translations, a technique which is now termed example-based machine translation. A defining feature of both of these approaches was the lack of syntactic and semantic rules and reliance instead on the manipulation of large text corpora.

During the 1990s, encouraged by successes in speech recognition and speech synthesis, research began into speech translation.

There was significant growth in the use of machine translation as a result of the advent of low-cost and more powerful computers. It was in the early 1990s that machine translation began to make the transition away from large mainframe computers toward personal computers and workstations. Two companies that led the PC market for a time were Globalink and MicroTac, following which a merger of the two companies (in December 1994) was found to be in the corporate interest of both. Intergraph and Systran also began to offer PC versions around this time. Sites also became available on the internet, such as AltaVista's Babel Fish (using Systran technology) and Google Language Tools (also initially using Systran technology exclusively).

Recent research

The field of machine translation has in the last few years seen major changes. Currently a large amount of research is being done into statistical machine translation and example-based machine translation. Today, only a few companies use statistical machine translation commercially, e.g. Language Weaver (sells translation products and services), Google (uses their proprietary statistical MT system for some language combination in Google's language tools) and Microsoft (uses their proprietary statistical MT system to translate knowledge base articles). There has been a renewed interest in hybridisation, with researchers combining syntactic and morphological (i.e., linguistic) knowledge into statistical systems, as well as combining statistics with existing rule based systems.

References:

1. Hutchins, J. (2005). "The history of machine translation in a nutshell"
2. Melby, Alan K. (1995). The Possibility of Language. Amsterdam: J. Benjamins. pp. 27–41.
3. Van Slype, G. (1983) "Better Translation for Better Communications", (Pergamon Press: Paris)

Lecture 2. Electronic dictionaries

An electronic dictionary is a dictionary whose data exists in digital form and can be accessed through a number of different media.Electronic dictionaries can be found in several forms, including:

· as dedicated handheld devices

· as apps on smartphones and tablet computers or computer software

· as a function built into an E-reader

· as CD-ROMs and DVD-ROMs, typically packaged with a printed dictionary, to be installed on the user’s own computer

· as free or paid-for online products

Overview

Most types of dictionary are available in electronic form. These include general-purpose monolingual and bilingual dictionaries, historical dictionaries such as the Oxford English Dictionary, monolingual learner's dictionaries, and specialized dictionaries of every type, such as medical or legal dictionaries, thesauruses (1. A book of synonyms, often including related and contrasting words and antonyms. 2. A book of selected words or concepts, such as a specialized vocabulary of a particular field, as of medicine or music.), travel dictionaries, dictionaries of idioms, and pronunciation guides.

Most of the early electronic dictionaries were, in effect, print dictionaries made available in digital form: the content was identical, but the electronic editions provided users with more powerful search functions. But soon the opportunities offered by digital media began to be exploited. Two obvious advantages are that limitations of space (and the need to optimize its use) become less pressing, so additional content can be provided; and the possibility arises of including multimedia content, such as audio pronunciations and video clips.

Electronic dictionary databases, especially those included with software dictionaries are often extensive and can contain up to 500,000 headwords and definitions, verb conjugation tables, and a grammar reference section. Bilingual electronic dictionaries and monolingual dictionaries of inflected languages often include an interactive verb conjugator, and are capable of word stemming.

Publishers and developers of electronic dictionaries may offer native content from their own lexicographers, licensed data from print publications, or both, as in the case of Babylon offering premium content from Merriam Webster, and Ultralingua offering additional premium content from Collins, Masson, and Simon & Schuster, and Paragon Software offering original content from Duden, Britannica, Harrap, Merriam-Webster and Oxford.

Writing systems

As well as Latin script, electronic dictionaries are also available in logographic and right-to-left scripts, including Arabic, Persian, Chinese, Devanagari (the alphabet used for Sanskrit, Hindi, and other Indian languages), Greek, Hebrew, Japanese, Korean, Cyrillic, and Thai.

Dictionary software

Dictionary software generally far exceeds the scope of the hand held dictionaries. Many publishers of traditional printed dictionaries such as Langenscheidt, Collins-Reverso, OED – Oxford English Dictionary, Duden, American Heritage, and Hachette, offer their resources for use on desktop and laptop computers. These programs can either be downloaded or purchased on CD-ROM and installed. Other dictionary software is available from specialised electronic dictionary publishers such as iFinger, Abbyy Lingvo, Collins-Ultralingua, Mobile Systems and Paragon Software. Some electronic dictionaries provide an online discussion forum moderated by the software developers and lexicographers.