References to words and word combinations

The references from any specific word give access to the set of words semantically related to the former, or to words, which can form combinations with the former in a text. This is a very important application. Nowadays it is performed with linguistic tools of two different kinds: autonomous on-line dictionaries and built-in dictionaries of synonyms.

Within typical text processors, the synonymy dictionaries are usually called thesauri. Later we will see that this name corresponds poorly to the synonymy dictionaries, since genuine thesauri usually include much more information, for example, references to generic words, i.e., names of superclasses, and to specific words, i.e., names of subclasses.

References to various words or word combinations of a given natural language have the objective to help the author of a text to create more correct, flexible, and idiomatic texts. Indeed, only an insignificant part of all thinkable word combinations are really permitted in a language, so that the knowledge of the permitted and common combinations is a very important part of linguistic competence of any author. For example, a foreigner might want to know all the verbs commonly used with the Spanish noun ayuda, such as prestar or pedir, or with the noun atención, such as dedicar or prestar, in order to avoid combinations like pagar atención, which is a word-by-word translation of the English combination to pay attention. Special language-dependent dictionaries are necessary for this purpose (see, for example, Figure III.2).

FIGURE III.2. CrossLexica™, a dictionary of word combinations.

Within such systems, various complex operations are needed, such as automated reduction of the entered words to their dictionary forms, search of relevant words in the corresponding linguistic database, and displaying all of them in a form convenient to a non-linguist user. These operations are versatile and include both morphologic and syntactic issues [37].

Another example of a dictionary that provides a number of semantic relations between different lexemes is EuroWordNet [55], a huge lexical resource reflecting diverse semantic links between lexemes of several European languages.

The ideological basis of EuroWordNet is the English dictionary WordNet [41]. English nouns, verbs, adjectives, and adverbs were divided into synonymy groups, or synsets. Several semantic relations were established between synsets: antonymy (reference to the “opposite” meaning), hyponymy (references to the subclasses), hyperonymy (reference to the superclass), meronymy (references to the parts), holonymy (reference to the whole), etc. Semantic links were established also between synsets of different parts of speech.

The classification hierarchy for nouns is especially well developed within WordNet. The number of hierarchical levels is in average 6 to 7, sometimes reaching 15. The upper levels of the hierarchy form the ontology, i.e., a presupposed scheme of human knowledge.

In essence, EuroWordNet is a transportation of the WordNet hierarchy to several other European languages, in particular to Spanish. The upper levels of ontology were obtained by direct translation from English, while for the other levels, additional lexicographic research turned out to be necessary. In this way, not only links between synsets within any involved language were determined, but also links between synsets of a number of different languages.

The efforts invested to the WordNet and EuroWordNet were tremendous. Approximately 25´000 words were elaborated in several languages.

INFORMATION RETRIEVAL

Information retrieval systems (IRS) are designed to search for relevant information in large documentary databases. This information can be of various kinds, with the queries ranging from “Find all the documents containing the word conjugar ” to “Find information on the conjugation of Spanish verbs”. Accordingly, various systems use different methods of search.

The earliest IRSs were developed to search for scientific articles on a specific topic. Usually, the scientists supply their papers with a set of keywords, i.e., the terms they consider most important and relevant for the topic of the paper. For example, español, verbos, subjuntivo might be the keyword set of the article “On means of expressing unreal conditions” in a Spanish scientific journal.

These sets of keywords are attached to the document in the bibliographic database of the IRS, being physically kept together with the corresponding documents or separately from them. In the simplest case, the query should explicitly contain one or more of such keywords as the condition on what the article can be found and retrieved from the database. Here is an example of a query: “Find the documents on verbos and español ”. In a more elaborate system, a query can be a longer logical expression with the operators and, or, not, e.g.: “Find the documents on (sustantivos or adjetivos) and (not inglés)”.

Nowadays, a simple but powerful approach to the format of the query is becoming popular in IRSs for non-professional users: the query is still a set of words; the system first tries to find the documents containing all of these words, then all but one, etc., and finally those containing only one of the words. Thus, the set of keywords is considered in a step-by-step transition from conjunction to disjunction of their occurrences. The results are ordered by degree of relevance, which can be measured by the number of relevant keywords found in the document. The documents containing more keywords are presented to the user first.

In some systems the user can manually set a threshold for the number of the keywords present in the documents, i.e., to search for “at least m of n ” keywords. With m = n, often too few documents, if any, are retrieved and many relevant documents are not found; with m = 1, too many unrelated ones are retrieved because of a high rate of false alarms.

Usually, recall and precision are considered the main characteristics of IRSs. Recall is the ratio of the number of relevant documents found divided by the total number of relevant documents in the database. Precision is the ratio of the number of relevant documents divided by the total number of documents found.

It is easy to see that these characteristics are contradictory in the general case, i.e. the greater one of them the lesser another, so that it is necessary to keep a proper balance between them.

In a specialized IRS, there usually exists an automated indexing subsystem, which works before the searches are executed. Given a set of keywords, it adds, using the or operator, other related keywords, based on a hierarchical system of the scientific, technical or business terms. This kind of hierarchical systems is usually called thesaurus in the literature on IRSs and it can be an integral part of the IRS. For instance, given the query “Find the documents on conjugación,” such a system could add the word morfología to both the query and the set of keywords in the example above, and hence find the requested article in this way.

Thus, a sufficiently sophisticated IRS first enriches the sets of keywords given in the query, and then compares this set with the previously enriched sets of keywords attached to each document in the database. Such comparison is performed according to any criteria mentioned above. After the enrichment, the average recall of the IRS system is usually increased.

Recently, systems have been created that can automatically build sets of keywords given just the full text of the document. Such systems do not require the authors of the documents to specifically provide the keywords. Some of the modern Internet search engines are essentially based on this idea.

Three decades ago, the problem of automatic extraction of keywords was called automatic abstracting. The problem is not simple, even when it is solved by purely statistical methods. Indeed, the most frequent words in any business, scientific or technical texts are purely auxiliary, like prepositions or auxiliary verbs. They do not reflect the essence of the text and are not usually taken for abstracting. However, the border between auxiliary and meaningful words cannot be strictly defined. Moreover, there exist many term-forming words like system, device, etc., which can seldom be used for information retrieval because their meaning is too general. Therefore, they are not useful for abstracts.

The multiplicity of IRSs is considered now as an important class of the applied software and, specifically, of applied linguistic systems. The period when they used only individual words as keys has passed. Developers now try to use word combinations and phrases, as well as more complicated strategies of search. The limiting factors for the more sophisticated techniques turned out to be the same as those for grammar and style checkers: the absence of complete grammatical and semantic analysis of the text of documents. The methods used now even in the most sophisticated Internet search engines are not efficient for accurate information retrieval. This leads to a high level of information noise, i.e., delivering of irrelevant documents, as well as to the frequent missing of relevant ones.

The results of retrieval operations directly depend on the quality and performance of the indexing and comparing subsystems, on the content of the terminological system or the thesaurus, and other data and knowledge used by the system. Obviously, the main tools and data sets used by an IRS have the linguistic nature.

TOPICAL SUMMARIZATION

In many cases, it is necessary to automatically determine what a given document is about. This information is used to classify the documents by their main topics, to deliver by Internet the documents on a specific subject to the users, to automatically index the documents in an IRS, to quickly orient people in a large set of documents, and for other purposes.

Such a task can be viewed as a special kind of summarization: to convey the contents of the document in a shorter form. While in “normal” summarization by the contents the main ideas of the document are considered, here we consider only the topics mentioned in the document, hence the term topical summarization.

FIGURE III.3. Classifier program determines the main topics of a document.

As an example, let us consider the system Clasitex™ that automatically determines the main topics of a document. A variant of its implementation, Classifier™, was developed in the Center of Computing Research, National Polytechnic Institute at Mexico City [46] (see Figure III.3). It uses two kinds of linguistic information:

· First, it neutralizes morphologic variations in order to reduce any word found in the text to its standard (i.e., dictionary) form, e.g., oraciones ® oración, regímenes ® régimen, lingüísticas ® lingüístico, propuesto ® proponer.

· Second, it puts into action a large dictionary of thesaurus type, which gives, for each word in its standard form, its corresponding position in a pre-defined hierarchy of topics. For example, the word oración belongs to the topic lingüística, which belongs in turn to the topic ciencias sociales, which in its turn belongs to the topic ciencia.

Then the program counts how many times each one of these topics occurred in the document. Roughly speaking, the topic mentioned most frequently is considered the main topic of the document. Actually, the topics in the dictionary have different weights of importance [43, 45], so that the main topic is the one with the greatest total weight in the document.

Applied linguistics can improve this method in many possible ways. For example, in its current version, Clasitex does not count any pronouns found in the text, since it is not obvious what object a personal pronoun such as él can refer to.

What is more, many Spanish sentences contain zero subjects, i.e. implicit references to some nouns. This becomes obvious in English translation: Hay un libro. Es muy interesante Þ There is a book. It is very interesting Þ El libro es muy interesante. Thus, each Spanish sentence without any subject is implicitly an occurrence of the corresponding word, which is not taken into account by Clasitex, so that the gathered statictics is not completely correct.

Another system, TextAnalystÔ, for determining the main topics of the document and the relationships between words in the document was developed by MicroSystems, in Russia (see Figure III.4). This system is not dictionary-based, though it does have a small dictionary of stop-words (these are prepositions, articles, etc., and they should not be processed as meaningful words).

This system reveals the relationships between words. Words are considered related to each other if they co-occurred closely enough in the text, e.g., in the same sentence. The program builds a network of the relationships between words. Figure III.4 shows the most important words found by TextAnalyst in the early draft of this book, and the network of their relationships.

As in Clasitex, the degree of importance of a word, or its weight, is determined in terms of its frequency, and the relationships between words are used to mutually increase the weights. The words closely related to many of the important words in the text are also considered important.

FIGURE III.4. TextAnalyst program reveals the relationships between words.

In TextAnalyst, the list of the important words is used for the following tasks:

· Compression of text by eliminating the sentences or paragraphs that contain the minimal number of important words, until the size of the text reaches the threshold selected by the user,

· Building hypertext by constructing mutual references between the most important words and from the important words to others to which they are supposedly related.

The TextAnalyst technology is based on a special type of a dynamic neural network algorithm. Since the Clasitex program is based on a large dictionary, it is a knowledge-based program, whereas TextAnalyst is not.

AUTOMATIC TRANSLATION

Translation from one natural language to another is a very important task. The amount of business and scientific texts in the world is growing rapidly, and many countries are very productive in scientific and business domains, publishing numerous books and articles in their own languages. With the growth of international contacts and collaboration, the need for translation of legal contracts, technical documentation, instructions, advertisements, and other texts used in the everyday life of millions of people has become a matter of vital importance.

The first programs for automatic, or machine, translation were developed more than 40 years ago. At first, there existed a hope that texts could be translated word by word, so that the only problem would be to create a dictionary of pairs of words: a word in one language and its equivalent in the other. However, that hope died just after the very first experiments.

Then the ambitious goal was formulated to create programs which could understand deeply the meaning of an arbitrary text in the source language, record it in some universal intermediate language, and then reformulate this meaning in the target language with the greatest possible accuracy. It was supposed that neither manual pre-editing of the source text nor manual post-editing of the target text would be necessary. This goal proved to be tremendously difficult to achieve, and has still not been satisfactorily accomplished in any but the narrowest special cases.

At present there is a lot of translation software, ranging from very large international projects being developed by several institutes or even several corporations in close cooperation, to simple automatic dictionaries, and from laboratory experiments to commercial products. However, the quality of the translations, even for large systems developed by the best scientists, is usually conspicuously lower than the quality of manual human translation.

As for commercial translation software, the quality of translation it generates is still rather low. A commercial translator can be used to allow people quite unfamiliar with the original language of the document to understand its main idea. Such programs can help in manual translation of texts. However, post-editing of the results, to bring them to the degree of quality sufficient for publication, often takes more time than just manual translation made by a person who knows both languages well enough.^[4] Commercial translators are quite good for the texts of very specific, narrow genres, such as weather reports. They are also acceptable for translation of legal contracts, at least for their formal parts, but the paragraphs specifying the very subject of the contract may be somewhat distorted.

To give the reader an impression of what kind of errors a translation program can make, it is enough to mention a well-known example of the mistranslation performed by one of the earliest systems in 1960s. It translated the text from Bible The spirit is willing, but the flesh is weak (Matt. 26:41) into Russian and then back into English. The English sentence then turned out to be The vodka is strong, but the meat is rotten [34]. Even today, audiences at lectures on automatic translation are entertained by similar examples from modern translation systems.

Two other examples are from our own experience with the popular commercial translation package PowerTranslator by Globalink, one of the best in the market. The header of an English document Plans is translated into Spanish as the verb Planifica, while the correct translation is the Spanish noun Planes (see Figure III.5). The Spanish phrase el papel de Francia en la guerra is translated as the paper of France in the war, while the correct translation is the role of France in the war. There are thousands of such examples, so that nearly any automatically translated document is full of them and should be reedited.

FIGURE III.5. One of commercial translators.

Actually, the quality of translation made by a given program is not the same in the two directions, say, from English to Spanish and from Spanish to English. Since automatic analysis of the text is usually a more difficult task than generation of text, the translation from a language that is studied and described better has generally higher quality than translation into this language. Thus, the elaboration of Spanish grammars and dictionaries can improve the quality of the translation from Spanish into English.

One difficult problem in automatic translation is the word sense disambiguation. In any bilingual dictionary, for many source words, dozens of words in the target language are listed as translations, e.g., for simple Spanish word gato: cat, moneybag, jack, sneak thief, trigger, outdoor market, hot-water bottle, blunder, etc. Which one should the program choose in any specific case? This problem has proven to be extremely difficult to solve. Deep linguistic analysis of the given text is necessary to make the correct choice, on the base on the meaning of the surrounding words, the text, as a whole, and perhaps some extralinguistic information [42].

Another, often more difficult problem of automatic translation is restoring the information that is contained in the source text implicitly, but which must be expressed explicitly in the target text. For example, given the Spanish text José le dio a María un libro. Es interesante, which translation of the second sentence is correct: He is interesting, or She is interesting, or It is interesting, or This is interesting? Given the English phrase computer shop, which Spanish translation is correct: tienda de computadora or tienda de computadoras? Compare this with computer memory. Is they are beautiful translated as son hermosos or son hermosas? Is as you wish translated as como quiere, como quieres, como quieren, or como queréis?^[5] Again, deep linguistic analysis and knowledge, rather than simple word-by-word translation, is necessary to solve such problems.

Great effort is devoted in the world to improve the quality of translation. As an example of successful research, the results of the Translation group of Information Science Institute at University of South California can be mentioned [53]. This research is based on the use of statistical techniques for lexical ambiguity resolution.

Another successful team working on automatic translation is that headed by Yu. Apresian in Russia [34]. Their research is conducted in the framework of the Meaning Û Text model.