PART 1
Applicative and Scientific Context
1
Leveraging Comparable Corpora for Computer-assisted Translation
1.1. Introduction
This chapter starts with a historical approach to computer-assisted translation (section 1.2): we will retrace the beginnings of machine translation and explain how computer-assisted translation has developed so far, with the recent appearance of the issue of comparable-corpus leveraging. Section 1.3 explains the current techniques to extract bilingual lexicons from comparable corpora. We provide an overview of the typical performances, and discuss the limitations of these techniques. Section 1.4 describes the prototyping of the computer-assisted translation (CAT) tool meant for comparable corpora and based on the techniques described in section 1.3.
1.2. From the beginnings of machine translation to comparable corpora processing
1.2.1. The dawn of machine translation
From the beginning, scientific research in computer science has tried to use the machine to accelerate and replace human translation. According to [HUT 05], it was in the United States, between 1959 and 1966, that the first research in machine translation was carried out. Here, machine translation (MT) refers to the translation of a text by a machine without any human intervention. Until 1966, several research groups were created, and two types of approaches could be identified:
– On the one hand, there were the pragmatic approaches combining statistical information with trial-and-error development methods1 and whose goal was to create an operational system as quickly as possible (University of Washington, Rand Corporation and University of Georgetown). This research applied the direct translation method2 and this gave rise to the first generation of machine translation systems.
– On the other hand, theoretic approaches emerged involving fundamental linguistics and considering research in the long term (MIT, Cambridge Research Language Unit). These projects were more theoretical and created the first versions of interlingual systems.3
In 1966, a report from the Automatic Language Processing Advisory Committee [ALP 66], which assesses machine translation purely based on the needs of the American government – i.e. the translation of Russian scientific documents – announced that after several years of research, it was not possible to obtain a translation that was entirely carried out by a computer and of human quality. Only postedition would allow us to reach a good quality of translation.4 Yet the point of postedition is not self-evident. A study mentioned in the appendix of this book points out that “most translators found postediting tedious and even frustrating”, but many found “the output served as an aid... particularly with regard to technical terms” [HUT 96].
Although the study does not allow us to come to a conclusion on the point of postedition in relation to fully manual translation (out of 22 translators, eight find postedition easier, eight others find it harder and six were undecided), the report mostly highlights the negative aspects, quoting one of the translators:
I found that I spend at least as much time in editing as if I had carried out the entire translation from the start. Even at that, I doubted if the edited translation reads as smoothly as one which I would have started from scratch. [HUT 96]
The report quotes remarks made by V. Yngve – the head of the machine translation research project at MIT – who claimed that MT “serves no useful purpose without postediting, and that with postediting the over-all process is slow and probably uneconomical” [HUT 96].
The report concludes on the fact that machine translation research is essential from the point of view of scientific progress, it however has a limited interest from an economic point of view. Thus funding was cut in the United States. However, research carried on in Europe (EUROTRA research project) and in Canada. This research was the source of the TAUM system, for example, (translation of weather reports from French to English) and of the translation software SYSTRAN.
1.2.2. The development of computer-assisted translation
While it signaled the end of public funding for machine translation research in the United States, the ALPAC report encouraged the pursuit of a more realistic goal for computer-assisted translation.5 The report praised the glossaries generated by the German army’s translation agency as well as the terminology base of the European Coal and Steal Community – a resource which foregrounded EURODICAUTOM and IATE – and came to the conclusion that these resources were a real help to translation. The final recommendations clearly encouraged the development of CAT, especially in the leveraging of glossaries initially created for machine translation.6
At that point, a whole range of tools intended to help the translator in his/her work rather than replace him/her started to be developed. The first terminology management programs appeared in the 1960s [HUT 05] and evolved into multilingual terminology databases such as TERMIUM or UNTERM. Bilingual concordancers are also of invaluable help: they allow the translator to access the word or term’s context and compare the translation of the contexts in the target language. According to [SOM 05], the rise in computer-assisted translation happened in the seventies with the creation of translation memory software, which allows the translator to recycle past translations: when a translator has to translate a new sentence, the software scans the memory for similar previously translated sentences, and when it finds any, suggests the previous translation as translation model. The time saved is all the greater when the texts translated are repetitive, which is often the case in certain specialized documents such as technical manuals.
These sets of translated documents make up what we call parallel corpora7 [VER 00] and their leveraging intensified in the 1980s, allowing for a resurgence in machine translation. While the translation systems based on rules had dominated the field until then, the access to large databases of translation examples helped further the development of data-driven systems. The two paradigms arising from this turnaround are the example-base translation [NAG 84] and statistical machine translation [BRO 90], which remains the current dominant trend. The quality of machine translation is improving. Today, it generates usable results in specialized fields in which vocabulary and structures are rather repetitive. The last stronghold is general texts: machine translation offers, at best, an aid for understanding.
During the 1990s, CAT benefited from the intersecting input of machine translation and computational terminology [BOU 94, DAI 94a, ENG 95, JAC 96]. It was at that point that term alignment algorithms appeared, based on parallel corpora [DAI 94b, MEL 99, GAU 00]. The bilingual terminology lists generated are particularly useful in the case of specialized translation.
Automatic extraction and management of terminology, bilingual concordance services, pre-translation and translation memories, understanding aids: today, the translator’s workstation is a complex and highly digital environment. The language technology industry has proliferated and developed itself, generating many pieces of CAT software: TRADOS8, WORDFAST9, DÉJÀ VU10, and SIMILIS11 to name just a few. The greater public is also provided for: on the one hand, Google has widened the access to immediate translation for anyone due to its GOOGLE TRANSLATE tool12 and on the other hand, open access bilingual concordance services have appeared recently on the Internet (BAB.LA13, LINGUEE14), and quickly become popular – for example LINGUEE reached 600,000 requests a day for is English–German version in 2008, a year after it had been created [PER 10].
1.2.3. Drawbacks of parallel corpora and advantages of comparable corpora
While they are useful, these technologies have a major drawback: they require the existence of a translation history. What about languages, which have few resources or emerging speciality fields? A possible solution is then to use what we refer to as comparable corpora.
There exist several definitions of comparable corpora. At one end of the spectrum is the very narrow definition given by [MCE 07]; within the framework of translation studies research. According to these authors, a comparable corpus contains texts in two or more languages, which have been gathered according to the same genre, field and sampling period criteria. Moreover, the corpora must be balanced: “comparable corpus can be defined as a corpus containing components that are collected using the same sampling frame and similar balance and representativeness (...