eBook - ePub

Comparable Corpora and Computer-assisted Translation

Name: Comparable Corpora and Computer-assisted Translation
Author: Estelle Maryline Delpech

Estelle Maryline Delpech,

English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Comparable Corpora and Computer-assisted Translation

Estelle Maryline Delpech,

Book details

Book preview

Table of contents

Citations

About This Book

Computer-assisted translation (CAT) has always used translation memories, which require the translator to have a corpus of previous translations that the CAT software can use to generate bilingual lexicons. This can be problematic when the translator does not have such a corpus, for instance, when the text belongs to an emerging field. To solve this issue, CAT research has looked into the leveraging of comparable corpora, i.e. a set of texts, in two or more languages, which deal with the same topic but are not translations of one another.

This work had two primary objectives. The first is to assess the input of lexicons extracted from comparable corpora in the context of a specialized human translation task. The second objective is to identify bilingual-lexicon-extraction methods which best match the translators' needs, determining the current limits of these techniques and suggesting improvements. The author focuses, in particular, on the identification of fertile translations, the management of multiple morphological structures, and the ranking of candidate translations.

The experiments are carried out on two language pairs (English–French and English–German) and on specialized texts dealing with breast cancer. This research puts significant emphasis on applicability – methodological choices are guided by the needs of the final users. This book is organized in two parts: the first part presents the applicative and scientific context of the research, and the second part is given over to efforts to improve compositional translation.

The research work presented in this book received the PhD Thesis award 2014 from the French association for natural language processing (ATALA).

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Yes, you can access Comparable Corpora and Computer-assisted Translation by Estelle Maryline Delpech in PDF and/or ePUB format, as well as other popular books in Ciencia de la computación & Desarrollo de software. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Wiley-ISTE

Year

2014

ISBN

9781119002703

Edition

Topic

Ciencia de la computación

Subtopic

Desarrollo de software

PART 1

Applicative and Scientific Context

1 Leveraging Comparable Corpora for Computer-assisted Translation

1.1. Introduction

This chapter starts with a historical approach to computer-assisted translation (section 1.2): we will retrace the beginnings of machine translation and explain how computer-assisted translation has developed so far, with the recent appearance of the issue of comparable-corpus leveraging. Section 1.3 explains the current techniques to extract bilingual lexicons from comparable corpora. We provide an overview of the typical performances, and discuss the limitations of these techniques. Section 1.4 describes the prototyping of the computer-assisted translation (CAT) tool meant for comparable corpora and based on the techniques described in section 1.3.

1.2. From the beginnings of machine translation to comparable corpora processing

1.2.1. The dawn of machine translation

From the beginning, scientific research in computer science has tried to use the machine to accelerate and replace human translation. According to [HUT 05], it was in the United States, between 1959 and 1966, that the first research in machine translation was carried out. Here, machine translation (MT) refers to the translation of a text by a machine without any human intervention. Until 1966, several research groups were created, and two types of approaches could be identified:

– On the one hand, there were the pragmatic approaches combining statistical information with trial-and-error development methods¹ and whose goal was to create an operational system as quickly as possible (University of Washington, Rand Corporation and University of Georgetown). This research applied the direct translation method² and this gave rise to the first generation of machine translation systems.

– On the other hand, theoretic approaches emerged involving fundamental linguistics and considering research in the long term (MIT, Cambridge Research Language Unit). These projects were more theoretical and created the first versions of interlingual systems.³

In 1966, a report from the Automatic Language Processing Advisory Committee [ALP 66], which assesses machine translation purely based on the needs of the American government – i.e. the translation of Russian scientific documents – announced that after several years of research, it was not possible to obtain a translation that was entirely carried out by a computer and of human quality. Only postedition would allow us to reach a good quality of translation.⁴ Yet the point of postedition is not self-evident. A study mentioned in the appendix of this book points out that “most translators found postediting tedious and even frustrating”, but many found “the output served as an aid... particularly with regard to technical terms” [HUT 96].

Although the study does not allow us to come to a conclusion on the point of postedition in relation to fully manual translation (out of 22 translators, eight find postedition easier, eight others find it harder and six were undecided), the report mostly highlights the negative aspects, quoting one of the translators:

I found that I spend at least as much time in editing as if I had carried out the entire translation from the start. Even at that, I doubted if the edited translation reads as smoothly as one which I would have started from scratch. [HUT 96]

The report quotes remarks made by V. Yngve – the head of the machine translation research project at MIT – who claimed that MT “serves no useful purpose without postediting, and that with postediting the over-all process is slow and probably uneconomical” [HUT 96].

The report concludes on the fact that machine translation research is essential from the point of view of scientific progress, it however has a limited interest from an economic point of view. Thus funding was cut in the United States. However, research carried on in Europe (EUROTRA research project) and in Canada. This research was the source of the TAUM system, for example, (translation of weather reports from French to English) and of the translation software SYSTRAN.

1.2.2. The development of computer-assisted translation

While it signaled the end of public funding for machine translation research in the United States, the ALPAC report encouraged the pursuit of a more realistic goal for computer-assisted translation.⁵ The report praised the glossaries generated by the German army’s translation agency as well as the terminology base of the European Coal and Steal Community – a resource which foregrounded EURODICAUTOM and IATE – and came to the conclusion that these resources were a real help to translation. The final recommendations clearly encouraged the development of CAT, especially in the leveraging of glossaries initially created for machine translation.⁶

At that point, a whole range of tools intended to help the translator in his/her work rather than replace him/her started to be developed. The first terminology management programs appeared in the 1960s [HUT 05] and evolved into multilingual terminology databases such as TERMIUM or UNTERM. Bilingual concordancers are also of invaluable help: they allow the translator to access the word or term’s context and compare the translation of the contexts in the target language. According to [SOM 05], the rise in computer-assisted translation happened in the seventies with the creation of translation memory software, which allows the translator to recycle past translations: when a translator has to translate a new sentence, the software scans the memory for similar previously translated sentences, and when it finds any, suggests the previous translation as translation model. The time saved is all the greater when the texts translated are repetitive, which is often the case in certain specialized documents such as technical manuals.

These sets of translated documents make up what we call parallel corpora⁷ [VER 00] and their leveraging intensified in the 1980s, allowing for a resurgence in machine translation. While the translation systems based on rules had dominated the field until then, the access to large databases of translation examples helped further the development of data-driven systems. The two paradigms arising from this turnaround are the example-base translation [NAG 84] and statistical machine translation [BRO 90], which remains the current dominant trend. The quality of machine translation is improving. Today, it generates usable results in specialized fields in which vocabulary and structures are rather repetitive. The last stronghold is general texts: machine translation offers, at best, an aid for understanding.

During the 1990s, CAT benefited from the intersecting input of machine translation and computational terminology [BOU 94, DAI 94a, ENG 95, JAC 96]. It was at that point that term alignment algorithms appeared, based on parallel corpora [DAI 94b, MEL 99, GAU 00]. The bilingual terminology lists generated are particularly useful in the case of specialized translation.

Automatic extraction and management of terminology, bilingual concordance services, pre-translation and translation memories, understanding aids: today, the translator’s workstation is a complex and highly digital environment. The language technology industry has proliferated and developed itself, generating many pieces of CAT software: TRADOS⁸, WORDFAST⁹, DÉJÀ VU¹⁰, and SIMILIS¹¹ to name just a few. The greater public is also provided for: on the one hand, Google has widened the access to immediate translation for anyone due to its GOOGLE TRANSLATE tool¹² and on the other hand, open access bilingual concordance services have appeared recently on the Internet (BAB.LA¹³, LINGUEE¹⁴), and quickly become popular – for example LINGUEE reached 600,000 requests a day for is English–German version in 2008, a year after it had been created [PER 10].

1.2.3. Drawbacks of parallel corpora and advantages of comparable corpora

While they are useful, these technologies have a major drawback: they require the existence of a translation history. What about languages, which have few resources or emerging speciality fields? A possible solution is then to use what we refer to as comparable corpora.

There exist several definitions of comparable corpora. At one end of the spectrum is the very narrow definition given by [MCE 07]; within the framework of translation studies research. According to these authors, a comparable corpus contains texts in two or more languages, which have been gathered according to the same genre, field and sampling period criteria. Moreover, the corpora must be balanced: “comparable corpus can be defined as a corpus containing components that are collected using the same sampling frame and similar balance and representativeness (...

Cover
Contents
Dedication
Title Page
Copyright
Acknowledgments
Introduction
PART 1: Applicative and Scientific Context
PART 2: Contributions to Compositional Translation
PART 3: Appendices
List of Tables
List of Figures
List of Algorithms
List of Extracts
Bibliography
Index