eBook - ePub

Machine Learning in Translation Corpora Processing

Name: Machine Learning in Translation Corpora Processing
Author: Krzysztof Wolk

Krzysztof Wolk,

264 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Machine Learning in Translation Corpora Processing

Krzysztof Wolk,

Book details

Book preview

Table of contents

Citations

About This Book

This book reviews ways to improve statistical machine speech translation between Polish and English. Research has been conducted mostly on dictionary-based, rule-based, and syntax-based, machine translation techniques. Most popular methodologies and tools are not well-suited for the Polish language and therefore require adaptation, and language resources are lacking in parallel and monolingual data. The main objective of this volume to develop an automatic and robust Polish-to-English translation system to meet specific translation requirements and to develop bilingual textual resources by mining comparable corpora.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Yes, you can access Machine Learning in Translation Corpora Processing by Krzysztof Wolk in PDF and/or ePUB format, as well as other popular books in Computer Science & Computer Science General. We have over one million books available in our catalogue for you to explore.

Information

Publisher

CRC Press

Year

2019

ISBN

9780429588839

Edition

Topic

Computer Science

Subtopic

Computer Science General

Index

Computer Science

Introduction

The aim of this monograph is to develop, implement, and adapt methods of statistical machine translation (SMT) [85] to Polish-English speech translation requirements. During a conversation, real-time speech translation would allow the utterances to be immediately translated and read aloud in another language, provided that the system is connected with text-to-speech (TTS) [151] and automatic speech recognition (ASR) [151] systems. Speech translation enables speakers of different languages to communicate freely. It also has great importance in the field of science, as well as in intercultural and global data exchange. From a business point of view, such speech translation systems could be applied as an aid to simultaneous interpreting, as well as in the field of respeaking.

Another aspect of the study is the preparation of parallel and comparable corpora and language models enhanced by pre- and post-processing of the data using morphosyntactic analysis. Such analysis enables the changing of words into their basic forms (to reduce the vocabulary) and the standardizing of the natural order of a sentence (especially to Subject Verb Object [SVO] word order [150], in which the subject is placed before the predicate and the object is located at the end). Training a factored translation model enriched with part of speech (POS) tags [85] was also addressed.

This study improves SMT quality through the processing and filtering of parallel corpora and through the extraction of additional data from the resulting comparable corpora. In order to enrich the language resources of SMT systems, various adaptation and interpolation techniques were applied to the prepared data. Experiments were conducted using spoken data from a specific domain (European Parliament proceedings and written medical texts [35, 36]), from a wide domain (TED lectures on various topics [15]) and from human speech (on the basis of movie and TV series dialogs [14]).

Evaluation of SMT systems was performed on random samples of parallel data using automated algorithms to evaluate the quality [149] and potential usability of the SMT systems’ output.

As far as experiments are concerned, the Moses Statistical Machine Translation Toolkit software [8], as well as related tools and unique implementations of processing scripts for the Polish language, were used.

Moreover, the multi-threaded implementation of the GIZA++ [148] tool was employed in order to train models on parallel data and to perform symmetrization at the phrase level. The SMT system was tuned using the Minimum Error Rate Training (MERT) tool [45], which, through parallel data, specifies the optimum weights for the trained models, improving the resulting translations. The statistical language models from single-language data were trained and smoothed using the SRI Language Modeling (SRILM) toolkit [9]. In addition, data from outside the thematic domain was adapted. In the case of parallel models, Modified Moore-Lewis filtering (MML) [60] was used, while single-language models were linearly interpolated. In order to enrich the training data, morphosyntactic processing tools were employed. (The morphosyntactic toolchain created by The WrocUT Language Technology Group and the PSI-Toolkit created by Adam Mickiewicz University [59] were used.)

Lastly, the author developed and implemented a method inspired by the Yalign [19] (parallel data-mining tool). Its speed was increased by reimplementing Yalign’s algorithms in a multi-threaded manner and by employing graphics processing unit (GPU) computing power for the calculations. The mining quality was improved by using the Needleman-Wunsch [20] algorithm for sequence comparison and by developing a tuning script that adjusts mining parameters to specific domain requirements.

1.1 Background and context

Polish is one of the more complex West-Slavic languages, in part because its grammar has complicated rules and elements, but also because of its large vocabulary, which is much larger than that of English. This complexity greatly affects the data and data structures required for statistical models of translation. The lack of available and appropriate resources required for data input to SMT systems presents another problem. SMT systems work best in specified text domains (not too wide) and do not perform well in general purpose use. High-quality parallel data, especially in a required domain, has low availability. All those differences and the fact that Polish and West-Slavic group has been, to some extend, neglected in this field of research, makes PL-EN translation an interesting topic of research as far as translation and additional resourcing is concerned. PL-EN results should be repeatable also for other languages in the West-Slavic group.

In general, Polish and English also differ in syntax and grammar. English is a positional language, which means that the syntactic order (the order of words in a sentence) plays a very important role, particularly due to the limited inflection of words (e.g., lack of declension endings). Sometimes, the position of a word in a sentence is the only indicator of the sentence’s meaning. In a Polish sentence, a thought can be expressed using several different word orderings, which is not possible in English. For example, the sentence “I bought myself a new car” can be written in Polish as “Kupiłem sobie nowy samochód”, or “Nowy samochód sobie kupiłem”, or “Sobie kupiłem nowy samochód”, or “Samochód nowy sobie kupiłem.” The only exception is when the subject and the object are in the same clause and the context is the only indication of which is the object and which is subject. For example, “Pies liże kość (A dog is licking a bone)” and “Kość liże pies (A bone is licking a dog).”

Differences in potential sentence word order make the translation process more complex, especially when using a phrase-model with no additional lexical information [18]. Furthermore, in Polish it is not necessary to use the operator, because the Polish form of a verb always contains information about the subject of a sentence. For example, the sentence “On jutro jedzie na wakacje” is equivalent to the Polish “Jutro jedzie na wakacje” and would be translated as “He is going on vacation tomorrow” [160].

In the Polish language, the plural formation is not made by adding the letter “s” as a suffix to a word, but rather each word has its own plural variant (e.g., “pies-psy”, “artysta-artyści”, etc.). Additionally, prefixes before nouns like “a”, “an”, “the”, do not exist in Polish (e.g., “a cat-kot”, “an apple-jabłko”, etc.) [160].

The Polish language has only three tenses (present, past, and future). However, it must be noted that the only indication whether an action has ended is an aspect. For example, “Robiłem pranie” Would be translated as “I have been doing laundry”, but “Zrobiłem pranie” as “I have done laundry”, or “płakać-wypłakać” as “cry-cry out” [160].

The gender of a noun in English does not have any effect on the form of a verb, but it does in Polish. For example, “Zrobił to. – He has done it”, “Zrobiła to. – She has done it”, “lekarz/lekarka - doctor”, “uczeń/uczennica = student”, etc. [160].

As a result of this complexity, progress in the development of SMT systems for Polish has been substantially slower than for other languages. On the other hand, excellent translation systems have been developed for many popular languages. Because of the similar language structure, the Czech language is, to some extent, comparable to Polish. In [161], the authors present a phrase-based machine translation system that outperformed any previous systems for Wall Street Journal text translation. The authors were able to achieve BLEU scores as high as 41 by applying various techniques for improving machine translation quality. The authors adapted an alignment symmetrization method to the needs of the Czech language and used the lemmatized forms of words. They proposed a post-processing step for correct translation of numbers and enhanced parallel corpora quality and quantity.

In [162], the author prepared a factored, phrase-based translation system from Czech to English and an experimental system using tree-based transfer at a deep syntactic layer. In [163], the authors presented another factored transition model that uses multiple factors taken from annotation of input and output tokens. Experiments were conducted on news texts that were previously lemmatized and tagged. A similar system, when it comes to text domain and the main ideas, are also presented for Russian [164].

Some interesting translation systems to and from Czech to languages other than English (German, French, Spanish) were recently presented in [165]. The specific languages properties were described and their properties were exploited in order to avoid language-specific errors. In addition, an experiment was conducted on translation using English as a pivot language, because of the great difference in parallel data.

A comparison of how the relatedness of two languages influences the performance of statistical machine translation is described in [166]. The comparison was made between the Czech, English, and Russian languages. The authors proved that translation between related languages provides better translations. They also concluded that, when dealing with closely-related languages, machine translation is improved if systems are enriched with morphological tags, especially for morphologically-rich languages [166].

Such systems are urgently required for many purposes, including web, medical text and international translation services, for example, for the error-free, real-time translation of European Parliament Proceedings (EUP) [36].

1.1.1 The concept of cohesion

Cohesion refers to non-structural resources, such as grammatical and lexical relationships in discourse. Cohesion is increased by the ties that link a text together and make it meaningful, for which the purpose and quality characteristics are not usually known or seen as relevant.

It is difficult to avoid using human subjects in evaluations of assimilative translation [95], in which the goal is to provide a translation that is good enough to enable a user with little knowledge of the source language to gain a correct understanding of the contents of the source text. In [95], Weiss and Ahrenberg present two methods to deal with aspect assignment in a prototype of a knowledge-based English-Polish machine translation (MT) system. Although there is no agreement among linguists as to its precise definition (e.g., Dowty [96]), aspect is a result of the complex interplay of semantics, tense, mood and pragmatics. It strongly affects the overall understanding of the text. In English, aspect is usually not explicitly indicated on a verb. On the other hand, in Polish it is overtly manifested and incorporated into verb morphology. This difference between the two languages makes English-to-Polish translation particularly difficult, as it requires contextual and semantic analysis of the English input to derive an aspect value for the Polish output [97].

1.2 Machine translation (MT)

Next, the history, approach, and applications and research trends of MT systems will be discussed.

1.2.1 History of statistical machine translation (SMT)

In recent years, SMT has been adopted as the main method employed for machine translation between languages. The main reasons for the use of SMT are its accuracy and the fact that it does not require manually-constructed rules [152]. A number of translation engines have been developed for general translation purposes, with perhaps the best known being Google Translate, which is available on the Internet for non-commercial purposes [133].

The enormous demand for translation services for different languages in science and technology has nearly always exceeded the capacity of translation professionals, i.e., actual humans. This work was driven, in part, by the universal availability of the Internet, where a user can access content in virtually any language. It was impossible for the huge demand for instant translation to be met by human translators on such a massive scale. Thus, much effort was expended in order to develop MT software products specifically for translating web pages, textbooks, European Parliament and something as mundane as—but still very useful—email messages.

During the 1980s, the forerunner of immediate online translation services was the rule-based Systran system, which originated in France and was limited to the Minitel network [134]. By the mid-1990s, many MT providers were offering Internet-based, on-the-spot translation services. In subsequent years, machine translation technology has advanced immeasurably, with Google translate supporting fifty-seven languages [135]. However, the overall translation quality of online MT services is frequently poor when it comes to languages other than English, French, German and Spanish. On the other hand, improvements are continuously being made. Google and other developers offer an automatic translation browsing tool, which can translate a website into a language of your choice [134]. While these services provide acceptable, immediate, “rough” translations of content into the user’s own language, a particular challenge for MT is the online translation of impure language that is colloquial, incoherent, not grammatically correct, full o...

Cover
Title Page
Copyright Page
Table of Contents
Acknowledgements
Preface
Abbreviations and Definitions
Overview
1. Introduction
2. Statistical Machine Translation and Comparable Corpora
3. State of the Art
4. Author’s Solutions to PL-EN Corpora Processing Problems
5. Results and Conclusions
6. Final Conclusions
References
Index