eBook - ePub

Corpus Linguistics for Translation and Contrastive Studies

Name: Corpus Linguistics for Translation and Contrastive Studies
ISBN: 9781317229384

A guide for research

Mikhail Mikhailov,

Robert Cooper,

234 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Corpus Linguistics for Translation and Contrastive Studies

A guide for research

Mikhail Mikhailov,

Robert Cooper,

About this book

Corpus Linguistics for Translation and Contrastive Studies provides a clear and practical introduction to using corpora in these fields. Giving special attention to parallel corpora, which are collections of texts in two or more languages, and demonstrating the potential benefits for multilingual corpus linguistics research to both translators and researchers, this book:

explores the different types of parallel corpora available, and shows how to use basic and advanced search procedures to analyse them;
explains how to compile a parallel corpus, and discusses their uses for translation purposes and to research linguistic phenomena across languages;
demonstrates the use of corpus extracts across a wide range of texts, including dictionaries, novels by authors including Jane Austen and Mikhail Bulgakov, and newspapers such as The Sunday Times;
is illustrated with case studies from a range of languages including Finnish, Russian, English and French.

Written by two experienced researchers and practitioners, Corpus Linguistics for Translation and Contrastive Studies is essential reading for postgraduate students and researchers working within the area of translation and contrastive studies.

Trusted by 375,005 students

Access to over 1.5 million titles for a fair monthly price.

Study more efficiently using our study tools.

Publisher

Routledge

Year

2016

Topic

Languages & Linguistics

eBook ISBN

9781317229384

Subtopic

Linguistics

Index

Languages & Linguistics

Chapter 1
Parallel text corpora

A general overview

Nowadays, most linguistic research is based on electronic data. Whether in the field of theoretical linguistic research or in the compilation of grammars and dictionaries, corpora have become a standard tool for studying the structure of different languages, their morphology, syntax and lexis. Indeed, electronic text corpora of all kinds – collections of whole texts, text samples, transcripts of recorded speech, etc – are becoming so common that research that does not use corpus data arouses suspicion. For many languages so-called ‘national corpora’ are being compiled. The trend was started with the British National Corpus, which in turn was followed by the National Corpus of Polish, the Czech National Corpus, the (Open) American National Corpus, the Russian National Corpus, etc.¹ Megacorpora, and collections of megacorpora such as COCA,² Sketch Engine³ and Aranea⁴ include billions of running words collected by web crawlers from the internet. Indeed, for those who do not have access to suitable text corpora, or do not want to compile a corpus of their own, the internet itself can be used as a corpus. Thus although the problem of corpus availability is still far from being resolved, monolingual corpus linguistics is progressing rapidly.

Research using multilingual corpora is less encouraging. Multilingual language resources are much more limited and more modest in size. This, in many ways, is rather surprising, because parallel corpora have so many potential uses and applications. The most obvious of these are in the field of translation. Parallel corpora are an invaluable aid to translators in their day-to-day work, and such corpora can obviously be used, therefore, in the training of translators. They are also important for studying the translation process itself: the strategies used by translators, the problem of ‘free’ vs. ‘literal’ translation, the question of style, etc. But parallel corpora are also crucial in more technical applications, especially in the field of machine translation – the development and testing of automatic translation programs. Another major area where parallel corpora are needed is the more theoretical discipline known as contrastive linguistics. This explores the morphological, syntactical and lexical similarities/differences between languages, with a view to compiling contrastive grammars and dictionaries. It is also concerned with the study of language universals, those features which different languages have in common. By extension, the results of contrastive research using parallel corpora will have a bearing on the methods and course materials used in language teaching. Indeed, parallel corpora can even be used in the classroom, both by teachers and the language learners themselves.

Why, then, has the development of parallel corpora lagged behind that of monolingual corpora? The reason, quite simply, is that it is far easier to obtain a large number of texts in one language than to find texts with corresponding versions in several different languages. There is also the problem of text alignment, i.e. linking corresponding sentences in the different languages (see section 2.3 below). Compiling parallel corpora, therefore, is a time-consuming undertaking and this explains why their development has not kept pace with that of monolingual corpora (see also Salkie 2008).

As was mentioned above, multilingual data is needed when writing in a foreign language or when translating. It may be necessary to check terminology, find suitable idiomatic phrasing, locate the standard (or different existing) translations of a well-known quotation, or find out what a quotation was in the original. However, most existing parallel and comparable corpora cannot be used for these purposes because of insufficient size, or because they are compiled from samples, not from whole texts. In theory, many of these tasks can be carried out with conventional internet searches (by using Google or other commercial search engines) or by consulting multilingual resources like Wikipedia, but multilingual internet searches of this kind clearly require much more ingenuity on the part of the user than when searching in one language only.

Similarly, when used in academic research, in the study of the structures of two or more languages, or in the compilation of bilingual dictionaries, parallel corpora need to be large enough to provide the researcher with enough data to draw reliable conclusions. But they must also include a wide variety of text types, to ensure that the languages being studied are covered adequately. Finding such texts in two or more languages is far more difficult than when working with a single language.

Considerations such as these all explain why parallel corpora are far less common than monolingual corpora, and also why the benefits of parallel corpora have not been fully recognized. It is our aim in the present book to help remedy this by presenting the reader with a comprehensive overview of multilingual corpora and thereby reveal their great potential.

1.1 Different types of text corpora

Corpora can be classified according to many different parameters. Some of these are relevant to any corpus, whether multilingual or monolingual, while some only apply to certain types of corpus. In this section we present some of the most important features of text corpora, but especially those that are relevant for multilingual corpora.

1.1.1 Important features of text corpora

Text corpora can consist of extracts or of whole texts. The very first text corpora, the best-known being the Brown University Standard Corpus of Present-Day American English, were of limited size. The Brown Corpus consisted of only 1 million words, and was made up of text extracts or samples, the length of each sample being about 2,000 words (Francis 1992). This was the only reasonable solution in the case of a small-size corpus (a million words is not a lot today, of course!). Nowadays, many corpora consist of whole texts. Whole-text corpora are faster to compile and they can be used for research both in linguistics and in literary and cultural studies. Their weakness is the possible problem of representativeness and statistical reliability; if a whole-text corpus is relatively small, it will not give a good cross-section of the language generally. A possible workaround solution is to compile a samples corpus but with longer text extracts, as in the case of the English-Norwegian Parallel Corpus (ENPC), which has a sample size of 10,000–15,000 running words (Johansson 2002).

However, a small corpus can easily be somewhat artificial, because the texts or extracts that are included will depend on the choices of the compilers. When compiling a small corpus of a million or so running words, therefore, it is important to use texts of approximately the same size, whether whole texts or samples, and to ensure that they come from a variety of sources; otherwise the corpus will easily become biased in one direction or another. With a corpus of several hundred million running words, on the other hand, the irregularities that might be caused by size and choice of texts become insignificant: unusual words and structures will only occur rarely, specialist terms will have low frequency, and the stylistic peculiarities of a particular writer will not be misinterpreted as being typical.

To make searches more effective, corpus texts are often marked up, or annotated, i.e. abstract features of words and sentences are marked with special tags. The most common kind of markup is lemmatization, i.e. annotation that indicates the base form of each word (TAKE for the forms take, takes, took, taken). Lemmatization is usually combined with part of speech tagging (NOUN, ADJECTIVE, VERB, etc), and for highly inflected languages it is also desirable to include morphological information as well (ACCUSATIVE, GENITIVE; CONDITIONAL, PERFECTIVE, etc). Corpora with syntactic markup (SUBJECT, OBJECT, ADVERBIAL), which are sometimes called ‘treebanks’, are less common, and semantic markup (ARTEFACT, COLOUR, PLACE-NAME, etc) has so far only been introduced in a few corpora on an experimental basis.

Many corpora, especially in the early phases of their development, consist of collections of unannotated texts. However, corpora without any annotation may sometimes be limited in their usefulness. The absence of annotation does not produce serious problems when searching for basic examples of language usage, although even there, searches are limited to simple string matching. If a corpus is lemmatized, on the other hand, it becomes easier to produce frequency lists, and with a morphologically annotated corpus, it is possible to compile statistics on the use and occurrence of different grammatical forms.

Nowadays, most types of annotation are performed automatically, but the results require manual checking, even when sophisticated context-sensitive software is used. With very large corpora, however, manual checking is impossible, and so researchers have to be content with automated annotation, even if there is the possibility of errors. Still, this is better than no annotation at all.

Sometimes, however, there is a need for large collections of unannotated raw data, e.g. for testing software for machine translation (MT). Researchers in the field of information technology and computer science work with huge raw text archives. These researchers hold regular conferences on text processing, e.g. CLEF in Europe, TREC in the USA, ROMIP in Russia, etc.⁵

1.1.2 Text archives and text corpora

Sometimes texts are collected for regular use as a source of information. News agencies, newspapers and magazines assemble huge archives of their published material, which can be later accessed online by the general public. Similarly, government departments, banks, universities and other institutions have archives of publicly available documents, reports, regulations and the like. These are typically produced in one language only, but legislative and judicial documents are sometimes available in several languages (e.g. documents of the United Nations on the UN website, EU legislation at Eur-Lex, etc). There are even newspapers which are published online in two or more languages, not to mention the day-today reports of international news agencies like Reuters. Text archives of this kind are a valuable source of multilingual language data, but they are of limited use in linguistic research. This is because the corresponding texts are all stored separately. To access any given text in two or more different language versions it would be necessary to search first one version, then the other, and then align the corresponding segments (paragraphs, sentences). This would clearly be extremely tedious.

Text archives, whether monolingual or multilingual, are designed to help retrieve information. They are not designed for studying languages or for doing language research. Text corpora, on the other hand, are created to enable linguists to study particular linguistic phenomena. They have search engines that are designed specifically to find such phenomena. Text corpora are typically monolingual, but with a multilingual parallel corpus, researchers have ready access to linguistic data in two or more languages. This is because the texts in the corpus are aligned, i.e. the corresponding segments (paragraphs or sentences) of the texts in different languages are linked together and output simultaneously. Such corpora are of little use to a person who requires information, but are invaluable when investigating linguistic phenomena, and in particular, the similarities and differences between different languages.

1.1.3 Monolingual vs. bilingual vs. multilingual corpora

As has already been mentioned, most corpora are monolingual. These also include comparable corpora of different varieties of the same language, e.g. the International Corpus of English (ICE).⁶ As regards parallel text corpora, the commonest type includes only two languages, but there do exist parallel corpora with several languages. However, because it is often difficult to find corresponding texts for a corpus consisting of many different languages, compiling such a corpus can be time-consuming and costly. Inevitably, therefore, multilingual corpora will always be smaller and less comprehensive than bilingual corpora. Nonetheless, in some kinds of research (e.g. studies in language typology) multilingual text collections, however small, can be very useful.

Multilingual data can consist of original texts (i.e. texts originally written in a given language), and/or translations from other languages. The possible combinations are as follows:

(a) original texts in language A vs. (different) authentic texts in language B
(b) original texts in language A vs. their translations in language B
(c) original texts in langua...

Cover
Title
Copyright
Contents
List of figures
List of tables
List of boxes
Preface
Acknowledgements
List of abbreviations
List of sources
1 Parallel text corpora: a general overview
2 Designing and compiling a parallel corpus
3 Using parallel corpora: basic search procedures
4 Processing search results
5 Using parallel corpora: more advanced search procedures
6 Applications of parallel corpora
7 A survey of available parallel corpora
Final remarks
Glossary
Appendix 1: Corpus-based M.A. theses at the University of Tampere
Appendix 2: Sample programs
Index

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.5M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1.5 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access Corpus Linguistics for Translation and Contrastive Studies by Mikhail Mikhailov,Robert Cooper in PDF and/or ePUB format, as well as other popular books in Languages & Linguistics & Linguistics. We have over 1.5 million books available in our catalogue for you to explore.