Corpora can be classified according to many different parameters. Some of these are relevant to any corpus, whether multilingual or monolingual, while some only apply to certain types of corpus. In this section we present some of the most important features of text corpora, but especially those that are relevant for multilingual corpora.
1.1.1 Important features of text corpora
Text corpora can consist of extracts or of whole texts. The very first text corpora, the best-known being the Brown University Standard Corpus of Present-Day American English, were of limited size. The Brown Corpus consisted of only 1 million words, and was made up of text extracts or samples, the length of each sample being about 2,000 words (Francis 1992). This was the only reasonable solution in the case of a small-size corpus (a million words is not a lot today, of course!). Nowadays, many corpora consist of whole texts. Whole-text corpora are faster to compile and they can be used for research both in linguistics and in literary and cultural studies. Their weakness is the possible problem of representativeness and statistical reliability; if a whole-text corpus is relatively small, it will not give a good cross-section of the language generally. A possible workaround solution is to compile a samples corpus but with longer text extracts, as in the case of the English-Norwegian Parallel Corpus (ENPC), which has a sample size of 10,000–15,000 running words (Johansson 2002).
However, a small corpus can easily be somewhat artificial, because the texts or extracts that are included will depend on the choices of the compilers. When compiling a small corpus of a million or so running words, therefore, it is important to use texts of approximately the same size, whether whole texts or samples, and to ensure that they come from a variety of sources; otherwise the corpus will easily become biased in one direction or another. With a corpus of several hundred million running words, on the other hand, the irregularities that might be caused by size and choice of texts become insignificant: unusual words and structures will only occur rarely, specialist terms will have low frequency, and the stylistic peculiarities of a particular writer will not be misinterpreted as being typical.
To make searches more effective, corpus texts are often marked up, or annotated, i.e. abstract features of words and sentences are marked with special tags. The most common kind of markup is lemmatization, i.e. annotation that indicates the base form of each word (TAKE for the forms take, takes, took, taken). Lemmatization is usually combined with part of speech tagging (NOUN, ADJECTIVE, VERB, etc), and for highly inflected languages it is also desirable to include morphological information as well (ACCUSATIVE, GENITIVE; CONDITIONAL, PERFECTIVE, etc). Corpora with syntactic markup (SUBJECT, OBJECT, ADVERBIAL), which are sometimes called ‘treebanks’, are less common, and semantic markup (ARTEFACT, COLOUR, PLACE-NAME, etc) has so far only been introduced in a few corpora on an experimental basis.
Many corpora, especially in the early phases of their development, consist of collections of unannotated texts. However, corpora without any annotation may sometimes be limited in their usefulness. The absence of annotation does not produce serious problems when searching for basic examples of language usage, although even there, searches are limited to simple string matching. If a corpus is lemmatized, on the other hand, it becomes easier to produce frequency lists, and with a morphologically annotated corpus, it is possible to compile statistics on the use and occurrence of different grammatical forms.
Nowadays, most types of annotation are performed automatically, but the results require manual checking, even when sophisticated context-sensitive software is used. With very large corpora, however, manual checking is impossible, and so researchers have to be content with automated annotation, even if there is the possibility of errors. Still, this is better than no annotation at all.
Sometimes, however, there is a need for large collections of unannotated raw data, e.g. for testing software for machine translation (MT). Researchers in the field of information technology and computer science work with huge raw text archives. These researchers hold regular conferences on text processing, e.g. CLEF in Europe, TREC in the USA, ROMIP in Russia, etc.5
1.1.2 Text archives and text corpora
Sometimes texts are collected for regular use as a source of information. News agencies, newspapers and magazines assemble huge archives of their published material, which can be later accessed online by the general public. Similarly, government departments, banks, universities and other institutions have archives of publicly available documents, reports, regulations and the like. These are typically produced in one language only, but legislative and judicial documents are sometimes available in several languages (e.g. documents of the United Nations on the UN website, EU legislation at Eur-Lex, etc). There are even newspapers which are published online in two or more languages, not to mention the day-today reports of international news agencies like Reuters. Text archives of this kind are a valuable source of multilingual language data, but they are of limited use in linguistic research. This is because the corresponding texts are all stored separately. To access any given text in two or more different language versions it would be necessary to search first one version, then the other, and then align the corresponding segments (paragraphs, sentences). This would clearly be extremely tedious.
Text archives, whether monolingual or multilingual, are designed to help retrieve information. They are not designed for studying languages or for doing language research. Text corpora, on the other hand, are created to enable linguists to study particular linguistic phenomena. They have search engines that are designed specifically to find such phenomena. Text corpora are typically monolingual, but with a multilingual parallel corpus, researchers have ready access to linguistic data in two or more languages. This is because the texts in the corpus are aligned, i.e. the corresponding segments (paragraphs or sentences) of the texts in different languages are linked together and output simultaneously. Such corpora are of little use to a person who requires information, but are invaluable when investigating linguistic phenomena, and in particular, the similarities and differences between different languages.
1.1.3 Monolingual vs. bilingual vs. multilingual corpora
As has already been mentioned, most corpora are monolingual. These also include comparable corpora of different varieties of the same language, e.g. the International Corpus of English (ICE).6 As regards parallel text corpora, the commonest type includes only two languages, but there do exist parallel corpora with several languages. However, because it is often difficult to find corresponding texts for a corpus consisting of many different languages, compiling such a corpus can be time-consuming and costly. Inevitably, therefore, multilingual corpora will always be smaller and less comprehensive than bilingual corpora. Nonetheless, in some kinds of research (e.g. studies in language typology) multilingual text collections, however small, can be very useful.
Multilingual data can consist of original texts (i.e. texts originally written in a given language), and/or translations from other languages. The possible combinations are as follows:
- (a) original texts in language A vs. (different) authentic texts in language B
- (b) original texts in language A vs. their translations in language B
- (c) original texts in langua...