Corpus Linguistics and Linguistically Annotated Corpora
eBook - ePub

Corpus Linguistics and Linguistically Annotated Corpora

  1. 288 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Corpus Linguistics and Linguistically Annotated Corpora

Book details
Book preview
Table of contents
Citations

About This Book

Linguistically annotated corpora are becoming a central part of the corpus linguistics field. One of their main strengths is the level of searchability they offer, but with the annotation come problems of the initial complexity of queries and query tools. This book gives a full, pedagogic account of this burgeoning field. Beginning with an overview of corpus linguistics, its prerequisites and goals, the book then introduces linguistically annotated corpora. It explores the different levels of linguistic annotation, including morphological, parts of speech, syntactic, semantic and discourse-level, as well as advantages and challenges for such annotations. It covers the main annotated corpora for English, the Penn Treebank, the International Corpus of English, and OntoNotes, as well as a wide range of corpora for other languages. In its third part, search strategies required for different types of data are explored. All chapters are accompanied by exercises and by sections on further reading.

Frequently asked questions

Simply head over to the account section in settings and click on ā€œCancel Subscriptionā€ - itā€™s as simple as that. After you cancel, your membership will stay active for the remainder of the time youā€™ve paid for. Learn more here.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Both plans give you full access to the library and all of Perlegoā€™s features. The only differences are the price and subscription period: With the annual plan youā€™ll save around 30% compared to 12 months on the monthly plan.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, weā€™ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes, you can access Corpus Linguistics and Linguistically Annotated Corpora by Sandra Kuebler, Heike Zinsmeister in PDF and/or ePUB format, as well as other popular books in Languages & Linguistics & Linguistics. We have over one million books available in our catalogue for you to explore.

Information

Year
2014
ISBN
9781441119803
Edition
1
PART I
INTRODUCTION
CHAPTER 1
CORPUS LINGUISTICS
1.1 Motivation
Corpus linguistics has a long tradition, especially in subdisciplines of linguistics that work with data for which it is hard or even impossible to gather native speakersā€™ intuitions, such as historical linguistics, language acquisition, or phonetics. But the last two decades have witnessed a turn towards empiricism in linguistic subdisciplines, such as formal syntax. These subdisciplines of linguistics used to have a strong intuitionistic bias for many years and were traditionally based on introspective methods. Thus, linguists would use invented examples rather than attested language use. Such examples have the advantage that they concentrate on the phenomenon in question and abstract away from other types of complexities. Thus, if a linguist wants to study fronting, sentences like the ones listed in (1) clearly show which constituents can be fronted and which cannot. The sentence in (2) is an attested example1 that shows the same type of fronting as the example in (1-a), but the sentence is more complicated and thus more difficult to analyze.
(1)
a.
In the morning, he read about linguistics.
b.
*The morning, he read about linguistics in.
(2)
In the 1990s, spurred by rising labor costs and the strong yen, these companies will increasingly turn themselves into multinationals with plants around the world.
Nowadays, linguists of all schools consult linguistic corpora or use the world wide web as a corpus not only for collecting natural sounding examples, but also for testing their linguistic hypotheses against quantitative data of attested language use.
The amount of linguistically analyzed and publicly available corpora has also increased. Many of them had originally been created for computational linguistic purposes, to provide data that could be used for testing or developing automatic tools for analyzing language and other applications. In addition to their original purpose, many of the corpora have been made accessible, for example, in terms of online search interfaces to be readily used and explored by the linguistic community. But even if the resources are available, it is not always straightforward to determine how to use and interpret the available data. We can compare this to arriving in a foreign city. You can wander around on your own. But it is tremendously helpful to have a guide who shows you how to get around and explains how to profit from the characteristics of that particular city. And if you do not speak the local language, you need a translator or, even better, a guide, who introduces you to it.
This book is intended to guide the reader in a similar way. It guides the reader in how to find their way in the data by using appropriate query and visualization tools. It also introduces the reader to how to interpret annotation by explaining linguistic analyses and their encodings. The first part of the book gives an introduction on the general level, the second part deepens the understanding of these issues by presenting examples of major corpora and their linguistic annotations. The third part covers more practical issues, and the fourth part introduces search tools in more detail. The book has as its goal to make readers truly ā€˜corpus-literateā€™ by providing them with the specific knowledge that one needs to work with annotated corpora in a productive way.
This current chapter will motivate corpus linguistics per se and introduce important terminology. It will discuss introductory questions such as: What is a corpus and what makes a corpus different from an electronic collection of texts (section 1.2)? What kinds of corpora can be distinguished (section 1.3)? Is corpus linguistics a theory or a tool (section 1.4)? How does corpus linguistics differ from an intuitionistic approach to linguistics (section 1.5)? The chapter will end with an explanation of the structure of the book and a short synopsis of the following chapters (section 1.6). Finally, this chapter, like all chapters, will be complemented by a list of further reading (section 1.7).
1.2 Definition of Corpus
A modern linguistic corpus is an electronically available collection of texts or transcripts of audio recordings which is sampled to represent a certain language, language variety, or other linguistic domain. It is optionally enriched with levels of linguistic analysis, which we will call linguistic annotation. The origin of the text samples and other information regarding the sampling criteria are described in the metadata of the corpus.
The remainder of this section will motivate and explain different issues arising from this definition of corpus. For beginners in the field, we want to point out that the term corpus has its origin in Latin, meaning ā€˜bodyā€™. For this reason, the plural of corpus is formed according to Latin morphology: one corpus, two corpora.
As indicated above, nowadays the term corpus is almost synonymous with electronically available corpus, but this is not necessarily so. Some linguistic subdisciplines have a long-standing tradition for working with corpora also in the pre-computer area, in particular historical linguistics, phonetics, and language acquisition. Pre-electronic corpora used in lexicography and grammar development often consisted of samples of short text snippets that illustrate the use of a particular word or grammar construction. But there were also some comprehensive quantitative evaluations of large text bodies. To showcase the characteristic properties of modern corpora, we will look back in time and consider an extreme example of quantitative evaluation in the pre-computer area in which relevant processing steps had to be performed manually.
At the end of the nineteenth century, before the invention of tape-recorders, there had been a strong interest in writing shorthand for documenting spoken language. Shorthand was intended as a system of symbols to represent letters, words, or even phrases, that allows the writer to optimize their speed of writing. The stenographer Friedrich Wilhelm Kaeding saw an opportunity for improving German shorthand by basing the system on solid statistics of word, syllable, and character distributions. In order to create an optimal shorthand system, words and phrases that occur very frequently should be represented by a short, simple symbol while less frequent words can be represented by longer symbols. To achieve such a system, Kaeding carried out a large-scale project in which hundreds of volunteers counted the frequencies of more than 250,000 words and their syllables in a text collection of almost 11 million words. It is obvious that it had been an enormous endeavor which took more than five years to complete.
To make the task of counting words and syllables feasible, it had to be split into different subtasks. The first, preparatory task was performed by 665 volunteers who simply copied all relevant word forms that occurred in the texts on index cards in a systematic way, including information about the source text. Subsequently, all index cards were sorted in alphabetical order for counting the frequencies of re-occurring words. Using one card for each instance made the counting replicable in the sense that other persons could also take the stack of cards, count the index cards themselves, and compare their findings with the original results. As we will see later, replicability is an important aspect of corpus linguistics.
The enormous manual effort described above points to a crucial property of modern linguistic corpora that we tend to take for granted as naĆÆve corpus users: A corpus provides texts in form of linguistically meaningful and retrievable units in a reusable way.
Kaedingā€™s helpers invested an enormous amount of time in identifying words, sorting, and counting them manually. The great merit of computers is that they perform exactly such tasks for us automatically, much more quickly, and more reliably: They can perform search, retrieval, sorting, calculations, and even visualization of linguistic information in a mechanic way. But it is a necessary prerequisite that relevant units are encoded as identifiable entities in the data representation. In Kaedingā€™s approach, for example, he needed to define what a word is. This is a non-trivial decision, even in English if we consider expressions such as donā€™t2 or in spite of. How this is done will be introduced in the following section and, in more detail, in Chapter 3.
1.2.1 Electronic Processing
Making a corpus available electronically goes beyond putting a text file on a web page. At the very least, there are several technical steps involved in the creation of a corpus. The first step concerns making the text accessible in a corpus. If we already have our text in electronic form, this generally means that the file is in PDF format or it is a MS Word document, to name just the most common formats. As a consequence, such files can only be opened by specific, mostly proprietary applications, and searching in such files is restricted to the search options that the application provides. Thus, we can search for individual words in PDF files, but we cannot go beyond that. When creating a corpus, we need more flexibility. This means that we need to extract the text and only the text from these formatted files. If our original text is not available electronically, we need to use a scanner to create an electronic image of the text and then use an Optical Character Recognition (OCR) software that translates such an image to text. Figure 1.1 shows a scanned image of the b...

Table of contents

  1. FC
  2. Half Title
  3. Title Page
  4. Toc
  5. Preface
  6. Part I: Introduction
  7. Part II: Linguistic Annotation
  8. Part III: Using Linguistic Annotation in Corpus Linguistics
  9. Part IV: Querying Linguistically Annotated Corpora
  10. Appendix A. Penn Treebank POS Tagset
  11. Appendix B. ICE POS Tagset
  12. Notes
  13. Bibliography
  14. Index
  15. Copyright Page