eBook - ePub

Corpus Linguistics and Linguistically Annotated Corpora

Name: Corpus Linguistics and Linguistically Annotated Corpora
ISBN: 9781441119803

Sandra Kuebler,

Heike Zinsmeister,

288 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Corpus Linguistics and Linguistically Annotated Corpora

Sandra Kuebler,

Heike Zinsmeister,

About this book

Linguistically annotated corpora are becoming a central part of the corpus linguistics field. One of their main strengths is the level of searchability they offer, but with the annotation come problems of the initial complexity of queries and query tools. This book gives a full, pedagogic account of this burgeoning field. Beginning with an overview of corpus linguistics, its prerequisites and goals, the book then introduces linguistically annotated corpora. It explores the different levels of linguistic annotation, including morphological, parts of speech, syntactic, semantic and discourse-level, as well as advantages and challenges for such annotations. It covers the main annotated corpora for English, the Penn Treebank, the International Corpus of English, and OntoNotes, as well as a wide range of corpora for other languages. In its third part, search strategies required for different types of data are explored. All chapters are accompanied by exercises and by sections on further reading.

Tools to learn more effectively

Saving Books

Keyword Search

Annotating Text

Listen to it instead

Information

Publisher

Year

Print ISBN

eBook ISBN

Edition

Topic

Languages & Linguistics

Subtopic

Linguistics

Index

Languages & Linguistics

PART I

INTRODUCTION

CHAPTER 1

CORPUS LINGUISTICS

1.1 Motivation

Corpus linguistics has a long tradition, especially in subdisciplines of linguistics that work with data for which it is hard or even impossible to gather native speakers’ intuitions, such as historical linguistics, language acquisition, or phonetics. But the last two decades have witnessed a turn towards empiricism in linguistic subdisciplines, such as formal syntax. These subdisciplines of linguistics used to have a strong intuitionistic bias for many years and were traditionally based on introspective methods. Thus, linguists would use invented examples rather than attested language use. Such examples have the advantage that they concentrate on the phenomenon in question and abstract away from other types of complexities. Thus, if a linguist wants to study fronting, sentences like the ones listed in (1) clearly show which constituents can be fronted and which cannot. The sentence in (2) is an attested example¹ that shows the same type of fronting as the example in (1-a), but the sentence is more complicated and thus more difficult to analyze.

(1)	a.	In the morning, he read about linguistics.
	b.	*The morning, he read about linguistics in.
(2)	In the 1990s, spurred by rising labor costs and the strong yen, these companies will increasingly turn themselves into multinationals with plants around the world.

Nowadays, linguists of all schools consult linguistic corpora or use the world wide web as a corpus not only for collecting natural sounding examples, but also for testing their linguistic hypotheses against quantitative data of attested language use.

The amount of linguistically analyzed and publicly available corpora has also increased. Many of them had originally been created for computational linguistic purposes, to provide data that could be used for testing or developing automatic tools for analyzing language and other applications. In addition to their original purpose, many of the corpora have been made accessible, for example, in terms of online search interfaces to be readily used and explored by the linguistic community. But even if the resources are available, it is not always straightforward to determine how to use and interpret the available data. We can compare this to arriving in a foreign city. You can wander around on your own. But it is tremendously helpful to have a guide who shows you how to get around and explains how to profit from the characteristics of that particular city. And if you do not speak the local language, you need a translator or, even better, a guide, who introduces you to it.

This book is intended to guide the reader in a similar way. It guides the reader in how to find their way in the data by using appropriate query and visualization tools. It also introduces the reader to how to interpret annotation by explaining linguistic analyses and their encodings. The first part of the book gives an introduction on the general level, the second part deepens the understanding of these issues by presenting examples of major corpora and their linguistic annotations. The third part covers more practical issues, and the fourth part introduces search tools in more detail. The book has as its goal to make readers truly ‘corpus-literate’ by providing them with the specific knowledge that one needs to work with annotated corpora in a productive way.

This current chapter will motivate corpus linguistics per se and introduce important terminology. It will discuss introductory questions such as: What is a corpus and what makes a corpus different from an electronic collection of texts (section 1.2)? What kinds of corpora can be distinguished (section 1.3)? Is corpus linguistics a theory or a tool (section 1.4)? How does corpus linguistics differ from an intuitionistic approach to linguistics (section 1.5)? The chapter will end with an explanation of the structure of the book and a short synopsis of the following chapters (section 1.6). Finally, this chapter, like all chapters, will be complemented by a list of further reading (section 1.7).

1.2 Definition of Corpus

A modern linguistic corpus is an electronically available collection of texts or transcripts of audio recordings which is sampled to represent a certain language, language variety, or other linguistic domain. It is optionally enriched with levels of linguistic analysis, which we will call linguistic annotation. The origin of the text samples and other information regarding the sampling criteria are described in the metadata of the corpus.

The remainder of this section will motivate and explain different issues arising from this definition of corpus. For beginners in the field, we want to point out that the term corpus has its origin in Latin, meaning ‘body’. For this reason, the plural of corpus is formed according to Latin morphology: one corpus, two corpora.

As indicated above, nowadays the term corpus is almost synonymous with electronically available corpus, but this is not necessarily so. Some linguistic subdisciplines have a long-standing tradition for working with corpora also in the pre-computer area, in particular historical linguistics, phonetics, and language acquisition. Pre-electronic corpora used in lexicography and grammar development often consisted of samples of short text snippets that illustrate the use of a particular word or grammar construction. But there were also some comprehensive quantitative evaluations of large text bodies. To showcase the characteristic properties of modern corpora, we will look back in time and consider an extreme example of quantitative evaluation in the pre-computer area in which relevant processing steps had to be performed manually.

At the end of the nineteenth century, before the invention of tape-recorders, there had been a strong interest in writing shorthand for documenting spoken language. Shorthand was intended as a system of symbols to represent letters, words, or even phrases, that allows the writer to optimize their speed of writing. The stenographer Friedrich Wilhelm Kaeding saw an opportunity for improving German shorthand by basing the system on solid statistics of word, syllable, and character distributions. In order to create an optimal shorthand system, words and phrases that occur very frequently should be represented by a short, simple symbol while less frequent words can be represented by longer symbols. To achieve such a system, Kaeding carried out a large-scale project in which hundreds of volunteers counted the frequencies of more than 250,000 words and their syllables in a text collection of almost 11 million words. It is obvious that it had been an enormous endeavor which took more than five years to complete.

To make the task of counting words and syllables feasible, it had to be split into different subtasks. The first, preparatory task was performed by 665 volunteers who simply copied all relevant word forms that occurred in the texts on index cards in a systematic way, including information about the source text. Subsequently, all index cards were sorted in alphabetical order for counting the frequencies of re-occurring words. Using one card for each instance made the counting replicable in the sense that other persons could also take the stack of cards, count the index cards themselves, and compare their findings with the original results. As we will see later, replicability is an important aspect of corpus linguistics.

The enormous manual effort described above points to a crucial property of modern linguistic corpora that we tend to take for granted as naïve corpus users: A corpus provides texts in form of linguistically meaningful and retrievable units in a reusable way.

Kaeding’s helpers invested an enormous amount of time in identifying words, sorting, and counting them manually. The great merit of computers is that they perform exactly such tasks for us automatically, much more quickly, and more reliably: They can perform search, retrieval, sorting, calculations, and even visualization of linguistic information in a mechanic way. But it is a necessary prerequisite that relevant units are encoded as identifiable entities in the data representation. In Kaeding’s approach, for example, he needed to define what a word is. This is a non-trivial decision, even in English if we consider expressions such as don’t² or in spite of. How this is done will be introduced in the following section and, in more detail, in Chapter 3.

1.2.1 Electronic Processing

Making a corpus available electronically goes beyond putting a text file on a web page. At the very least, there are several technical steps involved in the creation of a corpus. The first step concerns making the text accessible in a corpus. If we already have our text in electronic form, this generally means that the file is in PDF format or it is a MS Word document, to name just the most common formats. As a consequence, such files can only be opened by specific, mostly proprietary applications, and searching in such files is restricted to the search options that the application provides. Thus, we can search for individual words in PDF files, but we cannot go beyond that. When creating a corpus, we need more flexibility. This means that we need to extract the text and only the text from these formatted files. If our original text is not available electronically, we need to use a scanner to create an electronic image of the text and then use an Optical Character Recognition (OCR) software that translates such an image to text. Figure 1.1 shows a scanned image of the b...

FC
Half Title
Title Page
Toc
Preface
Part I: Introduction
Part II: Linguistic Annotation
Part III: Using Linguistic Annotation in Corpus Linguistics
Part IV: Querying Linguistically Annotated Corpora
Appendix A. Penn Treebank POS Tagset
Appendix B. ICE POS Tagset
Notes
Bibliography
Index
Copyright Page

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access Corpus Linguistics and Linguistically Annotated Corpora by Sandra Kuebler, Heike Zinsmeister in PDF and/or ePUB format, as well as other popular books in Languages & Linguistics & Linguistics. We have over one million books available in our catalogue for you to explore.

About this book

Tools to learn more effectively

Information

Table of contents

Frequently asked questions