eBook - ePub

Understanding Corpus Linguistics

Name: Understanding Corpus Linguistics
ISBN: 9781000466751

Danielle Barth,

Stefan Schnell,

238 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Understanding Corpus Linguistics

Danielle Barth,

Stefan Schnell,

About this book

This textbook introduces the fundamental concepts and methods of corpus linguistics for students approaching this topic for the first time, putting specific emphasis on the enormous linguistic diversity represented by approximately 7,000 human languages and broadening the scope of current concerns in general corpus linguistics.

Including a basic toolkit to help the reader investigate language in different usage contexts, this book:

Shows the relevance of corpora to a range of linguistic areas from phonology to sociolinguistics and discourse
Covers recent developments in the application of corpus linguistics to the study of understudied languages and linguistic typology
Features exercises, short problems, and questions
Includes examples from real studies in over 15 languages plus multilingual corpora

Providing the necessary corpus linguistics skills to critically evaluate and replicate studies, this book is essential reading for anyone studying corpus linguistics.

Tools to learn more effectively

Saving Books

Keyword Search

Annotating Text

Listen to it instead

Information

Publisher

Routledge

Year

2021

eBook ISBN

9781000466751

Topic

Languages & Linguistics

Subtopic

Linguistics

Index

Languages & Linguistics

1Introduction

DOI: 10.4324/9780429269035-1

This book is an introduction to corpus linguistics in its modern form for undergraduate and graduate students, and for advanced scholars who want to know more about corpus-based research. While some aspects of corpus linguistics can be of interest to scholars from scientific disciplines other than linguistics, the intended audience of this book is linguists. Our overall goal is to show how corpora can make relevant contributions to a better understanding of the many human languages in the world. As such, we believe that this introduction may be of particular interest not only to general corpus linguists but also to linguists who work on lesser-described languages and on comparative and typological linguistics.

1.1 What is corpus linguistics?

1.1.1 The basic idea of corpus linguistics

Corpus linguistics is essentially a specific way of studying language and languages by systematically investigating how language is used in context. A major concern for corpus linguists is that language use is massively variable. As language users, we are often aware of at least some forms of variation: we know that we use a language differently when we talk or sign to a friend face-to-face, or to our boss or colleagues during a work meeting, or when we write a text message to our partner or an email to a government agency. And we also expect to receive language with variable structure, for example, during a phone conversation or when reading a newspaper article. However, a major focus in corpus linguistics is on those forms of variation that speakers are typically not aware of, for instance, where the form and choice of expressions can be influenced in subtle ways by the structural contexts they occur in. Corpus linguists will ask questions concerning the choice of words or morphosyntactic construction, the reduction of some words (e.g. going to vs. gonna) or other variation in the sound shape of words, and so forth, depending on their context of use. The answers to these questions establish new facts about language and thus further our understanding of how human languages are used.

The corpus linguistic approach and the kind of insights it bears contrasts with structuralist approaches that focus entirely on languages as abstract systems of linguistic knowledge. It is thus closely linked with the tradition of functionalist and cognitive linguistics that have stressed that abstract representation – whatever its exact nature – is more strongly intertwined with usage than was assumed in the structuralist tradition, and that the langue-parole division is much less clear cut (cf. e.g. Bybee 2006; Diessel 2019). Scholars in other usage-oriented areas of linguistics, in particular, anthropological linguistics, sociolinguistics, and psycholinguistics, have stressed the importance of knowledge about language use and variation therein. Corpus linguistics ties in with these latter approaches: it is not just concerned with what expressions exist in any given languages and are possible to be produced, but what specific expressions language users are most likely to produce on any given occasion, like the ones mentioned above. Like other usage-oriented researchers, corpus linguists see language production as the result of a multi-layered decision-making process (cf. Diessel 2019:24–25) during which language users choose between different ways of expressing approximately¹ the same thing. The systematic study of usage data as represented in corpora aims at discovering the rules that govern these decisions and at understanding what ramifications usage patterns have on linguistic systems (e.g. Bybee 2006).

1.1.2 Corpus linguistics in contrast to other approaches

This distinguishes corpus linguistics from a number of other approaches to human languages, for example, those delimiting the range of possible structures through acceptability judgements (as done in much theoretical and descriptive work on grammar), those comparing languages on the basis of grammars (as done in classic typology), or those investigating how language users react during comprehension when exposed to language in experimental setups (as done in psycho- and neurolinguistics), and many more.

The contrast to judgement-based linguistic research is particularly prominent in the literature. The focus here is on determining the system of all possible structures in a language, as one would find described in the grammar, for instance. A major concern here is that contrary to judgements – where user-judges can reject a structure as impossible – corpora can never provide this kind of so-called negative evidence, which calls corpora into question as a reliable empirical basis for grammar writing and other descriptive and analytic statements about abstract representations. Despite its focus on usage and the variation therein, corpus linguists have developed criteria in order to evaluate the relationship between coverage of possible structures in a corpus and the range of possible structures in a language system. We will discuss these in Chapter 3. Roughly, we believe that if a structure is not attested in a sufficiently large and varied corpus it is not possible. Yet, as we will see, we can never be sure what constitutes a sufficiently rich corpus in order to cover all possible structures. While this essentially leaves real uncertainty, it should nonetheless be noted that alternatives like judgement tasks are not in a much better position: it has by now frequently been pointed out that judgements are by no means generally reliable (Gibson & Fedorenko 2013), and people can reject structures in a judgement elicitation session that they would produce and/or encounter themselves and apparently have no difficulty to process and interpret. There are two major reasons for this: first, user-judges may often be led by what they assume to be ‘correct’ language as a kind of ideal, which may or may in fact not coincide with a prescriptive standard. Second, the acceptability of a given structure often depends on the specific context, and someone may simply fail to come up with that context during such a session, hence rejecting the structure. Conversely, it is quite possible that user-judges will accept structures proposed by a researcher-linguist that they would never produce; for example, they think that the expert must be right or that one should not correct a community outsider. Hence, judgements are not necessarily more reliable than corpus data. Finally, it needs to be stressed here that different people in a community may provide different, divergent judgements (in the same way that they produce different structures), which means that judgements would need to be based on as representative as possible a sample of language users. This is typically not done in judgement-based research, as is criticised by Schütze (2016).

A similar type of argument among linguists working on lesser-studied languages relates to the distinction between ‘corpus data and elicitations’. An example is Evans’ (2008) criticism that targeted elicitations of specific (often rare) structures are excluded from mainstream documentary linguistics. Evans (2008) stresses the importance of elicited data to capture rare, yet possible structures in a given language (the Australian language Dalabon in his example) that may be unlikely to crop up in more authentic text data (texts that are more common in the community) that would be part of a corpus (his examples are very complex noun phrases). Documenters should, therefore, not only record the verbal behaviour characteristic of a speech community, but also collect elicited data. As pointed out by Himmelmann (2012), however, the reliability of elicitations depends to a large degree on the experience of speakers with the structures in question. Rare structures are for this reason problematic to elicit, since speakers lack routines of producing and interpreting them. Moreover, we have no reason to accept the construct of an ideal language user representing a homogenous language community, so that again targeted elicitations would need to be conducted with a representative sample of speakers to attain some degree of reliability. We should point out here that we agree with Evans’ (2008) view that elicitations are a useful source of information, and we likewise reject a view that corpora should only include ‘real life language use’ as McEnery and Wilson (2001) call it in their textbook. Elicitations can form part of corpus data, then, but they are not a fast-and-easy alternative to other procedures of data collection to fill in the gaps. And like corpus data, elicitations underlie the same considerations of representativeness and saturation that we will discuss in Chapter 3.

We end this section with an anecdote that underscores the particular value of corpus linguistics even for system-oriented descriptive linguistics. It is reported again by Nick Evans in Meakins et al. (2018:13–16). Due to a request by the community, Evans had been engaged in a Bible translation project in the community of Nen speakers in southern Papua New Guinea, and while he saw this project more as a sideline of his fieldwork on the language in a spirit of ‘giving back’, the Nen Bible text revealed expressions corresponding to so-called ‘free-selection’ pronouns in English, like ‘anyone’, ‘whoever’, etc. These had – despite the Nen speakers’ overall profound proficiency in English – been virtually impossible to elicit, but in the Bible texts, they were there, all of a sudden. This shows that language use can reveal structures in specific contexts that linguists may have a hard time imagining, and corpus linguistics also has this kind of explorative data-driven facet. In regard to targeted elicitations, it underscores how difficult it can be for speakers to imagine usages of some structure out of context – as is the case for out-of-contexts elicitations and judgements – but that the relevant forms may come up promptly once the relevant context has been brought up. It is in this way that the corpus linguistic approach bears great potential not only for the study of language use but also for the demarcation of possible structures.

1.1.3 Corpus linguistics and usage-oriented linguistics

The core concern of corpus linguistics is with patterns of language use and their variation. Language use involves numerous decision-taking processes whereby users choose between alternative ways of expressing the same thing during test production and recipients choose between different ways of interpreting the structures they perceive. The more specific concern of corpus linguistics is to account for these decisions by systematically investigating related variants and conditions on their choice. For instance, whether a copular verb like is is realised in its full form or appears as a clitic ‘s is subject to numerous factors, and corpus linguists seek to identify these and relate them to one another in modelling the variation at hand (cf. Barth 2015 for an in-depth study of such reductions in spoken English texts). In other words, what is of particular interest to corpus linguistics is not only the presence or availability of a given structure in a given language but especially the factors that govern their choice in actual language use.

A major concern with language use is shared by a range of sub-disciplines in linguistics. One of these is sociolinguistics. Sociolinguistics is concerned with the variability of language use and seeks to correlate these with the social features of language users and their interlocutors. For instance, the choice between the two variants of the copula, is and ‘s, in spoken discourse is related to the preceding and the following words, speech rate, and other aspects of discourse context. But it is also influenced by demographic characteristics of speakers and their audiences, the social and physical setting, and other general aspects of the communicative situation. We will turn to the role of corpus linguistics in sociolinguistics in Chapter 9.

Other areas of linguistics where details of language use are of central concern are psycho- and neurolinguistics. These fields are interested in how language is processed, for example, how language users encode and decode discourse and what structures pose particular problems, reflected in processing delays. For the most part, these fields of linguistics target processing during perception and deploy various methods of measuring aspects of processing, for example, neurological EEG measures of processing delays (N400, P600) (cf. Brown & Hagoort 1993; Gouvea et al. 2010 inter alia). However, there are also strands – in particular in recent years – that pay attention to discourse production. Corpus linguistic approaches are relevant here: in one type of production-oriented research, one will examine discourse production within a controlled experiment with various stimuli intended to control for various aspects of processing. From a corpus linguistic perspective, these will simply be one set of many factors that influence the choice of structures during discourse production, in addition to other factors. In more recent work (Barth 2019a; Bell et al. 2009; Jaeger 2010; Jurafsky 2003; McDonald & Schillcock 2003; Seyfarth 2014), even free text production is investigated from a psycholinguistic perspective. A general idea here is that more frequent structures in similar contexts ...

Cover
Half Title
Series Page
Title Page
Copyright Page
Table of Contents
Acknowledgements
1 Introduction
2 Basic concepts in corpus linguistics
3 Corpus composition and corpus types
4 Levels of linguistic representation in corpus-linguistic research
5 Corpus queries
6 Corpus building
7 Corpus annotation
8 Statistical description and analysis
9 Corpora in sociolinguistics
10 Corpus linguistics and language documentation
11 Corpus-based typology
References
Index

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access Understanding Corpus Linguistics by Danielle Barth,Stefan Schnell in PDF and/or ePUB format, as well as other popular books in Languages & Linguistics & Linguistics. We have over one million books available in our catalogue for you to explore.