Understanding Corpus Linguistics
eBook - ePub

Understanding Corpus Linguistics

  1. 238 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Understanding Corpus Linguistics

Book details
Book preview
Table of contents
Citations

About This Book

This textbook introduces the fundamental concepts and methods of corpus linguistics for students approaching this topic for the first time, putting specific emphasis on the enormous linguistic diversity represented by approximately 7, 000 human languages and broadening the scope of current concerns in general corpus linguistics.

Including a basic toolkit to help the reader investigate language in different usage contexts, this book:

  • Shows the relevance of corpora to a range of linguistic areas from phonology to sociolinguistics and discourse
  • Covers recent developments in the application of corpus linguistics to the study of understudied languages and linguistic typology
  • Features exercises, short problems, and questions
  • Includes examples from real studies in over 15 languages plus multilingual corpora

Providing the necessary corpus linguistics skills to critically evaluate and replicate studies, this book is essential reading for anyone studying corpus linguistics.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes, you can access Understanding Corpus Linguistics by Danielle Barth, Stefan Schnell in PDF and/or ePUB format, as well as other popular books in Languages & Linguistics & Linguistics. We have over one million books available in our catalogue for you to explore.

Information

Publisher
Routledge
Year
2021
ISBN
9781000466751
Edition
1

1Introduction

DOI: 10.4324/9780429269035-1
This book is an introduction to corpus linguistics in its modern form for undergraduate and graduate students, and for advanced scholars who want to know more about corpus-based research. While some aspects of corpus linguistics can be of interest to scholars from scientific disciplines other than linguistics, the intended audience of this book is linguists. Our overall goal is to show how corpora can make relevant contributions to a better understanding of the many human languages in the world. As such, we believe that this introduction may be of particular interest not only to general corpus linguists but also to linguists who work on lesser-described languages and on comparative and typological linguistics.

1.1 What is corpus linguistics?

1.1.1 The basic idea of corpus linguistics

Corpus linguistics is essentially a specific way of studying language and languages by systematically investigating how language is used in context. A major concern for corpus linguists is that language use is massively variable. As language users, we are often aware of at least some forms of variation: we know that we use a language differently when we talk or sign to a friend face-to-face, or to our boss or colleagues during a work meeting, or when we write a text message to our partner or an email to a government agency. And we also expect to receive language with variable structure, for example, during a phone conversation or when reading a newspaper article. However, a major focus in corpus linguistics is on those forms of variation that speakers are typically not aware of, for instance, where the form and choice of expressions can be influenced in subtle ways by the structural contexts they occur in. Corpus linguists will ask questions concerning the choice of words or morphosyntactic construction, the reduction of some words (e.g. going to vs. gonna) or other variation in the sound shape of words, and so forth, depending on their context of use. The answers to these questions establish new facts about language and thus further our understanding of how human languages are used.
The corpus linguistic approach and the kind of insights it bears contrasts with structuralist approaches that focus entirely on languages as abstract systems of linguistic knowledge. It is thus closely linked with the tradition of functionalist and cognitive linguistics that have stressed that abstract representation – whatever its exact nature – is more strongly intertwined with usage than was assumed in the structuralist tradition, and that the langue-parole division is much less clear cut (cf. e.g. Bybee 2006; Diessel 2019). Scholars in other usage-oriented areas of linguistics, in particular, anthropological linguistics, sociolinguistics, and psycholinguistics, have stressed the importance of knowledge about language use and variation therein. Corpus linguistics ties in with these latter approaches: it is not just concerned with what expressions exist in any given languages and are possible to be produced, but what specific expressions language users are most likely to produce on any given occasion, like the ones mentioned above. Like other usage-oriented researchers, corpus linguists see language production as the result of a multi-layered decision-making process (cf. Diessel 2019:24–25) during which language users choose between different ways of expressing approximately1 the same thing. The systematic study of usage data as represented in corpora aims at discovering the rules that govern these decisions and at understanding what ramifications usage patterns have on linguistic systems (e.g. Bybee 2006).

1.1.2 Corpus linguistics in contrast to other approaches

This distinguishes corpus linguistics from a number of other approaches to human languages, for example, those delimiting the range of possible structures through acceptability judgements (as done in much theoretical and descriptive work on grammar), those comparing languages on the basis of grammars (as done in classic typology), or those investigating how language users react during comprehension when exposed to language in experimental setups (as done in psycho- and neurolinguistics), and many more.
The contrast to judgement-based linguistic research is particularly prominent in the literature. The focus here is on determining the system of all possible structures in a language, as one would find described in the grammar, for instance. A major concern here is that contrary to judgements – where user-judges can reject a structure as impossible – corpora can never provide this kind of so-called negative evidence, which calls corpora into question as a reliable empirical basis for grammar writing and other descriptive and analytic statements about abstract representations. Despite its focus on usage and the variation therein, corpus linguists have developed criteria in order to evaluate the relationship between coverage of possible structures in a corpus and the range of possible structures in a language system. We will discuss these in Chapter 3. Roughly, we believe that if a structure is not attested in a sufficiently large and varied corpus it is not possible. Yet, as we will see, we can never be sure what constitutes a sufficiently rich corpus in order to cover all possible structures. While this essentially leaves real uncertainty, it should nonetheless be noted that alternatives like judgement tasks are not in a much better position: it has by now frequently been pointed out that judgements are by no means generally reliable (Gibson & Fedorenko 2013), and people can reject structures in a judgement elicitation session that they would produce and/or encounter themselves and apparently have no difficulty to process and interpret. There are two major reasons for this: first, user-judges may often be led by what they assume to be ‘correct’ language as a kind of ideal, which may or may in fact not coincide with a prescriptive standard. Second, the acceptability of a given structure often depends on the specific context, and someone may simply fail to come up with that context during such a session, hence rejecting the structure. Conversely, it is quite possible that user-judges will accept structures proposed by a researcher-linguist that they would never produce; for example, they think that the expert must be right or that one should not correct a community outsider. Hence, judgements are not necessarily more reliable than corpus data. Finally, it needs to be stressed here that different people in a community may provide different, divergent judgements (in the same way that they produce different structures), which means that judgements would need to be based on as representative as possible a sample of language users. This is typically not done in judgement-based research, as is criticised by Schütze (2016).
A similar type of argument among linguists working on lesser-studied languages relates to the distinction between ‘corpus data and elicitations’. An example is Evans’ (2008) criticism that targeted elicitations of specific (often rare) structures are excluded from mainstream documentary linguistics. Evans (2008) stresses the importance of elicited data to capture rare, yet possible structures in a given language (the Australian language Dalabon in his example) that may be unlikely to crop up in more authentic text data (texts that are more common in the community) that would be part of a corpus (his examples are very complex noun phrases). Documenters should, therefore, not only record the verbal behaviour characteristic of a speech community, but also collect elicited data. As pointed out by Himmelmann (2012), however, the reliability of elicitations depends to a large degree on the experience of speakers with the structures in question. Rare structures are for this reason problematic to elicit, since speakers lack routines of producing and interpreting them. Moreover, we have no reason to accept the construct of an ideal language user representing a homogenous language community, so that again targeted elicitations would need to be conducted with a representative sample of speakers to attain some degree of reliability. We should point out here that we agree with Evans’ (2008) view that elicitations are a useful source of information, and we likewise reject a view that corpora should only include ‘real life language use’ as McEnery and Wilson (2001) call it in their textbook. Elicitations can form part of corpus data, then, but they are not a fast-and-easy alternative to other procedures of data collection to fill in the gaps. And like corpus data, elicitations underlie the same considerations of representativeness and saturation that we will discuss in Chapter 3.
We end this section with an anecdote that underscores the particular value of corpus linguistics even for system-oriented descriptive linguistics. It is reported again by Nick Evans in Meakins et al. (2018:13–16). Due to a request by the community, Evans had been engaged in a Bible translation project in the community of Nen speakers in southern Papua New Guinea, and while he saw this project more as a sideline of his fieldwork on the language in a spirit of ‘giving back’, the Nen Bible text revealed expressions corresponding to so-called ‘free-selection’ pronouns in English, like ‘anyone’, ‘whoever’, etc. These had – despite the Nen speakers’ overall profound proficiency in English – been virtually impossible to elicit, but in the Bible texts, they were there, all of a sudden. This shows that language use can reveal structures in specific contexts that linguists may have a hard time imagining, and corpus linguistics also has this kind of explorative data-driven facet. In regard to targeted elicitations, it underscores how difficult it can be for speakers to imagine usages of some structure out of context – as is the case for out-of-contexts elicitations and judgements – but that the relevant forms may come up promptly once the relevant context has been brought up. It is in this way that the corpus linguistic approach bears great potential not only for the study of language use but also for the demarcation of possible structures.

1.1.3 Corpus linguistics and usage-oriented linguistics

The core concern of corpus linguistics is with patterns of language use and their variation. Language use involves numerous decision-taking processes whereby users choose between alternative ways of expressing the same thing during test production and recipients choose between different ways of interpreting the structures they perceive. The more specific concern of corpus linguistics is to account for these decisions by systematically investigating related variants and conditions on their choice. For instance, whether a copular verb like is is realised in its full form or appears as a clitic ‘s is subject to numerous factors, and corpus linguists seek to identify these and relate them to one another in modelling the variation at hand (cf. Barth 2015 for an in-depth study of such reductions in spoken English texts). In other words, what is of particular interest to corpus linguistics is not only the presence or availability of a given structure in a given language but especially the factors that govern their choice in actual language use.
A major concern with language use is shared by a range of sub-disciplines in linguistics. One of these is sociolinguistics. Sociolinguistics is concerned with the variability of language use and seeks to correlate these with the social features of language users and their interlocutors. For instance, the choice between the two variants of the copula, is and ‘s, in spoken discourse is related to the preceding and the following words, speech rate, and other aspects of discourse context. But it is also influenced by demographic characteristics of speakers and their audiences, the social and physical setting, and other general aspects of the communicative situation. We will turn to the role of corpus linguistics in sociolinguistics in Chapter 9.
Other areas of linguistics where details of language use are of central concern are psycho- and neurolinguistics. These fields are interested in how language is processed, for example, how language users encode and decode discourse and what structures pose particular problems, reflected in processing delays. For the most part, these fields of linguistics target processing during perception and deploy various methods of measuring aspects of processing, for example, neurological EEG measures of processing delays (N400, P600) (cf. Brown & Hagoort 1993; Gouvea et al. 2010 inter alia). However, there are also strands – in particular in recent years – that pay attention to discourse production. Corpus linguistic approaches are relevant here: in one type of production-oriented research, one will examine discourse production within a controlled experiment with various stimuli intended to control for various aspects of processing. From a corpus linguistic perspective, these will simply be one set of many factors that influence the choice of structures during discourse production, in addition to other factors. In more recent work (Barth 2019a; Bell et al. 2009; Jaeger 2010; Jurafsky 2003; McDonald & Schillcock 2003; Seyfarth 2014), even free text production is investigated from a psycholinguistic perspective. A general idea here is that more frequent structures in similar contexts ...

Table of contents

  1. Cover
  2. Half Title
  3. Series Page
  4. Title Page
  5. Copyright Page
  6. Table of Contents
  7. Acknowledgements
  8. 1 Introduction
  9. 2 Basic concepts in corpus linguistics
  10. 3 Corpus composition and corpus types
  11. 4 Levels of linguistic representation in corpus-linguistic research
  12. 5 Corpus queries
  13. 6 Corpus building
  14. 7 Corpus annotation
  15. 8 Statistical description and analysis
  16. 9 Corpora in sociolinguistics
  17. 10 Corpus linguistics and language documentation
  18. 11 Corpus-based typology
  19. References
  20. Index