eBook - ePub

Corpus Linguistics for Grammar

Name: Corpus Linguistics for Grammar
Author: Christian Jones, Daniel Waller

A guide for research

Christian Jones,

Daniel Waller,

202 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Corpus Linguistics for Grammar

A guide for research

Christian Jones,

Daniel Waller,

Book details

Book preview

Table of contents

Citations

About This Book

Corpus Linguistics for Grammar provides an accessible and practical introduction to the use of corpus linguistics to analyse grammar, demonstrating the wider application of corpus data and providing readers with all the skills and information they need to carry out their own corpus-based research.

This book:

explores the kinds of corpora available and the tools which can be used to analyse them;
looks at specific ways in which features of grammar can be explored using a corpus through analysis of areas such as frequency and colligation;
contains exercises, worked examples and suggestions for further practice with each chapter;
provides three illustrative examples of potential research projects in the areas of English Literature, TESOL and English Language.

Corpus Linguistics for Grammar is essential reading for students undertaking corpus-based research into grammar, or studying within the areas of English Language, Literature, Applied Linguistics and TESOL.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Yes, you can access Corpus Linguistics for Grammar by Christian Jones, Daniel Waller in PDF and/or ePUB format, as well as other popular books in Languages & Linguistics & Linguistics. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Routledge

Year

2015

ISBN

9781317499008

Edition

Topic

Languages & Linguistics

Subtopic

Linguistics

Index

Languages & Linguistics

Part 1 Defining grammar and using corpora

DOI: 10.4324/9781315713779-2

Chapter 1 What is a corpus? What can a corpus tell us?

DOI: 10.4324/9781315713779-3

1.1 Introduction

Suppose you had an argument with a friend as to whether Sherlock Holmes ever said ‘elementary, my dear Watson’ (he didn’t!) and you wanted to prove your case; how would you go about doing it? One option would be to read all of the novels and short stories, but presumably you would have to get your friend to do the same to verify the truth of what you find. The other would be to turn to an electronic database that contained all of the Holmes stories and then search for the phrase. Essentially, this is how a corpus can help you.

This chapter will explain what a corpus is and why we may wish to consult one when trying to analyse grammatical and lexico-grammatical patterns. We will demonstrate what different types of corpora exist, including examples of various spoken and written corpora with different designs. We will then move on to an explanation of what information a corpus can provide us with and why we might want to use one to analyse areas such as frequency or grammatical patterns, to provide robust evidence of language in use. We will also examine how corpora have been used within the development of corpus-informed dictionaries and grammars. All the samples we use will be taken from open-access corpora (corpora on the internet that are free to access). By using resources that anyone can access, we aim to encourage the reader to look at these corpora for themselves.

1.2 What is a corpus?

A corpus is simply an electronically stored, searchable collection of texts. These texts may be written or spoken and may vary in length but generally they will be longer than a single speaking turn or single written sentence. They are normally measured in terms of the number of words they contain or to use a word common in most corpora, the number of tokens. Consider an analysis of the sentence above:

They are normally measured in terms of the number of words they contain or to use a word common in most corpora, the number of tokens.

This sentence has a total of twenty-six tokens in it.

We can also measure a corpus by the number of different word types it may contain, i.e. how many adjectives, how many verbs, etc. If we look at the sentence above, we can see how many different types there are in the sentence.

Pronouns: they × 2

Verbs: are, measured, contain, use

Nouns: terms, number × 2, words, word, corpora, tokens

Adjectives: common

Adverbs: normally

Determiners: the × 2, a, most

Prepositions: in × 2, of × 3, to

Conjunction: or

Therefore there are twenty different types in the text.

Types and tokens can also be compared by dividing the number of types by the number of tokens, giving us a type:token ratio. In this case that is 20 divided by 26 × 100, which is a type token ratio of 76%. Obviously, in this example we have used only one sentence, which is a sample size that most researchers would not use. When looking at a corpus the type-token ratio simply allows a researcher to see how varied a collection of texts may or may not be; in general, the more types there are in comparison to the number of tokens, the more lexically varied the text.

Corpora vary enormously in size and there is no minimum limit on how many tokens they should contain or indeed no set maximum size. In general, written corpora tend to be larger due to the relative ease of locating and storing electronic texts and the time-consuming nature of transcribing spoken data. It is also fair to say that a small corpus can be just as effective as a large one, depending on the purpose for which it is used and the principles behind its construction, a point we shall go on to discuss in 1.3. However, at this stage, it is instructive to compare the size of many of the corpora we will use in this book, alongside some others that are commonly used by publishers. These details are shown in Table 1.1 and there is more information given on the open-access corpora in Chapter 2.

1.3 Different types of corpora and good corpus design

Corpora can be mono-modal (through one medium, typically text) or multi-modal (through more than one medium, typically text and video), as described by Adolphs and Carter (2013). Due to costs, most corpora are mono-modal, although increasingly multi-modal corpora are being developed (see Adolphs and Carter, 2013 for examples). According to Sinclair (1991), a corpus should consist of a principled collection of texts. This means that a corpus should contain texts that can provide answers to questions we want answers to.

Table 1.1 Examples of corpora
Corpus name	Spoken/written or both	Number of tokens	Text types	Availability	Dates
Brigham Young University-British National Corpus (BYU-BNC) (Davies, 2004)	Both	100 million	Newspapers, fiction, journals, academic books, published and unpublished letters, school and university essays, unscripted conversation, meetings, radio phone-ins and shows	Open-access (registration needed)	1980 s–1993
Corpus of Contemporary American English (COCA) (Davies, 2008)	Both	450 million	Fiction, newspapers, magazines, academic texts, unscripted conversations	Open-access (registration needed)	1990 –2012
Corpus of Global Web-Based English (GloWbe) (Davies, 2013)	Written	1 .9 billion	Web pages from 20 English-speaking countries	Open-access (registration needed)	2013
Vienna-Oxford International Corpus of English (VOICE) (Seidlhofer et al., 2013)	Spoken (English used as a Lingua Franca)	1 million words	Interviews, press conferences, service encounters, seminar discussions, working group discussions, workshop discussions, meetings, panels, question-answer, sessions conversations	Open-access (registration needed)	2008 –2011
Cambridge English Corpus (CEC)	Spoken and written	Multi-billion words	Learner English, business English, academic English, unscripted conversations	No general access	No dates given
The Cambridge English Profile Corpus (CEPC)	Spoken and written (learner data)	10 million words	Spoken and written texts from English language tests	Access to the English vocabulary profile available. Once complete, parts of the CEPC will be open-access	2005 –present

By way of example, if we wished to analyse the performance of learners in a set of English language tests, we would need samples of their written and spoken work from the tests to be able to make realistic statements about the language in use. We would also need to make decisions about whether to include students who pass or fail tests with a particular mark. Other variables we would need to acknowledge and control for are the age and nationalities of the candidates. If the test is taken by a range of nationalities, for example, we would need a sample of tests that give a representative sample of those nationalities. We would also need to make a decision about how many words (or tokens) to include. This should be based upon two aspects: what we intend to use the corpus for and, practically, how many texts we can collect in the time available to us.

In the hypothetical example of the corpus of tests, sihould we wsh to make statements about how a grammatical pattern is used across different levels, then clearly we would need a lot more words than if we wished to investigate how a particular pattern was used only in a written test at one particular level. Finally, we would need to decide upon the type of corpus we need. For example, a mono-modal corpus of texts would give us information about candidates’ writing and speech but in the case of speech, we would be unable to comment upon their use of body language and how this acts to reinforce their message.

Try it yourself 1.1

Imagine you wish to construct a corpus to represent the following types of English and purposes. What types of texts would you need and approximately how large would each corpus need to be? A suggested answer is available at the back of the book.

A corpus of British spoken academic English. Purpose: to discover the most frequent words used by lecturers.
A corpus of Dickens’ fiction. Purpose: to discover the way lexical and grammatical patterns are used to reinforce themes.
A corpus of written requests made by colleagues in a UK university. Purpo...

Cover Page
Half-Title Page
Title Page
Copyright Page
Table of Contents
Figures
Tables
Acknowledgements
List of abbreviations
Introduction
Part 1 Corpus Linguistics for Grammar
Part 2 Corpus Linguistics for Grammar
Part 3 Corpus Linguistics for Grammar
Suggested answers
Corpus Linguistics for Grammar
Index