Corpus Linguistics for Grammar
eBook - ePub

Corpus Linguistics for Grammar

A guide for research

  1. 202 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Corpus Linguistics for Grammar

A guide for research

Book details
Book preview
Table of contents
Citations

About This Book

Corpus Linguistics for Grammar provides an accessible and practical introduction to the use of corpus linguistics to analyse grammar, demonstrating the wider application of corpus data and providing readers with all the skills and information they need to carry out their own corpus-based research.

This book:

  • explores the kinds of corpora available and the tools which can be used to analyse them;
  • looks at specific ways in which features of grammar can be explored using a corpus through analysis of areas such as frequency and colligation;
  • contains exercises, worked examples and suggestions for further practice with each chapter;
  • provides three illustrative examples of potential research projects in the areas of English Literature, TESOL and English Language.

Corpus Linguistics for Grammar is essential reading for students undertaking corpus-based research into grammar, or studying within the areas of English Language, Literature, Applied Linguistics and TESOL.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes, you can access Corpus Linguistics for Grammar by Christian Jones, Daniel Waller in PDF and/or ePUB format, as well as other popular books in Languages & Linguistics & Linguistics. We have over one million books available in our catalogue for you to explore.

Information

Publisher
Routledge
Year
2015
ISBN
9781317499008
Edition
1

Part 1 Defining grammar and using corpora

DOI: 10.4324/9781315713779-2

Chapter 1 What is a corpus? What can a corpus tell us?

DOI: 10.4324/9781315713779-3

1.1 Introduction

Suppose you had an argument with a friend as to whether Sherlock Holmes ever said ‘elementary, my dear Watson’ (he didn’t!) and you wanted to prove your case; how would you go about doing it? One option would be to read all of the novels and short stories, but presumably you would have to get your friend to do the same to verify the truth of what you find. The other would be to turn to an electronic database that contained all of the Holmes stories and then search for the phrase. Essentially, this is how a corpus can help you.
This chapter will explain what a corpus is and why we may wish to consult one when trying to analyse grammatical and lexico-grammatical patterns. We will demonstrate what different types of corpora exist, including examples of various spoken and written corpora with different designs. We will then move on to an explanation of what information a corpus can provide us with and why we might want to use one to analyse areas such as frequency or grammatical patterns, to provide robust evidence of language in use. We will also examine how corpora have been used within the development of corpus-informed dictionaries and grammars. All the samples we use will be taken from open-access corpora (corpora on the internet that are free to access). By using resources that anyone can access, we aim to encourage the reader to look at these corpora for themselves.

1.2 What is a corpus?

A corpus is simply an electronically stored, searchable collection of texts. These texts may be written or spoken and may vary in length but generally they will be longer than a single speaking turn or single written sentence. They are normally measured in terms of the number of words they contain or to use a word common in most corpora, the number of tokens. Consider an analysis of the sentence above:
They are normally measured in terms of the number of words they contain or to use a word common in most corpora, the number of tokens.
This sentence has a total of twenty-six tokens in it.
We can also measure a corpus by the number of different word types it may contain, i.e. how many adjectives, how many verbs, etc. If we look at the sentence above, we can see how many different types there are in the sentence.
Pronouns: they × 2
Verbs: are, measured, contain, use
Nouns: terms, number × 2, words, word, corpora, tokens
Adjectives: common
Adverbs: normally
Determiners: the × 2, a, most
Prepositions: in × 2, of × 3, to
Conjunction: or
Therefore there are twenty different types in the text.
Types and tokens can also be compared by dividing the number of types by the number of tokens, giving us a type:token ratio. In this case that is 20 divided by 26 × 100, which is a type token ratio of 76%. Obviously, in this example we have used only one sentence, which is a sample size that most researchers would not use. When looking at a corpus the type-token ratio simply allows a researcher to see how varied a collection of texts may or may not be; in general, the more types there are in comparison to the number of tokens, the more lexically varied the text.
Corpora vary enormously in size and there is no minimum limit on how many tokens they should contain or indeed no set maximum size. In general, written corpora tend to be larger due to the relative ease of locating and storing electronic texts and the time-consuming nature of transcribing spoken data. It is also fair to say that a small corpus can be just as effective as a large one, depending on the purpose for which it is used and the principles behind its construction, a point we shall go on to discuss in 1.3. However, at this stage, it is instructive to compare the size of many of the corpora we will use in this book, alongside some others that are commonly used by publishers. These details are shown in Table 1.1 and there is more information given on the open-access corpora in Chapter 2.

1.3 Different types of corpora and good corpus design

Corpora can be mono-modal (through one medium, typically text) or multi-modal (through more than one medium, typically text and video), as described by Adolphs and Carter (2013). Due to costs, most corpora are mono-modal, although increasingly multi-modal corpora are being developed (see Adolphs and Carter, 2013 for examples). According to Sinclair (1991), a corpus should consist of a principled collection of texts. This means that a corpus should contain texts that can provide answers to questions we want answers to.
Table 1.1 Examples of corpora
Corpus name Spoken/written or both Number of tokens Text types Availability Dates
Brigham Young University-British National Corpus (BYU-BNC) (Davies, 2004) Both 100 million Newspapers, fiction, journals, academic books, published and unpublished letters, school and university essays, unscripted conversation, meetings, radio phone-ins and shows Open-access (registration needed) 1980 s–1993
Corpus of Contemporary American English (COCA) (Davies, 2008) Both 450 million Fiction, newspapers, magazines, academic texts, unscripted conversations Open-access (registration needed) 1990 –2012
Corpus of Global Web-Based English (GloWbe) (Davies, 2013) Written 1 .9 billion Web pages from 20 English-speaking countries Open-access (registration needed) 2013
Vienna-Oxford International Corpus of English (VOICE) (Seidlhofer et al., 2013) Spoken (English used as a Lingua Franca) 1 million words Interviews, press conferences, service encounters, seminar discussions, working group discussions, workshop discussions, meetings, panels, question-answer, sessions conversations Open-access (registration needed) 2008 –2011
Cambridge English Corpus (CEC) Spoken and written Multi-billion words Learner English, business English, academic English, unscripted conversations No general access No dates given
The Cambridge English Profile Corpus (CEPC) Spoken and written (learner data) 10 million words Spoken and written texts from English language tests Access to the English vocabulary profile available. Once complete, parts of the CEPC will be open-access 2005 –present
By way of example, if we wished to analyse the performance of learners in a set of English language tests, we would need samples of their written and spoken work from the tests to be able to make realistic statements about the language in use. We would also need to make decisions about whether to include students who pass or fail tests with a particular mark. Other variables we would need to acknowledge and control for are the age and nationalities of the candidates. If the test is taken by a range of nationalities, for example, we would need a sample of tests that give a representative sample of those nationalities. We would also need to make a decision about how many words (or tokens) to include. This should be based upon two aspects: what we intend to use the corpus for and, practically, how many texts we can collect in the time available to us.
In the hypothetical example of the corpus of tests, sihould we wsh to make statements about how a grammatical pattern is used across different levels, then clearly we would need a lot more words than if we wished to investigate how a particular pattern was used only in a written test at one particular level. Finally, we would need to decide upon the type of corpus we need. For example, a mono-modal corpus of texts would give us information about candidates’ writing and speech but in the case of speech, we would be unable to comment upon their use of body language and how this acts to reinforce their message.

Try it yourself 1.1

Imagine you wish to construct a corpus to represent the following types of English and purposes. What types of texts would you need and approximately how large would each corpus need to be? A suggested answer is available at the back of the book.
  1. A corpus of British spoken academic English. Purpose: to discover the most frequent words used by lecturers.
  2. A corpus of Dickens’ fiction. Purpose: to discover the way lexical and grammatical patterns are used to reinforce themes.
  3. A corpus of written requests made by colleagues in a UK university. Purpo...

Table of contents

  1. Cover Page
  2. Half-Title Page
  3. Title Page
  4. Copyright Page
  5. Table of Contents
  6. Figures
  7. Tables
  8. Acknowledgements
  9. List of abbreviations
  10. Introduction
  11. Part 1 Corpus Linguistics for Grammar
  12. Part 2 Corpus Linguistics for Grammar
  13. Part 3 Corpus Linguistics for Grammar
  14. Suggested answers
  15. Corpus Linguistics for Grammar
  16. Index