English Corpus Linguistics
eBook - ePub

English Corpus Linguistics

  1. 352 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

English Corpus Linguistics

Book details
Book preview
Table of contents
Citations

About This Book

This collection of articles form a tribute to Jan Svartvik and his pioneering work in the field. Covers corpus studies, problematic grammar, institution-based and observation-based grammars and the design and development of spoken and written text corpora in different varieties of English.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes, you can access English Corpus Linguistics by Karin Aijmer,Bengt Altenberg in PDF and/or ePUB format, as well as other popular books in Languages & Linguistics & Linguistics. We have over one million books available in our catalogue for you to explore.

Information

Publisher
Routledge
Year
2014
ISBN
9781317899235
Edition
1
1    Introduction
Karin Aijmer and Bengt Altenberg
The orientation of much linguistic research is undergoing change 
 I think this is a good development for humanistic subjects: it calls for more academic cross-fertilization and fresh approaches to old problems which, hopefully, will lead to a better understanding of the complexities of natural language and the marvel of human language processing. There is, in this field, a real need for people who have experience from working with ‘real’ language data

(Svartvik 1990: 85–6)
Corpus linguistics can be described as the study of language on the basis of text corpora. Although the use of authentic examples from selected texts has a long tradition in English studies, there has been a rapid expansion of corpus linguistics in the last three decades. This development stems from two important events which took place around 1960. One was Randolph Quirk’s launching of the Survey of English Usage (SEU) with the aim of collecting a large and stylistically varied corpus as the basis for a systematic description of spoken and written English. The other was the advent of computers which made it possible to store, scan and classify large masses of material. The first machine-readable corpus was compiled by Nelson Francis and Henry Kučera at Brown University in the early 1960s. It was soon followed by others, notably the Lancaster-Oslo/Bergen (LOB) Corpus, which utilized the same format as the Brown Corpus and made it possible to compare different varieties of English.
In 1975 Jan Svartvik and his colleagues at Lund University undertook the task of making the spoken part of the SEU Corpus available in machine-readable form. The resulting London-Lund Corpus of Spoken English has greatly stimulated studies of spoken English and inspired a number of research projects (see e.g. Svartvik et al. 1982, Tottie and BĂ€cklund 1986, Svartvik 1990, Altenberg and Eeg-Olofsson 1990).
There now exists a large number of computerized corpora varying in size, design and research purpose, and others are under development (see the Appendix at the end of this volume). The great research potential offered by these corpora has given rise to a dramatic expansion of corpus-based research that few could have foreseen thirty years ago.
Computerized corpora have proved to be excellent resources for a wide range of research tasks. In the first place, they have provided a more realistic foundation for the study of language than earlier types of material, a fact which has given new impetus to descriptive studies of English lexis, syntax, discourse and prosody. Secondly, they have become a particularly fruitful basis for comparing different varieties of English, and for exploring the quantitative and probabilistic aspects of the language.
In all these respects, the availability and use of computerized corpora have expanded the domain of linguistic inquiry in significant ways. At the same time, this expansion has led to the development of more sophisticated research methodologies and new linguistic models. Many tasks which previously had to be done by hand can now be achieved automatically or semi-automatically by means of computer programs and other kinds of software. The most fruitful efforts in corpus linguistics have concerned automatic grammatical analysis of texts (tagging and parsing), but recently there have also been attempts to develop automatic programs for the analysis and generation of speech, and for interpreting the meaning and coherence of texts. Such programs require precise rules which simulate the knowledge that is part of the native speaker’s linguistic competence. In the development of such programs computerized corpora have served both as a source for the creation of probabilistic models of language and as a testbed for theoretically motivated language models.
The benefits of using machine-readable text corpora, especially grammatically ‘annotated’ ones, are now so widely recognized that it is probably true to say that most text-based research makes use of a computerized corpus in one way or another. This growing dependence on machine-readable material and on the computer as a research tool has forced ‘traditional’ linguists to cooperate with computational linguists and computer scientists to an increasing extent. In this sense, corpus linguistics has more and more become an interdisciplinary field, where team-work has prospered and different approaches have met and fertilized each other. But the central goal has remained the same: to reach a better understanding of the workings of human language. Even when corpus-based research has had a predominantly practical aim, such as providing data for language teaching, lexicography or man–machine communication, the linguistic goals have been in the forefront. Thus corpus linguistics has developed into an important framework where description, model-building and practical application prosper side by side.
Only some of these aspects can be reflected in this volume. The computational side of corpus linguistics has had to be left out almost completely. Instead, the focus has been placed on theoretical and methodological questions and the description of particular linguistic phenomena in different varieties of English.
The book is arranged in four main sections focusing on different aspects of corpus linguistics. Part 1 describes the place of corpus studies in linguistic research and discusses the goals and methods of corpus work. From its humble beginnings in the late 1950s corpus linguistics has now emerged as a recognized research paradigm with its own particular methodologies and specific goals. The landmarks in this development and the present state of corpus linguistics are outlined by Geoffrey Leech (Chapter 2), who shows how the creation of new text corpora has led to an upsurge of research on many different fronts.
Like Leech, M. A. K. Halliday emphasizes the importance of corpus studies as a source of insight into the nature of language (Chapter 3). Viewing language as inherently probabilistic, he stresses the need to investigate frequencies in texts to establish probabilities in the grammatical system – not for the purpose of tagging and parsing, but for discovering the interaction between different subsystems and for a better understanding of historical and developmental change and the variation of language across registers.
An important problem in corpus linguistics concerns the status of corpus data. Any corpus is likely to contain constructions which, although they belong to language use, should not be part of a theoretical grammar describing the speaker’s competence. Jan Aarts’ article (Chapter 4) deals with the problem of deciding which types of phenomena should be included in the grammar and which should be excluded.
The articles in Part 2 deal with the design and development of new computer corpora. These are problematic tasks which require careful consideration of such matters as corpus and sample size, stylistic composition and coverage, systems of transcription, encoding principles, etc.
The lack of a corpus of spoken American English has long been felt among researchers interested in comparing speech and writing in different regional varieties. In Chapter 5, Wallace Chafe, John Du Bois and Sandra Thompson describe their projected new corpus of spoken American English at Santa Barbara. Another important corpus in the making is the Corpus of International English. This is going to consist of both spoken and written material and will include countries where English is spoken as a second or foreign language. The organization and design of the corpus are described by Sidney Greenbaum, coordinator of the project (Chapter 6).
Methodological and descriptive problems are the subject of Part 3, which deals with the exploration of corpora.
Collocations and ‘prepatterned’ language represent the intersection of lexicon and grammar, an area which can be fruitfully studied in corpora. Graeme Kennedy (Chapter 7) uses the LOB Corpus to show that the prepositions between and through, although they are partly similar and are often confused by learners, differ considerably in their major functions and linguistic ‘ecology’. Göran Kjellmer (Chapter 8) uses an inventory of collocations in the Brown Corpus to demonstrate the prepatterned nature of a text taken from Jan Svartvik’s The Evans Statements. He also discusses the role collocations play in the native speaker’s mental lexicon and their importance for language learning and teaching. On the basis of the large Birmingham Corpus, Antoinette Renouf and John Sinclair demonstrate that grammatical ‘frameworks’ (like a.. .of) are statistically important and form an interesting basis for studying collocation (Chapter 9).
Corpora are particularly useful for comparing regional and stylistic varieties of English. This may cause methodological problems if the corpora are not assembled and designed in the same way. How such problems can be solved is shown in Peter Collins’ article (Chapter 10) in which a comparison is made of the modals of necessity and obligation in three corpora representing British, American and Australian English. Variation is also the subject of Charles Meyer’s article (Chapter 11) comparing the use of appositions in the Brown, LOB and London-Lund corpora. The questions raised are how and why appositions vary between speech and writing and between different textual categories. Dieter Mindt (Chapter 12) is concerned with the relationship between syntax and semantics. Using three English corpora, he reveals systematic correspondences between semantic distinctions and their syntactic correlates in three areas: the expression of futurity, intentionality and the functions of any.
The material needed for a particular purpose can be quite small and need not be computerized. Gabriele Stein and Randolph Quirk demonstrate in their article (Chapter 13) that even a small corpus of fictional works can provide new insights into verbal–nominal phrases such as have a drink.
One area where computerized corpora have proved particularly useful is the study of linguistic variation and the stylistic properties of texts and genres. Drawing on their own large-scale explorations of computerized corpora, Douglas Biber and Edward Finegan (Chapter 14) discuss the methodological questions connected with corpus-based studies in general and multi-feature/multi-dimensional approaches in particular, such as the design of text corpora, the representativeness of text samples, the nature of text types and genres, and the form–function correspondences of linguistic features. The stylistic distinctiveness of texts is also the concern of David Crystal (Chapter 15), who considers the possibilities of extending to stylistics an analytical approach used in clinical linguistics. The procedure he explores – ‘stylistic profiling’ – is reminiscent of Biber and Finegan’s approach, but while theirs depends on multivariate statistical analysis of tagged texts, Crystal’s does not. Hence, the two approaches supplement each other in interesting ways.
While the use of ‘genuine’ examples from written sources has a long tradition in the study of English, research on natural speech data is a fairly recent development. Spoken corpora can be used for many different purposes, such as the investigation of prosodic phenomena or the functions of particular discourse items. Anna-Brita Stenström (Chapter 16) uses the London-Lund Corpus to study the repertoire and function of expletives employed by adult educated English speakers. She demonstrates that the women in the material resort to expletives more often than the men, but that they use different items in functionally different ways. Gunnel Tottie (Chapter 17) compares the use of ‘backchannels’ in British and American English conversation, showing that such a comparison is possible and can yield interesting results, although the British and American spoken corpora she uses are designed and transcribed in somewhat different ways.
Two other areas where computerized corpora have only recently begun to demonstrate their usefulness are historical linguistics and dialectology. Matti Rissanen’s diachronic study of that and zero as English noun clause links (Chapter 18) traces the gradual spread of zero, at the same time as it illustrates the methodological problem of how one can search for grammatical phenomena that have no overt lexical realization in the corpus. Ossi Ihalainen (Chapter 19) shows how machine-readable transcriptions of dialect material can provide evidence about syntactic variation across a dialect continuum in cases where questionnaires are of little use.
In the final chapter, Stig Johansson takes on the difficult task of looking into the future of corpus linguistics. Stressing the many problems that the explosive expansion of corpus studies has left unsolved, he nevertheless predicts a bright future for corpus linguistics, characterized by continued technical advances, new types of material, and exciting research possibilities.
Part 1
Goals and methods
2 The state of the art in corpus linguistics
Geoffrey Leech
2.1 Historical background
When did modern corpus linguistics begin? Should we trace it back to the era of post-Bloomfieldian structural linguistics in the USA? This was when linguists (such as Harris and Hill in the 1950s) were under the influence of a positivist and behaviourist view of the science, and regarded the ‘corpus’ as the primary explicandum of linguistics.1 For such linguists, the corpus – a sufficiently large body of naturally occurring data of the language to be investigated – was both necessary and sufficient for the task in hand, and intuitive evidence was a poor second, sometimes rejected altogether. But there is virtually a discontinuity between the corpus linguists of that era and the later variety of corpus linguists with whose work this book is concerned.
The discontinuity can be located fairly precisely in the later 1950s. Chomsky had, effectively, put to flight the corpus linguistics of the earlier generation. His view on the inadequacy of corpora, and the adequacy of intuition, became the orthodoxy of a succeeding generation of theoretical linguists:
Any natural corpus will be skewed. Some sentences won’t occur because they are obvious, others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description would be no more than a mere list.
(Chomsky, University of Texas, 1962, p. 159)
In the following year or two, the founders (as is now clear in hindsight) of a new school of corpus linguistics began their work, little noticed by the mainstream. In 1959 Randolph Quirk announced his plan for a corpus of both spoken and written British English – the Survey of English Usage (SEU) Corpus, as it came to be known. Very shortly afterwards, Nelson Francis and Henry Kucera assembled a gro...

Table of contents

  1. Cover Page
  2. Half Title page
  3. Dedication
  4. Title Page
  5. Copyright Page
  6. Contents
  7. List of Contributors
  8. Jan Svartvik
  9. Books by Jan Svartvik published by Longman
  10. Acknowledgements
  11. 1 Introduction
  12. Part 1 Goals and methods
  13. Part 2 Corpus design and development
  14. Part 3 Exploration of corpora
  15. Part 4 Prospects for the future
  16. Appendix Some computerized English text corpora
  17. References
  18. Index