Contemporary Corpus Linguistics
eBook - ePub

Contemporary Corpus Linguistics

  1. 368 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Contemporary Corpus Linguistics

Book details
Book preview
Table of contents
Citations

About This Book

Corpus linguistics uses large electronic databases of language to examine hypotheses about language use. These can be tested scientifically with computerised analytical tools, without the researcher's preconceptions influencing their conclusions. For this reason, corpus linguistics is a popular and expanding area of study. Contemporary Corpus Linguistics presents a comprehensive survey of the ways in which corpus linguistics is being used by researchers. Written by internationally renowned linguists, this volume of seventeen introductory chapters aims to provide a snapshot of the field of corpus linguistics. The contributors present accessible, yet detailed, analyses of recent methods and theory in Corpus Linguistics, ways of analysing corpora, and recent applications in translation, stylistics, discourse analysis and language teaching. The book represents the best of current practice in Corpus Linguistics, and as a one volume reference will be invaluable to students and researchers looking for an overview of the field.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes, you can access Contemporary Corpus Linguistics by Paul Baker in PDF and/or ePUB format, as well as other popular books in Languages & Linguistics & Linguistics. We have over one million books available in our catalogue for you to explore.

Information

Publisher
Continuum
Year
2012
ISBN
9781441109460
Edition
1
CHAPTER
1
Introduction
Paul Baker
The chapters in this book cover new research by corpus linguists, computational linguists and linguists who use corpora. While all three groups are growing in number, I suspect that the boundaries between them are becoming more blurred than they used to be, and also that it is the last group which is experiencing the most significant increase. As an illustration, in 1995, my university had a large Linguistics and English Language department which encompassed a broad range of fields and research methodologies. There were two corpus linguistics lecturers, but not a great deal of overlap between their work and the other research going on in the department. Now, in the same department, the situation has changed remarkably, with corpora and corpus techniques being used by the majority of the academics to various degrees. Additionally, I regularly receive requests for information and help from researchers in other departments who have heard about corpus-based analysis and think it would be helpful to them. This is in contrast to the response I received ten years ago when I gave a workshop on corpus linguistics to a very resistant group of social scientists. ‘Words are beautiful things, like flowers’, complained one participant. ‘We should not put them inside computers’
Perhaps the enthusiasm for corpus linguistics at my university is more an example of what is possible, rather than what is typical, yet a look at any online book store reveals numerous examples of published work that is not just about corpus linguistics but the corpus approach as it relates to some other aspect of linguistics (phonetics, language teaching, language acquisition, translation studies, discourse analysis, stylistics, metaphor, functional linguistics, world Englishes etc.).
One aim of this book is to address some of the more recent ways that corpus-based approaches have started to be incorporated in a range of linguistic research. A second aim is to address some of the current trends and themes that are influencing the manner in which corpus research is developing, as well as noting some of the concerns that people working closely with corpora are currently facing. Each chapter in this book follows (to a greater or lesser extent), the format of reviewing key and current work in a particular field of linguistics (e.g. stylistics, language teaching, critical discourse analysis), or aspect of corpus linguistics (e.g. software design, corpus design, annotation schemes) and then providing a recent example or case study of the author’s own research in that area. Many of the chapters have multiple foci; for example, David Oakey considers corpus design as well as the analysis of fixed collocational patterns, while Randi Reppen’s chapter looks at both the American National Corpus and language teaching. Because of this, it is difficult to divide the chapters in this book into neat subsections such as ‘corpus building’ ‘corpus software’ and ‘corpus applications’, although I have tried to order them in a way where it is possible to note relationships or similarities between those that are closer together. In the remainder of this introduction, I provide a short summary of each chapter, and end with a brief discussion of some of the themes which emerge across the book as a whole.
The book begins with a look at metaphor from a corpus-based perspective. Alice Deignan’s chapter reviews how linguists have attempted to identify metaphors in corpora, by applying sampling techniques and concordances, to methods which have used more automatic means – for example deriving lists of strong collocations that are semantically unrelated, which are likely to suggest metaphorical uses of language. She also considers how corpus approaches have helped metaphor theory by providing more detailed and accurate classifications of non-literal language and how corpus-based analysis has helped to challenge existing theories of metaphor. For example, with her analysis of metaphors around the word speed, she shows that ‘chunking’ often occurs, which counters the idea that linguistic metaphors are the product of underlying conceptual metaphorical networks.
A set of related methods are used by Gerlinde Mautner, who shows how corpus approaches can aid critical discourse analysis (CDA), a field which could be criticized for over-reliance on small-scale qualitative analyses, whose results may not be usefully applied to wider contexts. Mautner shows how corpus techniques such as concordancing and collocation can help to reveal semantic prosodies; for example, she finds that in general corpora the expression the elderly tends to strongly collocate with negative terms like infirm and frail. A concordance analysis shows that the exception negating lexical bundle elderly but tends to be followed by positive adjectives (charismatic, sharp-minded), which indicates how the term elderly is regularly constructed negatively in general language use. While Mautner warns that high frequency is not necessarily indicative of popular attitudes, her chapter shows that corpus techniques offer CDA researchers another way of carrying out their analysis, which is likely to make their findings more reliable and valid.
Similarly, in Chapter 4, Michaela Mahlberg reviews approaches in the growing field of corpus stylistics, while providing a case study which focuses on how corpus methods can be used to draw conclusions about language use in fiction. While literary critics may argue that a word or phrase is used to evoke a particular emotion or meaning, Mahlberg shows how concordance analyses of reference corpora and corpora based on an author’s complete works can help to provide evidence that a particular use of language has occurred in numerous other, similar contexts. For example, she shows that Charles Dickens uses the cluster put down his knife and fork in nine of his novels as a way of contextualizing when a character is shocked by an event. Corpus techniques therefore introduce systematicity to stylistics, allowing meaningful patterns to be identified and quantified.
Jonathan Culpeper’s study of metalanguage (in this case language about the language of impoliteness) uses a new piece of corpus analysis software, the web-based Sketch Engine developed by Adam Kilgarriff and David Tugwell. Using the Oxford English Corpus (approximately two billion words in size), Culpeper shows how a large corpus can yield hundreds or even thousands of citations of relatively infrequent words. This data can therefore be used in conjunction with Sketch Engine in order to derive ‘word sketches’. A particularly impressive aspect of Sketch Engine is the way that it gives detailed collocational information based on lexico-grammatical relationships. For example, in Culpeper’s examination of the terms impolite and rude, WordSketch is able to distinguish between collocates that are modifiers (downright, plain), those which are infinitival complements (stare, ask) and those that are adjectival subjects (doormen, waiter). Culpeper’s study points both to more sophisticated studies of collocational relationships as well as illustrating the analytical potential of the next generation of large corpora.
Staying with research that uses new analytical tools, in Chapter 6, Laurence Anthony describes the software AntConc, an increasingly popular (and free to download via the internet) multi-platform corpus toolkit which supports the Unicode Standard. Being highly functional, AntConc allows users to generate KWIC concordance lines and concordance distribution plots. It also has tools to analyse word clusters (lexical bundles), n-grams, collocates, word frequencies and keywords. Anthony discusses how tool design is often overlooked by corpus linguists, who instead have tended to focus on corpus-building procedures. However, he argues that it is only with the right tools that corpora can be adequately exploited. Anthony reports how AntConc was designed with input from corpus users, although it has a simple user interface that can be used by novices, for example, in classroom situations.
In Chapter 7, Adam Meyers considers issues surrounding best practice in corpus annotation. With reference to syntactic treebanks, he describes a number of different annotation schemes that are in existence and examines procedures that are used to convert one scheme to the other. Meyers gives a description of GLARF (Grammatical and Logical Argument Representation Framework), a scheme which allows different annotation systems to be merged in various ways. The author argues that the utility and accuracy of annotation will be improved if there is a greater degree of coordination among annotation research groups, and that multiple annotations should be carried out on corpora that are made freely available and shared, in order to facilitate annotation merging systems.
Continuing the theme of annotation, in Chapter 8, Irina Dahlmann and Svenja Adolphs discuss a number of issues in relationship to the annotation and analysis of spoken corpora. Focusing on the concept of the multi-word expression, they carry out two separate analyses of the two-word expression I think, the first using a corpus based on a mono-modal transcript (where short pauses within speech are not marked), the second where pauses have been fully annotated. While both forms of analysis result in interesting findings with respect to the patterns of I think produced by speakers, the authors argue that with the orthographically annotated corpus a fuller and more complex picture emerges, showing that I think is virtually never interrupted by pauses and therefore fits the criteria of a multi-word expression. The authors reason that only by using fully annotated multi-modal corpora will analysts be able to develop a more comprehensive understanding of speech.
David Oakey addresses corpus design, with regard to the analysis of fixed collocation patterns, a concept similar to the multi-word expressions used by Irina Dahlmann and Svenja Adolphs in the previous chapter. In order to analyse collocations across different genres, Oakey problematizes the fact that individual texts within different genres may be of different sizes (e.g. in the British National Corpus social science texts tend to be longer than texts from the pure sciences). Should comparisons between these genres therefore be isolexical (where the sub-corpora all contain the same number of words) or isotextual (where the sub-corpora all contain the same number of texts) Oakey’s findings, based on his own analysis of frequent fixed collocations in eight language genres, have implications for corpus builders who want to carry out studies of language variation.
In Chapter 10, Michael Oakes continues the themes that were raised in the previous chapters, also carrying out comparative analyses of a number of different genres of writing, but this time using the well-known Brown family of corpora. He shows how a range of statistical techniques can be gainfully employed in genre analysis (e.g. fiction vs news), synchronic analysis of language varieties (e.g. American vs British English) and diachronic analysis (e.g. 1960s English vs 1990s English). Starting with two-way chi-squared comparisons, Oakes moves on to more sophisticated techniques which involve comparisons of multiple genres. As well as multifactorial analysis, Oakes also considers techniques that produce visual renditions of similarity, such as dendograms and biplots. Additionally, he examines developments in computational stylometry as well as showing how a support vector machine is being used to classify web-based genres. Finally, Oakes critically addresses concerns regarding corpus design, particularly with respect to balance and representativeness.
Moving on, Yukio Tono examines language acquisition from the perspective of learner output, arguing that carefully encoded learner corpora can facilitate the emergence of theories of learner development, based on probabilistic analyses of multiple factors; an approach which echoes Oakes’ multifactorial analysis in the previous chapter. Additionally, using statistical models that include Bayesian network theory and Data Oriented Parsing, Tono shows how over-, under- or mis-use of linguistic phenomena in essays produced by learners can be explained (or predicted) by interactions between factors such as ability level, first language interference or frequency of a particular linguistic item in textbooks.
Randi Reppen’s chapter also focuses on language learning, but from the view of the creation of teaching materials (in this case using the American National Corpus). Drawing on findings from earlier corpus studies on register variation, Reppen shows how corpus analysis can enable teachers to encourage students to focus on linguistic features that are known to be typical and frequent of particular registers, in order to raise awareness about register variation among learners of English. Corpora therefore not only help to provide information about the sorts of salient linguistic features that are worth teaching to students, but they also facilitate an enormous amount of naturalistic data that teachers can draw on in order to create classroom-based exercises.
Similarly, Patrick Hanks reviews the contribution that corpus linguists have made towards dictionary creation in the last twenty years or so. While Hanks demonstrates that corpora afford dictionary creators the potential to add more words and more word meanings as well as accounts of typical and non-typical usage based on frequency data, he warns that a distinction needs to be made between dictionaries intended for language learners, and those for advanced users. Indeed, with the former, frequency information from corpora should be utilized in order to provide cut-off points (what to leave out), rather than offering blanket coverage of every word in a language. Additionally, Hanks discusses how corpus approaches can be of assistance in providing illustrative examples of word uses, raising a note of caution that authentic examples are not necessarily good examples, and that corpus techniques which identify normal usage will be most helpful for dictionary users.
A related area to dictionary creation is translation, which is considered in Chapter 14 by Richard Xiao and Ming Yue. After clearing up the confusion around terms like parallel, comparable, comparative, multilingual and bilingual corpora, the authors review contributions that corpus linguists have made to the various fields and sub-fields of translation studies. Then, moving away from studies that have compared closely related European languages, the authors focus on a case study which compares a corpus of Chinese fiction with a corpus of Chinese translations of English fiction, in order to examine the extent to which the hypothesized ‘translation universals’ found so far in similar language pairs, are also present in a language pair that is genetically distinct.
Continuing the focus on corpora of non-Latin writing systems, Chapter 15, by Andrew Hardie, considers developments in the emerging field of South Asian corpus linguistics, where languages spoken in India, Pakistan, Bangladesh, Sri Lanka and Nepal are beginning to be examined by corpus linguists. Hardie discusses the rendering and encoding problems that were originally encountered (and now largely solved due to the Unicode Standard) when building corpora of South Asian languages, as well as describing work on their annotation and mark-up. Finally, he outlines a case study which considers the extent to which Hindi and Urdu are dialects of the same language, by examining vocabulary differences in a range of multilingual corpora. While such work is still in its early stages, it aptly demonstrates the potential that corpus linguistics has for all the world’s languages.
While Hardie describes how much of the corpus data of Indic languages he examines was derived from web-based sources, in the following chapter Robert Lew explores the concept of ‘web as corpus’, discussing the advantages and disadvantages to corpus linguists of considering the whole web as a source of corpus data. While the web clearly offers access to a much larger rate of citations of rare terms and phrases, which is likely to be beneficial in terms of producing collocational analyses, Lew examines the extent to which the web can be considered to be a representative or balanced corpus, as well as looking at the types of interference which are specific to web-based texts: noise, spam and typos. Additionally, he discusses functionality and access mechanisms, concluding that the web may help to resolve language learners’ immediate lexical problems, as well as helping linguists in some contexts, but it should not replace traditionally built corpora.
Staying with texts derived from the internet, the final chapter by Brian King examines the feasibility of building, annotating and carrying out a comparative analysis of a corpus of chat-room data. The area of chat-room corpus analysis is still in its infancy, with researchers needing to quickly find solutions to new problems before research can be carried out. For example, King notes how the semi-public nature of chat-rooms raises ethical issues concerned with obtaining consent and retaining anonymity (particularly in this case, where the participants are classed as being from a ‘vulnerable’ group). Additionally, problems such as defining turn-taking, ensuring that a balanced sample is taken and categorization of linguistic phenomena are addressed.
Although the chapters in this book were chosen in order to represent a wide range of approaches that are currently being adopted within corpus-based research, covering corpus design, annotation and analysis, it is possible to identify a number of themes and trends which have organically emerged, being noted in multiple chapters. As I pointed out at the start of this chapter, it is clear that corpus-based approaches are increasingly being seen as useful to a range of linguistic disciplines, enabling new theories to be developed and older ones to be systematically tested. Fields such as translation studies (Xiao and Yue), metaphor analysis (Deignan), critical discourse analysis (Mautner), stylistics (Mahlberg), conversation analysis (Dahlmann and Adolphs) and metalanguage (Culpeper) are all benefiting from corpus approaches. The widening of corpus methods to a greater range of applications suggests that the field of linguistics (in both its applied and ‘pure’ senses) would benefit enormously if all researchers working with languages were afforded an understanding and appreciation of the ways in which corpus methods can be effectively utilized as an effective means of linguistic enquiry.
Otherwise, as Tono points out, there is a situation where a typical corpus linguist, whose specialisms involve corpus building and annotation along with using corpus software, will attempt to apply such techniques to a field, such as stylistics or critical discourse analysis, perhaps without being fully engaged with existing theory or techniques of analysis. For example, most CDA practitioners are aware of how nominalizations can be used to obscure agency, although this might not be something a corpus linguist, with little experience of CDA might be aware of – thus, nominalizations may either be overlooked or misinterpreted even if the corpus analysis highlights them as salient or frequent.
Conversely, an applied linguist may attempt to use corpus-based methods, but may ...

Table of contents

  1. Cover
  2. Half-Title
  3. Chapter  1  Introduction – Paul Baker
  4. Chapter  2  Searching for Metaphorical Patterns in Corpora – Alice Deignan
  5. Chapter  3  Corpora and Critical Discourse Analysis – Gerlinde Mautner
  6. Chapter  4  Corpus Stylistics and the Pickwickian watering-pot – Michaela Mahlberg
  7. Chapter  5  The Metalanguage of IMPOLITENESS: Using Sketch Engine to Explore the Oxford English Corpus – Jonathan Culpeper
  8. Chapter  6  Issues in the Design and Development of Software Tools for Corpus Studies: The Case for Collaboration – Laurence Anthony
  9. Chapter  7  Compatibility Between Corpus Annotation Efforts and its Effect on Computational Linguistics – Adam Meyers
  10. Chapter  8  Spoken Corpus Analysis: Multimodal Approaches to Language Description – Irina Dahlmann and Svenja Adolphs
  11. Chapter  9  Fixed Collocational Patterns in Isolexical and Isotextual Versions of a Corpus – David Oakey
  12. Chapter 10  Corpus Linguistics and Language Variation – Michael P. Oakes
  13. Chapter 11  Integrating Learner Corpus Analysis into a Probabilistic Model of Second Language Acquisition – Yukio Tono
  14. Chapter 12  English Language Teaching and Corpus Linguistics: Lessons from the American National Corpus – Randi Reppen
  15. Chapter 13  The Impact of Corpora on Dictionaries – Patrick Hanks
  16. Chapter 14  Using Corpora in Translation Studies: The State of the Art – Richard Xiao and Ming Yue
  17. Chapter 15  Corpus Linguistics and the Languages of South Asia: Some Current Research Directions – Andrew Hardie
  18. Chapter 16  The Web as Corpus Versus Traditional Corpora: Their Relative Utility for Linguists and Language Learners – Robert Lew
  19. Chapter 17  Building and Analysing Corpora of Computer-Mediated Communication – Brian King
  20. Bibliography
  21. Index