eBook - ePub
Corpus Linguistics for Vocabulary
A Guide for Research
PaweĆ Szudarski
This is a test
- 228 pages
- English
- ePUB (adapté aux mobiles)
- Disponible sur iOS et Android
eBook - ePub
Corpus Linguistics for Vocabulary
A Guide for Research
PaweĆ Szudarski
DĂ©tails du livre
Aperçu du livre
Table des matiĂšres
Citations
Ă propos de ce livre
Corpus Linguistics for Vocabulary provides a practical introduction to using corpus linguistics in vocabulary studies. Using freely available corpus tools, the author provides a step-by-step guide on how corpora can be used to explore key vocabulary-related research questions and topics such as:
-
- The frequency of English words and how to choose which ones should be taught to learners;
- How spoken vocabulary differs from written vocabulary, and how academic vocabulary differs from general vocabulary;
- How vocabulary contributes to the structure of discourse, and the pragmatic functions it fulfils.
Featuring case studies and tasks throughout, Corpus Linguistics for Vocabulary provides a clear and accessible guide and is essential reading for students and teachers wanting to understand, appreciate and conduct corpus-based research in vocabulary studies.
Foire aux questions
Comment puis-je résilier mon abonnement ?
Il vous suffit de vous rendre dans la section compte dans paramĂštres et de cliquer sur « RĂ©silier lâabonnement ». Câest aussi simple que cela ! Une fois que vous aurez rĂ©siliĂ© votre abonnement, il restera actif pour le reste de la pĂ©riode pour laquelle vous avez payĂ©. DĂ©couvrez-en plus ici.
Puis-je / comment puis-je télécharger des livres ?
Pour le moment, tous nos livres en format ePub adaptĂ©s aux mobiles peuvent ĂȘtre tĂ©lĂ©chargĂ©s via lâapplication. La plupart de nos PDF sont Ă©galement disponibles en tĂ©lĂ©chargement et les autres seront tĂ©lĂ©chargeables trĂšs prochainement. DĂ©couvrez-en plus ici.
Quelle est la différence entre les formules tarifaires ?
Les deux abonnements vous donnent un accĂšs complet Ă la bibliothĂšque et Ă toutes les fonctionnalitĂ©s de Perlego. Les seules diffĂ©rences sont les tarifs ainsi que la pĂ©riode dâabonnement : avec lâabonnement annuel, vous Ă©conomiserez environ 30 % par rapport Ă 12 mois dâabonnement mensuel.
Quâest-ce que Perlego ?
Nous sommes un service dâabonnement Ă des ouvrages universitaires en ligne, oĂč vous pouvez accĂ©der Ă toute une bibliothĂšque pour un prix infĂ©rieur Ă celui dâun seul livre par mois. Avec plus dâun million de livres sur plus de 1 000 sujets, nous avons ce quâil vous faut ! DĂ©couvrez-en plus ici.
Prenez-vous en charge la synthÚse vocale ?
Recherchez le symbole Ăcouter sur votre prochain livre pour voir si vous pouvez lâĂ©couter. Lâoutil Ăcouter lit le texte Ă haute voix pour vous, en surlignant le passage qui est en cours de lecture. Vous pouvez le mettre sur pause, lâaccĂ©lĂ©rer ou le ralentir. DĂ©couvrez-en plus ici.
Est-ce que Corpus Linguistics for Vocabulary est un PDF/ePUB en ligne ?
Oui, vous pouvez accĂ©der Ă Corpus Linguistics for Vocabulary par PaweĆ Szudarski en format PDF et/ou ePUB ainsi quâĂ dâautres livres populaires dans FilologĂa et LingĂŒĂstica. Nous disposons de plus dâun million dâouvrages Ă dĂ©couvrir dans notre catalogue.
Informations
p.4
Chapter 1
What is corpus linguistics?
1.1 What is a corpus and corpus-based analysis?
In simple terms, corpus linguistics can be defined as the study of âthe compilation and analysis of corporaâ (Cheng 2012: 6), which are large collections of ânaturally occurring language texts chosen to characterize a state or a variety of languageâ (Sinclair 1991: 171). According to Hunston (2002), even though corpus linguistics is a relatively new field, it has revolutionized language studies because it has provided new ways of analyzing and describing the use of language. The author emphasizes the fact that corpora consist of texts stored in an electronic format, which enables researchers to use special software (called concordancers) to conduct automatic searches and gain insights into the structure and regularity of naturally occurring language. Other important features of corpus-based analysis can be found in Biber et al. (1998: 4), who characterize it in the following way:
âą it is empirical, analyzing the actual patterns of use in natural texts;
âą it utilizes a large and principled collection of natural texts, known as a corpus, as the basis for analysis;
âą it makes extensive use of computers for analysis, using both automatic and interactive techniques;
âą it depends on both quantitative and qualitative analytical techniques.
At the same time, it should be stated that as in any discipline, also in corpus linguistics there are different approaches to how the actual analysis should be conducted and how its results should be interpreted. In a useful typology, Tognini-Bonelli (2001) distinguishes between corpus-based and corpus-driven approaches. In the former, corpus linguistics is perceived as a methodology (e.g. McEnery et al. 2003) in which corpus data are used to verify the existing theories of language. In contrast, corpus-driven approaches tend to view corpus linguistics as a theory which offers a new way of looking at the creation of meaning in a narrow sense and different aspects of the use of language in a broader sense (e.g. Stubbs 1993 or Teubert 2005). This demonstrates that the field of corpus linguistics is far from homogenous, with some authors regarding it is a theoretical approach which âmay refine and define a range of theories of languageâ (McEnery and Hardie 2012: 1), while others (probably the majority of corpus linguists) use it as a methodology that enhances research into language use and variation (Biber and Reppen 2015). However, irrespective of which approach we adhere to, it needs to be acknowledged that corpus linguistics has the potential of changing oneâs perspective on the study of language as a whole and, by providing powerful tools of analysis, it opens up new avenues for linguistic research.
p.5
1.2 Corpus design
If we wish corpora to be principled collections of texts whose aim is to represent a specific kind of language, it is important to realize that they should be designed and created according to specific criteria. The following paragraphs aim to present the key criteria of corpus design and explain why corpus linguists treat them as guidelines for the development of corpora. This is particularly important if you consider compiling your own small-scale corpus, for all the methodological decisions related to the structure of your corpus will have an impact on the validity of your findings.
The first author to be discussed in relation to corpus design is John Sinclair, who is often referred to as the father of corpus linguistics. In 2005, Sinclair proposed a set of principles that should be considered with regard to the process of developing a corpus:
1 The contents of a corpus should be selected without regard for the language they contain, but according to their communicative function in the community in which they arise.
2 Corpus builders should strive to make their corpus as representative as possible of the language from which it is chosen.
3 Only those components of corpora which have been designed to be independently contrastive should be contrasted.
4 Criteria for determining the structure of a corpus should be small in number, clearly separate from each other and efficient as a group in delineating a corpus that is representative of the language variety under examination.
5 Any information about a text other than the alphanumeric string of its words and punctuation should be stored separately from the plain text and merged when required in applications.
6 Samples of language for a corpus should, wherever possible, consist of entire documents or transcriptions of complete speech events, or should get as close to this target as possible. This means that samples will differ substantially in size.
7 The design and composition of a corpus should be documented fully with information about the contents and arguments in justification of the decisions taken.
8 The corpus builder should retain, as target notions, representativeness and balance. While these are not precisely definable and attainable goals, they must be used to guide the design of a corpus and the selection of its components.
p.6
9 Any control of subject matter in a corpus should be imposed by the use of external, and not internal, criteria.
10 A corpus should aim for homogeneity in its components while maintaining adequate coverage, and rogue texts should be avoided.
It needs to be stated that these points are valuable not only in terms of practical advice on how to build a corpus, but also because they point to the importance of theoretical considerations that underpin this process. Naturally, there is no such a thing as an ideal corpus and consequently any attempt to create a corpus is âa compromise between the hoped for and the achievableâ (Nelson 2010: 60). However, if you are planning to compile your own corpus, it is essential that you take the above principles into account and consider the impact of the structure and shape of your corpus on the quality of the information it will provide. The key decisions you are likely to make concern the size, representativeness and balance of the data within your corpus, and much of the corpus literature highlights the interrelatedness of these factors (see Hunston 2002 or McEnery and Hardie 2012 for details).
As far as the size of a corpus is concerned, it is largely dependent on the research question pursued. If you are interested in checking the occurrence of difficult, low-frequency words or phrases (e.g. âforayâ), then what you need is a large corpus so that you are able to find enough examples of how they are used in authentic texts. In turn, if your research focuses on frequent words such as âmakeâ or âgiveâ, even small-sized corpora should provide sufficient empirical evidence for your analysis. Thus, establishing the right size for your corpus is an open question. It is fair to say that individual researchers and teachers working in the field of corpus linguistics make use of both multimillion corpora (e.g. the British National Corpus, BNC) and specialized, do-it-yourself collections of texts that are much smaller in size but suit the purposes of the local contexts in which they are created (see Chapter 8 for a discussion of general vs. specialized corpora). In addition, it is also important to state that written corpora predominate in the field of corpus linguistics and they are much larger than spoken corpora, which results from the difficulty of collecting and organizing spoken data (more details to follow).
Two other criteria that play a central role in corpus design are the representativeness and balance of corpus data. Biber (1993: 243) defines representativeness as âthe extent to which a sample includes the full range of variability in a populationâ. In other words, representativeness concerns the issue of how well a corpus represents a given language or variety that is under study. A related notion is balance because it refers to the structure and type of data used to build a corpus. As explained by Hunston (2002), a well-balanced corpus should consist of several subsections that represent different types (registers) of language use. Importantly, all of the sections ought to contain a roughly equal number of words. A good illustration of a well-balanced corpus is the Contemporary Corpus of American English or COCA (Davies 2011). It is a corpus of general English which consists of 520 million words divided into five sections. Each of the sections contains around 110 million words and represents a different register of use (spoken language, fiction, popular magazines, newspapers and academic language). This design makes the corpus one of the biggest and best-developed corpora of contemporary English.
p.7
Another important aspect that needs to be discussed is annotation. This is an umbrella term that refers to procedures such as tagging and parsing which are carried out to add linguistic information to a corpus (Hunston 2002: 18). As Cheng (2012: 85) explains, the aim of annotation is to âenhance the corpus contentsâ in terms of the linguistic description of the data it contains. In their discussion of annotation, McEnery and Hardie (2012: 29) distinguish between three types of information that can accompany a corpus: metadata (details about a given text such as the name of the author), textual markup (information about the formatting of the text such as where italics starts and end or when a given speaker starts speaking) and linguistic annotation (assigning grammatical categories or tags to all the words within a corpus).
Crucially, the type and amount of information added to a corpus depend on the kind of analysis envisioned by its compilers. Referring to this issue, Cheng (2012) enumerates different levels or layers of annotation, the most important of which include: part-of-speech (PoS) tagging, syntactic (grammatical) parsing, error annotation, semantic annotation and phonetic annotation. It needs to be said that the level of annotation applied to a given corpus depends on the type of data collected and, even more importantly, what research purposes it will serve. For instance, if you wish to analyze the number of errors in a learner corpus composed of essays written by your students, it is essential that the corpus is error-tagged; that is, all the texts in the corpus need to be read to identify and flag up all examples of erroneous use. Or to use another example, it is rather obvious that phonetic annotation will be applied only to spoken data. However, irrespective of the kind of annotation applied, it is vital that this additional linguistic information is supplied in a removable form (i.e. separate files) and it cannot corrupt the original corpus data (Leech 2005).
Lastly, the process of annotating a corpus can be conducted in a number of ways. As explained by von Rooy (2015), annotation can be manual, computer-assisted (i.e. the output provided by a computer is subsequently edited by humans) or fully automatic. Automatic systems are the most efficient method and are often used for adding PoS tags (although their accuracy is not error free). A good example of an automatic tagger is CLAWS which was developed at Lancaster University (Garside and Smith 1997). Both the BNC and COCA have been annotated by means of this system...
Table des matiĂšres
- Cover Page
- Corpus Linguistics for Vocabulary
- Routledge Corpus Linguistics Guides
- Title
- Copyright
- Contents
- List of Figures
- List of Tables
- Acknowledgments
- Introduction and Aims of the Book
- 1 What is Corpus Linguistics?
- 2 Corpus Analysis: Tools and Statistics
- 3 What is Vocabulary? Terminology, Conceptualizations and Research Issues
- 4 Frequency and Vocabulary
- 5 Corpora, Phraseology and Formulaic Language
- 6 Corpora and Teaching Vocabulary
- 7 Corpora and Learner Vocabulary
- 8 Specialized Corpora and Vocabulary
- 9 Discourse, Pragmatics and Vocabulary
- 10 Summary and Research Projects
- Glossary
- Commentary on Tasks
- Index
Normes de citation pour Corpus Linguistics for Vocabulary
APA 6 Citation
Szudarski, P. (2017). Corpus Linguistics for Vocabulary (1st ed.). Taylor and Francis. Retrieved from https://www.perlego.com/book/1506077/corpus-linguistics-for-vocabulary-a-guide-for-research-pdf (Original work published 2017)
Chicago Citation
Szudarski, PaweĆ. (2017) 2017. Corpus Linguistics for Vocabulary. 1st ed. Taylor and Francis. https://www.perlego.com/book/1506077/corpus-linguistics-for-vocabulary-a-guide-for-research-pdf.
Harvard Citation
Szudarski, P. (2017) Corpus Linguistics for Vocabulary. 1st edn. Taylor and Francis. Available at: https://www.perlego.com/book/1506077/corpus-linguistics-for-vocabulary-a-guide-for-research-pdf (Accessed: 14 October 2022).
MLA 7 Citation
Szudarski, PaweĆ. Corpus Linguistics for Vocabulary. 1st ed. Taylor and Francis, 2017. Web. 14 Oct. 2022.