eBook - ePub

Corpus Linguistics for Vocabulary

Name: Corpus Linguistics for Vocabulary
Author: Paweł Szudarski

A Guide for Research

Paweł Szudarski

228 pages
English
ePUB (adapté aux mobiles)
Disponible sur iOS et Android

eBook - ePub

Corpus Linguistics for Vocabulary

A Guide for Research

Paweł Szudarski

Détails du livre

Aperçu du livre

Table des matières

Citations

À propos de ce livre

Corpus Linguistics for Vocabulary provides a practical introduction to using corpus linguistics in vocabulary studies. Using freely available corpus tools, the author provides a step-by-step guide on how corpora can be used to explore key vocabulary-related research questions and topics such as:

The frequency of English words and how to choose which ones should be taught to learners;
How spoken vocabulary differs from written vocabulary, and how academic vocabulary differs from general vocabulary;
How vocabulary contributes to the structure of discourse, and the pragmatic functions it fulfils.

Featuring case studies and tasks throughout, Corpus Linguistics for Vocabulary provides a clear and accessible guide and is essential reading for students and teachers wanting to understand, appreciate and conduct corpus-based research in vocabulary studies.

Foire aux questions

Comment puis-je résilier mon abonnement ?

Il vous suffit de vous rendre dans la section compte dans paramètres et de cliquer sur « Résilier l’abonnement ». C’est aussi simple que cela ! Une fois que vous aurez résilié votre abonnement, il restera actif pour le reste de la période pour laquelle vous avez payé. Découvrez-en plus ici.

Puis-je / comment puis-je télécharger des livres ?

Pour le moment, tous nos livres en format ePub adaptés aux mobiles peuvent être téléchargés via l’application. La plupart de nos PDF sont également disponibles en téléchargement et les autres seront téléchargeables très prochainement. Découvrez-en plus ici.

Quelle est la différence entre les formules tarifaires ?

Les deux abonnements vous donnent un accès complet à la bibliothèque et à toutes les fonctionnalités de Perlego. Les seules différences sont les tarifs ainsi que la période d’abonnement : avec l’abonnement annuel, vous économiserez environ 30 % par rapport à 12 mois d’abonnement mensuel.

Qu’est-ce que Perlego ?

Nous sommes un service d’abonnement à des ouvrages universitaires en ligne, où vous pouvez accéder à toute une bibliothèque pour un prix inférieur à celui d’un seul livre par mois. Avec plus d’un million de livres sur plus de 1 000 sujets, nous avons ce qu’il vous faut ! Découvrez-en plus ici.

Prenez-vous en charge la synthèse vocale ?

Recherchez le symbole Écouter sur votre prochain livre pour voir si vous pouvez l’écouter. L’outil Écouter lit le texte à haute voix pour vous, en surlignant le passage qui est en cours de lecture. Vous pouvez le mettre sur pause, l’accélérer ou le ralentir. Découvrez-en plus ici.

Est-ce que Corpus Linguistics for Vocabulary est un PDF/ePUB en ligne ?

Oui, vous pouvez accéder à Corpus Linguistics for Vocabulary par Paweł Szudarski en format PDF et/ou ePUB ainsi qu’à d’autres livres populaires dans Filología et Lingüística. Nous disposons de plus d’un million d’ouvrages à découvrir dans notre catalogue.

Informations

Éditeur

Routledge

Année

2017

ISBN

9781351608046

Édition

Sujet

Filología

Sous-sujet

Lingüística

p.4

Chapter 1

What is corpus linguistics?

1.1 What is a corpus and corpus-based analysis?

In simple terms, corpus linguistics can be defined as the study of “the compilation and analysis of corpora” (Cheng 2012: 6), which are large collections of “naturally occurring language texts chosen to characterize a state or a variety of language” (Sinclair 1991: 171). According to Hunston (2002), even though corpus linguistics is a relatively new field, it has revolutionized language studies because it has provided new ways of analyzing and describing the use of language. The author emphasizes the fact that corpora consist of texts stored in an electronic format, which enables researchers to use special software (called concordancers) to conduct automatic searches and gain insights into the structure and regularity of naturally occurring language. Other important features of corpus-based analysis can be found in Biber et al. (1998: 4), who characterize it in the following way:

• it is empirical, analyzing the actual patterns of use in natural texts;

• it utilizes a large and principled collection of natural texts, known as a corpus, as the basis for analysis;

• it makes extensive use of computers for analysis, using both automatic and interactive techniques;

• it depends on both quantitative and qualitative analytical techniques.

At the same time, it should be stated that as in any discipline, also in corpus linguistics there are different approaches to how the actual analysis should be conducted and how its results should be interpreted. In a useful typology, Tognini-Bonelli (2001) distinguishes between corpus-based and corpus-driven approaches. In the former, corpus linguistics is perceived as a methodology (e.g. McEnery et al. 2003) in which corpus data are used to verify the existing theories of language. In contrast, corpus-driven approaches tend to view corpus linguistics as a theory which offers a new way of looking at the creation of meaning in a narrow sense and different aspects of the use of language in a broader sense (e.g. Stubbs 1993 or Teubert 2005). This demonstrates that the field of corpus linguistics is far from homogenous, with some authors regarding it is a theoretical approach which “may refine and define a range of theories of language” (McEnery and Hardie 2012: 1), while others (probably the majority of corpus linguists) use it as a methodology that enhances research into language use and variation (Biber and Reppen 2015). However, irrespective of which approach we adhere to, it needs to be acknowledged that corpus linguistics has the potential of changing one’s perspective on the study of language as a whole and, by providing powerful tools of analysis, it opens up new avenues for linguistic research.

p.5

1.2 Corpus design

If we wish corpora to be principled collections of texts whose aim is to represent a specific kind of language, it is important to realize that they should be designed and created according to specific criteria. The following paragraphs aim to present the key criteria of corpus design and explain why corpus linguists treat them as guidelines for the development of corpora. This is particularly important if you consider compiling your own small-scale corpus, for all the methodological decisions related to the structure of your corpus will have an impact on the validity of your findings.

The first author to be discussed in relation to corpus design is John Sinclair, who is often referred to as the father of corpus linguistics. In 2005, Sinclair proposed a set of principles that should be considered with regard to the process of developing a corpus:

1 The contents of a corpus should be selected without regard for the language they contain, but according to their communicative function in the community in which they arise.

2 Corpus builders should strive to make their corpus as representative as possible of the language from which it is chosen.

3 Only those components of corpora which have been designed to be independently contrastive should be contrasted.

4 Criteria for determining the structure of a corpus should be small in number, clearly separate from each other and efficient as a group in delineating a corpus that is representative of the language variety under examination.

5 Any information about a text other than the alphanumeric string of its words and punctuation should be stored separately from the plain text and merged when required in applications.

6 Samples of language for a corpus should, wherever possible, consist of entire documents or transcriptions of complete speech events, or should get as close to this target as possible. This means that samples will differ substantially in size.

7 The design and composition of a corpus should be documented fully with information about the contents and arguments in justification of the decisions taken.

8 The corpus builder should retain, as target notions, representativeness and balance. While these are not precisely definable and attainable goals, they must be used to guide the design of a corpus and the selection of its components.

p.6

9 Any control of subject matter in a corpus should be imposed by the use of external, and not internal, criteria.

10 A corpus should aim for homogeneity in its components while maintaining adequate coverage, and rogue texts should be avoided.

It needs to be stated that these points are valuable not only in terms of practical advice on how to build a corpus, but also because they point to the importance of theoretical considerations that underpin this process. Naturally, there is no such a thing as an ideal corpus and consequently any attempt to create a corpus is “a compromise between the hoped for and the achievable” (Nelson 2010: 60). However, if you are planning to compile your own corpus, it is essential that you take the above principles into account and consider the impact of the structure and shape of your corpus on the quality of the information it will provide. The key decisions you are likely to make concern the size, representativeness and balance of the data within your corpus, and much of the corpus literature highlights the interrelatedness of these factors (see Hunston 2002 or McEnery and Hardie 2012 for details).

As far as the size of a corpus is concerned, it is largely dependent on the research question pursued. If you are interested in checking the occurrence of difficult, low-frequency words or phrases (e.g. ‘foray’), then what you need is a large corpus so that you are able to find enough examples of how they are used in authentic texts. In turn, if your research focuses on frequent words such as ‘make’ or ‘give’, even small-sized corpora should provide sufficient empirical evidence for your analysis. Thus, establishing the right size for your corpus is an open question. It is fair to say that individual researchers and teachers working in the field of corpus linguistics make use of both multimillion corpora (e.g. the British National Corpus, BNC) and specialized, do-it-yourself collections of texts that are much smaller in size but suit the purposes of the local contexts in which they are created (see Chapter 8 for a discussion of general vs. specialized corpora). In addition, it is also important to state that written corpora predominate in the field of corpus linguistics and they are much larger than spoken corpora, which results from the difficulty of collecting and organizing spoken data (more details to follow).

Two other criteria that play a central role in corpus design are the representativeness and balance of corpus data. Biber (1993: 243) defines representativeness as “the extent to which a sample includes the full range of variability in a population”. In other words, representativeness concerns the issue of how well a corpus represents a given language or variety that is under study. A related notion is balance because it refers to the structure and type of data used to build a corpus. As explained by Hunston (2002), a well-balanced corpus should consist of several subsections that represent different types (registers) of language use. Importantly, all of the sections ought to contain a roughly equal number of words. A good illustration of a well-balanced corpus is the Contemporary Corpus of American English or COCA (Davies 2011). It is a corpus of general English which consists of 520 million words divided into five sections. Each of the sections contains around 110 million words and represents a different register of use (spoken language, fiction, popular magazines, newspapers and academic language). This design makes the corpus one of the biggest and best-developed corpora of contemporary English.

p.7

Another important aspect that needs to be discussed is annotation. This is an umbrella term that refers to procedures such as tagging and parsing which are carried out to add linguistic information to a corpus (Hunston 2002: 18). As Cheng (2012: 85) explains, the aim of annotation is to “enhance the corpus contents” in terms of the linguistic description of the data it contains. In their discussion of annotation, McEnery and Hardie (2012: 29) distinguish between three types of information that can accompany a corpus: metadata (details about a given text such as the name of the author), textual markup (information about the formatting of the text such as where italics starts and end or when a given speaker starts speaking) and linguistic annotation (assigning grammatical categories or tags to all the words within a corpus).

Crucially, the type and amount of information added to a corpus depend on the kind of analysis envisioned by its compilers. Referring to this issue, Cheng (2012) enumerates different levels or layers of annotation, the most important of which include: part-of-speech (PoS) tagging, syntactic (grammatical) parsing, error annotation, semantic annotation and phonetic annotation. It needs to be said that the level of annotation applied to a given corpus depends on the type of data collected and, even more importantly, what research purposes it will serve. For instance, if you wish to analyze the number of errors in a learner corpus composed of essays written by your students, it is essential that the corpus is error-tagged; that is, all the texts in the corpus need to be read to identify and flag up all examples of erroneous use. Or to use another example, it is rather obvious that phonetic annotation will be applied only to spoken data. However, irrespective of the kind of annotation applied, it is vital that this additional linguistic information is supplied in a removable form (i.e. separate files) and it cannot corrupt the original corpus data (Leech 2005).

Lastly, the process of annotating a corpus can be conducted in a number of ways. As explained by von Rooy (2015), annotation can be manual, computer-assisted (i.e. the output provided by a computer is subsequently edited by humans) or fully automatic. Automatic systems are the most efficient method and are often used for adding PoS tags (although their accuracy is not error free). A good example of an automatic tagger is CLAWS which was developed at Lancaster University (Garside and Smith 1997). Both the BNC and COCA have been annotated by means of this system...

Table des matières

Cover Page
Corpus Linguistics for Vocabulary
Routledge Corpus Linguistics Guides
Title
Copyright
Contents
List of Figures
List of Tables
Acknowledgments
Introduction and Aims of the Book
1 What is Corpus Linguistics?
2 Corpus Analysis: Tools and Statistics
3 What is Vocabulary? Terminology, Conceptualizations and Research Issues
4 Frequency and Vocabulary
5 Corpora, Phraseology and Formulaic Language
6 Corpora and Teaching Vocabulary
7 Corpora and Learner Vocabulary
8 Specialized Corpora and Vocabulary
9 Discourse, Pragmatics and Vocabulary
10 Summary and Research Projects
Glossary
Commentary on Tasks
Index

Normes de citation pour Corpus Linguistics for Vocabulary

APA 6 Citation

Szudarski, P. (2017). Corpus Linguistics for Vocabulary (1st ed.). Taylor and Francis. Retrieved from https://www.perlego.com/book/1506077/corpus-linguistics-for-vocabulary-a-guide-for-research-pdf (Original work published 2017)

Chicago Citation

Szudarski, Paweł. (2017) 2017. Corpus Linguistics for Vocabulary. 1st ed. Taylor and Francis. https://www.perlego.com/book/1506077/corpus-linguistics-for-vocabulary-a-guide-for-research-pdf.

Harvard Citation

Szudarski, P. (2017) Corpus Linguistics for Vocabulary. 1st edn. Taylor and Francis. Available at: https://www.perlego.com/book/1506077/corpus-linguistics-for-vocabulary-a-guide-for-research-pdf (Accessed: 14 October 2022).

MLA 7 Citation

Szudarski, Paweł. Corpus Linguistics for Vocabulary. 1st ed. Taylor and Francis, 2017. Web. 14 Oct. 2022.