Languages & Linguistics

Corpus Linguistics

Corpus linguistics is a branch of linguistics that involves the analysis of large collections of written or spoken texts (corpora) to study language patterns and usage. It uses computational tools and statistical methods to identify linguistic patterns, frequencies, and relationships within a language. Corpus linguistics provides valuable insights into language structure, usage, and variation.

Written by Perlego with AI-assistance

8 Key excerpts on "Corpus Linguistics"

Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.
  • Corpus-Based Analysis of Ideological Bias
    eBook - ePub

    Corpus-Based Analysis of Ideological Bias

    Migration in the British Press

    • Anna Islentyeva(Author)
    • 2020(Publication Date)
    • Routledge
      (Publisher)
    It is, however, important to understand that Corpus Linguistics is not a “type” of linguistics in the same sense as cognitive, generative, or functional linguistics are; neither is it an aspect of language like grammar, lexis, or syntax. McEnery & Wilson (2001: 2; emphasis added) state that Corpus Linguistics is “a methodology that may be used in almost any area of linguistics, but it does not truly delimit an area of linguistics itself”. Importantly, a corpus-based approach can even challenge some existing linguistic theories. McEnery & Gabrielatos (2006: 33) argue that the analysis of corpora has not only reinforced the findings of descriptive linguistics, but has also enhanced theoretically oriented linguistic research. Contemporary Corpus Linguistics represents a heterogeneous set of methods that can be combined with other methods of analysis. Crucially, Corpus Linguistics represents an empirical approach to the study of language and involves the observation of naturally occurring or authentic language data that is collected, annotated, and put together to form a corpus. Broadly speaking, a language corpus is a collection of texts in electronic form that are selected according to specific criteria relevant to a particular line of linguistic enquiry. A classic definition from Sinclair (2004), one of the pioneers of Corpus Linguistics, states: A corpus is a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research. However, not all researchers see Corpus Linguistics as a mere methodological tool. There are at least two distinct methodological approaches to conducting Corpus Linguistics, known as corpus-based and corpus-driven approaches to the study of language (Hardie & McEnery 2010; McEnery & Hardie: 2012: 6, 147)
  • An Introduction to Corpus Linguistics
    • Graeme Kennedy(Author)
    • 2014(Publication Date)
    • Routledge
      (Publisher)
    A third group of researchers consists of descriptive linguists whose main concern has been to make use of computerized corpora to describe reliably the lexicon and grammar of languages, both of the linguistic systems we use and our likely use of those systems. It is the probabilistic aspect of corpus-based descriptive linguistic studies which especially distinguishes them from conventional descriptive fieldwork in linguistics or lexicography. That is, corpus-based descriptive linguistics is concerned not only with what is said or written, where, when and by whom, but how often particular forms are used. The measurement of the distribution of words and grammar has encouraged new ways of studying the linguistic basis of variation in text types, language change and regional and other varieties of language. The corpus provides contexts for the study of meaning in use and, by making available techniques for extracting linguistic information from texts on a scale previously undreamed of, it facilitates linguistic investigations where empiricism is text based.
    A fourth area of activity, which has been among the most innovative outcomes of the corpus revolution, has been the exploitation of corpus-based linguistic description for use in a variety of applications such as language learning and teaching, and natural language processing by machine, including speech recognition and translation.
    At the present time in Corpus Linguistics, some researchers tend to focus on issues in corpus design, others on methods for text analysis and processing, and still others, probably the majority, on corpus-based linguistic description and the application of such descriptions. These various concerns are discussed in Chapters 2 5 of this book.
    Although the scope of Corpus Linguistics may be defined in terms of what people do with corpora, it would be a mistake to assume that Corpus Linguistics is simply a faster way of describing how a language works, or is about the nature of linguistic evidence. Analysis of a corpus by means of standard corpus linguistic research software can and frequently does reveal facts about a language which we might never previously have thought of seeking. Altenberg’s (1991a) study of amplifier collocations in English, for example, raised questions about semantic classes of maximizers and boosters such as perfectly or awfully which probably would not have been asked without the evidence of a corpus. He found for example that frequent maximizers such as quite tend to collocate with non-scalar words (quite obviously ) while absolutely has a greater tendency than other maximisers to collocate with negatives (absolutely not
  • Understanding Corpus Linguistics
    • Danielle Barth, Stefan Schnell(Authors)
    • 2021(Publication Date)
    • Routledge
      (Publisher)
    Duranti 1997 ).
    The main concern of Corpus Linguistics, however, is the regularities of language use. Corpus linguists seek to identify patterns of variation in language use and relate these to relevant factors of their context. Instead of asking what is possible to say, sign, or write in a given language – given relevant abstract rules – we are interested in what people have said, signed, or written in specific contexts, as observed in recorded texts contained in a corpus, and what they are therefore most likely to say, sign, or write given the same contextual circumstances.
    Our definition of corpus and characterisation of Corpus Linguistics does not specify any properties of the texts included. Some corpus linguists may restrict definitions of corpora to ‘authentic’ texts, those produced in non-academic contexts (McEnery & Wilson 2001 ; Stefanowitsch 2020 :23–25). But we would include in our definition also texts that come from specific experimental designs for linguistic or other academic purposes. Corpora containing such experimentally elicited texts are labelled ‘artificial corpora’ (cf. Section 3.3.1 ). There are specific research contexts – in particular, language documentation to be discussed in Chapter 10 – where corpus compilation is driven by a variety of considerations, some of which necessitate the inclusion of non-authentic texts and text types. The only necessary condition for the inclusion of texts in a corpus is that the expressions in it are used in the sense of constituting some social action, if only in reaction to a stimulus. This makes corpus texts different from the constructed examples abundant in many other strands of linguistics like Sapir’s the farmer killed the ducklings (1921:94) as an example of a typical simple sentence.1 A crucial feature of such examples is that they do not represent any language use, but instead, merely mention a possible structure. Likewise, irrelevant for Corpus Linguistics are any kind of evaluation of language use, for the simple reason that these do not represent language production at all, but judgements thereof. This applies to grammaticality judgements either by linguists themselves (intuitions) or by informants (Jackendoff 1994 :48–49; Schütze 2016 ; Stefanowitsch 2020 :8–17), as well as elicited evaluation of language use as are common in perceptual dialectology (Preston 1989
  • Corpus Linguistics for Online Communication
    eBook - ePub
    • Luke Collins(Author)
    • 2019(Publication Date)
    • Routledge
      (Publisher)
    Chapter 1 What is Corpus Linguistics? 1.1 Features of corpus analysis
    Corpus Linguistics is concerned with understanding how people use language in various contexts and incorporates computational tools to identify recurring patterns in ‘natural’ or authentic language use. However, as McEnery and Hardie (2012: 1) observe, “Corpus Linguistics is not a monolithic, consensually agreed set of methods and procedures for the exploration of language” and what they propose to call ‘corpus methods in linguistics’ offers a critical resource that can be implemented across and beyond the field of linguistics.
    Corpus Linguistics is founded on a frequency-based view of language that determines significance from the recurring patterns evidenced in observations of ‘real’ language data. As Tognini Bonelli (2010: 19) remarks, “The significant elements in a corpus become the patterns of repetition and patterns of co-selection. In other words, in Corpus Linguistics it is the frequency of occurrence that takes pride of place”. This measure of frequency can be in relation to the occurrence of particular features, or it might be the co-occurrence of one feature with another. Furthermore, this might be a measure of a distributional frequency, i.e. regular occurrence across a number of texts within a corpus. The underlying assumption is that the regularity of a formal pattern (i.e. the regular use of a particular form) reflects some functional difference, that language users are regularly opting for one form of expression over another and that this indicates something of its meaning.
    We have looked at the role of the computer in our analysis, not least of all in being able to process much more data than a team of manual researchers would be able to – and considerably faster. It is the capacity to process large amounts of data that helps to identify patterns that may be beyond our intuition and which may escape our reading, even if we were given the time to analyse the same amount of data. As manual readers, we are susceptible to giving disproportionate value to some words over others. For example, politically loaded words like ‘anti-Semitism’ or ‘terrorism’ may stand out to us more than an unusually frequent use of functional words, such as ‘but’ or ‘you’. Nevertheless, the frequent use of ‘you’ might attest to a deliberate form of direct address in a piece of marketing discourse; the regular use of ‘but’ might reflect the ­counterclaims that typify a debate. The computer does not hold the same biases as we do and will count each word ‘equally’ (so long as we program it to do so), which allows us to construct our reading on the basis of what is there in the text, though of course we can also conduct searchers for the features we are particularly interested in. Corpus analysis can also support us in validating observations that we have made based on smaller datasets; as Baker and McEnery (2015: 10) assert: “Being able to draw conclusions based on extremely large samples of data adds validity to claims, even if they confirm what we suspected, while providing a quantitative summary gives substance to what may have been a suspicion”. The corpus, then, is a resource for evidence-building, that can support our claims but equally, challenge assumptions made about language use that are not reflected in the data.
  • Introduction to Corpus Linguistics
    • Sandrine Zufferey(Author)
    • 2020(Publication Date)
    • Wiley-ISTE
      (Publisher)
    For example, it is useful to know what the word “knowledge” means, but it is just as important to know that this word is frequently used in phrases such as “acquire knowledge” or “having good knowledge of”, etc. Corpus Linguistics is a particularly effective method for establishing the frequent contexts in which a word or an expression is used. But Corpus Linguistics is also used for conducting research in fundamental areas of linguistics such as the study of syntax, since it makes it possible to identify the types of syntactic structures used in different languages. For example, by making a corpus study, it is possible to determine in which textual genres the passive voice is most commonly used. Finally, thanks to the existence of a corpus of oral data, Corpus Linguistics also makes it possible to answer questions related to phonology and sociolinguistics. For instance, it makes it possible to establish the area of geographical distribution of certain pronunciation traits, such as differentiating the short /a/ form in the French word “patte” (paw), from the long /ɑ/ form in the word “pâte” (pastry). Answering these different questions requires the use of different types of corpora, as well as having available data regarding their contents. For example, in order to determine the geographical area of diffusion of a certain pronunciation trait, it is necessary to know where each speaker having contributed to the corpus came from. This type of information is called corpus metadata. We will review the main types of existing corpora at the end of this chapter, and discuss the issue of metadata in Chapter 6. To sum up, in this section, we have defined Corpus Linguistics as an empirical discipline, which observes and analyzes quantitative language samples gathered in a computerized format
  • Discourse in English Language Education
    • John Flowerdew(Author)
    • 2012(Publication Date)
    • Routledge
      (Publisher)
    There are also parallel corpora, which consist of two or more corpora that have been sampled in the same way for different languages, usually of texts that have been translated. In addition to these corpora, the worldwide web can also be used as a corpus, either by using a search engine such as Google or Yahoo! or via specialised interfaces, for example: http://www.webcorp.org.uk/live/. To put the size of these corpora into perspective, as Gavioli and Aston (2001: 238) have noted, even the very large corpora consist of less language than will be encountered by average humans in their daily life. In addition, the composition of these corpora is different to what the individual experiences in real life, many, if not most of them consisting of written language. Furthermore, in real life, certain texts may be experienced more than once, while they will only occur once in a corpus. While some corpora are kept in a ‘raw’ state, others are annotated, or tagged, for parts of speech, or other information such as who is speaking or when the speaker has changed, a process which can be done automatically. 9.2 WHAT IS Corpus Linguistics? Corpus Linguistics is the application of computational tools to the analysis of corpora, in order to reveal language patterns which systematically occur in them. The rationale for such an analysis is that, on the one hand, large amounts of text can be analysed automatically — much more than would be humanly possible manually — and that, on the other hand, patterns may be revealed by the computational tools which may not be obvious to the naked eye
  • Corpus Linguistics for Education
    eBook - ePub
    • Pascual Pérez-Paredes(Author)
    • 2020(Publication Date)
    • Routledge
      (Publisher)
    Chapter 3

    Corpus Linguistics approaches to understanding language use

    3.1 Understanding and researching language use: discovering patterns

    Frequency in a corpus or in a text is ‘observable evidence of probability in the system’, therefore ‘unique events can be described only against the background of what is normal and expected’ (Stubbs 2007: 130). Gablasova, Brezina & McEnery (2017) have highlighted that, when looking at second language learners, information about the frequency of occurrence or co-occurrence of a unit or sets of units can help us uncover language patterns that point to underlying factors in second language learning, that is, the patterns of use in a dataset can reveal aspects of the researched phenomena. Durrant & Brenchely (2019) have studied children’s use of vocabulary in English schools and have suggested that, given the high repetition of high frequency verbs and adjectives in lower forms, lexical sophistication is conceptually inseparable from lexical diversity. McEnery, Brezina, Gablasova & Banerjee (2019) have noted that word associations (i.e. collocations) are crucial to understand discourse. Two examples are metaphorical connections between words and social evaluation promotion. However, the unwavering role of frequency in CL is not necessarily well understood in other areas of language education research, let alone other disciplines not remotely connected with language or linguistics.
    McEnery & Hardie (2012: 1) have noted that ‘Corpus Linguistics is not a monolithic, consensually agreed set of methods and procedures for the exploration of language […] Corpus Linguistics is a heterogeneous field’. This is a relevant assertion in the context of a young discipline that is subject to a process of critical inquiry and witnessing a rich debate in terms of methodological foundations. In the following paragraphs, we will try to explore some of the principles behind the exploration of language use and what they imply for the deployment of CL methods in education research.
  • Corpus Linguistics for Vocabulary
    eBook - ePub
    • Paweł Szudarski(Author)
    • 2017(Publication Date)
    • Routledge
      (Publisher)
    p.72 Chapter 5 Corpora, phraseology and formulaic language 5.1 Corpus Linguistics, phraseology and lexical priming
    In the previous chapters, we have discussed the usefulness of corpus tools for analyzing the occurrence of individual words. However, it is vital to state that corpora have also been instrumental in demonstrating that a large part of language consists of units longer than single words. More specifically, once we start exploring large amounts of naturally occurring data, we quickly discover that words have a tendency to cluster with one another and form lexical combinations. As Sinclair (1991: 108) notes, “most everyday words do not have an independent meaning, or meanings, but are components of a rich repertoire of multi-word patterns that make up a text”. Similarly, Tognini-Bonelli (2010) highlights the fact that the patterns of lexical repetition and co-selection are an important aspect of language use and Stubbs (2002: 59) notes “the pervasive occurrence of phrase-like units of idiomatic language”. Analyzing the structure and occurrence of such multiword units is the domain of phraseology and it constitutes a major line of research within Corpus Linguistics.
    An influential manifestation of empirical work in this area is Hunston and Francis’s (2000) pattern grammar. This is an innovative type of lexical grammar built around the most commonly occurring patterns defined as “all the words and structures which are regularly associated with the word and which contribute to its meaning” (Hunston and Francis 2000: 37). An example of a simple pattern is a head noun followed by a to-infinitive which complements it (‘a decision to do something’ or ‘an intention to buy something’). According to Hunston and Francis (2000), there are important associations between patterns and meanings and they can be observed at two levels: different meanings of a word tend to be associated with different patterns but also different words with the same patterns often share some aspects of the same meaning. As McEnery and Hardie (2012: 81) explain, pattern grammar is a model “where language is built up of a series of linked sequences of fuzzy structures” which influence both structural coherence and meaning.