1 Introduction
1.1 Why Another Introduction to Corpus Linguistics?
In some sense at least, this book is an introduction to corpus linguistics. If you are a little familiar with the field, this probably immediately triggers the question āWhy yet another introduction to corpus linguistics?ā This is a valid question because, given the upsurge of studies using corpus data in linguistics, there are also already quite a few very good introductions available. Do we really need another one? Predictably, I think the answer is still āyesā and āyes, even a second edition,ā and the reason is that this introduction is radically different from every other introduction to corpus linguistics out there. For example, there are a lot of things that are regularly dealt with at length in introductions to corpus linguistics that I will not talk about much:
ā¢ the history of corpus linguistics: Kaeding, Fries, early 1m word corpora, up to the contemporary giga corpora and the still lively web-as-corpus discussion;
ā¢ how to compile corpora: size, sampling, balancedness, representativity;
ā¢ how to create corpus markup and annotation: lemmatization, tagging, parsing;
ā¢ kinds and examples of corpora: synchronic vs. diachronic, annotated vs. unannotated;
ā¢ what kinds of corpus-linguistic research have been done.
That is to say, rather than telling you about the discipline of corpus linguistics ā its history, its place in linguistics, its contributions to different fields, etc. ā with this book, I will āonlyā teach you how to do corpus-linguistic data processing with the programming language R (see McEnery and Hardie 2011 for an excellent recent introduction). In other words, this book presupposes that you know what you would like to explore but gives you tools to do it that go beyond what most commonly used tools can offer and, thus, hopefully also open up your minds about how to approach your corpus-linguistic questions. This is important since, to me, corpus linguistics is a method of analysis, so talking about how to do things should enjoy a high priority (see Gries 2010 and the rest of that special issue, as well as Gries 2011 for my subjective takes on this matter). Therefore, I will mostly be concerned with:
ā¢ aspects of how exactly data are retrieved from corpora to be used in linguistically informed analyses, specifically how to obtain from corpora frequency lists, dispersion information, collocation displays, concordances, etc. (see Chapter 2 for explanation and exemplification of these terms);
ā¢ aspects of data manipulation and evaluation: how to process and convert corpus data; how to save various kinds of results; how to import them into a spreadsheet program for further annotation; how to analyze results statistically; how to represent the results graphically; and how to report your results.
A second important characteristic of this book is that it only uses freely available software:
ā¢ R, the corpus linguistās all-purpose tool (cf. R Core Team 2016): a software which is a calculator, a statistics program, a (statistical) graphics program, and a programming language at the same time. The versions used in this book are R (www.r-project.org) and the freely available Microsoft R Open 3.3.1 (https://mran.revolutionanalytics.com/open, the versions for Ubuntu 16.04 LTS (or Mint 18) and Microsoft Windows 10);
ā¢ RStudio 0.99.1294 (www.rstudio.com);
ā¢ LibreOffice 5.2.0.4 (www.libreoffice.org).
The choice of these software tools, especially the decision to use R, has a number of important implications, which should be mentioned early on. As I just mentioned, R is a full-fledged multi-purpose programming language and, thus, a very powerful tool. However, this degree of power does come at a cost: In the beginning, it is undoubtedly more difficult to do things with R than with ready-made (free or commercial) concordancing software that has been written specifically for corpus-linguistic applications. For example, if you want to generate a frequency list of a corpus or a concordance of a word in a corpus with R, you must write a small script or a little bit of code in a programming language, which is the technical way of saying you write lines of text that are instructions to R. If you do not need pretty output, this script may consist of just a few lines, but it will often also be longer than that. On the other hand, if you have a ready-made concordancer, you click a few buttons (and enter a search term) to get the job done. One may therefore ask why go through the trouble of learning R? There is a variety of very good reasons for this, some of them related to corpus linguistics, some more general.
First, let me address this very argument, which is often made against using R (or other programming languages): why use a lot of time and effort to learn a programming language if you can get results from ready-made software within minutes? With regard to the time that goes into learning R, yes, there is a learning curve. However, that time may not be as long as you think: Many participants in my bootcamps and other workshops develop a first good understanding of R that allows them to begin to proceed on their own within just a few days. Plus, being able to program is an extremely useful skill for academic purposes, but also for jobs outside of academia; I would go so far as to say that learning to program is extremely useful in how it develops, or hones, a particular way of analytical and rigorous thinking that is useful in general. With regard to the time that goes into writing a script, much of that usually needs to be undertaken only once. As you will see below, once you have written your first few scripts while going through this book, you can usually reuse (parts of) them for many different tasks and corpora, and the amount of time that is required to perform a particular task becomes very similar to that of using a ready-made program. In fact, nearly all corpus-linguistic tasks in my own research are done with (somewhat adjusted) scripts or small snippets of code from this book. In addition, once you explore how to write your own functions (see Section 3.10), you can easily write your own versatile or specialized functions yourself; I will make several of those available in subsequent chapters. This way, the actual effort of generating a frequency list, a collocate display, a dispersion plot, etc. often reduces to about the time you need with a concordance program. In fact, R may even be faster than competing applications: For example, some concordance programs read in the corpus files once before they are processed and then again for performing the actual task ā R requires only one pass and may, therefore, outperform some competitors in terms of processing time.
Another point related to the notion that programming knowledge is useful: The knowledge you will acquire by working through this book is quite general, and I mean that in a good way. This is because you will not be restricted to just one particular software application (or even one version of one particular software application) and its restricted set of features. Rather, you will acquire knowledge of a programming language and regular expressions which will allow you to use many different utilities and to understand scripts in other programming languages, such as Perl or Python. (At the same time, I think R is simpler than Perl or Python, but can also interface with them via RSPerl and RSPython, respectively; see www.omegahat.org.) For example, if you ever come across scripts by other people or decide to turn to these languages yourself, you will benefit from knowing R in a way that no ready-made concordancing software would allow for. If you are already a bit familiar with corpus-linguistic work, you may now think ābut why turn to R and not use Perl or Python (especially since you say Perl and Python are similar anyway and many people already use one of these languages)?ā This is a good question, and I myself used Perl for corpus processing before I turned to R. However, I think I also have a good answer to why to use R instead. First, the issue of speed is much less of a problem than one may think. R is fast enough and stable enough for most applications (especially if you heed some of the advice given in Sections 3.6.3 and 3.10). Thus, if a script takes a bit of time, you can simply run it over lunch, while you are in class, or even overnight and collect the results afterwards. Second, R has other advantages. The main one is probably that, in addition text-processing capabilities, R offers a large number of ready-made functions for the statistical evaluation and graphical representation of data, which allows you to perform just about all corpus-linguistic tasks within only one programming environment. You can do your data processing, data retrieval, annotation, statistical evaluation, graphical representation . . . everything within just one environment, whereas if you wanted to do all these things in Perl or Python, you would require a huge amount of separate programming. Consider a very simple example: R has a function called table that generates a frequency table. To perform the same in Perl you would either have to have a small loop counting elements in an array and in a stepwise fashion increment their frequencies in a hash or, later and more cleverly, program a subroutine which you would then always call upon. While this is no problem with a one-dimensional frequency list, this is much harder with multidimensional frequency tables: Perlās arrays of arrays or hashes of arrays etc. are not for the faint-hearted, whereas Rās table is easy to handle, and additional functions (table, xtabs, ftable, etc.) allow you to handle such tables very easily. I believe learning one environment can be sufficiently hard for beginners, and therefore recommend using the more comprehensive environment with the greater number of simpler functions, which to me clearly is R. And, once you have mastered the fundamentals of R and face situations in which you need maximal computational power, switching to Perl or Python in a limited number of cases will be easier for you anyway, especially since much of the programming languagesā syntaxes is similar and the regular expressions used in this book are all Perl compatible. (Let me tell you, though, that in all my years using R, there were a mere two instances where I had to switch to Perl and that was only because I didnāt yet know how to solve a particular problem in R.)
Second, by learning to do your analyses with a programming language, you usually have more control over what you are actually doing: Different concordance programs have different settings or different ways of handling searches that are not always obvious to the (inexperienced) user. For instance, ready-made concordance tools often have slightly different settings that specify what āa wordā is, which means you can get different results if you have different programs perform the same search on the same corpus. Yes, those settings can usually be tweaked, but that means that, actually, such a ready-made application requires the same attention to detail as R, and with a programming language all of your methodological choices are right there in the code for everyone to see and replicate.
Third, if you use a particular concordancing software, you are at the mercy of its developer. If the developers change its behavior, its results output, or its default settings, you can only hope that this is documented well and/or does not affect your results. There have been cases where even silent over-the-internet updates have changed the output of such software from one day to the next. Worse, developers might even discontinue the development of a tool altogether ā and let us not even consider how sorry the state of the discipline of corpus linguistics would be if a majority of its practitioners was dependent on not even a handful of ready-made corpus tools and websites that allow you to search a corpus online. Somewhat polemically speaking, being able to enter a URL and type in a search word shouldnāt make you a corpus linguist.
The fourth and maybe most important reason for learning a programming language such as R is that a programming language is a much more versatile tool than any ready-made software application. For instance, many ready-made corpus tools can only offer the functionality they aim to provide for corpora with particular formats, and then can only provide a small number of kinds of output. R, as a programming language, can handle pretty much any input and can generate pretty much any output you want ā in fact, in my bootcamps, I tell participants on day 1 that I donāt want to hear any questions that begin with āCan R . . . ?ā because the answer is āYesā. For instance, with R you can readily use the CELEX database, CHAT files from language acquisition corpora, the very hierarchically layered annotation of XML corpora, previously generated frequency lists for corpora you no longer have access to, literature files from Project Gutenberg or similar sites, tabular corpus files such as those from the Corpus of Contemporary American English (http://corpus.byu.edu/coca) or the Corpus of Historical American English (http://corpus.byu.edu/coha), and so on and so forth. You can use files of whatever encoding, meaning that data from any language/writing system can be straightforwardly processed, and Rās general data-processing capabilities are mostly only limited by your working memory and abilities (rather than, for instance, the number of rows your spreadsheet software can handle). With very few exceptions, R works identically on all three major operating systems: Linux/Unix, Windows, and Mac OS X. In a way, once you have mastered the basic mechanisms, there is basically no limit to what you can do with it, both in terms of linguistic processing and statistical evaluation.
But there are also additional important advantages in the fact that R is an open-source tool/programming language. For instance, there is a large number of functions and packages that are contributed by users all over the world. These often allow effective shortcuts that are not, or hardly, possible with ready-made applications, which you cannot tweak as you wish. Also, contrary to commercial concordance software, bug-fixes are usually available very quickly. And a final, obvious, and very down-to-earth advantage of using open-source software is of course that it comes free of charge. Any student or any departmentās computer lab can afford it without expensive licenses, temporally limited or functionally restricted licenses, or irritating ads and nag screens. All this makes a strong case for the choice of software made here.
1.2 Outline of the Book
This book has changed quite a bit from the first edition; it is now structured as follows. Chapter 2 defines the notion of a corpus and provides a brief overview of what I consider to be the most central corpus-linguistic methods, namely frequency lists, dispersion, collocations, and concordances; in addition, I ...