1 Introduction
Corpus linguistics involves the study of natural language on the basis of authentic written or spoken data stored electronically; that is, in machine-readable form. The advent of computers in the 1950s laid the foundation for this modern view of corpora, and a corpus as we know it today may accordingly be defined as a principled collection of naturally occurring texts âstored and processed on computer for the purposes of linguistic researchâ (Renouf 1987: 1). In essence, though, the text-based nature of using corpora in linguistics remains largely unchanged. What has changed dramatically after the arrival of the computer is the efficiency it offers to researchers for using and managing data. When, in the late 1920s, the Chinese scholar Heqin Chen (1928) compiled his corpus of 554,478 characters in order to generate a frequency list of Chinese vocabulary for primary education, he did so with the help of nine assistants (Huang and Li 2002: 68). It took them nearly three years of manual work to complete the list, which includes 4261 Chinese characters in use in the six text categories compiled for the corpus. Today, this pioneering work in Chinese corpus linguistics would have taken only a matter of seconds to generate on a personal computer, which can accommodate a much larger corpus with much higher levels of accuracy that has been built and processed by one researcher.
The use of authentic data is an important principle for both modern corpus linguists and researchers who used corpora prior to the arrival of the computer. Post-Bloomfieldian structural linguists such as Harris (1951) regarded a corpus as the primary source of linguistic insights. Their empirical use of large collections of recorded writings and utterances is seen as âearly corpus linguisticsâ in some of the literature (McEnery and Wilson 2001). Likewise, then, we could place within this broader tradition of corpus linguistics the work of a scholar such as Chen (1928), who wrote his corpus-based èȘé«ææçšććœ [Applied Lexis of Vernacular Chinese] with a pedagogical purpose in mind, asking such questions as whether elementary school pupils actually used the vocabulary they learnt.
Reusability of data (Huang 1997) is one of the major advantages of modern corpus linguistics. When corpora are stored on the computer and are (commercially) available in the form of, for example, CD-ROMs, as was the case with the British National Corpus World Edition released in 2000, they can be accessed by a much wider research community and facilitate various types of study based on the same data. Electronic corpora are also less prone to loss of data and to damage, a predicament that befell Chen when his second hard-copy corpus and accompanying results were ruined by fires during a Chinese civil war in the early twentieth century. In addition to data dissemination and security, reusability of corpora can also be measured in terms of their content. Corpora that were manually compiled prior to the computer era were naturally limited in terms of the data sampled and the amount of texts on paper that could be indexed at a given time. With progressively more linguistic resources becoming available today in both hard-copy and electronic versions, corpus compilers have the unprecedented advantage of being able to design very large corpora that address the limitation of data variety faced by linguists before the dawn of computer technology (Huang 1997; Leech 1991).
Drawing on the techniques and methodologies of corpus linguistics, an increasing number of scholars in corpus-based translation studies (CTS) are compiling their own databases or using existing databases to study translations from an empirical and descriptive point of view. On the other hand, CTS contributes to the study of language by building corpora of authentic translated texts that have almost never been represented in corpus linguistics (Baker 1993) during the last four decades. The now readily available corpus resources in Chinese and the relevant corpus methodology, coupled with the flow of mainstream corpus-based translation studies from the West, have largely facilitated corpus-based research on Chinese translation in the various Chinese communities. Song et al. (2013) observe that 97 articles involving corpora and translation studies were published in major Chinese journals between 1993 and 2012. Although many of these articles reported on the basics of Western corpus-based translation studies (i.e. rationale and methodology) rather than on actual empirical studies, they nevertheless successfully introduced CTS to Chinese translation researchers and practitioners. In the next section, I provide some background information on how corpus linguistics informed the study of translation (Baker 1993, 1995, 1996), and how the resulting corpus-based translation studies inspire new research agendas and the construction of translational corpora in Chinese academia. They, indeed, âprovide new tools, expand research scope and open up new pathsâ for Chinese translation studies (Luo et al. 2005: 56).
2 CTS and Chinese Translation Studies
Methodologically informed by corpus linguistics and first proposed in Baker (1...