1.1 Why Language Technology for the Humanities?
1.2 Structure of the Book
References
1.1 Why Language Technology for the Humanities?
In the last two decades, the humanities have seen an unprecedented change opening up new directions for the inquiry of human cultures and their histories: the yet not fully explored availability of digitized humanistic texts. Thanks to the mass digitization of analogue resources preserved in libraries and archives, large textual collections, such as Google Books, Early English Books Online, and Project Gutenberg, have become available on the World Wide Web. The rise of digital humanities as a new academic field has contributed to the proliferation of research infrastructures and centres dedicated to the study and distribution of textual resources in the humanities. The mission of digital humanities projects such as CLARIN European Research Infrastructure, DARIAH and the ESRC Centre for Corpus Approaches to Social Science is to make textual resources not only available but also investigable for scholars. Digital humanists have proposed the method of distant reading or macro analysis for learning from large textual resources (Jockers 2013; Moretti 2015). Alongside a growing interest in large textual resources, there is an increasing demand from (digital) humanities researchers for quantitative and computational skills. The current offering in this space is rich, with a range of training options (including dedicated summer schools like the digital humanities training events at Oxford,1 DHSI at Victoria,2 or the European Summer School in Digital Humanities in Leipzig3) and publications (examples include Bird et al. 2009; Gries 2009; Hockey 2000; Jockers 2014; Piotrowski 2012). Nonetheless, textual resources in the humanities and beyond raise a key challenge: they are too big to be read by humans interested in analysing them. The potential lying in the exploration of large textual collections has not been fully realized; yet, it remains a key task for the current and the next generations of humanities scholars.
To explore tens of thousands of books or millions of historical documents, humanities scholars inevitably need the power of computing technologies. Among these technologies, there is one that has had and will definitely continue to have a pivotal role in the exploration of big textual resources. Language technology, which can help unlock and investigate large amounts of textual data, is a truly interdisciplinary enterprise. It is not an academic field per se; it is rather a collection of methods that deal with textual data. Language technology sits at the crossroads between corpus and computational linguistics, natural language processing and text mining, data science and data visualization. As we will demonstrate throughout this book, language technology can be used to address a great variety of research problems involved in the investigation of textual data in the humanities and beyond.
1.2 Structure of the Book
This book examines research problems that are relevant for humanities and can be addressed with the help of language technology. The first chapter demonstrates how language technology can help structure raw textual data and represent them as a resource meaningful for both humans and computers. For instance, the lyrics of thousands of popular songs are now available in plain text on the World Wide Web. But lyrics in plain text format do not distinguish the title and the refrain of a song. This is an example of unstructured data because various components of a song are not marked in a way that computers can automatically extract them. Language technology can help detect structural components within a text such as the refrain of a song; it can also help represent a song in digital form so that different structural components are distinguished and readily available for further computational investigations. Language technology also supports word-level investigations of textuality. The lyrics of a song consist of not only structural units, but also different types of words such as nouns, verbs, and names of people. In plain text format, word-level information about lyrics is not readily usable by computing tools; for instance, it is not possible to extract all proper names from a collection of lyrics in plain text. As Chapter 2 explains, language technology helps attach different types of information to each word of a text; it also offers ways to record this information in well-established data formats.
Language technology also facilitates the bottom-up exploration of textual resources and textuality. For instance, finding terms that are significant elements of a text is an important component of bottom-up explorations. We will discuss how the investigation of word frequency can support this in Chapter 3. Language technology methods can map terms closely related to a given concept in thousands of texts. This form of bottom-up exploration is discussed in Chapter 4. Language technology methods can also help in bottom-up studies of word meaning. For instance, the meaning of a concept can be investigated by drawing on a dictionary definition, but it can also be inferred from the way authors used that concept in their works. Chapter 5 examines how language technology enables this type of exploration of meaning. Finally, language technology has tools to detect patterns recurring over thousands of texts. As the proverb says, there is nothing new under the sun. Similar themes and ideas recur over texts from different historical times. However, detecting them in large textual resources is a tedious (or sometimes impossible) task for human readers. As Chapter 5 illustrates, language technology supports humans in their efforts to detect recurrence and similarity in texts.
To realize the rich potential that language technology offers, humanists need to bridge two interrelated gaps. The first is the conceptual gap between humanities research problems and language technology methods. As a simple example, language technology can detect how many times a given term is used in a given set of historical sources. In more technical terms, with language technology we can study word frequency. But rarely do historians ask how many times a term occurs in their source texts. Rather, they inquire about the prevailing social concepts in a given historical time. There is a conceptual gap between word frequency and the prevailing social concepts. This simple example also sheds light upon the second gap, which lies between qualitative and quantitative approaches. The insights that language technology can deliver are very often quantitative and difficult to interpret with a qualitative framework. Bridging these gaps is a daunting task for scholars, and this publication seeks to assist them in this task. We believe that the potential of language technology can be realized if there is a clear understanding of the logic underlying it. The overall goal of this book is therefore to apply the logic of language technology to the resolution of humanistic research problems. We will attempt to convey this logic by following a didactic approach with three pillars.
First, we guide you through various research procedures involved in the application of language technology. The first chapter looks at the design of language resources, the first step in the application of language technology. The following chapters study specific humanities-related research problems and show how to design quantitative research procedures to address them. We believe that an understanding of how to design a research process in language technology is one of the key steps to understanding its overall logic. We do not, however, explain the technical implementation of the research procedures discussed throughout the book.4 Thanks to the development of computing tools in popular programming languages, such as Python and R, many of the technological procedures presented here have been (at least partially) automated, and their implementation can be learnt by following excellent on-line tutorials and manuals. But what is difficult to learn from on-line resources is ...