1 Introduction
Searching for information on the World Wide Web is a method of acquiring information useful to many people in varied guises. As the Internet continues to grow rapidly, with evermore resources such as hypertexts, documents or multimedia files being added every second, the challenge of providing users with the specific contents they need becomes more and more important. Current search engines are far from perfect as they hardly ever return completely satisfying results. One reason for this is that search engines, or more specifically text retrieval systems, usually do not consider the semantics of a text but rather just conduct a statistical analysis. Search engines, for example, will tend to rank a text in which university occurs ten times more highly than a text containing four occurrences of that item. This approach, however, cannot adequately represent the textâs content, nor is a simple âbag-of-wordsâ approach, where each text is merely seen as containing âan unordered set of wordsâ (Baeza-Yates & Castillo 2006: 527), sufficient (cf. Jurafsky & Martin 2009: 801). Accordingly, the results returned by Web retrieval systems are commonly not subjected to any closer examination of the textâs topics, let alone a linguistic analysis. Thus, search systems tend to use purely quantitative, rather than qualitative linguistic methods.
One approach to analyse a text linguistically is to investigate its cohesion, and here, more specifically, to pay attention to anaphors. The aim of this book is to outline anaphors of English and to examine to what extent they are worth being considered in text retrieval systems. Although anaphors and their resolution is a highly debated issue in present research, there are few studies that explore anaphors in the context of text retrieval. Even research in the field of anaphora resolution that is not intended for text retrieval shows a number of deficiencies.
To start with, a comprehensive classification of anaphor types based on linguistic description and also with regard to text retrieval systems is missing. Text retrieval systems would profit from a thorough examination because more precise rules for resolving anaphors could be formulated. The standard work for computational anaphora resolution is Mitkovâs book Anaphora Resolution (2002). However, Mitkovâs classification is not satisfying from a linguistic point of view because it does not take into account the many different types of anaphors and their features.
A further weakness is that in the discussion of anaphor types, no approach pays attention to non-finite clause anaphors. Not only are non-finite clauses disregarded as one type of anaphor in existing text retrieval systems, but they
are even frequently ignored as one type of anaphor in linguistics, e.g. in Stirling & Huddleston (2010). Although Quirk et al. (2012: 910) mention non-finite clauses as a special type of ellipsis, they do not discuss whether non-finite clauses are a type of anaphor or not.
An additional shortcoming lies in the scarcity of annotated corpora. The few corpora that are annotated are mostly not freely accessible. Furthermore, current annotations of corpora only contain a few anaphor types, which is why these corpora are not adequate for an examination in this book.
All in all, there are many desiderata in the field of anaphora resolution. This book will contribute to their investigation from a linguistic and computational point of view. It draws particularly on syntactic, text linguistic and corpus linguistic methods as well as on methods from text retrieval and natural language processing. This book will first examine the question of what anaphor types the English language shows. For that purpose, a linguistic definition of anaphors is needed (cf. chapter 2), before a classification of anaphor types is presented (cf. chapter 3). Second, the frequency of each type of anaphor in hypertexts will be analysed (cf. chapter 4). From these insights, research questions for computational anaphora resolution can then be formulated (cf. chapter 4.5).
In more detail, the book is structured as follows. The second chapter will define anaphors and discuss related concepts. It will conclude with six conditions or characteristics of anaphors, all of which have to apply to items in order to be regarded as anaphors. In the third chapter, the twelve types of anaphors will be described in detail. The grammatical features of each anaphor type will be explained in depth, which is subsequently also of importance for computational anaphora resolution. The fourth chapter will examine the frequency of anaphors in hypertexts. Here, a corpus including different types of hypertexts will be introduced and statistically investigated with regard to anaphor types. A further chapter will describe text retrieval systems in general and for retrieving hypertexts from the Internet specifically, and the types of natural language processing methods these systems use. The sixth chapter will then present computational anaphora resolution, i.e. current approaches and applications, and the structure and evaluation of anaphora resolution systems. In the last chapter, non-finite clause anaphors will be analysed with respect to computational anaphora resolution, applying the insights of chapter four about the frequency of anaphors. Rules for identifying anaphors as well as for assigning antecedents will be established. Finally, the results will be discussed and perspectives for future research will be offered.
2 Linguistic fundamentals of anaphors and anaphora
2.1 Basic definitions
The word anaphora originates from Greek ana- (âbackâ) and pherein (âto bearâ) and entered English via Latin transmission (cf. âAnaphoraâ 2010). In English, it is documented for the first time in 1589 (cf. Simpson & Weiner 1989: 436-437):
Anaphora, or the Figure of Report. Repetition in the firĹżt degree we call the figure of Report according to the Greeke originall, and is when we make one word begin, and as they are wont to Ĺżay, lead the daunce to many verĹżes in Ĺżute, as thus.
To thinke on death it is a miferie,
To think on life it is a vanitie:
To thinke on the world verily it is,
To thinke that heare man hath no perfit bliffe.
(Puttenham 1589: 165)
âAnaphoraâ here denotes the rhetoric figure of repetition. The first written evidence of a use in grammar is not found until 1933, when the term appeared in Bloomfieldâs work Language:
[W]hen we say Ask that policeman, and he will tell you, the substitute he means, among other things, that the singular male substantive expression which is replaced by he, has been recently uttered. A substitute which implies this, is an anaphoric or dependent substitute, and the recently-uttered replaced form is the antecedent. (Bloomfield 1984: 249)
Later he gives another example:
The word one [âŚ] replaces a with anaphora of the noun [âŚ] when no other modifier is present (Here are some apples; take one); [âŚ] it is the anaphoric substitute for nouns after an adjective, and in this use forms a plural, ones (the big box and the small one, these boxes and the ones in the kitchen [âŚ]). (ibid.: 265-266)
As for derivations, the adjective anaphoric and the adverb anaphorically are first mentioned in 1914 (cf. Bloomfield 1984: 249-251; Simpson & Weiner 1989: 436-437). According to the Oxford Dictionary of English (2005, 2nd rev. ed.), the noun anaphor has its origin in a backformation of anaphora, which dates back to the 1970s (cf. Soanes & Stevenson 2005: 55).
When consulting current dictionaries, the word anaphora often divides up into different senses, depending on its use in various contexts. First, the term denotes a part of the mass in liturgics. Second, âanaphoraâ describes the ârepetition of a word or phrase at the beginning of successive clauses, lines of verse, etc.â (Agnes et al. 2007: 51) in rhetoric (cf. Wilpert 2001: 27). So it is still used in the sense it was for the first time (cf. Puttenham 1589, above). Third, âanaphoraâ is used in music for the repetition of a voice, usually the bass (cf. Bartel 2007: 90-95).
The fourth â grammatical â definition is of importance here: âanaphoraâ is â[t]he use of a word which refers to, or is a substitute for, a preceding word or group of wordsâ (Simpson & Weiner 1989: 436). According to Valentin (1996: 179), this meaning has developed from the use of anaphora in rhetoric. The following example illustrates what an anaphor is in the grammatical sense of the word:
- (1) Susan plays the piano. She likes music.
In example (1), the word she is an anaphor and refers back to a preceding expression, in this case Susan. As can be seen in this example, an anaphor is an item that commonly points backwards. Anaphors derive their interpretation from the expressions they refer to because their own meaning is often rather general (cf. Finch 2005: 199-200; Trask & Stockwell 2007: 16-17; Huddleston 2010a: 68; Quirk et al. 2012: 335, 862). This becomes obvious if the second sentence in example (1), She likes music, appears on its own. In such circumstances, it is not possible to find out the person meant by she. We can only state that it is most likely a female person. But if both sentences are present, she is undoubtedly used in place of Susan.
The linguistic element or elements to which an anaphor refers is called âan antecedentâ. The antecedent in the preceding example is the expression Susan. The relationship between anaphor and antecedent is termed âanaphoraâ (cf. Huddleston 2010a: 68-69). âAnaphora resolutionâ or âanaphor resolutionâ is the process of finding the correct antecedent of an anaphor (cf. KĂźbler n.d.: 5; Mitkov 2004a: 269; Crystal 2009: 25). In addition, so-called âanaphoric chainsâ can arise, if anaphors are themselves antecedents. In example (2), the anaphor she refers to the antecedent Ann, and she is also the antecedent of herself (cf. Halliday & Hasan 2008: 15, 52; Stirling & Huddleston 2010: 1457).
- (2) Ann knew that she had written the letter herself.
Another central aspect of anaphors is that they can vary with regard to the importance of the antecedent for determining reference. Anaphoric noun phrases with a definite article are a case in point. They, for instance, can have antecedents that are not needed for determining the referent of the anaphor, as is shown here:
- (3) I went to an amusing show recently where I met two friends.⌠As they were sitting next to me during the show [1] I was able to ask them about the presenter. However, they could not tell me anything about the show [2].
In this example, the second anaphor the show [2] has the antecedent the show [1]. At first sight, the second anaphor [2] does not seem to gain new information through this relation to the antecedent [1]. But as the antecedent [1] itself is an anaphor and refers to an amusing show, the second anaphor [2] also gains information through these links. In consequence, it makes sense that the second anaphor [2] is interpreted in relation to its identical antecedent [1] (cf. Quirk et al. 2012: 1464-1465).
Recognising anaphors whose antecedents are literally identical with themselves is also important for computational anaphora resolution systems because anaphoric chains can be established through that process. Additionally, when detecting anaphoric chains, the distance between anaphor and antecedent does not become unnaturally large. Stirling & Huddleston (2010) argue:
There can be a very large distance between the first antecedent in a chain and the final anaphor, greater than would typically be permitted for a direct link: it is the intermediate links that keep the referent salient in the context of discourse so that reference to it can be made by means of a personal pronoun or other anaphor with little intrinsic content. (ibid.: 1457)
With regard to ...