eBook - ePub

Theoretical Issues in Natural Language Processing

Name: Theoretical Issues in Natural Language Processing
Author: Yorick Wilks, Yorick Wilks

Yorick Wilks, Yorick Wilks

264 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Theoretical Issues in Natural Language Processing

Yorick Wilks, Yorick Wilks

Book details

Book preview

Table of contents

Citations

About This Book

Accompanying continued industrial production and sales of artificial intelligence and expert systems is the risk that difficult and resistant theoretical problems and issues will be ignored. The participants at the Third Tinlap Workshop, whose contributions are contained in Theoretical Issues in Natural Language Processing, remove that risk. They discuss and promote theoretical research on natural language processing, examinations of solutions to current problems, development of new theories, and representations of published literature on the subject. Discussions among these theoreticians in artificial intelligence, logic, psychology, philosophy, and linguistics draw a comprehensive, up-to-date picture of the natural language processing field.

Frequently asked questions

How do I cancel my subscription?

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

Can/how do I download books?

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

What is the difference between the pricing plans?

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

What is Perlego?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Is Theoretical Issues in Natural Language Processing an online PDF/ePUB?

Yes, you can access Theoretical Issues in Natural Language Processing by Yorick Wilks, Yorick Wilks in PDF and/or ePUB format, as well as other popular books in Psicologia & Storia e teoria della psicologia. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Psychology Press

Year

2018

ISBN

9781317717553

Edition

Topic

Psicologia

Subtopic

Storia e teoria della psicologia

Chapter 1
Words & World Representations

Chapter 1
The World of Words

Donald E. Walker

In the Beginning Was the Word! It is only appropriate that a section on "Words and World Representations" should start with this reminder. It is clear that word and world are closely coupled concepts. While their similarity in form inspired me to try to trace them back to some common root, none of the dictionaries available to me sustains that conceit, although the Oxford English Dictionary does show word as one of the forms of world in some part of the English-speaking regions during the period 1300-1600.

Selecting words allows us to distinguish different aspects of the world, but the relationship is not a simple one. It would be gratifying to believe that words reflect the form of language and that the world represents their content. However, the history of philosophy reveals that people differ in what these terms mean and which should be taken to be primary, so no simple solution is available to us. It is clear, though, that the shared concern among the participants in this section is to explore the nature of the relationships between the lexicon and knowledge representation.

There have been recent intensifications of interest in the lexicon and in knowledge representation, motivated by both theoretical considerations and application demands. From a theoretical perspective, the lexicon is beginning to assume a new and increasingly important role within linguistics. Correspondingly, knowledge representation has become the dominant concern for artificial intelligence. From the standpoint of applications, the movement toward production systems-for computational linguistics, the development of natural language interfaces; for artificial intelligence, the emergence of expert systems-is requiring practical ways of realizing the underlying concepts. As a result, there now is an interplay between science and engineering in our fields that I view as gratifying. The history of science clearly demonstrates that the physical and biological sciences took off exponentially when the interaction between theory and applications began. I am hoping for similar progress for the language and knowledge sciences.

The authors in this section are addressing a number of critical issues: What are the appropriate units for words and worlds, both theoretically and practically? Will current systems scale up as we add more words and facts? How can we acquire lexical and knowledge elements in a form that our systems can use? Can material assembled for one application be easily adapted to another? Can we take advantage of existing resources in building the capabilities required?

The section contains a computational linguist, Bran Boguraev from Cambridge University; a computational lexicologist, Bob Amsler from Bellcore; an artificial intelligence specialist, Jerry Hobbs; and a linguist, Judy Kegl. My own remarks are made from a position at the intersection of computational linguistics, artificial intelligence, and information science.

Boguraev talks about "The Definitional Power of Words," reflecting his work on building frames for knowledge representation from dictionary definitions. He and his colleagues have demonstrated that it is possible to establish lexical primitives through a taxonomic analysis of the core vocabulary in a controlled vocabulary, 'learner's' dictionary. One particularly exciting achievement is their use of the grammatical information in the form provided by the Longman Dictionary of Contemporary English as the lexical base for the PATR-II parser, developed at SRI International.

Amsler in "Words and Worlds" describes his work on the analysis of large text files to determine the lexicons needed to understand them. Existing machine-readable dictionaries do not have sufficient coverage for the task. Two special problems are being addressed. The variety of forms in which proper nouns can be expressed suggests that grammars need to be written for them and parsers developed that can apply those rules. Another important area is phrase collocations: the identification of sequences of words that function as lexical units and should be dictionary entries that are part of the lexicon of a system.

Hobbs considers "World Knowledge and Word Meaning." He believes that it is necessary to explicate the commonsense knowledge people have about the world in order to understand the words they use. His group has been axiomatizing a set of core theories that underlie the way people communicate about objects and events. They have been analyzing texts about mechanical devices and their failures and are developing ways of codifying notions like time, space, causality, shape, and force.

Kegl addresses "The Boundary between Word Knowledge and World Knowledge." Her work entails applying linguistic knowledge to world knowledge to yield word knowledge. She and her colleagues are deriving semantic primitives by determining how the languages of the world encode expressions of objects and events. Their goal is to develop ways of representing lexical items that are neutral with respect to a particular theory, so that lexicons can be shared by different projects.

My own remarks about "The World of Words" reflect a concern for organizing the groups working on the problems that the other authors have considered. I discuss some of the critical issues, point out the broad range of activities that are currently underway, and propose a mechanism for coordinating them.

There are a large number of protagonists involved in the topic area "Words and World representations": lexicographers, lexicologists, computational linguists, artificial intelligence specialists, linguists, cognitive scientists, philosophers, computer scientists, information scientists, publishers, lexical software marketers, translators, industry representatives, funding agency representatives, and professional society representatives. These groups have different backgrounds, motivations, and criteria for evaluation. Yet all have to be convinced that they must work together on shared problems toward a common set of goals.

One of the shared problems is the establishment of multifunctional, polytheoretical databases. There are increasing numbers of dictionaries in machine-readable form, and, as a byproduct of the electronic photocomposition of newspapers, books, magazines, and office documentation, it is possible to get online access to millions of words of text. However, we need to structure these materials so they can be used for many different puiposes by people who have different theoretical orientations.

Equally important to accomplish is the development of computational tools for manipulating these data for the range of purposes described by the panelists, as well as to satisfy the requirements of the other protagonists identified above.

To support people who want to use these tools to process the varieties of data available, we have to develop workstation environments that do not require computer sophistication. More important, those environments must be tailored to the special objectives of the users. Although there are similarities between the requirements of lexicographers and translators, for example, the differences are important to identify so that we can model workstations for lexicographers and translators that satisfy their distinctive needs.

There are a number of exciting activities that have been taking place recently that are bringing people together to examine issues involving the lexicon. In May 1986, Antonio Zampolli, Nicoletta Calzolari, Juan Sager, Loll Rolling, and I organized a workshop on "Automating the Lexicon; Research and Practice in a Multilingual Environment." Held in Marina di Grosseto, Italy, its purpose was to explore research efforts, current practice, and potential developments in work on the lexicon, machine-readable dictionaries, and lexical knowledge bases with special consideration to the problems created by working with different languages. We wanted both to identify the current state of affairs and to recommend directions for future activities. We solicited papers that would examine in depth a set of research areas, core problems, application areas, and developing lexical knowledge bases. The papers prepared for the meeting surveyed the role of the lexicon in linguistics, semantics, parsing, generation, lexicography, translation, teaching, psycholinguistics, information retrieval, office automation, and dictionaries for the mass market. They also examined the problems of developing data and knowledge bases that would support the expansion and more effective use of lexical information. As a result we were able to establish a baseline for future work in the field. Of particular value was our bringing together a distinguished group of participants representing the range of protagonists mentioned above.

This workshop was followed immediately by another one on "The Lexical Entry," which was held in conjunction with the Summer Linguistic Institute at the City University of New York in July 1986. The focus there was to determine how different theoretical frameworks and system implementations influence the format for a lexical entry. We were interested in characterizing a general representation or "metaformat" that would subsume the specific ones. The expectation was not that everyone would agree to share a single model. Rather, we hoped to identify the range of parameters that are used and from which different approaches would make a selection.

That workshop, which included primarily research specialists, led to the establishment of a much smaller and more sharply focused "Pisa Working Group" that attempted to create lexical entries that would actually satisfy the requirements of different theoretical approaches. Several meetings were held, and the results served to motivate another large workshop on "The Lexicon in Theoretical and Computational Perspective," at the Stanford Linguistic Institute in July 1987. It had two objectives: (1) the establishment of a structure for lexical entries that will be neutral with respect to theoretical differences so that the information they contain can be shared; (2) the development of community-wide resources that will be widely accessible. The workshop was organized around working groups on syntax, semantics, morphology, and data and knowledge base design.

Collateral developments during this period have been the establishment of an ad hoc working group on "Computational Lexicology and Lexicography" by the European Science Foundation and the establishment of a specialist working group on "Dictionaries and the Computer" by EURALEX, the European Association for Lexicography. A panel on "The Lexicon in a Multilingual Environment," was held at COLING '86 in Bonn in August, 1986. A conference on "Standardization in Lexicography" took place in Saarbruecken in October 1986, and one on "Advances in Lexicology" was organized by the Centre for the New Oxford English Dictionary in Waterloo, Ontario, in November 1986.

Two other related activities were a special double issue on the Lexicon in "Computational Linguistics" during 1987 and a summer school on "Computational Lexicography and Lexicology" in Summer 1988, organized by the European Science Foundation.

This broad range of activities, coupled with the convening of this panel, testifies to the vitality of work on the lexicon. However, the complexities of the issues entailed in accomplishing the goals under consideration demand a more coordinated effort. Bob Amsler likened the organization required to similar efforts in government or industrial circles, which can be characterized as involving a series of directorates, offices, and projects.

Procurement is one major issue. Organized efforts are needed to acquire data, to catalog relevant literature, and to identify people with the necessary skills. A variety of massive databases are essential; they need to include texts, spoken language, graphics, and images. Legal and copyright issues constitute a major set of problems to be considered systematically.

Capturing lexical knowledge is another concern. Establishing a common, shareable lexicon is critical, as is embedding it in the context of a 'universal' grammar. We need to model lexicographers and lexicologists to better understand how they work. Multilingual studies are essential to ensure generality.

Communication and collaboration are central to this effort. Consequently, we need to be concerned with standards and terminology, with effective electronic communication networks, with procedures for data dissemination. Coordination is essential to avoid duplicating research projects and to insure that they complement each other. Conferences and workshops need to be organized to bring people together. Curriculum development and training aids will further the spread of this technology.

Hardware and software support are required. We need to be concerned with workstation design, the development of more effective programming languages that can handle the string and structural features of language, the establishment of database management procedures for massive text files, and the creation of distributed storage systems with fast access times.

These brief statements summarize a complex organizational problem. The Grosseto Workshop, referred to above, actually identified more than 30 major projects that would benefit research and practice in work on the lexicon. The "Introduction" to the Proceedings of that Workshop will provide a more detailed analysis with recommendations for further activities (Walker, forthcoming).

Chapter 1.2
The Definitional Power of Words

Branimir Κ. Boguraev

I am deliberate in introducing ambiguity in the title. Part of my thesis in this brief note is going to be that there is a wealth of information relevant to a range of natural language processing functions available and extractable from the definitions of words found in obvious places like dictionaries¹. This is hardly surprising, given that what is to be found in a dictionary is essentially the result of a substantial amount of work on analyzing and collating data about real language and elicitating collocational and distributional properties of words and applying certain common principles of defining their meaning. Furthermore, I am going to argue that carefully exploited interplay between notions of "words" and "primitives" can add substantial leverage to the functionality and coverage of a natural language processing system.

Several factors and related phenomena underlie the current interest in words, and consequently in word resources. Over the last decade there has been the emergence of theories of grammar and grammatical frameworks (e.g., LFG, GPSG, PATR-II, FUG, Lexicon Grammar, Word Grammar) placing heavy emphasis on the lexicon, where elaborate information about the grammatical and logical idiosyncracies of words is stored and used to drive various parsing systems. More relevant to this panel, however, is the progress in both the practical aspects of natural language processing (various techniques for, e.g., performing text analysis or building and customizing natural language interfaces) and the theoretical issues of knowledge representation and access.

It would not be too provocative to slate that the current understanding of how to go about building practical systems is sufficient to make such a task tractable. However, realistic natural language processing programs fall in the general class of knowledge based systems in AI, and they all require significant amounts of structured knowledge about the real world, as well as about a particular domain of discourse. There are typically two problems here, one related to the scaling of a prototype up by expanding its knowledge base, and the other related to the activity of transporting or customizing an existing system. In both cases the real culprit is the knowledge acquisition bottleneck. Given the availability online of suitable machine readable resources, namely dictionaries and encyclopaedias, there is strong hope that some model of the common world may be localized and extracted from such sources. Even if individual applications may require additional elaboration of their knowledge bases and the introduction of specialized terms and concepts, these will still have to be related to the common world knowledge. A growing mass of work at present is focussed on making some use of the definitional component of a dictionary entry, where the dictionary itself is regarded as a knowledge base, albeit presented in a loose and not very structured fashion. Starting with the assumption that dictionary definitions both employ and imply a taxonomy of genus terms, the ultimate goal is to relate natural language words to this underlying structure which relates together the defining concepts in the dictionary. There are many problems here, ranging from the arbitrariness of dictionary definitions, to the distribution of a particular piece of data over a number of separate entries, to the fundament...