Part 1 Basics
Welcome to the beautiful and exciting world of natural language processing (NLP)! NLP is a subfield of artificial intelligence (AI) that concerns computational approaches to processing, understanding, and generating human languages. NLP is used in many technologies you interact with in your daily life—spam filtering, conversational assistants, search engines, and machine translation. This first part of the book is intended to give you a gentle introduction to the field and bring you up to speed with how to build practical NLP applications.
In chapter 1, we’ll begin by introducing the “what” and “why” of NLP—what is NLP, what is not NLP, how NLP technologies are used, and how it’s related to other fields of AI.
In chapter 2, you’ll build a complete, working NLP application—a sentiment analyzer—within an hour with the help of a powerful NLP framework, AllenNLP. You’ll also learn to use basic machine learning (ML) concepts, including word embeddings and recurrent neural networks (RNNs). Don’t worry if this sounds intimidating—we’ll introduce you to the concepts gradually and provide an intuitive explanation.
Chapter 3 provides a deep dive into the one of the most important concepts for deep learning approaches to NLP—word and sentence embeddings. The chapter demonstrates how to use and even train them using your own data.
Chapters 4 and 5 cover fundamental NLP tasks, sentence classification and sequence labeling. Though simple, these tasks have a wide range of applications, including sentiment analysis, part-of-speech tagging, and named entity recognition.
This part familiarizes you with some basic concepts of modern NLP and we’ll build useful NLP applications along the way.
1 Introduction to natural language processing
This chapter covers
- What natural language processing (NLP) is, what it is not, and why it’s such an interesting, yet challenging, field
- How NLP relates to other fields, including artificial intelligence (AI) and machine learning (ML)
- What typical NLP applications and tasks are
- How a typical NLP application is developed and structured
This is not an introductory book to machine learning or deep learning. You won’t learn how to write neural networks in mathematical terms or how to compute gradients, for example. But don’t worry, even if you don’t have any idea what they are. I’ll explain those concepts as needed, not mathematically but conceptually. In fact, this book contains no mathematical formulae—not a single one. Also, thanks to modern deep learning libraries, you don’t really need to understand the math to build practical NLP applications. If you are interested in learning the theories and the math behind machine learning and deep learning, you can find a number of great resources out there.
But you do need to be at least comfortable enough to write in Python and know its ecosystems. However, you don’t need to be an expert in software engineering topics. In fact, this book’s purpose is to introduce software engineering best practices for developing NLP applications. You also don’t need to know NLP in advance. Again, this book is designed to be a gentle introduction to the field.
You need Python version 3.6.1 or higher and AllenNLP 2.5.0 or higher to run the code examples in this book. Note that we do not support Python 2, mainly because AllenNLP (https://allennlp.org/), the deep natural language processing framework I’m going to heavily use in this book, supports only Python 3. If you haven’t done so, I strongly recommend upgrading to Python 3 and familiarizing yourself with the latest language features such as type hints and new string-formatting syntax. This will be helpful, even if you are developing non-NLP applications.
Don’t worry if you don’t have a Python development environment ready. Most of the examples in this book can be run via the Google Colab platform (https://colab.research.google.com). You need only a web browser to build and experiment with NLP models!
This book will use PyTorch (https://pytorch.org/) as its main choice of deep learning framework. This was a difficult decision for me, because several deep learning frameworks are equally great choices for building NLP applications, namely, TensorFlow, Keras, and Chainer. A few factors make PyTorch stand out among those frameworks—it’s a flexible and dynamic framework that makes it easier to prototype and debug NLP models; it’s becoming increasingly popular within the research community, so it’s easy to find open source implementations of major models; and the deep NLP framework AllenNLP mentioned earlier is built on top of PyTorch.
1.1 What is natural language processing (NLP)?
NLP is a principled approach to processing human language. Formally, it is a subfield of artificial intelligence (AI) that refers to computational approaches to process, understand, and generate human language. The reason it is part of AI is because language processing is considered a huge part of human intelligence. The use of language is arguably the most salient skill that separates humans from other animals.
1.1.1 What is NLP?
NLP includes a range of algorithms, tasks, and problems that take human-produced text as an input and produce some useful information, such as labels, semantic representations, and so on, as an output. Other tasks, such as translation, summarization, and text generation, directly produce text as output. In any case, the focus is on producing some output that is useful per se (e.g., a translation) or as input to other downstream tasks (e.g., parsing). I’ll touch upon some popular NLP applications and tasks in section 1.3.
You might wonder why NLP explicitly has “natural” in its name. What does it mean for a language to be natural? Are there any unnatural languages? Is English natural? Which is more natural: Spanish or French?
The word “natural” here is used to contrast natural languages with formal languages. In this sense, all the languages humans speak are natural. Many experts believe that language emerged naturally tens of thousands of years ago and has evolved organically ever since. Formal languages, on the other hand, are types of languages that are invented by humans and have strictly and explicitly defined syntax (i.e., what is grammatical) and semantics (i.e., what it means).
Programming languages such as C and Python are good examples of formal languages. These languages are defined in such a strict way that it is always clear what is grammatical and ungrammatical. When you run a compiler or an interpreter on the code you write in those languages, you either get a syntax error or not. The compiler won’t say something like, “Hmm, this code is maybe 50% grammatical.” Also, the behavior of your program is always the same if it’s run on the same code, assuming external factors such as the random seed and the system states remain constant. Your interpreter won’t show one result 50% of the time and another the other 50% of the time.
This is not the case for human languages. You can write a sentence that is maybe grammatical. For example, do you consider the phrase “The person I spoke to” ungrammatical? There are some grammar topics where even experts disagree with each other. This is what makes human languages interesting but challenging, and why the entire field of NLP even exists. Human languages are ambiguous, meaning that their interpretation is often not unique. Both structures (how sentences are formed) and semantics (what sentences mean) can have ambiguities in human language. As an example, let’s take a close look at the next sentence:
He saw a girl with a telescope.
When you read this sentence, who do you think has a telescope? Is it the boy, who’s using a telescope to see a girl (from somewhere far), or the girl, who has a telescope and is seen by the boy? There seem to be at least two interpretations of this sentence as shown in figure 1.1.
Figure 1.1 Two interpretations of “He saw a girl with a telescope.”
The reason you are confused upon reading this sentence is because you don’t know what the phrase “with a telescope” is about. More technically, you don’t know what this prepositional phrase (PP) modifies. This is called a PP-attachment problem and is a classic example of syntactic ambiguity. A syntactically ambiguous sentence has more than one interpretation of how the sentence is structured. You can interpret the sentence in multiple ways, depending on which structure of the sentence you believe.
Another type of ambiguity that may arise in natural language is semantic ambiguity. This is when the meaning of a word or a sentence, not its structure, is ambiguous. For example, let’s look at the following sentence:
I saw a bat.
There is no question how this sentence is structured. The subject of the sentence is “I” and the object is “a bat,” connected by the verb “saw.” In other words, there is no syntactical ambiguity in it. But how about its meaning? “Saw” has at least two meanings. One is the past tense of the verb...