Natural Language Processing with Java
eBook - ePub

Natural Language Processing with Java

Techniques for building machine learning and neural network models for NLP, 2nd Edition

Richard M. Reese, AshishSingh Bhatia

  1. 318 Seiten
  2. English
  3. ePUB (handyfreundlich)
  4. Über iOS und Android verfügbar
eBook - ePub

Natural Language Processing with Java

Techniques for building machine learning and neural network models for NLP, 2nd Edition

Richard M. Reese, AshishSingh Bhatia

Angaben zum Buch
Buchvorschau
Inhaltsverzeichnis
Quellenangaben

Über dieses Buch

Explore various approaches to organize and extract useful text from unstructured data using Java

Key Features

  • Use deep learning and NLP techniques in Java to discover hidden insights in text
  • Work with popular Java libraries such as CoreNLP, OpenNLP, and Mallet
  • Explore machine translation, identifying parts of speech, and topic modeling

Book Description

Natural Language Processing (NLP) allows you to take any sentence and identify patterns, special names, company names, and more. The second edition of Natural Language Processing with Java teaches you how to perform language analysis with the help of Java libraries, while constantly gaining insights from the outcomes.

You'll start by understanding how NLP and its various concepts work. Having got to grips with the basics, you'll explore important tools and libraries in Java for NLP, such as CoreNLP, OpenNLP, Neuroph, and Mallet. You'll then start performing NLP on different inputs and tasks, such as tokenization, model training, parts-of-speech and parsing trees. You'll learn about statistical machine translation, summarization, dialog systems, complex searches, supervised and unsupervised NLP, and more.

By the end of this book, you'll have learned more about NLP, neural networks, and various other trained models in Java for enhancing the performance of NLP applications.

What you will learn

  • Understand basic NLP tasks and how they relate to one another
  • Discover and use the available tokenization engines
  • Apply search techniques to find people, as well as things, within a document
  • Construct solutions to identify parts of speech within sentences
  • Use parsers to extract relationships between elements of a document
  • Identify topics in a set of documents
  • Explore topic modeling from a document

Who this book is for

Natural Language Processing with Java is for you if you are a data analyst, data scientist, or machine learning engineer who wants to extract information from a language using Java. Knowledge of Java programming is needed, while a basic understanding of statistics will be useful but not mandatory.

Häufig gestellte Fragen

Wie kann ich mein Abo kündigen?
Gehe einfach zum Kontobereich in den Einstellungen und klicke auf „Abo kündigen“ – ganz einfach. Nachdem du gekündigt hast, bleibt deine Mitgliedschaft für den verbleibenden Abozeitraum, den du bereits bezahlt hast, aktiv. Mehr Informationen hier.
(Wie) Kann ich Bücher herunterladen?
Derzeit stehen all unsere auf Mobilgeräte reagierenden ePub-Bücher zum Download über die App zur Verfügung. Die meisten unserer PDFs stehen ebenfalls zum Download bereit; wir arbeiten daran, auch die übrigen PDFs zum Download anzubieten, bei denen dies aktuell noch nicht möglich ist. Weitere Informationen hier.
Welcher Unterschied besteht bei den Preisen zwischen den Aboplänen?
Mit beiden Aboplänen erhältst du vollen Zugang zur Bibliothek und allen Funktionen von Perlego. Die einzigen Unterschiede bestehen im Preis und dem Abozeitraum: Mit dem Jahresabo sparst du auf 12 Monate gerechnet im Vergleich zum Monatsabo rund 30 %.
Was ist Perlego?
Wir sind ein Online-Abodienst für Lehrbücher, bei dem du für weniger als den Preis eines einzelnen Buches pro Monat Zugang zu einer ganzen Online-Bibliothek erhältst. Mit über 1 Million Büchern zu über 1.000 verschiedenen Themen haben wir bestimmt alles, was du brauchst! Weitere Informationen hier.
Unterstützt Perlego Text-zu-Sprache?
Achte auf das Symbol zum Vorlesen in deinem nächsten Buch, um zu sehen, ob du es dir auch anhören kannst. Bei diesem Tool wird dir Text laut vorgelesen, wobei der Text beim Vorlesen auch grafisch hervorgehoben wird. Du kannst das Vorlesen jederzeit anhalten, beschleunigen und verlangsamen. Weitere Informationen hier.
Ist Natural Language Processing with Java als Online-PDF/ePub verfügbar?
Ja, du hast Zugang zu Natural Language Processing with Java von Richard M. Reese, AshishSingh Bhatia im PDF- und/oder ePub-Format sowie zu anderen beliebten Büchern aus Ciencia de la computación & Programación en Java. Aus unserem Katalog stehen dir über 1 Million Bücher zur Verfügung.

Information

Finding Parts of Text

Finding parts of text is concerned with breaking text down into individual units, called tokens, and optionally performing additional processing on those tokens. This additional processing can include stemming, lemmatization, stopword removal, synonym expansion, and converting text to lowercase.
We will demonstrate several tokenization techniques found in the standard Java distribution. These are included because sometimes this is all you may need to do the job. There may be no need to import NLP libraries in this situation. However, these techniques are limited. This is followed by a discussion of specific tokenizers or tokenization approaches supported by NLP APIs. These examples will provide a reference for how the tokenizers are used and the type of output they produce. This is followed by a simple comparison of the differences between the approaches.
There are many specialized tokenizers. For example, the Apache Lucene project supports tokenizers for various languages and specialized documents. The WikipediaTokenizer class is a tokenizer that handles Wikipedia-specific documents, and the ArabicAnalyzer class handles Arabic text. It is not possible to illustrate all of these varying approaches here.
We will also examine how certain tokenizers can be trained to handle specialized text. This can be useful when a different form of text is encountered. It can often eliminate the need to write a new and specialized tokenizer.
Next, we will illustrate how some of these tokenizers can be used to support specific operations, such as stemming, lemmatization, and stopword removal. POS can also be considered as a special instance of parts of text. However, this topic is investigated in Chapter 5, Detecting Parts of Speech.
Therefore, we will be covering the following topics in this chapter:
  • What is tokenization?
  • Uses of tokenizers
  • NLP tokenizer APIs
  • Understanding normalization

Understanding the parts of text

There are a number of ways to categorize parts of text. For example, we may be concerned with character-level issues, such as punctuation, with a possible need to ignore or expand contractions. At the word level, we may need to perform different operations, such as the following:
  • Identifying morphemes using stemming and/or lemmatization
  • Expanding abbreviations and acronyms
  • Isolating number units
We cannot always split words with punctuation, because the punctuation is sometimes considered to be part of the word, such as the word can't. We may also be concerned with grouping multiple words to form meaningful phrases. Sentence-detection can also be a factor. We do not necessarily want to group words that cross sentence boundaries.
In this chapter, we are primarily concerned with the tokenization process and a few specialized techniques, such as stemming. We will not attempt to show how they are used in other NLP tasks. Those efforts are reserved for later chapters.

What is tokenization?

Tokenization is the process of breaking text down into simpler units. For most text, we are concerned with isolating words. Tokens are split based on a set of delimiters. These delimiters are frequently whitespace characters. Whitespace in Java is defined by the Character class' isWhitespace method. These characters are listed in the following table. However, there may be a need, at times, to use a different set of delimiters. For example, different delimiters can be useful when whitespace delimiters obscure text breaks, such as paragraph boundaries, and detecting these text breaks is important:
Character
Meaning
Unicode space character
(space_separator, line_separator, or paragraph_separator)
\t
U+0009 horizontal tabulation
\n
U+000A line feed
\u000B
U+000B vertical tabulation
\f
U+000C form feed
\r
U+000D carriage return
\u001C
U+001C file separator
\u001D
U+001D group separator
\u001E
U+001E record separator
\u001F
U+001F unit separator
The tokenization process is complicated by a large number of factors, such as the following:
  • Language: Different languages present unique challenges. Whitespace is a commonly-used delimiter, but it will not be sufficient if we need to work with Chinese, where it is not used.
  • Text format: Text is often stored or presented using different formats. How simple text is processed versus HTML or other markup techniques will complicate the tokenization process.
  • Stopwords: Commonly-used words might not be important for some NLP tasks, such as general searches. These common words are called stopwords. Stopwords are sometimes removed when they do not contribute to the NLP task at hand. These can include words such as a, and, and she.
  • Text-expansion: For acronyms and abbreviations, it is sometimes desirable
    to expand them so that postprocesses can produce better-quality results.
    For example, if a search is interested in the word machine, knowing that IBM stands for International Business Machines can be useful.
  • Case: The case of a word (upper or lower) may be significant in some situations. For example, the case of a word can help identify proper nouns. When identifying the par...

Inhaltsverzeichnis

  1. Title Page
  2. Copyright and Credits
  3. Dedication
  4. Packt Upsell
  5. Contributors
  6. Preface
  7. Introduction to NLP
  8. Finding Parts of Text
  9. Finding Sentences
  10. Finding People and Things
  11. Detecting Part of Speech
  12. Representing Text with Features
  13. Information Retrieval
  14. Classifying Texts and Documents
  15. Topic Modeling
  16. Using Parsers to Extract Relationships
  17. Combined Pipeline
  18. Creating a Chatbot
  19. Other Books You May Enjoy
Zitierstile für Natural Language Processing with Java

APA 6 Citation

Reese, R., & Bhatia, A. (2018). Natural Language Processing with Java (2nd ed.). Packt Publishing. Retrieved from https://www.perlego.com/book/778157/natural-language-processing-with-java-techniques-for-building-machine-learning-and-neural-network-models-for-nlp-2nd-edition-pdf (Original work published 2018)

Chicago Citation

Reese, Richard, and AshishSingh Bhatia. (2018) 2018. Natural Language Processing with Java. 2nd ed. Packt Publishing. https://www.perlego.com/book/778157/natural-language-processing-with-java-techniques-for-building-machine-learning-and-neural-network-models-for-nlp-2nd-edition-pdf.

Harvard Citation

Reese, R. and Bhatia, A. (2018) Natural Language Processing with Java. 2nd edn. Packt Publishing. Available at: https://www.perlego.com/book/778157/natural-language-processing-with-java-techniques-for-building-machine-learning-and-neural-network-models-for-nlp-2nd-edition-pdf (Accessed: 14 October 2022).

MLA 7 Citation

Reese, Richard, and AshishSingh Bhatia. Natural Language Processing with Java. 2nd ed. Packt Publishing, 2018. Web. 14 Oct. 2022.