Text Mining with Machine Learning
Principles and Techniques
- 352 pages
- English
- ePUB (mobile friendly)
- Available on iOS & Android
Text Mining with Machine Learning
Principles and Techniques
About This Book
This book provides a perspective on the application of machine learning-based methods in knowledge discovery from natural languages texts. By analysing various data sets, conclusions which are not normally evident, emerge and can be used for various purposes and applications. The book provides explanations of principles of time-proven machine learning algorithms applied in text mining together with step-by-step demonstrations of how to reveal the semantic contents in real-world datasets using the popular R-language with its implemented machine learning algorithms. The book is not only aimed at IT specialists, but is meant for a wider audience that needs to process big sets of text documents and has basic knowledge of the subject, e.g. e-mail service providers, online shoppers, librarians, etc.
The book starts with an introduction to text-based natural language data processing and its goals and problems. It focuses on machine learning, presenting various algorithms with their use and possibilities, and reviews the positives and negatives. Beginning with the initial data pre-processing, a reader can follow the steps provided in the R-language including the subsuming of various available plug-ins into the resulting software tool. A big advantage is that R also contains many libraries implementing machine learning algorithms, so a reader can concentrate on the principal target without the need to implement the details of the algorithms her- or himself. To make sense of the results, the book also provides explanations of the algorithms, which supports the final evaluation and interpretation of the results. The examples are demonstrated using realworld data from commonly accessible Internet sources.
Frequently asked questions
Information
Chapter 1
Introduction to Text Mining with Machine Learning
1.1 Introduction
1.2 Relation of Text Mining to Data Mining
- ■ categorization of documents – assigning a document to one or more predefined category (e.g., assigning a newspaper article to one or more categories, labeling e-mails as spam or ham);
- ■ clustering – grouping documents according to their similarity, for example, in order to identify documents sharing a common topic;
- ■ summarization – finding the most important parts in one or more documents and creating a text that is significantly shorter than the original;
- ■ information retrieval – retrieving documents that match a query representing information needed from a large collection of documents;
- ■ extracting the meaning of documents or their parts – identifying hidden topics, analyzing sentiment, opinion, or emotions;
- ■ information extraction – extracting structured information like entities, events, or relations from unstructured texts;
- ■ association mining – finding associations between concepts or terms in texts;
- ■ trend analysis – looking at how concepts contained in documents change in time;
- ■ machine translation – converting a text written in one language to a text in another language.
- ■ categorical (nominal) – the domain is a discrete set of values where ordering does not make sense;
- ■ binary – a special type of categorical attribute with only two possible values;
- ■ ordinal – the domain is a discrete set of values that can be ordered;
- ■ numerical – the value of a feature is a number, either integer or continuous.
Age | Gender | Education | Date | Price |
35 | female | primary | 2019-02-10 | 20.0 |
40 | male | tertiary | 2019-02-14 | 28.4 |
21 | male | secondary | 2019-01-30 | 15.1 |
63 | female | secondary | 2019-03-01 | 11.9 |
- ■ Independently of the complexity of features, the input space for text mining problems is large. It is not uncommon that the dictionaries of various natural languages contain hundreds of thousands of words. When considering not only words as features but also, for example, combinations, mutual positions, or grammatical relationships between words, the complexity further increases. Of course, not all words of a language will appear in most d...
Table of contents
- Cover
- Title Page
- Copyright Page
- Dedication Page
- Preface
- Table of Contents
- Authors’ Biographies
- 1. Introduction to Text Mining with Machine Learning
- 2. Introduction to R
- 3. Structured Text Representations
- 4. Classification
- 5. Bayes Classifier
- 6. Nearest Neighbors
- 7. Decision Trees
- 8. Random Forest
- 9. Adaboost
- 10. Support Vector Machines
- 11. Deep Learning
- 12. Clustering
- 13. Word Embeddings
- 14. Feature Selection
- References
- Index