eBook - ePub

fastText Quick Start Guide

Name: fastText Quick Start Guide
Author: Joydeep Bhattacharjee

Get started with Facebook's library for text representation and classification

Joydeep Bhattacharjee

Condividi libro

194 pagine
English
ePUB (disponibile sull'app)
Disponibile su iOS e Android

eBook - ePub

fastText Quick Start Guide

Get started with Facebook's library for text representation and classification

Joydeep Bhattacharjee

Dettagli del libro

Anteprima del libro

Indice dei contenuti

Citazioni

Informazioni sul libro

Perform efficient fast text representation and classification with Facebook's fastText library

Key Features

Introduction to Facebook's fastText library for NLP
Perform efficient word representations, sentence classification, vector representation
Build better, more scalable solutions for text representation and classification

Book Description

Facebook's fastText library handles text representation and classification, used for Natural Language Processing (NLP). Most organizations have to deal with enormous amounts of text data on a daily basis, and gaining efficient data insights requires powerful NLP tools such as fastText.

This book is your ideal introduction to fastText. You will learn how to create fastText models from the command line, without the need for complicated code. You will explore the algorithms that fastText is built on and how to use them for word representation and text classification.

Next, you will use fastText in conjunction with other popular libraries and frameworks such as Keras, TensorFlow, and PyTorch.

Finally, you will deploy fastText models to mobile devices. By the end of this book, you will have all the required knowledge to use fastText in your own applications at work or in projects.

What you will learn

Create models using the default command line options in fastText
Understand the algorithms used in fastText to create word vectors
Combine command line text transformation capabilities and the fastText library to implement a training, validation, and prediction pipeline
Explore word representation and sentence classification using fastText
Use Gensim and spaCy to load the vectors, transform, lemmatize, and perform other NLP tasks efficiently
Develop a fastText NLP classifier using popular frameworks, such as Keras, Tensorflow, and PyTorch

Who this book is for

This book is for data analysts, data scientists, and machine learning developers who want to perform efficient word representation and sentence classification using Facebook's fastText library. Basic knowledge of Python programming is required.

Domande frequenti

Come faccio ad annullare l'abbonamento?

È semplicissimo: basta accedere alla sezione Account nelle Impostazioni e cliccare su "Annulla abbonamento". Dopo la cancellazione, l'abbonamento rimarrà attivo per il periodo rimanente già pagato. Per maggiori informazioni, clicca qui

È possibile scaricare libri? Se sì, come?

Al momento è possibile scaricare tramite l'app tutti i nostri libri ePub mobile-friendly. Anche la maggior parte dei nostri PDF è scaricabile e stiamo lavorando per rendere disponibile quanto prima il download di tutti gli altri file. Per maggiori informazioni, clicca qui

Che differenza c'è tra i piani?

Entrambi i piani ti danno accesso illimitato alla libreria e a tutte le funzionalità di Perlego. Le uniche differenze sono il prezzo e il periodo di abbonamento: con il piano annuale risparmierai circa il 30% rispetto a 12 rate con quello mensile.

Cos'è Perlego?

Perlego è un servizio di abbonamento a testi accademici, che ti permette di accedere a un'intera libreria online a un prezzo inferiore rispetto a quello che pagheresti per acquistare un singolo libro al mese. Con oltre 1 milione di testi suddivisi in più di 1.000 categorie, troverai sicuramente ciò che fa per te! Per maggiori informazioni, clicca qui.

Perlego supporta la sintesi vocale?

Cerca l'icona Sintesi vocale nel prossimo libro che leggerai per verificare se è possibile riprodurre l'audio. Questo strumento permette di leggere il testo a voce alta, evidenziandolo man mano che la lettura procede. Puoi aumentare o diminuire la velocità della sintesi vocale, oppure sospendere la riproduzione. Per maggiori informazioni, clicca qui.

fastText Quick Start Guide è disponibile online in formato PDF/ePub?

Sì, puoi accedere a fastText Quick Start Guide di Joydeep Bhattacharjee in formato PDF e/o ePub, così come ad altri libri molto apprezzati nelle sezioni relative a Computer Science e Natural Language Processing. Scopri oltre 1 milione di libri disponibili nel nostro catalogo.

Informazioni

Editore

Packt Publishing

Anno

2018

ISBN

9781789136715

Edizione

Argomento

Computer Science

Categoria

Natural Language Processing

Creating Models Using FastText Command Line

FastText has a powerful command line. In fact, you can call fastText a command-line-first library. Now, a lot of developers and researchers are not comfortable with the command line, and I would ask you to go through the examples in this chapter with greater attention. My hope is that by the end of this chapter, you will have some confidence in command-line file manipulations. The advantages of using the command line are as follows:

Commands such as cat, grep, sed, and awk are quite old and their behavior is well-documented on the internet. Chances are high that, for any use case that you might have, you will easily get snippets on Stack Overflow/Google (or your colleague next door will know it).
Since they are generally implemented in the C language, they are very fast.
The commands are very crisp and concise, which means there is not a lot of code to write and maintain.

We will take a look at how classification and word vector generation works in fastText. In this chapter, we will explore how to implement them using the command line:

Text classification using fastText
FastText word vectors
Creating word vectors
Facebook word vectors
Using pretrained word vectors

Text classification using fastText

To access the command line, open the Terminal on your Linux or macOS machines, or the command prompt (by typing cmd in Windows + R and hitting Enter) on Windows machines, and then type fastText. You should see some output coming out. If you are not seeing anything, or getting an error saying that the command not found, please take a look at the previous chapter on how to install fastText on your computer. If you are able to see some output, the output is a basic description of all the options. A description of the command line options for fastText can be found in the Appendix of this book.

All the methods and command line statements mentioned in this chapter will work on Linux and Mac machines. If you are a Windows user, focus more on the description and the logic of what is being done and follow the logic of the steps. A helpful guide on command line differences between Windows and Linux is mentioned in the Appendix.

In fastText, there are two primary use cases for the command line. These are the following:

Text classification
Text representation

One of the core areas of focus for fastText is text classification. Text classification is a technique in which we learn to which set of categories the input text belongs. This is basically a supervised machine learning problem, so first and foremost, you will need a dataset that contains text and the corresponding labels.

Roughly speaking, machine learning algorithms run some kind of optimization problem on a set of matrices and vectors. They do not really understand "raw text," which means that you will need to set up a pipeline to convert the raw text into numbers. Here are the steps that can be followed to do that:

First, you need the data and hence for text classification you need a series of texts or documents that will be labeled. You convert them into a series of text-label pairs.
The next step is called tokenization. Tokenization is the process of dividing the text into individual pieces or tokens. Tokenization is primarily done by understanding the word boundaries in the given text. Many languages in the world are space delimited. Examples of these are English and French. In some other cases, the word boundaries may not be clear, such as in the case of Mandarin, Tamil, and Urdu.
Once the tokenization is done, based on the process you may end up with a "bag of words," which is essentially a vector for the document/sentence telling you whether a specific word is there or not, and how many times. The columns in the matrix are all the set of words present, which is called the dictionary, and the rows are the count of the particular words in the document. This is called the bag-of-words approach.
Convert the bag of words into a TF-IDF matrix to reduce the weight of the common terms. TF-IDF has been used so that the terms that are common in the document do not have too much impact on the resultant matrix.
Now that you have the matrix, you can pass the matrix as input to a classification algorithm, which will essentially train a model on this input matrix. General algorithms that are quite popular in this stage are logistic regression, as well as algorithms such as XGBoost, random forest, and so on.

Some of the additional steps that may need to be taken are the following:

Removal of stop words.
Stemming or a heurestic removal of end of words. This process works mostly in English and related languages due to the prevalence of derivational affixes.
Addition of n-grams to the model.
Synonymous sets.
Part of speech tagging.

Text preprocessing

Depending on the dataset, you may need to do some or all of these steps:

Tokenize the text.
Convert the text into lowercase. This is only required for lang...