Learning Objectives
By the end of this chapter, you will be able to:
- Present data for use in machine learning models
- Explain how to preprocess data for a machine learning model
- Build a logistic regression model with scikit-learn
- Use regularization in machine learning models
- Evaluate model performance with model evaluation metrics
In this chapter, we will learn how to preprocess data for machine learning models. We will learn how to develop logistic regression models with scikit-learn. Lastly, we will evaluate model performance with model evaluation metrics.
Introduction
Machine learning is the science of utilizing machines to emulate human tasks and to have the machine improve their performance of that task over time. By feeding machines data in the form of observations of real-world events, they can develop patterns and relationships that will optimize an objective function, such as the accuracy of a binary classification task or the error in a regression task. In general, the usefulness of machine learning is in the ability to learn highly complex and non-linear relationships in large datasets and to replicate the results of that learning many times.
Take, for example, the classification of a dataset of pictures of either dogs or cats into classes of their respective type. For a human, this is trivial, and the accuracy would likely be very high. However, it may take around a second to categorize each picture, and scaling the task can only be achieved by increasing the number of humans, which may be infeasible. While it may be difficult, though certainly not impossible, for machines to reach the same level of accuracy as humans for this task, machines can classify many images per second, and scaling can be easily done by increasing the processing power of single machine, or making the algorithm more efficient.
Figure 1.1: A trivial classification task for humans, but quite difficult for machines
While the trivial task of classifying dogs and cats may be simple for us humans, the same principles that are used to create a machine learning model classify dogs and cats can be applied to other classification tasks that humans may struggle with. An example of this is identifying tumors in Magnetic Resonance Images (MRIs). For humans, this task requires a medical professional with years of experience, whereas a machine may only need a dataset of labeled images.
Figure 1.2: A non-trivial classification task for humans. Are you able to spot the tumors?
Data Representation
We build models so that we can learn something about the data we are training on and about the relationships between the features of the dataset. This learning can inform us when we encounter new observations. However, we must realize that the observations we interact with in the real world and the format of data needed to train machine learning models are very different. Working with text data is a prime example of this. When we read text, we are able to understand each word and apply context given each word in relation to the surrounding words -- not a trivial task.However, machines are unable to interpret this contextual information. Unless it specifically encoded, they have no idea how to convert text into something that can be an input numerical. Therefore, we must represent the data appropriately, often by converting non-numerical data types, for example, converting text, dates, and categorical variables into numerical ones.
Tables of Data
Much of the data fed into machine learning problems is two-dimensional, and can be represented as rows or columns. Images are a good example of a dataset that may be three-or even four-dimensional. The shape of each image will be two-dimensional (a height and a width), the number of images together will add a third dimension, and a color channel (red, green, blue) will add a fourth.
Figure 1.3: A color image and its representation as red, green, and blue images
Note
We have used datasets from this repository: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
The following figure shows a few rows from a marketing dataset taken from the UCI repository. The dataset presents marketing campaign results of a Portuguese banking institution. The columns of the table show various details about each customer, while the final column, y, shows whether or not the customer subscribed to the product that was featured in the marketing campaign.
One objective of analyzing the dataset could be to try and use the information given to predict whether a given customer subscribed to the product (that is, to try and predict what is in column y for each row). We can then check whether we were correct by comparing our predictions to column y. The longer-term benefit of this is that we could then use our model to predict whether new customers will subscribe to the product, or whether existing customers will subscribe to another product after a different campaign.
Figure 1.4: An image showing the first 20 instances of the marketing dataset
Loading Data
Data can be in different forms and can be available in many places. Datasets for beginners are often given in a flat format, which means that they are two-dimensional, with rows and columns. Other common forms of data may include images, JSON objects, and text documents. Each type of data format has to be loaded in specific ways. For example, numerical data can be loaded into memory using the NumPy library, which is an efficient library for working with matrices in Python. However, we would not be able to load our marketing data .csv into memory using the NumPy library because the dataset contains string values. For our dataset, we will use the pandas library becauseof its ability to easily work with various data types, such as strings, integers, floats, and binary values. In fact, pandas is dependent on NumPy for operations on numerical data types. pandas is also able to read JSON, Excel documents, and databases using SQL queries, which makes the library common amongst practitioners for loading and manipulating data in Python.
Here is an example of how to load a CSV file using the NumPy library. We use the skiprows argument in case is there is a header, which usually contains column names:
import numpy as np
data = np.loadtxt(filename, delimiter=",", skiprows=1)
Here's an example of loading data using the pandas library:
import pandas as pd
data = pd.read_csv(filename, delimiter=",")
Here we are loading in a CSV file. The default delimiter is a comma, and so passing this as an argument is not necessary, but is useful to see. The pandas library can also handle non-numeric datatypes, which makes the library more flexible:
import pandas as pd
data = pd.read_json(filename)
The pandas library will flatten out the JSON and return a DataFrame.
The library can even connect to a database, and queries can be fed directly into the function, and the table returned will be loaded as a pandas DataFrame:
import pandas as pd
data = pd.read_sql(con, "SELECT * FROM table")
We have to pass a database connection to the function in order for this to work. There are a myriad of ways for this to be achieved, depending on the database flavor.
Other forms of data that are common in deep learning, such as images and text, can also be loaded in and will be discussed later in the book.
Note
You can find all the documentation for pandas at the following link: https://pandas.pydata.org/pandas-docs/stable/. The documentation for NumPy can be found at the following link: https://docs.sci...