This first chapter covers all the required components for running a custom end-to-end machine learning (ML) pipeline in Azure. Some sections might be a recap of your existing knowledge with useful practical tips, step-by-step guidelines, and pointers to using Azure services to perform ML at scale. You can see it as an overview of the book, where we will dive into each section in great detail with many practical examples and a lot of code during the remaining chapters of the book.
First, we will look at data experimentation techniques as a step-by-step process for analyzing common insights, such as missing values, data distribution, feature importance, and two-dimensional embedding techniques to estimate the expected model performance of a classification task. In the second section, we will use these insights about the data to perform data preprocessing and feature engineering, such as normalization, the encoding of categorical and temporal variables, and transforming text columns into meaningful features using Natural Language Processing (NLP).
In the subsequent sections, we will recap the analytical process of training an ML model by selecting a model, an error metric, and a train-testing split, and performing cross- validation. Then, we will learn about techniques that help to improve the prediction performance of a single model through hyperparameter tuning, model stacking, and automated machine learning. Finally, we will cover the most common techniques for model deployments, such as online real-time scoring and batch scoring.
The following topics will be covered in this chapter:
- Performing descriptive data exploration
- Common techniques for data preparation
- Choosing the right ML model to train data
- Optimization techniques
- Deploying and operating models
Performing descriptive data exploration
Descriptive data exploration is, without a doubt, one of the most important steps in an ML project. If you want to clean data and build derived features or select an ML algorithm to predict a target variable in your dataset, then you need to understand your data first. Your data will define many of the necessary cleaning and preprocessing steps; it will define which algorithms you can choose and it will ultimately define the performance of your predictive model.
Hence, data exploration should be considered an important analytical step to understanding whether your data is informative to build an ML model in the first place. By analytical step, we mean that the exploration should be done as a structured analytical process rather than a set of experimental tasks. Therefore, we will go through a checklist of data exploration tasks that you can perform as an initial step in every ML project—before starting any data cleaning, preprocessing, feature engineering, or model selection.
Once the data is provided, we will work through the following data exploration checklist and try to get as many insights as possible about the data and its relation to the target variable:
- Analyze the data distribution and check for the following:
- Data types (continuous, ordinal, nominal, or text)
- Mean, median, and percentiles
- Data skew
- Outliers and minimum and maximum values
- Null and missing values
- Most common values
- The number of unique values (in categorical features)
- Correlations (in continuous features)
- Analyze how the target variable is influenced by the features and check for the following:
- The regression coefficient (in regression)
- Feature importance (in classification)
- Categorical values with high error rates (in binary classification)
- Analyze the difficulty of your prediction task.
By applying these steps, you will be able to understand the data and gain knowledge about the required preprocessing tasks for your data—features and target variables. Along with that, it will give you a good estimate of what difficulties you can expect in your prediction task, which is essential for judging required algorithms and validation strategies. You will also gain an insight into what possible feature engineering methods could apply to your dataset and have a better understanding of how to select a good error metric.
Note
You can use a representative subset of the data and extrapolate your hypothesis and insights to the whole dataset
Moving data to the cloud
Before we can start exploring the data, we need to make it available in our cloud environment. While this seems like a trivial task, efficiently accessing data from a new environment inside a corporate environment is not always easy. Also, uploading, copying, and distributing the same data to many Virtual Machines (VMs) and data science environments is not sustainable and doesn't scale well. For data exploration, we only need a significant subset of the data that can easily be connected to all other environments—rather than live access to a production database or data warehouse.
There is no wrong practice of uploading Comma-Separated Values (CSV) or Tab-Separated Values (TSV) files to your experimentation environment or accessing data via Java Database Connectivity (JDBC) from the source system. However, there are a few easy tricks to optimize your workflow.
First, we will choose a data format optimized for data exploration. In the exploration phase, we need to glance at the source data multiple times and explore the values, feature dimensions, and target variables. Hence, using a human-readable text format is usually very practical. In order to parse it efficiently, a delimiter-separated file, such as CSV, is strongly recommended. CSV can be parsed efficiently and you can open and browse it using any text editor.
Another small tweak that will bring you a significant performance improvement is compressing the file using Gzip before uploading it to the cloud. This will make uploads, loading, and downloads of this file much faster, while the compute resources spent on decompression are minimal. Thanks to the nature of the tabular data, the compression ratio will be very high. Most analytical frameworks for data processing, such as pandas and Spark, can read and parse Gzipped files natively, which requires minimal-to-no code changes. In addition, this only adds a small extra step for reading and analyzing the file manually with an editor.
Once your training data is compressed, it's recommended to upload the Gzipped CSV file to an Azure Storage container; a good choice would be Azure Blob storage. When the data is stored in Blob storage, it can be conveniently accessed from any other services within Azure, as well as from your local machine. This means if you scale your experimentation environment from an Azure notebook to a compute cluster, your code for accessing and reading the data will stay the same.
A fantastic cross-platform GUI tool to interact with many different Azure Storage services is Azure Storage Explorer. Using this tool, it is very easy to efficiently upload small and large files to Blob storage. It also allows you to generate direct links to your files with an embedded access key. This technique is simple yet also super effective when uploading hundreds of terabytes (TBs) from your local machine to the cloud. We will discuss this in much more detail in Chapter 4, ETL, data preparation, and feature extraction.
Understanding missing values
Once the data is uploaded to the cloud—for example, using Azure Storage Explorer and Azure Blob storage for your files—we can bring up a Notebook environment and start exploring the data. The goal is to thoroughly explore your data in an analytical process to understand the distribution of each dimension of your data. This is essential for choosing any appropriate data preprocessing feature engineering and ML algorithms for your use case.
Note
Please keep in mind that not only the feature dimensions but also the target variable needs to be preprocessed and thoroughly analyzed.
Analyzing each dimension of a dataset with more than 100 feature dimensions is an extremely time-consuming task. However, instead of randomly exploring feature dimensions, you can analyze the dimensions ordered by feature importance and hence significantly reduce your time working through the data. Like many other areas of computer science, it is good to use an 80/20 principle for the initial data exploration and so only use 20% of the features to achieve 80% of the performance. This sets you up for a great start and you can always come back later to add more dimensions if needed.
The first thing to look for in a new dataset is missing values for each feature dimension. This will help you to gain a deeper understanding of the dataset and what actions could be taken to resolve those. It's not uncommon to remove missing values or impute them with zeros at the beginning of a project—however, this approach bears the risk of not properly analyzing missing values in the first place.
Note
Missing values can be disguised as valid numeric or categorical values. Typical examples are minimum or maximum values, -1, 0, or NaN. Hence, if you find the values 32,767 (= 215-1) or 65,535 (= 216-1) appearing multiple times in an integer data column, they might well be missing values disguised as the maximum signed or unsigned 16-bit integer representation. Always assume that your data contains missing values and outliers in different shapes and representations. Your task is to uncover, find, and clean them.
Any prior knowledge about the data or domain will give you a competitive advantage when working with the data. The reason for this is that you will be able to understand missing values, outliers, and extremes in relation to the data and domain—which will help you to perform better imputation, cleansing, or transformation. As the next step, you should look for these outliers in your data, specifically for the following values:
- The absolute number (or percentage) of the null values (look for Null, "Null", "", NaN, and so on)
- The absolute number (or percentage) of minimum and maximum values The absolute number (or percentage) of the most common value (MODE) The absolute number (or percentage) of value 0
- The absolute number (or percentage) of unique values
Once you have identified these values, we can use different preprocessing techniques to impute missing values and normalize or exclude dimensions with outliers. You will find many of th...