eBook - ePub

Mastering Azure Machine Learning

Name: Mastering Azure Machine Learning
Author: Christoph Körner, Kaijisse Waaijer

Perform large-scale end-to-end advanced machine learning on the cloud with Microsoft Azure ML

Christoph Körner,

Kaijisse Waaijer,

394 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Mastering Azure Machine Learning

Perform large-scale end-to-end advanced machine learning on the cloud with Microsoft Azure ML

Christoph Körner,

Kaijisse Waaijer,

Book details

Book preview

Table of contents

Citations

About This Book

Master expert techniques for building automated and highly scalable end-to-end machine learning models and pipelines in Azure using TensorFlow, Spark, and Kubernetes

Key Features

Make sense of data on the cloud by implementing advanced analytics
Train and optimize advanced deep learning models efficiently on Spark using Azure Databricks
Deploy machine learning models for batch and real-time scoring with Azure Kubernetes Service (AKS)

Book Description

The increase being seen in data volume today requires distributed systems, powerful algorithms, and scalable cloud infrastructure to compute insights and train and deploy machine learning (ML) models. This book will help you improve your knowledge of building ML models using Azure and end-to-end ML pipelines on the cloud.

The book starts with an overview of an end-to-end ML project and a guide on how to choose the right Azure service for different ML tasks. It then focuses on Azure Machine Learning and takes you through the process of data experimentation, data preparation, and feature engineering using Azure Machine Learning and Python. You'll learn advanced feature extraction techniques using natural language processing (NLP), classical ML techniques, and the secrets of both a great recommendation engine and a performant computer vision model using deep learning methods. You'll also explore how to train, optimize, and tune models using Azure Automated Machine Learning and HyperDrive, and perform distributed training on Azure. Then, you'll learn different deployment and monitoring techniques using Azure Kubernetes Services with Azure Machine Learning, along with the basics of MLOps—DevOps for ML to automate your ML process as CI/CD pipeline.

By the end of this book, you'll have mastered Azure Machine Learning and be able to confidently design, build and operate scalable ML pipelines in Azure.

What you will learn

Setup your Azure Machine Learning workspace for data experimentation and visualization
Perform ETL, data preparation, and feature extraction using Azure best practices
Implement advanced feature extraction using NLP and word embeddings
Train gradient boosted tree-ensembles, recommendation engines and deep neural networks on Azure Machine Learning
Use hyperparameter tuning and Azure Automated Machine Learning to optimize your ML models
Employ distributed ML on GPU clusters using Horovod in Azure Machine Learning
Deploy, operate and manage your ML models at scale
Automated your end-to-end ML process as CI/CD pipelines for MLOps

Who this book is for

This machine learning book is for data professionals, data analysts, data engineers, data scientists, or machine learning developers who want to master scalable cloud-based machine learning architectures in Azure. This book will help you use advanced Azure services to build intelligent machine learning applications. A basic understanding of Python and working knowledge of machine learning are mandatory.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Yes, you can access Mastering Azure Machine Learning by Christoph Körner, Kaijisse Waaijer in PDF and/or ePUB format, as well as other popular books in Computer Science & Information Technology. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Packt Publishing

Year

2020

ISBN

9781789801521

Edition

Topic

Computer Science

Subtopic

Information Technology

Index

Computer Science

Section 1: Azure Machine Learning

In the first part of the book, the reader will come to understand the steps and requirements of an end-to-end machine learning pipeline and will be introduced to the different Azure Machine Learning. The reader will learn how to choose a machine learning service for a specific machine learning task.

This section comprises the following chapters:

Chapter 1, Building an end-to-end machine learning pipeline in Azure
Chapter 2, Choosing a machine learning service in Azure

1. Building an end-to-end machine learning pipeline in Azure

This first chapter covers all the required components for running a custom end-to-end machine learning (ML) pipeline in Azure. Some sections might be a recap of your existing knowledge with useful practical tips, step-by-step guidelines, and pointers to using Azure services to perform ML at scale. You can see it as an overview of the book, where we will dive into each section in great detail with many practical examples and a lot of code during the remaining chapters of the book.

First, we will look at data experimentation techniques as a step-by-step process for analyzing common insights, such as missing values, data distribution, feature importance, and two-dimensional embedding techniques to estimate the expected model performance of a classification task. In the second section, we will use these insights about the data to perform data preprocessing and feature engineering, such as normalization, the encoding of categorical and temporal variables, and transforming text columns into meaningful features using Natural Language Processing (NLP).

In the subsequent sections, we will recap the analytical process of training an ML model by selecting a model, an error metric, and a train-testing split, and performing cross- validation. Then, we will learn about techniques that help to improve the prediction performance of a single model through hyperparameter tuning, model stacking, and automated machine learning. Finally, we will cover the most common techniques for model deployments, such as online real-time scoring and batch scoring.

The following topics will be covered in this chapter:

Performing descriptive data exploration
Common techniques for data preparation
Choosing the right ML model to train data
Optimization techniques
Deploying and operating models

Performing descriptive data exploration

Descriptive data exploration is, without a doubt, one of the most important steps in an ML project. If you want to clean data and build derived features or select an ML algorithm to predict a target variable in your dataset, then you need to understand your data first. Your data will define many of the necessary cleaning and preprocessing steps; it will define which algorithms you can choose and it will ultimately define the performance of your predictive model.

Hence, data exploration should be considered an important analytical step to understanding whether your data is informative to build an ML model in the first place. By analytical step, we mean that the exploration should be done as a structured analytical process rather than a set of experimental tasks. Therefore, we will go through a checklist of data exploration tasks that you can perform as an initial step in every ML project—before starting any data cleaning, preprocessing, feature engineering, or model selection.

Once the data is provided, we will work through the following data exploration checklist and try to get as many insights as possible about the data and its relation to the target variable:

Analyze the data distribution and check for the following:
- Data types (continuous, ordinal, nominal, or text)
- Mean, median, and percentiles
- Data skew
- Outliers and minimum and maximum values
- Null and missing values
- Most common values
- The number of unique values (in categorical features)
- Correlations (in continuous features)
Analyze how the target variable is influenced by the features and check for the following:
- The regression coefficient (in regression)
- Feature importance (in classification)
- Categorical values with high error rates (in binary classification)
Analyze the difficulty of your prediction task.

By applying these steps, you will be able to understand the data and gain knowledge about the required preprocessing tasks for your data—features and target variables. Along with that, it will give you a good estimate of what difficulties you can expect in your prediction task, which is essential for judging required algorithms and validation strategies. You will also gain an insight into what possible feature engineering methods could apply to your dataset and have a better understanding of how to select a good error metric.

Note

You can use a representative subset of the data and extrapolate your hypothesis and insights to the whole dataset

Moving data to the cloud

Before we can start exploring the data, we need to make it available in our cloud environment. While this seems like a trivial task, efficiently accessing data from a new environment inside a corporate environment is not always easy. Also, uploading, copying, and distributing the same data to many Virtual Machines (VMs) and data science environments is not sustainable and doesn't scale well. For data exploration, we only need a significant subset of the data that can easily be connected to all other environments—rather than live access to a production database or data warehouse.

There is no wrong practice of uploading Comma-Separated Values (CSV) or Tab-Separated Values (TSV) files to your experimentation environment or accessing data via Java Database Connectivity (JDBC) from the source system. However, there are a few easy tricks to optimize your workflow.

First, we will choose a data format optimized for data exploration. In the exploration phase, we need to glance at the source data multiple times and explore the values, feature dimensions, and target variables. Hence, using a human-readable text format is usually very practical. In order to parse it efficiently, a delimiter-separated file, such as CSV, is strongly recommended. CSV can be parsed efficiently and you can open and browse it using any text editor.

Another small tweak that will bring you a significant performance improvement is compressing the file using Gzip before uploading it to the cloud. This will make uploads, loading, and downloads of this file much faster, while the compute resources spent on decompression are minimal. Thanks to the nature of the tabular data, the compression ratio will be very high. Most analytical frameworks for data processing, such as pandas and Spark, can read and parse Gzipped files natively, which requires minimal-to-no code changes. In addition, this only adds a small extra step for reading and analyzing the file manually with an editor.

Once your training data is compressed, it's recommended to upload the Gzipped CSV file to an Azure Storage container; a good choice would be Azure Blob storage. When the data is stored in Blob storage, it can be conveniently accessed from any other services within Azure, as well as from your local machine. This means if you scale your experimentation environment from an Azure notebook to a compute cluster, your code for accessing and reading the data will stay the same.

A fantastic cross-platform GUI tool to interact with many different Azure Storage services is Azure Storage Explorer. Using this tool, it is very easy to efficiently upload small and large files to Blob storage. It also allows you to generate direct links to your files with an embedded access key. This technique is simple yet also super effective when uploading hundreds of terabytes (TBs) from your local machine to the cloud. We will discuss this in much more detail in Chapter 4, ETL, data preparation, and feature extraction.

Understanding missing values

Once the data is uploaded to the cloud—for example, using Azure Storage Explorer and Azure Blob storage for your files—we can bring up a Notebook environment and start exploring the data. The goal is to thoroughly explore your data in an analytical process to understand the distribution of each dimension of your data. This is essential for choosing any appropriate data preprocessing feature engineering and ML algorithms for your use case.

Note

Please keep in mind that not only the feature dimensions but also the target variable needs to be preprocessed and thoroughly analyzed.

Analyzing each dimension of a dataset with more than 100 feature dimensions is an extremely time-consuming task. However, instead of randomly exploring feature dimensions, you can analyze the dimensions ordered by feature importance and hence significantly reduce your time working through the data. Like many other areas of computer science, it is good to use an 80/20 principle for the initial data exploration and so only use 20% of the features to achieve 80% of the performance. This sets you up for a great start and you can always come back later to add more dimensions if needed.

The first thing to look for in a new dataset is missing values for each feature dimension. This will help you to gain a deeper understanding of the dataset and what actions could be taken to resolve those. It's not uncommon to remove missing values or impute them with zeros at the beginning of a project—however, this approach bears the risk of not properly analyzing missing values in the first place.

Note

Missing values can be disguised as valid numeric or categorical values. Typical examples are minimum or maximum values, -1, 0, or NaN. Hence, if you find the values 32,767 (= 215-1) or 65,535 (= 216-1) appearing multiple times in an integer data column, they might well be missing values disguised as the maximum signed or unsigned 16-bit integer representation. Always assume that your data contains missing values and outliers in different shapes and representations. Your task is to uncover, find, and clean them.

Any prior knowledge about the data or domain will give you a competitive advantage when working with the data. The reason for this is that you will be able to understand missing values, outliers, and extremes in relation to the data and domain—which will help you to perform better imputation, cleansing, or transformation. As the next step, you should look for these outliers in your data, specifically for the following values:

The absolute number (or percentage) of the null values (look for Null, "Null", "", NaN, and so on)
The absolute number (or percentage) of minimum and maximum values The absolute number (or percentage) of the most common value (MODE) The absolute number (or percentage) of value 0
The absolute number (or percentage) of unique values

Once you have identified these values, we can use different preprocessing techniques to impute missing values and normalize or exclude dimensions with outliers. You will find many of th...

Preface
Section 1: Azure Machine Learning
1. Building an end-to-end machine learning pipeline in Azure
2. Choosing a machine learning service in Azure
Section 2: Experimentation and Data Preparation
3. Data experimentation and visualization using Azure
4. ETL, data preparation, and feature extraction
5. Azure Machine Learning pipelines
6. Advanced feature extraction with NLP
Section 3: Training Machine Learning Models
7. Building ML models using Azure Machine Learning
8. Training deep neural networks on Azure
9. Hyperparameter tuning and Automated Machine Learning
10. Distributed machine learning on Azure
11. Building a recommendation engine in Azure
Section 4: Optimization and Deployment of Machine Learning Models
12. Deploying and operating machine learning models
13. MLOps—DevOps for machine learning
14. What's next?
Index