eBook - ePub

Applied Data Science with Python and Jupyter

Name: Applied Data Science with Python and Jupyter
Author: Alex Galea

Use powerful industry-standard tools to unlock new, actionable insights from your data

Alex Galea,

192 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Applied Data Science with Python and Jupyter

Use powerful industry-standard tools to unlock new, actionable insights from your data

Alex Galea,

Book details

Book preview

Table of contents

Citations

About This Book

Become the master player of data exploration by creating reproducible data processing pipelines, visualizations, and prediction models for your applications.

Key Features

Get up and running with the Jupyter ecosystem and some example datasets
Learn about key machine learning concepts such as SVM, KNN classifiers, and Random Forests
Discover how you can use web scraping to gather and parse your own bespoke datasets

Book Description

Getting started with data science doesn't have to be an uphill battle. Applied Data Science with Python and Jupyter is a step-by-step guide ideal for beginners who know a little Python and are looking for a quick, fast-paced introduction to these concepts. In this book, you'll learn every aspect of the standard data workflow process, including collecting, cleaning, investigating, visualizing, and modeling data. You'll start with the basics of Jupyter, which will be the backbone of the book. After familiarizing ourselves with its standard features, you'll look at an example of it in practice with our first analysis. In the next lesson, you dive right into predictive analytics, where multiple classification algorithms are implemented. Finally, the book ends by looking at data collection techniques. You'll see how web data can be acquired with scraping techniques and via APIs, and then briefly explore interactive visualizations.

What you will learn

Get up and running with the Jupyter ecosystem
Identify potential areas of investigation and perform exploratory data analysis
Plan a machine learning classification strategy and train classification models
Use validation curves and dimensionality reduction to tune and enhance your models
Scrape tabular data from web pages and transform it into Pandas DataFrames
Create interactive, web-friendly visualizations to clearly communicate your findings

Who this book is for

Applied Data Science with Python and Jupyter is ideal for professionals with a variety of job descriptions across a large range of industries, given the rising popularity and accessibility of data science. You'll need some prior experience with Python, with any prior work with libraries such as Pandas, Matplotlib, and Pandas providing you a useful head start.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Yes, you can access Applied Data Science with Python and Jupyter by Alex Galea in PDF and/or ePUB format, as well as other popular books in Ciencia de la computación & Tratamiento de datos. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Packt Publishing

Year

2018

ISBN

9781789951929

Edition

Topic

Ciencia de la computación

Subtopic

Tratamiento de datos

Data Cleaning and Advanced Machine Learning

Learning Objectives

By the end of this chapter, you will be able to:

Plan a machine learning classification strategy
Preprocess data to prepare it for machine learning
Train classification models
Use validation curves to tune model parameters
Use dimensionality reduction to enhance model performance

In this chapter, you will learn data preprocessing and machine learning by completing several practical exercises.

Introduction

Consider a small food-delivery business that is looking to optimize their product. An analyst might look at the appropriate data and determine what type of food people are enjoying most. Perhaps they find a large amount of people are ordering the spiciest food options, indicating the business might be losing out on customers who desire something even more spicy. This is quite basic, or as some might say, "vanilla" analytics.

In a separate task, the analyst could employ predictive analytics by modeling the order volumes over time. With enough data, they could predict the future order volumes and therefore guide the restaurant as to how many staff are required each day. This model could take factors such as the weather into account to make the best predictions. For instance, a heavy rainstorm could be an indicator to staff more delivery personnel to make up for slow travel times. With historical weather data, that type of signal could be encoded into the model. This prediction model would save a business the time of having to consider these problems manually, and money by keeping customers happy and thereby increasing customer retention.

The goal of data analytics in general is to uncover actionable insights that result in positive business outcomes. In the case of predictive analytics, the aim is to do this by determining the most likely future outcome of a target, based on previous trends and patterns.

The benefits of predictive analytics are not restricted to big technology companies. Any business can find ways to benefit from machine learning, given the right data.

Companies all around the world are collecting massive amounts of data and using predictive analytics to cut costs and increase profits. Some of the most prevalent examples of this are from the technology giants Google, Facebook, and Amazon, who utilize big data on a huge scale. For example, Google and Facebook serve you personalized ads based on predictive algorithms that guess what you are most likely to click on. Similarly, Amazon recommends personalized products that you are most likely to buy, given your previous purchases.

Modern predictive analytics is done with machine learning, where computer models are trained to learn patterns from data. As we saw briefly in the previous chapter, software such as scikit-learn can be used with Jupyter Notebooks to efficiently build and test machine learning models. As we will continue to see, Jupyter Notebooks are an ideal environment for doing this type of work, as we can perform ad-hoc testing and analysis, and easily save the results for reference later.

In this chapter, we will again take a hands-on approach by running through various examples and activities in a Jupyter Notebook. Where we saw a couple of examples of machine learning in the previous chapter, here we'll take a much slower and more thoughtful approach. Using an employee retention problem as our overarching example for the chapter, we will discuss how to approach predictive analytics, what things to consider when preparing the data for modeling, and how to implement and compare a variety of models using Jupyter Notebooks.

Preparing to Train a Predictive Model

Here, we will cover the preparation required to train a predictive model. Although not as technically glamorous as training the models themselves, this step should not be taken lightly. It's very important to ensure you have a good plan before proceeding with the details of building and training a reliable model. Furthermore, once you've decided on the right plan, there are technical steps in preparing the data for modeling that should not be overlooked.

Note

We must be careful not to go so deep into the weeds of technical tasks that we lose sight of the goal. Technical tasks include things that require programming skills, for example, constructing visualizations, querying databases, and validating predictive models. It's easy to spend hours trying to implement a specific feature or get the plots looking just right. Doing this sort of thing is certainly beneficial to our programming skills, but we should not forget to ask ourselves if it's really worth our time with respect to the current project.

Also, keep in mind that Jupyter Notebooks are particularly well-suited for this step, as we can use them to document our plan, for example, by writing rough notes about the data or a list of models we are interested in training. Before starting to train models, its good practice to even take this a step further and write out a well- structured plan to follow. Not only will this help you stay on track as you build and test the models, but it will allow others to understand what you're doing when they see your work.

After discussing the preparation, we will also cover another step in preparing to train the predictive model, which is cleaning the dataset. This is another thing that Jupyter Notebooks are well-suited for, as they offer an ideal testing ground for performing dataset transformations and keeping track of the exact changes. The data transformations required for cleaning raw data can quickly become intricate and convoluted; therefore, it's important to keep track of your work. As discussed in the first chapter, tools other than Jupyter Notebooks just don't offer very good options for doing this efficiently.

Before we progress to the next section, let's pause and think about these ideas in the context of a real-life example.

Consider the following situation:

You are hired by an online video game marketplace who want to increase the conversion rate of people visiting their site. They ask you to use predictive analytics to determine what genre of game the user will like, so they can display specialized content that will encourage the user to make a purchase. They want to do this without having to ask the customer their preference of game genre.

Is this a problem that can be solved? What type of data would be required? What would be the business implications?

To address this challenge, we could consider making the prediction based on users' browsing cookies. For example, if they have a cookie from previously visiting a World of Warcraft website, this would act as an indicator that they like role playing games.

Another valuable piece of data would be a history of the games that user has previously bought in the marketplace. This could be the target variable in a machine learning algorithm, for example, a model that could predict which games the user would be interested in, based on the type of cookies in their browsing session. An alternate target variable could be constructed by setting up a survey in the marketplace to collect data on user preferences.

In terms of the business implications, being able to accurately predict the genre of game is very important to the success of the campaign. In fact, getting the prediction wrong is doubly problematic: not only do we miss out on the opportunity to target users, but we may end up showing users content that would be negatively perceived. This could lead to more people leaving the site and fewer sales.

Determining a Plan for Predictive Analytics

When formulating a plan for doing predictive modeling, one should start by considering stakeholder needs. A perfect model will be useless if it doesn't solve a relevant problem. Planning a strategy around business needs ensures that a successful model will lead to actionable insights.

Although it may be possible in principle to solve many business problems, the ability to deliver the solution will always depend on the availability of the necessary data. Therefore, it's important to consider the business needs in the context of the available data sources. When data is plentiful, this will have little effect, but as the amount of available data becomes smaller, so too does the scope of problems that can be solved.

These ideas can be formed in...

Preface
Jupyter Fundamentals
Data Cleaning and Advanced Machine Learning
Web Scraping and Interactive Visualizations
Appendix A