The Applied Data Science Workshop
eBook - ePub

The Applied Data Science Workshop

Get started with the applications of data science and techniques to explore and assess data effectively, 2nd Edition

  1. 352 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

The Applied Data Science Workshop

Get started with the applications of data science and techniques to explore and assess data effectively, 2nd Edition

Book details
Book preview
Table of contents
Citations

About This Book

Designed with beginners in mind, this workshop helps you make the most of Python libraries and the Jupyter Notebook's functionality to understand how data science can be applied to solve real-world data problems.

Key Features

  • Gain useful insights into data science and machine learning
  • Explore the different functionalities and features of a Jupyter Notebook
  • Discover how Python libraries are used with Jupyter for data analysis

Book Description

From banking and manufacturing through to education and entertainment, using data science for business has revolutionized almost every sector in the modern world. It has an important role to play in everything from app development to network security.

Taking an interactive approach to learning the fundamentals, this book is ideal for beginners. You'll learn all the best practices and techniques for applying data science in the context of real-world scenarios and examples.

Starting with an introduction to data science and machine learning, you'll start by getting to grips with Jupyter functionality and features. You'll use Python libraries like sci-kit learn, pandas, Matplotlib, and Seaborn to perform data analysis and data preprocessing on real-world datasets from within your own Jupyter environment. Progressing through the chapters, you'll train classification models using sci-kit learn, and assess model performance using advanced validation techniques. Towards the end, you'll use Jupyter Notebooks to document your research, build stakeholder reports, and even analyze web performance data.

By the end of The Applied Data Science Workshop, you'll be prepared to progress from being a beginner to taking your skills to the next level by confidently applying data science techniques and tools to real-world projects.

What you will learn

  • Understand the key opportunities and challenges in data science
  • Use Jupyter for data science tasks such as data analysis and modeling
  • Run exploratory data analysis within a Jupyter Notebook
  • Visualize data with pairwise scatter plots and segmented distribution
  • Assess model performance with advanced validation techniques
  • Parse HTML responses and analyze HTTP requests

Who this book is for

If you are an aspiring data scientist who wants to build a career in data science or a developer who wants to explore the applications of data science from scratch and analyze data in Jupyter using Python libraries, then this book is for you. Although a brief understanding of Python programming and machine learning is recommended to help you grasp the topics covered in the book more quickly, it is not mandatory.

Frequently asked questions

Simply head over to the account section in settings and click on ā€œCancel Subscriptionā€ - itā€™s as simple as that. After you cancel, your membership will stay active for the remainder of the time youā€™ve paid for. Learn more here.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Both plans give you full access to the library and all of Perlegoā€™s features. The only differences are the price and subscription period: With the annual plan youā€™ll save around 30% compared to 12 months on the monthly plan.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, weā€™ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes, you can access The Applied Data Science Workshop by Alex Galea in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Processing. We have over one million books available in our catalogue for you to explore.

Information

Year
2020
ISBN
9781800207004
Edition
2

1. Introduction to Jupyter Notebooks

Overview
This chapter describes Jupyter Notebooks and their use in data analysis. It also explains the features of Jupyter Notebooks, which allow for additional functionality beyond running Python code. You will learn and implement the fundamental features of Jupyter Notebooks by completing several hands-on exercises. By the end of this chapter, you will be able to use some important features of Jupyter Notebooks and some key libraries available in Python.

Introduction

Our approach to learning in this book is highly applied since hands-on learning is the quickest way to understand abstract concepts. With this in mind, the focus of this chapter is to introduce Jupyter Notebooksā€”the data science tool that we will be using throughout this book.
Since Jupyter Notebooks have gained mainstream popularity, they have been one of the most important tools for data scientists who use Python. This is because they offer a great environment for a variety of tasks, such as performing quick and dirty analysis, researching model selection, and creating reproducible pipelines. They allow for data to be loaded, transformed, and modeled inside a single file, where it's quick and easy to test out code and explore ideas along the way. Furthermore, all of this can be documented inline using formatted text, which means you can make notes or even produce a structured report.
Other comparable platformsā€”for example, RStudio or Spyderā€”offer multiple panels to work between. Frequently, one of these panels will be a Read Eval Prompt Loop (REPL), where code is run on a Terminal session that has saved memory. Code written here may end up being copied and pasted into a different panel within the main codebase, and there may also be additional panels to see visualizations or other files. Such development environments are prone to efficiency issues and can promote bad practices for reproducibility if you're not careful.
Jupyter Notebooks work differently. Instead of having multiple panels for different components of your project, they offer the same functionality in a single component (that is, the Notebook), where the text is displayed along with code snippets, and code outputs are displayed inline. This lets you code efficiently and allows you to look back at previous work for reference, or even make alterations.
We'll start this chapter by explaining exactly what Jupyter Notebooks are and why they are so popular among data scientists. Then, we'll access a Notebook together and go through some exercises to learn how the platform is used.

Basic Functionality and Features of Jupyter Notebooks

In this section, we will briefly demonstrate the usefulness of Jupyter Notebooks with examples. Then, we'll walk through the basics of how they work and how to run them within the Jupyter platform. For those who have used Jupyter Notebooks before, this will be a good refresher, and you are likely to uncover new things as well.

What Is a Jupyter Notebook and Why Is It Useful?

Jupyter Notebooks are locally run on web applications that contain live code, equations, figures, interactive apps, and Markdown text in which the default programming language is Python. In other words, a Notebook will assume you are writing Python unless you tell it otherwise. We'll see examples of this when we work through our first workbook, later in this chapter.
Note
Jupyter Notebooks support many programming languages through the use of kernels, which act as bridges between the Notebook and the language. These include R, C++, and JavaScript, among many others. A list of available kernels can be found here: https://packt.live/2Y0jKJ0.
The following is an example of a Jupyter Notebook:
Figure 1.1: Jupyter Notebook sample workbook
Figure 1.1: Jupyter Notebook sample workbook
Besides executing Python code, you can write in Markdown to quickly render formatted text, such as titles, lists, or bold font. This can be done in combination with code using the concept of independent cells in the Notebook, as seen in Figure 1.2. Markdown is not specific to Jupyter; it is also a simple language used for styling text and creating basic documents. For example, most GitHub repositories have a README.md file that is written in Markdown format. It's comparable to HTML but offers much less customization in exchange for simplicity.
Commonly used symbols in markdown include hashes (#) to make text into a heading, square ([]) and round brackets (()) to insert hyperlinks, and asterisks (*) to create italicized or bold text:
Figure 1.2: Sample Markdown document
Figure 1.2: Sample Markdown document
In addition, Markdown can be used to render images and add hyperlinks in your document, both of which are supported in Jupyter Notebooks.
Jupyter Notebooks was not the first tool to use Markdown alongside code. This was the design of R Markdown, a hybrid language where R code can be written and executed inline with Markdown text. Jupyter Notebooks essentially offer the equivalent functionality for Python code. However, as we will see, they function quite differently from R Markdown documents. For example, R Markdown assumes you are writing Markdown unless otherwise specified, whereas Jupyter Notebooks assume you are inputting code. This and other features (as we will explore throughout) make it more appealing to use Jupyter Notebooks for rapid development in data science research.
While Jupyter Notebooks offer a blank canvas for a general range of applications, the types of Notebooks commonly seen in real-world data science can be categorized as either lab-style or deliverable.
Lab-style Notebooks serve as the programming analog of research journals. These should contain all the work you've done to load, process, analyze, and model the data. The idea here is to document everything you've done for future reference. For this reason, it's usually not advisable to delete or alter previous lab-style Notebooks. It's also a good idea to accumulate multiple date-stamped versions of the Notebook as you progress through the analysis, in case you want to look back at previous states.
Deliverable Notebooks are intended to be presentable and should contain only select parts of the lab-style Notebooks. For example, this could be an interesting discovery to share with your colleagues, an in-depth report of your analysis for a manager, ...

Table of contents

  1. The Applied Data Science Workshop
  2. Preface
  3. 1. Introduction to Jupyter Notebooks
  4. 2. Data Exploration with Jupyter
  5. 3. Preparing Data for Predictive Modeling
  6. 4. Training Classification Models
  7. 5. Model Validation and Optimization
  8. 6. Web Scraping with Jupyter Notebooks
  9. Appendix