eBook - ePub

Python Data Science Essentials

Name: Python Data Science Essentials
Author: Alberto Boschetti, Luca Massaron

A practitioner's guide covering essential data science principles, tools, and techniques, 3rd Edition

Alberto Boschetti, Luca Massaron

472 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Python Data Science Essentials

A practitioner's guide covering essential data science principles, tools, and techniques, 3rd Edition

Alberto Boschetti, Luca Massaron

Book details

Book preview

Table of contents

Citations

About This Book

Gain useful insights from your data using popular data science tools

Key Features

A one-stop guide to Python libraries such as pandas and NumPy
Comprehensive coverage of data science operations such as data cleaning and data manipulation
Choose scalable learning algorithms for your data science tasks

Book Description

Fully expanded and upgraded, the latest edition of Python Data Science Essentials will help you succeed in data science operations using the most common Python libraries. This book offers up-to-date insight into the core of Python, including the latest versions of the Jupyter Notebook, NumPy, pandas, and scikit-learn.

The book covers detailed examples and large hybrid datasets to help you grasp essential statistical techniques for data collection, data munging and analysis, visualization, and reporting activities. You will also gain an understanding of advanced data science topics such as machine learning algorithms, distributed computing, tuning predictive models, and natural language processing. Furthermore, You'll also be introduced to deep learning and gradient boosting solutions such as XGBoost, LightGBM, and CatBoost.

By the end of the book, you will have gained a complete overview of the principal machine learning algorithms, graph analysis techniques, and all the visualization and deployment instruments that make it easier to present your results to an audience of both data science experts and business users

What you will learn

Set up your data science toolbox on Windows, Mac, and Linux
Use the core machine learning methods offered by the scikit-learn library
Manipulate, fix, and explore data to solve data science problems
Learn advanced explorative and manipulative techniques to solve data operations
Optimize your machine learning models for optimized performance
Explore and cluster graphs, taking advantage of interconnections and links in your data

Who this book is for

If you're a data science entrant, data analyst, or data engineer, this book will help you get ready to tackle real-world data science problems without wasting any time. Basic knowledge of probability/statistics and Python coding experience will assist you in understanding the concepts covered in this book.

Frequently asked questions

How do I cancel my subscription?

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

Can/how do I download books?

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

What is the difference between the pricing plans?

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

What is Perlego?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Is Python Data Science Essentials an online PDF/ePUB?

Yes, you can access Python Data Science Essentials by Alberto Boschetti, Luca Massaron in PDF and/or ePUB format, as well as other popular books in Informatique & Traitement des données. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Packt Publishing

Year

2018

ISBN

9781789531893

Edition

Topic

Informatique

Subtopic

Traitement des données

Data Munging

We are just getting into the action with data! In this chapter, you'll learn how to munge data. What does data munging mean ?

The term mung is a technical term that was coined about half a century ago by students of at Massachusetts Institute of Technology (MIT). Munging means to change, in a series of well-specified and reversible steps, a piece of original data to a completely different (and hopefully more useful) one. Deep-rooted in hacker culture, munging is often described in the data science pipeline using other, almost synonymous, terms such as data wrangling or data preparation.

Given such premises, in this chapter, the following topics will be covered:

The data science process (so that you'll know what is going on and what's next)
Uploading data from a file
Selecting the data you need
Cleaning up any missing or wrong data
Adding, inserting, and deleting data
Grouping and transforming data to obtain new and meaningful information
Managing to obtain a dataset matrix or an array to feed into the data science pipeline

The data science process

Although every data science project is different, for our illustrative purposes, we can partition an ideal data science project into a series of reduced and simplified phases.

The process starts by obtaining data (a phase known as data ingestion). Data ingestion implies a series of possible alternatives, from simply uploading data to assembling it from RDBMS or NoSQL repositories, or from synthetically generating it to scraping it from web APIs or HTML pages.

Especially when faced with novel challenges, uploading data can reveal itself as a critical part of a data scientist's work. Your data can arrive from multiple sources: databases, CSV or Excel files, raw HTML, images, sound recordings, APIs (if you are clueless about what an API is, you can read a good tutorial about APIs with Python here: https://www.dataquest.io/blog/python-api-tutorial/) providing JavaScript Object Notation (JSON) files, and so on. Given the wide range of alternatives, we will just briefly touch upon this aspect by offering the basic tools to get your data (even if it is too big) into your computer memory by using either a textual file that's present on your hard disk or the web, or tables in a relational database management system (RDBMS).

After successfully uploading your data comes the data munging phase. Although now available in-memory, inevitably, your data will surely be in a form that's unsuitable for any analysis and experimentation. Data in the real world is complex, messy, and sometimes even erroneous or missing. Yet, thanks to a bunch of basic Python data structures and commands, you'll address all the problematic data and feed it into the next phases of the project, appropriately transformed into a typical dataset that has observations in rows and variables in columns. A dataset is a basic requirement for any statistical and machine learning analysis, and you may hear it being mentioned as the flat file (when it is the result of joining together multiple relational tables from a database) or data matrix (when columns and rows are unlabeled and the values it contains are just numeric).

Though less rewarding than other intellectually stimulating phases (such as the application of algorithms or machine learning), data munging creates the foundations for every complex and sophisticated value-added analysis that you may have in mind to obtain. The success of your project heavily relies on it.

Having completely defined the dataset that you'll be working on, a new phase opens up. At this time, you'll start observing your data; then, you will proceed to develop and test your hypothesis in a recurring loop. For instance, you'll explore your variables graphically. With the help of descriptive stats, you'll figure out how to create new variables by putting your domain knowledge into action. You'll address redundant and unexpected information (outliers, first of all) and select the most meaningful variables and effective parameters to be tested by a selection of machine learning algorithms.

This phase is structured as a pipeline, where your data is processed according to a series of steps. After that, a model is finally created, but you may realize that you have to reiterate and start again from data munging or somewhere in the data pipeline, supplying corrections or trying different experiments, until you have reached a meaningful result.

From our experience on the field, we can assure you that no matter how promising your plans were when starting to analyze the data, in the end, your solution will be much different from any first envisioned idea. The confrontation with the experimental results you will obtain rules the kind of data munging, optimizations, models, and the overall number of iterations you have to go through before reaching a satisfactory end to your project. That is why if you want to be a successful data scientist, it won't suffice at all just to provide theoretically sound solutions. It is necessary to be able to quickly prototype a large number of possible solutions in the fastest time in order to ascertain which is the best path to take. It is our purpose to help you accelerate to the maximum by using the code snippets provided by this book in your data science process.

A result from your project is represented by an error or optimization measure (that you have chosen carefully in order to represent your business targets). Besides an error measurement, your achievement can also be communicated by an interpretable insight that has to be verbally or visually described to your data science project's sponsors or other data scientists. At this point, being able to visualize results and insights appropriately using tables, charts, and plots is indeed essential.

This process can also be described using the acronym OSEMN (Obtain, Scrub, Explore, Model, iNterpret), as introduced by Hilary Mason and Chris Wiggins in a famous post on the blog dataists (http://www.dataists.com/2010/09/a-taxonomy-of-data-science/), describing a data science taxonomy. OSEMN is also quite memorable since it rhymes with the words possum and awesome:

We won't ever get tired of remarking how everything starts with munging your data and that munging can easily require up to 80% of your efforts in a data project. Since even the longest journey starts with a single step, let's immediately step into this chapter and learn the building blocks of a successful munging phase!

Data loading and preprocessing with pandas

In the previous chapter, we discussed where to find useful datasets and examined the basic import commands of Python packages. In this section, having kept your toolbox ready, you are about to learn how to structurally load, manipulate, process, and polish data using pandas and NumPy.

Fast and easy data loading

Let's start with a CSV fi...

Title Page
Copyright and Credits
Packt Upsell
Contributors
Preface
First Steps
Data Munging
The Data Pipeline
Machine Learning
Visualization, Insights, and Results
Social Network Analysis
Deep Learning Beyond the Basics
Spark for Big Data
Strengthen Your Python Foundations
Other Books You May Enjoy

Citation styles for Python Data Science Essentials

APA 6 Citation

Boschetti, A., & Massaron, L. (2018). Python Data Science Essentials (3rd ed.). Packt Publishing. Retrieved from https://www.perlego.com/book/825756/python-data-science-essentials-a-practitioners-guide-covering-essential-data-science-principles-tools-and-techniques-3rd-edition-pdf (Original work published 2018)

Chicago Citation

Boschetti, Alberto, and Luca Massaron. (2018) 2018. Python Data Science Essentials. 3rd ed. Packt Publishing. https://www.perlego.com/book/825756/python-data-science-essentials-a-practitioners-guide-covering-essential-data-science-principles-tools-and-techniques-3rd-edition-pdf.

Harvard Citation

Boschetti, A. and Massaron, L. (2018) Python Data Science Essentials. 3rd edn. Packt Publishing. Available at: https://www.perlego.com/book/825756/python-data-science-essentials-a-practitioners-guide-covering-essential-data-science-principles-tools-and-techniques-3rd-edition-pdf (Accessed: 14 October 2022).

MLA 7 Citation

Boschetti, Alberto, and Luca Massaron. Python Data Science Essentials. 3rd ed. Packt Publishing, 2018. Web. 14 Oct. 2022.