eBook - ePub

Python Data Science Essentials

Name: Python Data Science Essentials
Author: Alberto Boschetti, Luca Massaron

A practitioner's guide covering essential data science principles, tools, and techniques, 3rd Edition

Alberto Boschetti, Luca Massaron

472 páginas
English
ePUB (apto para móviles)
Disponible en iOS y Android

eBook - ePub

Python Data Science Essentials

A practitioner's guide covering essential data science principles, tools, and techniques, 3rd Edition

Alberto Boschetti, Luca Massaron

Detalles del libro

Vista previa del libro

Índice

Citas

Información del libro

Gain useful insights from your data using popular data science tools

Key Features

A one-stop guide to Python libraries such as pandas and NumPy
Comprehensive coverage of data science operations such as data cleaning and data manipulation
Choose scalable learning algorithms for your data science tasks

Book Description

Fully expanded and upgraded, the latest edition of Python Data Science Essentials will help you succeed in data science operations using the most common Python libraries. This book offers up-to-date insight into the core of Python, including the latest versions of the Jupyter Notebook, NumPy, pandas, and scikit-learn.

The book covers detailed examples and large hybrid datasets to help you grasp essential statistical techniques for data collection, data munging and analysis, visualization, and reporting activities. You will also gain an understanding of advanced data science topics such as machine learning algorithms, distributed computing, tuning predictive models, and natural language processing. Furthermore, You'll also be introduced to deep learning and gradient boosting solutions such as XGBoost, LightGBM, and CatBoost.

By the end of the book, you will have gained a complete overview of the principal machine learning algorithms, graph analysis techniques, and all the visualization and deployment instruments that make it easier to present your results to an audience of both data science experts and business users

What you will learn

Set up your data science toolbox on Windows, Mac, and Linux
Use the core machine learning methods offered by the scikit-learn library
Manipulate, fix, and explore data to solve data science problems
Learn advanced explorative and manipulative techniques to solve data operations
Optimize your machine learning models for optimized performance
Explore and cluster graphs, taking advantage of interconnections and links in your data

Who this book is for

If you're a data science entrant, data analyst, or data engineer, this book will help you get ready to tackle real-world data science problems without wasting any time. Basic knowledge of probability/statistics and Python coding experience will assist you in understanding the concepts covered in this book.

Preguntas frecuentes

¿Cómo cancelo mi suscripción?

Simplemente, dirígete a la sección ajustes de la cuenta y haz clic en «Cancelar suscripción». Así de sencillo. Después de cancelar tu suscripción, esta permanecerá activa el tiempo restante que hayas pagado. Obtén más información aquí.

¿Cómo descargo los libros?

Por el momento, todos nuestros libros ePub adaptables a dispositivos móviles se pueden descargar a través de la aplicación. La mayor parte de nuestros PDF también se puede descargar y ya estamos trabajando para que el resto también sea descargable. Obtén más información aquí.

¿En qué se diferencian los planes de precios?

Ambos planes te permiten acceder por completo a la biblioteca y a todas las funciones de Perlego. Las únicas diferencias son el precio y el período de suscripción: con el plan anual ahorrarás en torno a un 30 % en comparación con 12 meses de un plan mensual.

¿Qué es Perlego?

Somos un servicio de suscripción de libros de texto en línea que te permite acceder a toda una biblioteca en línea por menos de lo que cuesta un libro al mes. Con más de un millón de libros sobre más de 1000 categorías, ¡tenemos todo lo que necesitas! Obtén más información aquí.

¿Perlego ofrece la función de texto a voz?

Busca el símbolo de lectura en voz alta en tu próximo libro para ver si puedes escucharlo. La herramienta de lectura en voz alta lee el texto en voz alta por ti, resaltando el texto a medida que se lee. Puedes pausarla, acelerarla y ralentizarla. Obtén más información aquí.

¿Es Python Data Science Essentials un PDF/ePUB en línea?

Sí, puedes acceder a Python Data Science Essentials de Alberto Boschetti, Luca Massaron en formato PDF o ePUB, así como a otros libros populares de Informatique y Traitement des données. Tenemos más de un millón de libros disponibles en nuestro catálogo para que explores.

Información

Editorial

Packt Publishing

Año

2018

ISBN

9781789531893

Edición

Categoría

Informatique

Categoría

Traitement des données

Data Munging

We are just getting into the action with data! In this chapter, you'll learn how to munge data. What does data munging mean ?

The term mung is a technical term that was coined about half a century ago by students of at Massachusetts Institute of Technology (MIT). Munging means to change, in a series of well-specified and reversible steps, a piece of original data to a completely different (and hopefully more useful) one. Deep-rooted in hacker culture, munging is often described in the data science pipeline using other, almost synonymous, terms such as data wrangling or data preparation.

Given such premises, in this chapter, the following topics will be covered:

The data science process (so that you'll know what is going on and what's next)
Uploading data from a file
Selecting the data you need
Cleaning up any missing or wrong data
Adding, inserting, and deleting data
Grouping and transforming data to obtain new and meaningful information
Managing to obtain a dataset matrix or an array to feed into the data science pipeline

The data science process

Although every data science project is different, for our illustrative purposes, we can partition an ideal data science project into a series of reduced and simplified phases.

The process starts by obtaining data (a phase known as data ingestion). Data ingestion implies a series of possible alternatives, from simply uploading data to assembling it from RDBMS or NoSQL repositories, or from synthetically generating it to scraping it from web APIs or HTML pages.

Especially when faced with novel challenges, uploading data can reveal itself as a critical part of a data scientist's work. Your data can arrive from multiple sources: databases, CSV or Excel files, raw HTML, images, sound recordings, APIs (if you are clueless about what an API is, you can read a good tutorial about APIs with Python here: https://www.dataquest.io/blog/python-api-tutorial/) providing JavaScript Object Notation (JSON) files, and so on. Given the wide range of alternatives, we will just briefly touch upon this aspect by offering the basic tools to get your data (even if it is too big) into your computer memory by using either a textual file that's present on your hard disk or the web, or tables in a relational database management system (RDBMS).

After successfully uploading your data comes the data munging phase. Although now available in-memory, inevitably, your data will surely be in a form that's unsuitable for any analysis and experimentation. Data in the real world is complex, messy, and sometimes even erroneous or missing. Yet, thanks to a bunch of basic Python data structures and commands, you'll address all the problematic data and feed it into the next phases of the project, appropriately transformed into a typical dataset that has observations in rows and variables in columns. A dataset is a basic requirement for any statistical and machine learning analysis, and you may hear it being mentioned as the flat file (when it is the result of joining together multiple relational tables from a database) or data matrix (when columns and rows are unlabeled and the values it contains are just numeric).

Though less rewarding than other intellectually stimulating phases (such as the application of algorithms or machine learning), data munging creates the foundations for every complex and sophisticated value-added analysis that you may have in mind to obtain. The success of your project heavily relies on it.

Having completely defined the dataset that you'll be working on, a new phase opens up. At this time, you'll start observing your data; then, you will proceed to develop and test your hypothesis in a recurring loop. For instance, you'll explore your variables graphically. With the help of descriptive stats, you'll figure out how to create new variables by putting your domain knowledge into action. You'll address redundant and unexpected information (outliers, first of all) and select the most meaningful variables and effective parameters to be tested by a selection of machine learning algorithms.

This phase is structured as a pipeline, where your data is processed according to a series of steps. After that, a model is finally created, but you may realize that you have to reiterate and start again from data munging or somewhere in the data pipeline, supplying corrections or trying different experiments, until you have reached a meaningful result.

From our experience on the field, we can assure you that no matter how promising your plans were when starting to analyze the data, in the end, your solution will be much different from any first envisioned idea. The confrontation with the experimental results you will obtain rules the kind of data munging, optimizations, models, and the overall number of iterations you have to go through before reaching a satisfactory end to your project. That is why if you want to be a successful data scientist, it won't suffice at all just to provide theoretically sound solutions. It is necessary to be able to quickly prototype a large number of possible solutions in the fastest time in order to ascertain which is the best path to take. It is our purpose to help you accelerate to the maximum by using the code snippets provided by this book in your data science process.

A result from your project is represented by an error or optimization measure (that you have chosen carefully in order to represent your business targets). Besides an error measurement, your achievement can also be communicated by an interpretable insight that has to be verbally or visually described to your data science project's sponsors or other data scientists. At this point, being able to visualize results and insights appropriately using tables, charts, and plots is indeed essential.

This process can also be described using the acronym OSEMN (Obtain, Scrub, Explore, Model, iNterpret), as introduced by Hilary Mason and Chris Wiggins in a famous post on the blog dataists (http://www.dataists.com/2010/09/a-taxonomy-of-data-science/), describing a data science taxonomy. OSEMN is also quite memorable since it rhymes with the words possum and awesome:

We won't ever get tired of remarking how everything starts with munging your data and that munging can easily require up to 80% of your efforts in a data project. Since even the longest journey starts with a single step, let's immediately step into this chapter and learn the building blocks of a successful munging phase!

Data loading and preprocessing with pandas

In the previous chapter, we discussed where to find useful datasets and examined the basic import commands of Python packages. In this section, having kept your toolbox ready, you are about to learn how to structurally load, manipulate, process, and polish data using pandas and NumPy.

Fast and easy data loading

Let's start with a CSV fi...

Índice

Title Page
Copyright and Credits
Packt Upsell
Contributors
Preface
First Steps
Data Munging
The Data Pipeline
Machine Learning
Visualization, Insights, and Results
Social Network Analysis
Deep Learning Beyond the Basics
Spark for Big Data
Strengthen Your Python Foundations
Other Books You May Enjoy

Estilos de citas para Python Data Science Essentials

APA 6 Citation

Boschetti, A., & Massaron, L. (2018). Python Data Science Essentials (3rd ed.). Packt Publishing. Retrieved from https://www.perlego.com/book/825756/python-data-science-essentials-a-practitioners-guide-covering-essential-data-science-principles-tools-and-techniques-3rd-edition-pdf (Original work published 2018)

Chicago Citation

Boschetti, Alberto, and Luca Massaron. (2018) 2018. Python Data Science Essentials. 3rd ed. Packt Publishing. https://www.perlego.com/book/825756/python-data-science-essentials-a-practitioners-guide-covering-essential-data-science-principles-tools-and-techniques-3rd-edition-pdf.

Harvard Citation

Boschetti, A. and Massaron, L. (2018) Python Data Science Essentials. 3rd edn. Packt Publishing. Available at: https://www.perlego.com/book/825756/python-data-science-essentials-a-practitioners-guide-covering-essential-data-science-principles-tools-and-techniques-3rd-edition-pdf (Accessed: 14 October 2022).

MLA 7 Citation

Boschetti, Alberto, and Luca Massaron. Python Data Science Essentials. 3rd ed. Packt Publishing, 2018. Web. 14 Oct. 2022.