Python Data Science Essentials
eBook - ePub

Python Data Science Essentials

A practitioner's guide covering essential data science principles, tools, and techniques, 3rd Edition

Alberto Boschetti, Luca Massaron

  1. 472 pagine
  2. English
  3. ePUB (disponibile sull'app)
  4. Disponibile su iOS e Android
eBook - ePub

Python Data Science Essentials

A practitioner's guide covering essential data science principles, tools, and techniques, 3rd Edition

Alberto Boschetti, Luca Massaron

Dettagli del libro
Anteprima del libro
Indice dei contenuti
Citazioni

Informazioni sul libro

Gain useful insights from your data using popular data science tools

Key Features

  • A one-stop guide to Python libraries such as pandas and NumPy
  • Comprehensive coverage of data science operations such as data cleaning and data manipulation
  • Choose scalable learning algorithms for your data science tasks

Book Description

Fully expanded and upgraded, the latest edition of Python Data Science Essentials will help you succeed in data science operations using the most common Python libraries. This book offers up-to-date insight into the core of Python, including the latest versions of the Jupyter Notebook, NumPy, pandas, and scikit-learn.

The book covers detailed examples and large hybrid datasets to help you grasp essential statistical techniques for data collection, data munging and analysis, visualization, and reporting activities. You will also gain an understanding of advanced data science topics such as machine learning algorithms, distributed computing, tuning predictive models, and natural language processing. Furthermore, You'll also be introduced to deep learning and gradient boosting solutions such as XGBoost, LightGBM, and CatBoost.

By the end of the book, you will have gained a complete overview of the principal machine learning algorithms, graph analysis techniques, and all the visualization and deployment instruments that make it easier to present your results to an audience of both data science experts and business users

What you will learn

  • Set up your data science toolbox on Windows, Mac, and Linux
  • Use the core machine learning methods offered by the scikit-learn library
  • Manipulate, fix, and explore data to solve data science problems
  • Learn advanced explorative and manipulative techniques to solve data operations
  • Optimize your machine learning models for optimized performance
  • Explore and cluster graphs, taking advantage of interconnections and links in your data

Who this book is for

If you're a data science entrant, data analyst, or data engineer, this book will help you get ready to tackle real-world data science problems without wasting any time. Basic knowledge of probability/statistics and Python coding experience will assist you in understanding the concepts covered in this book.

Domande frequenti

Come faccio ad annullare l'abbonamento?
È semplicissimo: basta accedere alla sezione Account nelle Impostazioni e cliccare su "Annulla abbonamento". Dopo la cancellazione, l'abbonamento rimarrà attivo per il periodo rimanente già pagato. Per maggiori informazioni, clicca qui
È possibile scaricare libri? Se sì, come?
Al momento è possibile scaricare tramite l'app tutti i nostri libri ePub mobile-friendly. Anche la maggior parte dei nostri PDF è scaricabile e stiamo lavorando per rendere disponibile quanto prima il download di tutti gli altri file. Per maggiori informazioni, clicca qui
Che differenza c'è tra i piani?
Entrambi i piani ti danno accesso illimitato alla libreria e a tutte le funzionalità di Perlego. Le uniche differenze sono il prezzo e il periodo di abbonamento: con il piano annuale risparmierai circa il 30% rispetto a 12 rate con quello mensile.
Cos'è Perlego?
Perlego è un servizio di abbonamento a testi accademici, che ti permette di accedere a un'intera libreria online a un prezzo inferiore rispetto a quello che pagheresti per acquistare un singolo libro al mese. Con oltre 1 milione di testi suddivisi in più di 1.000 categorie, troverai sicuramente ciò che fa per te! Per maggiori informazioni, clicca qui.
Perlego supporta la sintesi vocale?
Cerca l'icona Sintesi vocale nel prossimo libro che leggerai per verificare se è possibile riprodurre l'audio. Questo strumento permette di leggere il testo a voce alta, evidenziandolo man mano che la lettura procede. Puoi aumentare o diminuire la velocità della sintesi vocale, oppure sospendere la riproduzione. Per maggiori informazioni, clicca qui.
Python Data Science Essentials è disponibile online in formato PDF/ePub?
Sì, puoi accedere a Python Data Science Essentials di Alberto Boschetti, Luca Massaron in formato PDF e/o ePub, così come ad altri libri molto apprezzati nelle sezioni relative a Informatique e Traitement des données. Scopri oltre 1 milione di libri disponibili nel nostro catalogo.

Informazioni

Anno
2018
ISBN
9781789531893
Edizione
3
Argomento
Informatique

Data Munging

We are just getting into the action with data! In this chapter, you'll learn how to munge data. What does data munging mean ?
The term mung is a technical term that was coined about half a century ago by students of at Massachusetts Institute of Technology (MIT). Munging means to change, in a series of well-specified and reversible steps, a piece of original data to a completely different (and hopefully more useful) one. Deep-rooted in hacker culture, munging is often described in the data science pipeline using other, almost synonymous, terms such as data wrangling or data preparation.
Given such premises, in this chapter, the following topics will be covered:
  • The data science process (so that you'll know what is going on and what's next)
  • Uploading data from a file
  • Selecting the data you need
  • Cleaning up any missing or wrong data
  • Adding, inserting, and deleting data
  • Grouping and transforming data to obtain new and meaningful information
  • Managing to obtain a dataset matrix or an array to feed into the data science pipeline

The data science process

Although every data science project is different, for our illustrative purposes, we can partition an ideal data science project into a series of reduced and simplified phases.
The process starts by obtaining data (a phase known as data ingestion). Data ingestion implies a series of possible alternatives, from simply uploading data to assembling it from RDBMS or NoSQL repositories, or from synthetically generating it to scraping it from web APIs or HTML pages.
Especially when faced with novel challenges, uploading data can reveal itself as a critical part of a data scientist's work. Your data can arrive from multiple sources: databases, CSV or Excel files, raw HTML, images, sound recordings, APIs (if you are clueless about what an API is, you can read a good tutorial about APIs with Python here: https://www.dataquest.io/blog/python-api-tutorial/) providing JavaScript Object Notation (JSON) files, and so on. Given the wide range of alternatives, we will just briefly touch upon this aspect by offering the basic tools to get your data (even if it is too big) into your computer memory by using either a textual file that's present on your hard disk or the web, or tables in a relational database management system (RDBMS).
After successfully uploading your data comes the data munging phase. Although now available in-memory, inevitably, your data will surely be in a form that's unsuitable for any analysis and experimentation. Data in the real world is complex, messy, and sometimes even erroneous or missing. Yet, thanks to a bunch of basic Python data structures and commands, you'll address all the problematic data and feed it into the next phases of the project, appropriately transformed into a typical dataset that has observations in rows and variables in columns. A dataset is a basic requirement for any statistical and machine learning analysis, and you may hear it being mentioned as the flat file (when it is the result of joining together multiple relational tables from a database) or data matrix (when columns and rows are unlabeled and the values it contains are just numeric).
Though less rewarding than other intellectually stimulating phases (such as the application of algorithms or machine learning), data munging creates the foundations for every complex and sophisticated value-added analysis that you may have in mind to obtain. The success of your project heavily relies on it.
Having completely defined the dataset that you'll be working on, a new phase opens up. At this time, you'll start observing your data; then, you will proceed to develop and test your hypothesis in a recurring loop. For instance, you'll explore your variables graphically. With the help of descriptive stats, you'll figure out how to create new variables by putting your domain knowledge into action. You'll address redundant and unexpected information (outliers, first of all) and select the most meaningful variables and effective parameters to be tested by a selection of machine learning algorithms.
This phase is structured as a pipeline, where your data is processed according to a series of steps. After that, a model is finally created, but you may realize that you have to reiterate and start again from data munging or somewhere in the data pipeline, supplying corrections or trying different experiments, until you have reached a meaningful result.
From our experience on the field, we can assure you that no matter how promising your plans were when starting to analyze the data, in the end, your solution will be much different from any first envisioned idea. The confrontation with the experimental results you will obtain rules the kind of data munging, optimizations, models, and the overall number of iterations you have to go through before reaching a satisfactory end to your project. That is why if you want to be a successful data scientist, it won't suffice at all just to provide theoretically sound solutions. It is necessary to be able to quickly prototype a large number of possible solutions in the fastest time in order to ascertain which is the best path to take. It is our purpose to help you accelerate to the maximum by using the code snippets provided by this book in your data science process.
A result from your project is represented by an error or optimization measure (that you have chosen carefully in order to represent your business targets). Besides an error measurement, your achievement can also be communicated by an interpretable insight that has to be verbally or visually described to your data science project's sponsors or other data scientists. At this point, being able to visualize results and insights appropriately using tables, charts, and plots is indeed essential.
This process can also be described using the acronym OSEMN (Obtain, Scrub, Explore, Model, iNterpret), as introduced by Hilary Mason and Chris Wiggins in a famous post on the blog dataists (http://www.dataists.com/2010/09/a-taxonomy-of-data-science/), describing a data science taxonomy. OSEMN is also quite memorable since it rhymes with the words possum and awesome:
We won't ever get tired of remarking how everything starts with munging your data and that munging can easily require up to 80% of your efforts in a data project. Since even the longest journey starts with a single step, let's immediately step into this chapter and learn the building blocks of a successful munging phase!

Data loading and preprocessing with pandas

In the previous chapter, we discussed where to find useful datasets and examined the basic import commands of Python packages. In this section, having kept your toolbox ready, you are about to learn how to structurally load, manipulate, process, and polish data using pandas and NumPy.

Fast and easy data loading

Let's start with a CSV fi...

Indice dei contenuti

  1. Title Page
  2. Copyright and Credits
  3. Packt Upsell
  4. Contributors
  5. Preface
  6. First Steps
  7. Data Munging
  8. The Data Pipeline
  9. Machine Learning
  10. Visualization, Insights, and Results
  11. Social Network Analysis
  12. Deep Learning Beyond the Basics
  13. Spark for Big Data
  14. Strengthen Your Python Foundations
  15. Other Books You May Enjoy
Stili delle citazioni per Python Data Science Essentials

APA 6 Citation

Boschetti, A., & Massaron, L. (2018). Python Data Science Essentials (3rd ed.). Packt Publishing. Retrieved from https://www.perlego.com/book/825756/python-data-science-essentials-a-practitioners-guide-covering-essential-data-science-principles-tools-and-techniques-3rd-edition-pdf (Original work published 2018)

Chicago Citation

Boschetti, Alberto, and Luca Massaron. (2018) 2018. Python Data Science Essentials. 3rd ed. Packt Publishing. https://www.perlego.com/book/825756/python-data-science-essentials-a-practitioners-guide-covering-essential-data-science-principles-tools-and-techniques-3rd-edition-pdf.

Harvard Citation

Boschetti, A. and Massaron, L. (2018) Python Data Science Essentials. 3rd edn. Packt Publishing. Available at: https://www.perlego.com/book/825756/python-data-science-essentials-a-practitioners-guide-covering-essential-data-science-principles-tools-and-techniques-3rd-edition-pdf (Accessed: 14 October 2022).

MLA 7 Citation

Boschetti, Alberto, and Luca Massaron. Python Data Science Essentials. 3rd ed. Packt Publishing, 2018. Web. 14 Oct. 2022.