Python Data Science Essentials
eBook - ePub

Python Data Science Essentials

A practitioner's guide covering essential data science principles, tools, and techniques, 3rd Edition

Alberto Boschetti, Luca Massaron

  1. 472 pages
  2. English
  3. ePUB (adapté aux mobiles)
  4. Disponible sur iOS et Android
eBook - ePub

Python Data Science Essentials

A practitioner's guide covering essential data science principles, tools, and techniques, 3rd Edition

Alberto Boschetti, Luca Massaron

DĂ©tails du livre
Aperçu du livre
Table des matiĂšres
Citations

À propos de ce livre

Gain useful insights from your data using popular data science tools

Key Features

  • A one-stop guide to Python libraries such as pandas and NumPy
  • Comprehensive coverage of data science operations such as data cleaning and data manipulation
  • Choose scalable learning algorithms for your data science tasks

Book Description

Fully expanded and upgraded, the latest edition of Python Data Science Essentials will help you succeed in data science operations using the most common Python libraries. This book offers up-to-date insight into the core of Python, including the latest versions of the Jupyter Notebook, NumPy, pandas, and scikit-learn.

The book covers detailed examples and large hybrid datasets to help you grasp essential statistical techniques for data collection, data munging and analysis, visualization, and reporting activities. You will also gain an understanding of advanced data science topics such as machine learning algorithms, distributed computing, tuning predictive models, and natural language processing. Furthermore, You'll also be introduced to deep learning and gradient boosting solutions such as XGBoost, LightGBM, and CatBoost.

By the end of the book, you will have gained a complete overview of the principal machine learning algorithms, graph analysis techniques, and all the visualization and deployment instruments that make it easier to present your results to an audience of both data science experts and business users

What you will learn

  • Set up your data science toolbox on Windows, Mac, and Linux
  • Use the core machine learning methods offered by the scikit-learn library
  • Manipulate, fix, and explore data to solve data science problems
  • Learn advanced explorative and manipulative techniques to solve data operations
  • Optimize your machine learning models for optimized performance
  • Explore and cluster graphs, taking advantage of interconnections and links in your data

Who this book is for

If you're a data science entrant, data analyst, or data engineer, this book will help you get ready to tackle real-world data science problems without wasting any time. Basic knowledge of probability/statistics and Python coding experience will assist you in understanding the concepts covered in this book.

Foire aux questions

Comment puis-je résilier mon abonnement ?
Il vous suffit de vous rendre dans la section compte dans paramĂštres et de cliquer sur « RĂ©silier l’abonnement ». C’est aussi simple que cela ! Une fois que vous aurez rĂ©siliĂ© votre abonnement, il restera actif pour le reste de la pĂ©riode pour laquelle vous avez payĂ©. DĂ©couvrez-en plus ici.
Puis-je / comment puis-je télécharger des livres ?
Pour le moment, tous nos livres en format ePub adaptĂ©s aux mobiles peuvent ĂȘtre tĂ©lĂ©chargĂ©s via l’application. La plupart de nos PDF sont Ă©galement disponibles en tĂ©lĂ©chargement et les autres seront tĂ©lĂ©chargeables trĂšs prochainement. DĂ©couvrez-en plus ici.
Quelle est la différence entre les formules tarifaires ?
Les deux abonnements vous donnent un accĂšs complet Ă  la bibliothĂšque et Ă  toutes les fonctionnalitĂ©s de Perlego. Les seules diffĂ©rences sont les tarifs ainsi que la pĂ©riode d’abonnement : avec l’abonnement annuel, vous Ă©conomiserez environ 30 % par rapport Ă  12 mois d’abonnement mensuel.
Qu’est-ce que Perlego ?
Nous sommes un service d’abonnement Ă  des ouvrages universitaires en ligne, oĂč vous pouvez accĂ©der Ă  toute une bibliothĂšque pour un prix infĂ©rieur Ă  celui d’un seul livre par mois. Avec plus d’un million de livres sur plus de 1 000 sujets, nous avons ce qu’il vous faut ! DĂ©couvrez-en plus ici.
Prenez-vous en charge la synthÚse vocale ?
Recherchez le symbole Écouter sur votre prochain livre pour voir si vous pouvez l’écouter. L’outil Écouter lit le texte Ă  haute voix pour vous, en surlignant le passage qui est en cours de lecture. Vous pouvez le mettre sur pause, l’accĂ©lĂ©rer ou le ralentir. DĂ©couvrez-en plus ici.
Est-ce que Python Data Science Essentials est un PDF/ePUB en ligne ?
Oui, vous pouvez accĂ©der Ă  Python Data Science Essentials par Alberto Boschetti, Luca Massaron en format PDF et/ou ePUB ainsi qu’à d’autres livres populaires dans Informatique et Traitement des donnĂ©es. Nous disposons de plus d’un million d’ouvrages Ă  dĂ©couvrir dans notre catalogue.

Informations

Année
2018
ISBN
9781789531893

Data Munging

We are just getting into the action with data! In this chapter, you'll learn how to munge data. What does data munging mean ?
The term mung is a technical term that was coined about half a century ago by students of at Massachusetts Institute of Technology (MIT). Munging means to change, in a series of well-specified and reversible steps, a piece of original data to a completely different (and hopefully more useful) one. Deep-rooted in hacker culture, munging is often described in the data science pipeline using other, almost synonymous, terms such as data wrangling or data preparation.
Given such premises, in this chapter, the following topics will be covered:
  • The data science process (so that you'll know what is going on and what's next)
  • Uploading data from a file
  • Selecting the data you need
  • Cleaning up any missing or wrong data
  • Adding, inserting, and deleting data
  • Grouping and transforming data to obtain new and meaningful information
  • Managing to obtain a dataset matrix or an array to feed into the data science pipeline

The data science process

Although every data science project is different, for our illustrative purposes, we can partition an ideal data science project into a series of reduced and simplified phases.
The process starts by obtaining data (a phase known as data ingestion). Data ingestion implies a series of possible alternatives, from simply uploading data to assembling it from RDBMS or NoSQL repositories, or from synthetically generating it to scraping it from web APIs or HTML pages.
Especially when faced with novel challenges, uploading data can reveal itself as a critical part of a data scientist's work. Your data can arrive from multiple sources: databases, CSV or Excel files, raw HTML, images, sound recordings, APIs (if you are clueless about what an API is, you can read a good tutorial about APIs with Python here: https://www.dataquest.io/blog/python-api-tutorial/) providing JavaScript Object Notation (JSON) files, and so on. Given the wide range of alternatives, we will just briefly touch upon this aspect by offering the basic tools to get your data (even if it is too big) into your computer memory by using either a textual file that's present on your hard disk or the web, or tables in a relational database management system (RDBMS).
After successfully uploading your data comes the data munging phase. Although now available in-memory, inevitably, your data will surely be in a form that's unsuitable for any analysis and experimentation. Data in the real world is complex, messy, and sometimes even erroneous or missing. Yet, thanks to a bunch of basic Python data structures and commands, you'll address all the problematic data and feed it into the next phases of the project, appropriately transformed into a typical dataset that has observations in rows and variables in columns. A dataset is a basic requirement for any statistical and machine learning analysis, and you may hear it being mentioned as the flat file (when it is the result of joining together multiple relational tables from a database) or data matrix (when columns and rows are unlabeled and the values it contains are just numeric).
Though less rewarding than other intellectually stimulating phases (such as the application of algorithms or machine learning), data munging creates the foundations for every complex and sophisticated value-added analysis that you may have in mind to obtain. The success of your project heavily relies on it.
Having completely defined the dataset that you'll be working on, a new phase opens up. At this time, you'll start observing your data; then, you will proceed to develop and test your hypothesis in a recurring loop. For instance, you'll explore your variables graphically. With the help of descriptive stats, you'll figure out how to create new variables by putting your domain knowledge into action. You'll address redundant and unexpected information (outliers, first of all) and select the most meaningful variables and effective parameters to be tested by a selection of machine learning algorithms.
This phase is structured as a pipeline, where your data is processed according to a series of steps. After that, a model is finally created, but you may realize that you have to reiterate and start again from data munging or somewhere in the data pipeline, supplying corrections or trying different experiments, until you have reached a meaningful result.
From our experience on the field, we can assure you that no matter how promising your plans were when starting to analyze the data, in the end, your solution will be much different from any first envisioned idea. The confrontation with the experimental results you will obtain rules the kind of data munging, optimizations, models, and the overall number of iterations you have to go through before reaching a satisfactory end to your project. That is why if you want to be a successful data scientist, it won't suffice at all just to provide theoretically sound solutions. It is necessary to be able to quickly prototype a large number of possible solutions in the fastest time in order to ascertain which is the best path to take. It is our purpose to help you accelerate to the maximum by using the code snippets provided by this book in your data science process.
A result from your project is represented by an error or optimization measure (that you have chosen carefully in order to represent your business targets). Besides an error measurement, your achievement can also be communicated by an interpretable insight that has to be verbally or visually described to your data science project's sponsors or other data scientists. At this point, being able to visualize results and insights appropriately using tables, charts, and plots is indeed essential.
This process can also be described using the acronym OSEMN (Obtain, Scrub, Explore, Model, iNterpret), as introduced by Hilary Mason and Chris Wiggins in a famous post on the blog dataists (http://www.dataists.com/2010/09/a-taxonomy-of-data-science/), describing a data science taxonomy. OSEMN is also quite memorable since it rhymes with the words possum and awesome:
We won't ever get tired of remarking how everything starts with munging your data and that munging can easily require up to 80% of your efforts in a data project. Since even the longest journey starts with a single step, let's immediately step into this chapter and learn the building blocks of a successful munging phase!

Data loading and preprocessing with pandas

In the previous chapter, we discussed where to find useful datasets and examined the basic import commands of Python packages. In this section, having kept your toolbox ready, you are about to learn how to structurally load, manipulate, process, and polish data using pandas and NumPy.

Fast and easy data loading

Let's start with a CSV fi...

Table des matiĂšres

  1. Title Page
  2. Copyright and Credits
  3. Packt Upsell
  4. Contributors
  5. Preface
  6. First Steps
  7. Data Munging
  8. The Data Pipeline
  9. Machine Learning
  10. Visualization, Insights, and Results
  11. Social Network Analysis
  12. Deep Learning Beyond the Basics
  13. Spark for Big Data
  14. Strengthen Your Python Foundations
  15. Other Books You May Enjoy
Normes de citation pour Python Data Science Essentials

APA 6 Citation

Boschetti, A., & Massaron, L. (2018). Python Data Science Essentials (3rd ed.). Packt Publishing. Retrieved from https://www.perlego.com/book/825756/python-data-science-essentials-a-practitioners-guide-covering-essential-data-science-principles-tools-and-techniques-3rd-edition-pdf (Original work published 2018)

Chicago Citation

Boschetti, Alberto, and Luca Massaron. (2018) 2018. Python Data Science Essentials. 3rd ed. Packt Publishing. https://www.perlego.com/book/825756/python-data-science-essentials-a-practitioners-guide-covering-essential-data-science-principles-tools-and-techniques-3rd-edition-pdf.

Harvard Citation

Boschetti, A. and Massaron, L. (2018) Python Data Science Essentials. 3rd edn. Packt Publishing. Available at: https://www.perlego.com/book/825756/python-data-science-essentials-a-practitioners-guide-covering-essential-data-science-principles-tools-and-techniques-3rd-edition-pdf (Accessed: 14 October 2022).

MLA 7 Citation

Boschetti, Alberto, and Luca Massaron. Python Data Science Essentials. 3rd ed. Packt Publishing, 2018. Web. 14 Oct. 2022.