Reproducible Data Science with Pachyderm
eBook - ePub

Reproducible Data Science with Pachyderm

  1. 364 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Reproducible Data Science with Pachyderm

Book details
Book preview
Table of contents
Citations

About This Book

Create scalable and reliable data pipelines easily with PachydermKey Features• Learn how to build an enterprise-level reproducible data science platform with Pachyderm• Deploy Pachyderm on cloud platforms such as AWS EKS, Google Kubernetes Engine, and Microsoft Azure Kubernetes Service• Integrate Pachyderm with other data science tools, such as Pachyderm NotebooksBook DescriptionPachyderm is an open source project that enables data scientists to run reproducible data pipelines and scale them to an enterprise level. This book will teach you how to implement Pachyderm to create collaborative data science workflows and reproduce your ML experiments at scale.You'll begin your journey by exploring the importance of data reproducibility and comparing different data science platforms. Next, you'll explore how Pachyderm fits into the picture and its significance, followed by learning how to install Pachyderm locally on your computer or a cloud platform of your choice. You'll then discover the architectural components and Pachyderm's main pipeline principles and concepts. The book demonstrates how to use Pachyderm components to create your first data pipeline and advances to cover common operations involving data, such as uploading data to and from Pachyderm to create more complex pipelines. Based on what you've learned, you'll develop an end-to-end ML workflow, before trying out the hyperparameter tuning technique and the different supported Pachyderm language clients. Finally, you'll learn how to use a SaaS version of Pachyderm with Pachyderm Notebooks.By the end of this book, you will learn all aspects of running your data pipelines in Pachyderm and manage them on a day-to-day basis.What you will learn• Understand the importance of reproducible data science for enterprise• Explore the basics of Pachyderm, such as commits and branches• Upload data to and from Pachyderm• Implement common pipeline operations in Pachyderm• Create a real-life example of hyperparameter tuning in Pachyderm• Combine Pachyderm with Pachyderm language clients in Python and GoWho this book is forThis book is for new as well as experienced data scientists and machine learning engineers who want to build scalable infrastructures for their data science projects. Basic knowledge of Python programming and Kubernetes will be beneficial. Familiarity with Golang will be helpful.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes, you can access Reproducible Data Science with Pachyderm by Svetlana Karslioglu in PDF and/or ePUB format, as well as other popular books in Ciencia de la computación & Inteligencia artificial (IA) y semántica. We have over one million books available in our catalogue for you to explore.

Information

Section 1:
Introduction to Pachyderm and Reproducible Data Science

This section introduces the basics of Pachyderm, as well as describing the importance of data reproducibility for an enterprise-level data science platform. You will learn what the main pillars of the Pachyderm solution are, including repositories, datums, jobs, and the most important of them all – the pipeline. The chapter also briefly talks about the ethics of AI in terms of reproducibility.
This section comprises the following chapters:
  • Chapter 1, The Problem of Data Reproducibility
  • Chapter 2, Pachyderm Basics
  • Chapter 3, Pachyderm Pipeline Specification

Chapter 1: The Problem of Data Reproducibility

Today, machine learning algorithms are used everywhere. They are integrated into our day-to-day lives, and we use them without noticing. While we are rushing to work, planning a vacation, or visiting a doctor's office, the models are at work, at times making important decisions about us. If we are unsure what the model is doing and how it makes decisions, how can we be sure that its decisions are fair and just? Pachyderm profoundly cares about the reproducibility of data science experiments and puts data lineage, reproducibility, and version control at its core. But before we proceed, let's discuss why reproducibility is so important.
This chapter explains the concepts of reproducibility, ethical Artificial Intelligence (AI), and Machine Learning Operations (MLOps), as well as providing an overview of the existing data science platforms and how they compare to Pachyderm.
In this chapter, we're going to cover the following main topics:
  • Why is reproducibility important?
  • The reproducibility crisis in science
  • Demystifying MLOps
  • Types of data science platforms
  • Explaining ethical AI

Why is reproducibility important?

First of all, let's define AI, ML, and data science.
Data science is a field of study that involves collecting and preparing large amounts of data to extract knowledge and produce insights.
AI is more of an umbrella term for technology that enables machines to mimic the behavior of human beings. Machine learning is a subset of AI that is based on the idea that an algorithm can learn based on past experiences.
Now, let's define reproducibility. A data science experiment is considered reproducible if other data scientists can repeat it with a comparable outcome on a similar dataset and problem. And although reproducibility has been a pillar of scientific research for decades, it has only recently become an important topic in the data science scope.
Not only is a reproducible experiment more likely to be free of errors, but it also takes the experiment further and allows others to build on top of it, contributing to knowledge transfer and speeding up future discoveries.
It's not a secret that data science has become one of the hottest topics in the last 10 years. Many big tech companies have opened tens of high-paying data scientist, data engineering, and data analyst positions. With that, the demand to join the profession has been rising exponentially. According to the AI Index 2019 Annual Report published by the Stanford Institute for Human-Centered Artificial Intelligence (HAI), the number of AI papers has grown threefold in the last 20 years. You can read more about this report on the Stanford University HAI website: https://hai.stanford.edu/blog/introducing-ai-index-2019-report.
Figure 1.1 – AI publications trend, from the AI Index 2019 Annual Report (p. 5)
Figure 1.1 – AI publications trend, from the AI Index 2019 Annual Report (p. 5)
Almost every learning platform and university now offers a data science or AI program, and these programs never lack students. Thousands of people of all backgrounds, from software developers to CEOs, take ML classes to keep up with the rapidly growing industry.
The number of AI conferences has been steadily growing as well. Even in the pandemic world, where in-person events have become impossible, the AI community has continued to meet in a virtual format. Such flagship conferences as Neural Information Processing Systems (NeurIPS) and International Conference on Machine Learning (ICML), which typically attract more than 10,000 visitors, took place online with significant attendance.
According to some predictions, the AI market size will increase to more than $350 billion by 2025. The market grew from $12 billion to $58 billion from 2020 to 2021 alone. The Silicon Valley tech giants are fiercely battling to achieve dominance in the space, while smaller players emerge to get their share of the market. The number of AI start-ups worldwide is steadily growing, with billions being invested in them each year.
The following graph shows the growth of AI-related start-ups in recent years:
Figure 1.2 – Total private investment in AI-related start-ups worldwide, from the AI Index 2019 Annual Report (p. 88)
Figure 1.2 – Total private investment in AI-related start-ups worldwide, from the AI Index 2019 Annual Report (p. 88)
The total private investment in AI start-ups grew by more than 30 times in the last 10 years.
And another interesting metric from the same source is the number of AI patents published between 2015 and 2018:
Figure 1.3 – Total number of AI patents (2015-2018), from the AI Index 2019 Annual Report (p. 32)
Figure 1.3 – Total number of AI patents (2015-2018), from the AI Index 2019 Annual Report (p. 32)
The United States is leading in the number of published patents among other countries.
These trends boost the economy and industry but inevitably affect the quality of submitted AI papers, processes, practices, and experiments. That's why a proper process is needed to ensure the validation of data science models. The replication of experiments is an important part of a data science model's quality control.
Next, let's learn what a model is.

What is a model?

Let's define what a model is. A data science or AI model is a simplified representation of a process that also suggests possible results. Whether it is a weather-prediction algorithm or a website attendance calculator, a model provides the most probable outcome and helps us make informed decisions. When a data scientist creates a model, they need to make decisions about the critical parameters that must be included in that model because they cannot include everything. Therefore, a model is a simplified version of a process. And that's when sacrifices are made based on the data scientist's or organization's definition of success.
The following diagram demonstrates a data model:
Figure 1.4 – Data science model
Figure 1.4 – Data science model
Every model needs a continuous data flow to improve and perform correctly. Consider the Amazon Go stores where shoppers' behavior is analyzed by multiple cameras inside the store. The models that ensure safety in the store are trained continuously on real-life customer behavior. These models had to learn that sometimes shoppers might pick up an item and then change their mind and put it back; sometimes shoppers can drop an item on the floor, damaging the product, and so on. The Amazon Go store model is likely good because it has access to a lot of real data, and it improves over time. However, not all models have access to real data, and that's when a synthetic dataset can be used.
A synthetic dataset is a dataset that was generated artificially by a computer. The problem with synthetic data is that it is only as good as the algorithm that generated it. Often, such data misrepresents the real world. In some cases, such as when users' privacy prevents data scientists from using real data, usage of a synthetic dataset is justified; in other cases, it can lead to negative results.
IBM's Watson was an ambitious project that promised to revolutionize healthcare by promising to diagnose patients based on a provided list of symptoms in a matter of a few seconds. This invention could greatly speed up the diagnosis process. In some places on this planet, where people have no access to healthcare, a system like that could save many lives. Unfortunately, since the original promise was to replace doctors, Watson is a recommendation system that can assist in diagnosing, but nothing more than that. One of the reasons is that Watson was trained on a synthetic dataset and not on real data.
There are cases when detecting issues in a trained model can be especially difficult. Take the example of an image recognition algorithm developed by the University of Washington that was built to identify whether an image had a husky portrayed in it or a wolf. The model was seemingly working really well, predicting the corre...

Table of contents

  1. Reproducible Data Science with Pachyderm
  2. Contributors
  3. Preface
  4. Section 1:
Introduction to Pachyderm and Reproducible Data Science
  5. Chapter 1: The Problem of Data Reproducibility
  6. Chapter 2: Pachyderm Basics
  7. Chapter 3: Pachyderm Pipeline Specification
  8. Section 2:Getting Started with Pachyderm
  9. Chapter 4: Installing Pachyderm Locally
  10. Chapter 5: Installing Pachyderm on a Cloud Platform
  11. Chapter 6: Creating Your First Pipeline
  12. Chapter 7: Pachyderm Operations
  13. Chapter 8: Creating an End-to-End Machine Learning Workflow
  14. Chapter 9: Distributed Hyperparameter Tuning with Pachyderm
  15. Section 3:Pachyderm Clients and Tools
  16. Chapter 10: Pachyderm Language Clients
  17. Chapter 11: Using Pachyderm Notebooks
  18. Other Books You May Enjoy