Section 1:
Introduction to Pachyderm and Reproducible Data Science
This section introduces the basics of Pachyderm, as well as describing the importance of data reproducibility for an enterprise-level data science platform. You will learn what the main pillars of the Pachyderm solution are, including repositories, datums, jobs, and the most important of them all – the pipeline. The chapter also briefly talks about the ethics of AI in terms of reproducibility.
This section comprises the following chapters:
- Chapter 1, The Problem of Data Reproducibility
- Chapter 2, Pachyderm Basics
- Chapter 3, Pachyderm Pipeline Specification
Chapter 1: The Problem of Data Reproducibility
Today, machine learning algorithms are used everywhere. They are integrated into our day-to-day lives, and we use them without noticing. While we are rushing to work, planning a vacation, or visiting a doctor's office, the models are at work, at times making important decisions about us. If we are unsure what the model is doing and how it makes decisions, how can we be sure that its decisions are fair and just? Pachyderm profoundly cares about the reproducibility of data science experiments and puts data lineage, reproducibility, and version control at its core. But before we proceed, let's discuss why reproducibility is so important.
This chapter explains the concepts of reproducibility, ethical Artificial Intelligence (AI), and Machine Learning Operations (MLOps), as well as providing an overview of the existing data science platforms and how they compare to Pachyderm.
In this chapter, we're going to cover the following main topics:
- Why is reproducibility important?
- The reproducibility crisis in science
- Demystifying MLOps
- Types of data science platforms
- Explaining ethical AI
Why is reproducibility important?
First of all, let's define AI, ML, and data science.
Data science is a field of study that involves collecting and preparing large amounts of data to extract knowledge and produce insights.
AI is more of an umbrella term for technology that enables machines to mimic the behavior of human beings. Machine learning is a subset of AI that is based on the idea that an algorithm can learn based on past experiences.
Now, let's define reproducibility. A data science experiment is considered reproducible if other data scientists can repeat it with a comparable outcome on a similar dataset and problem. And although reproducibility has been a pillar of scientific research for decades, it has only recently become an important topic in the data science scope.
Not only is a reproducible experiment more likely to be free of errors, but it also takes the experiment further and allows others to build on top of it, contributing to knowledge transfer and speeding up future discoveries.
It's not a secret that data science has become one of the hottest topics in the last 10 years. Many big tech companies have opened tens of high-paying data scientist, data engineering, and data analyst positions. With that, the demand to join the profession has been rising exponentially. According to the AI Index 2019 Annual Report published by the Stanford Institute for Human-Centered Artificial Intelligence (HAI), the number of AI papers has grown threefold in the last 20 years. You can read more about this report on the Stanford University HAI website: https://hai.stanford.edu/blog/introducing-ai-index-2019-report.
Figure 1.1 – AI publications trend, from the AI Index 2019 Annual Report (p. 5)
Almost every learning platform and university now offers a data science or AI program, and these programs never lack students. Thousands of people of all backgrounds, from software developers to CEOs, take ML classes to keep up with the rapidly growing industry.
The number of AI conferences has been steadily growing as well. Even in the pandemic world, where in-person events have become impossible, the AI community has continued to meet in a virtual format. Such flagship conferences as Neural Information Processing Systems (NeurIPS) and International Conference on Machine Learning (ICML), which typically attract more than 10,000 visitors, took place online with significant attendance.
According to some predictions, the AI market size will increase to more than $350 billion by 2025. The market grew from $12 billion to $58 billion from 2020 to 2021 alone. The Silicon Valley tech giants are fiercely battling to achieve dominance in the space, while smaller players emerge to get their share of the market. The number of AI start-ups worldwide is steadily growing, with billions being invested in them each year.
The following graph shows the growth of AI-related start-ups in recent years:
Figure 1.2 – Total private investment in AI-related start-ups worldwide, from the AI Index 2019 Annual Report (p. 88)
The total private investment in AI start-ups grew by more than 30 times in the last 10 years.
And another interesting metric from the same source is the number of AI patents published between 2015 and 2018:
Figure 1.3 – Total number of AI patents (2015-2018), from the AI Index 2019 Annual Report (p. 32)
The United States is leading in the number of published patents among other countries.
These trends boost the economy and industry but inevitably affect the quality of submitted AI papers, processes, practices, and experiments. That's why a proper process is needed to ensure the validation of data science models. The replication of experiments is an important part of a data science model's quality control.
Next, let's learn what a model is.
What is a model?
Let's define what a model is. A data science or AI model is a simplified representation of a process that also suggests possible results. Whether it is a weather-prediction algorithm or a website attendance calculator, a model provides the most probable outcome and helps us make informed decisions. When a data scientist creates a model, they need to make decisions about the critical parameters that must be included in that model because they cannot include everything. Therefore, a model is a simplified version of a process. And that's when sacrifices are made based on the data scientist's or organization's definition of success.
The following diagram demonstrates a data model:
Figure 1.4 – Data science model
Every model needs a continuous data flow to improve and perform correctly. Consider the Amazon Go stores where shoppers' behavior is analyzed by multiple cameras inside the store. The models that ensure safety in the store are trained continuously on real-life customer behavior. These models had to learn that sometimes shoppers might pick up an item and then change their mind and put it back; sometimes shoppers can drop an item on the floor, damaging the product, and so on. The Amazon Go store model is likely good because it has access to a lot of real data, and it improves over time. However, not all models have access to real data, and that's when a synthetic dataset can be used.
A synthetic dataset is a dataset that was generated artificially by a computer. The problem with synthetic data is that it is only as good as the algorithm that generated it. Often, such data misrepresents the real world. In some cases, such as when users' privacy prevents data scientists from using real data, usage of a synthetic dataset is justified; in other cases, it can lead to negative results.
IBM's Watson was an ambitious project that promised to revolutionize healthcare by promising to diagnose patients based on a provided list of symptoms in a matter of a few seconds. This invention could greatly speed up the diagnosis process. In some places on this planet, where people have no access to healthcare, a system like that could save many lives. Unfortunately, since the original promise was to replace doctors, Watson is a recommendation system that can assist in diagnosing, but nothing more than that. One of the reasons is that Watson was trained on a synthetic dataset and not on real data.
There are cases when detecting issues in a trained model can be especially difficult. Take the example of an image recognition algorithm developed by the University of Washington that was built to identify whether an image had a husky portrayed in it or a wolf. The model was seemingly working really well, predicting the corre...