IN THIS CHAPTER
Clarifying the concept of data science Understanding the fundamentals of a data-driven organization Putting machine learning in context of data science Clarifying the components of an effective data science strategy In this chapter, I aim to sort out the basics of what data science is all about, but I have to warn you that data science is a term that escapes any single complete definition — which, of course, makes data science difficult to understand and apply in an organization. Many articles and publications use the term quite freely, with the assumption that it’s universally understood. Yet, data science — including its methods, goals, and applications — evolves with time and technology and is now far different from what it might have been 25 years ago.
Despite all that, I'm willing to put forward a tentative definition: Data science is the study of where data comes from, what it represents, and how it can be turned into a valuable resource in the creation of business strategies. Data science can be said to be a multidisciplinary field that uses scientific methods, processes, algorithms, and systems to extract insights from data in various forms, both structured and unstructured. Mining large amounts of structured and unstructured data to identify patterns and deviations that can help an organization rein in costs, increase efficiencies, recognize new market opportunities, and increase the organization's competitive advantage.
Data science is a concept that can be used to unify statistics, analytics, machine learning, and their related methods and techniques in order to understand and analyze actual phenomena with data. It employs techniques and theories drawn from many fields within the context of mathematics, statistics, information science, and computer science.
Behind that type of definition though, lies the definition of how data science is approached and performed. And because the ambition of this part of the book is to frame data science strategy, I need to first frame this multidisciplinary area of data science and its life cycle more properly.
Establishing the Data Science Narrative
It never hurts to have an image when explaining a complicated process, so do take a look at Figure 1-1, where you can see the main steps or phases in the data science life-cycle. Keep in mind, however, that the model visualized in Figure 1-1 assumes that you've already identified a high-level business problem or business opportunity as a starting point. This early ambition is usually derived from a business perspective, but it needs to be analyzed and framed in detail together with the data science team. This dialogue is vital in terms of understanding which data is available and what is possible to do with that data so you can set the focus of the work going forward. It isn’t a good idea to just start capturing any and all data that looks interesting enough to analyze. Therefore, the first stage of the data science life cycle, capture, is to frame the data you need by translating the business need into a concrete and well-defined problem or business opportunity.
The initial business problem and/or opportunity isn’t static and will change over time as your data-driven understanding matures. Staying flexible in terms of which data is captured as well as which problem and/or opportunity is most important at any given point in time, is therefore a vital in order to achieve your business objectives.
The model shown in Figure 1-1 aims to represent a view of the different stages of the data science life cycle, from capturing the business and data need through preparing, exploring, and analyzing the data to reaching insights and acting on them.
The output of each full cycle produces new data, which provides the result of the previous cycle. This includes not only new data or results, which you can use to optimize your model, but can also generate new business needs, problems, or even a new understanding of what the business priority should be.
These stages of the data science life cycle can also be seen as not only steps describing the scope of data science but also layers in an architecture. More on that later; let me start by explaining the different stages.
Capture
There are two different parts of the first stage in the life-cycle, since capture refers to both the capture of the business need as well as the extraction and acquisition of data. This stage is vital to the rest of the process. I'll start by explaining what it means to capture the business need.
The starting point for detailing the business need is a high-level business request or business problem expressed by management or similar entities and should include tasks such as
- Translating ambiguous business requests into concrete, well-defined problems or opportunities
- Deep-diving into the context of the requests to better understand what a potential solution could look like, including which data will be needed
- Outlining (if possible) strategic business priorities set by the company that might impact the data science work
Now that I've made clear the importance of capturing and understanding the business requests and initial scoping of data needed, I want to move on to describing aspects of the data capture process itself. It’s the main interface to the data source that you need to tap into and includes areas such as
- Managing data ownership and securing legal rights to data capture and usage
- Handling of personal information and securing data privacy through different anonymization techniques
- Using hardware and software for acquiring the data through batch uploads or the real-time streaming of data
- Determining how frequently data will need to be acquired, because the frequency usually varies between data types and categories
Mandating that the preprocessing of data occurs at the point of collection, or even before collection (at the edge of an IoT device, for example). This includes basic processing, like cleaning and aggregating data, but it can also include more advanced activities, such as anonymizing the data to remove sensitive information. (Anonymizing refers to removing sensitive information such as a person's name, phone number, address and so on from a data set.)
In most cases, data must be anonymized before being transferred from the data source. Usually a procedure is also in place to validate data sets in terms of completeness. If the data isn’t complete, the collection may need to be repeated several times to achieve the desired data scope. Performing this type of validation early on has a positive impact on both process speed and cost.
- Managing the data transfer process to the needed storage point (local and/or global). As part of the data transfer, you may have to transform the data — aggregating it to make it smaller, for example. You may need to do this if you’re facing limits on the bandwidth capacity of the transfer links you use.
Maintain
Data maintenance activities includes both storing and maintaining the data. Note that data is usually processed in many different steps throughout its life cycle.
The need to protect data integrity during the life cycle of a data element is especially important during data processing activities. It’s easy to accidentally corrupt a dataset through human error when manually processing data, causing the data set to be useless for analysis in the next step. The best way to protect data integrity is to automate as many steps as possible of the data management activities leading up to the point of data analysis.
Keeping business trust in the data foundation is vital in order for business users to trust and make use of the derived insights.
When it comes to maintaining data, two important aspects are
- Data storage: Think of this as everything associated with what's happening in the data lake. Data storage activities include managing the different retention periods for different types of data, as well as cataloging data properly to ensure that data is easy to access and use.
- Data preparation: In the context of maintaining data, data preparation includes basic processing tasks such as second-level data cleansing, data staging, and data aggregation, all of which usually involve applying a filter directly when the data is put into storage. You don't want to put data with poor quality into your data lake.
Data retention periods can be different for the same data type, depending on its level of aggregation. For example, raw data might be interesting to save for only a short time because it’s usually very large in volume and therefore costly to store. Aggregated data on the other hand, is often smaller in size and cheaper and easier to store and can therefore be saved for longer periods, depending on the targeted use cases.
Process
Processing of data is the main data processing layer focused on preparing data for analysis, and it refers to using more advanced data engineering methodologies, such as
- Data classification: This refers to the process of organizing data into categories for even more effective and efficient...