eBook - ePub

Data Science Strategy For Dummies

Name: Data Science Strategy For Dummies
Author: Ulrika Jägare

Ulrika Jägare,

English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Data Science Strategy For Dummies

Ulrika Jägare,

Book details

Book preview

Table of contents

Citations

About This Book

All the answers to your data science questions

Over half of all businesses are using data science to generate insights and value from big data. How are they doing it? Data Science Strategy For Dummies answers all your questions about how to build a data science capability from scratch, starting with the "what" and the "why" of data science and covering what it takes to lead and nurture a top-notch team of data scientists.

With this book, you'll learn how to incorporate data science as a strategic function into any business, large or small. Find solutions to your real-life challenges as you uncover the stories and value hidden within data.

Learn exactly what data science is and why it's important
Adopt a data-driven mindset as the foundation to success
Understand the processes and common roadblocks behind data science
Keep your data science program focused on generating business value
Nurture a top-quality data science team

In non-technical language, Data Science Strategy For Dummies outlines new perspectives and strategies to effectively lead analytics and data science functions to create real value.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Yes, you can access Data Science Strategy For Dummies by Ulrika Jägare in PDF and/or ePUB format, as well as other popular books in Computer Science & Databases. We have over one million books available in our catalogue for you to explore.

Information

Publisher

For Dummies

Year

2019

ISBN

9781119566274

Edition

Topic

Computer Science

Subtopic

Databases

Index

Computer Science

Part 1

Optimizing Your Data Science Investment

IN THIS PART …

Defining a data science strategy

Grasping the complexity in data science

Tackling major challenges in the field of data science

Addressing change in a data-driven organization

Chapter 1

Framing Data Science Strategy

IN THIS CHAPTER

Clarifying the concept of data science

Understanding the fundamentals of a data-driven organization

Putting machine learning in context of data science

Clarifying the components of an effective data science strategy

In this chapter, I aim to sort out the basics of what data science is all about, but I have to warn you that data science is a term that escapes any single complete definition — which, of course, makes data science difficult to understand and apply in an organization. Many articles and publications use the term quite freely, with the assumption that it’s universally understood. Yet, data science — including its methods, goals, and applications — evolves with time and technology and is now far different from what it might have been 25 years ago.

Despite all that, I'm willing to put forward a tentative definition: Data science is the study of where data comes from, what it represents, and how it can be turned into a valuable resource in the creation of business strategies. Data science can be said to be a multidisciplinary field that uses scientific methods, processes, algorithms, and systems to extract insights from data in various forms, both structured and unstructured. Mining large amounts of structured and unstructured data to identify patterns and deviations that can help an organization rein in costs, increase efficiencies, recognize new market opportunities, and increase the organization's competitive advantage.

Data science is a concept that can be used to unify statistics, analytics, machine learning, and their related methods and techniques in order to understand and analyze actual phenomena with data. It employs techniques and theories drawn from many fields within the context of mathematics, statistics, information science, and computer science.

Behind that type of definition though, lies the definition of how data science is approached and performed. And because the ambition of this part of the book is to frame data science strategy, I need to first frame this multidisciplinary area of data science and its life cycle more properly.

Establishing the Data Science Narrative

It never hurts to have an image when explaining a complicated process, so do take a look at Figure 1-1, where you can see the main steps or phases in the data science life-cycle. Keep in mind, however, that the model visualized in Figure 1-1 assumes that you've already identified a high-level business problem or business opportunity as a starting point. This early ambition is usually derived from a business perspective, but it needs to be analyzed and framed in detail together with the data science team. This dialogue is vital in terms of understanding which data is available and what is possible to do with that data so you can set the focus of the work going forward. It isn’t a good idea to just start capturing any and all data that looks interesting enough to analyze. Therefore, the first stage of the data science life cycle, capture, is to frame the data you need by translating the business need into a concrete and well-defined problem or business opportunity.

Flow diagram depicting Communicate, Analyze, Process, Maintain, and Capture connected at the top to Actuate and connected at the bottom with arrows. — FIGURE 1-1: The different stages of the data science life cycle.

The initial business problem and/or opportunity isn’t static and will change over time as your data-driven understanding matures. Staying flexible in terms of which data is captured as well as which problem and/or opportunity is most important at any given point in time, is therefore a vital in order to achieve your business objectives.

The model shown in Figure 1-1 aims to represent a view of the different stages of the data science life cycle, from capturing the business and data need through preparing, exploring, and analyzing the data to reaching insights and acting on them.

The output of each full cycle produces new data, which provides the result of the previous cycle. This includes not only new data or results, which you can use to optimize your model, but can also generate new business needs, problems, or even a new understanding of what the business priority should be.

These stages of the data science life cycle can also be seen as not only steps describing the scope of data science but also layers in an architecture. More on that later; let me start by explaining the different stages.

Capture

There are two different parts of the first stage in the life-cycle, since capture refers to both the capture of the business need as well as the extraction and acquisition of data. This stage is vital to the rest of the process. I'll start by explaining what it means to capture the business need.

The starting point for detailing the business need is a high-level business request or business problem expressed by management or similar entities and should include tasks such as

Translating ambiguous business requests into concrete, well-defined problems or opportunities
Deep-diving into the context of the requests to better understand what a potential solution could look like, including which data will be needed
Outlining (if possible) strategic business priorities set by the company that might impact the data science work

Now that I've made clear the importance of capturing and understanding the business requests and initial scoping of data needed, I want to move on to describing aspects of the data capture process itself. It’s the main interface to the data source that you need to tap into and includes areas such as

Managing data ownership and securing legal rights to data capture and usage
Handling of personal information and securing data privacy through different anonymization techniques
Using hardware and software for acquiring the data through batch uploads or the real-time streaming of data
Determining how frequently data will need to be acquired, because the frequency usually varies between data types and categories
Mandating that the preprocessing of data occurs at the point of collection, or even before collection (at the edge of an IoT device, for example). This includes basic processing, like cleaning and aggregating data, but it can also include more advanced activities, such as anonymizing the data to remove sensitive information. (Anonymizing refers to removing sensitive information such as a person's name, phone number, address and so on from a data set.)

In most cases, data must be anonymized before being transferred from the data source. Usually a procedure is also in place to validate data sets in terms of completeness. If the data isn’t complete, the collection may need to be repeated several times to achieve the desired data scope. Performing this type of validation early on has a positive impact on both process speed and cost.
Managing the data transfer process to the needed storage point (local and/or global). As part of the data transfer, you may have to transform the data — aggregating it to make it smaller, for example. You may need to do this if you’re facing limits on the bandwidth capacity of the transfer links you use.

Maintain

Data maintenance activities includes both storing and maintaining the data. Note that data is usually processed in many different steps throughout its life cycle.

The need to protect data integrity during the life cycle of a data element is especially important during data processing activities. It’s easy to accidentally corrupt a dataset through human error when manually processing data, causing the data set to be useless for analysis in the next step. The best way to protect data integrity is to automate as many steps as possible of the data management activities leading up to the point of data analysis.

Keeping business trust in the data foundation is vital in order for business users to trust and make use of the derived insights.

When it comes to maintaining data, two important aspects are

Data storage: Think of this as everything associated with what's happening in the data lake. Data storage activities include managing the different retention periods for different types of data, as well as cataloging data properly to ensure that data is easy to access and use.
Data preparation: In the context of maintaining data, data preparation includes basic processing tasks such as second-level data cleansing, data staging, and data aggregation, all of which usually involve applying a filter directly when the data is put into storage. You don't want to put data with poor quality into your data lake.

Data retention periods can be different for the same data type, depending on its level of aggregation. For example, raw data might be interesting to save for only a short time because it’s usually very large in volume and therefore costly to store. Aggregated data on the other hand, is often smaller in size and cheaper and easier to store and can therefore be saved for longer periods, depending on the targeted use cases.

Process

Processing of data is the main data processing layer focused on preparing data for analysis, and it refers to using more advanced data engineering methodologies, such as

Data classification: This refers to the process of organizing data into categories for even more effective and efficient...

Cover
Table of Contents
Foreword
Introduction
Part 1: Optimizing Your Data Science Investment
Part 2: Making Strategic Choices for Your Data
Part 3: Building a Successful Data Science Organization
Part 4: Investing in the Right Infrastructure
Part 5: Data as a Business
Part 6: The Part of Tens
Index
About the Author
Connect with Dummies
End User License Agreement