eBook - ePub

Developing Analytic Talent

Name: Developing Analytic Talent
ISBN: 9781118810095

Becoming a Data Scientist

Vincent Granville,

English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Developing Analytic Talent

Becoming a Data Scientist

Vincent Granville,

About this book

Learn what it takes to succeed in the the most in-demand tech job

Harvard Business Review calls it the sexiest tech job of the 21st century. Data scientists are in demand, and this unique book shows you exactly what employers want and the skill set that separates the quality data scientist from other talented IT professionals. Data science involves extracting, creating, and processing data to turn it into business value. With over 15 years of big data, predictive modeling, and business analytics experience, author Vincent Granville is no stranger to data science. In this one-of-a-kind guide, he provides insight into the essential data science skills, such as statistics and visualization techniques, and covers everything from analytical recipes and data science tricks to common job interview questions, sample resumes, and source code.

The applications are endless and varied: automatically detecting spam and plagiarism, optimizing bid prices in keyword advertising, identifying new molecules to fight cancer, assessing the risk of meteorite impact. Complete with case studies, this book is a must, whether you're looking to become a data scientist or to hire one.

Explains the finer points of data science, the required skills, and how to acquire them, including analytical recipes, standard rules, source code, and a dictionary of terms
Shows what companies are looking for and how the growing importance of big data has increased the demand for data scientists
Features job interview questions, sample resumes, salary surveys, and examples of job ads
Case studies explore how data science is used on Wall Street, in botnet detection, for online advertising, and in many other business-critical situations

Developing Analytic Talent: Becoming a Data Scientist is essential reading for those aspiring to this hot career choice and for employers seeking the best candidates.

Tools to learn more effectively

Saving Books

Keyword Search

Annotating Text

Listen to it instead

Information

Publisher

Year

Print ISBN

eBook ISBN

Edition

Topic

Computer Science

Subtopic

Databases

Index

Computer Science

Chapter 1 What Is Data Science?

Sometimes, understanding what something is includes having a clear picture of what it is not. Understanding data science is no exception. Thus, this chapter begins by investigating what data science is not, because the term has been much abused and a lot of hype surrounds big data and data science. You will first consider the difference between true data science and fake data science. Next, you will learn how new data science training has evolved from traditional university degree programs. Then you will review several examples of how modern data science can be used in real-world scenarios.

Finally, you will review the history of data science and its evolution from computer science, business optimization, and statistics into modern data science and its trends. At the end of the chapter, you will find a Q&A section from recent discussions I’ve had that illustrate the conflicts between data scientists, data architects, and business analysts.

This chapter asks more questions than it answers, but you will find the answers discussed in more detail in subsequent chapters. The purpose of this approach is for you to become familiar with how data scientists think, what is important in the big data industry today, what is becoming obsolete, and what people interested in a data science career don’t need to learn. For instance, you need to know statistics, computer science, and machine learning, but not everything from these domains. You don’t need to know the details about complexity of sorting algorithms (just the general results), and you don’t need to know how to compute a generalized inverse matrix, nor even know what a generalized inverse matrix is (a core topic of statistical theory), unless you specialize in the numerical aspects of data science.

start feature

Technical Note

This chapter can be read by anyone with minimal mathematical or technical knowledge. More advanced information is presented in “Technical Notes” like this one, which may be skipped by non-mathematicians.

end feature start feature

CROSS-REFERENCE You will find definitions of most terms used in this book in Chapter 8.

end feature

Real Versus Fake Data Science

Books, certificates, and graduate degrees in data science are spreading like mushrooms after the rain. Unfortunately, many are just a mirage: people taking advantage of the new paradigm to quickly repackage old material (such as statistics and R programming) with the new label “data science.”

Expanding on the R programming example of fake data science, note that R is an open source statistical programming language and environment that is at least 20 years old, and is the successor of the commercial product S+. R was and still is limited to in-memory data processing and has been very popular in the statistical community, sometimes appreciated for the great visualizations that it produces. Modern environments have extended R capabilities (the in-memory limitations) by creating libraries or integrating R in a distributed architecture, such as RHadoop (R + Hadoop). Of course other languages exist, such as SAS, but they haven’t gained as much popularity as R. In the case of SAS, this is because of its high price and the fact that it was more popular in government organizations and brick-and-mortar companies than in the fields that experienced rapid growth over the last 10 years, such as digital data (search engine, social, mobile data, collaborative filtering). Finally, R is not unlike the C, Perl, or Python programming languages in terms of syntax (they all share the same syntax roots), and thus it is easy for a wide range of programmers to learn. It also comes with many libraries and a nice user interface. SAS, on the other hand, is more difficult to learn.

To add to the confusion, executives and decision makers building a new team of data scientists sometimes don’t know exactly what they are looking for, and they end up hiring pure tech geeks, computer scientists, or people lacking proper big data experience. The problem is compounded by Human Resources (HR) staff who do not know any better and thus produce job ads that repeat the same keywords: Java, Python, MapReduce, R, Hadoop, and NoSQL. But is data science really a mix of these skills?

Sure, MapReduce is just a generic framework to handle big data by reducing data into subsets and processing them separately on different machines, then putting all the pieces back together. So it’s the distributed architecture aspect of processing big data, and these farms of servers and machines are called the cloud.

Hadoop is an implementation of MapReduce, just like C++ is an implementation (still used in finance) of object oriented programming. NoSQL means “Not Only SQL” and is used to describe database or data management systems that support new, more efficient ways to access data (for instance, MapReduce), sometimes as a layer hidden below SQL (the standard database querying language).

start feature

CROSS-REFERENCE See Chapter 2 for more information on what MapReduce can’t do.

end feature

There are other frameworks besides MapReduce — for instance, graph databases and environments that rely on the concepts of nodes and edges to manage and access data, typically spatial data. These concepts are not necessarily new. Distributed architecture has been used in the context of search technology since before Google existed. I wrote Perl scripts that perform hash joins (a type of NoSQL join, where a join is the operation of joining or merging two tables in a database) more than 15 years ago. Today some database vendors offer hash joins as a fast alternative to SQL joins. Hash joins are discussed later in this book. They use hash tables and rely on name-value pairs. The conclusion is that MapReduce, NoSQL, Hadoop, and Python (a scripting programming language great at handling text and unstructured data) are sometimes presented as Perl’s successors and have their roots in systems and techniques that started to be developed decades ago and have matured over the last 10 years. But data science is more than that.

Indeed, you can be a real data scientist and have none of these skills. NoSQL and MapReduce are not new concepts — many people embraced them long before these keywords were created. But to be a data scientist, you also need the following:

Business acumen
Real big data expertise (for example, you can easily process a 50 million-row data set in a couple of hours)
Ability to sense the data
A distrust of models
Knowledge of the curse of big data
Ability to communicate and understand which problems management is trying to solve
Ability to correctly assess lift — or ROI — on the salary paid to you
Ability to quickly identify a simple, robust, scalable solution to a problem
Ability to convince and drive management in the right direction, sometimes against its will, for the benefit of the company, its users, and shareholders
A real passion for analytics
Real applied experience with success stories
Data architecture knowledge
Data gathering and cleaning skills
Computational complexity basics — how to develop robust, efficient, scalable, and portable architectures
Good knowledge of algorithms

A data scientist is also a generalist in business analysis, statistics, and computer science, with expertise in fields such as robustness, design of experiments, algorithm complexity, dashboards, and data visualization, to name a few. Some data scientists are also data strategists — they can develop a data collection strategy and leverage data to develop actionable insights that make business impact. This requires creativity to develop analytics solutions based on business constraints and limitations.

The basic mathematics needed to understand data science are as follows:

Algebra, including, if possible, basic matrix theory.
A first course in calculus. Theory can be limited to understanding computational complexity and the O notation. Special functions include the logarithm, exponential, and power functions. Differential equations, integrals, and complex numbers are not necessary.
A first course in statistics and probability, including a familiarity with the concept of random variables, probability, mean, variance, percentiles, experimental design, cross-validation, goodness of fit, and robust statistics (not the technical details, but a general understanding as presented in this book).

From a technical point a view, important skills and knowledge include R, Python (or Perl), Excel, SQL, graphics (visualization), FTP, basic UNIX commands (sort, grep, head, tail, the pipe and redirect operators, cat, cron jobs, and so on), as well as a basic understanding of how databases are designed and accessed. Also important is understanding how distributed systems work and where bottlenecks are found (data transfers between hard disk and memory, or over the Internet). Finally, a basic knowledge of web crawlers helps to access unstructured data found on the Internet.

Two Examples of Fake Data Science

Here are two examples of fake data science that demonstrate why data scientists need a standard and best practices for their work. The two examples discussed here are not bad products — they indeed have a lot of intrinsic value — but they are not data science. The problem is two-fold:

First, statisticians have not been involved in the big data revolution. Some have written books about applied data science, but it’s just a repackaging of old statistics courses.
Second, methodologies that work for big data sets — as big data was defined back in 2005 when 20 million rows would qualify as big data — fail on post-2010 big data that is in terabytes.

As a result, people think that data science is statistics with a new name; they confuse data science and fake data science, and big data 2005 with big data 2013. Modern data is also very different and has been described by three Vs: velocity (real time, fast flowing), variety (structured, unstructured such as tweets), and volume. I would add veracity and value as well. For details, read the discussion on when data is flowing faster than it can be processed in Chapter 2.

start feature

CROSS-REFERENCE See Chapter 4 for more detail on statisticians versus data scientists.

end feature

Example 1: Introduction to Data Science e-Book

Looking at a 2012 data science training manual from a well-known university, most of the book is about old statistical theory. Throughout the book, R is used to illustrate the various concepts. But logistic regression in the context of processing a mere 10,000 rows of data is not big data science; it is fake data science. The entire book is about small data, with the exception of the last few chapters, where you learn a bit of SQL (embedded in R code) and how to use an R package to extract tweets from Twitter, and create what the author calls a word cloud (it has nothing to do with cloud computing).

Even the Twitter project is about small data, and there’s no distributed architecture (for example, MapReduce) in it. Indeed, the book never talks about data architecture. Its level is elementary. Each chapter starts with a short introduction in simple English (suitable for high school students) about big data/data science, but these little data science excursions are out of context and independent from the projects and technical presentations.

Perhaps the author added these short paragraphs so that he could rename his “Statistics with R” e-book as “Introduction to Data Science.” But it’s free and it’s a nice, well-written book to get high-school students interested in statistics and programming. It’s just that it has nothing to do with data science.

Example 2: Data Science Certificate

Consider a data science certificate offered by a respected public university in the United States. The advisory board is mostly senior technical guys, most having academic positions. The data scientist is presented as “a new type of data analyst.” I disagree. Data analysts include number crunchers and others who, on average, command lower salaries when you check job ads, mostly because these are less senior positions. Data scientist is not a junior-level position.

This university program has a strong data architecture and computer science flair, and the computer science content is of great quality. That’s an important part of data science, but it covers only one-third of data science. It also has a bit of old statistics and some nice lessons on robustness and other statistical topics, but nothing about several topics that are useful for data scientists (for example, Six Sigma, approximate solutions, the 80/20 rule, cross-validation, design of experiments, modern pattern recognition, lift metrics, third-party data, Monte Carlo simulations, or the life cycle of data science projects. The program does requires knowledge of Java and Python for admission. It is also expensive — several thousand dollars.

So what comprises the remaining two-thirds of data science? Domain expertise (in one or two areas) counts for one-third. The final third is a blend of applied statistics, business acumen, and the ability to communicate with decision makers or to make decisions, as well as vision and leadership. You don’t need to know everything about six sigma, statistics, or operations resea...

Cover
Contents
Chapter 1: What Is Data Science?
Chapter 2: Big Data Is Different
Chapter 3: Becoming a Data Scientist
Chapter 4: Data Science Craftsmanship, Part I
Chapter 5: Data Science Craftsmanship, Part II
Chapter 6: Data Science Application Case Studies
Chapter 7: Launching Your New Data Science Career
Chapter 8: Data Science Resources
Introduction

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access Developing Analytic Talent by Vincent Granville in PDF and/or ePUB format, as well as other popular books in Computer Science & Databases. We have over one million books available in our catalogue for you to explore.

About this book

Tools to learn more effectively

Information

Chapter 1

What Is Data Science?

Real Versus Fake Data Science

Two Examples of Fake Data Science

Example 1: Introduction to Data Science e-Book

Example 2: Data Science Certificate

Table of contents

Frequently asked questions