eBook - ePub

Hands-On Data Science and Python Machine Learning

Name: Hands-On Data Science and Python Machine Learning
Author: Frank Kane

Frank Kane,

420 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Hands-On Data Science and Python Machine Learning

Frank Kane,

Book details

Book preview

Table of contents

Citations

About This Book

This book covers the fundamentals of machine learning with Python in a concise and dynamic manner. It covers data mining and large-scale machine learning using Apache Spark.About This Book• Take your first steps in the world of data science by understanding the tools and techniques of data analysis• Train efficient Machine Learning models in Python using the supervised and unsupervised learning methods• Learn how to use Apache Spark for processing Big Data efficientlyWho This Book Is ForIf you are a budding data scientist or a data analyst who wants to analyze and gain actionable insights from data using Python, this book is for you. Programmers with some experience in Python who want to enter the lucrative world of Data Science will also find this book to be very useful, but you don't need to be an expert Python coder or mathematician to get the most from this book.What You Will Learn• Learn how to clean your data and ready it for analysis• Implement the popular clustering and regression methods in Python• Train efficient machine learning models using decision trees and random forests• Visualize the results of your analysis using Python's Matplotlib library• Use Apache Spark's MLlib package to perform machine learning on large datasetsIn DetailJoin Frank Kane, who worked on Amazon and IMDb's machine learning algorithms, as he guides you on your first steps into the world of data science. Hands-On Data Science and Python Machine Learning gives you the tools that you need to understand and explore the core topics in the field, and the confidence and practice to build and analyze your own machine learning models. With the help of interesting and easy-to-follow practical examples, Frank Kane explains potentially complex topics such as Bayesian methods and K-means clustering in a way that anybody can understand them.Based on Frank's successful data science course, Hands-On Data Science and Python Machine Learning empowers you to conduct data analysis and perform efficient machine learning using Python. Let Frank help you unearth the value in your data using the various data mining and data analysis techniques available in Python, and to develop efficient predictive models to predict future results. You will also learn how to perform large-scale machine learning on Big Data using Apache Spark. The book covers preparing your data for analysis, training machine learning models, and visualizing the final data analysis.Style and approachThis comprehensive book is a perfect blend of theory and hands-on code examples in Python which can be used for your reference at any time.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Yes, you can access Hands-On Data Science and Python Machine Learning by Frank Kane in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Processing. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Packt Publishing

Year

2017

ISBN

9781787280229

Edition

Topic

Computer Science

Subtopic

Data Processing

Index

Computer Science

Apache Spark - Machine Learning on Big Data

So far in this book we've talked about a lot of general data mining and machine learning techniques that you can use in your data science career, but they've all been running on your desktop. As such, you can only run as much data as a single machine can process using technologies such as Python and scikit-learn.

Now, everyone talks about big data, and odds are you might be working for a company that does in fact have big data to process. Big data meaning that you can't actually control it all, you can't actually wrangle it all on just one system. You need to actually compute it using the resources of an entire cloud, a cluster of computing resources. And that's where Apache Spark comes in. Apache Spark is a very powerful tool for managing big data, and doing machine learning on large Datasets. By the end of the chapter, you will have an in-depth knowledge of the following topics:

Installing and working with Spark
Resilient Distributed Datasets (RDDs)
The MLlib (Machine Learning Library)
Decision Trees in Spark
K-Means Clustering in Spark

Installing Spark

In this section, I'm going to get you set up using Apache Spark, and show you some examples of actually using Apache Spark to solve some of the same problems that we solved using a single computer in the past in this book. The first thing we need to do is get Spark set up on your computer. So, we're going to walk you through how to do that in the next couple of sections. It's pretty straightforward stuff, but there are a few gotchas. So, don't just skip these sections; there are a few things you need to pay special attention to get Spark running successfully, especially on a Windows system. Let's get Apache Spark set up on your system, so you can actually dive in and start playing around with it.

We're going to be running this just on your own desktop for now. But, the same programs that we're going to write in this chapter could be run on an actual Hadoop cluster. So, you can take these scripts that we're writing and running locally on your desktop in Spark standalone mode, and actually run them from the master node of an actual Hadoop cluster, then let it scale up to the entire power of a Hadoop cluster and process massive Datasets that way. Even though we're going to set things up to run locally on your own computer, keep in mind that these same concepts will scale up to running on a cluster as well.

Installing Spark on Windows

Getting Spark installed on Windows involves several steps that we'll walk you through here. I'm just going to assume that you're on Windows because most people use this book at home. We'll talk a little bit about dealing with other operating systems in a moment. If you're already familiar with installing stuff and dealing with environment variables on your computer, then you can just take the following little cheat sheet and go off and do it. If you're not so familiar with Windows internals, I will walk you through it one step at a time in the upcoming sections. Here are the quick steps for those Windows pros:

Install a JDK: You need to first install a JDK, that's a Java Development Kit. You can just go to Sun's website and download that and install it if you need to. We need the JDK because, even though we're going to be developing in Python during this course, that gets translated under the hood to Scala code, which is what Spark is developed in natively. And, Scala, in turn, runs on top of the Java interpreter. So, in order to run Python code, you need a Scala system, which will be installed by default as part of Spark. Also, we need Java, or more specifically Java's interpreter, to actually run that Scala code. It's like a technology layer cake.
Install Python: Obviously you're going to need Python, but if you've gotten to this point in the book, you should already have a Python environment set up, hopefully with Enthought Canopy. So, we can skip this step.
Install a prebuilt version of Spark for Hadoop: Fortunately, the Apache website makes available prebuilt versions of Spark that will just run out of the box that are precompiled for the latest Hadoop version. You don't have to build anything, you can just download that to your computer and stick it in the right place and be good to go for the most part.
Create a conf/log4j.properties file: We have a few configuration things to take care of. One thing we want to do is adjust our warning level so we don't get a bunch of warning spam when we run our jobs. We'll walk through how to do that. Basically, you need to rename one of the properties files, and then adjust the error setting within it.
Add a SPARK_HOME environment variable: Next, we need to set up some environment variables to make sure that you can actually run Spark from any path that you might have. We're going to add a SPARK_HOME environment variable pointing to where you installed Spark, and then we will add %SPARK_HOME%\bin to your system path, so that when you run Spark Submit, or PySpark or whatever Spark command you need, Windows will know where to find it.
Set a HADOOP_HOME variable: On Windows there's one more thing we need to do, we need to set a HADOOP_HOME variable as well because it's going to expect to find one little bit of Hadoop, even if you're not using Hadoop on your standalone system.
Install winutils.exe: Finally, we need to install a file called winutils.exe. There's a link to winutils.exe within the resources for this book, so you can get that there.

If you want to walk through the steps in more detail, you can refer to the upcoming sections.

Installing Spark on other operating systems

A quick note on installing Spark on other operating systems: the same steps will basically apply on them too. The main difference is going to be in how you set environment variables on your system, in such a way that they will automatically be applied whenever you log in. That's going to vary from OS to OS. macOS does it differently from various flavors of Linux, so you're going to have to be at least a little bit familiar with using a Unix terminal command prompt, and how to manipulate your environment to do that. But most macOS or Linux users who are doing development already have those fundamentals under their belt. And of course, you're not going to need winutils.exe if you're not on Windows. So, those are the main differences for installing on different OSes.

Installing the Java Development Kit

For installing the Java Development Kit, go back to the browser, open a new tab, and just search for jdk (short for Java Development Kit). This will bring you to the Oracle site, from where you can download Java:

On the Oracle website, click on JDK DOWNLOAD. Now, click on Accept License Agreement and then you can select the download option for your operating system:

For me, that's going to be Windows 64-bit, and a wait for 198 MB of goodness to download:

Once the download is finished, locate the installer and start it running. Note that we can't just accept the default settings in the installer on Windows here. So, this is a Windows-specific workaround, but as of the writing of this book, the current version of Spark is 2.1.1 and it turns out there's an issue with Spark 2.1.1 with Java on Windows. The issue is that if you've installed Java to a path that has a space in it, it doesn't work, so we need to make sure that Java is installed to a path that does not have a space in it. This means that you can't skip this step even if you have Java installed already, so let me show you how to do that. On the installer, click on Next, and you will see, as in the following screen, that it wants to install by default to the C:\Program Files\Java\jdk path, whatever the version is:

The space in the Program Files path is going to cause trouble, so let's click on the Change... button and install to c:\jdk, a nice simple path, easy to remember, and with no spaces in it:

Now, it also wants to install the Java Runtime environment, so just to be safe, I'm also going to install that to a path with no spaces.

At the second step of the JDK installation, we should have this showing on our screen:

I will change that destination folder as well, and we will make a new folder called C:\jre for that:

Alright, successfully installed. Woohoo!

Now, you'll need to remember the path that we installed the JDK into, which in our case was C:\jdk. We still have a few more steps to go here. Next, we need to install Spark itself.

Installing Spark

Let's get back to a new browser tab here, head to spark.apache.org, and click on the Download Spark button:

Now, we have used Spark 2.1.1 in this book, but anything beyond 2.0 should work just fine.

Make sure you get a prebuilt version, and select the Direct Download option so all these defaults are perfectly fine. Go ahead and click on the link next to instruction number 4 to download that package.

Now, it downloads a TGZ (Tar in GZip) file, which you might not be familiar with. Windows is kind of an afterthought with Spark quite honestly because on Windows, you're not going to have a built-in utility for actually decompressing TGZ files. This means that you might need to install one, if you don't have one already. The one I use is called WinRAR, and you can pick that up from www.rarlab.com. Go to the Downloads page if you need it, and download the installer for WinRAR 32-bit or 64-bit, depending on your operating system. Install WinRAR as normal, and that will allo...

Title Page
Copyright
Credits
About the Author
www.PacktPub.com
Customer Feedback
Preface
Getting Started
Statistics and Probability Refresher, and Python Practice
Matplotlib and Advanced Probability Concepts
Predictive Models
Machine Learning with Python
Recommender Systems
More Data Mining and Machine Learning Techniques
Dealing with Real-World Data
Apache Spark - Machine Learning on Big Data
Testing and Experimental Design