eBook - ePub

Mastering Apache Spark 2.x - Second Edition

Name: Mastering Apache Spark 2.x - Second Edition
ISBN: 9781785285226

Romeo Kienzler,

332 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Mastering Apache Spark 2.x - Second Edition

Romeo Kienzler,

About this book

Advanced analytics on your Big Data with latest Apache Spark 2.xAbout This Book• An advanced guide with a combination of instructions and practical examples to extend the most up-to date Spark functionalities.• Extend your data processing capabilities to process huge chunk of data in minimum time using advanced concepts in Spark.• Master the art of real-time processing with the help of Apache Spark 2.xWho This Book Is ForIf you are a developer with some experience with Spark and want to strengthen your knowledge of how to get around in the world of Spark, then this book is ideal for you. Basic knowledge of Linux, Hadoop and Spark is assumed. Reasonable knowledge of Scala is expected.What You Will Learn• Examine Advanced Machine Learning and DeepLearning with MLlib, SparkML, SystemML, H2O and DeepLearning4J• Study highly optimised unified batch and real-time data processing using SparkSQL and Structured Streaming• Evaluate large-scale Graph Processing and Analysis using GraphX and GraphFrames• Apply Apache Spark in Elastic deployments using Jupyter and Zeppelin Notebooks, Docker, Kubernetes and the IBM Cloud• Understand internal details of cost based optimizers used in Catalyst, SystemML and GraphFrames• Learn how specific parameter settings affect overall performance of an Apache Spark cluster• Leverage Scala, R and python for your data science projectsIn DetailApache Spark is an in-memory cluster-based parallel processing system that provides a wide range of functionalities such as graph processing, machine learning, stream processing, and SQL. This book aims to take your knowledge of Spark to the next level by teaching you how to expand Spark's functionality and implement your data flows and machine/deep learning programs on top of the platform.The book commences with an overview of the Spark ecosystem. It will introduce you to Project Tungsten and Catalyst, two of the major advancements of Apache Spark 2.x.You will understand how memory management and binary processing, cache-aware computation, and code generation are used to speed things up dramatically. The book extends to show how to incorporate H20, SystemML, and Deeplearning4j for machine learning, and Jupyter Notebooks and Kubernetes/Docker for cloud-based Spark. During the course of the book, you will learn about the latest enhancements to Apache Spark 2.x, such as interactive querying of live data and unifying DataFrames and Datasets.You will also learn about the updates on the APIs and how DataFrames and Datasets affect SQL, machine learning, graph processing, and streaming. You will learn to use Spark as a big data operating system, understand how to implement advanced analytics on the new APIs, and explore how easy it is to use Spark in day-to-day tasks.Style and approachThis book is an extensive guide to Apache Spark modules and tools and shows how Spark's functionality can be extended for real-time processing and storage with worked examples.

Tools to learn more effectively

Saving Books

Keyword Search

Annotating Text

Listen to it instead

Information

Publisher

Packt Publishing

Year

2017

eBook ISBN

9781785285226

Edition

Topic

Computer Science

Subtopic

Data Modelling & Design

Index

Computer Science

Deep Learning on Apache Spark with DeepLearning4j and H2O

This chapter will give you an introduction to Deep Learning and how you can use third-party machine learning libraries on top of Apache Spark in order to do so. Deep Learning is outperforming a variety of state-of-the-art machine learning algorithms, and it is a very active area of research, so there is more to come soon. Therefore, it is important to know how Deep Learning works and how it can be applied in a parallel data processing environment such as Apache Spark.

This chapter will cover the following topics in detail:

Introduction to the installation and usage of the H2O framework
Introduction to Deeplearning4j with an IoT anomaly detection example

H2O

H2O is an open source system developed in Java by http://h2o.ai/ for machine learning. It offers a rich set of machine learning algorithms and a web-based data processing user interface. It offers the ability to develop in a range of languages: Java, Scala, Python, and R.

It also has the ability to interface with Spark, HDFS, SQL, and NoSQL databases. This chapter will concentrate on H2O's integration with Apache Spark using the Sparkling Water component of H2O. A simple example developed in Scala will be used based on real data to create a deep-learning model.

The next step will be to provide an overview of the H2O functionality and the Sparkling Water architecture that will be used in this chapter.

Overview

Since it is only possible to examine and use a small amount of H2O's functionality in this chapter, we thought that it would be useful to provide a list of all of the functional areas that it covers. This list is taken from the http://h2o.ai/ website at http://h2o.ai/product/algorithms/ and is based upon wrangling data, modeling using the data, and scoring the resulting models:

Process
Model
The score tool
Data profiling
Generalized linear models (GLM)
Predict
Summary statistics
Decision trees
Confusion matrix
Aggregate, filter, bin, and derive columns
Gradient boosting machine (GBM)
AUC
Slice, log transform, and anonymize
K-means
Hit ratio
Variable creation
Anomaly detection
PCA/PCA score
Deep learning
Multimodel scoring
Training and validation sampling plan
Naive Bayes
Grid search

The following section will explain the environment used for the Spark and H2O examples in this chapter and some of the problems encountered.

For completeness, we will show you how we downloaded, installed, and used H2O. Although we finally settled on version 0.2.12-95, we first downloaded and used 0.2.12-92. This section is based on the earlier install, but the approach used to source the software is the same. The download link changes over time, so follow the Sparkling Water download option at http://h2o.ai/download/.

This will source the zipped Sparkling Water release, as shown in the file listing here:

 [hadoop@hc2r1m2 h2o]$ pwd ; ls -l
 /home/hadoop/h2o
 total 15892
 -rw-r--r-- 1 hadoop hadoop 16272364 Apr 11 12:37 sparkling-water-0.2.12-92.zip

This zipped release file is unpacked using the Linux unzip command, and it results in a Sparkling Water release file tree:

 [hadoop@hc2r1m2 h2o]$ unzip sparkling-water-0.2.12-92.zip
 
 [hadoop@hc2r1m2 h2o]$ ls -d sparkling-water*
 sparkling-water-0.2.12-92 sparkling-water-0.2.12-92.zip

We have moved the release tree to the /usr/local/ area using the root account and created a simple symbolic link to the release called H2O. This means that our H2O-based build can refer to this link, and it doesn't need to change as new versions of Sparkling Water are sourced. We have also made sure, using the Linux chmod command, that our development account, Hadoop, has access to the release:

[hadoop@hc2r1m2 h2o]$ su -
[root@hc2r1m2 ~]# cd /home/hadoop/h2o
[root@hc2r1m2 h2o]# mv sparkling-water-0.2.12-92 /usr/local
[root@hc2r1m2 h2o]# cd /usr/local
 
[root@hc2r1m2 local]# chown -R hadoop:hadoop sparkling-water-0.2.12-92
[root@hc2r1m2 local]# ln –s sparkling-water-0.2.12-92 h2o
 
[root@hc2r1m2 local]# ls –lrt | grep sparkling
total 52
drwxr-xr-x 6 hadoop hadoop 4096 Mar 28 02:27 sparkling-water-0.2.12-92
lrwxrwxrwx 1 root root 25 Apr 11 12:43 h2o -> sparkling-water-0.2.12-92

The release has been installed on all the nodes of our Hadoop clusters.

The build environment

From past examples, you know that we favor sbt as a build tool for developing Scala source examples.

We have created a development environment on the Linux server called hc2r1m2 using the Hadoop development account. The development directory is called h2o_spark_1_2:

[hadoop@hc2r1m2 h2o_spark_1_2]$ pwd
 /home/hadoop/spark/h2o_spark_1_2

Our SBT build configuration file named h2o.sbt is located here; it contains the following:

 [hadoop@hc2r1m2 h2o_spark_1_2]$ more h2o.sbt
 
 name := "H 2 O"
 
 version := "1.0"
 
 scalaVersion := "2.10.4"
 
 libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.3.0"
 
 libraryDependencies += "org.apache.spark" % "spark-core" % "1.2.0" from "file:///usr/hdp/2.6.0.3-8/spark/lib/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar"
 
 libraryDependencies += "org.apache.spark" % "mllib" % "1.2.0" from "file:///usr/hdp/2.6.0.3-8/spark/lib/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar"
 
 libraryDependencies += "org.apache.spark" % "sql" % "1.2.0" from "file:///usr/hdp/2.6.0.3-8/spark/lib/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar"
 
 libraryDependencies += "org.apache.spark" % "h2o" % "0.2.12-95" from "file:///usr/local/h2o/assembly/build/libs/sparkling-water-assembly-0.2.12-95-all.jar"
 
 libraryDependencies += "hex.deeplearning" % "DeepLearningModel" % "0.2.12-95" from "file:///usr/local/h2o/assembly/build/libs/sparkling-water-assembly-...

Title Page
Copyright
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface
A First Taste and What’s New in Apache Spark V2
Apache Spark SQL
The Catalyst Optimizer
Project Tungsten
Apache Spark Streaming
Structured Streaming
Apache Spark MLlib
Apache SparkML
Apache SystemML
Deep Learning on Apache Spark with DeepLearning4j and H2O
Apache Spark GraphX
Apache Spark GraphFrames
Apache Spark with Jupyter Notebooks on IBM DataScience Experience
Apache Spark on Kubernetes

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access Mastering Apache Spark 2.x - Second Edition by Romeo Kienzler in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Modelling & Design. We have over one million books available in our catalogue for you to explore.

About this book

Tools to learn more effectively

Information

Table of contents

Frequently asked questions