Hands-On Deep Learning with Apache Spark
eBook - ePub

Hands-On Deep Learning with Apache Spark

Build and deploy distributed deep learning applications on Apache Spark

  1. 322 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Hands-On Deep Learning with Apache Spark

Build and deploy distributed deep learning applications on Apache Spark

Book details
Book preview
Table of contents
Citations

About This Book

Speed up the design and implementation of deep learning solutions using Apache Spark

Key Features

  • Explore the world of distributed deep learning with Apache Spark
  • Train neural networks with deep learning libraries such as BigDL and TensorFlow
  • Develop Spark deep learning applications to intelligently handle large and complex datasets

Book Description

Deep learning is a subset of machine learning where datasets with several layers of complexity can be processed. Hands-On Deep Learning with Apache Spark addresses the sheer complexity of technical and analytical parts and the speed at which deep learning solutions can be implemented on Apache Spark.

The book starts with the fundamentals of Apache Spark and deep learning. You will set up Spark for deep learning, learn principles of distributed modeling, and understand different types of neural nets. You will then implement deep learning models, such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), and long short-term memory (LSTM) on Spark.

As you progress through the book, you will gain hands-on experience of what it takes to understand the complex datasets you are dealing with. During the course of this book, you will use popular deep learning frameworks, such as TensorFlow, Deeplearning4j, and Keras to train your distributed models.

By the end of this book, you'll have gained experience with the implementation of your models on a variety of use cases.

What you will learn

  • Understand the basics of deep learning
  • Set up Apache Spark for deep learning
  • Understand the principles of distribution modeling and different types of neural networks
  • Obtain an understanding of deep learning algorithms
  • Discover textual analysis and deep learning with Spark
  • Use popular deep learning frameworks, such as Deeplearning4j, TensorFlow, and Keras
  • Explore popular deep learning algorithms

Who this book is for

If you are a Scala developer, data scientist, or data analyst who wants to learn how to use Spark for implementing efficient deep learning models, Hands-On Deep Learning with Apache Spark is for you. Knowledge of the core machine learning concepts and some exposure to Spark will be helpful.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes, you can access Hands-On Deep Learning with Apache Spark by Guglielmo Iozzia in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Mining. We have over one million books available in our catalogue for you to explore.

Information

Year
2019
ISBN
9781788999700
Edition
1

The Apache Spark Ecosystem

Apache Spark (http://spark.apache.org/) is an open source, fast cluster-computing platform. It was originally created by AMPLab at the University of California, Berkeley. Its source code was later donated to the Apache Software Foundation (https://www.apache.org/). Spark comes with a very fast computation speed because data is loaded into distributed memory (RAM) across a cluster of machines. Not only can data be quickly transformed, but also cached on demand for a variety of use cases. Compared to Hadoop MapReduce, it runs programs up to 100 times faster when the data fits in memory, or 10 times faster on disk. Spark provides support for four programming languages: Java, Scala, Python, and R. This book covers the Spark APIs (and deep learning frameworks) for Scala (https://www.scala-lang.org/) and Python (https://www.python.org/) only.
This chapter will cover the following topics:
  • Apache Spark fundamentals
  • Getting Spark
  • Resilient Distributed Dataset (RDD) programming
  • Spark SQL, Datasets, and DataFrames
  • Spark Streaming
  • Cluster mode using a different manager

Apache Spark fundamentals

This section covers the Apache Spark fundamentals. It is important to become very familiar with the concepts that are presented here before moving on to the next chapters, where we'll be exploring the available APIs.
As mentioned in the introduction to this chapter, the Spark engine processes data in distributed memory across the nodes of a cluster. The following diagram shows the logical structure of how a typical Spark job processes information:
Figure 1.1
Spark executes a job in the following way:
Figure 1.2
The Master controls how data is partitioned and takes advantage of data locality while keeping track of all the distributed data computation on the Slave machines. If a certain Slave machine becomes unavailable, the data on that machine is reconstructed on another available machine(s). In standalone mode, the Master is a single point of failure. This chapter's Cluster mode using different managers section covers the possible running modes and explains fault tolerance in Spark.
Spark comes with five major components:
Figure 1.3
These components are as follows:
  • The core engine.
  • Spark SQL: A module for structured data processing.
  • Spark Streaming: This extends the core Spark API. It allows live data stream processing. Its strengths include scalability, high throughput, and fault tolerance.
  • MLib: The Spark machine learning library.
  • GraphX: Graphs and graph-parallel computation algorithms.
Spark can access data that's stored in different systems, such as HDFS, Cassandra, MongoDB, relational databases, and also cloud storage services such as Amazon S3 and Azure Data Lake Storage.

Getting Spark

Now, let's get hands-on with Spark so that we can go deeper into the core APIs and libraries. In all of the chapters of this book, I will be referring to the 2.2.1 release of Spark, however, several examples that are presented here should work with the 2.0 release or later. I will put a note when an example is specifically for 2.2+ releases only.
First of all, you need to download Spark from its official website (https://spark.apache.org/downloads.html). The download page should look like this:
Figure 1.4
You need to have JDK 1.8+ and Python 2.7+ or 3.4+ (only if you need to develop using this language). Spark 2.2.1 supports Scala 2.11. The JDK needs to be present on your user path system variable, though, alternatively, you could have your user JAVA_HOME environment variable pointing to a JDK installation.
Extract the content of the downloaded archive to any local directory. Move to the $SPARK_HOME/bin directory. There, among the other executables, you will find the interactive Spark shells for Scala and Python. They are the best way to get familiar with this framework. In this chapter, I am going to present examples that you can run through these shells.
You can run a Scala shell using the following command:
$SPARK_HOME/bin/spark-shell.sh
If you don't specify an argument, Spark assumes that you're running locally in standalone mode. Here's the expected output to the console:
Spark context Web UI available at http://10.72.0.2:4040
Spark context available as 'sc' (master = local[*], app id = local-1518131682342).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.1
/_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.

scala>
The web UI is available at the following URL: http://<host>:4040.
It will give you the following output:
Figure 1.5
From there, you can check the status of your jobs and executors.
From the output of the console startup, you will notice that two built-in variables, sc and spark, are available. sc represents the SparkContext (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext), which in Spark < 2.0 was the entry point for each application. Through the Spark context (and its specializations), you can get input data from data sources, create and manipulate RDDs (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD), and attain the Spark primary abstraction before 2.0. The RDD programming section will cover this topic and other operations in more detail. Starting from release 2.0, a new entry point, SparkSession (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SparkSession), and a new main data abstraction, the Dataset (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset), were introduced. More details on them are presented in the following sections. The SparkContext is still part of the Spark API so that compatibility with existing frameworks not supporting Spark sessions is ensured, but the direction the project has taken is to move development to use the SparkSession.
Here's an example of how to read and manipulate a text file and put it into a Dataset using the Spark shell (the file used in this example is part of the resources for the examples that are bundled with the Spark distribution):
scala> spark.read.textFile("/usr/spark-2.2.1/examples/src/main/resources/people.txt")
res5: org.apache.spark.sql.Dataset[String] = [value: string]
The result is a Dataset instance that contains the file lines. You can then make several operations on this Dataset, such as counting the number of lines:
scala> res5.count()
res6: Long = 3
You can also get the first line of the Dataset:
scala> res5.first()
res7: String = Michael, 29
In this example, we used a path on the local fi...

Table of contents

  1. Title Page
  2. Copyright and Credits
  3. About Packt
  4. Contributors
  5. Preface
  6. The Apache Spark Ecosystem
  7. Deep Learning Basics
  8. Extract, Transform, Load
  9. Streaming
  10. Convolutional Neural Networks
  11. Recurrent Neural Networks
  12. Training Neural Networks with Spark
  13. Monitoring and Debugging Neural Network Training
  14. Interpreting Neural Network Output
  15. Deploying on a Distributed System
  16. NLP Basics
  17. Textual Analysis and Deep Learning
  18. Convolution
  19. Image Classification
  20. What's Next for Deep Learning?
  21. Appendix A: Functional Programming in Scala
  22. Appendix B: Image Data Preparation for Spark
  23. Other Books You May Enjoy