The way that Spark scales data analysis problems is, it runs on top of a cluster manager, so your actual Spark scripts are just everyday scripts written in Python, Java, or Scala; they behave just like any other script. Your "driver program" is what we call it, and it will run on your desktop or on one master node of your cluster. However, under the hood, when you run it, Spark knows how to take the work and actually farm it out to different computers on your cluster or even different CPUs on the same machine. Spark can actually run on top of different cluster managers. It has its own built-in cluster manager that you can use by default, but if you have access to a Hadoop cluster there's a component called YARN, that Spark can also run on top of to distribute work among a huge Hadoop cluster, if you have one available. For example, you can use Amazon's Elastic MapReduce service to get cheap and easy access to a Hadoop cluster, which we'll do later on in this course. As illustrated in the following diagram, the cluster manager will split up the work and coordinate it among various executors. Spark will split up and create multiple executors per machine â ideally you want one per CPU core. It can do all the coordination using the cluster manager, and also your driver program itself, to farm out work and distribute it to different nodes and give you fault tolerance. So, if one of your executors goes down, it can recover without actually stopping your entire job and making you start it all over again.
The beauty of it is that it scales out to an entire cluster of computers and it gives you horizontal partitioning and horizontal scalability; basically, the sky' is the limit. However, from a user's standpoint and from a developer's standpoint, it's all just one simple little program running on one computer that feels a lot like writing any other script. This is a really nice aspect of Spark.
Why do people use Spark? Well, it has a lot in common with MapReduce, really, it solves the same sort of problems, so why are people using Spark instead of MapReduce? MapReduce has been around a lot longer and the ecosystems and the tools surrounding it are more mature at this point. Well, one of the main reasons is that Spark is really fast. On the Apache website, they claim that Spark can Run Programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Now, that's a little bit of hyperbole to be honest, I mean that's in a very contrived example. In my own experiments, if you compare some of the tasks that we run on Spark with the same tasks we run using MapReduce, it's not 100 times faster, it's definitely faster, but maybe two to three times faster. Spark definitely has that going for it.
The way that it achieves that performance is using what it calls the directed acyclic graph engine. The fancy thing about Spark is that it doesn't actually do anything until you ask it to deliver results, and at that point it creates a graph of all the different steps that it needs to put together to actually achieve the results you want. It does that in an optimal manner, so it can actually wait until it knows what you're asking for, and then figure out the optimal path to answering the question that you want.
It's also very hot. If you go to this web page here, it will give you a list of some known people that are using Spark: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
A lot of big companies are kind of secretive about what they're doing inside, so I'm sure there are even more people using it than what are listed here, but we know a lot of the big players are already using Spark: Amazon, eBay, NASA's Jet Propulsion Laboratory, Yahoo, and many others. The list isn't as long as MapReduce because Spark hasn't been around as long as MapReduce, but it's definitely getting some big adoption quickly.
The beautiful thing about Spark is it's not that hard. It allows you to code in Python, Java, or Scala, so if you're already familiar with Python, you don't have to learn a new language. From a developer's standpoint, it's built around one main concept, the Resilient Distributed Dataset (RDD). There is one main kind of object that you'll encounter in Spark over and over and over again â the RDD. Various operations on that RDD will let you slice and dice and carve up that data and do what you want with it. So what would take many lines of code and different functions in a MapReduce job can often be done in just one line, much more quickly and efficiently using Spark.
If you actually took my MapReduce course, we're going to be doing a lot of the exact same problems in the Spark course. You might find it interesting how much simpler and how much easier those same problems are to solve in Spark compared to MapReduce in many cases.
Spark is made up of many components. We're going to be focusing a lot on Spark Core, which means looking at what you can do with RDD objects and how you can use that to distribute the processing of data and the mapping and reducing of large datasets. But in addition to Spark Core, Spark also includes several libraries that run on top of it, as shown in the following diagram:
We have Spark Streaming, which actually gives you the ability to analyze real-time data streams â a set of web logs coming in from a fleet of web servers that need to be continually updated for instance. We'll talk about Spark Streaming later in the book. We have Spark SQL, which lets you run Spark on top of a Hive context, deal with structured data within Spark, and actually run SQL queries on top of it. So if you are familiar with SQL and you want to treat Spark as sort of a data warehouse, you can do that too. Spark also includes something called MLlib, which is a series of machine learning algorithms; if you're going to be doing any sort of machine learning or data mining with Spark, it contains many useful tools you can use to simplify a lot of common operations. For example, if you need to do a Pearson correlation or get statistical properties of your dataset, MLlib makes that very easy to do. Finally, we have GraphX: that's not the kind of graphs you draw on graph paper, it's actually managing graphs of information â network theory sort of stuff. So if you have a graph, for example, a social graph of people that are friends with each other or citations between documents and scholarly articles, or other things of that nature, GraphX can help you make sense of those sorts of networks and give you high-level information about the properties of those graphs. All these libraries run on top of Spark Core. Spark has a lot to offer and it's expanding all the time.
Now, people who are already familiar with Spark might say why are you using Python in this book? Well, I'm using Python as the scripting language that I'm going to be working with, and there are some pretty good reasons, so I want to defend my choice here a little bit. For one thing, it's a lot easier to get up and running with Python under Spark. You don't have to deal with dependencies, running Maven, and figuring out how to get JAR files, where they need to be, and what not. With Python, if you just type your code, you don't have to compile it, it just runs and that makes life a lot simpler.
In this book, I really want to focus on the concepts behind Spark, how to deal with RDDs, what you can do with them, and how you can put these different things together; I don't really want to be focusing on things such as the mechanics of compiling and distributing JAR files and dealing with the Java and Scala stuff. Also, there's probably a better chance you are already familiar with Python as opposed to Scala, which is a newer language, and like I said, Java is just a little bit more complicated.
However, I will s...