In Lecture 7 of our Big Data in 30 hours class, we discussed Apache Spark and did some hands-on programming. The purpose of this memo is to summarize the terms and ideas presented.
Apache Spark is the currently one of the most popular platforms for parallel execution of computing jobs in a distributed environment. The idea is not new. Starting in the late 1980’s, the HPC (high performance computing) community executed jobs in parallel over clusters, supercomputers and compute farms. Technologies of the time, broadly related to scheduling jobs, included: PVM, MPI, PBS, Platform LSF, Sun Grid Engine, Globus, Moab, and many more. In the first decade of 2000’s, cluster computing saw mainstream with advent of high-level APIs, cloud environments and Hadoop MapReduce model (discussed in previous lecture).
Hadoop (orignal credits to Doug Cutting and Mike Cafarella, working for Google and Yahoo! respectively, later Apache took over) became so popular that for a while it was a de-facto standard in the field. However, some deficiencies of MapReduce model included:
- focus on batch operations, while the market demand drifted toward real-time online processing
- limited, non-flexible API. Only some type of computations could be represented in this model
- missing abstractions for advanced workflows: streamed data, interactive data, DAG workflows, heterogeneous tasks
Apache Spark (original credits to Matei Zaharia at UC Berkeley’s AMPLab) came in light in the early 2010’s because it had response to these deficiencies. Spark is a distributed processing engine:
- written in Scala, with programming interface in Scala, Python, R and SQL
- Focused on in-memory processing
- 10 – 100 faster than Hadoop
- allows for a wide range of workflows. Flexible and easy in programming
- many distributed operations happen implicitly. Programmer only implies them in the source code
- Leverages a lot of Hadoop infrastructure underneath: Mesos, YARN, HDFS
Spark today is much more than just a scheduler. It has Spark SQL, Spark Streaming, MLib (machine learning) and GraphX (graph) libraries. Altogether, it is very robust and versatile. Here we will only focus on Spark Core introduction, but keep in mind there’s more to it.
The architectural components involved a Spark distributed task execution involve the driver program with SparkContext object, a Cluster Manager, and Executors placed on worker nodes. It is described nicely on Jacek Laskowski’s pages (thanks Jacek, great job!) or here.
- for standalone mode, consider Mesosphere Spark docker image. With docker, installing Spark boils down to one command: docker pull mesosphere/spark
- another option is Sequenceig Spark docker image. This is an older Spark version (1.6) however it works well and saves your disk space since uses the Hadoop image which we used in the previous lecture
- for cluster-aware installation, to replicate the cluster Spark configuration on 8 Ubuntu machines used in our lab, you may use Diwakar’s instructions here (thanks Diwakar!)
- RDD (resilient distributed data set) is the basic data structure.
- HDFS is the default storage, with implications in the programming model: you do not modify an existing RDD, you create a new RDD.
- RDDs are sharded into partitions. One RDD can fall into several slices placed on different partitions of the storage
- lazy evaluation is common. Often, things will not be evaluated (executed) until they are really needed
- the Spark program is a series of transformations (operations that transform an RDD into a new RDD) and actions (operations that do not create an RDD).
- the workflow of your program is referred to as DAG (direct acyclic graph). You do not need to define the DAG. DAG scheduler will construct the DAG by looking at your program. For instance, whenever DAG sees the need to transport data between partitions, it will create a new stage
- use WebUI (typically, host:4040, but look at pyspark console to get the actual address) to inspect DAG and its execution
Spark programming: first steps
To play with the material here, download the Lecture 7 slides deck so you can copy the examples. The below assumes Python programming environment.
- use pyspark (or spark-shell, if you prefer Scala) to start interactive work with Spark with Python. Important objects are sc (an instantiation of SparkContext) and spark (that instantiates SparkSession).
- use spark-submit for launching batch scripts over Spark cluster
- to create an RDD using a data in text file, use sc.textFile(). To create an RDD off Python list data, use sc.parallelize()
- start with these basic RDD operations: cache() forces data placement (remember lazy execution), count() returns the total, first() return the first record
- print RDD using print(rdd.take(number-of-records)) for large data set or rdd.collect() for a small one
- transformations we learned: filter(), foreach(), map(), reduce(). Consult the slides deck for some basic source code.
- lately Spark introduces DataSets (DataFrames) instead of RDDs. Create a DataFrame with spark.read.text(). Some operations, like filtering, have different syntaxt for DataFrames.
- create variables useful in distributed environments. sc.accumulator() and sc.broadcast()
RDD keeps data split into slices (partitions). A few starting points to start working with partitions:
- you can tell Spark how many partitions you want for a data set. For instance, to create an RDD with 10 partitions, type: sc.parallelize(data, 10), or textFile(name, 10)
- inspect an RDD for partitioning: rdd.getNumPartitions(), rdd.partitioner, rdd.glom().collect()
- repartition the data using coalesce() and repartition()
- shuffling is moving data between partitions. Shuffling happens implicitly and should be avoided for performance reasons
- DAG scheduler defines the distributed processing workflow divided in stages. The process of defining stages is closely linked with the way data is partitioned
- When DAG sees a transformation that requires shuffling, it will create a new stage of tasks. This is is called wide transformation. Example: reduceByKey(). In contrast, a narrow transformation (such that does not involve shuffling, e.g. map(), filter()) will not cause new processing stage
- Stages and tasks are passed to the cluster manager, eg YARN or Mesos which schedules them to worker nodes
- consult the word count example in the slide deck to review a simple DAG
Where to go from here
- pyspark API reference. Start reading with RDD, DataFrame, SparkContext, and SparkSession
- online examples: Pi, wordcount, ML
- Local spark-submit examples: in your install directory, check the examples directory, for instance examples/src/main/python/pi.py
- Many more examples on Github
- Jacek Laskowski’s Mastering Apache Spark gitbook
To try things out, first consult the slide deck if you haven’t already, and try the examples presented there. Once you’ve done that, here’s some suggested homework exercises to understand better the basics of Spark programming
- Where exactly is the data of your partitioned RDD? What data is where? Find out how to check this. Provide example where you can store data, check that it really is there, shuffle elsewhere, check again
- Which node exactly is performing theparallel RDD execution? Which component of spark system makes this decision?How can we influence this? When/why should?
- What are the performanceconsiderations, regarding the distributed RDD organization and execution? Provide example where we can improve performance by better organization of thedata and source code
- Provide practical example with Broadcast and Accumulator variables. What happens if we mistakenly use local variables instead?
- implement an equivalent of some common database queries on csv data, making intelligent use of RDD feature:
SELECT COUNT(*)FROM x GROUP BY y
SELECT a,b FROM x INNER/OUTER JOIN y ON x.t=y.t
- implement some common NLP (natural language processing), in its basic form, making intelligent use of RDD parallel execution.
- how is Spark better than historic Hadoop? Read about MapReduce. Demonstrate how to implement MapReducein a simple Spark job. Augment that job by activities that you couldn’t (easily) have implemented in raw Hadoop MapReduce. Discuss.
- Take your favorite ML code and port it to Spark. Explain where it starts making sense.