Study Notes: Apache Spark

Summary of Apache Spark
Original Papers:
1. “Spark: Cluster Computing with Working Sets” by Matei Zaharia, et al.
Hotcloud 2010.
2. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory
Cluster Computing” by Matei Zaharia, et al. NSDI 2012.

Motivation
• MapReduce is great, but,
• There are applications where you iterate on the same set of data, e.g.,
for many iterations {
old_data = new_data;
new_data = func(old_data);
}
• Problem: The body of each iteration can be described as a MapReduce task,
where the inputs and final outputs are GFS files. There is redundant work storing
the output data to GFS and then reading it out again in the next iteration.
• Idea: We can provide a mode to cache the final outputs in memory if possible.
o Challenge: but if the machine crashes, we lose the outputs, can we recover?
o Solution: store the lineage of the data so that they can be reconstructed as needed (e.g., if
they get lost or insufficient memory).

Motivation
• Spark’s goal was to generalize MapReduce to support new apps
within same engine
oMapReduce problems can be expressed in Spark too.
oWhere Spark shines and MapReduce does not: applications that need to
reuse a working set of data across multiple parallel operations
• Two reasonably small additions allowed the previous specialized
models to be expressed within Spark:
ofast data sharing
ogeneral DAGs

Motivation
Some key points about Spark:
• handles batch, interactive, and real-time within a single framework
• native integration with Java, Python, Scala.
oHas APIs written in these languages
• programming at a higher level of abstraction
• more general: map/reduce is just one set of supported constructs

Use Example
• We’ll run Spark’s interactive shell…
./bin/spark-shell
• Then from the “scala>” REPL prompt, let’s create some data…
val data = 1 to 10000
• Create an RDD based on that data…
val distData = sc.parallelize(data)
• Then use a filter to select values less than 10…
distData.filter(_ < 10).collect()

Resilient Distributed Datasets (RDD)
• Represents a read-only collection of objects partitioned across a set
of machines that can be rebuilt if a partition is lost.
• RDDs can only be created through deterministic operations (aka
transformations) on:
oEither data in stable storage:
 Any file stored in HDFS
 Or other storage systems supported by Hadoop
oOr other RDDs
• A program cannot reference an RDD that it cannot reconstruct after a
failure

Resilient Distributed Datasets (RDD)
• Two types of operations on RDDs: transformations and actions
o Programmers start by defining one or more RDDs through transformations on data in stable
storage (e.g., map and filter). Transformations create a new dataset from an existing one
 transformations are lazy (not computed immediately)
 instead they remember the transformations applied to some base dataset
 The transformed RDD gets recomputed when an action is run on it (default)
o They can then use these RDDs in actions, which are operations that return a value to the
application or export data to a storage system.
• However, an RDD can be persisted into storage in memory or disk
o Each node stores in memory any slices of it that it computes and reuses them in other
actions on that dataset – often making future actions more than 10x faster
o The cache is fault-tolerant: if any partition of an RDD is lost, it will automatically be
recomputed using the transformations that originally created it
o Note that by default, the base RDD (the original data in stable storage) is not loaded into
RAM because the useful data after transformation might be only a small fraction (small
enough to fit into memory).

RDD Implementation
Common interface of each RDD:
• A set of partitions: atomic pieces of the dataset
• A set of dependencies on parent RDDs: one RDD can have multiple parents
o narrow dependencies: each partition of the parent RDD is used by at most one
partition of the child RDD.
o wide dependencies: multiple child partitions may depend on a parent. Requires the
shuffle operation.
• A function for computing the dataset based on its parents
• Metadata about its partitioning scheme
• Metadata about its data placement, e.g., perferredLocations(p) returns a
list of nodes where partition p can be accessed faster due to data locality

Narrow vs Wide Dependencies
• Narrow dependencies allow for pipelined
execution on one cluster node, which can
compute all the parent partitions. Wide
dependencies require data from all parent
partitions to be available and to be
shuffled across the nodes using a
MapReduce-like operation.
• Recovery after a node failure is more
efficient with a narrow dependency, as
only the lost parent partitions need be
recomputed, and re-computation can be
done in parallel

Job Scheduling on RDDs
• Similar to Dryad, but takes data locality into account
• When user runs an action, scheduler builds a DAG of
stages to execute. Each stage contains as many
pipelined transformations with narrow dependencies
as possible.
• Stage boundaries are the shuffle operations required
for wide dependencies
• Scheduler launches tasks to compute missing
partitions from each stage. Tasks are assigned to
machines based on data locality using delay
scheduling.
• For wide dependencies, intermediate records are
materialized on the nodes holding parent partitions
(similar to the intermediate map outputs of
MapReduce) to simplify fault recovery.

Checkpointing
• Although lineage can always be used to recover RDDs after a failure,
checkpointing can be helpful for RDDs with long lineage chains
containing wide dependencies.
• For RDDs with narrow dependencies on data in stable storage,
checkpointing is not worthwhile. Reconstruction can be done in
parallel for these RDDs, at a fraction of the cost of replicating the
whole RDD.

RDD vs Distributed Shared Memory (DSM)
• Previous frameworks that support data reuse, e.g., Pregel and Piccolo.
o Perform data sharing implicitly for these patterns
o Specialized frameworks; do not provide abstractions for more general reuse
o Programming interface supports fine-grained updates (reads and writes to each
memory location): fault-tolerance requires expensive replication of data across
machines or logging of updates across machines
• RDD:
o Only coarse-grained transformations (e.g., map, filter and join): apply the same
operation to many data item. Note that reads on RDDs can still be fine-grained.
o Fault-tolerance only requires logging the transformation used to build a dataset
instead of the actual data
o RDDs are not suitable for applications that make asynchronous fine-grained updates
to shared state.

RDD vs Distributed Shared Memory (DSM)
• Other benefits of RDDs:
oRDDs are immutable. A system can mitigate slow nodes (stragglers) by
running backup copies of slow tasks as in MapReduce.
oIn bulk operations, a runtime can schedule tasks based on data locality to
improve performance
oRDDs degrade gracefully when there is not enough memory to store them. An
LRU eviction policy is used at the level of RDDs. A partition from the least
recently accessed RDD is evicted to make room for a newly computed RDD
partition. This is user-configurable via the “persistence priority” for each RDD.

Debugging RDDs
• One can reconstruct the RDDs later from the lineage and let the user
query them interactively
• One can re-run any task from the job in a single-process debugger by
recomputing the RDD partitions it depends on.
• Similar to the replay debuggers but without the capturing/recording
overhead.

// load error messages from a log into memory; then interactively search for
// various patterns
// base RDD
val lines = sc.textFile("hdfs://...")
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split("t")).map(r => r(1))
messages.cache()
// action 1
messages.filter(_.contains("mysql")).count()
// action 2
messages.filter(_.contains("php")).count()
RDD Example

// various patterns
// base RDD
messages.cache()
// action 1
// action 2
RDD

// various patterns
// base RDD
messages.cache()
// action 1
// action 2
RDD
RDD
RDD
RDD
Transformations

// various patterns
// base RDD
messages.cache()
// action 1
// action 2
RDD
RDD
RDD
RDD
Transformations
Value
Actions

Shared Variables
• Broadcast variables let programmer keep a read-only variable cached
on each machine rather than shipping a copy of it with tasks
oFor example, to give every node a copy of a large input dataset efficiently
• Spark also attempts to distribute broadcast variables using efficient
broadcast algorithms to reduce communication cost

Shared Variables
• Accumulators are variables that can only be “added” to through an
associative operation.
oUsed to implement counters and sums, efficiently in parallel
• Spark natively supports accumulators of numeric value types and
standard mutable collections, and programmers can extend for new
types
• Only the driver program can read an accumulator’s value, not the
workers
oEach accumulator is given a unique ID upon creation
oEach worker creates a separate copy of the accumulator
oWorker sends a message to driver about the updates to the accumulator

Study Notes: Apache Spark

Recommended

More Related Content

What's hot (19)

Viewers also liked (12)

Similar to Study Notes: Apache Spark (20)

Recently uploaded (19)

Study Notes: Apache Spark

Editor's Notes