Apache Spark
Apache Spark
Systems Programming
Lecture 9 Apache Spark
What is Apache Spark?
• Single workflow to sophisticated analytics by integrating streaming data, machine learning, map and reduces and
queries
• Spark is compatible with all other forms of Hadoop data
– But Hadoop not aimed at many important applications!
• Lazy Evaluation: “Execution will not start until an action is triggered", ex: Data is not loaded before execution
• Active community around the globe working with Spark, is very important feature
Executing Spark Programs
• Iterative and interactive processes become more efficient because of efficient data reuse
– Iterative examples include PageRank, K-means clusters and logistic regression
– An interactive example is interactive datamining where several queries are run on the same dataset
– Makes better use of database technologies than previous technologies
• Previous frameworks required that data be written to a stable file storage system (distributed
file system)
– This process creates overhead that could dominate execution times
• Data replication
• Disk I/O
• Serialization
Features of Spark RDD
• Lazy Evaluation
– In Spark lazy evaluation means that results are not computed immediately, instead execution starts when an action is triggered
– In Spark evaluation happens when a transformation occurs
• Spark maintains a record of the operation
• Fault Tolerance
– RDDs track data lineage and rebuild data upon failures
– RDDs do this by remembering how they were created from other datasets
• RDDs are created by transformations (map, join …)
• Immutability
– Safely share data across processes
– Create and retrieve data anytime
• Ensures easy caching, sharing and replication
– Allows computations to be consistent
• Partitioning
– Fundamental unit of parallelism within Spark RDD
• Each partition is a mutable logical division of data
– Partitions are created by performing a transformation of existing partitions
• Persistence
• Coarse-grained Operations
Supported RDD Operations
Transformations examples
• Many are classic database operations
• Map() - map function defined is applied to each element in the RDD
• Flatmap() - similar to map, but it can return a multiple new RDDs
• Filter() - creates a new RDD after filtering out specifics from original RDD
• MapPartitions() - Works on RDD partitions, useful when distributed
computation is done.
• Union(dataset) - Performs union operation (putting them together in a single
RDD) on RDDs
• Intersect() - Finds the intersecting elements in RDDs and creating a new RDD
• Distinct() - Finds out the distinct elements in RDDs and creating a new RDD
• Join() - Performs inner or outer join on a dataset.
Supported RDD Operations
Actions
• When you need to work with the actual dataset an action must
be performed as transformations simply create new RDDs from other
RDDs where actions actually produce values
– Values are produced by actions by executing the RDD lineage graph
• An action in Spark is an RDD operator that will produce a non-RDD value
– These values are stored either in drivers or to an external storage system
• Actions are what bring "laziness" of RDD to the spot light
– Results are only computed when an action is called
– Importance of synchronization key
• Action are one way of sending data from the Executor to the driver and
the other is accumulators
Supported RDD Operations
Actions Examples
• Count()
• Collect()
• Take()
• Top()
• CountByValue()
• Reduce()
• Fold()
• Aggregate()
• Foreach()
Transformations vs. Actions
RDD
• 2. Actions: return values. Value
– e.g., count, sum, collect, etc.
RDD Persistence
• RDD to be persisted using the persist() or cache() methods on it.
The first time it is computed in an action, it will be kept in memory on the
nodes.
• Fault-tolerant – if any partition of an RDD is lost, it will automatically be
recomputed using the transformations that originally created it.