0% found this document useful (0 votes)
2 views

Apache Spark

Apache Spark is a unified analytics engine designed for large-scale data processing, offering features like in-memory processing, fault tolerance, and support for multiple programming languages. Developed at UC Berkeley in 2009, it addresses limitations of Hadoop MapReduce by enabling faster data processing and real-time analytics. Key components include Spark SQL for structured data, GraphX for graph processing, Spark Streaming for real-time data, and MLlib for machine learning.

Uploaded by

Isuru Amarasena
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Apache Spark

Apache Spark is a unified analytics engine designed for large-scale data processing, offering features like in-memory processing, fault tolerance, and support for multiple programming languages. Developed at UC Berkeley in 2009, it addresses limitations of Hadoop MapReduce by enabling faster data processing and real-time analytics. Key components include Spark SQL for structured data, GraphX for graph processing, Spark Streaming for real-time data, and MLlib for machine learning.

Uploaded by

Isuru Amarasena
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

5CS022 Distributed and Cloud

Systems Programming
Lecture 9 Apache Spark
What is Apache Spark?

• "… a unified analytics engine for large-


scale data processing. "
• "… provides an interface for programming
entire clusters with implicit data parallelism
and fault tolerance"
• "… is a unified computing engine and a set
of libraries for parallel data processing on
computer clusters"
What is Apache Spark?
Motivation and History of Spark
• Created in 2009 at UC Berkley’s AMPLab by Matei Zaharia to address the Hadoop
MapReduce cluster computing paradigm. Hadoop had 3 problems:
– Uses disk based processing (slow compared to in-memory based)
– Applications only written in Java (security concerns)
– No stream processing support only batch processing
• Original UC Berkeley user group used Spark to monitor and predict Bay Area traffic
patterns
• In 2010 Spark became open source under the BSD license
• 2013 Spark turned into an Apache Software project
– Currently one of the largest projects of the Apache foundation
• 2014 Spark 1.0.0 released the largest release to this date and started the 1.x line
• 2016 Spark 2.0.0 released
– Current Spark release 2.2.1 (12/2017)
Key Features
• Spark decreases the number of reads and writes to disc which significantly increasing the processing speed
• The 80 high-level operators help Spark overcome the limitation of Hadoop MapReduce restriction to java
– Possible to develop parallel applications with Scala, Python and R
– Other systems could be used in same way with more work from community

• In-memory processing reduces disc reads and writes


– Data size continues to increase reading and writing TBs and PBs isn’t viable
– Storing data in servers’ RAM makes access significantly faster

• DAG (directed acyclic graph) execution engine


• Fault tolerant
– The Spark RDD abstraction handles failures of any worker nodes within the cluster so data loss is negligible

• Spark Streaming allows for the processing of real-time data streams


– Not as powerful as Apache Storm/Heron/Flink

• Single workflow to sophisticated analytics by integrating streaming data, machine learning, map and reduces and
queries
• Spark is compatible with all other forms of Hadoop data
– But Hadoop not aimed at many important applications!

• Lazy Evaluation: “Execution will not start until an action is triggered", ex: Data is not loaded before execution
• Active community around the globe working with Spark, is very important feature
Executing Spark Programs

• Spark programs can be executed


standalone locally
• Interactively on a cluster
• Or submitted to a Spark cluster
Local Mode
Interactive Client Mode
Cluster Mode
Spark Cluster Applications
Cluster Resoruce Manager
Resilient Distributed Dataset (RDD)
Resilient Distributed Dataset (RDD)
• This is a cacheable database technology
• Fundamental data structure of Spark and naturally supports in-memory processing
– The state of the memory is stored as an object and the object is shared among jobs
– Sharing data as objects in memory is 1-2 orders of magnitude faster than network and Disk sharing

• Iterative and interactive processes become more efficient because of efficient data reuse
– Iterative examples include PageRank, K-means clusters and logistic regression
– An interactive example is interactive datamining where several queries are run on the same dataset
– Makes better use of database technologies than previous technologies

• Previous frameworks required that data be written to a stable file storage system (distributed
file system)
– This process creates overhead that could dominate execution times
• Data replication
• Disk I/O
• Serialization
Features of Spark RDD
• Lazy Evaluation
– In Spark lazy evaluation means that results are not computed immediately, instead execution starts when an action is triggered
– In Spark evaluation happens when a transformation occurs
• Spark maintains a record of the operation
• Fault Tolerance
– RDDs track data lineage and rebuild data upon failures
– RDDs do this by remembering how they were created from other datasets
• RDDs are created by transformations (map, join …)
• Immutability
– Safely share data across processes
– Create and retrieve data anytime
• Ensures easy caching, sharing and replication
– Allows computations to be consistent
• Partitioning
– Fundamental unit of parallelism within Spark RDD
• Each partition is a mutable logical division of data
– Partitions are created by performing a transformation of existing partitions
• Persistence
• Coarse-grained Operations
Supported RDD Operations
Transformations examples
• Many are classic database operations
• Map() - map function defined is applied to each element in the RDD
• Flatmap() - similar to map, but it can return a multiple new RDDs
• Filter() - creates a new RDD after filtering out specifics from original RDD
• MapPartitions() - Works on RDD partitions, useful when distributed
computation is done.
• Union(dataset) - Performs union operation (putting them together in a single
RDD) on RDDs
• Intersect() - Finds the intersecting elements in RDDs and creating a new RDD
• Distinct() - Finds out the distinct elements in RDDs and creating a new RDD
• Join() - Performs inner or outer join on a dataset.
Supported RDD Operations
Actions
• When you need to work with the actual dataset an action must
be performed as transformations simply create new RDDs from other
RDDs where actions actually produce values
– Values are produced by actions by executing the RDD lineage graph
• An action in Spark is an RDD operator that will produce a non-RDD value
– These values are stored either in drivers or to an external storage system
• Actions are what bring "laziness" of RDD to the spot light
– Results are only computed when an action is called
– Importance of synchronization key
• Action are one way of sending data from the Executor to the driver and
the other is accumulators
Supported RDD Operations
Actions Examples
• Count()
• Collect()
• Take()
• Top()
• CountByValue()
• Reduce()
• Fold()
• Aggregate()
• Foreach()
Transformations vs. Actions

• Transformations • Actions return values


generate a new from the computation
modified RDD based performed on the RDD
off the parent RDD
RDD Operations
• 1. Transformations: define new RDDs based on
the current one.
– e.g., filter, map, flatMap, reduce, etc.
RDD New RDD

RDD
• 2. Actions: return values. Value
– e.g., count, sum, collect, etc.
RDD Persistence
• RDD to be persisted using the persist() or cache() methods on it.
The first time it is computed in an action, it will be kept in memory on the
nodes.
• Fault-tolerant – if any partition of an RDD is lost, it will automatically be
recomputed using the transformations that originally created it.

Storage Level Meaning


MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If
the RDD does not fit in memory, some partitions will not
be cached and will be recomputed on the fly each time
they're needed. This is the default level.
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If
the RDD does not fit in memory, store the partitions that
don't fit on disk, and read them from there when they're
needed.
Directed Acyclic Graph (DAG) Execution
Engine
• DAG is a programming style for distributed systems. • Directed Acyclic Graph (DAG)
– Edges have direction, represented by the arrows
• Basically Spark uses vertices and edges to represent
RDDs and operation to be applied on RDD respectively.
• DAG is modelled in Spark using a DAG scheduler which
further splits the graph into stages of the tasks.
• DAG scheduler splits up the original task into smaller 3
subtasks and use transformations in RDD to perform sub
task operations.
• In Spark the DAG can have any number of stages 1 2 4
– This prevents having to split a single job up into several jobs
if more than two stages are required 5
– Allows simple jobs to complete after one stage
7
• In MapReduce DAG has two predefined stages 6
– Map • Links of DAG are transformations, actions,
– Reduce maps
• Nodes of graph are result or RDD instances
• https://databricks.com/blog/2015/06/22/understanding-
your-spark-application-through-visualization.html
DAG In Action
• For DAG to work, it uses Scala or Java layer as the first interpreter to get the information on
the task that has to be performed.
• Once a job or a set of tasks is submitted to Spark, it creates an Operator Graph.
• In process, at the time an Action is called on Spark RDD, spark submits a task to the DAG
scheduler with the information on what to do when to do.
• DAG Scheduler has stages being defined depending on the operation submitted to it. A
certain stage data partitioning can take place and in a certain stage data gathering can take
place after a computation in being performed on separate partitions.
• The small stages are passed to Task Scheduler and task scheduler manages all the sub tasks
generated for a particular job.
• A cluster manager is the tool which launch the tasks and Task schedulers submit those
requests on the cluster manager.
• Cluster manager will submit these tasks to different machines or workers.
• Each of the workers are executing these tasks. (Master-worker parallel computing paradigm)
Execution Graph
DAG In Action
Apache Spark Components and
Relationships
Cluster management
• A cluster manager is used to acquire cluster
resources for executing jobs.
• Spark core runs over diverse cluster managers
including Hadoop YARN, Apache Mesos, Amazon
EC2 and Spark’s built-in cluster manager.
• The cluster manager handles resource sharing
between Spark applications.
• On the other hand, Spark can access data in
HDFS, Cassandra, HBase, Hive, Alluxio, and any
Hadoop data source
Spark SQL
• Spark SQL is a new module in Spark which integrates
relational processing with Spark’s functional programming
API.
• It supports querying data either via SQL or via the Hive
Query Language.
• The DataFrame and Dataset APIs of Spark SQL provide a
higher level of abstraction for structured data.
GraphX

• GraphX is the Spark API for graphs and


graph-parallel computation.
• Thus, it extends the Spark RDD with a
Resilient Distributed Property Graph
Spark Streaming

• Spark Streaming is the component of Spark


which is used to process real-time
streaming data.
MLlib

• MLlib stands for Machine Learning Library.


Spark MLlib is used to perform machine
learning in Apache Spark.
Questions?

You might also like