What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Training | Edureka

www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
5 Best Practices in DevOps Culture

What to expect?
Why Apache
Spark?
Use Case
5
Hands-On
Examples
4
Spark Ecosystem
3
Spark Features
2
1

Big Data Analytics

Data Generated Every Minute!

Big Data Analytics
➢ Big Data Analytics is the process of examining large data sets to uncover
hidden patterns, unknown correlations, market trends, customer
preferences and other useful business information
Batch Analytics Real Time Analytics
➢ Big Data Analytics is of two types:
1. Batch Analytics
2. Real-Time Analytics

Spark For Real Time Analysis
Use Cases For Real Time Analytics
Banking Government Healthcare Telecommunications Stock Market
Process data in real-time
Easy to use
Faster processing
Our Requirements:
Handle input from multiple sources

What Is Spark?

What Is Spark?
 Apache Spark is an open-source cluster-computing framework for real
time processing developed by the Apache Software Foundation
 Spark provides an interface for programming entire clusters with implicit
data parallelism and fault-tolerance
 It was built on top of Hadoop MapReduce and it extends the
MapReduce model to efficiently use more types of computations
Reduction
in time
Parallel
Serial
Figure: Data Parallelism In Spark
Figure: Real Time Processing In Spark

Deployment
Powerful
Caching
Polyglot
Features
100x faster than
for large scale data
processing
Simple programming
layer provides powerful
caching and disk
persistence capabilities
Can be deployed through
Mesos, Hadoop via Yarn, or
Spark’s own cluster manger
Can be programmed
in Scala, Java,
Python and R
Speed
vs
Why Spark?

Spark Success Story

Spark Success Story
Twitter Sentiment
Analysis With Spark
Trending Topics can
be used to create
campaigns and attract
larger audience
Sentiment helps in
crisis management,
service adjusting and
target marketing
NYSE: Real Time Analysis of
Stock Market Data
Banking: Credit Card
Fraud Detection
Genomic Sequencing

Using Hadoop
Through Spark

&
Spark can be used along
with MapReduce in the
same Hadoop cluster or
separately as a processing
framework
Spark applications can also
be run on YARN (Hadoop
NextGen)
MapReduce and Spark are used
together where MapReduce is
used for batch processing and
Spark for real-time processing
Spark can run on top
of HDFS to leverage
the distributed
replicated storage
Spark And Hadoop

Spark Features
Speed
Multiple Languages
Advanced Analytics
Real Time
Hadoop Integration
Machine Learning

Spark Features
Supports multiple data sourcesSpark runs upto 100x times faster
than MapReduce
vs
Lazy Evaluation: Delays evaluation till needed
Real time computation & low latency because of
in-memory computation

Spark Features
Hadoop Integration Machine Learning for iterative tasks

Spark Core
Spark Streaming
Spark SQL
MLlib
GraphX
Spark Components

Spark Components
Used for structured
data. Can run
unmodified hive
queries on existing
Hadoop deployment
Spark Core Engine
Spark SQL
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
Learning)
GraphX
(Graph
Computation)
SparkR
(R on Spark)
Enables analytical
and interactive
apps for live
streaming data.
Package for R language to
enable R-users to leverage
Spark power from R shell
Machine learning
libraries being built
on top of Spark.
The core engine for entire Spark framework. Provides
utilities and architecture for other components
Graph Computation
engine (Similar to
Giraph). Combines data-
parallel and graph-
parallel concepts

Spark Components
Spark Core Engine
Spark SQL
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
learning)
GraphX
(Graph
Computation)
SparkR
(R on Spark)
DataFrames ML Pipelines
Tabular data
abstraction
introduced by
Spark SQL
ML pipelines makes
it easier to combine
multiple algorithms
or workflows

Spark Core
Spark Streaming
Spark SQL
MLlib
GraphX
Spark Core

Spark Core
Spark Core is the base engine for large-scale parallel and distributed data
processing
It is responsible for:
 Memory management and fault recovery
 Scheduling, distributing and monitoring jobs on a cluster
 Interacting with storage systems
Figure: Spark Core Job Cluster
Table
Row
Row
Row
Row
Result

Spark Architecture
Figure: Components of a Spark cluster
Driver Program
Spark Context
Cluster Manager
Worker Node
Executor
Cache
Task Task
Worker Node
Executor
Cache
Task Task

Spark Core
Spark Streaming
Spark SQL
MLlib
GraphX
Spark Streaming

Spark Streaming
 Spark Streaming is used for processing real-time streaming data
 It is a useful addition to the core Spark API
 Spark Streaming enables high-throughput and fault-tolerant
stream processing of live data streams
 The fundamental stream unit is DStream which is basically a
series of RDDs to process the real-time data Figure: Streams In Spark Streaming

Spark Streaming
Figure: Overview Of Spark Streaming
MLlib
Machine Learning
Spark SQL
SQL + DataFrames
Spark Streaming
Streaming Data
Sources
Static Data
Sources
Data Storage
Systems

Spark Streaming
Kafka
HDFS/ S3
Flume
Streaming
Twitter
Kinesis
Databases
HDFS
Dashboards
Figure: Data from a variety of sources to various storage systems
Streaming Engine
Input Data
Stream
Batches Of
Input Data
Batches Of
Processed Data
Figure: Incoming streams of data divided into batches
Figure: Extracting words from an InputStream
Data From
Time 0 to 1
Data From
Time 0 to 1
Data From
Time 0 to 1
Data From
Time 0 to 1
RDD @ Time 1 RDD @ Time 2 RDD @ Time 3 RDD @ Time 4
DStream
Figure: Input data stream divided into discrete chunks of data
Data From
Time 0 to 1
Data From
Time 0 to 1
Data From
Time 0 to 1
Data From
Time 0 to 1
DStream
Words From
Time 0 to 1
Words From
Time 0 to 1
Words From
Time 0 to 1
Words From
Time 0 to 1
Words
DStream
flatMap
Operation

Spark Core
Spark Streaming
Spark SQL
MLlib
GraphX
Spark SQL

Spark SQL Features
Spark SQL integrates relational
processing with Spark’s functional
programming.
1
Spark SQL is used for the
structured/semi structured data
analysis in Spark.
2

Spark SQL Features
SQL queries can be converted into
RDDs for transformations
Support for various data formats
3
RDD 1 RDD 2
Shuffle
transform
Drop split
point
Invoking RDD 2 computes all partitions of RDD 1
4

Spark SQL Overview
Performance And Scalability
5

Spark SQL Features
User
Standard JDBC/ODBC Connectivity
6
User Defined Functions lets users
define new Column-based functions
to extend the Spark vocabulary
7

Spark SQL Flow Diagram
Spark SQL
Service
Interpreter &
OptimizerResilient
Distributed
Dataset
 Spark SQL has the following libraries:
1. Data Source API
2. DataFrame API
3. Interpreter & Optimizer
4. SQL Service
 The flow diagram represents a Spark SQL process using all the four libraries in
sequence
DataFrame API
Named
Columns
Data Source
API

Spark Core
Spark Streaming
Spark SQL
MLlib
GraphX
MLlib

MLlib
Supervised algorithms use labelled data in which both the input and output are provided to the
algorithm
Unsupervised algorithms do not have the outputs in advance. These algorithms are left to make
sense of the data without labels
Machine Learning
Supervised
• Classification
- Naïve Bayes
- SVM
• Regression
- Linear
- Logistic
Unsupervised
• Clustering
- K Means
• Dimensionality
Reduction
- Principal
Component Analysis
- SVD
Machine Learning may be broken down into two classes of algorithms:

Mllib - Techniques
1. Classification: It is a family of supervised machine
learning algorithms that designate input as belonging
to one of several pre-defined classes
Some common use cases for classification include:
i) Credit card fraud detection
ii) Email spam detection
2. Clustering: In clustering, an algorithm groups objects
into categories by analyzing similarities between input
examples
There are 3 common techniques for Machine Learning:

Mllib - Techniques
Collaborative Filtering: Collaborative filtering algorithms
recommend items (this is the filtering part) based on
preference information from many users (this is the
collaborative part)
3.

Spark Core
Spark Streaming
Spark SQL
MLlib
GraphX
GraphX

GraphX
Graph Concepts
A graph is a mathematical structure used to model relations between objects. A graph is made up of vertices and edges that
connect them. The vertices are the objects and the edges are the relationships between them.
A directed graph is a graph where the edges have a direction associated with them. E.g. User Sam follows John on Twitter.
Sam
John
Relationship: Friends
Edge Vertex
Sam
John
Relationship: Friends
Follows

GraphX – Triplet View
GraphX has Graph class that contains members to access edges and
vertices
Triplet View
The triplet view logically joins the vertex and edge properties yielding an
RDD[EdgeTriplet[VD, ED]]
containing instances of the EdgeTriplet class
A BTriplets:
A
B
Vertices:
A BEdges:

GraphX – Property Graph
GraphX is the Spark API for graphs and graph-parallel computation. GraphX extends the Spark RDD with a Resilient Distributed
Property Graph.
The property graph is a directed multigraph which can have multiple edges in parallel. Every edge and vertex has user defined
properties associated with it. The parallel edges allow multiple relationships between the same vertices.
LAX
SJC
Vertex Property
Edge Property
Property Graph

GraphX – Example
To understand GraphX, let us consider the below graph.
 The vertices have names and ages of people.
 The edges represent whether a person likes a person and its weight is a measure of the likeability.
1 2 3
4 5 6
Alice
Age: 28
David
Age: 42
Ed
Age: 55
Fran
Age: 50
Charlie
Age: 65
Bob
Age: 27
7 4
1 3
3
2

GraphX – Example
val vertexRDD: RDD[(Long, (String, Int))] =
sc.parallelize(vertexArray)
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)
val graph: Graph[(String, Int), Int] = Graph(vertexRDD,
edgeRDD)
graph.vertices.filter { case (id, (name, age)) => age >
30 }.collect.foreach { case (id, (name, age)) =>
println(s"$name is $age")}
David is 42
Fran is 50
Ed is 55
Charlie is 65
Output
Display names and ages

GraphX – Example
for (triplet <- graph.triplets.collect)
{
println(s"${triplet.srcAttr._1} likes
${triplet.dstAttr._1}")
}
Bob likes Alice
Bob likes David
Charlie likes Bob
Charlie likes Fran
David likes Alice
Ed likes Bob
Ed likes Charlie
Ed likes Fran
Output
Display Relations

Use Case: Analyze Flight Data
Using Spark GraphX

Use Case: Problem Statement
Problem Statement
To analyse Real-Time Flight data using Spark GraphX, provide near real-time
computation results and visualize the results using Google Data Studio
Computations to be done:
 Compute the total number of flight routes
 Compute and sort the longest flight routes
 Display the airport with the highest degree vertex
 List the most important airports according to PageRank
 List the routes with the lowest flight costs
We will use Spark GraphX for the above computations and visualize the results using
Google Data Studio

Use Case: Flight Dataset
The attributes of each
particular row is as below:
1. Day Of Month
2. Day Of Week
3. Carrier Code
4. Unique ID- Tail Number
5. Flight Number
6. Origin Airport ID
7. Origin Airport Code
8. Destination Airport ID
9. Destination Airport Code
10. Scheduled Departure Time
11. Actual Departure Time
12. Departure Delay In Minutes
13. Scheduled Arrival Time
14. Actual Arrival Time
15. Arrival Delay Minutes
16. Elapsed Time
17. Distance
Figure: USA Airport Flight Data

Use Case: Flow Diagram
Huge amount of
Flight data
1
Database storing
Real-Time Flight
Data
2
Creating Graph
Using GraphX
3
Calculate Top Busiest
Airports
Compute Longest
Flight Routes
Calculate Routes with
Lowest Flight Costs
USA Flight Mapping
4
4
Query 3
Query 1
Query 2
4
Visualizing using
Google Data
Studio
5

Use Case: Starting Spark Shell
//Importing the necessary classes
import org.apache.spark._
import org.apache.spark.rdd.RDD
import org.apache.spark.util.IntParam
import org.apache.spark.graphx._
import org.apache.spark.graphx.util.GraphGenerators
//Creating a Case Class ‘Flight’
case class Flight(dofM:String, dofW:String, carrier:String, tailnum:String,
flnum:Int, org_id:Long, origin:String, dest_id:Long, dest:String, crsdeptime:Double,
deptime:Double, depdelaymins:Double, crsarrtime:Double, arrtime:Double,
arrdelay:Double,crselapsedtime:Double,dist:Int)
//Defining a Parse String ‘parseFlight’ function to parse input into ‘Flight’ class
def parseFlight(str: String): Flight = {
val line = str.split(",")
Flight(line(0), line(1), line(2), line(3), line(4).toInt, line(5).toLong, line(6),
line(7).toLong, line(8), line(9).toDouble, line(10).toDouble, line(11).toDouble,
line(12).toDouble, line(13).toDouble, line(14).toDouble, line(15).toDouble,
line(16).toInt)
}

Use Case: Starting Spark Shell
1
2
3
7
6
5
4

Use Case: Creating Edges For Graph Mapping
//Load the data into a RDD ‘textRDD’
val textRDD = sc.textFile("/home/edureka/Downloads/AirportDataset.csv")
//Parse the RDD of CSV lines into an RDD of flight classes
val flightsRDD = textRDD.map(parseFlight).cache()
//Create airports RDD with ID and Name
val airports = flightsRDD.map(flight => (flight.org_id, flight.origin)).distinct
airports.take(1)
//Defining a default vertex called ‘nowhere’ and mapping Airport ID for printlns
val nowhere = "nowhere"
val airportMap = airports.map { case ((org_id), name) => (org_id -> name)
}.collect.toList.toMap
//Create routes RDD with sourceID, destinationID and distance
val routes = flightsRDD.map(flight => ((flight.org_id, flight.dest_id),
flight.dist)).distinct
routes.take(2)
//Create edges RDD with sourceID, destinationID and distance
val edges = routes.map { case ((org_id, dest_id), distance) => Edge(org_id.toLong,
dest_id.toLong, distance)}
edges.take(1)

1
2
3
5
4

7
9
8
6
10

Use Case: Routes & Edge Triplets
//Define the graph and display some vertices and edges
val graph = Graph(airports, edges, nowhere)
graph.vertices.take(2)
graph.edges.take(2)
//Find the number of airports
val numairports = graph.numVertices
//Calculate the total number of routes?
val numroutes = graph.numEdges
//Calculate those routes with distances more than 1000 miles
graph.edges.filter { case ( Edge(org_id, dest_id,distance))=> distance > 1000}.take(3)
//Implementing edge triplets
graph.triplets.take(3).foreach(println)
//Sort and print the longest routes
graph.triplets.sortBy(_.attr, ascending=false).map(triplet => "Distance " +
triplet.attr.toString + " from " + triplet.srcAttr + " to " + triplet.dstAttr +
".").take(10).foreach(println)

1
2
3
4
5
6

7
8

Use Case: Vertex Degree Computation
//Define a reduce operation to compute the highest degree vertex
def max(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = {
if (a._2 > b._2) a else b}
//Display highest degree vertices for incoming and outgoing flights of airports
val maxInDegree: (VertexId, Int) = graph.inDegrees.reduce(max)
val maxOutDegree: (VertexId, Int) = graph.outDegrees.reduce(max)
val maxDegrees: (VertexId, Int) = graph.degrees.reduce(max)
//Get the airport name with IDs 10397 and 12478
airportMap(10397)
airportMap(12478)
//Find the airport with the highest incoming flights
val maxIncoming = graph.inDegrees.collect.sortWith(_._2 > _._2).map(x => (airportMap(x._1),
x._2)).take(3)
maxIncoming.foreach(println)
//Find the airport with the highest outgoing flights
val maxout= graph.outDegrees.join(airports).sortBy(_._2._1, ascending=false).take(3)
maxout.foreach(println)

1
2
4
5
6
3

7
8
9
10

Use Case: Graph PageRank
//Find the most important airports according to PageRank
val ranks = graph.pageRank(0.1).vertices
val temp= ranks.join(airports)
temp.take(1)
//Sort the airports by ranking
val temp2 = temp.sortBy(_._2._1, false)
temp2.take(2)
//Display the most important airports
val impAirports =temp2.map(_._2._2)
impAirports.take(4)

1
2
3

4
7
6
5

Use Case: Pregel Computation
//Implementing Pregel
val sourceId: VertexId = 13024
val gg = graph.mapEdges(e => 50.toDouble + e.attr.toDouble/20 )
val initialGraph = gg.mapVertices((id, _) => if (id == sourceId) 0.0 else Double.PositiveInfinity)
val sssp = initialGraph.pregel(Double.PositiveInfinity)(
(id, dist, newDist) => math.min(dist, newDist),
triplet => {
if (triplet.srcAttr + triplet.attr < triplet.dstAttr) {
Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
} else {
Iterator.empty
}
},
(a,b) => math.min(a,b)
)
//Find the Routes with the lowest flight costs
println(sssp.edges.take(4).mkString("n"))
//Find airports and their lowest flight costs
println(sssp.vertices.take(4).mkString("n"))
//Display airport codes along with sorted lowest flight costs
sssp.vertices.collect.map(x => (airportMap(x._1), x._2)).sortWith(_._2 < _._2)

5
1
2
3
4

5
6
7
8

Use Case: Visualizing Results

Use Case – Visualizing Results
 We will be using Google Data Studio to visualize our
analysis
 Google Data Studio is a product under Google Analytics
360 Suite
 The image shows a Sample Marketing website summary
using Geo Map, Time Series and Bar Chart
 We will use Geo Map service to map the Airports on
their respective locations on the USA map and display
the metrics quantity
Google Data Studio

Use Case - Visualizing Results
1. Display the total number of flights per Airport
Figure: Total Number of Flights from New York

2. Display the metric sum of Destination routes from every Airport
Figure: Measure of the total outgoing traffic from Los Angeles

3. Display the total delay of all flights per Airport
Figure: Total Delay of all Flights at Atlanta

Conclusion
Congrats!
We have hence demonstrated the power of Spark in Real Time Data Analytics.
The hands-on examples will give you the required confidence to work on any
future projects you encounter in Apache Spark.

Thank You …
Questions/Queries/Feedback

What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Training | Edureka

More Related Content

What's hot (20)

Similar to What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Training | Edureka (20)

More from Edureka! (20)

Recently uploaded (20)

What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Training | Edureka