SlideShare a Scribd company logo
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
5 Best Practices in DevOps Culture
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
What to expect?
Why Apache
Spark?
Use Case
5
Hands-On
Examples
4
Spark Ecosystem
3
Spark Features
2
1
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Big Data Analytics
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Data Generated Every Minute!
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Big Data Analytics
➢ Big Data Analytics is the process of examining large data sets to uncover
hidden patterns, unknown correlations, market trends, customer
preferences and other useful business information
Batch Analytics Real Time Analytics
➢ Big Data Analytics is of two types:
1. Batch Analytics
2. Real-Time Analytics
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark For Real Time Analysis
Use Cases For Real Time Analytics
Banking Government Healthcare Telecommunications Stock Market
Process data in real-time
Easy to use
Faster processing
Our Requirements:
Handle input from multiple sources
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
What Is Spark?
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
What Is Spark?
 Apache Spark is an open-source cluster-computing framework for real
time processing developed by the Apache Software Foundation
 Spark provides an interface for programming entire clusters with implicit
data parallelism and fault-tolerance
 It was built on top of Hadoop MapReduce and it extends the
MapReduce model to efficiently use more types of computations
Reduction
in time
Parallel
Serial
Figure: Data Parallelism In Spark
Figure: Real Time Processing In Spark
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Deployment
Powerful
Caching
Polyglot
Features
100x faster than
for large scale data
processing
Simple programming
layer provides powerful
caching and disk
persistence capabilities
Can be deployed through
Mesos, Hadoop via Yarn, or
Spark’s own cluster manger
Can be programmed
in Scala, Java,
Python and R
Speed
vs
Why Spark?
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Success Story
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Success Story
Twitter Sentiment
Analysis With Spark
Trending Topics can
be used to create
campaigns and attract
larger audience
Sentiment helps in
crisis management,
service adjusting and
target marketing
NYSE: Real Time Analysis of
Stock Market Data
Banking: Credit Card
Fraud Detection
Genomic Sequencing
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Using Hadoop
Through Spark
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
&
Spark can be used along
with MapReduce in the
same Hadoop cluster or
separately as a processing
framework
Spark applications can also
be run on YARN (Hadoop
NextGen)
MapReduce and Spark are used
together where MapReduce is
used for batch processing and
Spark for real-time processing
Spark can run on top
of HDFS to leverage
the distributed
replicated storage
Spark And Hadoop
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Features
Speed
Multiple Languages
Advanced Analytics
Real Time
Hadoop Integration
Machine Learning
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Features
Supports multiple data sourcesSpark runs upto 100x times faster
than MapReduce
vs
Lazy Evaluation: Delays evaluation till needed
Real time computation & low latency because of
in-memory computation
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Features
Hadoop Integration Machine Learning for iterative tasks
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Core
Spark Streaming
Spark SQL
MLlib
GraphX
Spark Components
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Components
Used for structured
data. Can run
unmodified hive
queries on existing
Hadoop deployment
Spark Core Engine
Spark SQL
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
Learning)
GraphX
(Graph
Computation)
SparkR
(R on Spark)
Enables analytical
and interactive
apps for live
streaming data.
Package for R language to
enable R-users to leverage
Spark power from R shell
Machine learning
libraries being built
on top of Spark.
The core engine for entire Spark framework. Provides
utilities and architecture for other components
Graph Computation
engine (Similar to
Giraph). Combines data-
parallel and graph-
parallel concepts
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Components
Spark Core Engine
Spark SQL
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
learning)
GraphX
(Graph
Computation)
SparkR
(R on Spark)
DataFrames ML Pipelines
Tabular data
abstraction
introduced by
Spark SQL
ML pipelines makes
it easier to combine
multiple algorithms
or workflows
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Core
Spark Streaming
Spark SQL
MLlib
GraphX
Spark Core
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Core
Spark Core is the base engine for large-scale parallel and distributed data
processing
It is responsible for:
 Memory management and fault recovery
 Scheduling, distributing and monitoring jobs on a cluster
 Interacting with storage systems
Figure: Spark Core Job Cluster
Table
Row
Row
Row
Row
Result
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Architecture
Figure: Components of a Spark cluster
Driver Program
Spark Context
Cluster Manager
Worker Node
Executor
Cache
Task Task
Worker Node
Executor
Cache
Task Task
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Core
Spark Streaming
Spark SQL
MLlib
GraphX
Spark Streaming
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Streaming
 Spark Streaming is used for processing real-time streaming data
 It is a useful addition to the core Spark API
 Spark Streaming enables high-throughput and fault-tolerant
stream processing of live data streams
 The fundamental stream unit is DStream which is basically a
series of RDDs to process the real-time data Figure: Streams In Spark Streaming
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Streaming
Figure: Overview Of Spark Streaming
MLlib
Machine Learning
Spark SQL
SQL + DataFrames
Spark Streaming
Streaming Data
Sources
Static Data
Sources
Data Storage
Systems
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Streaming
Kafka
HDFS/ S3
Flume
Streaming
Twitter
Kinesis
Databases
HDFS
Dashboards
Figure: Data from a variety of sources to various storage systems
Streaming Engine
Input Data
Stream
Batches Of
Input Data
Batches Of
Processed Data
Figure: Incoming streams of data divided into batches
Figure: Extracting words from an InputStream
Data From
Time 0 to 1
Data From
Time 0 to 1
Data From
Time 0 to 1
Data From
Time 0 to 1
RDD @ Time 1 RDD @ Time 2 RDD @ Time 3 RDD @ Time 4
DStream
Figure: Input data stream divided into discrete chunks of data
Data From
Time 0 to 1
Data From
Time 0 to 1
Data From
Time 0 to 1
Data From
Time 0 to 1
DStream
Words From
Time 0 to 1
Words From
Time 0 to 1
Words From
Time 0 to 1
Words From
Time 0 to 1
Words
DStream
flatMap
Operation
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Core
Spark Streaming
Spark SQL
MLlib
GraphX
Spark SQL
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark SQL Features
Spark SQL integrates relational
processing with Spark’s functional
programming.
1
Spark SQL is used for the
structured/semi structured data
analysis in Spark.
2
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark SQL Features
SQL queries can be converted into
RDDs for transformations
Support for various data formats
3
RDD 1 RDD 2
Shuffle
transform
Drop split
point
Invoking RDD 2 computes all partitions of RDD 1
4
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark SQL Overview
Performance And Scalability
5
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark SQL Features
User
Standard JDBC/ODBC Connectivity
6
User Defined Functions lets users
define new Column-based functions
to extend the Spark vocabulary
7
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark SQL Flow Diagram
Spark SQL
Service
Interpreter &
OptimizerResilient
Distributed
Dataset
 Spark SQL has the following libraries:
1. Data Source API
2. DataFrame API
3. Interpreter & Optimizer
4. SQL Service
 The flow diagram represents a Spark SQL process using all the four libraries in
sequence
DataFrame API
Named
Columns
Data Source
API
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Core
Spark Streaming
Spark SQL
MLlib
GraphX
MLlib
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
MLlib
Supervised algorithms use labelled data in which both the input and output are provided to the
algorithm
Unsupervised algorithms do not have the outputs in advance. These algorithms are left to make
sense of the data without labels
Machine Learning
Supervised
• Classification
- Naïve Bayes
- SVM
• Regression
- Linear
- Logistic
Unsupervised
• Clustering
- K Means
• Dimensionality
Reduction
- Principal
Component Analysis
- SVD
Machine Learning may be broken down into two classes of algorithms:
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Mllib - Techniques
1. Classification: It is a family of supervised machine
learning algorithms that designate input as belonging
to one of several pre-defined classes
Some common use cases for classification include:
i) Credit card fraud detection
ii) Email spam detection
2. Clustering: In clustering, an algorithm groups objects
into categories by analyzing similarities between input
examples
There are 3 common techniques for Machine Learning:
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Mllib - Techniques
Collaborative Filtering: Collaborative filtering algorithms
recommend items (this is the filtering part) based on
preference information from many users (this is the
collaborative part)
3.
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Core
Spark Streaming
Spark SQL
MLlib
GraphX
GraphX
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
GraphX
Graph Concepts
A graph is a mathematical structure used to model relations between objects. A graph is made up of vertices and edges that
connect them. The vertices are the objects and the edges are the relationships between them.
A directed graph is a graph where the edges have a direction associated with them. E.g. User Sam follows John on Twitter.
Sam
John
Relationship: Friends
Edge Vertex
Sam
John
Relationship: Friends
Follows
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
GraphX – Triplet View
GraphX has Graph class that contains members to access edges and
vertices
Triplet View
The triplet view logically joins the vertex and edge properties yielding an
RDD[EdgeTriplet[VD, ED]]
containing instances of the EdgeTriplet class
A BTriplets:
A
B
Vertices:
A BEdges:
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
GraphX – Property Graph
GraphX is the Spark API for graphs and graph-parallel computation. GraphX extends the Spark RDD with a Resilient Distributed
Property Graph.
The property graph is a directed multigraph which can have multiple edges in parallel. Every edge and vertex has user defined
properties associated with it. The parallel edges allow multiple relationships between the same vertices.
LAX
SJC
Vertex Property
Edge Property
Property Graph
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
GraphX – Example
To understand GraphX, let us consider the below graph.
 The vertices have names and ages of people.
 The edges represent whether a person likes a person and its weight is a measure of the likeability.
1 2 3
4 5 6
Alice
Age: 28
David
Age: 42
Ed
Age: 55
Fran
Age: 50
Charlie
Age: 65
Bob
Age: 27
7 4
1 3
3
2
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
GraphX – Example
val vertexRDD: RDD[(Long, (String, Int))] =
sc.parallelize(vertexArray)
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)
val graph: Graph[(String, Int), Int] = Graph(vertexRDD,
edgeRDD)
graph.vertices.filter { case (id, (name, age)) => age >
30 }.collect.foreach { case (id, (name, age)) =>
println(s"$name is $age")}
David is 42
Fran is 50
Ed is 55
Charlie is 65
Output
Display names and ages
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
GraphX – Example
for (triplet <- graph.triplets.collect)
{
println(s"${triplet.srcAttr._1} likes
${triplet.dstAttr._1}")
}
Bob likes Alice
Bob likes David
Charlie likes Bob
Charlie likes Fran
David likes Alice
Ed likes Bob
Ed likes Charlie
Ed likes Fran
Output
Display Relations
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Analyze Flight Data
Using Spark GraphX
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Problem Statement
Problem Statement
To analyse Real-Time Flight data using Spark GraphX, provide near real-time
computation results and visualize the results using Google Data Studio
Computations to be done:
 Compute the total number of flight routes
 Compute and sort the longest flight routes
 Display the airport with the highest degree vertex
 List the most important airports according to PageRank
 List the routes with the lowest flight costs
We will use Spark GraphX for the above computations and visualize the results using
Google Data Studio
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Flight Dataset
The attributes of each
particular row is as below:
1. Day Of Month
2. Day Of Week
3. Carrier Code
4. Unique ID- Tail Number
5. Flight Number
6. Origin Airport ID
7. Origin Airport Code
8. Destination Airport ID
9. Destination Airport Code
10. Scheduled Departure Time
11. Actual Departure Time
12. Departure Delay In Minutes
13. Scheduled Arrival Time
14. Actual Arrival Time
15. Arrival Delay Minutes
16. Elapsed Time
17. Distance
Figure: USA Airport Flight Data
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Flow Diagram
Huge amount of
Flight data
1
Database storing
Real-Time Flight
Data
2
Creating Graph
Using GraphX
3
Calculate Top Busiest
Airports
Compute Longest
Flight Routes
Calculate Routes with
Lowest Flight Costs
USA Flight Mapping
4
4
Query 3
Query 1
Query 2
4
Visualizing using
Google Data
Studio
5
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Starting Spark Shell
//Importing the necessary classes
import org.apache.spark._
import org.apache.spark.rdd.RDD
import org.apache.spark.util.IntParam
import org.apache.spark.graphx._
import org.apache.spark.graphx.util.GraphGenerators
//Creating a Case Class ‘Flight’
case class Flight(dofM:String, dofW:String, carrier:String, tailnum:String,
flnum:Int, org_id:Long, origin:String, dest_id:Long, dest:String, crsdeptime:Double,
deptime:Double, depdelaymins:Double, crsarrtime:Double, arrtime:Double,
arrdelay:Double,crselapsedtime:Double,dist:Int)
//Defining a Parse String ‘parseFlight’ function to parse input into ‘Flight’ class
def parseFlight(str: String): Flight = {
val line = str.split(",")
Flight(line(0), line(1), line(2), line(3), line(4).toInt, line(5).toLong, line(6),
line(7).toLong, line(8), line(9).toDouble, line(10).toDouble, line(11).toDouble,
line(12).toDouble, line(13).toDouble, line(14).toDouble, line(15).toDouble,
line(16).toInt)
}
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Starting Spark Shell
1
2
3
7
6
5
4
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Creating Edges For Graph Mapping
//Load the data into a RDD ‘textRDD’
val textRDD = sc.textFile("/home/edureka/Downloads/AirportDataset.csv")
//Parse the RDD of CSV lines into an RDD of flight classes
val flightsRDD = textRDD.map(parseFlight).cache()
//Create airports RDD with ID and Name
val airports = flightsRDD.map(flight => (flight.org_id, flight.origin)).distinct
airports.take(1)
//Defining a default vertex called ‘nowhere’ and mapping Airport ID for printlns
val nowhere = "nowhere"
val airportMap = airports.map { case ((org_id), name) => (org_id -> name)
}.collect.toList.toMap
//Create routes RDD with sourceID, destinationID and distance
val routes = flightsRDD.map(flight => ((flight.org_id, flight.dest_id),
flight.dist)).distinct
routes.take(2)
//Create edges RDD with sourceID, destinationID and distance
val edges = routes.map { case ((org_id, dest_id), distance) => Edge(org_id.toLong,
dest_id.toLong, distance)}
edges.take(1)
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Creating Edges For Graph Mapping
1
2
3
5
4
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Creating Edges For Graph Mapping
7
9
8
6
10
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Routes & Edge Triplets
//Define the graph and display some vertices and edges
val graph = Graph(airports, edges, nowhere)
graph.vertices.take(2)
graph.edges.take(2)
//Find the number of airports
val numairports = graph.numVertices
//Calculate the total number of routes?
val numroutes = graph.numEdges
//Calculate those routes with distances more than 1000 miles
graph.edges.filter { case ( Edge(org_id, dest_id,distance))=> distance > 1000}.take(3)
//Implementing edge triplets
graph.triplets.take(3).foreach(println)
//Sort and print the longest routes
graph.triplets.sortBy(_.attr, ascending=false).map(triplet => "Distance " +
triplet.attr.toString + " from " + triplet.srcAttr + " to " + triplet.dstAttr +
".").take(10).foreach(println)
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Routes & Edge Triplets
1
2
3
4
5
6
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Routes & Edge Triplets
7
8
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Vertex Degree Computation
//Define a reduce operation to compute the highest degree vertex
def max(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = {
if (a._2 > b._2) a else b}
//Display highest degree vertices for incoming and outgoing flights of airports
val maxInDegree: (VertexId, Int) = graph.inDegrees.reduce(max)
val maxOutDegree: (VertexId, Int) = graph.outDegrees.reduce(max)
val maxDegrees: (VertexId, Int) = graph.degrees.reduce(max)
//Get the airport name with IDs 10397 and 12478
airportMap(10397)
airportMap(12478)
//Find the airport with the highest incoming flights
val maxIncoming = graph.inDegrees.collect.sortWith(_._2 > _._2).map(x => (airportMap(x._1),
x._2)).take(3)
maxIncoming.foreach(println)
//Find the airport with the highest outgoing flights
val maxout= graph.outDegrees.join(airports).sortBy(_._2._1, ascending=false).take(3)
maxout.foreach(println)
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Vertex Degree Computation
1
2
4
5
6
3
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Vertex Degree Computation
7
8
9
10
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Graph PageRank
//Find the most important airports according to PageRank
val ranks = graph.pageRank(0.1).vertices
val temp= ranks.join(airports)
temp.take(1)
//Sort the airports by ranking
val temp2 = temp.sortBy(_._2._1, false)
temp2.take(2)
//Display the most important airports
val impAirports =temp2.map(_._2._2)
impAirports.take(4)
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Graph PageRank
1
2
3
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Graph PageRank
4
7
6
5
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Pregel Computation
//Implementing Pregel
val sourceId: VertexId = 13024
val gg = graph.mapEdges(e => 50.toDouble + e.attr.toDouble/20 )
val initialGraph = gg.mapVertices((id, _) => if (id == sourceId) 0.0 else Double.PositiveInfinity)
val sssp = initialGraph.pregel(Double.PositiveInfinity)(
(id, dist, newDist) => math.min(dist, newDist),
triplet => {
if (triplet.srcAttr + triplet.attr < triplet.dstAttr) {
Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
} else {
Iterator.empty
}
},
(a,b) => math.min(a,b)
)
//Find the Routes with the lowest flight costs
println(sssp.edges.take(4).mkString("n"))
//Find airports and their lowest flight costs
println(sssp.vertices.take(4).mkString("n"))
//Display airport codes along with sorted lowest flight costs
sssp.vertices.collect.map(x => (airportMap(x._1), x._2)).sortWith(_._2 < _._2)
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Pregel Computation
5
1
2
3
4
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Pregel Computation
5
6
7
8
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case: Visualizing Results
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case – Visualizing Results
 We will be using Google Data Studio to visualize our
analysis
 Google Data Studio is a product under Google Analytics
360 Suite
 The image shows a Sample Marketing website summary
using Geo Map, Time Series and Bar Chart
 We will use Geo Map service to map the Airports on
their respective locations on the USA map and display
the metrics quantity
Google Data Studio
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case - Visualizing Results
1. Display the total number of flights per Airport
Figure: Total Number of Flights from New York
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case - Visualizing Results
2. Display the metric sum of Destination routes from every Airport
Figure: Measure of the total outgoing traffic from Los Angeles
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case - Visualizing Results
3. Display the total delay of all flights per Airport
Figure: Total Delay of all Flights at Atlanta
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Conclusion
Congrats!
We have hence demonstrated the power of Spark in Real Time Data Analytics.
The hands-on examples will give you the required confidence to work on any
future projects you encounter in Apache Spark.
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Thank You …
Questions/Queries/Feedback

More Related Content

What's hot (20)

PDF
Apache Spark Overview
Vadim Y. Bichutskiy
 
PPTX
Apache Spark overview
DataArt
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PDF
What Is RDD In Spark? | Edureka
Edureka!
 
PPTX
Apache Spark Components
Girish Khanzode
 
PDF
Understanding Query Plans and Spark UIs
Databricks
 
PPTX
Spark architecture
GauravBiswas9
 
PDF
Spark overview
Lisa Hua
 
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PDF
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
PPTX
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Simplilearn
 
PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
PDF
Spark SQL
Joud Khattab
 
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
PDF
Parquet performance tuning: the missing guide
Ryan Blue
 
PDF
Introduction to Apache Spark
datamantra
 
PPTX
Programming in Spark using PySpark
Mostafa
 
PPT
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Apache Spark Overview
Vadim Y. Bichutskiy
 
Apache Spark overview
DataArt
 
Introduction to Apache Spark
Rahul Jain
 
What Is RDD In Spark? | Edureka
Edureka!
 
Apache Spark Components
Girish Khanzode
 
Understanding Query Plans and Spark UIs
Databricks
 
Spark architecture
GauravBiswas9
 
Spark overview
Lisa Hua
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Simplilearn
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Spark SQL
Joud Khattab
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Parquet performance tuning: the missing guide
Ryan Blue
 
Introduction to Apache Spark
datamantra
 
Programming in Spark using PySpark
Mostafa
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 

Similar to What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Training | Edureka (20)

PDF
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Edureka!
 
PDF
Spark For Faster Batch Processing
Edureka!
 
PDF
5 things one must know about spark!
Edureka!
 
PPTX
5 things one must know about spark!
Edureka!
 
PPTX
5 reasons why spark is in demand!
Edureka!
 
PDF
Spark SQL | Apache Spark
Edureka!
 
PDF
Big Data Processing With Spark
Edureka!
 
PDF
Spark is going to replace Apache Hadoop! Know Why?
Edureka!
 
PDF
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Edureka!
 
PDF
Big Data Processing with Spark and Scala
Edureka!
 
PDF
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Edureka!
 
PDF
5 Reasons why Spark is in demand!
Edureka!
 
PPTX
Big data Processing with Apache Spark & Scala
Edureka!
 
PDF
Spark Will Replace Hadoop ! Know Why
Edureka!
 
PDF
Spark Streaming
Edureka!
 
PDF
Apache Spark beyond Hadoop MapReduce
Edureka!
 
PDF
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
PPTX
Spark for big data analytics
Edureka!
 
PPTX
Apache Spark & Scala
Edureka!
 
PPTX
Scala & Spark Online Training
Learntek1
 
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Edureka!
 
Spark For Faster Batch Processing
Edureka!
 
5 things one must know about spark!
Edureka!
 
5 things one must know about spark!
Edureka!
 
5 reasons why spark is in demand!
Edureka!
 
Spark SQL | Apache Spark
Edureka!
 
Big Data Processing With Spark
Edureka!
 
Spark is going to replace Apache Hadoop! Know Why?
Edureka!
 
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Edureka!
 
Big Data Processing with Spark and Scala
Edureka!
 
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Edureka!
 
5 Reasons why Spark is in demand!
Edureka!
 
Big data Processing with Apache Spark & Scala
Edureka!
 
Spark Will Replace Hadoop ! Know Why
Edureka!
 
Spark Streaming
Edureka!
 
Apache Spark beyond Hadoop MapReduce
Edureka!
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
Spark for big data analytics
Edureka!
 
Apache Spark & Scala
Edureka!
 
Scala & Spark Online Training
Learntek1
 
Ad

More from Edureka! (20)

PDF
What to learn during the 21 days Lockdown | Edureka
Edureka!
 
PDF
Top 10 Dying Programming Languages in 2020 | Edureka
Edureka!
 
PDF
Top 5 Trending Business Intelligence Tools | Edureka
Edureka!
 
PDF
Tableau Tutorial for Data Science | Edureka
Edureka!
 
PDF
Python Programming Tutorial | Edureka
Edureka!
 
PDF
Top 5 PMP Certifications | Edureka
Edureka!
 
PDF
Top Maven Interview Questions in 2020 | Edureka
Edureka!
 
PDF
Linux Mint Tutorial | Edureka
Edureka!
 
PDF
How to Deploy Java Web App in AWS| Edureka
Edureka!
 
PDF
Importance of Digital Marketing | Edureka
Edureka!
 
PDF
RPA in 2020 | Edureka
Edureka!
 
PDF
Email Notifications in Jenkins | Edureka
Edureka!
 
PDF
EA Algorithm in Machine Learning | Edureka
Edureka!
 
PDF
Cognitive AI Tutorial | Edureka
Edureka!
 
PDF
AWS Cloud Practitioner Tutorial | Edureka
Edureka!
 
PDF
Blue Prism Top Interview Questions | Edureka
Edureka!
 
PDF
Big Data on AWS Tutorial | Edureka
Edureka!
 
PDF
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
Edureka!
 
PDF
Kubernetes Installation on Ubuntu | Edureka
Edureka!
 
PDF
Introduction to DevOps | Edureka
Edureka!
 
What to learn during the 21 days Lockdown | Edureka
Edureka!
 
Top 10 Dying Programming Languages in 2020 | Edureka
Edureka!
 
Top 5 Trending Business Intelligence Tools | Edureka
Edureka!
 
Tableau Tutorial for Data Science | Edureka
Edureka!
 
Python Programming Tutorial | Edureka
Edureka!
 
Top 5 PMP Certifications | Edureka
Edureka!
 
Top Maven Interview Questions in 2020 | Edureka
Edureka!
 
Linux Mint Tutorial | Edureka
Edureka!
 
How to Deploy Java Web App in AWS| Edureka
Edureka!
 
Importance of Digital Marketing | Edureka
Edureka!
 
RPA in 2020 | Edureka
Edureka!
 
Email Notifications in Jenkins | Edureka
Edureka!
 
EA Algorithm in Machine Learning | Edureka
Edureka!
 
Cognitive AI Tutorial | Edureka
Edureka!
 
AWS Cloud Practitioner Tutorial | Edureka
Edureka!
 
Blue Prism Top Interview Questions | Edureka
Edureka!
 
Big Data on AWS Tutorial | Edureka
Edureka!
 
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
Edureka!
 
Kubernetes Installation on Ubuntu | Edureka
Edureka!
 
Introduction to DevOps | Edureka
Edureka!
 
Ad

Recently uploaded (20)

PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 

What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Training | Edureka