SlideShare a Scribd company logo
6
Most read
17
Most read
18
Most read
Apache Spark
Shima jafari
Overview
● Introduction
● What is apache Spark
● Spark stack
● RDD
● Operation
● Sample
● Architecture
● Spark Streaming
● Kafka Streaming
Map-Reduce
● It is a two step process
● Once data is processed through the map and reduce, it has to be stored again
inefficient for iterative and interactive computing jobs
Spark was designed to be fast for interactive queries and iterative algorithms, bringing in ideas like
support for in-memory storage and efficient fault recovery
Apache Spark
● Speed
● Ease Of Use
What is apache spark
Apache Spark is a cluster computing platform designed to be fast and general-purpose.
The main feature of Spark is its in-memory cluster computing that increases the processing
speed of an application.
Who use spark, and for what?
● Data science tasks
○ Analyze and model data
● Data processing application
○ Parallelize application across cluster
The Spark Stack
Resilient Distributed Dataset(RDD)
● In-memory computation
● Lazy Evaluation
● Fault Tolerance
● Immutability
● Persistence
● Partitioning
● Parallel
Spark Operation
● Transformation
○ create a new dataset from an existing one
● Action
○ return a value to the driver program after running a computation on the dataset.
Spark Operation
Transformation Action
Map/Map partition Reduce
Flatmap Count/Count by key
Filter Foreach
Sort by key Save as...
Group/Reduce by key First/ Take
Union/Join Collect
Cartesian ...
...
Spark Transformation
● Narrow
○ Map /Map Partition
○ Flatmap
○ Filter
○ Sample
○ Union
● Wide
○ Join
○ Intersection
○ Distinct
○ Reduce/GroupByKey
○ Cartesian
○ Repartition
○ Coalesce
Lazy evaluation
Sample
Movies similarity:
nameDict = loadMovieNames()
data = sc.textFile("/SparkCourse/ml-100k/u.data")
Sample
Movies similarity:
# Map ratings to key / value pairs: user ID => movie ID, rating
ratings = data.map(lambda l: l.split()).map(lambda l: (int(l[0]), (int(l[1]), float(l[2]))))
# Emit every movie rated together by the same user.
# Self-join to find every combination.
joinedRatings = ratings.join(ratings)
Sample
Movies similarity:
# At this point our RDD consists of userID => ((movieID, rating), (movieID, rating))
# Filter out duplicate pairs
uniqueJoinedRatings = joinedRatings.filter(filterDuplicates)
# Now key by (movie1, movie2) pairs.
moviePairs = uniqueJoinedRatings.map(makePairs)
Sample
Movies similarity:
# We now have (movie1, movie2) => (rating1, rating2)
# Now collect all ratings for each movie pair and compute similarity
moviePairRatings = moviePairs.groupByKey()
# We now have (movie1, movie2) = > (rating1, rating2), (rating1, rating2) …
# Can now compute similarities.
moviePairSimilarities = moviePairRatings.mapValues(computeCosineSimilarity).cache()
Architecture
Terms
● Driver Program
● Cluster manager
● Executor
● Job
● Trask
● stage
Terms
● Driver Program
● Cluster manager
● Executor
● Job
● Trask
● stage
Spark Streaming
Streaming Flow:
Streaming Program Structure:
After a context is defined, you have to do the following.
1. Define the input sources by creating input DStreams.
2. Define the streaming computations by applying transformation and output operations to
DStreams.
3. Start receiving data and processing it using streamingContext.start().
4. Wait for the processing to be stopped (manually or due to any error) using
streamingContext.awaitTermination().
5. The processing can be manually stopped using streamingContext.stop().
Discretized Stream(DStream)
Source:
● https://www.kdnuggets.com/2018/07/introduction-apache-spark.html
● https://stackoverflow.com/questions/32621990/what-are-workers-executors-cores-in-spark-sta
ndalone-cluster
● https://spark.apache.org/docs/latest/cluster-overview.html
● https://spark.apache.org/docs/latest/streaming-programming-guide.html
● https://dzone.com/articles/spark-streaming-vs-kafka-stream-1
● https://www.edureka.co/blog/spark-architecture/

More Related Content

What's hot (20)

PPTX
Spark
Koushik Mondal
 
PPTX
Map Reduce
Prashant Gupta
 
PPT
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
PPTX
Introduction to Scala
Mohammad Hossein Rimaz
 
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
PPTX
Apache Spark Core
Girish Khanzode
 
PDF
Spark SQL
Joud Khattab
 
PPT
Map reduce in BIG DATA
GauravBiswas9
 
PPTX
Apache Spark Fundamentals
Zahra Eskandari
 
PPT
Cloud deployment models
Ashok Kumar
 
PPT
Map Reduce
Sri Prasanna
 
PPTX
Introduction to Aneka, Aneka Model is explained
Dr Neelesh Jain
 
PDF
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
PDF
Map Reduce
Vigen Sahakyan
 
PPTX
Big data and Hadoop
Rahul Agarwal
 
PDF
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Edureka!
 
PPTX
Introduction to Pig
Prashanth Babu
 
PPTX
introduction to NOSQL Database
nehabsairam
 
PDF
Introduction to Spark Internals
Pietro Michiardi
 
Map Reduce
Prashant Gupta
 
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
Introduction to Scala
Mohammad Hossein Rimaz
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Apache Spark Core
Girish Khanzode
 
Spark SQL
Joud Khattab
 
Map reduce in BIG DATA
GauravBiswas9
 
Apache Spark Fundamentals
Zahra Eskandari
 
Cloud deployment models
Ashok Kumar
 
Map Reduce
Sri Prasanna
 
Introduction to Aneka, Aneka Model is explained
Dr Neelesh Jain
 
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
Map Reduce
Vigen Sahakyan
 
Big data and Hadoop
Rahul Agarwal
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Edureka!
 
Introduction to Pig
Prashanth Babu
 
introduction to NOSQL Database
nehabsairam
 
Introduction to Spark Internals
Pietro Michiardi
 

Similar to Apache spark (20)

PPTX
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
PPTX
SparkNotes
Demet Aksoy
 
PDF
Apache Spark - A High Level overview
Karan Alang
 
PPTX
Introduction to Apache Spark
Mohamed hedi Abidi
 
PPTX
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
PPTX
Scala meetup - Intro to spark
Javier Arrieta
 
PDF
Bds session 13 14
Infinity Tech Solutions
 
PPTX
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
PPTX
Dive into spark2
Gal Marder
 
PPTX
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
PDF
Big Data processing with Apache Spark
Lucian Neghina
 
PPTX
Spark
Heena Madan
 
PDF
Apache Spark and DataStax Enablement
Vincent Poncet
 
PPTX
Glint with Apache Spark
Venkata Naga Ravi
 
PPTX
APACHE SPARK.pptx
DeepaThirumurugan
 
PDF
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
PDF
Analyzing Data at Scale with Apache Spark
Nicola Ferraro
 
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
SparkNotes
Demet Aksoy
 
Apache Spark - A High Level overview
Karan Alang
 
Introduction to Apache Spark
Mohamed hedi Abidi
 
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Scala meetup - Intro to spark
Javier Arrieta
 
Bds session 13 14
Infinity Tech Solutions
 
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
Dive into spark2
Gal Marder
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
Big Data processing with Apache Spark
Lucian Neghina
 
Apache Spark and DataStax Enablement
Vincent Poncet
 
Glint with Apache Spark
Venkata Naga Ravi
 
APACHE SPARK.pptx
DeepaThirumurugan
 
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
Analyzing Data at Scale with Apache Spark
Nicola Ferraro
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Ad

Recently uploaded (20)

PDF
McKinsey - Global Energy Perspective 2023_11.pdf
niyudha
 
PPTX
Nursing Shift Supervisor 24/7 in a week .pptx
amjadtanveer
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PDF
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PDF
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PPTX
Insurance-Analytics-Branch-Dashboard (1).pptx
trivenisapate02
 
PPTX
Solution+Architecture+Review+-+Sample.pptx
manuvratsingh1
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PDF
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
McKinsey - Global Energy Perspective 2023_11.pdf
niyudha
 
Nursing Shift Supervisor 24/7 in a week .pptx
amjadtanveer
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
short term internship project on Data visualization
JMJCollegeComputerde
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
Insurance-Analytics-Branch-Dashboard (1).pptx
trivenisapate02
 
Solution+Architecture+Review+-+Sample.pptx
manuvratsingh1
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
Ad

Apache spark