SlideShare a Scribd company logo
Spark & Spark Streaming Internals 
Akhil Das 
akhil@sigmoidanalytics.com
Apache Spark 
Spark Stack
Spark Internals
Resilient Distributed Dataset (RDD) 
Restricted form of distributed shared memory 
➔ Immutable 
➔ Can only be built through deterministic transformations (textFile, map, 
filter, join, …) 
Efficient fault recovery using lineage 
➔ Recompute lost partitions on failure 
➔ No cost if nothing fails
RDD Operations 
Transformations 
➔ map/flatmap 
➔ filter 
➔ union/join/groupBy 
➔ cache 
……. 
Actions 
➔ collect/count 
➔ save 
➔ take 
…….
Log Mining Example 
Load error messages from a log into memory, then interactively search 
for various patterns 
lines = spark.textFile(“hdfs://...”) 
Base RDD 
errors = lines.filter(_.startsWith(“ERROR”)) 
Transformed RDD 
messages = errors.map(_.split(‘t’)(2)) 
messages.persist() 
messages.filter(_.contains(“foo”)).count 
Action 
messages.filter(_.contains(“bar”)).count
What is Spark Streaming?
What is Spark Streaming? 
Framework for large scale stream processing 
➔ Scales to 100s of nodes 
➔ Can achieve second scale latencies 
➔ Provides a simple batch-like API for implementing complex algorithm 
➔ Can absorb live data streams from Kafka, Flume, ZeroMQ, Kinesis etc.
Overview 
Run a streaming computation as a series of very small, deterministic batch jobs 
SparkStreaming 
Spark 
- Chop up the live stream into batches of X seconds 
- Spark treats each batch of data as RDDs 
and processes them using RDD operations 
- Finally, the processed results of the RDD 
operations are returned in batches
Key Concepts 
➔ DStream – sequence of RDDs representing a stream of data 
- Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets 
➔ Transformations – modify data from one DStream to another 
- Standard RDD operations – map, countByValue, reduce, join, … 
- Stateful operations – window, countByValueAndWindow, … 
➔ Output Operations – send data to external entity 
- saveAsHadoopFiles – saves to HDFS 
- foreach – do anything with each batch of results
Eg: Get hashtags from Twitter 
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) 
val hashTags = tweets.flatMap (status => status.getText.split(" ").filter(_. 
startsWith("#")))) 
hashTags.saveAsHadoopFiles("hdfs://...") Transformation 
#Ebola, #India, 
#Mars ...
Thank You
Ad

More Related Content

What's hot (20)

Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Apache Spark RDD 101
Apache Spark RDD 101Apache Spark RDD 101
Apache Spark RDD 101
sparkInstructor
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Duyhai Doan
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
Cheng Lian
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
Spark Summit
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain
 
Spark core
Spark coreSpark core
Spark core
Freeman Zhang
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Datio Big Data
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
Tudor Lapusan
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
IBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsIBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark Basics
Satya Narayan
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
Majid Hajibaba
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
Fernando Rodriguez
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
Thu Hiền
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Duyhai Doan
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
Cheng Lian
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
Spark Summit
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Datio Big Data
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
IBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsIBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark Basics
Satya Narayan
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
Thu Hiền
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 

Similar to Spark & Spark Streaming Internals - Nov 15 (1) (20)

Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
wang xing
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
Kelly Technologies
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Mohamed hedi Abidi
 
CloudCamp Chicago lightning talk "Spark: A Quick Ignition" - Matthew Kem...
CloudCamp Chicago lightning talk      "Spark: A Quick Ignition" - Matthew Kem...CloudCamp Chicago lightning talk      "Spark: A Quick Ignition" - Matthew Kem...
CloudCamp Chicago lightning talk "Spark: A Quick Ignition" - Matthew Kem...
CloudCamp Chicago
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
YahooTechConference
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
Richard Kuo
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
Demet Aksoy
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
Chetan Khatri
 
Scala+data
Scala+dataScala+data
Scala+data
Samir Bessalah
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
Michael Spector
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
Fabio Fumarola
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
Arjen de Vries
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
Reynold Xin
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
wang xing
 
CloudCamp Chicago lightning talk "Spark: A Quick Ignition" - Matthew Kem...
CloudCamp Chicago lightning talk      "Spark: A Quick Ignition" - Matthew Kem...CloudCamp Chicago lightning talk      "Spark: A Quick Ignition" - Matthew Kem...
CloudCamp Chicago lightning talk "Spark: A Quick Ignition" - Matthew Kem...
CloudCamp Chicago
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
YahooTechConference
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
Richard Kuo
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
Chetan Khatri
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
Fabio Fumarola
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
Arjen de Vries
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
Reynold Xin
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Ad

Spark & Spark Streaming Internals - Nov 15 (1)

  • 1. Spark & Spark Streaming Internals Akhil Das [email protected]
  • 4. Resilient Distributed Dataset (RDD) Restricted form of distributed shared memory ➔ Immutable ➔ Can only be built through deterministic transformations (textFile, map, filter, join, …) Efficient fault recovery using lineage ➔ Recompute lost partitions on failure ➔ No cost if nothing fails
  • 5. RDD Operations Transformations ➔ map/flatmap ➔ filter ➔ union/join/groupBy ➔ cache ……. Actions ➔ collect/count ➔ save ➔ take …….
  • 6. Log Mining Example Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) Base RDD errors = lines.filter(_.startsWith(“ERROR”)) Transformed RDD messages = errors.map(_.split(‘t’)(2)) messages.persist() messages.filter(_.contains(“foo”)).count Action messages.filter(_.contains(“bar”)).count
  • 7. What is Spark Streaming?
  • 8. What is Spark Streaming? Framework for large scale stream processing ➔ Scales to 100s of nodes ➔ Can achieve second scale latencies ➔ Provides a simple batch-like API for implementing complex algorithm ➔ Can absorb live data streams from Kafka, Flume, ZeroMQ, Kinesis etc.
  • 9. Overview Run a streaming computation as a series of very small, deterministic batch jobs SparkStreaming Spark - Chop up the live stream into batches of X seconds - Spark treats each batch of data as RDDs and processes them using RDD operations - Finally, the processed results of the RDD operations are returned in batches
  • 10. Key Concepts ➔ DStream – sequence of RDDs representing a stream of data - Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets ➔ Transformations – modify data from one DStream to another - Standard RDD operations – map, countByValue, reduce, join, … - Stateful operations – window, countByValueAndWindow, … ➔ Output Operations – send data to external entity - saveAsHadoopFiles – saves to HDFS - foreach – do anything with each batch of results
  • 11. Eg: Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => status.getText.split(" ").filter(_. startsWith("#")))) hashTags.saveAsHadoopFiles("hdfs://...") Transformation #Ebola, #India, #Mars ...