SlideShare a Scribd company logo
© 2015 MapR Technologies 1© 2015 MapR Technologies
Implementing the Lambda Architecture
efficiently with Apache Spark
Polyglot Processing
• Combination of different
processing engines over
DFS/NoSQL stores
• Lambda and Kappa
architectures are two
prominent examples
datadventures.ghost.io/2014/07/06/polyglot-processing/
Fault tolerance
hardware
software
developer
?
© 2014 MapR Technologies 4© 2014 MapR Technologies
Let’s talk about developers…
xkcd.com/327/
Implementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache Spark
© 2015 MapR Technologies 8© 2014 MapR Technologies
human fault toleranceLet’s talk about developers…
Human fault tolerance
When things go wrong …
http://allfacebook.com/the-real-reason-facebook-went-down-yesterday-its-complicated_b19366
2010
unfortunate
handling of
error condition
When things go wrong …
2012
cascaded bug
http://money.cnn.com/2012/06/21/technology/twitter-down/index.htm
When things go wrong …
http://www.v3.co.uk/v3-uk/news/2196577/rbs-takes-gbp125m-hit-over-it-outage
2012
upgrade of batch
processing
When things go wrong …
http://www.androidcentral.com/google-explains-reasons-behind-today-s-30-minute-service-outage
2014
bug/bad config
© 2014 MapR Technologies 14© 2014 MapR Technologies
Lambda Architecture to the rescue!
Let’s step back a bit …
• Nathan Marz (Backtype, Twitter, stealth startup)
• Creator of …
– Storm
– Cascalog
– ElephantDB
manning.com/marz/
Lambda Architecture—Requirements
• Fault-tolerant against both hardware failures and human errors
• Support variety of use cases that include low latency querying as
well as updates
• Linear scale-out capabilities
• Extensible, so that the system is manageable and can
accommodate newer features easily
Lambda Architecture—Concept
• Latency—the time it takes to run a query
• Timeliness—how up to date the query results are
( consistency)
• Accuracy—tradeoff between performance and scalability
( approximations)
query = function(all data)
Lambda Architecture
NEW DATA
STREAM QUERY
BATCH VIEWS
√View 1 View 2 View N
REAL-TIME VIEWS
BATCH LAYER
SERVINGLAYER
SPEED LAYER
MERGE
IMMUTABLE
MASTER DATA
PRECOMPUTE
VIEWSBATCH
RECOMPUTE
PROCESS
STREAM
INCREMENT
VIEWSREAL-TIME
INCREMENT
View 1 View 2 View N
Lambda Architecture—Compensate Batch
time
not absorbed
now
Lambda Architecture—Immutable Data + Views
openflights.org
Lambda Architecture—Immutable Data + Views
timestamp airport flight actiontimestamp airport flight action
2014-01-01T10:00:00 DUB EI123 take-off
timestamp airport flight action
2014-01-01T10:00:00 DUB EI123 take-off
2014-01-01T10:05:00 HEL SAS45 take-off
timestamp airport flight action
2014-01-01T10:00:00 DUB EI123 take-off
2014-01-01T10:05:00 HEL SAS45 take-off
2014-01-01T10:07:00 AMS BA99 take-off
timestamp airport flight action
2014-01-01T10:00:00 DUB EI123 take-off
2014-01-01T10:05:00 HEL SAS45 take-off
2014-01-01T10:07:00 AMS BA99 take-off
2014-01-01T10:09:00 LHR LH17 landing
timestamp airport flight action
2014-01-01T10:00:00 DUB EI123 take-off
2014-01-01T10:05:00 HEL SAS45 take-off
2014-01-01T10:07:00 AMS BA99 take-off
2014-01-01T10:09:00 LHR LH17 landing
2014-01-01T10:10:00 CDG AF03 landing
timestamp airport flight action
2014-01-01T10:00:00 DUB EI123 take-off
2014-01-01T10:05:00 HEL SAS45 take-off
2014-01-01T10:07:00 AMS BA99 take-off
2014-01-01T10:09:00 LHR LH17 landing
2014-01-01T10:10:00 CDG AF03 landing
2014-01-01T10:10:00 FCO AZ501 take-off
immutable master dataset
Lambda Architecture—Immutable Data + Views
timestamp airport flight action
2014-01-01T10:00:00 DUB EI123 take-off
2014-01-01T10:05:00 HEL SAS45 take-off
2014-01-01T10:07:00 AMS BA99 take-off
2014-01-01T10:09:00 LHR LH17 landing
2014-01-01T10:10:00 CDG AF03 landing
2014-01-01T10:10:00 FCO AZ501 take-off
immutable master dataset
views
airport planes
AMS 69
CDG 44
DUB 31
FCO 10
HEL 17
LHR 101
airport load: airline planes
AF 59
AZ 23
BA 167
EI 19
LH 201
SAS 28
air-borne
per airline:
air-borne: 2307
© 2014 MapR Technologies 23© 2014 MapR Technologies
Implementing the Lambda Architecture
© 2014 MapR Technologies 24
Implementing the Lambda Architecture efficiently with Apache Spark
How about an integrated approach?
• Twitter Summingbird
• Lambdoop
• Apache Spark
Apache Spark
Apache Spark—a unified platform …
spark.apache.org
Continued innovation bringing new functionality, such as:
• Tachyon (Shared RDDs, off-heap solution)
• BlinkDB (approximate queries)
• SparkR (R wrapper for Spark)
Spark SQL
(SQL/HQL)
Spark Streaming
(stream processing)
MLlib
(machine learning)
Spark (core execution engine—RDDs)
GraphX
(graph processing)
Mesos
file system (local, MapR-FS, HDFS, S3) or data store (HBase, Elasticsearch, etc.)
YARNStandalone
Easy and fast Big Data
• Easy to Develop
– Rich APIs available through
Java, Scala, Python
– Interactive shell
• Fast to Run
– Advanced data storage model
(automated optimization
between memory and disk)
– General execution graphs
2-5× less code up to 10× faster on disk,
100× in memory
https://amplab.cs.berkeley.edu/benchmark/
… across multiple datasources
• Local Files
• Object Stores (Amazon S3)
• HDFS
– text files, sequence files, any other Hadoop InputFormat
• Key-Value datastores (HBase, C*)
• Elasticsearch
Easy: expressive API
map reduce
Easy: expressive API
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...
spark.apache.org/docs/latest/programming-guide.html#transformations
… and scale as you go (mentally and physically)
YARN
Standalone
Resilient Distributed Datasets (RDD)
• RDDs are the core of the Spark execution engine
• Collections of elements that can be operated on in parallel
• Persistent in memory between operations
www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
RDD Operations
• Lazy evaluation is key to Spark
• Transformations
– Creation of a new dataset from an existing:
map, filter, distinct, union, sample, groupByKey, join, etc.
• Actions
– Return a value after running a computation:
collect, count, first, takeSample, foreach, etc.
RDD persistence
spark.apache.org/docs/latest/scala-programming-guide.html
Spark Streaming
• High-level language operators for streaming data
• Fault-tolerant semantics
• Support for merging streaming data with historical data
spark.apache.org/docs/latest/streaming-programming-guide.html
Spark Streaming
Run a streaming computation as a series of small, deterministic
batch jobs.
• Chop up live stream into batches of X
seconds (DStream)
• Spark treats each batch of data as RDDs
and processes them using RDD ops
• Finally, processed results of the RDD
operations are returned in batches
Spark
Spark
Streaming
batches of X seconds
live data stream
processed results
Spark Streaming
Run a streaming computation as a series of small, deterministic
batch jobs.
• Batch sizes as low as ½ second,
latency of about 1 second
• Potential for combining batch processing
and streaming processing in the same
system
Spark
Spark
Streaming
batches of X seconds
live data stream
processed results
Spark Streaming: Execution
Spark Streaming: transformations
• Stateless transformations
• Stateful transformations
– checkpointing
– windowed transformations
• window duration
• sliding duration
spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams
Spark Streaming comparison
• Spark Streaming: 670k records/sec/node
• Storm: 115k records/sec/node
• Commercial systems: 100-500k records/sec/node
0
10
20
30
100 1000
Throughputpernode
(MB/s)
Record Size (bytes)
WordCount
Spark
Storm
0
20
40
60
100 1000
Throughputpernode
(MB/s)
Record Size (bytes)
Grep
Spark
Storm
Where to go from here
The book: Learning Spark
shop.oreilly.com/product/0636920028512.do
Apache Spark developer certificate program
oreilly.com/go/sparkcert
http://lambda-architecture.net
http://spark-stack.org
MapR Sandbox with Spark
mapr.com/blog/getting-started-spark-mapr-sandbox
Conclusion
• Let’s scale systems and humans
• How? Lambda Architecture!
• Apache Spark is an efficient way to implement Lambda Architecture
$50M$50M
in Free Training
Q&A
@mhausenblas maprtech
mhausenblas@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies
Ad

More Related Content

What's hot (20)

Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Mammoth Data
 
How Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-ShmaHow Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-Shma
Spark Summit
 
Lambda architecture
Lambda architectureLambda architecture
Lambda architecture
Mario Alexandro Santini
 
Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16 Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16
PivotalOpenSourceHub
 
ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
Juantomás García Molina
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
Simon Ambridge
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
Big Data Spain
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Spark Summit
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos Erotocritou
Spark Summit
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Lambda architecture
Lambda architectureLambda architecture
Lambda architecture
Szilveszter Molnár
 
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
Real-Time Analytics and Actions Across Large Data Sets with Apache SparkReal-Time Analytics and Actions Across Large Data Sets with Apache Spark
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
Databricks
 
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Helena Edelson
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
Ben Laird
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
Spark Summit
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoT
Jim Haughwout
 
Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming
Stratio
 
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
Nathan Bijnens
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Mammoth Data
 
How Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-ShmaHow Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-Shma
Spark Summit
 
Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16 Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16
PivotalOpenSourceHub
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
Simon Ambridge
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
Big Data Spain
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Spark Summit
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos Erotocritou
Spark Summit
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
Real-Time Analytics and Actions Across Large Data Sets with Apache SparkReal-Time Analytics and Actions Across Large Data Sets with Apache Spark
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
Databricks
 
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Helena Edelson
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
Ben Laird
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
Spark Summit
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoT
Jim Haughwout
 
Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming
Stratio
 
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
Nathan Bijnens
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
hadooparchbook
 

Viewers also liked (20)

Big Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in ActionBig Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in Action
Guido Schmutz
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
Trieu Nguyen
 
[USI] Lambda-Architecture : comment réconcilier BigData et temps-réel
[USI] Lambda-Architecture : comment réconcilier BigData et temps-réel[USI] Lambda-Architecture : comment réconcilier BigData et temps-réel
[USI] Lambda-Architecture : comment réconcilier BigData et temps-réel
Mathieu DESPRIEE
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
Daniel Marcous
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
Guido Schmutz
 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big data
Trieu Nguyen
 
Runaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop itRunaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop it
nathanmarz
 
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
SoftServe
 
Architecture data-flow programmable pour le traitement d’image
Architecture data-flow programmable pour le traitement d’imageArchitecture data-flow programmable pour le traitement d’image
Architecture data-flow programmable pour le traitement d’image
Ciprian Teodorov
 
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Altan Khendup
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and Elasticsearch
Abhishek Andhavarapu
 
Zeta Architecture: The Next Generation Big Data Architecture
Zeta Architecture: The Next Generation Big Data ArchitectureZeta Architecture: The Next Generation Big Data Architecture
Zeta Architecture: The Next Generation Big Data Architecture
MapR Technologies
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
Rahul Kumar
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Hortonworks
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
Rahul Kumar
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Legacy Typesafe (now Lightbend)
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Big Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in ActionBig Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in Action
Guido Schmutz
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
Trieu Nguyen
 
[USI] Lambda-Architecture : comment réconcilier BigData et temps-réel
[USI] Lambda-Architecture : comment réconcilier BigData et temps-réel[USI] Lambda-Architecture : comment réconcilier BigData et temps-réel
[USI] Lambda-Architecture : comment réconcilier BigData et temps-réel
Mathieu DESPRIEE
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
Daniel Marcous
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
Guido Schmutz
 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big data
Trieu Nguyen
 
Runaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop itRunaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop it
nathanmarz
 
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
SoftServe
 
Architecture data-flow programmable pour le traitement d’image
Architecture data-flow programmable pour le traitement d’imageArchitecture data-flow programmable pour le traitement d’image
Architecture data-flow programmable pour le traitement d’image
Ciprian Teodorov
 
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Altan Khendup
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and Elasticsearch
Abhishek Andhavarapu
 
Zeta Architecture: The Next Generation Big Data Architecture
Zeta Architecture: The Next Generation Big Data ArchitectureZeta Architecture: The Next Generation Big Data Architecture
Zeta Architecture: The Next Generation Big Data Architecture
MapR Technologies
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
Rahul Kumar
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Hortonworks
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
Rahul Kumar
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Legacy Typesafe (now Lightbend)
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
Ad

Similar to Implementing the Lambda Architecture efficiently with Apache Spark (20)

Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Alex Zeltov
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
DataWorks Summit
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
Etu Solution
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
Saptak Sen
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
shareddatamsft
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
Carol McDonald
 
Big Data Processing Using Spark.pptx
Big  Data  Processing  Using  Spark.pptxBig  Data  Processing  Using  Spark.pptx
Big Data Processing Using Spark.pptx
DeekshaM35
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
Hortonworks
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
Juan Pedro Moreno
 
Meet Spark
Meet SparkMeet Spark
Meet Spark
Chicago Hadoop Users Group
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Hortonworks
 
Scaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftScaling spark on kubernetes at Lyft
Scaling spark on kubernetes at Lyft
Li Gao
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
DataWorks Summit
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
Databricks
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
Hortonworks
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
Vince Gonzalez
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Alex Zeltov
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
DataWorks Summit
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
Etu Solution
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
Saptak Sen
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
shareddatamsft
 
Big Data Processing Using Spark.pptx
Big  Data  Processing  Using  Spark.pptxBig  Data  Processing  Using  Spark.pptx
Big Data Processing Using Spark.pptx
DeekshaM35
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
Hortonworks
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
Juan Pedro Moreno
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Hortonworks
 
Scaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftScaling spark on kubernetes at Lyft
Scaling spark on kubernetes at Lyft
Li Gao
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
DataWorks Summit
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
Databricks
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
Hortonworks
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
Vince Gonzalez
 
Ad

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

Implementing the Lambda Architecture efficiently with Apache Spark

  • 1. © 2015 MapR Technologies 1© 2015 MapR Technologies Implementing the Lambda Architecture efficiently with Apache Spark
  • 2. Polyglot Processing • Combination of different processing engines over DFS/NoSQL stores • Lambda and Kappa architectures are two prominent examples datadventures.ghost.io/2014/07/06/polyglot-processing/
  • 4. © 2014 MapR Technologies 4© 2014 MapR Technologies Let’s talk about developers…
  • 8. © 2015 MapR Technologies 8© 2014 MapR Technologies human fault toleranceLet’s talk about developers…
  • 10. When things go wrong … http://allfacebook.com/the-real-reason-facebook-went-down-yesterday-its-complicated_b19366 2010 unfortunate handling of error condition
  • 11. When things go wrong … 2012 cascaded bug http://money.cnn.com/2012/06/21/technology/twitter-down/index.htm
  • 12. When things go wrong … http://www.v3.co.uk/v3-uk/news/2196577/rbs-takes-gbp125m-hit-over-it-outage 2012 upgrade of batch processing
  • 13. When things go wrong … http://www.androidcentral.com/google-explains-reasons-behind-today-s-30-minute-service-outage 2014 bug/bad config
  • 14. © 2014 MapR Technologies 14© 2014 MapR Technologies Lambda Architecture to the rescue!
  • 15. Let’s step back a bit … • Nathan Marz (Backtype, Twitter, stealth startup) • Creator of … – Storm – Cascalog – ElephantDB manning.com/marz/
  • 16. Lambda Architecture—Requirements • Fault-tolerant against both hardware failures and human errors • Support variety of use cases that include low latency querying as well as updates • Linear scale-out capabilities • Extensible, so that the system is manageable and can accommodate newer features easily
  • 17. Lambda Architecture—Concept • Latency—the time it takes to run a query • Timeliness—how up to date the query results are ( consistency) • Accuracy—tradeoff between performance and scalability ( approximations) query = function(all data)
  • 18. Lambda Architecture NEW DATA STREAM QUERY BATCH VIEWS √View 1 View 2 View N REAL-TIME VIEWS BATCH LAYER SERVINGLAYER SPEED LAYER MERGE IMMUTABLE MASTER DATA PRECOMPUTE VIEWSBATCH RECOMPUTE PROCESS STREAM INCREMENT VIEWSREAL-TIME INCREMENT View 1 View 2 View N
  • 20. Lambda Architecture—Immutable Data + Views openflights.org
  • 21. Lambda Architecture—Immutable Data + Views timestamp airport flight actiontimestamp airport flight action 2014-01-01T10:00:00 DUB EI123 take-off timestamp airport flight action 2014-01-01T10:00:00 DUB EI123 take-off 2014-01-01T10:05:00 HEL SAS45 take-off timestamp airport flight action 2014-01-01T10:00:00 DUB EI123 take-off 2014-01-01T10:05:00 HEL SAS45 take-off 2014-01-01T10:07:00 AMS BA99 take-off timestamp airport flight action 2014-01-01T10:00:00 DUB EI123 take-off 2014-01-01T10:05:00 HEL SAS45 take-off 2014-01-01T10:07:00 AMS BA99 take-off 2014-01-01T10:09:00 LHR LH17 landing timestamp airport flight action 2014-01-01T10:00:00 DUB EI123 take-off 2014-01-01T10:05:00 HEL SAS45 take-off 2014-01-01T10:07:00 AMS BA99 take-off 2014-01-01T10:09:00 LHR LH17 landing 2014-01-01T10:10:00 CDG AF03 landing timestamp airport flight action 2014-01-01T10:00:00 DUB EI123 take-off 2014-01-01T10:05:00 HEL SAS45 take-off 2014-01-01T10:07:00 AMS BA99 take-off 2014-01-01T10:09:00 LHR LH17 landing 2014-01-01T10:10:00 CDG AF03 landing 2014-01-01T10:10:00 FCO AZ501 take-off immutable master dataset
  • 22. Lambda Architecture—Immutable Data + Views timestamp airport flight action 2014-01-01T10:00:00 DUB EI123 take-off 2014-01-01T10:05:00 HEL SAS45 take-off 2014-01-01T10:07:00 AMS BA99 take-off 2014-01-01T10:09:00 LHR LH17 landing 2014-01-01T10:10:00 CDG AF03 landing 2014-01-01T10:10:00 FCO AZ501 take-off immutable master dataset views airport planes AMS 69 CDG 44 DUB 31 FCO 10 HEL 17 LHR 101 airport load: airline planes AF 59 AZ 23 BA 167 EI 19 LH 201 SAS 28 air-borne per airline: air-borne: 2307
  • 23. © 2014 MapR Technologies 23© 2014 MapR Technologies Implementing the Lambda Architecture
  • 24. © 2014 MapR Technologies 24
  • 26. How about an integrated approach? • Twitter Summingbird • Lambdoop • Apache Spark
  • 28. Apache Spark—a unified platform … spark.apache.org Continued innovation bringing new functionality, such as: • Tachyon (Shared RDDs, off-heap solution) • BlinkDB (approximate queries) • SparkR (R wrapper for Spark) Spark SQL (SQL/HQL) Spark Streaming (stream processing) MLlib (machine learning) Spark (core execution engine—RDDs) GraphX (graph processing) Mesos file system (local, MapR-FS, HDFS, S3) or data store (HBase, Elasticsearch, etc.) YARNStandalone
  • 29. Easy and fast Big Data • Easy to Develop – Rich APIs available through Java, Scala, Python – Interactive shell • Fast to Run – Advanced data storage model (automated optimization between memory and disk) – General execution graphs 2-5× less code up to 10× faster on disk, 100× in memory https://amplab.cs.berkeley.edu/benchmark/
  • 30. … across multiple datasources • Local Files • Object Stores (Amazon S3) • HDFS – text files, sequence files, any other Hadoop InputFormat • Key-Value datastores (HBase, C*) • Elasticsearch
  • 33. … and scale as you go (mentally and physically) YARN Standalone
  • 34. Resilient Distributed Datasets (RDD) • RDDs are the core of the Spark execution engine • Collections of elements that can be operated on in parallel • Persistent in memory between operations www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  • 35. RDD Operations • Lazy evaluation is key to Spark • Transformations – Creation of a new dataset from an existing: map, filter, distinct, union, sample, groupByKey, join, etc. • Actions – Return a value after running a computation: collect, count, first, takeSample, foreach, etc.
  • 37. Spark Streaming • High-level language operators for streaming data • Fault-tolerant semantics • Support for merging streaming data with historical data spark.apache.org/docs/latest/streaming-programming-guide.html
  • 38. Spark Streaming Run a streaming computation as a series of small, deterministic batch jobs. • Chop up live stream into batches of X seconds (DStream) • Spark treats each batch of data as RDDs and processes them using RDD ops • Finally, processed results of the RDD operations are returned in batches Spark Spark Streaming batches of X seconds live data stream processed results
  • 39. Spark Streaming Run a streaming computation as a series of small, deterministic batch jobs. • Batch sizes as low as ½ second, latency of about 1 second • Potential for combining batch processing and streaming processing in the same system Spark Spark Streaming batches of X seconds live data stream processed results
  • 41. Spark Streaming: transformations • Stateless transformations • Stateful transformations – checkpointing – windowed transformations • window duration • sliding duration spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams
  • 42. Spark Streaming comparison • Spark Streaming: 670k records/sec/node • Storm: 115k records/sec/node • Commercial systems: 100-500k records/sec/node 0 10 20 30 100 1000 Throughputpernode (MB/s) Record Size (bytes) WordCount Spark Storm 0 20 40 60 100 1000 Throughputpernode (MB/s) Record Size (bytes) Grep Spark Storm
  • 43. Where to go from here
  • 44. The book: Learning Spark shop.oreilly.com/product/0636920028512.do
  • 45. Apache Spark developer certificate program oreilly.com/go/sparkcert
  • 48. MapR Sandbox with Spark mapr.com/blog/getting-started-spark-mapr-sandbox
  • 49. Conclusion • Let’s scale systems and humans • How? Lambda Architecture! • Apache Spark is an efficient way to implement Lambda Architecture
  • 51. Q&A @mhausenblas maprtech [email protected] Engage with us! MapR maprtech mapr-technologies

Editor's Notes

  • #38: fault-tolerant : can force “exactly once” operations on incoming data streams The same Spark Code works with streaming data sets and the static, DFS data sets. You can very easily write applications that test incoming data streams against historical data