Implementing the Lambda Architecture efficiently with Apache Spark

© 2015 MapR Technologies 1© 2015 MapR Technologies
Implementing the Lambda Architecture
efficiently with Apache Spark

Polyglot Processing
• Combination of different
processing engines over
DFS/NoSQL stores
• Lambda and Kappa
architectures are two
prominent examples
datadventures.ghost.io/2014/07/06/polyglot-processing/

Fault tolerance
hardware
software
developer
?

Let’s talk about developers…

human fault toleranceLet’s talk about developers…

When things go wrong …
http://allfacebook.com/the-real-reason-facebook-went-down-yesterday-its-complicated_b19366
2010
unfortunate
handling of
error condition

2012
cascaded bug
http://money.cnn.com/2012/06/21/technology/twitter-down/index.htm

http://www.v3.co.uk/v3-uk/news/2196577/rbs-takes-gbp125m-hit-over-it-outage
2012
upgrade of batch
processing

http://www.androidcentral.com/google-explains-reasons-behind-today-s-30-minute-service-outage
2014
bug/bad config

Lambda Architecture to the rescue!

Let’s step back a bit …
• Nathan Marz (Backtype, Twitter, stealth startup)
• Creator of …
– Storm
– Cascalog
– ElephantDB
manning.com/marz/

Lambda Architecture—Requirements
• Fault-tolerant against both hardware failures and human errors
• Support variety of use cases that include low latency querying as
well as updates
• Linear scale-out capabilities
• Extensible, so that the system is manageable and can
accommodate newer features easily

Lambda Architecture—Concept
• Latency—the time it takes to run a query
• Timeliness—how up to date the query results are
( consistency)
• Accuracy—tradeoff between performance and scalability
( approximations)
query = function(all data)

Lambda Architecture
NEW DATA
STREAM QUERY
BATCH VIEWS
√View 1 View 2 View N
REAL-TIME VIEWS
BATCH LAYER
SERVINGLAYER
SPEED LAYER
MERGE
IMMUTABLE
MASTER DATA
PRECOMPUTE
VIEWSBATCH
RECOMPUTE
PROCESS
STREAM
INCREMENT
VIEWSREAL-TIME
INCREMENT
View 1 View 2 View N

Lambda Architecture—Compensate Batch
time
not absorbed
now

Lambda Architecture—Immutable Data + Views
openflights.org

timestamp airport flight actiontimestamp airport flight action
2014-01-01T10:00:00 DUB EI123 take-off
timestamp airport flight action
2014-01-01T10:00:00 DUB EI123 take-off
2014-01-01T10:05:00 HEL SAS45 take-off
2014-01-01T10:00:00 DUB EI123 take-off
2014-01-01T10:05:00 HEL SAS45 take-off
2014-01-01T10:07:00 AMS BA99 take-off
2014-01-01T10:00:00 DUB EI123 take-off
2014-01-01T10:05:00 HEL SAS45 take-off
2014-01-01T10:07:00 AMS BA99 take-off
2014-01-01T10:09:00 LHR LH17 landing
2014-01-01T10:00:00 DUB EI123 take-off
2014-01-01T10:05:00 HEL SAS45 take-off
2014-01-01T10:07:00 AMS BA99 take-off
2014-01-01T10:09:00 LHR LH17 landing
2014-01-01T10:10:00 CDG AF03 landing
2014-01-01T10:00:00 DUB EI123 take-off
2014-01-01T10:05:00 HEL SAS45 take-off
2014-01-01T10:07:00 AMS BA99 take-off
2014-01-01T10:09:00 LHR LH17 landing
2014-01-01T10:10:00 CDG AF03 landing
2014-01-01T10:10:00 FCO AZ501 take-off
immutable master dataset

2014-01-01T10:00:00 DUB EI123 take-off
2014-01-01T10:05:00 HEL SAS45 take-off
2014-01-01T10:07:00 AMS BA99 take-off
2014-01-01T10:09:00 LHR LH17 landing
2014-01-01T10:10:00 CDG AF03 landing
2014-01-01T10:10:00 FCO AZ501 take-off
immutable master dataset
views
airport planes
AMS 69
CDG 44
DUB 31
FCO 10
HEL 17
LHR 101
airport load: airline planes
AF 59
AZ 23
BA 167
EI 19
LH 201
SAS 28
air-borne
per airline:
air-borne: 2307

Implementing the Lambda Architecture

How about an integrated approach?
• Twitter Summingbird
• Lambdoop
• Apache Spark

Apache Spark—a unified platform …
spark.apache.org
Continued innovation bringing new functionality, such as:
• Tachyon (Shared RDDs, off-heap solution)
• BlinkDB (approximate queries)
• SparkR (R wrapper for Spark)
Spark SQL
(SQL/HQL)
Spark Streaming
(stream processing)
MLlib
(machine learning)
Spark (core execution engine—RDDs)
GraphX
(graph processing)
Mesos
file system (local, MapR-FS, HDFS, S3) or data store (HBase, Elasticsearch, etc.)
YARNStandalone

Easy and fast Big Data
• Easy to Develop
– Rich APIs available through
Java, Scala, Python
– Interactive shell
• Fast to Run
– Advanced data storage model
(automated optimization
between memory and disk)
– General execution graphs
2-5× less code up to 10× faster on disk,
100× in memory
https://amplab.cs.berkeley.edu/benchmark/

… across multiple datasources
• Local Files
• Object Stores (Amazon S3)
• HDFS
– text files, sequence files, any other Hadoop InputFormat
• Key-Value datastores (HBase, C*)
• Elasticsearch

Easy: expressive API
map reduce

Easy: expressive API
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...
spark.apache.org/docs/latest/programming-guide.html#transformations

… and scale as you go (mentally and physically)
YARN
Standalone

Resilient Distributed Datasets (RDD)
• RDDs are the core of the Spark execution engine
• Collections of elements that can be operated on in parallel
• Persistent in memory between operations
www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

RDD Operations
• Lazy evaluation is key to Spark
• Transformations
– Creation of a new dataset from an existing:
map, filter, distinct, union, sample, groupByKey, join, etc.
• Actions
– Return a value after running a computation:
collect, count, first, takeSample, foreach, etc.

RDD persistence
spark.apache.org/docs/latest/scala-programming-guide.html

Spark Streaming
• High-level language operators for streaming data
• Fault-tolerant semantics
• Support for merging streaming data with historical data
spark.apache.org/docs/latest/streaming-programming-guide.html

Spark Streaming
Run a streaming computation as a series of small, deterministic
batch jobs.
• Chop up live stream into batches of X
seconds (DStream)
• Spark treats each batch of data as RDDs
and processes them using RDD ops
• Finally, processed results of the RDD
operations are returned in batches
Spark
Spark
Streaming
batches of X seconds
live data stream
processed results

Spark Streaming
Run a streaming computation as a series of small, deterministic
batch jobs.
• Batch sizes as low as ½ second,
latency of about 1 second
• Potential for combining batch processing
and streaming processing in the same
system
Spark
Spark
Streaming
batches of X seconds
live data stream
processed results

Spark Streaming: transformations
• Stateless transformations
• Stateful transformations
– checkpointing
– windowed transformations
• window duration
• sliding duration
spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams

Spark Streaming comparison
• Spark Streaming: 670k records/sec/node
• Storm: 115k records/sec/node
• Commercial systems: 100-500k records/sec/node
0
10
20
30
100 1000
Throughputpernode
(MB/s)
Record Size (bytes)
WordCount
Spark
Storm
0
20
40
60
100 1000
Throughputpernode
(MB/s)
Record Size (bytes)
Grep
Spark
Storm

The book: Learning Spark
shop.oreilly.com/product/0636920028512.do

Apache Spark developer certificate program
oreilly.com/go/sparkcert

http://lambda-architecture.net

http://spark-stack.org

MapR Sandbox with Spark
mapr.com/blog/getting-started-spark-mapr-sandbox

Conclusion
• Let’s scale systems and humans
• How? Lambda Architecture!
• Apache Spark is an efficient way to implement Lambda Architecture

Q&A
@mhausenblas maprtech
mhausenblas@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

Implementing the Lambda Architecture efficiently with Apache Spark

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Implementing the Lambda Architecture efficiently with Apache Spark (20)

More from DataWorks Summit (20)

Implementing the Lambda Architecture efficiently with Apache Spark

Editor's Notes