Architectural Patterns for Streaming Applications

Architectural
Patterns for
Streaming
applications
Strata+Hadoop World, Singapore – December 02, 2015
tiny.cloudera.com/streaming-singapore
tiny.cloudera.com/streaming-singapore-questions
Mark Grover | @mark_grover
Ted Malaska | @TedMalaska

2
About the book
• @hadooparchbook
• hadooparchitecturebook.com
• github.com/hadooparchitecturebook
• slideshare.com/hadooparchbook
Questions? tiny.cloudera.com/streaming-singapore-questions

3
About the presenters
• Principal Solutions Architect at
Cloudera
• Done Hadoop for 6 years
– Worked with > 70 companies in 8
countries
• Previously, lead architect at FINRA
• Contributor to Apache Hadoop,
HBase, Flume, Avro, Pig and Spark
HBase, Flume, Avro, Pig and Spark
• Marvel fan boy, runner
• Software Engineer at Cloudera,
working on Spark
• Committer on Apache Bigtop, PMC
member on Apache Sentry
(incubating)
Spark, Hive, Sqoop, Pig and Flume
Ted Malaska Mark Grover

5
Understand common use-
cases for streaming and
their architectures

7
When to stream, and when not to
• We are looking for a SLA sweet spot
• Multi milliseconds to seconds
• Not minutes
• Not constant low milliseconds or under
• Doesn’t come for free

9
Use-case categories
• Ingestion
– Transformation
– Decision (e.g. Anomaly detection)
• Simple counts
– Lambda, etc.
• Advanced usage
– Machine Learning
– Windowing

11
What is ingestion?
IngestSource Systems
Destination system

12
But there multiple sources
Source System 1
Destination systemSource System 2
Source System 3
Ingest

13
But..
• Sources, sinks, ingestion channels may go down
• Sources and sinks may be producing/consuming at different rates
• Regular maintenance windows may need to be scheduled
• We need a resilient message broker

14
Need for a message broker
Source System 1
Source System 3
Ingest Extract
Message broker

15
Kafka
Source System 1
Source System 3
Ingest Extract
Message broker

16
But ‘queue’ doesn’t ‘push’
Source System 1
Storage
systemSource System 2
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Message broker

17
Streaming data ingestion process
Source System 1
Storage
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Kafka Connect
Apache Flume
Message broker

18
Streaming architecture for ingestion
Source System 1
Storage
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Kafka
Connect
Apache
Flume
Message broker

19
Transforming data
in flight

20
Source System 1
Storage
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Kafka
connect
Apache
Flume
Message broker

21
Source System 1
Storage
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Kafka
connect
Apache
Flume
Message broker
Can be used to
do simple
transformations

22
Two types of transformations
Atomic
• Need to work with one event at a
time
• Example – mask a credit card
number
With context
• Need to refer to external context
• Example - convert zip code to state,
by looking up a cache

23
Atomic transformations
• Require no context
• Can be simply done within Flume interceptors, Kafka connect or
Spark streaming

24
Flume Interceptors
• Mask fields
• Validate information
against external source
• Extract fields
• Modify data format
• Filter or split events

25
Source System 1
Storage
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Kafka
connect
Apache
Flume
Message broker
Can be used to
do simple
transformations

26
Transformations with context

27
Exactly once, at
least once, at most
once
(In the context of data ingestion)

28
Source System 1
Storage
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Copycat
Apache
Flume
Message broker
Can be used to
do simple
transformations

29
Semantic types
• At most once
– Not good for many cases
– Only where performance/SLA is more important than accuracy
• Exactly once
– Expensive to achieve but desirable
• At least once
– Easiest to achieve

30
Categories of storage systems
“Puts” based
• Can be re-inserted without side
effects since re-inserted record will
have duplicate key
“Appends” based
• Can not be re-inserted

31
How to achieve exactly once?
• For “puts” based storage systems
– At least once is enough (keys have to be unique though i.e. primary key)
– Re-inserted records will have duplicate keys
– Will simply overwrite the exist record with the same value
• For “appends” based storage systems (e.g. HDFS)
– Still easiest to do at least once
– Need to de-duplicate before processing

33Questions? tiny.cloudera.com/streaming-singapore-questions
Hadoop Cluster II
Storage
Batch Processing
Hadoop Cluster I
Flume
(Sink)
HBase and/or
Memory Store
HDFS
HBase
Impala
Map/Reduce
Spark
Automated & Manual
Analytical Adjustments and
Pattern detection
Fetching & Updating Profiles/Rules
Batch Time
Adjustments
NRT/Stream Processing
Spark Streaming
Adjusting
NRT stats
Kafka
Events
Reporting
Flume
(Source)
Interceptor(Rules)
Flume
(Source)
Flume
(Source)
Interceptor (Rules)
Kafka
Alerts/Events
Flume Channel
Events
Alerts
Hadoop Cluster I
HBase and/or
Memory Store

35
Streaming and Counting
• Counting is easy right?
• Back to Only once

36
We started with Lambda
Pipe
Speed Layer
Batch Layer
Persist Results
Speed Results
Batch Results
Serving Layer

37
Why did Streaming Suck
• Increments with Cassandra
• Double increment
• No strong consistency
• Storm with out Kafka
• Not only once
• Not at least once
• Batch would have to re-process EVERY record to remove
dups

38
We have come a long way
• We don’t have to use Increments any more and we can
have consistency
• HBase
• We can have state in our streaming platform
• Spark Streaming
• We don’t lose data
• Spark Streaming
• Kafka
• Other options
• Full universe of Deduping
• Again HBase with versions

39
Increments

40
Puts with State

41
Advanced Streaming
• Ad-hoc will produce Identify Value
• Ad-hoc will become batch
• The value will demand less latency on batch
• Batch will become Streaming

42
Advanced Streaming
• Requirements for Ideal Batch to Streaming frameworks
• Something that can snap both paradigms
• Something that can use the tools of Ad-hoc
• SQL
• MlLib
• R
• Scala
• Java
• Development through a common IDE
• Debugging
• Unit Testing
• Common deployment model

43
Spark Streaming Example
1. val conf = new SparkConf().setMaster("local[2]”)
2. val sc = new SparkContext(conf)
3. val lines = sc.textFile(path, 2)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()

44
Spark Streaming Example
1. val conf = new SparkConf().setMaster("local[2]”)
2. val ssc = new StreamingContext(conf, Seconds(1))
3. val lines = ssc.socketTextStream("localhost", 9999)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
8. SSC.start()

46
Advanced Streaming
• In Spark Streaming
• A DStream is a collection of RDD with respect to micro batch
intervals
• If we can access RDDs in Spark Streaming
• We can convert to Vectors
• KMeans
• Principal component analysis
• We can convert to LabeledPoint
• NaiveBayes
• Random Forest
• Linear Support Vector Machines
• We can convert to a DataFrames
• SQL
• R

48
Understand common
use-cases for streaming and
their architectures
Our original goal

49
Common streaming use-cases
• Ingestion
– Transformation
– Decision (e.g. Anomaly detection)
• Simple counts
– Lambda, etc.
• Advanced usage
– Machine Learning
– Windowing

50
Free books!
• Book signings
– Wednesday (today), 5:30 PM at O’Reilly booth
– Thursday (tomorrow), 3:15 PM at Cloudera booth
• Please leave us a review!

51
Stay in touch!
Mark Grover | @mark_grover
Ted Malaska | @TedMalaska
@hadooparchbook
tiny.cloudera.com/streaming-singapore
tiny.cloudera.com/streaming-singapore-questions
hadooparchitecturebook.com

Architectural Patterns for Streaming Applications

Recommended

More Related Content

What's hot (20)

Similar to Architectural Patterns for Streaming Applications (20)

More from hadooparchbook (10)

Recently uploaded (20)

Architectural Patterns for Streaming Applications