Introduction to Flink Streaming

Introduction to Flink
Streaming
Framework for modern streaming
applications
https://github.com/phatak-dev/flink-examples

● Madhukara Phatak
● Big data consultant and
trainer at datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com

Agenda
● Stream abstraction vs streaming applications
● Stream as an abstraction
● Challenges with modern streaming applications
● Why not Spark streaming?
● Introduction to Flink
● Introduction to Flink streaming
● Flink Streaming API
● References

Use of stream in applications
● Streams are used both in big data and outside big data
to support two major use cases
○ Stream as abstraction layer
○ Stream as unbounded data to support real time
analysis
● Abstraction and real time have different need and
expectation from the streams
● Different platforms use stream in different meanings

Stream as the abstraction
● A stream is a sequence of data elements made
available over time.
● A stream can be thought of as items on a conveyor belt
being processed one at a time rather than in large
batches.
● Streams can be unbounded ( message queue) and
bounded ( files)
● Streams are becoming new abstractions to build data
pipelines.

Streams as abstraction outside big data
● Streams are used as an abstraction outside big data in
last few years
● Some of them are
○ Reactive streams like akka-streams, akka-http
○ Java 8 streams
○ RxJava etc
● These use of streams are don't care about real time
analysis

Streams for real time analysis
● In this use cases of stream, stream is viewed as
unbounded data which has low latency and available as
soon it arrives in the system
● Stream can be processed using non stream abstraction
in run time
● So focus in these scenarios is only to model API in
streams not the implementation
● Ex : Spark streaming

Stream abstraction in big data
● Stream is the new abstraction layer people are
exploring in the big data
● With right implementation, stream can support both
streaming and batch applications much more effectively
than existing abstractions.
● Batch on streaming is new way of looking at processing
rather than treating streaming as the special case of
batch
● Batch can be faster on streaming than dedicated batch
processing

Frameworks with stream as abstraction

Apache flink
● Flink’s core is a streaming dataflow engine that
provides data distribution, communication, and fault
tolerance for distributed computations over data
streams.
● Flink provides
○ Dataset API - for bounded streams
○ Datastream API - for unbounded streams
● Flink embraces the stream as abstraction to implement
it’s dataflow.

Flink history
● Stratosphere project started in Technical university,
Berlin in 2009
● Incubated to Apache in March, 2014
● Top level project in Dec 2014
● Started as stream engine for batch processing
● Started to support streaming few versions before
● DataArtisians is company founded by core flink team

Flink streaming
● Flink Streaming is an extension of the core Flink API for
high-throughput, low-latency data stream processing
● Supports many data sources like Flume, Twitter,
ZeroMQ and also from any user defined data source
● Data streams can be transformed and modified using
high-level functions similar to the ones provided by the
batch processing API
● Sound much like Spark streaming promises !!

Streaming is not fast batch processing
● Most of the streaming framework focuses too much on
the latency when they develop streaming extensions
● Both storm and spark-streaming view streaming as low
latency batch processing system
● Though latency plays an important role in the real time
application, the need and challenges go beyond it
● Addressing the complex needs of modern streaming
systems need a fresh view on streaming API’s

Streaming in Lamda architecture
● Streaming is viewed as limited, approximate, low
latency computing system compared to a batch system
in lambda architecture
● So we usually run a streaming system to get low latency
approximate results and run a batch system to get high
latency with accurate result
● All the limitations of streaming is stemmed from
conventional thinking and implementations
● New idea is why not streaming a low latency accurate
system itself?

Google dataflow
● Google articulated the first modern streaming
framework which is low latency, exactly once, accurate
stream applications in their dataflow paper
● It talks about a single system which can replace need of
separate streaming and batch processing system
● Known as Kappa architecture
● Modern stream frameworks embrace this over lambda
architecture
● Google dataflow is open sourced under the name
apache beam

Google dataflow and Flink streaming
● Flink adopted dataflow ideas for their streaming API
● Flink streaming API went through big overhaul in 1.0
version to embrace these ideas
● It was relatively easy to adapt ideas as both google
dataflow and flink use streaming as abstraction
● Spark 2.0 may add some of these ideas in their
structured stream processing effort

Needs of modern real time applications
● Ability to handle out of time events in unbounded data
● Ability to correlate the events with different dimensions
of time
● Ability to correlate events using custom application
based characteristics like session
● Ability to both microbatch and event at a time on same
framework
● Support for complex stream processing libraries

Mandatory wordcount
● Streams are represented using DataStream in Flink
streaming
● DataStream support both RDD and Dataset like API for
manipulation
● In this example,
○ Read from socket to create DataStream
○ Use map, keyBy and sum operation for aggregation
● com.madhukaraphatak.flink.streaming.examples.
StreamingWordCount

Flink streaming vs Spark streaming
Spark Streaming Flink Streaming
Streams are represented using DStreams Streams are represented using
DataStreams
Stream is discretized to mini batch Stream is not discretized
Support RDD DSL Supports Dataset like DSL
By default stateless By default stateful at operator level
Runs mini batch for each interval Runs pipelined operators for each events
that comes in
Near realtime Real time

Discretizing the stream
● Flink by default don’t need any discretization of stream
to work
● But using window API, we can create discretized stream
similar to spark
● This time state will be discarded, as and when the batch
is computed
● This way you can mimic spark micro batches in Flink
● com.madhukaraphatak.flink.streaming.examples.
WindowedStreamingWordCount

Understanding dataflow of flink
● All programs in flink, both batch and streaming, are
represented using a dataflow
● This dataflow signifies the stream abstraction provided
by the flink runtime
● This dataflow treats all the data as stream and
processes using long running operator model
● This is quite different than RDD model of the spark
● Flink UI allows us to understand dataflow of a given flink
program

Running in local mode
● bin/start-local.sh
● bin/flink run -c com.madhukaraphatak.flink.streaming.
examples.StreamingWordCount
/home/madhu/Dev/mybuild/flink-examples/target/scala-
2.10/flink-examples_2.10-1.0.jar

Dataflow for wordcount example

Operator fusing
● Flink optimiser fuses the operator for efficiency
● All the fused operator run in a same thread, which
saves the serialization and deserialization cost between
the operators
● For all fused operators, flink generates a nested
function which comprises all the code from operators
● This is much efficient that RDD optimization
● Dataset is planning to support this functionality
● You can disable this by env.disableOperatorChaining()

Dataflow for without operate fusing

Flink streaming vs Spark streaming
Spark Streaming Flink Streaming
Uses RDD distribution model for processing Uses pipelined stream processing
paradigm for processing
Parallelism is done at batch level Parallelism is controlled at operator level
Uses RDD immutability for fault recovery Uses Asynchronous barriers for fault
recovery
RDD level optimization for stream
optimization
Operator fusing for stream optimization

Window API
● Powerful API to track and do custom state analysis
● Types of windows
○ Time window
■ Tumbling window
■ Sliding window
○ Non time based window
■ Count window
● Ex : WindowExample.scala

Anatomy of Window API
● Window API is made of 3 different components
● The three components of window are
○ Window assigner
○ Trigger
○ Evictor
● These three components made up all the window API in
Flink

Window Assigner
● A function which determines given element, which
window it should belong
● Responsible for creation of window and assigning
elements to a window
● Two types of window assigner
○ Time based window assigner
○ GlobalWindow assigner
● User can write their custom window assigner too

Trigger
● Trigger is a function responsible for determining when a
given window is triggered
● In a time based window, this function will wait till time is
done to trigger
● But in non time based window, it can use custom logic
to determine when to evaluate a given window
● In our example, the example number of records in an
given window is used to determine the trigger or not.
● WindowAnatomy.scala

Building custom session window
● We want to track session of a user
● Each session is identified using sessionID
● We will get an event when the session is started
● Evaluate the session, when we get the end of session
event
● For this, we want to implement our own custom window
trigger which tracks the end of session
● Ex : SessionWindowExample.scala

Concept of Time in Flink streaming
● Time in a streaming application plays an important role
● So having ability to express time in flexible way is very
important feature of modern streaming application
● Flink support three kind of time
○ Process time
○ Event time
○ Ingestion time
● Event time is one of the important feature of flink which
compliments the custom window API

Understanding event time
● Time in flink needs to address following two questions
○ When event is occurred?
○ How much time has occurred after the event?
● First question can be answered using the assigning time
stamps
● Second question is answered using understanding the
concept of the water marks
● Ex : EventTimeExample.scala

Watermarks in Event Time
● Watermark is a special signal which signifies flow of
time in Flink
● In above diagram, w(20) signifies 20 units of time is
passed in source
● Watermarks allow flink to support different time
abstractions

References
● http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
● http://blog.madhukaraphatak.com/categories/flink-
streaming/
● https://www.youtube.com/watch?v=y7f6wksGM6c
● https://yahooeng.tumblr.
com/post/135321837876/benchmarking-streaming-
computation-engines-at
● https://www.youtube.com/watch?v=v_exWHj1vmo
● http://www.slideshare.net/FlinkForward/dongwon-kim-a-
comparative-performance-evaluation-of-flink

Introduction to Flink Streaming

More Related Content

What's hot (20)

Viewers also liked (17)

Similar to Introduction to Flink Streaming (20)

More from datamantra (10)

Recently uploaded (20)

Introduction to Flink Streaming