SlideShare a Scribd company logo
Introduction to Flink
Streaming
Framework for modern streaming
applications
https://github.com/phatak-dev/flink-examples
● Madhukara Phatak
● Big data consultant and
trainer at datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
Agenda
● Stream abstraction vs streaming applications
● Stream as an abstraction
● Challenges with modern streaming applications
● Why not Spark streaming?
● Introduction to Flink
● Introduction to Flink streaming
● Flink Streaming API
● References
Use of stream in applications
● Streams are used both in big data and outside big data
to support two major use cases
○ Stream as abstraction layer
○ Stream as unbounded data to support real time
analysis
● Abstraction and real time have different need and
expectation from the streams
● Different platforms use stream in different meanings
Stream as the abstraction
● A stream is a sequence of data elements made
available over time.
● A stream can be thought of as items on a conveyor belt
being processed one at a time rather than in large
batches.
● Streams can be unbounded ( message queue) and
bounded ( files)
● Streams are becoming new abstractions to build data
pipelines.
Streams as abstraction outside big data
● Streams are used as an abstraction outside big data in
last few years
● Some of them are
○ Reactive streams like akka-streams, akka-http
○ Java 8 streams
○ RxJava etc
● These use of streams are don't care about real time
analysis
Streams for real time analysis
● In this use cases of stream, stream is viewed as
unbounded data which has low latency and available as
soon it arrives in the system
● Stream can be processed using non stream abstraction
in run time
● So focus in these scenarios is only to model API in
streams not the implementation
● Ex : Spark streaming
Stream abstraction in big data
● Stream is the new abstraction layer people are
exploring in the big data
● With right implementation, stream can support both
streaming and batch applications much more effectively
than existing abstractions.
● Batch on streaming is new way of looking at processing
rather than treating streaming as the special case of
batch
● Batch can be faster on streaming than dedicated batch
processing
Frameworks with stream as abstraction
Apache flink
● Flink’s core is a streaming dataflow engine that
provides data distribution, communication, and fault
tolerance for distributed computations over data
streams.
● Flink provides
○ Dataset API - for bounded streams
○ Datastream API - for unbounded streams
● Flink embraces the stream as abstraction to implement
it’s dataflow.
Flink stack
Flink history
● Stratosphere project started in Technical university,
Berlin in 2009
● Incubated to Apache in March, 2014
● Top level project in Dec 2014
● Started as stream engine for batch processing
● Started to support streaming few versions before
● DataArtisians is company founded by core flink team
Flink streaming
● Flink Streaming is an extension of the core Flink API for
high-throughput, low-latency data stream processing
● Supports many data sources like Flume, Twitter,
ZeroMQ and also from any user defined data source
● Data streams can be transformed and modified using
high-level functions similar to the ones provided by the
batch processing API
● Sound much like Spark streaming promises !!
Streaming is not fast batch processing
● Most of the streaming framework focuses too much on
the latency when they develop streaming extensions
● Both storm and spark-streaming view streaming as low
latency batch processing system
● Though latency plays an important role in the real time
application, the need and challenges go beyond it
● Addressing the complex needs of modern streaming
systems need a fresh view on streaming API’s
Streaming in Lamda architecture
● Streaming is viewed as limited, approximate, low
latency computing system compared to a batch system
in lambda architecture
● So we usually run a streaming system to get low latency
approximate results and run a batch system to get high
latency with accurate result
● All the limitations of streaming is stemmed from
conventional thinking and implementations
● New idea is why not streaming a low latency accurate
system itself?
Google dataflow
● Google articulated the first modern streaming
framework which is low latency, exactly once, accurate
stream applications in their dataflow paper
● It talks about a single system which can replace need of
separate streaming and batch processing system
● Known as Kappa architecture
● Modern stream frameworks embrace this over lambda
architecture
● Google dataflow is open sourced under the name
apache beam
Google dataflow and Flink streaming
● Flink adopted dataflow ideas for their streaming API
● Flink streaming API went through big overhaul in 1.0
version to embrace these ideas
● It was relatively easy to adapt ideas as both google
dataflow and flink use streaming as abstraction
● Spark 2.0 may add some of these ideas in their
structured stream processing effort
Needs of modern real time applications
● Ability to handle out of time events in unbounded data
● Ability to correlate the events with different dimensions
of time
● Ability to correlate events using custom application
based characteristics like session
● Ability to both microbatch and event at a time on same
framework
● Support for complex stream processing libraries
Mandatory wordcount
● Streams are represented using DataStream in Flink
streaming
● DataStream support both RDD and Dataset like API for
manipulation
● In this example,
○ Read from socket to create DataStream
○ Use map, keyBy and sum operation for aggregation
● com.madhukaraphatak.flink.streaming.examples.
StreamingWordCount
Flink streaming vs Spark streaming
Spark Streaming Flink Streaming
Streams are represented using DStreams Streams are represented using
DataStreams
Stream is discretized to mini batch Stream is not discretized
Support RDD DSL Supports Dataset like DSL
By default stateless By default stateful at operator level
Runs mini batch for each interval Runs pipelined operators for each events
that comes in
Near realtime Real time
Discretizing the stream
● Flink by default don’t need any discretization of stream
to work
● But using window API, we can create discretized stream
similar to spark
● This time state will be discarded, as and when the batch
is computed
● This way you can mimic spark micro batches in Flink
● com.madhukaraphatak.flink.streaming.examples.
WindowedStreamingWordCount
Understanding dataflow of flink
● All programs in flink, both batch and streaming, are
represented using a dataflow
● This dataflow signifies the stream abstraction provided
by the flink runtime
● This dataflow treats all the data as stream and
processes using long running operator model
● This is quite different than RDD model of the spark
● Flink UI allows us to understand dataflow of a given flink
program
Running in local mode
● bin/start-local.sh
● bin/flink run -c com.madhukaraphatak.flink.streaming.
examples.StreamingWordCount
/home/madhu/Dev/mybuild/flink-examples/target/scala-
2.10/flink-examples_2.10-1.0.jar
Dataflow for wordcount example
Operator fusing
● Flink optimiser fuses the operator for efficiency
● All the fused operator run in a same thread, which
saves the serialization and deserialization cost between
the operators
● For all fused operators, flink generates a nested
function which comprises all the code from operators
● This is much efficient that RDD optimization
● Dataset is planning to support this functionality
● You can disable this by env.disableOperatorChaining()
Dataflow for without operate fusing
Flink streaming vs Spark streaming
Spark Streaming Flink Streaming
Uses RDD distribution model for processing Uses pipelined stream processing
paradigm for processing
Parallelism is done at batch level Parallelism is controlled at operator level
Uses RDD immutability for fault recovery Uses Asynchronous barriers for fault
recovery
RDD level optimization for stream
optimization
Operator fusing for stream optimization
Window API
● Powerful API to track and do custom state analysis
● Types of windows
○ Time window
■ Tumbling window
■ Sliding window
○ Non time based window
■ Count window
● Ex : WindowExample.scala
Anatomy of Window API
● Window API is made of 3 different components
● The three components of window are
○ Window assigner
○ Trigger
○ Evictor
● These three components made up all the window API in
Flink
Window Assigner
● A function which determines given element, which
window it should belong
● Responsible for creation of window and assigning
elements to a window
● Two types of window assigner
○ Time based window assigner
○ GlobalWindow assigner
● User can write their custom window assigner too
Trigger
● Trigger is a function responsible for determining when a
given window is triggered
● In a time based window, this function will wait till time is
done to trigger
● But in non time based window, it can use custom logic
to determine when to evaluate a given window
● In our example, the example number of records in an
given window is used to determine the trigger or not.
● WindowAnatomy.scala
Building custom session window
● We want to track session of a user
● Each session is identified using sessionID
● We will get an event when the session is started
● Evaluate the session, when we get the end of session
event
● For this, we want to implement our own custom window
trigger which tracks the end of session
● Ex : SessionWindowExample.scala
Concept of Time in Flink streaming
● Time in a streaming application plays an important role
● So having ability to express time in flexible way is very
important feature of modern streaming application
● Flink support three kind of time
○ Process time
○ Event time
○ Ingestion time
● Event time is one of the important feature of flink which
compliments the custom window API
Understanding event time
● Time in flink needs to address following two questions
○ When event is occurred?
○ How much time has occurred after the event?
● First question can be answered using the assigning time
stamps
● Second question is answered using understanding the
concept of the water marks
● Ex : EventTimeExample.scala
Watermarks in Event Time
● Watermark is a special signal which signifies flow of
time in Flink
● In above diagram, w(20) signifies 20 units of time is
passed in source
● Watermarks allow flink to support different time
abstractions
References
● http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
● http://blog.madhukaraphatak.com/categories/flink-
streaming/
● https://www.youtube.com/watch?v=y7f6wksGM6c
● https://yahooeng.tumblr.
com/post/135321837876/benchmarking-streaming-
computation-engines-at
● https://www.youtube.com/watch?v=v_exWHj1vmo
● http://www.slideshare.net/FlinkForward/dongwon-kim-a-
comparative-performance-evaluation-of-flink

More Related Content

What's hot (20)

PDF
Interactive Data Analysis in Spark Streaming
datamantra
 
PPTX
State management in Structured Streaming
datamantra
 
PDF
Migrating to spark 2.0
datamantra
 
PDF
Migrating to Spark 2.0 - Part 2
datamantra
 
PDF
Introduction to Datasource V2 API
datamantra
 
PDF
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
datamantra
 
PDF
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
HostedbyConfluent
 
PDF
Understanding time in structured streaming
datamantra
 
PDF
Interactive workflow management using Azkaban
datamantra
 
PDF
Introduction to dataset
datamantra
 
PDF
Real time ETL processing using Spark streaming
datamantra
 
PDF
Core Services behind Spark Job Execution
datamantra
 
PDF
Introduction to concurrent programming with Akka actors
Shashank L
 
PDF
Introduction to Structured Data Processing with Spark SQL
datamantra
 
PDF
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
HostedbyConfluent
 
PDF
Apache Spark vs Apache Flink
AKASH SIHAG
 
PDF
Case Study: Stream Processing on AWS using Kappa Architecture
Joey Bolduc-Gilbert
 
PDF
Building distributed processing system from scratch - Part 2
datamantra
 
PDF
Introduction to Spark 2.0 Dataset API
datamantra
 
PDF
Introduction to spark 2.0
datamantra
 
Interactive Data Analysis in Spark Streaming
datamantra
 
State management in Structured Streaming
datamantra
 
Migrating to spark 2.0
datamantra
 
Migrating to Spark 2.0 - Part 2
datamantra
 
Introduction to Datasource V2 API
datamantra
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
datamantra
 
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
HostedbyConfluent
 
Understanding time in structured streaming
datamantra
 
Interactive workflow management using Azkaban
datamantra
 
Introduction to dataset
datamantra
 
Real time ETL processing using Spark streaming
datamantra
 
Core Services behind Spark Job Execution
datamantra
 
Introduction to concurrent programming with Akka actors
Shashank L
 
Introduction to Structured Data Processing with Spark SQL
datamantra
 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
HostedbyConfluent
 
Apache Spark vs Apache Flink
AKASH SIHAG
 
Case Study: Stream Processing on AWS using Kappa Architecture
Joey Bolduc-Gilbert
 
Building distributed processing system from scratch - Part 2
datamantra
 
Introduction to Spark 2.0 Dataset API
datamantra
 
Introduction to spark 2.0
datamantra
 

Viewers also liked (17)

PDF
Apache spark with Machine learning
datamantra
 
PDF
Building scalable rest service using Akka HTTP
datamantra
 
PPTX
Telco analytics at scale
datamantra
 
PDF
Anatomy of spark catalyst
datamantra
 
PDF
Machine learning pipeline with spark ml
datamantra
 
PDF
Functional programming in Scala
datamantra
 
PPTX
Platform for Data Scientists
datamantra
 
PDF
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 
PPTX
Il üniversitesi islam fikhi_2.2.ef_'ali mükellefin_namaz
Colorado Theology University
 
PDF
TermiChack
BBSheshi
 
PPTX
İL Üniversitesi - 1.15.habesistan hicreti asr i saadet-islam tarihi
Colorado Theology University
 
PDF
How to plan a hadoop cluster for testing and production environment
Anna Yen
 
PPTX
Extending the Yahoo Streaming Benchmark
Jamie Grier
 
PDF
Alexander Kolb – Flink. Yet another Streaming Framework?
Flink Forward
 
PDF
Automatic Detection of Web Trackers by Vasia Kalavri
Flink Forward
 
PDF
Jamie Grier - Robust Stream Processing with Apache Flink
Flink Forward
 
PPTX
Flink Streaming @BudapestData
Gyula Fóra
 
Apache spark with Machine learning
datamantra
 
Building scalable rest service using Akka HTTP
datamantra
 
Telco analytics at scale
datamantra
 
Anatomy of spark catalyst
datamantra
 
Machine learning pipeline with spark ml
datamantra
 
Functional programming in Scala
datamantra
 
Platform for Data Scientists
datamantra
 
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 
Il üniversitesi islam fikhi_2.2.ef_'ali mükellefin_namaz
Colorado Theology University
 
TermiChack
BBSheshi
 
İL Üniversitesi - 1.15.habesistan hicreti asr i saadet-islam tarihi
Colorado Theology University
 
How to plan a hadoop cluster for testing and production environment
Anna Yen
 
Extending the Yahoo Streaming Benchmark
Jamie Grier
 
Alexander Kolb – Flink. Yet another Streaming Framework?
Flink Forward
 
Automatic Detection of Web Trackers by Vasia Kalavri
Flink Forward
 
Jamie Grier - Robust Stream Processing with Apache Flink
Flink Forward
 
Flink Streaming @BudapestData
Gyula Fóra
 
Ad

Similar to Introduction to Flink Streaming (20)

PDF
Introduction to Apache Flink
datamantra
 
PDF
Introduction to Spark Streaming
datamantra
 
PDF
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Bowen Li
 
PDF
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
GlobalLogic Ukraine
 
PDF
Structured Streaming in Spark
Digital Vidya
 
PDF
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
confluent
 
PDF
A Tool For Big Data Analysis using Apache Spark
datamantra
 
PDF
Automation + dev ops summit hail hydrate! from stream to lake
Timothy Spann
 
PDF
Big data Argentina meetup 2020-09: Intro to presto on docker
Federico Palladoro
 
PDF
The Fn Project: A Quick Introduction (December 2017)
Oracle Developers
 
PDF
[WSO2Con Asia 2018] Architecting for Container-native Environments
WSO2
 
PDF
Cloud lunch and learn real-time streaming in azure
Timothy Spann
 
PDF
Netty training
Jackson dos Santos Olveira
 
PDF
Devoxx Belgium 2017 - easy microservices with JHipster
Julien Dubois
 
PDF
Easy Microservices with JHipster - Devoxx BE 2017
Deepu K Sasidharan
 
PPTX
Cassandra Lunch #88: Cadence
Anant Corporation
 
PPTX
Delivering Cloud Native Batch Solutions - Dodd Pfeffer
VMware Tanzu
 
PDF
Netty training
Marcelo Serpa
 
PPTX
Salesforce Developer User Group, Oslo, Norway - Salesforce PubSub API and gRP...
Kenneth Sørensen
 
PDF
Triangle Devops Meetup 10/2015
aspyker
 
Introduction to Apache Flink
datamantra
 
Introduction to Spark Streaming
datamantra
 
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Bowen Li
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
GlobalLogic Ukraine
 
Structured Streaming in Spark
Digital Vidya
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
confluent
 
A Tool For Big Data Analysis using Apache Spark
datamantra
 
Automation + dev ops summit hail hydrate! from stream to lake
Timothy Spann
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Federico Palladoro
 
The Fn Project: A Quick Introduction (December 2017)
Oracle Developers
 
[WSO2Con Asia 2018] Architecting for Container-native Environments
WSO2
 
Cloud lunch and learn real-time streaming in azure
Timothy Spann
 
Devoxx Belgium 2017 - easy microservices with JHipster
Julien Dubois
 
Easy Microservices with JHipster - Devoxx BE 2017
Deepu K Sasidharan
 
Cassandra Lunch #88: Cadence
Anant Corporation
 
Delivering Cloud Native Batch Solutions - Dodd Pfeffer
VMware Tanzu
 
Netty training
Marcelo Serpa
 
Salesforce Developer User Group, Oslo, Norway - Salesforce PubSub API and gRP...
Kenneth Sørensen
 
Triangle Devops Meetup 10/2015
aspyker
 
Ad

More from datamantra (10)

PPTX
Multi Source Data Analysis using Spark and Tellius
datamantra
 
PDF
Spark on Kubernetes
datamantra
 
PDF
Understanding transactional writes in datasource v2
datamantra
 
PDF
Optimizing S3 Write-heavy Spark workloads
datamantra
 
PDF
Spark stack for Model life-cycle management
datamantra
 
PDF
Productionalizing Spark ML
datamantra
 
PDF
Testing Spark and Scala
datamantra
 
PDF
Understanding Implicits in Scala
datamantra
 
PDF
Scalable Spark deployment using Kubernetes
datamantra
 
PDF
Introduction to concurrent programming with akka actors
datamantra
 
Multi Source Data Analysis using Spark and Tellius
datamantra
 
Spark on Kubernetes
datamantra
 
Understanding transactional writes in datasource v2
datamantra
 
Optimizing S3 Write-heavy Spark workloads
datamantra
 
Spark stack for Model life-cycle management
datamantra
 
Productionalizing Spark ML
datamantra
 
Testing Spark and Scala
datamantra
 
Understanding Implicits in Scala
datamantra
 
Scalable Spark deployment using Kubernetes
datamantra
 
Introduction to concurrent programming with akka actors
datamantra
 

Recently uploaded (20)

PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
What Is Data Integration and Transformation?
subhashenia
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
big data eco system fundamentals of data science
arivukarasi
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 

Introduction to Flink Streaming

  • 1. Introduction to Flink Streaming Framework for modern streaming applications https://github.com/phatak-dev/flink-examples
  • 2. ● Madhukara Phatak ● Big data consultant and trainer at datamantra.io ● Consult in Hadoop, Spark and Scala ● www.madhukaraphatak.com
  • 3. Agenda ● Stream abstraction vs streaming applications ● Stream as an abstraction ● Challenges with modern streaming applications ● Why not Spark streaming? ● Introduction to Flink ● Introduction to Flink streaming ● Flink Streaming API ● References
  • 4. Use of stream in applications ● Streams are used both in big data and outside big data to support two major use cases ○ Stream as abstraction layer ○ Stream as unbounded data to support real time analysis ● Abstraction and real time have different need and expectation from the streams ● Different platforms use stream in different meanings
  • 5. Stream as the abstraction ● A stream is a sequence of data elements made available over time. ● A stream can be thought of as items on a conveyor belt being processed one at a time rather than in large batches. ● Streams can be unbounded ( message queue) and bounded ( files) ● Streams are becoming new abstractions to build data pipelines.
  • 6. Streams as abstraction outside big data ● Streams are used as an abstraction outside big data in last few years ● Some of them are ○ Reactive streams like akka-streams, akka-http ○ Java 8 streams ○ RxJava etc ● These use of streams are don't care about real time analysis
  • 7. Streams for real time analysis ● In this use cases of stream, stream is viewed as unbounded data which has low latency and available as soon it arrives in the system ● Stream can be processed using non stream abstraction in run time ● So focus in these scenarios is only to model API in streams not the implementation ● Ex : Spark streaming
  • 8. Stream abstraction in big data ● Stream is the new abstraction layer people are exploring in the big data ● With right implementation, stream can support both streaming and batch applications much more effectively than existing abstractions. ● Batch on streaming is new way of looking at processing rather than treating streaming as the special case of batch ● Batch can be faster on streaming than dedicated batch processing
  • 9. Frameworks with stream as abstraction
  • 10. Apache flink ● Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. ● Flink provides ○ Dataset API - for bounded streams ○ Datastream API - for unbounded streams ● Flink embraces the stream as abstraction to implement it’s dataflow.
  • 12. Flink history ● Stratosphere project started in Technical university, Berlin in 2009 ● Incubated to Apache in March, 2014 ● Top level project in Dec 2014 ● Started as stream engine for batch processing ● Started to support streaming few versions before ● DataArtisians is company founded by core flink team
  • 13. Flink streaming ● Flink Streaming is an extension of the core Flink API for high-throughput, low-latency data stream processing ● Supports many data sources like Flume, Twitter, ZeroMQ and also from any user defined data source ● Data streams can be transformed and modified using high-level functions similar to the ones provided by the batch processing API ● Sound much like Spark streaming promises !!
  • 14. Streaming is not fast batch processing ● Most of the streaming framework focuses too much on the latency when they develop streaming extensions ● Both storm and spark-streaming view streaming as low latency batch processing system ● Though latency plays an important role in the real time application, the need and challenges go beyond it ● Addressing the complex needs of modern streaming systems need a fresh view on streaming API’s
  • 15. Streaming in Lamda architecture ● Streaming is viewed as limited, approximate, low latency computing system compared to a batch system in lambda architecture ● So we usually run a streaming system to get low latency approximate results and run a batch system to get high latency with accurate result ● All the limitations of streaming is stemmed from conventional thinking and implementations ● New idea is why not streaming a low latency accurate system itself?
  • 16. Google dataflow ● Google articulated the first modern streaming framework which is low latency, exactly once, accurate stream applications in their dataflow paper ● It talks about a single system which can replace need of separate streaming and batch processing system ● Known as Kappa architecture ● Modern stream frameworks embrace this over lambda architecture ● Google dataflow is open sourced under the name apache beam
  • 17. Google dataflow and Flink streaming ● Flink adopted dataflow ideas for their streaming API ● Flink streaming API went through big overhaul in 1.0 version to embrace these ideas ● It was relatively easy to adapt ideas as both google dataflow and flink use streaming as abstraction ● Spark 2.0 may add some of these ideas in their structured stream processing effort
  • 18. Needs of modern real time applications ● Ability to handle out of time events in unbounded data ● Ability to correlate the events with different dimensions of time ● Ability to correlate events using custom application based characteristics like session ● Ability to both microbatch and event at a time on same framework ● Support for complex stream processing libraries
  • 19. Mandatory wordcount ● Streams are represented using DataStream in Flink streaming ● DataStream support both RDD and Dataset like API for manipulation ● In this example, ○ Read from socket to create DataStream ○ Use map, keyBy and sum operation for aggregation ● com.madhukaraphatak.flink.streaming.examples. StreamingWordCount
  • 20. Flink streaming vs Spark streaming Spark Streaming Flink Streaming Streams are represented using DStreams Streams are represented using DataStreams Stream is discretized to mini batch Stream is not discretized Support RDD DSL Supports Dataset like DSL By default stateless By default stateful at operator level Runs mini batch for each interval Runs pipelined operators for each events that comes in Near realtime Real time
  • 21. Discretizing the stream ● Flink by default don’t need any discretization of stream to work ● But using window API, we can create discretized stream similar to spark ● This time state will be discarded, as and when the batch is computed ● This way you can mimic spark micro batches in Flink ● com.madhukaraphatak.flink.streaming.examples. WindowedStreamingWordCount
  • 22. Understanding dataflow of flink ● All programs in flink, both batch and streaming, are represented using a dataflow ● This dataflow signifies the stream abstraction provided by the flink runtime ● This dataflow treats all the data as stream and processes using long running operator model ● This is quite different than RDD model of the spark ● Flink UI allows us to understand dataflow of a given flink program
  • 23. Running in local mode ● bin/start-local.sh ● bin/flink run -c com.madhukaraphatak.flink.streaming. examples.StreamingWordCount /home/madhu/Dev/mybuild/flink-examples/target/scala- 2.10/flink-examples_2.10-1.0.jar
  • 25. Operator fusing ● Flink optimiser fuses the operator for efficiency ● All the fused operator run in a same thread, which saves the serialization and deserialization cost between the operators ● For all fused operators, flink generates a nested function which comprises all the code from operators ● This is much efficient that RDD optimization ● Dataset is planning to support this functionality ● You can disable this by env.disableOperatorChaining()
  • 26. Dataflow for without operate fusing
  • 27. Flink streaming vs Spark streaming Spark Streaming Flink Streaming Uses RDD distribution model for processing Uses pipelined stream processing paradigm for processing Parallelism is done at batch level Parallelism is controlled at operator level Uses RDD immutability for fault recovery Uses Asynchronous barriers for fault recovery RDD level optimization for stream optimization Operator fusing for stream optimization
  • 28. Window API ● Powerful API to track and do custom state analysis ● Types of windows ○ Time window ■ Tumbling window ■ Sliding window ○ Non time based window ■ Count window ● Ex : WindowExample.scala
  • 29. Anatomy of Window API ● Window API is made of 3 different components ● The three components of window are ○ Window assigner ○ Trigger ○ Evictor ● These three components made up all the window API in Flink
  • 30. Window Assigner ● A function which determines given element, which window it should belong ● Responsible for creation of window and assigning elements to a window ● Two types of window assigner ○ Time based window assigner ○ GlobalWindow assigner ● User can write their custom window assigner too
  • 31. Trigger ● Trigger is a function responsible for determining when a given window is triggered ● In a time based window, this function will wait till time is done to trigger ● But in non time based window, it can use custom logic to determine when to evaluate a given window ● In our example, the example number of records in an given window is used to determine the trigger or not. ● WindowAnatomy.scala
  • 32. Building custom session window ● We want to track session of a user ● Each session is identified using sessionID ● We will get an event when the session is started ● Evaluate the session, when we get the end of session event ● For this, we want to implement our own custom window trigger which tracks the end of session ● Ex : SessionWindowExample.scala
  • 33. Concept of Time in Flink streaming ● Time in a streaming application plays an important role ● So having ability to express time in flexible way is very important feature of modern streaming application ● Flink support three kind of time ○ Process time ○ Event time ○ Ingestion time ● Event time is one of the important feature of flink which compliments the custom window API
  • 34. Understanding event time ● Time in flink needs to address following two questions ○ When event is occurred? ○ How much time has occurred after the event? ● First question can be answered using the assigning time stamps ● Second question is answered using understanding the concept of the water marks ● Ex : EventTimeExample.scala
  • 35. Watermarks in Event Time ● Watermark is a special signal which signifies flow of time in Flink ● In above diagram, w(20) signifies 20 units of time is passed in source ● Watermarks allow flink to support different time abstractions
  • 36. References ● http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf ● http://blog.madhukaraphatak.com/categories/flink- streaming/ ● https://www.youtube.com/watch?v=y7f6wksGM6c ● https://yahooeng.tumblr. com/post/135321837876/benchmarking-streaming- computation-engines-at ● https://www.youtube.com/watch?v=v_exWHj1vmo ● http://www.slideshare.net/FlinkForward/dongwon-kim-a- comparative-performance-evaluation-of-flink