SlideShare a Scribd company logo
Introduction to Apache Flink
A native Streaming Data Flow Engine
Stefan Papp
21.10.2015 – The day of Marty McFlys Arrival
2 #Streaming
Streaming is the biggest thing since Hadoop
3
Streaming
Process data immediately at event
time.
Consequences
• Data is processed immediately
• The act of processing data is
more repetitive
Batch Processing
Process collected data at
scheduled time or when a
sufficient amount of data has
been accumulated.
Consequences
• More transactions processed at
one time in a single process
• Higher processing time
#Stream vs Batch
Batch Processing vs. Streaming
4
Past
• Insufficent technologies for streaming,
focus on batch
• Some technologies were not real streaming,
but only microbatches
• Either batch or streaming, but no engine that can do both
Now
• Technologies have matured
• Streaming is highly demanded in business
Streaming – The challenges in the past
#Stream
5
Streaming Solutions
6 #Streaming
The Focus moves from Storage to Processing
7
Technology Stack
Storage Layer
General Purpose Processing Engine
SQL Engine Abstraction
Engine
ML Graph Streaming
8
Technology Stack with Technologies
Hadoop, S3,...
Flink, Spark
Hive,
SparkSQL
Cascading,
Pig
FlinkML,
MLLib
Gelly,
GraphX
Flink,
Spark
Streaming
9
Old Style Batch Processing: MapReduce
Step Step Step Step Step
Client
for (int i = 0; i < maxIterations; i++) {
// Execute MapReduce job
}
10
Optimized Execution
case class Path (from: Long, to:
Long)
val tc = edges.iterate(10) {
paths: DataSet[Path] =>
val next = paths
.join(edges)
.where("to")
.equalTo("from") {
(path, edge) =>
Path(path.from, edge.to)
}
.union(paths)
.distinct()
next
}
Optimizer
Type extraction
stack
Task
scheduling
Dataflow
metadata
Pre-flight (Client)
Master
Worker
Data Source
orders.tbl
Filter
Map DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
hash-part [0] hash-part [0]
GroupRed
sort
forward
Program
Dataflow Graph
Independent of
batch or
streaming job
deploy
operators
track
intermediate
results
11
Old Style Streaming (Micro Batches)
stream
discretizer
Job Job Job Job
while (true) {
// get next few records
// issue batch job
}
12
Streaming Topology
#Streaming
Data
Source 1
Data
Source 2
Sprout
Sprout
Bolt
BoltBolt
Bolt
Target
Topology
13 #Streaming
Apache Flink is a Native Streaming GPPE
14
The Flink Ecosystem in a Nutshell
14
Gelly
Table
ML
SAMOA
DataSet (Java/Scala/Python) DataStream (Java/Scala)
HadoopM/R
Local Remote Yarn Tez Embedded
Dataflow
Dataflow(WiP)
MRQL
Table
Cascading(WiP)
Streaming dataflow runtime
15
Native workload support
#workload
Flink
Streaming
topologies
Long batch
pipelines
Machine Learning at scale
Graph Analysis
16
1. Execute everything as streams
2. Allow some iterative (cyclic) dataflows
3. Allow some mutable state
4. Operate on managed memory
Flink Engine – Core Features
17
3 Parts of a Streaming Infrastructure
#Streaming 17
Gathering Broker Analysis
Sensors
Transaction
logs …
Server Logs
18
• Batch programs are a special kind of streaming program
Batch on Streaming
Infinite Streams Finite Streams
Stream
Windows
Global View
Pipelined
Data Exchange
Pipelined or
Blocking Exchange
Streaming Programs Batch Programs
19
Expressive APIs
19
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))
.groupBy("word").sum("frequency")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
DataSet API (batch):
DataStream API (streaming):
20
Table API
20
val customers = env.readCsvFile(…).as('id, 'mktSegment)
.filter("mktSegment = AUTOMOBILE")
val orders = env.readCsvFile(…)
.filter( o => dateFormat.parse(o.orderDate).before(date) )
.as("orderId, custId, orderDate, shipPrio")
val items = orders
.join(customers).where("custId = id")
.join(lineitems).where("orderId = id")
.select("orderId, orderDate, shipPrio,
extdPrice * (Literal(1.0f) – discount) as revenue")
val result = items
.groupBy("orderId, orderDate, shipPrio")
.select('orderId, revenue.sum, orderDate, shipPrio")
21
Data Source – Processing – Data Sink
© 2014 Teradata

More Related Content

What's hot (20)

PPTX
GOTO Night Amsterdam - Stream processing with Apache Flink
Robert Metzger
 
PPTX
Apache Flink Berlin Meetup May 2016
Stephan Ewen
 
PDF
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Flink Forward
 
PDF
Flink Apachecon Presentation
Gyula Fóra
 
PDF
Stateful Distributed Stream Processing
Gyula Fóra
 
PPTX
Streaming in the Wild with Apache Flink
Kostas Tzoumas
 
PDF
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Flink Forward
 
PPTX
Real-time Stream Processing with Apache Flink
DataWorks Summit
 
PPTX
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Stephan Ewen
 
PDF
Marton Balassi – Stateful Stream Processing
Flink Forward
 
PDF
Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...
Bowen Li
 
PPTX
Fabian Hueske – Cascading on Flink
Flink Forward
 
PPTX
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Fabian Hueske
 
PDF
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 
PPTX
January 2016 Flink Community Update & Roadmap 2016
Robert Metzger
 
PPTX
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Flink Forward
 
PPTX
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
PDF
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Flink Forward
 
PPTX
SICS: Apache Flink Streaming
Turi, Inc.
 
PDF
Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...
Flink Forward
 
GOTO Night Amsterdam - Stream processing with Apache Flink
Robert Metzger
 
Apache Flink Berlin Meetup May 2016
Stephan Ewen
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Flink Forward
 
Flink Apachecon Presentation
Gyula Fóra
 
Stateful Distributed Stream Processing
Gyula Fóra
 
Streaming in the Wild with Apache Flink
Kostas Tzoumas
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Flink Forward
 
Real-time Stream Processing with Apache Flink
DataWorks Summit
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Stephan Ewen
 
Marton Balassi – Stateful Stream Processing
Flink Forward
 
Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...
Bowen Li
 
Fabian Hueske – Cascading on Flink
Flink Forward
 
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Fabian Hueske
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 
January 2016 Flink Community Update & Roadmap 2016
Robert Metzger
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Flink Forward
 
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Flink Forward
 
SICS: Apache Flink Streaming
Turi, Inc.
 
Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...
Flink Forward
 

Similar to Introduction to Apache Flink at Vienna Meet Up (20)

PPTX
Apache Flink Training: System Overview
Flink Forward
 
PPTX
Apache Flink: Past, Present and Future
Gyula Fóra
 
PPTX
Flink Streaming @BudapestData
Gyula Fóra
 
PPTX
Apache Flink@ Strata & Hadoop World London
Stephan Ewen
 
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Gyula Fóra
 
PPTX
Apache Flink(tm) - A Next-Generation Stream Processor
Aljoscha Krettek
 
PPTX
Berlin Apache Flink Meetup May 2015, Community Update
Robert Metzger
 
PPTX
Flink September 2015 Community Update
Robert Metzger
 
PPTX
Data Stream Processing with Apache Flink
Fabian Hueske
 
PPTX
Flink history, roadmap and vision
Stephan Ewen
 
PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
PDF
SnappyData Toronto Meetup Nov 2017
SnappyData
 
PPTX
Unified Batch and Real-Time Stream Processing Using Apache Flink
Slim Baltagi
 
PDF
Bay Area Apache Flink Meetup Community Update August 2015
Henry Saputra
 
PDF
Data streaming-systems
imcpune
 
PDF
Reintroducing the Stream Processor: A universal tool for continuous data anal...
Paris Carbone
 
PDF
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Taiwan User Group
 
PPTX
Robust stream processing with Apache Flink
Aljoscha Krettek
 
PDF
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
GlobalLogic Ukraine
 
Apache Flink Training: System Overview
Flink Forward
 
Apache Flink: Past, Present and Future
Gyula Fóra
 
Flink Streaming @BudapestData
Gyula Fóra
 
Apache Flink@ Strata & Hadoop World London
Stephan Ewen
 
Large-Scale Stream Processing in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Gyula Fóra
 
Apache Flink(tm) - A Next-Generation Stream Processor
Aljoscha Krettek
 
Berlin Apache Flink Meetup May 2015, Community Update
Robert Metzger
 
Flink September 2015 Community Update
Robert Metzger
 
Data Stream Processing with Apache Flink
Fabian Hueske
 
Flink history, roadmap and vision
Stephan Ewen
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
SnappyData Toronto Meetup Nov 2017
SnappyData
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Slim Baltagi
 
Bay Area Apache Flink Meetup Community Update August 2015
Henry Saputra
 
Data streaming-systems
imcpune
 
Reintroducing the Stream Processor: A universal tool for continuous data anal...
Paris Carbone
 
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Taiwan User Group
 
Robust stream processing with Apache Flink
Aljoscha Krettek
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
GlobalLogic Ukraine
 
Ad

Recently uploaded (20)

PPTX
computer forensics encase emager app exp6 1.pptx
ssuser343e92
 
PPTX
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
PDF
Dealing with JSON in the relational world
Andres Almiray
 
PPTX
CONCEPT OF PROGRAMMING in language .pptx
tamim41
 
PPTX
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
PPTX
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
PPTX
Automatic_Iperf_Log_Result_Excel_visual_v2.pptx
Chen-Chih Lee
 
PPTX
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
PPTX
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
PDF
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
PPTX
Quality on Autopilot: Scaling Testing in Uyuni
Oscar Barrios Torrero
 
PDF
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
PPTX
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
PPTX
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
PPTX
Perfecting XM Cloud for Multisite Setup.pptx
Ahmed Okour
 
PDF
LPS25 - Operationalizing MLOps in GEP - Terradue.pdf
terradue
 
PPTX
NeuroStrata: Harnessing Neuro-Symbolic Paradigms for Improved Testability and...
Ivan Ruchkin
 
PDF
Continouous failure - Why do we make our lives hard?
Papp Krisztián
 
PDF
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
PDF
2025年 Linux 核心專題: 探討 sched_ext 及機器學習.pdf
Eric Chou
 
computer forensics encase emager app exp6 1.pptx
ssuser343e92
 
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
Dealing with JSON in the relational world
Andres Almiray
 
CONCEPT OF PROGRAMMING in language .pptx
tamim41
 
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
Automatic_Iperf_Log_Result_Excel_visual_v2.pptx
Chen-Chih Lee
 
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
Quality on Autopilot: Scaling Testing in Uyuni
Oscar Barrios Torrero
 
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
Perfecting XM Cloud for Multisite Setup.pptx
Ahmed Okour
 
LPS25 - Operationalizing MLOps in GEP - Terradue.pdf
terradue
 
NeuroStrata: Harnessing Neuro-Symbolic Paradigms for Improved Testability and...
Ivan Ruchkin
 
Continouous failure - Why do we make our lives hard?
Papp Krisztián
 
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
2025年 Linux 核心專題: 探討 sched_ext 及機器學習.pdf
Eric Chou
 
Ad

Introduction to Apache Flink at Vienna Meet Up

  • 1. Introduction to Apache Flink A native Streaming Data Flow Engine Stefan Papp 21.10.2015 – The day of Marty McFlys Arrival
  • 2. 2 #Streaming Streaming is the biggest thing since Hadoop
  • 3. 3 Streaming Process data immediately at event time. Consequences • Data is processed immediately • The act of processing data is more repetitive Batch Processing Process collected data at scheduled time or when a sufficient amount of data has been accumulated. Consequences • More transactions processed at one time in a single process • Higher processing time #Stream vs Batch Batch Processing vs. Streaming
  • 4. 4 Past • Insufficent technologies for streaming, focus on batch • Some technologies were not real streaming, but only microbatches • Either batch or streaming, but no engine that can do both Now • Technologies have matured • Streaming is highly demanded in business Streaming – The challenges in the past #Stream
  • 6. 6 #Streaming The Focus moves from Storage to Processing
  • 7. 7 Technology Stack Storage Layer General Purpose Processing Engine SQL Engine Abstraction Engine ML Graph Streaming
  • 8. 8 Technology Stack with Technologies Hadoop, S3,... Flink, Spark Hive, SparkSQL Cascading, Pig FlinkML, MLLib Gelly, GraphX Flink, Spark Streaming
  • 9. 9 Old Style Batch Processing: MapReduce Step Step Step Step Step Client for (int i = 0; i < maxIterations; i++) { // Execute MapReduce job }
  • 10. 10 Optimized Execution case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next } Optimizer Type extraction stack Task scheduling Dataflow metadata Pre-flight (Client) Master Worker Data Source orders.tbl Filter Map DataSource lineitem.tbl Join Hybrid Hash buildHT probe hash-part [0] hash-part [0] GroupRed sort forward Program Dataflow Graph Independent of batch or streaming job deploy operators track intermediate results
  • 11. 11 Old Style Streaming (Micro Batches) stream discretizer Job Job Job Job while (true) { // get next few records // issue batch job }
  • 12. 12 Streaming Topology #Streaming Data Source 1 Data Source 2 Sprout Sprout Bolt BoltBolt Bolt Target Topology
  • 13. 13 #Streaming Apache Flink is a Native Streaming GPPE
  • 14. 14 The Flink Ecosystem in a Nutshell 14 Gelly Table ML SAMOA DataSet (Java/Scala/Python) DataStream (Java/Scala) HadoopM/R Local Remote Yarn Tez Embedded Dataflow Dataflow(WiP) MRQL Table Cascading(WiP) Streaming dataflow runtime
  • 15. 15 Native workload support #workload Flink Streaming topologies Long batch pipelines Machine Learning at scale Graph Analysis
  • 16. 16 1. Execute everything as streams 2. Allow some iterative (cyclic) dataflows 3. Allow some mutable state 4. Operate on managed memory Flink Engine – Core Features
  • 17. 17 3 Parts of a Streaming Infrastructure #Streaming 17 Gathering Broker Analysis Sensors Transaction logs … Server Logs
  • 18. 18 • Batch programs are a special kind of streaming program Batch on Streaming Infinite Streams Finite Streams Stream Windows Global View Pipelined Data Exchange Pipelined or Blocking Exchange Streaming Programs Batch Programs
  • 19. 19 Expressive APIs 19 case class Word (word: String, frequency: Int) val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .groupBy("word").sum("frequency") .print() val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() DataSet API (batch): DataStream API (streaming):
  • 20. 20 Table API 20 val customers = env.readCsvFile(…).as('id, 'mktSegment) .filter("mktSegment = AUTOMOBILE") val orders = env.readCsvFile(…) .filter( o => dateFormat.parse(o.orderDate).before(date) ) .as("orderId, custId, orderDate, shipPrio") val items = orders .join(customers).where("custId = id") .join(lineitems).where("orderId = id") .select("orderId, orderDate, shipPrio, extdPrice * (Literal(1.0f) – discount) as revenue") val result = items .groupBy("orderId, orderDate, shipPrio") .select('orderId, revenue.sum, orderDate, shipPrio")
  • 21. 21 Data Source – Processing – Data Sink © 2014 Teradata

Editor's Notes

  • #11: toy program: native transitive closure type extraction: types that go in and out of each operator