SlideShare a Scribd company logo
DON'T CROSS THE STREAMS!
STREAMING AND APACHE FLINK
Senior Data Consultant
Dublin
JOHN GORMAN
amberhand
WHAT WE WILL COVER
It's all about pain!
Streaming and Related Terminology
Stream Processing Engines
Apache Flink
Don't Cross The Streams  - Data Streaming And Apache Flink
It started with a pain...a so ware pain
Things were big, slow & shaky....and getting worse!
The calm before the storm
Batch Processing (High Latency, inability to reason about
time)
Coupled systems prevented fast delivery of single change
requirements
Processing large distributed data
Messaging incorporated business logic (Service Bus)
Customers demanded immediate insight/action
Event Ordering/Timing, Consistency, Data Lineage
Lack of Fault Tolerant Systems
Someone noticed the need to change some time back...
Don't Cross The Streams  - Data Streaming And Apache Flink
Oh! The other Michael Hammer...
Ref: Michael Hammer - Harvard Business Review 1990
“We cannot achieve breakthroughs in
performance by cutting fat or automating
existing processes. Rather, we must
challenge old assumptions and shed the
old rules that made the business
underperform in the first place.”
Ref: Michael Hammer - Harvard Business Review 1990
“These rules of work design are based on
assumptions about technology, people,
and organisational goals that no longer
hold”
So...So ware Legends set out to fix it...
THE PERFECT STORM
Elements of the "Perfect Storm"
Elements of the "Perfect Storm" contd.
Can something save us?
Streams!
flowing from a to a
Any event that happens internal or external to your
company is fair game for inclusion in a stream!
WHAT ARE STREAMS?
Unbounded Events Producer Consumer
Streaming obliterates old working habbits, not automates
them
When did you last drop a DVD back to your video store ?
Convenience of streaming films won out
Anyone using Dublin Bus still carry a timetable?
Realtime with Context is needed...
SOME OTHER COMMON STREAM EXAMPLES
Log files
User website clicks,
Finance stocks
Social media streams
Ideal Stream Charactristics
Low Latency (Time required to produce some result)
High Throughput (Number of results produced in time)
Persisted for reuse
Fault Tolerant
Scalable Event Production (i.e. Partitioning)
Scaleable Event Consumption (i.e. Consumer Groups)
Consumer manages state (offsets)
Handle Back Pressure
Benefits of streams
Ability to augment and enrich data streams
Duality of Streams and Tables (Only Streams Work)
Replay from define offset
Stream outputs can become stream inputs (unix pipes!)
Data first - Processing Later (Fast feature creation)
Stream your monitoring (Logs, Ops Metrics, Business KPI
etc.)
Benefits of streams contd.
Location in Time Testing (Bugs In Code)
Replication for Scale
Cross/Join prior unrelated sources (i.e. Time, Context -
Analytics)
Point of Record Stream (produce suitable Materialized
Views)
MOST POPULAR STREAMING TOOLS
Apache Kafka
Amazon Kinesis - Based on Kafka Ideas
MapR Streams - Uses Kafka API (adds resilience features)
Can these Streams handle the load ?
Apache Kafka Data Handling at LinkedIn
LinkedIn Engineering Blog March 20, 2015
We have the stream! Now what?
Enter the Stream Processing Engine
What is a Stream Processing Engine ?
8 Requirements of a Real-Time Stream Processing Engine
(Michael Stonebraker)
1. Keep the data moving
2. Query using SQL on Stream
3. Handle Stream Imperfections (Delayed, Missing, Out-Of-
Order Data)
4. Generate Predictable Outcomes
5. Integrate Stored and Streaming Data
6. Guarantee Data Safety and Availabilty
7. Partition and Scale Applications Automatically
8. Process and Respond Instantaneously
OK - Engines on... What can we do with it ?
Stream Processing Engine - Use Cases
Lineage, Auditing, History (Immutable)
Internet of Things (Sensor data)
Realtime Monitoring (Failure Prevention)
Autonomous Cars
Fraud/Anomoly Detection
Health devices (fitbit, cardio pacemakers etc)
For System of record (Infinite persistence)
Digital Marketing
Network monitoring
Realtime pricing / analytics
Stream Processing Engine - Use Cases Contd...
Intelligence and Surveillance
Risk management (Realtime Asset Coverage)
E-commerce (Realtime customer retention)
Fraud detection (Card, Insurance)
Smart order routing
Transaction cost analysis
Pricing and analytics
Market data management
Algorithmic trading
Data warehouse augmentation
Streaming does not mandate BigData
Streaming does not mandate RealTime processing
...but many application types may mandate either or both
Ok great - Let's dig into an engine...
APACHE FLINK
Apache Flink Components
Apache Flink Architecture
Source: DataArtisans (BerlinBuzzwords 2016)
Job Manager UI - (For Job Submission & Monitoring)
Job Manager UI - (Plan and Scheduling)
WAIT! Let's clear a few things up...
Pipelining & Backpressure
Time Semantics (Event, Injestion, Processing etc.)
Windows (count, rolling, session, custom)
Watermarks, Triggers (Inserted into stream)
Checkpoints (Async Recovery - Choice of state store
backend)
"Exactly Once" semantics (no need to question if fail on
send, process, return?)
Apache Flink - Features out of the box!
Support for Event Time and Out-of-Order Events
Exactly-once Semantics for Stateful Computations
Highly flexible Streaming Windows & CEP
Continuous Streaming Model with Backpressure (Buffers)
Fault-tolerance via Lightweight Distributed Snapshots
One Runtime for Streaming and Batch Processing
Memory Management & Custom Serialization
Iterations and Delta Iterations
Program Optimizer
SQL (Batch and Streams) due soon in 1.1
But I'm only here for the Machine Learning and Graph
Processing!!...
Machine Learning in Flink with FlinkML
* Apache Samoa Project - Streaming Machine Learning that works on top of Flink
** Apache Mahout - Batch based Machine Learning that works on top of Flink
Graph Processing in Flink?
"Gelly" is Apache Flink's Graph Analysis API
Iterative Graph processing abstractions on top of Flink
1. Vertex-Centric Iterations (like pregal, giraph)
2. Scatter-Gather Iterations
3. Gather-Sum-Apply (like PowerGraph)
GELLY SUPPORTS
1. Graph Properties (numberOfVerices etc...)
2. Transformations (map, difference, join...)
3. Mutations (Add/Remove vertices/edges...)
4. Batch and Streams - Java, Scala
* External "Gradoop" Project adds further features on top of Flink
Graph Processing with Gelly - Algorithms
PageRank
Single Source Shortest Path
Label Propogation
Weakly Connected Components
Community Detection
Planned Algorithms
Triangle Count
HITS
Affinity Propogation
Graph Summarization
Planned Algorithms - Attribution: Vasia Kalavri
Ecosystem Integration
Data Source/Sinks via Connectors (Kafka, jdbc, S3, etc)
Storm and Cascading & MapReduce support
Machine Learning - Apache Samoa (Streaming ML),
Appache Mahout (Batch)
Graph - Gradoop
Python API, Scala Repl, Apache Zeppelin Support
DataFlow Model - Apache Beam (API Abstraction + Flink
"Runner")
Apache Beam - Data Flow Model Support in Flink
Supported Distributions / Deployment Options
HortonWorks - Ambari Service (Confirmed full support on
the way)
Cloudera - Not Supported to my knowledge (Discussion
forums ref BigTop)
MapR - Not part of their MapR converged data platform
Amazon EMR (Yarn - Single Instance, Session)
Google Compute Engine (Yarn Support & Hosted
Competitor -> Cloud Dataflow)
Via Apache Myriad on Mesos (Native support coming in
1.2)
Some DataStream API Code (Setup)
* Code courtesy of DataArtisans on github
Some DataStream Code (Destination Sink & Running)
Sometimes, crossing the streams is the solution you need...
Crossing the streams with DataStream API
Crossing the streams with CEP Library
Proposed Flink 1.1 SQL API
* Code courtesy of DataArtisans on github
Flink Furthering Yahoo Benchmarks
Apache Flink Adoption
Whats Next For Flink?
Queryable State (Database inversion! Kafka log, RocksDB)
Release of 1.1+
Dynamic Scaling, Resource Elasticity (i.e. for catchup)
Production Hardening (1,000 node cluster Alibaba)
Stream SQL (Apache Calcite)
CEP Enhancements (large sized async state snapshoting)
Mesos Support
More Connectors
API enhancements (joins, slowly changing inputs)
Security (data encryption, Kerberos with Kafka)
Email: john.gorman@amberhand.ie
LinkedIn: johnpgorman
THANK YOU
ACKNOWLEDGEMENTS
Bank Of Ireland - Event and Venue
Hadoop User Group Ireland - Community Building
Data Artisans - Images, Code and Community Support
Anne Ebeling - Dublin Artwork
RESOURCES
APACHE FLINK
APACHE FLINK
IN FLINK
CEP MONITORING
RUNNING FLINK ON
BY TYLER AKIDAU
BY TYLER AKIDAU
MAPR FREE EBOOK ON
TRAINING
TAXI STREAM EXAMPLE
BACK PRESSURE CEP
SAMPLE
YARN
STREAMING 101
STREAMING 102
STREAMING ARCHITECTURE

More Related Content

What's hot (20)

PPTX
Data Stream Processing with Apache Flink
Fabian Hueske
 
PDF
FastR+Apache Flink
Juan Fumero
 
PPTX
Real-time Stream Processing with Apache Flink
DataWorks Summit
 
PPTX
Debunking Common Myths in Stream Processing
Kostas Tzoumas
 
PPTX
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
PDF
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Flink Training: System Overview
Flink Forward
 
PPTX
Streaming in the Wild with Apache Flink
Kostas Tzoumas
 
PDF
Tech Talk @ Google on Flink Fault Tolerance and HA
Paris Carbone
 
PPTX
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Stephan Ewen
 
PDF
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
ucelebi
 
PPTX
SICS: Apache Flink Streaming
Turi, Inc.
 
PPTX
Flink Streaming @BudapestData
Gyula Fóra
 
PPTX
Continuous Processing with Apache Flink - Strata London 2016
Stephan Ewen
 
PDF
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Vasia Kalavri
 
PDF
Flink Streaming Berlin Meetup
Márton Balassi
 
PDF
Apache Flink internals
Kostas Tzoumas
 
PPTX
Apache Flink at Strata San Jose 2016
Kostas Tzoumas
 
PPTX
First Flink Bay Area meetup
Kostas Tzoumas
 
PDF
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
ucelebi
 
Data Stream Processing with Apache Flink
Fabian Hueske
 
FastR+Apache Flink
Juan Fumero
 
Real-time Stream Processing with Apache Flink
DataWorks Summit
 
Debunking Common Myths in Stream Processing
Kostas Tzoumas
 
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
Apache Flink Training: System Overview
Flink Forward
 
Streaming in the Wild with Apache Flink
Kostas Tzoumas
 
Tech Talk @ Google on Flink Fault Tolerance and HA
Paris Carbone
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Stephan Ewen
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
ucelebi
 
SICS: Apache Flink Streaming
Turi, Inc.
 
Flink Streaming @BudapestData
Gyula Fóra
 
Continuous Processing with Apache Flink - Strata London 2016
Stephan Ewen
 
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Vasia Kalavri
 
Flink Streaming Berlin Meetup
Márton Balassi
 
Apache Flink internals
Kostas Tzoumas
 
Apache Flink at Strata San Jose 2016
Kostas Tzoumas
 
First Flink Bay Area meetup
Kostas Tzoumas
 
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
ucelebi
 

Viewers also liked (19)

PPTX
Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku
 
PDF
The shortest path is not always a straight line
Vasia Kalavri
 
PPTX
OWF 2014 - Take back control of your Web tracking - Dataiku
Dataiku
 
PDF
Flink in Zalando's World of Microservices
Zalando Technology
 
PPTX
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
PPTX
Apache Flink Training: DataStream API Part 2 Advanced
Flink Forward
 
PPTX
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
Slim Baltagi
 
PDF
Dongwon Kim – A Comparative Performance Evaluation of Flink
Flink Forward
 
PPTX
Real-Time Event & Stream Processing on MS Azure
Khalid Salama
 
PPTX
Flink Case Study: Capital One
Flink Forward
 
PDF
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...
Kai Wähner
 
PDF
Streaming Analytics - Comparison of Open Source Frameworks and Products
Kai Wähner
 
PPTX
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 
PPT
Step-by-Step Introduction to Apache Flink
Slim Baltagi
 
PPTX
Apache Flink: Real-World Use Cases for Streaming Analytics
Slim Baltagi
 
PPTX
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Slim Baltagi
 
PPTX
A Multi Colored YARN
DataWorks Summit/Hadoop Summit
 
PPTX
Flink vs. Spark
Slim Baltagi
 
PDF
Hadoop Overview & Architecture
EMC
 
Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku
 
The shortest path is not always a straight line
Vasia Kalavri
 
OWF 2014 - Take back control of your Web tracking - Dataiku
Dataiku
 
Flink in Zalando's World of Microservices
Zalando Technology
 
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
Apache Flink Training: DataStream API Part 2 Advanced
Flink Forward
 
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
Slim Baltagi
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Flink Forward
 
Real-Time Event & Stream Processing on MS Azure
Khalid Salama
 
Flink Case Study: Capital One
Flink Forward
 
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...
Kai Wähner
 
Streaming Analytics - Comparison of Open Source Frameworks and Products
Kai Wähner
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 
Step-by-Step Introduction to Apache Flink
Slim Baltagi
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Slim Baltagi
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Slim Baltagi
 
A Multi Colored YARN
DataWorks Summit/Hadoop Summit
 
Flink vs. Spark
Slim Baltagi
 
Hadoop Overview & Architecture
EMC
 
Ad

Similar to Don't Cross The Streams - Data Streaming And Apache Flink (20)

PDF
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
GeeksLab Odessa
 
PPTX
Apache Beam: A unified model for batch and stream processing data
DataWorks Summit/Hadoop Summit
 
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
PPTX
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
 
PDF
Processing Real-Time Data at Scale: A streaming platform as a central nervous...
confluent
 
PDF
Metadata and Provenance for ML Pipelines with Hopsworks
Jim Dowling
 
PPTX
Real-time Analytics for Data-Driven Applications
VMware Tanzu
 
PDF
Spark Seattle meetup - Breaking ETL barrier with Spark Streaming
Santosh Sahoo
 
PDF
Cloud Lambda Architecture Patterns
Asis Mohanty
 
PPT
Moving Towards a Streaming Architecture
Gabriele Modena
 
PDF
Introduction to apache kafka, confluent and why they matter
Paolo Castagna
 
PDF
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
 
PDF
SF Big Analytics meetup : Hoodie From Uber
Chester Chen
 
PPTX
Yahoo compares Storm and Spark
Chicago Hadoop Users Group
 
PDF
Stream Processing – Concepts and Frameworks
Guido Schmutz
 
PDF
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
PDF
Lambda Architecture Using SQL
SATOSHI TAGOMORI
 
PPTX
Building data pipelines
Jonathan Holloway
 
PDF
Cloud lunch and learn real-time streaming in azure
Timothy Spann
 
PPTX
Log Data Analysis Platform
Valentin Kropov
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
GeeksLab Odessa
 
Apache Beam: A unified model for batch and stream processing data
DataWorks Summit/Hadoop Summit
 
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
 
Processing Real-Time Data at Scale: A streaming platform as a central nervous...
confluent
 
Metadata and Provenance for ML Pipelines with Hopsworks
Jim Dowling
 
Real-time Analytics for Data-Driven Applications
VMware Tanzu
 
Spark Seattle meetup - Breaking ETL barrier with Spark Streaming
Santosh Sahoo
 
Cloud Lambda Architecture Patterns
Asis Mohanty
 
Moving Towards a Streaming Architecture
Gabriele Modena
 
Introduction to apache kafka, confluent and why they matter
Paolo Castagna
 
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
 
SF Big Analytics meetup : Hoodie From Uber
Chester Chen
 
Yahoo compares Storm and Spark
Chicago Hadoop Users Group
 
Stream Processing – Concepts and Frameworks
Guido Schmutz
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
Lambda Architecture Using SQL
SATOSHI TAGOMORI
 
Building data pipelines
Jonathan Holloway
 
Cloud lunch and learn real-time streaming in azure
Timothy Spann
 
Log Data Analysis Platform
Valentin Kropov
 
Ad

Recently uploaded (20)

PPTX
Designing Production-Ready AI Agents
Kunal Rai
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Designing Production-Ready AI Agents
Kunal Rai
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Biography of Daniel Podor.pdf
Daniel Podor
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 

Don't Cross The Streams - Data Streaming And Apache Flink

  • 1. DON'T CROSS THE STREAMS! STREAMING AND APACHE FLINK
  • 3. WHAT WE WILL COVER It's all about pain! Streaming and Related Terminology Stream Processing Engines Apache Flink
  • 5. It started with a pain...a so ware pain
  • 6. Things were big, slow & shaky....and getting worse!
  • 7. The calm before the storm Batch Processing (High Latency, inability to reason about time) Coupled systems prevented fast delivery of single change requirements Processing large distributed data Messaging incorporated business logic (Service Bus) Customers demanded immediate insight/action Event Ordering/Timing, Consistency, Data Lineage Lack of Fault Tolerant Systems
  • 8. Someone noticed the need to change some time back...
  • 10. Oh! The other Michael Hammer...
  • 11. Ref: Michael Hammer - Harvard Business Review 1990 “We cannot achieve breakthroughs in performance by cutting fat or automating existing processes. Rather, we must challenge old assumptions and shed the old rules that made the business underperform in the first place.”
  • 12. Ref: Michael Hammer - Harvard Business Review 1990 “These rules of work design are based on assumptions about technology, people, and organisational goals that no longer hold”
  • 13. So...So ware Legends set out to fix it...
  • 15. Elements of the "Perfect Storm"
  • 16. Elements of the "Perfect Storm" contd.
  • 19. flowing from a to a Any event that happens internal or external to your company is fair game for inclusion in a stream! WHAT ARE STREAMS? Unbounded Events Producer Consumer
  • 20. Streaming obliterates old working habbits, not automates them
  • 21. When did you last drop a DVD back to your video store ? Convenience of streaming films won out
  • 22. Anyone using Dublin Bus still carry a timetable? Realtime with Context is needed...
  • 23. SOME OTHER COMMON STREAM EXAMPLES Log files User website clicks, Finance stocks Social media streams
  • 24. Ideal Stream Charactristics Low Latency (Time required to produce some result) High Throughput (Number of results produced in time) Persisted for reuse Fault Tolerant Scalable Event Production (i.e. Partitioning) Scaleable Event Consumption (i.e. Consumer Groups) Consumer manages state (offsets) Handle Back Pressure
  • 25. Benefits of streams Ability to augment and enrich data streams Duality of Streams and Tables (Only Streams Work) Replay from define offset Stream outputs can become stream inputs (unix pipes!) Data first - Processing Later (Fast feature creation) Stream your monitoring (Logs, Ops Metrics, Business KPI etc.)
  • 26. Benefits of streams contd. Location in Time Testing (Bugs In Code) Replication for Scale Cross/Join prior unrelated sources (i.e. Time, Context - Analytics) Point of Record Stream (produce suitable Materialized Views)
  • 27. MOST POPULAR STREAMING TOOLS Apache Kafka Amazon Kinesis - Based on Kafka Ideas MapR Streams - Uses Kafka API (adds resilience features)
  • 28. Can these Streams handle the load ?
  • 29. Apache Kafka Data Handling at LinkedIn LinkedIn Engineering Blog March 20, 2015
  • 30. We have the stream! Now what?
  • 31. Enter the Stream Processing Engine
  • 32. What is a Stream Processing Engine ?
  • 33. 8 Requirements of a Real-Time Stream Processing Engine (Michael Stonebraker) 1. Keep the data moving 2. Query using SQL on Stream 3. Handle Stream Imperfections (Delayed, Missing, Out-Of- Order Data) 4. Generate Predictable Outcomes 5. Integrate Stored and Streaming Data 6. Guarantee Data Safety and Availabilty 7. Partition and Scale Applications Automatically 8. Process and Respond Instantaneously
  • 34. OK - Engines on... What can we do with it ?
  • 35. Stream Processing Engine - Use Cases Lineage, Auditing, History (Immutable) Internet of Things (Sensor data) Realtime Monitoring (Failure Prevention) Autonomous Cars Fraud/Anomoly Detection Health devices (fitbit, cardio pacemakers etc) For System of record (Infinite persistence) Digital Marketing Network monitoring Realtime pricing / analytics
  • 36. Stream Processing Engine - Use Cases Contd... Intelligence and Surveillance Risk management (Realtime Asset Coverage) E-commerce (Realtime customer retention) Fraud detection (Card, Insurance) Smart order routing Transaction cost analysis Pricing and analytics Market data management Algorithmic trading Data warehouse augmentation
  • 37. Streaming does not mandate BigData Streaming does not mandate RealTime processing ...but many application types may mandate either or both
  • 38. Ok great - Let's dig into an engine...
  • 41. Apache Flink Architecture Source: DataArtisans (BerlinBuzzwords 2016)
  • 42. Job Manager UI - (For Job Submission & Monitoring)
  • 43. Job Manager UI - (Plan and Scheduling)
  • 44. WAIT! Let's clear a few things up... Pipelining & Backpressure Time Semantics (Event, Injestion, Processing etc.) Windows (count, rolling, session, custom) Watermarks, Triggers (Inserted into stream) Checkpoints (Async Recovery - Choice of state store backend) "Exactly Once" semantics (no need to question if fail on send, process, return?)
  • 45. Apache Flink - Features out of the box! Support for Event Time and Out-of-Order Events Exactly-once Semantics for Stateful Computations Highly flexible Streaming Windows & CEP Continuous Streaming Model with Backpressure (Buffers) Fault-tolerance via Lightweight Distributed Snapshots One Runtime for Streaming and Batch Processing Memory Management & Custom Serialization Iterations and Delta Iterations Program Optimizer SQL (Batch and Streams) due soon in 1.1
  • 46. But I'm only here for the Machine Learning and Graph Processing!!...
  • 47. Machine Learning in Flink with FlinkML * Apache Samoa Project - Streaming Machine Learning that works on top of Flink ** Apache Mahout - Batch based Machine Learning that works on top of Flink
  • 49. "Gelly" is Apache Flink's Graph Analysis API Iterative Graph processing abstractions on top of Flink 1. Vertex-Centric Iterations (like pregal, giraph) 2. Scatter-Gather Iterations 3. Gather-Sum-Apply (like PowerGraph)
  • 50. GELLY SUPPORTS 1. Graph Properties (numberOfVerices etc...) 2. Transformations (map, difference, join...) 3. Mutations (Add/Remove vertices/edges...) 4. Batch and Streams - Java, Scala * External "Gradoop" Project adds further features on top of Flink
  • 51. Graph Processing with Gelly - Algorithms PageRank Single Source Shortest Path Label Propogation Weakly Connected Components Community Detection
  • 52. Planned Algorithms Triangle Count HITS Affinity Propogation Graph Summarization Planned Algorithms - Attribution: Vasia Kalavri
  • 53. Ecosystem Integration Data Source/Sinks via Connectors (Kafka, jdbc, S3, etc) Storm and Cascading & MapReduce support Machine Learning - Apache Samoa (Streaming ML), Appache Mahout (Batch) Graph - Gradoop Python API, Scala Repl, Apache Zeppelin Support DataFlow Model - Apache Beam (API Abstraction + Flink "Runner")
  • 54. Apache Beam - Data Flow Model Support in Flink
  • 55. Supported Distributions / Deployment Options HortonWorks - Ambari Service (Confirmed full support on the way) Cloudera - Not Supported to my knowledge (Discussion forums ref BigTop) MapR - Not part of their MapR converged data platform Amazon EMR (Yarn - Single Instance, Session) Google Compute Engine (Yarn Support & Hosted Competitor -> Cloud Dataflow) Via Apache Myriad on Mesos (Native support coming in 1.2)
  • 56. Some DataStream API Code (Setup) * Code courtesy of DataArtisans on github
  • 57. Some DataStream Code (Destination Sink & Running)
  • 58. Sometimes, crossing the streams is the solution you need...
  • 59. Crossing the streams with DataStream API
  • 60. Crossing the streams with CEP Library
  • 61. Proposed Flink 1.1 SQL API * Code courtesy of DataArtisans on github
  • 64. Whats Next For Flink? Queryable State (Database inversion! Kafka log, RocksDB) Release of 1.1+ Dynamic Scaling, Resource Elasticity (i.e. for catchup) Production Hardening (1,000 node cluster Alibaba) Stream SQL (Apache Calcite) CEP Enhancements (large sized async state snapshoting) Mesos Support More Connectors API enhancements (joins, slowly changing inputs) Security (data encryption, Kerberos with Kafka)
  • 65. Email: [email protected] LinkedIn: johnpgorman THANK YOU ACKNOWLEDGEMENTS Bank Of Ireland - Event and Venue Hadoop User Group Ireland - Community Building Data Artisans - Images, Code and Community Support Anne Ebeling - Dublin Artwork
  • 66. RESOURCES APACHE FLINK APACHE FLINK IN FLINK CEP MONITORING RUNNING FLINK ON BY TYLER AKIDAU BY TYLER AKIDAU MAPR FREE EBOOK ON TRAINING TAXI STREAM EXAMPLE BACK PRESSURE CEP SAMPLE YARN STREAMING 101 STREAMING 102 STREAMING ARCHITECTURE