SlideShare a Scribd company logo
Aljoscha Krettek / Till Rohrmann
Flink committers
Co-founders @ data Artisans
aljoscha@apache.org / trohrmann@apache.org
Data Analysis With
Apache Flink
What is Apache Flink?
1
Functional
API
Relational
API
Graph API
Machine
Learning
…
Iterative Dataflow Engine
Apache Flink Stack
2
Python
Gelly
Table
FlinkML
SAMOA
Batch Optimizer
DataSet (Java/Scala) DataStream (Java/Scala)
Stream Builder
Hadoop
M/R
Distributed Runtime
Local Remote Yarn Tez Embedded
Dataflow
Dataflow
*current Flink master + few PRs
Table
Example Use Case: Log
Analysis
3
What Seems to be the Problem?
 Collect clicks from a
webserver log
 Find interesting URLs
 Combine with user
data
4
Web server
log
user
data base
Interesting
User Data
Extract
Clicks
Combine
Massage
The Execution Environment
 Entry point for all Flink programs
 Creates DataSets from data sources
5
ExecutionEnvironment env =
ExecutionEnvironment.getExecutionEnvironment();
Getting at Those Clicks
6
DataSet<String> log = env.readTextFile("hdfs:///log");
DataSet<Tuple2<String, Integer>> clicks = log.flatMap(
(String line, Collector<Tuple2<String, Integer>> out) ->
String[] parts = in.split("*magic regex*");
if (isClick(parts)) {
out.collect(new Tuple2<>(parts[1],Integer.parseInt(parts[2])));
}
}
)
post /foo/bar… 313
get /data/pic.jpg 128
post /bar/baz… 128
post /hello/there… 42
The Table Environment
 Environment for dealing with Tables
 Converts between DataSet and Table
7
TableEnvironment tableEnv = new TableEnvironment();
Counting those Clicks
8
Table clicksTable = tableEnv.toTable(clicks, "url, userId");
Table urlClickCounts = clicksTable
.groupBy("url, userId")
.select("url, userId, url.count as count");
Getting the User Information
9
Table userInfo = tableEnv.toTable(…, "name, id, …");
Table resultTable = urlClickCounts.join(userInfo)
.where("userId = id && count > 10")
.select("url, count, name, …");
The Final Step
10
class Result {
public String url;
public int count;
public String name;
…
}
DataSet<Result> set =
tableEnv.toSet(resultTable, Result.class);
DataSet<Result> result =
set.groupBy("url").reduceGroup(new ComplexOperation());
result.writeAsText("hdfs:///result");
env.execute();
API in a Nutshell
 Element-wise
• map, flatMap, filter
 Group-wise
• groupBy, reduce, reduceGroup, combineGroup,
mapPartition, aggregate, distinct
 Binary
• join, coGroup, union, cross
 Iterations
• iterate, iterateDelta
 Physical re-organization
• rebalance, partitionByHash, sortPartition
 Streaming
• window, windowMap, coMap, ...
11
What happens under the
hood?
12
From Program to Dataflow
13
Flink Program
Dataflow Plan
Optimized Plan
Distributed Execution
14
Orchestration
Recovery
Master
Memory
Management
Serialization
Worker
Streaming
Network
Advanced Analysis:
Website Recommendation
15
Going Further
 Log analysis result:
Which user visited how
often which web site
 Which other websites
might they like?
 Recommendation by
collaborative filtering
16
Collaborative Filtering
 Recommend items based on users with
similar preferences
 Latent factor models capture underlying
characteristics of items and preferences of
user
 Predicted preference:
17
ˆru,i = xu
T
yi
Matrix Factorization
18
minX,Y ru,i - xu
T
yi( )
2
+l nu xu
2
+ ni yi
2
i
å
u
å
æ
è
ç
ö
ø
÷
ru,i¹0
å
R » XT
Y
Alternating least squares
 Iterative approximation
1. Fix X and optimize Y
2. Fix Y and optimize X
 Communication and
computation intensive
19
R=YX x
R=YX x
Matrix Factorization Pipeline
20
val featureExtractor = HashingFT()
val factorizer = ALS()
val pipeline = featureExtractor.chain(factorizer)
val clickstreamDS =
env.readCsvFile[(String, String, Int)](clickStreamData)
val parameters = ParameterMap()
.add(HashingFT.NumFeatures, 1000000)
.add(ALS.Iterations, 10)
.add(ALS.NumFactors, 50)
.add(ALS.Lambda, 1.5)
val factorization = pipeline.fit(clickstreamDS, parameters)
Clickstream
Data
Hashing
Feature
Extractor
ALS
Matrix
factorization
Does it Scale?
21
• 40 node GCE cluster, highmem-8
• 10 ALS iteration with 50 latent factors
• Based on Spark MLlib’s implementation
Scale of Netflix or Spotify
What Else Can You Do?
 Classification using SVMs
• Conversion goal prediction
 Clustering
• Visitor segmentation
 Multiple linear regression
• Visitor prediction
22
Closing
23
What Have You Seen?
 Flink is a general-purpose analytics system
 Highly expressive Table API
 Advanced analysis with Flink’s machine learning
library
 Jobs are executed on powerful distributed
dataflow engine
24
Flink Roadmap for 2015
 Additions to Machine Learning library
 Streaming Machine Learning
 Support for interactive programs
 Optimization for Table API queries
 SQL on top of Table API
25
26
flink.apache.org
@ApacheFlink
Backup Slides
28
WordCount in DataSet API
29
case class Word (word: String, frequency: Int)
val env = ExecutionEnvironment.getExecutionEnvironment()
val lines = env.readTextFile(...)
lines
.flatMap {line => line.split(" ").map(word => Word(word,1))}
.groupBy("word").sum("frequency”)
.print()
env.execute()
Java and Scala APIs offer the same functionality.
Log Analysis Code
30
ExecutionEnvironment env = TableEnvironment tableEnv = new TableEnvironment();
TableEnvironment tableEnv = new TableEnvironment();
DataSet<String> log = env.readTextFile("hdfs:///log");
DataSet<Tuple2<String, Integer>> clicks = log.flatMap(
new FlatMapFunction<String, Tuple2<String, Integer>>() {
public void flatMap(String in, Collector<Tuple2<>> out) {
String[] parts = in.split("*magic regex*");
if (parts[0].equals("click")) {
out.collect(new Tuple2<>(parts[1], Integer.parseInt(parts[4])));
}
}
});
Table clicksTable = tableEnv.toTable(clicks, "url, userId");
Table urlClickCounts = clicksTable
.groupBy("url, userId")
.select("url, userId, url.count as count");
Table userInfo = tableEnv.toTable(…, "name, id, …");
Table resultTable = urlClickCounts.join(userInfo)
.where("userId = id && count > 10")
.select("url, count, name, …");
DataSet<Result> result = tableEnv.toSet(resultTable, Result.class);
result.writeAsText("hdfs:///result");
env.execute();
Log Analysis Dataflow Graph
31
Log
Map
AggUsers
Join
Result
Group
Log
Map
AggUsers
Join
combine
partition
sort
merge
sort
Result
Group
partition
sort
Pipelined Execution
32
Only 1 Stage
(depending on join strategy)
Data transfer in-memory
and disk if needed
Note: Intermediate DataSets
are not necessarily “created”!

More Related Content

What's hot (20)

PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
PPTX
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Stephan Ewen
 
PPTX
Streaming in the Wild with Apache Flink
Kostas Tzoumas
 
PPTX
Flink Streaming
Gyula Fóra
 
PDF
Don't Cross The Streams - Data Streaming And Apache Flink
John Gorman (BSc, CISSP)
 
PDF
Bay Area Apache Flink Meetup Community Update August 2015
Henry Saputra
 
PPTX
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Robert Metzger
 
PPTX
Apache Flink: API, runtime, and project roadmap
Kostas Tzoumas
 
PPTX
Apache Flink: Past, Present and Future
Gyula Fóra
 
PPTX
Apache Flink: Real-World Use Cases for Streaming Analytics
Slim Baltagi
 
PDF
Stateful Distributed Stream Processing
Gyula Fóra
 
PDF
Flink Apachecon Presentation
Gyula Fóra
 
PDF
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Vasia Kalavri
 
PPTX
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
PDF
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Till Rohrmann
 
PDF
Apache Flink: Streaming Done Right @ FOSDEM 2016
Till Rohrmann
 
PPTX
Apache Flink@ Strata & Hadoop World London
Stephan Ewen
 
PPTX
Assaf Araki – Real Time Analytics at Scale
Flink Forward
 
PDF
FlinkML: Large Scale Machine Learning with Apache Flink
Theodoros Vasiloudis
 
PPTX
Apache Flink and what it is used for
Aljoscha Krettek
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Stephan Ewen
 
Streaming in the Wild with Apache Flink
Kostas Tzoumas
 
Flink Streaming
Gyula Fóra
 
Don't Cross The Streams - Data Streaming And Apache Flink
John Gorman (BSc, CISSP)
 
Bay Area Apache Flink Meetup Community Update August 2015
Henry Saputra
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Robert Metzger
 
Apache Flink: API, runtime, and project roadmap
Kostas Tzoumas
 
Apache Flink: Past, Present and Future
Gyula Fóra
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Slim Baltagi
 
Stateful Distributed Stream Processing
Gyula Fóra
 
Flink Apachecon Presentation
Gyula Fóra
 
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Vasia Kalavri
 
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Till Rohrmann
 
Apache Flink: Streaming Done Right @ FOSDEM 2016
Till Rohrmann
 
Apache Flink@ Strata & Hadoop World London
Stephan Ewen
 
Assaf Araki – Real Time Analytics at Scale
Flink Forward
 
FlinkML: Large Scale Machine Learning with Apache Flink
Theodoros Vasiloudis
 
Apache Flink and what it is used for
Aljoscha Krettek
 

Viewers also liked (14)

PPTX
Data Stream Management
k_tauhid
 
PPTX
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Robert Metzger
 
PPTX
Flink Batch Processing and Iterations
Sameer Wadkar
 
PDF
Apache Flink internals
Kostas Tzoumas
 
PDF
Apache Flink Deep Dive
Vasia Kalavri
 
PDF
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Till Rohrmann
 
PDF
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
PDF
Apache Flink & Graph Processing
Vasia Kalavri
 
PDF
Dbms vs dsms
Longo-Stefano
 
PDF
Introduction to Apache Flink
datamantra
 
PPT
Step-by-Step Introduction to Apache Flink
Slim Baltagi
 
PDF
ちょっと理解に自信がないな という皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)
hamaken
 
PDF
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Anton Kirillov
 
PPTX
Flink vs. Spark
Slim Baltagi
 
Data Stream Management
k_tauhid
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Robert Metzger
 
Flink Batch Processing and Iterations
Sameer Wadkar
 
Apache Flink internals
Kostas Tzoumas
 
Apache Flink Deep Dive
Vasia Kalavri
 
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Till Rohrmann
 
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
Apache Flink & Graph Processing
Vasia Kalavri
 
Dbms vs dsms
Longo-Stefano
 
Introduction to Apache Flink
datamantra
 
Step-by-Step Introduction to Apache Flink
Slim Baltagi
 
ちょっと理解に自信がないな という皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)
hamaken
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Anton Kirillov
 
Flink vs. Spark
Slim Baltagi
 
Ad

Similar to Data Analysis With Apache Flink (20)

PPTX
Flink internals web
Kostas Tzoumas
 
PPTX
Apache Flink Training: System Overview
Flink Forward
 
PPTX
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
 
PPTX
Apache Flink Deep Dive
DataWorks Summit
 
PDF
Metadata and Provenance for ML Pipelines with Hopsworks
Jim Dowling
 
PPTX
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
Yahoo Developer Network
 
PPTX
Introduction to Apache Flink
mxmxm
 
PPTX
Overview of VS2010 and .NET 4.0
Bruce Johnson
 
PDF
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
PDF
Tapping into Scientific Data with Hadoop and Flink
Michael Häusler
 
PDF
Linaro Connect 2016 (BKK16) - Introduction to LISA
Patrick Bellasi
 
PPTX
Why and how to leverage the power and simplicity of SQL on Apache Flink
Fabian Hueske
 
PPTX
Apache flink
Ahmed Nader
 
PDF
Building and deploying LLM applications with Apache Airflow
Kaxil Naik
 
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
PPTX
Distributed Deep Learning + others for Spark Meetup
Vijay Srinivas Agneeswaran, Ph.D
 
PDF
Operating system concepts
Green Ecosystem
 
PPTX
RDF Stream Processing: Let's React
Jean-Paul Calbimonte
 
PPTX
First Flink Bay Area meetup
Kostas Tzoumas
 
PPTX
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Robert Metzger
 
Flink internals web
Kostas Tzoumas
 
Apache Flink Training: System Overview
Flink Forward
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
 
Apache Flink Deep Dive
DataWorks Summit
 
Metadata and Provenance for ML Pipelines with Hopsworks
Jim Dowling
 
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
Yahoo Developer Network
 
Introduction to Apache Flink
mxmxm
 
Overview of VS2010 and .NET 4.0
Bruce Johnson
 
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
Tapping into Scientific Data with Hadoop and Flink
Michael Häusler
 
Linaro Connect 2016 (BKK16) - Introduction to LISA
Patrick Bellasi
 
Why and how to leverage the power and simplicity of SQL on Apache Flink
Fabian Hueske
 
Apache flink
Ahmed Nader
 
Building and deploying LLM applications with Apache Airflow
Kaxil Naik
 
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
Distributed Deep Learning + others for Spark Meetup
Vijay Srinivas Agneeswaran, Ph.D
 
Operating system concepts
Green Ecosystem
 
RDF Stream Processing: Let's React
Jean-Paul Calbimonte
 
First Flink Bay Area meetup
Kostas Tzoumas
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Robert Metzger
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

Recently uploaded (20)

PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
Designing Production-Ready AI Agents
Kunal Rai
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Advancing WebDriver BiDi support in WebKit
Igalia
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Designing Production-Ready AI Agents
Kunal Rai
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Advancing WebDriver BiDi support in WebKit
Igalia
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 

Data Analysis With Apache Flink

  • 1. Aljoscha Krettek / Till Rohrmann Flink committers Co-founders @ data Artisans [email protected] / [email protected] Data Analysis With Apache Flink
  • 2. What is Apache Flink? 1 Functional API Relational API Graph API Machine Learning … Iterative Dataflow Engine
  • 3. Apache Flink Stack 2 Python Gelly Table FlinkML SAMOA Batch Optimizer DataSet (Java/Scala) DataStream (Java/Scala) Stream Builder Hadoop M/R Distributed Runtime Local Remote Yarn Tez Embedded Dataflow Dataflow *current Flink master + few PRs Table
  • 4. Example Use Case: Log Analysis 3
  • 5. What Seems to be the Problem?  Collect clicks from a webserver log  Find interesting URLs  Combine with user data 4 Web server log user data base Interesting User Data Extract Clicks Combine Massage
  • 6. The Execution Environment  Entry point for all Flink programs  Creates DataSets from data sources 5 ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
  • 7. Getting at Those Clicks 6 DataSet<String> log = env.readTextFile("hdfs:///log"); DataSet<Tuple2<String, Integer>> clicks = log.flatMap( (String line, Collector<Tuple2<String, Integer>> out) -> String[] parts = in.split("*magic regex*"); if (isClick(parts)) { out.collect(new Tuple2<>(parts[1],Integer.parseInt(parts[2]))); } } ) post /foo/bar… 313 get /data/pic.jpg 128 post /bar/baz… 128 post /hello/there… 42
  • 8. The Table Environment  Environment for dealing with Tables  Converts between DataSet and Table 7 TableEnvironment tableEnv = new TableEnvironment();
  • 9. Counting those Clicks 8 Table clicksTable = tableEnv.toTable(clicks, "url, userId"); Table urlClickCounts = clicksTable .groupBy("url, userId") .select("url, userId, url.count as count");
  • 10. Getting the User Information 9 Table userInfo = tableEnv.toTable(…, "name, id, …"); Table resultTable = urlClickCounts.join(userInfo) .where("userId = id && count > 10") .select("url, count, name, …");
  • 11. The Final Step 10 class Result { public String url; public int count; public String name; … } DataSet<Result> set = tableEnv.toSet(resultTable, Result.class); DataSet<Result> result = set.groupBy("url").reduceGroup(new ComplexOperation()); result.writeAsText("hdfs:///result"); env.execute();
  • 12. API in a Nutshell  Element-wise • map, flatMap, filter  Group-wise • groupBy, reduce, reduceGroup, combineGroup, mapPartition, aggregate, distinct  Binary • join, coGroup, union, cross  Iterations • iterate, iterateDelta  Physical re-organization • rebalance, partitionByHash, sortPartition  Streaming • window, windowMap, coMap, ... 11
  • 13. What happens under the hood? 12
  • 14. From Program to Dataflow 13 Flink Program Dataflow Plan Optimized Plan
  • 17. Going Further  Log analysis result: Which user visited how often which web site  Which other websites might they like?  Recommendation by collaborative filtering 16
  • 18. Collaborative Filtering  Recommend items based on users with similar preferences  Latent factor models capture underlying characteristics of items and preferences of user  Predicted preference: 17 ˆru,i = xu T yi
  • 19. Matrix Factorization 18 minX,Y ru,i - xu T yi( ) 2 +l nu xu 2 + ni yi 2 i å u å æ è ç ö ø ÷ ru,i¹0 å R » XT Y
  • 20. Alternating least squares  Iterative approximation 1. Fix X and optimize Y 2. Fix Y and optimize X  Communication and computation intensive 19 R=YX x R=YX x
  • 21. Matrix Factorization Pipeline 20 val featureExtractor = HashingFT() val factorizer = ALS() val pipeline = featureExtractor.chain(factorizer) val clickstreamDS = env.readCsvFile[(String, String, Int)](clickStreamData) val parameters = ParameterMap() .add(HashingFT.NumFeatures, 1000000) .add(ALS.Iterations, 10) .add(ALS.NumFactors, 50) .add(ALS.Lambda, 1.5) val factorization = pipeline.fit(clickstreamDS, parameters) Clickstream Data Hashing Feature Extractor ALS Matrix factorization
  • 22. Does it Scale? 21 • 40 node GCE cluster, highmem-8 • 10 ALS iteration with 50 latent factors • Based on Spark MLlib’s implementation Scale of Netflix or Spotify
  • 23. What Else Can You Do?  Classification using SVMs • Conversion goal prediction  Clustering • Visitor segmentation  Multiple linear regression • Visitor prediction 22
  • 25. What Have You Seen?  Flink is a general-purpose analytics system  Highly expressive Table API  Advanced analysis with Flink’s machine learning library  Jobs are executed on powerful distributed dataflow engine 24
  • 26. Flink Roadmap for 2015  Additions to Machine Learning library  Streaming Machine Learning  Support for interactive programs  Optimization for Table API queries  SQL on top of Table API 25
  • 27. 26
  • 30. WordCount in DataSet API 29 case class Word (word: String, frequency: Int) val env = ExecutionEnvironment.getExecutionEnvironment() val lines = env.readTextFile(...) lines .flatMap {line => line.split(" ").map(word => Word(word,1))} .groupBy("word").sum("frequency”) .print() env.execute() Java and Scala APIs offer the same functionality.
  • 31. Log Analysis Code 30 ExecutionEnvironment env = TableEnvironment tableEnv = new TableEnvironment(); TableEnvironment tableEnv = new TableEnvironment(); DataSet<String> log = env.readTextFile("hdfs:///log"); DataSet<Tuple2<String, Integer>> clicks = log.flatMap( new FlatMapFunction<String, Tuple2<String, Integer>>() { public void flatMap(String in, Collector<Tuple2<>> out) { String[] parts = in.split("*magic regex*"); if (parts[0].equals("click")) { out.collect(new Tuple2<>(parts[1], Integer.parseInt(parts[4]))); } } }); Table clicksTable = tableEnv.toTable(clicks, "url, userId"); Table urlClickCounts = clicksTable .groupBy("url, userId") .select("url, userId, url.count as count"); Table userInfo = tableEnv.toTable(…, "name, id, …"); Table resultTable = urlClickCounts.join(userInfo) .where("userId = id && count > 10") .select("url, count, name, …"); DataSet<Result> result = tableEnv.toSet(resultTable, Result.class); result.writeAsText("hdfs:///result"); env.execute();
  • 32. Log Analysis Dataflow Graph 31 Log Map AggUsers Join Result Group Log Map AggUsers Join combine partition sort merge sort Result Group partition sort
  • 33. Pipelined Execution 32 Only 1 Stage (depending on join strategy) Data transfer in-memory and disk if needed Note: Intermediate DataSets are not necessarily “created”!

Editor's Notes

  • #3: Engine is Batch or Streaming
  • #7: Works also with Scala API
  • #15: Visualization of program to plan to optimized plan to JobGraph What you see is not what you get.
  • #16: Pipelined Execution
  • #27: Algorithms: Decision trees and random forests PCA CCA More transformers: Scaler, Centering, Whitening Feature extractor Count vectorizer Outlier detector Support for cross validation Improved pipeline support Automatic pre- and post-processing pipeline SAMOA support Pending PR which will be merged with the upcoming milestone release Integration with Zeppelin, a IPython Notebook-like web interface for explorative data analysis
  • #34: Visualization of JobGraph to ExecutionGraph