SlideShare a Scribd company logo
Vasia Kalavri
Flink committer & PhD student @KTH
vasia@apache.org
@vkalavri
Large-Scale Graph Processing
with Apache Flink
GraphDevroom
FOSDEM ‘15
Overview
● What is Apache Flink?
● Why Graph Processing with Flink:
○ user perspective
○ system perspective
● Gelly: the upcoming Flink Graph API
● Example: Music Profiles
Apache Flink
quick intro
What is Apache Flink?
● Large-scale data processing engine
● Java and Scala APIs
● Batch and Streaming Analytics
● Runs locally, on your cluster, on YARN
● Performs well even when memory runs out
4
The growing Flink stack
5
Flink Optimizer Flink Stream Builder
Common API
Scala API
(batch and streaming)
Java API
(batch and streaming)
Python API
(upcoming)
Graph API
Apache
MRQL
Flink Local Runtime
Embedded
environment
(Java collections)
Local
Environment
(for debugging)
Remote environment
(Regular cluster execution)
Apache Tez
Data
storage
HDFSFiles S3 JDBC Redis
Rabbit
MQ
Kafka
Azure
tables …
Single node execution Standalone or YARN cluster
● map, flatMap
● filter
● reduce,
reduceGroup
● join
● coGroup
● aggregate
Available Transformations
● cross
● project
● distinct
● union
● iterate
● iterateDelta
● ...
6
DataSet<String> text = env.readTextFile(input);
DataSet<Tuple2<String, Integer>> result = text
.flatMap((str, out) -> {
for (String token : value.split("W")) {
out.collect(new Tuple2<>(token, 1));
})
.groupBy(0)
.aggregate(SUM, 1);
Word Count
val input = env.readTextFile(input);
val words = input flatMap { line => line.split("W+")}
map { word => (word, 1)}
val counts = words groupBy(0) sum(1)
Java
Scala
7
Why Graph Processing
with Flink?
user perspective
Typical graph data analysis pipeline
load clean
create
graph
analyze
graph
clean transformload result
clean transformload
9
A more realistic pipeline
load clean
create
graph
analyze
graph
clean transformload result
clean transformload
often, it’s not easy to get the
graph properties and the
analysis algorithm right the
first time!
10
A more realistic pipeline
load clean
create
graph
analyze
graph
clean transformload result
clean transformload
11
A more user-friendly pipeline
load
load result
load
12
General-purpose or specialized?
- fast application
development and
deployment
- easier maintenance
- non-intuitive APIs
- time-consuming
- use, configure and integrate
different systems
- hard to maintain
- rich APIs and features
general-purpose specialized
what about performance?
13
Why Graph Processing
with Flink?
system perspective
Efficient Iterations
● Fink supports
iterations natively
○ the runtime is aware of
the iterative execution
○ no scheduling overhead
between iterations
○ caching and state
maintenance are handled
automatically
15
Flink Iteration Operators
Iterate IterateDelta
16
Input
Iterative
Update Function
Result
Replace
Workset
Iterative
Update Function
Result
Solution Set
State
Flink Optimizer
● The optimizer selects
an execution plan for
a program
● Think of an AI system
manipulating your
program for you
17
Optimization of Iterative algorithms
18
Caching Loop-invariant
Data
Pushing work
“out of the loop”
Maintain state as
index
Performance
● in-memory data
streaming
● memory
management
● serialization
framework
19
Scalability
20
http://data-artisans.com/computing-recommendations-with-flink.html
Gelly
the upcoming Flink Graph API
● Java Graph API on top of Flink
● Initial version coming with Flink 0.9
● Can be seamlessly mixed with the standard
Flink API
● Easily implement applications that use both
record-based and graph-based analysis
Meet Gelly
22
In Gelly, a Graph is simply represented by a DataSet of
Vertices and a DataSet of Edges:
Hello, Gelly!
Graph<String, Long, Double> graph = Graph.fromDataSet(vertices, edges, env);
Graph<String, Long, NullValue> graph = Graph.fromCollection(edges,
new MapFunction<String, Long>() {
public Long map(String value) {
return 1l;
}
}, env);
23
● Graph Properties
○ getVertexIds
○ getEdgeIds
○ numberOfVertices
○ numberOfEdges
○ getDegrees
○ isWeaklyConnected
○ ...
Available Methods
● Transformations
○ map, filter, join
○ subgraph, union
○ reverse, undirected
○ ...
● Mutations
○ add vertex/edge
○ remove vertex/edge
24
- Apply a reduce function to the 1st-hop
neighborhood of each vertex in parallel
Neighborhood Methods
3
4
7
4
4
graph.reduceOnNeighbors(new MinValue(), EdgeDirection.OUT);
3
9
7
4
5
25
● Validate a Graph according to given criteria
○ do the edge ids correspond to vertex ids?
○ are there duplicates?
○ is the graph bipartite?
Graph Validation
edges = { (1, 2), (3, 4), (1, 5), (2, 3), (6, 5) }
vertices = { 1, 2, 3, 4, 5 }
graph = Graph.fromCollection(vertices, edges);
graph.validate(new InvalidVertexIdsValidator()); // false
26
● Wraps the Flink Spargel
(Pregel-like) API
● The user only
implements two
functions
○ VertexUpdateFunction
○ MessagingFunction
● Internally creates a
delta iteration
Vertex-centric Iterations
27
updateVertex(K key, Double value,
MessageIterator msgs) {
Double minDist = Double.MAX_VALUE;
for (double msg : msgs) {
if (msg < minDist)
minDist = msg;
}
if (value > minDist)
setNewVertexValue(minDist);
}
Vertex-centric SSSP
sendMessages(K key, Double newDist) {
for (Edge edge : getOutgoingEdges()) {
sendMessageTo(edge.getTarget(),
newDist + edge.getValue());
}
shortestPaths = graph.runVertexCentricIteration(
new DistanceUpdater(), new DistanceMessenger()).getVertices();
DistanceUpdater: VertexUpdateFunction DistanceMessenger: MessagingFunction
28
● PageRank
● Single Source Shortest Paths
● Label Propagation
● Weakly Connected Components
Library of Algorithms
29
Example
User Music Profiles
Music Profiles
31
Problem Description
Input:
● <userId, songId, playCount> triplets
● a set of bad records (not to be trusted)
Tasks:
1. filter out bad records
2. compute the top song per user (most listened to)
3. create a user-user similarity graph based on common songs
4. detect communities on the similarity graph
32
1. Filter out bad records
/** Read <userID>t<songID>t<playcount> triplets */
DataSet<Tuple3> triplets = getTriplets();
/** Read the bad records songIDs */
DataSet<Tuple1> mismatches = getMismatches();
/** Filter out the mismatches from the triplets dataset */
DataSet<Tuple3> validTriplets = triplets.coGroup(mismatches).where(1).equalTo(0)
.with(new CoGroupFunction {
void coGroup(Iterable triplets, Iterable invalidSongs, Collector out) {
if (!invalidSongs.iterator().hasNext())
for (Tuple3 triplet : triplets) // this is a valid triplet
out.collect(triplet);
}
33
2a. Compute top song per user
/** Create a user -> song weighted bipartite graph where the edge weights correspond
to play counts */
Graph userSongGraph = Graph.fromTupleDataSet(validTriplets, env);
/** Get the top track (most listened) for each user */
DataSet<Tuple2> usersWithTopTrack = userSongGraph
.reduceOnEdges(new GetTopSongPerUser(), EdgeDirection.OUT);
34
Tom “I like birds”
“elephant woman”
“red morning”
323 plays
18 plays
42plays
2b. Compute top song per user
35
class GetTopSongPerUser implements EdgesFunctionWithVertexValue {
void iterateEdges(Vertex vertex, Iterable<Edge> edges) {
int maxPlaycount = 0;
String topSong = "";
for (Edge edge : edges) {
if (edge.getValue() > maxPlaycount) {
maxPlaycount = edge.getValue();
topSong = edge.getTarget();
}
}
return new Tuple2(vertex.getId(), topSong);
}
}
user-song to user-user graph
36
“red morning”
“I like birds”
“get lucky”
“disorder”
Tom
Steve
Wendy
“elephant woman”
Emily
Tom Steve
Wendy
Emily
Emily
3. Create a user-user similarity graph
/**Create a user-user similarity graph:
two users that listen to the same song are connected */
DataSet<Edge> similarUsers = userSongGraph.getEdges().groupBy(1)
.reduceGroup(new GroupReduceFunction() {
void reduce(Iterable<Edge> edges, Collector<Edge> out) {
List users = new ArrayList();
for (Edge edge : edges)
users.add(edge.getSource());
for (int i = 0; i < users.size() - 1; i++)
for (int j = i+1; j < users.size() - 1; j++)
out.collect(new Edge(users.get(i), users.get(j)));
}
}).distinct();
Graph similarUsersGraph = Graph.fromDataSet(similarUsers).getUndirected();
37
4. Cluster similar users
/** Detect user communities using label propagation */
// Initialize each vertex with a unique numeric label
DataSet<Tuple2> idsWithLabels = similarUsersGraph
.getVertices().reduceGroup(new AssignInitialLabel());
// update the vertex values and run the label propagation algorithm
DataSet<Vertex> verticesWithCommunity = similarUsersGraph
.joinWithVertices(idsWithlLabels, new MapFunction() {
public Long map(Tuple2 idWithLabel) {
return idWithLabel.f1;
}
}).run(new LabelPropagation(numIterations)).getVertices();
38
Music Profiles Recap
39
● Filter out bad records : record API
● Create user-song graph : record API
● Top song per user : Gelly
● Create user-user graph : record API
● Cluster users : Gelly
What’s next, Gelly?
● Gather-Sum-Apply
● Scala API
● More library methods
○ Clustering Coefficient
○ Minimum Spanning Tree
● Integration with the Flink Streaming API
● Specialized Operators for Skewed Graphs
40
Keep in touch!
● Gelly development repository
http://github.com/project-flink/flink-graph
● Apache Flink mailing lists
http://flink.apache.org/community.html#mailing-lists
● Follow @ApacheFlink
41

More Related Content

What's hot (20)

PDF
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
ucelebi
 
PPTX
Apache Flink Training: System Overview
Flink Forward
 
PPTX
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Robert Metzger
 
PPTX
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
PDF
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Vasia Kalavri
 
PDF
Apache Flink & Graph Processing
Vasia Kalavri
 
PDF
Apache Flink internals
Kostas Tzoumas
 
PPTX
Real-time Stream Processing with Apache Flink
DataWorks Summit
 
PPTX
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Fabian Hueske
 
PDF
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Till Rohrmann
 
PDF
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
PDF
Apache Flink Deep Dive
Vasia Kalavri
 
PPTX
Apache Flink@ Strata & Hadoop World London
Stephan Ewen
 
PPTX
Data Stream Processing with Apache Flink
Fabian Hueske
 
PPTX
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Stephan Ewen
 
PPTX
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
PDF
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
ucelebi
 
PDF
Don't Cross The Streams - Data Streaming And Apache Flink
John Gorman (BSc, CISSP)
 
PDF
Gelly in Apache Flink Bay Area Meetup
Vasia Kalavri
 
PDF
Marton Balassi – Stateful Stream Processing
Flink Forward
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
ucelebi
 
Apache Flink Training: System Overview
Flink Forward
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Robert Metzger
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Vasia Kalavri
 
Apache Flink & Graph Processing
Vasia Kalavri
 
Apache Flink internals
Kostas Tzoumas
 
Real-time Stream Processing with Apache Flink
DataWorks Summit
 
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Fabian Hueske
 
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Till Rohrmann
 
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
Apache Flink Deep Dive
Vasia Kalavri
 
Apache Flink@ Strata & Hadoop World London
Stephan Ewen
 
Data Stream Processing with Apache Flink
Fabian Hueske
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Stephan Ewen
 
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
ucelebi
 
Don't Cross The Streams - Data Streaming And Apache Flink
John Gorman (BSc, CISSP)
 
Gelly in Apache Flink Bay Area Meetup
Vasia Kalavri
 
Marton Balassi – Stateful Stream Processing
Flink Forward
 

Viewers also liked (20)

PDF
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
PDF
Matthias J. Sax – A Tale of Squirrels and Storms
Flink Forward
 
PDF
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Flink Forward
 
PPTX
Apache Flink - Hadoop MapReduce Compatibility
Fabian Hueske
 
PPTX
Fabian Hueske – Cascading on Flink
Flink Forward
 
PPTX
Assaf Araki – Real Time Analytics at Scale
Flink Forward
 
PPTX
Apache Flink Training: DataStream API Part 1 Basic
Flink Forward
 
PDF
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
Flink Forward
 
PDF
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
PDF
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Flink Forward
 
PDF
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Flink Forward
 
PDF
Mikio Braun – Data flow vs. procedural programming
Flink Forward
 
PDF
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Flink Forward
 
PPTX
Flink Case Study: Bouygues Telecom
Flink Forward
 
PPTX
Slim Baltagi – Flink vs. Spark
Flink Forward
 
PDF
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Flink Forward
 
PPTX
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Flink Forward
 
PDF
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Flink Forward
 
PPTX
Aljoscha Krettek – Notions of Time
Flink Forward
 
PDF
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
Matthias J. Sax – A Tale of Squirrels and Storms
Flink Forward
 
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Flink Forward
 
Apache Flink - Hadoop MapReduce Compatibility
Fabian Hueske
 
Fabian Hueske – Cascading on Flink
Flink Forward
 
Assaf Araki – Real Time Analytics at Scale
Flink Forward
 
Apache Flink Training: DataStream API Part 1 Basic
Flink Forward
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
Flink Forward
 
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Flink Forward
 
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Flink Forward
 
Mikio Braun – Data flow vs. procedural programming
Flink Forward
 
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Flink Forward
 
Flink Case Study: Bouygues Telecom
Flink Forward
 
Slim Baltagi – Flink vs. Spark
Flink Forward
 
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Flink Forward
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Flink Forward
 
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Flink Forward
 
Aljoscha Krettek – Notions of Time
Flink Forward
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 
Ad

Similar to Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15 (20)

PDF
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Flink Forward
 
PDF
Single-Pass Graph Stream Analytics with Apache Flink
Paris Carbone
 
PPTX
First Flink Bay Area meetup
Kostas Tzoumas
 
PDF
Baymeetup-FlinkResearch
Foo Sounds
 
PPTX
Apache Flink Deep Dive
DataWorks Summit
 
PDF
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
PDF
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Databricks
 
PDF
Challenging Web-Scale Graph Analytics with Apache Spark
Databricks
 
PPTX
Flink internals web
Kostas Tzoumas
 
PDF
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Flink Forward
 
PDF
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Martin Junghanns
 
PDF
Ling liu part 02:big graph processing
jins0618
 
PDF
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Seattle Apache Flink Meetup
 
PDF
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
Bowen Li
 
PDF
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Spark Summit
 
PDF
Vasia Kalavri – Training: Gelly School
Flink Forward
 
PPTX
Chicago Flink Meetup: Flink's streaming architecture
Robert Metzger
 
PDF
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
PDF
Data Summer Conf 2018, “Analysing Billion Node Graphs (ENG)” — Giorgi Jvaridz...
Provectus
 
PPTX
Advanced
mxmxm
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Flink Forward
 
Single-Pass Graph Stream Analytics with Apache Flink
Paris Carbone
 
First Flink Bay Area meetup
Kostas Tzoumas
 
Baymeetup-FlinkResearch
Foo Sounds
 
Apache Flink Deep Dive
DataWorks Summit
 
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Databricks
 
Challenging Web-Scale Graph Analytics with Apache Spark
Databricks
 
Flink internals web
Kostas Tzoumas
 
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Flink Forward
 
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Martin Junghanns
 
Ling liu part 02:big graph processing
jins0618
 
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Seattle Apache Flink Meetup
 
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
Bowen Li
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Spark Summit
 
Vasia Kalavri – Training: Gelly School
Flink Forward
 
Chicago Flink Meetup: Flink's streaming architecture
Robert Metzger
 
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
Data Summer Conf 2018, “Analysing Billion Node Graphs (ENG)” — Giorgi Jvaridz...
Provectus
 
Advanced
mxmxm
 
Ad

More from Vasia Kalavri (14)

PDF
From data stream management to distributed dataflows and beyond
Vasia Kalavri
 
PDF
Self-managed and automatically reconfigurable stream processing
Vasia Kalavri
 
PDF
Predictive Datacenter Analytics with Strymon
Vasia Kalavri
 
PDF
Online performance analysis of distributed dataflow systems (O'Reilly Velocit...
Vasia Kalavri
 
PDF
The shortest path is not always a straight line
Vasia Kalavri
 
PDF
Graphs as Streams: Rethinking Graph Processing in the Streaming Era
Vasia Kalavri
 
PDF
Demystifying Distributed Graph Processing
Vasia Kalavri
 
PDF
Like a Pack of Wolves: Community Structure of Web Trackers
Vasia Kalavri
 
PDF
Big data processing systems research
Vasia Kalavri
 
PDF
Asymmetry in Large-Scale Graph Analysis, Explained
Vasia Kalavri
 
PDF
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
Vasia Kalavri
 
PDF
m2r2: A Framework for Results Materialization and Reuse
Vasia Kalavri
 
PDF
MapReduce: Optimizations, Limitations, and Open Issues
Vasia Kalavri
 
PDF
A Skype case study (2011)
Vasia Kalavri
 
From data stream management to distributed dataflows and beyond
Vasia Kalavri
 
Self-managed and automatically reconfigurable stream processing
Vasia Kalavri
 
Predictive Datacenter Analytics with Strymon
Vasia Kalavri
 
Online performance analysis of distributed dataflow systems (O'Reilly Velocit...
Vasia Kalavri
 
The shortest path is not always a straight line
Vasia Kalavri
 
Graphs as Streams: Rethinking Graph Processing in the Streaming Era
Vasia Kalavri
 
Demystifying Distributed Graph Processing
Vasia Kalavri
 
Like a Pack of Wolves: Community Structure of Web Trackers
Vasia Kalavri
 
Big data processing systems research
Vasia Kalavri
 
Asymmetry in Large-Scale Graph Analysis, Explained
Vasia Kalavri
 
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
Vasia Kalavri
 
m2r2: A Framework for Results Materialization and Reuse
Vasia Kalavri
 
MapReduce: Optimizations, Limitations, and Open Issues
Vasia Kalavri
 
A Skype case study (2011)
Vasia Kalavri
 

Recently uploaded (20)

PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
BinarySearchTree in datastructures in detail
kichokuttu
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 

Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15

  • 1. Vasia Kalavri Flink committer & PhD student @KTH [email protected] @vkalavri Large-Scale Graph Processing with Apache Flink GraphDevroom FOSDEM ‘15
  • 2. Overview ● What is Apache Flink? ● Why Graph Processing with Flink: ○ user perspective ○ system perspective ● Gelly: the upcoming Flink Graph API ● Example: Music Profiles
  • 4. What is Apache Flink? ● Large-scale data processing engine ● Java and Scala APIs ● Batch and Streaming Analytics ● Runs locally, on your cluster, on YARN ● Performs well even when memory runs out 4
  • 5. The growing Flink stack 5 Flink Optimizer Flink Stream Builder Common API Scala API (batch and streaming) Java API (batch and streaming) Python API (upcoming) Graph API Apache MRQL Flink Local Runtime Embedded environment (Java collections) Local Environment (for debugging) Remote environment (Regular cluster execution) Apache Tez Data storage HDFSFiles S3 JDBC Redis Rabbit MQ Kafka Azure tables … Single node execution Standalone or YARN cluster
  • 6. ● map, flatMap ● filter ● reduce, reduceGroup ● join ● coGroup ● aggregate Available Transformations ● cross ● project ● distinct ● union ● iterate ● iterateDelta ● ... 6
  • 7. DataSet<String> text = env.readTextFile(input); DataSet<Tuple2<String, Integer>> result = text .flatMap((str, out) -> { for (String token : value.split("W")) { out.collect(new Tuple2<>(token, 1)); }) .groupBy(0) .aggregate(SUM, 1); Word Count val input = env.readTextFile(input); val words = input flatMap { line => line.split("W+")} map { word => (word, 1)} val counts = words groupBy(0) sum(1) Java Scala 7
  • 8. Why Graph Processing with Flink? user perspective
  • 9. Typical graph data analysis pipeline load clean create graph analyze graph clean transformload result clean transformload 9
  • 10. A more realistic pipeline load clean create graph analyze graph clean transformload result clean transformload often, it’s not easy to get the graph properties and the analysis algorithm right the first time! 10
  • 11. A more realistic pipeline load clean create graph analyze graph clean transformload result clean transformload 11
  • 12. A more user-friendly pipeline load load result load 12
  • 13. General-purpose or specialized? - fast application development and deployment - easier maintenance - non-intuitive APIs - time-consuming - use, configure and integrate different systems - hard to maintain - rich APIs and features general-purpose specialized what about performance? 13
  • 14. Why Graph Processing with Flink? system perspective
  • 15. Efficient Iterations ● Fink supports iterations natively ○ the runtime is aware of the iterative execution ○ no scheduling overhead between iterations ○ caching and state maintenance are handled automatically 15
  • 16. Flink Iteration Operators Iterate IterateDelta 16 Input Iterative Update Function Result Replace Workset Iterative Update Function Result Solution Set State
  • 17. Flink Optimizer ● The optimizer selects an execution plan for a program ● Think of an AI system manipulating your program for you 17
  • 18. Optimization of Iterative algorithms 18 Caching Loop-invariant Data Pushing work “out of the loop” Maintain state as index
  • 19. Performance ● in-memory data streaming ● memory management ● serialization framework 19
  • 22. ● Java Graph API on top of Flink ● Initial version coming with Flink 0.9 ● Can be seamlessly mixed with the standard Flink API ● Easily implement applications that use both record-based and graph-based analysis Meet Gelly 22
  • 23. In Gelly, a Graph is simply represented by a DataSet of Vertices and a DataSet of Edges: Hello, Gelly! Graph<String, Long, Double> graph = Graph.fromDataSet(vertices, edges, env); Graph<String, Long, NullValue> graph = Graph.fromCollection(edges, new MapFunction<String, Long>() { public Long map(String value) { return 1l; } }, env); 23
  • 24. ● Graph Properties ○ getVertexIds ○ getEdgeIds ○ numberOfVertices ○ numberOfEdges ○ getDegrees ○ isWeaklyConnected ○ ... Available Methods ● Transformations ○ map, filter, join ○ subgraph, union ○ reverse, undirected ○ ... ● Mutations ○ add vertex/edge ○ remove vertex/edge 24
  • 25. - Apply a reduce function to the 1st-hop neighborhood of each vertex in parallel Neighborhood Methods 3 4 7 4 4 graph.reduceOnNeighbors(new MinValue(), EdgeDirection.OUT); 3 9 7 4 5 25
  • 26. ● Validate a Graph according to given criteria ○ do the edge ids correspond to vertex ids? ○ are there duplicates? ○ is the graph bipartite? Graph Validation edges = { (1, 2), (3, 4), (1, 5), (2, 3), (6, 5) } vertices = { 1, 2, 3, 4, 5 } graph = Graph.fromCollection(vertices, edges); graph.validate(new InvalidVertexIdsValidator()); // false 26
  • 27. ● Wraps the Flink Spargel (Pregel-like) API ● The user only implements two functions ○ VertexUpdateFunction ○ MessagingFunction ● Internally creates a delta iteration Vertex-centric Iterations 27
  • 28. updateVertex(K key, Double value, MessageIterator msgs) { Double minDist = Double.MAX_VALUE; for (double msg : msgs) { if (msg < minDist) minDist = msg; } if (value > minDist) setNewVertexValue(minDist); } Vertex-centric SSSP sendMessages(K key, Double newDist) { for (Edge edge : getOutgoingEdges()) { sendMessageTo(edge.getTarget(), newDist + edge.getValue()); } shortestPaths = graph.runVertexCentricIteration( new DistanceUpdater(), new DistanceMessenger()).getVertices(); DistanceUpdater: VertexUpdateFunction DistanceMessenger: MessagingFunction 28
  • 29. ● PageRank ● Single Source Shortest Paths ● Label Propagation ● Weakly Connected Components Library of Algorithms 29
  • 32. Problem Description Input: ● <userId, songId, playCount> triplets ● a set of bad records (not to be trusted) Tasks: 1. filter out bad records 2. compute the top song per user (most listened to) 3. create a user-user similarity graph based on common songs 4. detect communities on the similarity graph 32
  • 33. 1. Filter out bad records /** Read <userID>t<songID>t<playcount> triplets */ DataSet<Tuple3> triplets = getTriplets(); /** Read the bad records songIDs */ DataSet<Tuple1> mismatches = getMismatches(); /** Filter out the mismatches from the triplets dataset */ DataSet<Tuple3> validTriplets = triplets.coGroup(mismatches).where(1).equalTo(0) .with(new CoGroupFunction { void coGroup(Iterable triplets, Iterable invalidSongs, Collector out) { if (!invalidSongs.iterator().hasNext()) for (Tuple3 triplet : triplets) // this is a valid triplet out.collect(triplet); } 33
  • 34. 2a. Compute top song per user /** Create a user -> song weighted bipartite graph where the edge weights correspond to play counts */ Graph userSongGraph = Graph.fromTupleDataSet(validTriplets, env); /** Get the top track (most listened) for each user */ DataSet<Tuple2> usersWithTopTrack = userSongGraph .reduceOnEdges(new GetTopSongPerUser(), EdgeDirection.OUT); 34 Tom “I like birds” “elephant woman” “red morning” 323 plays 18 plays 42plays
  • 35. 2b. Compute top song per user 35 class GetTopSongPerUser implements EdgesFunctionWithVertexValue { void iterateEdges(Vertex vertex, Iterable<Edge> edges) { int maxPlaycount = 0; String topSong = ""; for (Edge edge : edges) { if (edge.getValue() > maxPlaycount) { maxPlaycount = edge.getValue(); topSong = edge.getTarget(); } } return new Tuple2(vertex.getId(), topSong); } }
  • 36. user-song to user-user graph 36 “red morning” “I like birds” “get lucky” “disorder” Tom Steve Wendy “elephant woman” Emily Tom Steve Wendy Emily Emily
  • 37. 3. Create a user-user similarity graph /**Create a user-user similarity graph: two users that listen to the same song are connected */ DataSet<Edge> similarUsers = userSongGraph.getEdges().groupBy(1) .reduceGroup(new GroupReduceFunction() { void reduce(Iterable<Edge> edges, Collector<Edge> out) { List users = new ArrayList(); for (Edge edge : edges) users.add(edge.getSource()); for (int i = 0; i < users.size() - 1; i++) for (int j = i+1; j < users.size() - 1; j++) out.collect(new Edge(users.get(i), users.get(j))); } }).distinct(); Graph similarUsersGraph = Graph.fromDataSet(similarUsers).getUndirected(); 37
  • 38. 4. Cluster similar users /** Detect user communities using label propagation */ // Initialize each vertex with a unique numeric label DataSet<Tuple2> idsWithLabels = similarUsersGraph .getVertices().reduceGroup(new AssignInitialLabel()); // update the vertex values and run the label propagation algorithm DataSet<Vertex> verticesWithCommunity = similarUsersGraph .joinWithVertices(idsWithlLabels, new MapFunction() { public Long map(Tuple2 idWithLabel) { return idWithLabel.f1; } }).run(new LabelPropagation(numIterations)).getVertices(); 38
  • 39. Music Profiles Recap 39 ● Filter out bad records : record API ● Create user-song graph : record API ● Top song per user : Gelly ● Create user-user graph : record API ● Cluster users : Gelly
  • 40. What’s next, Gelly? ● Gather-Sum-Apply ● Scala API ● More library methods ○ Clustering Coefficient ○ Minimum Spanning Tree ● Integration with the Flink Streaming API ● Specialized Operators for Skewed Graphs 40
  • 41. Keep in touch! ● Gelly development repository http://github.com/project-flink/flink-graph ● Apache Flink mailing lists http://flink.apache.org/community.html#mailing-lists ● Follow @ApacheFlink 41