Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15

Vasia Kalavri
Flink committer & PhD student @KTH
vasia@apache.org
@vkalavri
Large-Scale Graph Processing
with Apache Flink
GraphDevroom
FOSDEM ‘15

Overview
● What is Apache Flink?
● Why Graph Processing with Flink:
○ user perspective
○ system perspective
● Gelly: the upcoming Flink Graph API
● Example: Music Profiles

What is Apache Flink?
● Large-scale data processing engine
● Java and Scala APIs
● Batch and Streaming Analytics
● Runs locally, on your cluster, on YARN
● Performs well even when memory runs out
4

The growing Flink stack
5
Flink Optimizer Flink Stream Builder
Common API
Scala API
(batch and streaming)
Java API
(batch and streaming)
Python API
(upcoming)
Graph API
Apache
MRQL
Flink Local Runtime
Embedded
environment
(Java collections)
Local
Environment
(for debugging)
Remote environment
(Regular cluster execution)
Apache Tez
Data
storage
HDFSFiles S3 JDBC Redis
Rabbit
MQ
Kafka
Azure
tables …
Single node execution Standalone or YARN cluster

● map, flatMap
● filter
● reduce,
reduceGroup
● join
● coGroup
● aggregate
Available Transformations
● cross
● project
● distinct
● union
● iterate
● iterateDelta
● ...
6

DataSet<String> text = env.readTextFile(input);
DataSet<Tuple2<String, Integer>> result = text
.flatMap((str, out) -> {
for (String token : value.split("W")) {
out.collect(new Tuple2<>(token, 1));
})
.groupBy(0)
.aggregate(SUM, 1);
Word Count
val input = env.readTextFile(input);
val words = input flatMap { line => line.split("W+")}
map { word => (word, 1)}
val counts = words groupBy(0) sum(1)
Java
Scala
7

Why Graph Processing
with Flink?
user perspective

Typical graph data analysis pipeline
load clean
create
graph
analyze
graph
clean transformload result
clean transformload
9

A more realistic pipeline
load clean
create
graph
analyze
graph
clean transformload
often, it’s not easy to get the
graph properties and the
analysis algorithm right the
first time!
10

A more realistic pipeline
load clean
create
graph
analyze
graph
clean transformload
11

A more user-friendly pipeline
load
load result
load
12

General-purpose or specialized?
- fast application
development and
deployment
- easier maintenance
- non-intuitive APIs
- time-consuming
- use, configure and integrate
different systems
- hard to maintain
- rich APIs and features
general-purpose specialized
what about performance?
13

Why Graph Processing
with Flink?
system perspective

Efficient Iterations
● Fink supports
iterations natively
○ the runtime is aware of
the iterative execution
○ no scheduling overhead
between iterations
○ caching and state
maintenance are handled
automatically
15

Flink Iteration Operators
Iterate IterateDelta
16
Input
Iterative
Update Function
Result
Replace
Workset
Iterative
Update Function
Result
Solution Set
State

Flink Optimizer
● The optimizer selects
an execution plan for
a program
● Think of an AI system
manipulating your
program for you
17

Optimization of Iterative algorithms
18
Caching Loop-invariant
Data
Pushing work
“out of the loop”
Maintain state as
index

Performance
● in-memory data
streaming
● memory
management
● serialization
framework
19

Scalability
20
http://data-artisans.com/computing-recommendations-with-flink.html

Gelly
the upcoming Flink Graph API

● Java Graph API on top of Flink
● Initial version coming with Flink 0.9
● Can be seamlessly mixed with the standard
Flink API
● Easily implement applications that use both
record-based and graph-based analysis
Meet Gelly
22

In Gelly, a Graph is simply represented by a DataSet of
Vertices and a DataSet of Edges:
Hello, Gelly!
Graph<String, Long, Double> graph = Graph.fromDataSet(vertices, edges, env);
Graph<String, Long, NullValue> graph = Graph.fromCollection(edges,
new MapFunction<String, Long>() {
public Long map(String value) {
return 1l;
}
}, env);
23

● Graph Properties
○ getVertexIds
○ getEdgeIds
○ numberOfVertices
○ numberOfEdges
○ getDegrees
○ isWeaklyConnected
○ ...
Available Methods
● Transformations
○ map, filter, join
○ subgraph, union
○ reverse, undirected
○ ...
● Mutations
○ add vertex/edge
○ remove vertex/edge
24

- Apply a reduce function to the 1st-hop
neighborhood of each vertex in parallel
Neighborhood Methods
3
4
7
4
4
graph.reduceOnNeighbors(new MinValue(), EdgeDirection.OUT);
3
9
7
4
5
25

● Validate a Graph according to given criteria
○ do the edge ids correspond to vertex ids?
○ are there duplicates?
○ is the graph bipartite?
Graph Validation
edges = { (1, 2), (3, 4), (1, 5), (2, 3), (6, 5) }
vertices = { 1, 2, 3, 4, 5 }
graph = Graph.fromCollection(vertices, edges);
graph.validate(new InvalidVertexIdsValidator()); // false
26

● Wraps the Flink Spargel
(Pregel-like) API
● The user only
implements two
functions
○ VertexUpdateFunction
○ MessagingFunction
● Internally creates a
delta iteration
Vertex-centric Iterations
27

updateVertex(K key, Double value,
MessageIterator msgs) {
Double minDist = Double.MAX_VALUE;
for (double msg : msgs) {
if (msg < minDist)
minDist = msg;
}
if (value > minDist)
setNewVertexValue(minDist);
}
Vertex-centric SSSP
sendMessages(K key, Double newDist) {
for (Edge edge : getOutgoingEdges()) {
sendMessageTo(edge.getTarget(),
newDist + edge.getValue());
}
shortestPaths = graph.runVertexCentricIteration(
new DistanceUpdater(), new DistanceMessenger()).getVertices();
DistanceUpdater: VertexUpdateFunction DistanceMessenger: MessagingFunction
28

● PageRank
● Single Source Shortest Paths
● Label Propagation
● Weakly Connected Components
Library of Algorithms
29

Problem Description
Input:
● <userId, songId, playCount> triplets
● a set of bad records (not to be trusted)
Tasks:
1. filter out bad records
2. compute the top song per user (most listened to)
3. create a user-user similarity graph based on common songs
4. detect communities on the similarity graph
32

1. Filter out bad records
/** Read <userID>t<songID>t<playcount> triplets */
DataSet<Tuple3> triplets = getTriplets();
/** Read the bad records songIDs */
DataSet<Tuple1> mismatches = getMismatches();
/** Filter out the mismatches from the triplets dataset */
DataSet<Tuple3> validTriplets = triplets.coGroup(mismatches).where(1).equalTo(0)
.with(new CoGroupFunction {
void coGroup(Iterable triplets, Iterable invalidSongs, Collector out) {
if (!invalidSongs.iterator().hasNext())
for (Tuple3 triplet : triplets) // this is a valid triplet
out.collect(triplet);
}
33

2a. Compute top song per user
/** Create a user -> song weighted bipartite graph where the edge weights correspond
to play counts */
Graph userSongGraph = Graph.fromTupleDataSet(validTriplets, env);
/** Get the top track (most listened) for each user */
DataSet<Tuple2> usersWithTopTrack = userSongGraph
.reduceOnEdges(new GetTopSongPerUser(), EdgeDirection.OUT);
34
Tom “I like birds”
“elephant woman”
“red morning”
323 plays
18 plays
42plays

2b. Compute top song per user
35
class GetTopSongPerUser implements EdgesFunctionWithVertexValue {
void iterateEdges(Vertex vertex, Iterable<Edge> edges) {
int maxPlaycount = 0;
String topSong = "";
for (Edge edge : edges) {
if (edge.getValue() > maxPlaycount) {
maxPlaycount = edge.getValue();
topSong = edge.getTarget();
}
}
return new Tuple2(vertex.getId(), topSong);
}
}

user-song to user-user graph
36
“red morning”
“I like birds”
“get lucky”
“disorder”
Tom
Steve
Wendy
“elephant woman”
Emily
Tom Steve
Wendy
Emily
Emily

3. Create a user-user similarity graph
/**Create a user-user similarity graph:
two users that listen to the same song are connected */
DataSet<Edge> similarUsers = userSongGraph.getEdges().groupBy(1)
.reduceGroup(new GroupReduceFunction() {
void reduce(Iterable<Edge> edges, Collector<Edge> out) {
List users = new ArrayList();
for (Edge edge : edges)
users.add(edge.getSource());
for (int i = 0; i < users.size() - 1; i++)
for (int j = i+1; j < users.size() - 1; j++)
out.collect(new Edge(users.get(i), users.get(j)));
}
}).distinct();
Graph similarUsersGraph = Graph.fromDataSet(similarUsers).getUndirected();
37

4. Cluster similar users
/** Detect user communities using label propagation */
// Initialize each vertex with a unique numeric label
DataSet<Tuple2> idsWithLabels = similarUsersGraph
.getVertices().reduceGroup(new AssignInitialLabel());
// update the vertex values and run the label propagation algorithm
DataSet<Vertex> verticesWithCommunity = similarUsersGraph
.joinWithVertices(idsWithlLabels, new MapFunction() {
public Long map(Tuple2 idWithLabel) {
return idWithLabel.f1;
}
}).run(new LabelPropagation(numIterations)).getVertices();
38

Music Profiles Recap
39
● Filter out bad records : record API
● Create user-song graph : record API
● Top song per user : Gelly
● Create user-user graph : record API
● Cluster users : Gelly

What’s next, Gelly?
● Gather-Sum-Apply
● Scala API
● More library methods
○ Clustering Coefficient
○ Minimum Spanning Tree
● Integration with the Flink Streaming API
● Specialized Operators for Skewed Graphs
40

Keep in touch!
● Gelly development repository
http://github.com/project-flink/flink-graph
● Apache Flink mailing lists
http://flink.apache.org/community.html#mailing-lists
● Follow @ApacheFlink
41

Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15 (20)

More from Vasia Kalavri (14)

Recently uploaded (20)

Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15