SlideShare a Scribd company logo
GRADOOP: Scalable Graph
Analytics with Apache Flink
Martin Junghanns
University of Leipzig
About the speaker and the team
 2011 Bachelor of Engineering
 Thesis: Partitioning of Dynamic Graphs
 2014 Master of Science
 Thesis: Graph Database Systems for Business Intelligence
 Now: PhD Student, Database Group, University of Leipzig
 Distributed Systems
 Distributed Graph Data Management
 Graph Theory & Algorithms
 Professional Experience: sones GraphDB, SAP
André, PhD Student
Martin, PhD Student
Kevin, M.Sc. StudentNiklas, M.Sc. Student
Motivation
𝑮𝑟𝑟𝑟𝑟 = (𝑽𝑒𝑒𝑒𝑒𝑒𝑒𝑒, 𝑬𝑑𝑑𝑑𝑑)
“Graphs are everywhere”
𝐺𝐺𝐺𝐺𝐺 = (𝐔𝐔𝐔𝐔𝐔, 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹)
“Graphs are everywhere”
Alice
Bob
Eve
Dave
Carol
Mallory
Peggy
𝐺𝐺𝐺𝐺𝐺 = (𝐔𝐔𝐔𝐔𝐔, 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹)
“Graphs are everywhere”
Alice
Bob
Eve
Dave
Carol
Mallory
Peggy
𝐺𝐺𝐺𝐺𝐺 = (𝐔𝐔𝐔𝐔𝐔, 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹)
“Graphs are everywhere”
Alice
Bob
Eve
Dave
Carol
Mallory
Peggy
𝐺𝐺𝐺𝐺𝐺 = (𝐔𝐔𝐔𝐔𝐔, 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹)
“Graphs are everywhere”
Alice
Bob
Eve
Dave
Carol
Mallory
Peggy
𝐺𝐺𝐺𝐺𝐺 = (𝐔𝐔𝐔𝐔𝐔, 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹)
“Graphs are everywhere”
Alice
Bob
Eve
Dave
Carol
Mallory
Peggy
Trent
𝐺𝐺𝐺𝐺𝐺 = (𝐔𝐔𝐔𝐔𝐔, 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹)
“Graphs are everywhere”
Alice
Bob
Eve
Dave
Carol
Mallory
Peggy
Trent
𝐺𝐺𝐺𝐺𝐺 = (𝐂𝐂𝐂𝐂𝐂𝐂, 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶)
“Graphs are everywhere”
Leipzig
pop: 544K
Dresden
pop: 536K
Berlin
pop: 3.5M
Hamburg
pop: 1.7M
Munich
pop: 1.4M
Chemnitz
pop: 243K
Nuremberg
pop: 500K
Cologne
pop: 1M
 World Wide Web
 ca. 1 billion websites
“Graphs are large”
 Facebook
 ca. 1.49 billion active users
 ca. 340 friends per user
End-to-End Graph Analytics
Data Integration Graph Analytics Representation
End-to-End Graph Analytics
Data Integration Graph Analytics Representation
 Integrate data from one or more sources into a dedicated
graph storage with common graph data model
End-to-End Graph Analytics
Data Integration Graph Analytics Representation
 Integrate data from one or more sources into a dedicated
graph storage with common graph data model
 Definition of analytical workflows from operator algebra
End-to-End Graph Analytics
Data Integration Graph Analytics Representation
 Integrate data from one or more sources into a dedicated
graph storage with common graph data model
 Definition of analytical workflows from operator algebra
 Result representation in a meaningful way
Graph Data Management
Graph Database
Systems
Neo4j, OrientDB
Graph Processing
Systems
Pregel, Giraph
Distributed Workflow
Systems
Flink Gelly, Spark GraphX
Data Model Rich Graph
Models
Generic Graph Models Generic Graph Models
Focus Local ACID
Operations
Global Graph Operations Global Data and Graph
Operations
Query Language Yes No No
Persistency Yes No No
Scalability Vertical Horizontal Horizontal
Workflows No No Yes
Data Integration No No No
Graph Analytics No Yes Yes
Representation Yes No No
Graph Data Management
Graph Database
Systems
Neo4j, OrientDB
Graph Processing
Systems
Pregel, Giraph
Distributed Workflow
Systems
Flink Gelly, Spark GraphX
Data Model Rich Graph
Models
Generic Graph Models Generic Graph Models
Focus Local ACID
Operations
Global Graph Operations Global Data and Graph
Operations
Query Language Yes No No
Persistency Yes No No
Scalability Vertical Horizontal Horizontal
Workflows No No Yes
Data Integration No No No
Graph Analytics No Yes Yes
Representation Yes No No
Graph Data Management
Graph Database
Systems
Neo4j, OrientDB
Graph Processing
Systems
Pregel, Giraph
Distributed Workflow
Systems
Flink Gelly, Spark GraphX
Data Model Rich Graph
Models
Generic Graph Models Generic Graph Models
Focus Local ACID
Operations
Global Graph Operations Global Data and Graph
Operations
Query Language Yes No No
Persistency Yes No No
Scalability Vertical Horizontal Horizontal
Workflows No No Yes
Data Integration No No No
Graph Analytics No Yes Yes
Representation Yes No No
Graph Data Management
Graph Database
Systems
Neo4j, OrientDB
Graph Processing
Systems
Pregel, Giraph
Distributed Workflow
Systems
Flink Gelly, Spark GraphX
Data Model Rich Graph
Models
Generic Graph Models Generic Graph Models
Focus Local ACID
Operations
Global Graph Operations Global Data and Graph
Operations
Query Language Yes No No
Persistency Yes No No
Scalability Vertical Horizontal Horizontal
Workflows No No Yes
Data Integration No No No
Graph Analytics No Yes Yes
Representation Yes No No
What‘s missing?
An end-to-end framework and research platform
for efficient, distributed and domain independent
graph data management and analytics.
What‘s missing?
An end-to-end framework and research platform
for efficient, distributed and domain independent
graph data management and analytics.
Gradoop Architecture & Data Model
High Level Architecture
HDFS/YARN
Cluster
HBase Distributed Graph Store
Extended Property Graph Model
Flink Operator Implementations
Data Integration
Flink Operator Execution
Workflow
Declaration
Visual
GrALa DSL
Representation
Data flow
Control flow
Graph Analytics Representation
Workflow Execution
High Level Architecture
HBase Distributed Graph Store
Extended Property Graph Model
Flink Operator Implementations
Data Integration
Flink Operator Execution
Workflow
Declaration
Visual
GrALa DSL
Representation
Data flow
Control flow
Graph Analytics Representation
Workflow Execution
HDFS/YARN
Cluster
Extended Property Graph Model
Extended Property Graph Model
Extended Property Graph Model
Graph Operators
Operator GrALa notation
Binary
Combination graph.combine(otherGraph) : Graph
Overlap graph.overlap(otherGraph) : Graph
Exclusion graph.exclude(otherGraph) : Graph
Isomorphism graph.isIsomorphicTo(otherGraph) : Boolean
Unary
Pattern Matching graph.match(patternGraph,predicate) : Collection
Aggregation graph.aggregate(propertyKey,aggregateFunction) : Graph
Projection graph.project(vertexFunction,edgeFunction) : Graph
Summarization graph.summarize(
vertexGroupKeys,vertexAggregateFunction,
edgeGroupKeys,edgeAggregateFunction) : Graph
Combination
1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2])
Combination
1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2])
Graph Operators
Operator GrALa notation
Binary
Combination graph.combine(otherGraph) : Graph
Overlap graph.overlap(otherGraph) : Graph
Exclusion graph.exclude(otherGraph) : Graph
Isomorphism graph.isIsomorphicTo(otherGraph) : Boolean
Unary
Pattern Matching graph.match(patternGraph,predicate) : Collection
Aggregation graph.aggregate(propertyKey,aggregateFunction) : Graph
Projection graph.project(vertexFunction,edgeFunction) : Graph
Summarization graph.summarize(
vertexGroupKeys,vertexAggregateFunction,
edgeGroupKeys,edgeAggregateFunction) : Graph
Summarization
1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2])
2: vertexGroupingKeys = {:type, “city”}
3: edgeGroupingKeys = {:type}
4: vertexAggFunc = (Vertex vSum, Set vertices => vSum[“count”] = |vertices|)
5: edgeAggFunc = (Edge eSum, Set edges => eSum[“count”] = |edges|)
6: sumGraph = personGraph.summarize(vertexGroupingKeys, vertexAggFunc,
edgeGroupingKeys, edgeAggFunc)
Summarization
1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2])
2: vertexGroupingKeys = {:type, “city”}
3: edgeGroupingKeys = {:type}
4: vertexAggFunc = (Vertex vSum, Set vertices => vSum[“count”] = |vertices|)
5: edgeAggFunc = (Edge eSum, Set edges => eSum[“count”] = |edges|)
6: sumGraph = personGraph.summarize(vertexGroupingKeys, vertexAggFunc,
edgeGroupingKeys, edgeAggFunc)
Graph Collection Operators
Operator GrALa notation
Collection
Selection collection.select(predicate) : Collection
Distinct collection.distinct() : Collection
Sort by collection.sortBy(key, [:asc|:desc]) : Collection
Top collection.top(limit) : Collection
Union collection.union(otherCollection) : Collection
Intersection collection.intersect(otherCollection) : Collection
Difference collection.difference(otherCollection) : Collection
Auxiliary
Apply collection.apply(unaryGraphOperator) : Collection
Reduce collection.reduce(binaryGraphOperator) : Graph
Call [graph|collection].callFor[Graph|Collection](
algorithm,parameters) : [Graph|Collection]
Selection
1: collection = <db.G[0],db.G[1],db.G[2]>
2: predicate = (Graph g => |g.V| > 3)
3: result = collection.select(predicate)
Selection
1: collection = <db.G[0],db.G[1],db.G[2]>
2: predicate = (Graph g => |g.V| > 3)
3: result = collection.select(predicate)
Graph Collection Operators
Operator GrALa notation
Collection
Selection collection.select(predicate) : Collection
Distinct collection.distinct() : Collection
Sort by collection.sortBy(key, [:asc|:desc]) : Collection
Top collection.top(limit) : Collection
Union collection.union(otherCollection) : Collection
Intersection collection.intersect(otherCollection) : Collection
Difference collection.difference(otherCollection) : Collection
Auxiliary
Apply collection.apply(unaryGraphOperator) : Collection
Reduce collection.reduce(binaryGraphOperator) : Graph
Call [graph|collection].callFor[Graph|Collection](
algorithm,parameters) : [Graph|Collection]
Extended Property Graph Model in Flink
ID Label Properties Graphs
ID Label Properties Source
Vertex
Target
Vertex
Graphs
VertexData
EdgeData
GraphData
ID Label Properties
POJO
POJO
POJO
DataSet<Vertex<ID,VertexData>>
DataSet<Edge<ID,EdgeData>>
DataSet<Subgraph<ID,GraphData>>
Gelly
𝒱
ℰ
𝒢
Pojo Representation
Extended Property Graph Model in Flink
VertexData
EdgeData
GraphData
POJO
POJO
POJO
DataSet<Vertex<ID,VertexData>>
DataSet<Edge<ID,EdgeData>>
DataSet<Subgraph<ID,GraphData>>
Gelly
VertexData
EdgeData
GraphData
Tuple
Tuple
Tuple
DataSet<VertexData>
DataSet<EdgeData>
DataSet<GraphData>
𝒱
𝒱
ℰ
ℰ
𝒢
𝒢
Pojo Representation
Tuple Representation
ID Label Properties Graphs
ID Label Properties Source
Vertex
Target
Vertex
Graphs
ID Label Properties
ID Label Properties Graphs
ID Label Properties Source
Vertex
Target
Vertex
Graphs
ID Label Properties
Summarization in Flink
VID City
0 L
1 L
2 D
3 D
4 D
5 B
EID S T
0 0 1
1 1 0
2 1 2
3 2 1
4 2 3
5 3 2
6 4 0
7 4 1
8 5 2
9 5 3
L [0,1]
D [2,3,4]
B [5]
VID City Count
0 L 2
2 D 3
5 B 1
VID Rep
0 0
1 0
2 2
3 2
4 2
5 5
ID S T
0 0 1
1 0 0
2 0 2
3 2 1
4 2 3
5 2 2
6 2 0
7 2 1
8 5 2
9 5 3
ID S T
0 0 0
1 0 0
2 0 2
3 2 0
4 2 2
5 2 2
6 2 0
7 2 0
8 5 2
9 5 2
0,0 [0,1]
0,2 [2]
2,0 [3,6,7]
2,2 [4,5]
5,2 [8,9]
EID S T Count
0 0 1 2
2 0 2 1
3 2 0 3
4 2 2 2
8 5 2 2
join(VID==S)
𝒱
ℰ’
𝒱′
ℰ
groupBy(City)
reduceGroup + filter + map
reduceGroup + filter + map
groupBy(S,T)
join(VID==T)
Use Case: Graph Business Intelligence
Use Case: Graph Business Intelligence
 Business intelligence usually based on relational data
warehouses
 Enterprise data is integrated within dimensional schema
 Analysis limited to predefined relationships
 No support for relationship-oriented data mining
Facts
Dim 1
Dim 2
Dim 3
Use Case: Graph Business Intelligence
 Business intelligence usually based on relational data
warehouses
 Enterprise data is integrated within dimensional schema
 Analysis limited to predefined relationships
 No support for relationship-oriented data mining
 Graph-based approach
 Integrate data sources within an instance graph by preserving original
relationships between data objects (transactional and master data)
 Determine subgraphs (business transaction graphs) related to business
activities
 Analyze subgraphs or entire graphs with aggregation queries, mining
relationship patterns, etc.
Facts
Dim 1
Dim 2
Dim 3
Prerequisites: Data Integration
Business Transaction Graphs
CIT ERP
Employee
Name: Dave
Employee
Name: Alice
Employee
Name: Bob
Employee
Name: Carol
Ticket
Expense: 500
SalesQuotation
SalesOrder
PurchaseOrder
PurchaseOrder
SalesRevenue
Revenue: 5,000
PurchaseInvoice
Expense: 2,000
PurchaseInvoice
Expense: 1,500
sentBy
createdBy
processedBy
createdBy
openedFor
processedBy
basedOn serves
serves
bills
bills
bills
processedBy
Business Transaction Graphs
CIT ERP
Employee
Name: Dave
Employee
Name: Alice
Employee
Name: Bob
Employee
Name: Carol
Ticket
Expense: 500
SalesQuotation
SalesOrder
PurchaseOrder
PurchaseOrder
SalesRevenue
Revenue: 5,000
PurchaseInvoice
Expense: 2,000
PurchaseInvoice
Expense: 1,500
sentBy
createdBy
processedBy
createdBy
openedFor
processedBy
processedBy
basedOn serves
serves
bills
bills
bills
Business Transaction Graphs
CIT ERP
Employee
Name: Dave
Employee
Name: Alice
Employee
Name: Bob
Employee
Name: Carol
Ticket
Expense: 500
SalesQuotation
SalesOrder
PurchaseOrder
PurchaseOrder
SalesRevenue
Revenue: 5,000
PurchaseInvoice
Expense: 2,000
PurchaseInvoice
Expense: 1,500
sentBy
createdBy
processedBy
createdBy
openedFor
processedBy
processedBy
basedOn serves
serves
bills
bills
bills
Business Transaction Graphs
CIT ERP
Employee
Name: Dave
Employee
Name: Alice
Employee
Name: Bob
Employee
Name: Carol
Ticket
Expense: 500
SalesQuotation
SalesOrder
PurchaseOrder
PurchaseOrder
SalesRevenue
Revenue: 5,000
PurchaseInvoice
Expense: 2,000
PurchaseInvoice
Expense: 1,500
sentBy
createdBy
processedBy
createdBy
openedFor
processedBy
processedBy
basedOn serves
serves
bills
bills
bills
Business Transaction Graphs
CIT ERP
Employee
Name: Dave
Employee
Name: Alice
Employee
Name: Bob
Employee
Name: Carol
Ticket
Expense: 500
SalesQuotation
SalesOrder
PurchaseOrder
PurchaseOrder
SalesRevenue
Revenue: 5,000
PurchaseInvoice
Expense: 2,000
PurchaseInvoice
Expense: 1,500
sentBy
createdBy
processedBy
createdBy
openedFor
processedBy
processedBy
basedOn serves
serves
bills
bills
bills
Business Transaction Graphs
CIT ERP
Employee
Name: Dave
Employee
Name: Alice
Employee
Name: Bob
Employee
Name: Carol
Ticket
Expense: 500
SalesQuotation
SalesOrder
PurchaseOrder
PurchaseOrder
SalesRevenue
Revenue: 5,000
PurchaseInvoice
Expense: 2,000
PurchaseInvoice
Expense: 1,500
sentBy
createdBy
processedBy
createdBy
openedFor
processedBy
processedBy
basedOn serves
serves
bills
bills
bills
BTG 1
(1) BTG Extraction
BTG 2
BTG 3
BTG 4
BTG 5
BTG n
…
(1) BTG Extraction
// generate base collection
btgs = iig.callForCollection( :BusinessTransactionGraphs , {} )
(2) Profit Aggregation
CIT ERP
Employee
Name: Dave
Employee
Name: Alice
Employee
Name: Bob
Employee
Name: Carol
Ticket
Expense: 500
SalesQuotation
SalesOrder
PurchaseOrder
PurchaseOrder
SalesRevenue
Revenue: 5,000
PurchaseInvoice
Expense: 2,000
PurchaseInvoice
Expense: 1,500
sentBy
createdBy
processedBy
createdBy
openedFor
processedBy
processedBy
basedOn serves
serves
bills
bills
bills
(2) Profit Aggregation
// generate base collection
btgs = iig.callForCollection( :BusinessTransactionGraphs , {} )
// define profit aggregate function
aggFunc = ( Graph g =>
g.V.values(“Revenue").sum() - g.V.values(“Expense").sum()
)
(2) Profit Aggregation
BTG 1
BTG 2
BTG 3
BTG 4
BTG 5
BTG n
… ∑ Revenue ∑ Expenses Net Profit
5,000 -3,000 2,000
9,000 -3,000 6,000
2,000 -1,500 500
5,000 -7,000 -2,000
10,000 -15,000 -5,000
… … …
8,000 -4,000 4,000
(2) Profit Aggregation
// generate base collection
btgs = iig.callForCollection( :BusinessTransactionGraphs , {} )
// define profit aggregate function
aggFunc = ( Graph g =>
g.V.values(“Revenue").sum() - g.V.values(“Expense").sum()
)
// apply aggregate function and store result at new property
btgs = btgs.apply( Graph g =>
g.aggregate( “Profit“ , aggFunc )
)
(3) BTG Clustering
BTG 1
BTG 2
BTG 3
BTG 4
BTG 5
BTG n
… ∑ Revenue ∑ Expenses Net Profit
5,000 -3,000 2,000
9,000 -3,000 6,000
2,000 -1,500 500
5,000 -7,000 -2,000
10,000 -15,000 -5,000
… … …
8,000 -4,000 4,000
(3) BTG Clustering
// select profit and loss clusters
profitBtgs = btgs.select( Graph g => g[“Profit”] >= 0 )
lossBtgs = btgs.difference(profitBtgs)
(4) Cluster Characteristic Patterns
CIT ERP
Employee
Name: Dave
Employee
Name: Alice
Employee
Name: Bob
Employee
Name: Carol
Ticket
Expense: 500
SalesQuotation
SalesOrder
PurchaseOrder
PurchaseOrder
SalesRevenue
Revenue: 5,000
PurchaseInvoice
Expense: 2,000
PurchaseInvoice
Expense: 1,500
sentBy
createdBy
processedBy
createdBy
openedFor
processedBy
processedBy
basedOn serves
serves
bills
bills
bills
(4) Cluster Characteristic Patterns
CIT ERP
Employee
Name: Dave
Employee
Name: Alice
Employee
Name: Bob
Employee
Name: Carol
Ticket
Expense: 500
SalesQuotation
SalesOrder
PurchaseOrder
PurchaseOrder
SalesRevenue
Revenue: 5,000
PurchaseInvoice
Expense: 2,000
PurchaseInvoice
Expense: 1,500
sentBy
createdBy
processedBy
createdBy
openedFor
processedBy
processedBy
basedOn serves
serves
bills
bills
bills
(4) Cluster Characteristic Patterns
BTG 1
BTG 2
BTG 3
BTG 4
BTG 5
BTG n
…
∑ Revenue ∑ Expenses Net Profit
5,000 -3,000 2,000
9,000 -3,000 6,000
2,000 -1,500 500
5,000 -7,000 -2,000
10,000 -15,000 -5,000
… … …
8,000 -4,000 4,000
TicketAlice
processedBy
Bob
createdBy
PurchaseOrder
(4) Cluster Characteristic Patterns
// select profit and loss clusters
profitBtgs = btgs.select( Graph g => g[“Profit”] >= 0 )
lossBtgs = btgs.difference(profitBtgs)
// apply magic
profitFreqPats = profitBtgs.callForCollection(
:FrequentSubgraphs , {“Threshold”:0.7}
)
lossFreqPats = lossBtgs.callForCollection(
:FrequentSubgraphs , {“Threshold”:0.7}
)
// determine cluster characteristic patterns
trivialPats = profitFreqPats.intersect(lossFreqPats)
profitCharPatterns = profitFreqPats.difference(trivialPats)
lossCharPatterns = lossFreqPats.difference(trivialPats)
Current State & Future Work
Current State
 0.0.1 First Prototype (May 2015)
 Hadoop MapReduce and Giraph for operator implementations
 Too much complexity
 Performance loss through serialization in HDFS/HBase
 0.0.2 Using Flink as execution layer (June 2015)
 Basic operators
 Currently 0.0.3-SNAPSHOT
 Performance improvements
 More operator implementations
Operator implementations (0.0.3-SNAPSHOT)
Unary Pattern Matching Collection Selection Algorithms LabelPropagation
Aggregation Distinct BTG Extraction
Projection Sort by FSM
Summarization Top
Binary Combination Union
Overlap Intersection
Exclusion Difference
Isomorphism Auxiliary Apply
Reduce
Call
Future Work
 Operator integration into Gelly
 Summarization FLINK-2411
 Graph Sampling
 …
 Graph Operations on streams (Flink)
 Graph Partitioning (maybe together with the Gelly people)
 Graph Versioning (Storage)
 Benchmarking
 GrALa Interpreter / Web UI
Benchmarks Sneak Preview
0
200
400
600
800
1000
1200
1400
1 2 4 8 16
Time [s]
# Worker
Summarization (Vertex and Edge Labels)
 16x Intel(R) Xeon(R) CPU E5-2430 v2 @ 2.50GHz (12 Cores), 48 GB RAM
 Hadoop 2.5.2, Flink 0.9.0
 slots (per node) 12
 jobmanager.heap.mb 2048
 taskmanager.heap.mb 40960
 Foodbroker Graph (https://github.com/dbs-leipzig/foodbroker)
 Generates BI process data
 858,624,267 Vertices, 4,406,445,007 Edges, 663GB Payload
Web UI Sneak Preview
Contributions welcome
 Code
 Operator implementations
 Performance Tuning
 Storage layout
 Data! and Use Cases
 We are researchers, we assume ...
 Getting real data (especially BI data) is nearly impossible
 People
 Bachelor / Master / PhD Thesis
Thank you for building Flink!
www.gradoop.com
https://github.com/dbs-leipzig/gradoop
http://dbs.uni-leipzig.de/file/GradoopTR.pdf
http://dbs.uni-leipzig.de/file/biiig-vldb2014.pdf

More Related Content

What's hot (20)

PPTX
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Stephan Ewen
 
PDF
Fabian Hueske – Juggling with Bits and Bytes
Flink Forward
 
PDF
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Flink Forward
 
PPTX
Google cloud Dataflow & Apache Flink
Iván Fernández Perea
 
PDF
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Flink Forward
 
PDF
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
PDF
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Vasia Kalavri
 
PDF
Building Robust ETL Pipelines with Apache Spark
Databricks
 
PDF
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Vasia Kalavri
 
PDF
Understanding Query Plans and Spark UIs
Databricks
 
PDF
Flink Gelly - Karlsruhe - June 2015
Andra Lungu
 
PPTX
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Fabian Hueske
 
PDF
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
PPTX
Robust and Scalable ETL over Cloud Storage with Apache Spark
Databricks
 
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Gyula Fóra
 
PDF
Scaling Up AI Research to Production with PyTorch and MLFlow
Databricks
 
PDF
Designing Distributed Machine Learning on Apache Spark
Databricks
 
PPTX
Intro to Spark development
Spark Summit
 
PDF
Apache Spark vs Apache Flink
AKASH SIHAG
 
PDF
Exceptions are the Norm: Dealing with Bad Actors in ETL
Databricks
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Stephan Ewen
 
Fabian Hueske – Juggling with Bits and Bytes
Flink Forward
 
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Flink Forward
 
Google cloud Dataflow & Apache Flink
Iván Fernández Perea
 
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Flink Forward
 
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Vasia Kalavri
 
Building Robust ETL Pipelines with Apache Spark
Databricks
 
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Vasia Kalavri
 
Understanding Query Plans and Spark UIs
Databricks
 
Flink Gelly - Karlsruhe - June 2015
Andra Lungu
 
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Fabian Hueske
 
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Databricks
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Gyula Fóra
 
Scaling Up AI Research to Production with PyTorch and MLFlow
Databricks
 
Designing Distributed Machine Learning on Apache Spark
Databricks
 
Intro to Spark development
Spark Summit
 
Apache Spark vs Apache Flink
AKASH SIHAG
 
Exceptions are the Norm: Dealing with Bad Actors in ETL
Databricks
 

Viewers also liked (20)

PDF
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Flink Forward
 
PDF
Ufuc Celebi – Stream & Batch Processing in one System
Flink Forward
 
PDF
Vasia Kalavri – Training: Gelly School
Flink Forward
 
PPTX
Apache Flink Training: System Overview
Flink Forward
 
PDF
Marton Balassi – Stateful Stream Processing
Flink Forward
 
PDF
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Flink Forward
 
PPTX
Slim Baltagi – Flink vs. Spark
Flink Forward
 
PPTX
Flink Case Study: Bouygues Telecom
Flink Forward
 
PDF
Mikio Braun – Data flow vs. procedural programming
Flink Forward
 
PDF
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Flink Forward
 
PPTX
Apache Flink Training: DataStream API Part 1 Basic
Flink Forward
 
PDF
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Flink Forward
 
PPTX
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
PDF
Matthias J. Sax – A Tale of Squirrels and Storms
Flink Forward
 
PDF
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 
PPTX
Apache Flink Training: DataStream API Part 2 Advanced
Flink Forward
 
PDF
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 
PPTX
Fabian Hueske – Cascading on Flink
Flink Forward
 
PPTX
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Flink Forward
 
PPTX
Assaf Araki – Real Time Analytics at Scale
Flink Forward
 
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Flink Forward
 
Ufuc Celebi – Stream & Batch Processing in one System
Flink Forward
 
Vasia Kalavri – Training: Gelly School
Flink Forward
 
Apache Flink Training: System Overview
Flink Forward
 
Marton Balassi – Stateful Stream Processing
Flink Forward
 
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Flink Forward
 
Slim Baltagi – Flink vs. Spark
Flink Forward
 
Flink Case Study: Bouygues Telecom
Flink Forward
 
Mikio Braun – Data flow vs. procedural programming
Flink Forward
 
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Flink Forward
 
Apache Flink Training: DataStream API Part 1 Basic
Flink Forward
 
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Flink Forward
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
Matthias J. Sax – A Tale of Squirrels and Storms
Flink Forward
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 
Apache Flink Training: DataStream API Part 2 Advanced
Flink Forward
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 
Fabian Hueske – Cascading on Flink
Flink Forward
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Flink Forward
 
Assaf Araki – Real Time Analytics at Scale
Flink Forward
 
Ad

Similar to Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink (20)

PDF
Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with A...
Martin Junghanns
 
PDF
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink & Neo4j Meetup Be...
Martin Junghanns
 
PPTX
Graphs in data structures are non-linear data structures made up of a finite ...
bhargavi804095
 
PDF
Big dataintegration rahm-part3Scalable and privacy-preserving data integratio...
ErhardRahm
 
PPTX
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Databricks
 
PDF
Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016
Martin Junghanns
 
PDF
Data Summer Conf 2018, “Analysing Billion Node Graphs (ENG)” — Giorgi Jvaridz...
Provectus
 
PDF
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
PPTX
Big data week 2018 - Graph Analytics on Big Data
Christos Hadjinikolis
 
PDF
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
csandit
 
PDF
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
cscpconf
 
PDF
Benchmarking tool for graph algorithms
Yash Khandelwal
 
PDF
Employing Graph Databases as a Standardization Model towards Addressing Heter...
Dippy Aggarwal
 
PDF
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
PDF
Graph Analytics in Spark
Paco Nathan
 
PDF
An excursion into Graph Analytics with Apache Spark GraphX
Krishna Sankar
 
PDF
GraphTech Ecosystem - part 2: Graph Analytics
Linkurious
 
PDF
Distributed Graph Analytics with Gradoop
Martin Junghanns
 
PDF
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Spark Summit
 
PDF
Distributed graph processing
Bartosz Konieczny
 
Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with A...
Martin Junghanns
 
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink & Neo4j Meetup Be...
Martin Junghanns
 
Graphs in data structures are non-linear data structures made up of a finite ...
bhargavi804095
 
Big dataintegration rahm-part3Scalable and privacy-preserving data integratio...
ErhardRahm
 
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Databricks
 
Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016
Martin Junghanns
 
Data Summer Conf 2018, “Analysing Billion Node Graphs (ENG)” — Giorgi Jvaridz...
Provectus
 
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
Big data week 2018 - Graph Analytics on Big Data
Christos Hadjinikolis
 
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
csandit
 
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
cscpconf
 
Benchmarking tool for graph algorithms
Yash Khandelwal
 
Employing Graph Databases as a Standardization Model towards Addressing Heter...
Dippy Aggarwal
 
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
Graph Analytics in Spark
Paco Nathan
 
An excursion into Graph Analytics with Apache Spark GraphX
Krishna Sankar
 
GraphTech Ecosystem - part 2: Graph Analytics
Linkurious
 
Distributed Graph Analytics with Gradoop
Martin Junghanns
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Spark Summit
 
Distributed graph processing
Bartosz Konieczny
 
Ad

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
PPTX
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
PDF
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
PPTX
Autoscaling Flink with Reactive Mode
Flink Forward
 
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
PPTX
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
PDF
Flink powered stream processing platform at Pinterest
Flink Forward
 
PPTX
Apache Flink in the Cloud-Native Era
Flink Forward
 
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
PPTX
The Current State of Table API in 2022
Flink Forward
 
PDF
Flink SQL on Pulsar made easy
Flink Forward
 
PPTX
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
PPTX
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
PDF
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Autoscaling Flink with Reactive Mode
Flink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
Flink powered stream processing platform at Pinterest
Flink Forward
 
Apache Flink in the Cloud-Native Era
Flink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
The Current State of Table API in 2022
Flink Forward
 
Flink SQL on Pulsar made easy
Flink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 

Recently uploaded (20)

PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 

Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

  • 1. GRADOOP: Scalable Graph Analytics with Apache Flink Martin Junghanns University of Leipzig
  • 2. About the speaker and the team  2011 Bachelor of Engineering  Thesis: Partitioning of Dynamic Graphs  2014 Master of Science  Thesis: Graph Database Systems for Business Intelligence  Now: PhD Student, Database Group, University of Leipzig  Distributed Systems  Distributed Graph Data Management  Graph Theory & Algorithms  Professional Experience: sones GraphDB, SAP André, PhD Student Martin, PhD Student Kevin, M.Sc. StudentNiklas, M.Sc. Student
  • 4. 𝑮𝑟𝑟𝑟𝑟 = (𝑽𝑒𝑒𝑒𝑒𝑒𝑒𝑒, 𝑬𝑑𝑑𝑑𝑑) “Graphs are everywhere”
  • 5. 𝐺𝐺𝐺𝐺𝐺 = (𝐔𝐔𝐔𝐔𝐔, 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹) “Graphs are everywhere” Alice Bob Eve Dave Carol Mallory Peggy
  • 6. 𝐺𝐺𝐺𝐺𝐺 = (𝐔𝐔𝐔𝐔𝐔, 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹) “Graphs are everywhere” Alice Bob Eve Dave Carol Mallory Peggy
  • 7. 𝐺𝐺𝐺𝐺𝐺 = (𝐔𝐔𝐔𝐔𝐔, 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹) “Graphs are everywhere” Alice Bob Eve Dave Carol Mallory Peggy
  • 8. 𝐺𝐺𝐺𝐺𝐺 = (𝐔𝐔𝐔𝐔𝐔, 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹) “Graphs are everywhere” Alice Bob Eve Dave Carol Mallory Peggy
  • 9. 𝐺𝐺𝐺𝐺𝐺 = (𝐔𝐔𝐔𝐔𝐔, 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹) “Graphs are everywhere” Alice Bob Eve Dave Carol Mallory Peggy Trent
  • 10. 𝐺𝐺𝐺𝐺𝐺 = (𝐔𝐔𝐔𝐔𝐔, 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹) “Graphs are everywhere” Alice Bob Eve Dave Carol Mallory Peggy Trent
  • 11. 𝐺𝐺𝐺𝐺𝐺 = (𝐂𝐂𝐂𝐂𝐂𝐂, 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶) “Graphs are everywhere” Leipzig pop: 544K Dresden pop: 536K Berlin pop: 3.5M Hamburg pop: 1.7M Munich pop: 1.4M Chemnitz pop: 243K Nuremberg pop: 500K Cologne pop: 1M
  • 12.  World Wide Web  ca. 1 billion websites “Graphs are large”  Facebook  ca. 1.49 billion active users  ca. 340 friends per user
  • 13. End-to-End Graph Analytics Data Integration Graph Analytics Representation
  • 14. End-to-End Graph Analytics Data Integration Graph Analytics Representation  Integrate data from one or more sources into a dedicated graph storage with common graph data model
  • 15. End-to-End Graph Analytics Data Integration Graph Analytics Representation  Integrate data from one or more sources into a dedicated graph storage with common graph data model  Definition of analytical workflows from operator algebra
  • 16. End-to-End Graph Analytics Data Integration Graph Analytics Representation  Integrate data from one or more sources into a dedicated graph storage with common graph data model  Definition of analytical workflows from operator algebra  Result representation in a meaningful way
  • 17. Graph Data Management Graph Database Systems Neo4j, OrientDB Graph Processing Systems Pregel, Giraph Distributed Workflow Systems Flink Gelly, Spark GraphX Data Model Rich Graph Models Generic Graph Models Generic Graph Models Focus Local ACID Operations Global Graph Operations Global Data and Graph Operations Query Language Yes No No Persistency Yes No No Scalability Vertical Horizontal Horizontal Workflows No No Yes Data Integration No No No Graph Analytics No Yes Yes Representation Yes No No
  • 18. Graph Data Management Graph Database Systems Neo4j, OrientDB Graph Processing Systems Pregel, Giraph Distributed Workflow Systems Flink Gelly, Spark GraphX Data Model Rich Graph Models Generic Graph Models Generic Graph Models Focus Local ACID Operations Global Graph Operations Global Data and Graph Operations Query Language Yes No No Persistency Yes No No Scalability Vertical Horizontal Horizontal Workflows No No Yes Data Integration No No No Graph Analytics No Yes Yes Representation Yes No No
  • 19. Graph Data Management Graph Database Systems Neo4j, OrientDB Graph Processing Systems Pregel, Giraph Distributed Workflow Systems Flink Gelly, Spark GraphX Data Model Rich Graph Models Generic Graph Models Generic Graph Models Focus Local ACID Operations Global Graph Operations Global Data and Graph Operations Query Language Yes No No Persistency Yes No No Scalability Vertical Horizontal Horizontal Workflows No No Yes Data Integration No No No Graph Analytics No Yes Yes Representation Yes No No
  • 20. Graph Data Management Graph Database Systems Neo4j, OrientDB Graph Processing Systems Pregel, Giraph Distributed Workflow Systems Flink Gelly, Spark GraphX Data Model Rich Graph Models Generic Graph Models Generic Graph Models Focus Local ACID Operations Global Graph Operations Global Data and Graph Operations Query Language Yes No No Persistency Yes No No Scalability Vertical Horizontal Horizontal Workflows No No Yes Data Integration No No No Graph Analytics No Yes Yes Representation Yes No No
  • 21. What‘s missing? An end-to-end framework and research platform for efficient, distributed and domain independent graph data management and analytics.
  • 22. What‘s missing? An end-to-end framework and research platform for efficient, distributed and domain independent graph data management and analytics.
  • 24. High Level Architecture HDFS/YARN Cluster HBase Distributed Graph Store Extended Property Graph Model Flink Operator Implementations Data Integration Flink Operator Execution Workflow Declaration Visual GrALa DSL Representation Data flow Control flow Graph Analytics Representation Workflow Execution
  • 25. High Level Architecture HBase Distributed Graph Store Extended Property Graph Model Flink Operator Implementations Data Integration Flink Operator Execution Workflow Declaration Visual GrALa DSL Representation Data flow Control flow Graph Analytics Representation Workflow Execution HDFS/YARN Cluster
  • 29. Graph Operators Operator GrALa notation Binary Combination graph.combine(otherGraph) : Graph Overlap graph.overlap(otherGraph) : Graph Exclusion graph.exclude(otherGraph) : Graph Isomorphism graph.isIsomorphicTo(otherGraph) : Boolean Unary Pattern Matching graph.match(patternGraph,predicate) : Collection Aggregation graph.aggregate(propertyKey,aggregateFunction) : Graph Projection graph.project(vertexFunction,edgeFunction) : Graph Summarization graph.summarize( vertexGroupKeys,vertexAggregateFunction, edgeGroupKeys,edgeAggregateFunction) : Graph
  • 30. Combination 1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2])
  • 31. Combination 1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2])
  • 32. Graph Operators Operator GrALa notation Binary Combination graph.combine(otherGraph) : Graph Overlap graph.overlap(otherGraph) : Graph Exclusion graph.exclude(otherGraph) : Graph Isomorphism graph.isIsomorphicTo(otherGraph) : Boolean Unary Pattern Matching graph.match(patternGraph,predicate) : Collection Aggregation graph.aggregate(propertyKey,aggregateFunction) : Graph Projection graph.project(vertexFunction,edgeFunction) : Graph Summarization graph.summarize( vertexGroupKeys,vertexAggregateFunction, edgeGroupKeys,edgeAggregateFunction) : Graph
  • 33. Summarization 1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2]) 2: vertexGroupingKeys = {:type, “city”} 3: edgeGroupingKeys = {:type} 4: vertexAggFunc = (Vertex vSum, Set vertices => vSum[“count”] = |vertices|) 5: edgeAggFunc = (Edge eSum, Set edges => eSum[“count”] = |edges|) 6: sumGraph = personGraph.summarize(vertexGroupingKeys, vertexAggFunc, edgeGroupingKeys, edgeAggFunc)
  • 34. Summarization 1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2]) 2: vertexGroupingKeys = {:type, “city”} 3: edgeGroupingKeys = {:type} 4: vertexAggFunc = (Vertex vSum, Set vertices => vSum[“count”] = |vertices|) 5: edgeAggFunc = (Edge eSum, Set edges => eSum[“count”] = |edges|) 6: sumGraph = personGraph.summarize(vertexGroupingKeys, vertexAggFunc, edgeGroupingKeys, edgeAggFunc)
  • 35. Graph Collection Operators Operator GrALa notation Collection Selection collection.select(predicate) : Collection Distinct collection.distinct() : Collection Sort by collection.sortBy(key, [:asc|:desc]) : Collection Top collection.top(limit) : Collection Union collection.union(otherCollection) : Collection Intersection collection.intersect(otherCollection) : Collection Difference collection.difference(otherCollection) : Collection Auxiliary Apply collection.apply(unaryGraphOperator) : Collection Reduce collection.reduce(binaryGraphOperator) : Graph Call [graph|collection].callFor[Graph|Collection]( algorithm,parameters) : [Graph|Collection]
  • 36. Selection 1: collection = <db.G[0],db.G[1],db.G[2]> 2: predicate = (Graph g => |g.V| > 3) 3: result = collection.select(predicate)
  • 37. Selection 1: collection = <db.G[0],db.G[1],db.G[2]> 2: predicate = (Graph g => |g.V| > 3) 3: result = collection.select(predicate)
  • 38. Graph Collection Operators Operator GrALa notation Collection Selection collection.select(predicate) : Collection Distinct collection.distinct() : Collection Sort by collection.sortBy(key, [:asc|:desc]) : Collection Top collection.top(limit) : Collection Union collection.union(otherCollection) : Collection Intersection collection.intersect(otherCollection) : Collection Difference collection.difference(otherCollection) : Collection Auxiliary Apply collection.apply(unaryGraphOperator) : Collection Reduce collection.reduce(binaryGraphOperator) : Graph Call [graph|collection].callFor[Graph|Collection]( algorithm,parameters) : [Graph|Collection]
  • 39. Extended Property Graph Model in Flink ID Label Properties Graphs ID Label Properties Source Vertex Target Vertex Graphs VertexData EdgeData GraphData ID Label Properties POJO POJO POJO DataSet<Vertex<ID,VertexData>> DataSet<Edge<ID,EdgeData>> DataSet<Subgraph<ID,GraphData>> Gelly 𝒱 ℰ 𝒢 Pojo Representation
  • 40. Extended Property Graph Model in Flink VertexData EdgeData GraphData POJO POJO POJO DataSet<Vertex<ID,VertexData>> DataSet<Edge<ID,EdgeData>> DataSet<Subgraph<ID,GraphData>> Gelly VertexData EdgeData GraphData Tuple Tuple Tuple DataSet<VertexData> DataSet<EdgeData> DataSet<GraphData> 𝒱 𝒱 ℰ ℰ 𝒢 𝒢 Pojo Representation Tuple Representation ID Label Properties Graphs ID Label Properties Source Vertex Target Vertex Graphs ID Label Properties ID Label Properties Graphs ID Label Properties Source Vertex Target Vertex Graphs ID Label Properties
  • 41. Summarization in Flink VID City 0 L 1 L 2 D 3 D 4 D 5 B EID S T 0 0 1 1 1 0 2 1 2 3 2 1 4 2 3 5 3 2 6 4 0 7 4 1 8 5 2 9 5 3 L [0,1] D [2,3,4] B [5] VID City Count 0 L 2 2 D 3 5 B 1 VID Rep 0 0 1 0 2 2 3 2 4 2 5 5 ID S T 0 0 1 1 0 0 2 0 2 3 2 1 4 2 3 5 2 2 6 2 0 7 2 1 8 5 2 9 5 3 ID S T 0 0 0 1 0 0 2 0 2 3 2 0 4 2 2 5 2 2 6 2 0 7 2 0 8 5 2 9 5 2 0,0 [0,1] 0,2 [2] 2,0 [3,6,7] 2,2 [4,5] 5,2 [8,9] EID S T Count 0 0 1 2 2 0 2 1 3 2 0 3 4 2 2 2 8 5 2 2 join(VID==S) 𝒱 ℰ’ 𝒱′ ℰ groupBy(City) reduceGroup + filter + map reduceGroup + filter + map groupBy(S,T) join(VID==T)
  • 42. Use Case: Graph Business Intelligence
  • 43. Use Case: Graph Business Intelligence  Business intelligence usually based on relational data warehouses  Enterprise data is integrated within dimensional schema  Analysis limited to predefined relationships  No support for relationship-oriented data mining Facts Dim 1 Dim 2 Dim 3
  • 44. Use Case: Graph Business Intelligence  Business intelligence usually based on relational data warehouses  Enterprise data is integrated within dimensional schema  Analysis limited to predefined relationships  No support for relationship-oriented data mining  Graph-based approach  Integrate data sources within an instance graph by preserving original relationships between data objects (transactional and master data)  Determine subgraphs (business transaction graphs) related to business activities  Analyze subgraphs or entire graphs with aggregation queries, mining relationship patterns, etc. Facts Dim 1 Dim 2 Dim 3
  • 46. Business Transaction Graphs CIT ERP Employee Name: Dave Employee Name: Alice Employee Name: Bob Employee Name: Carol Ticket Expense: 500 SalesQuotation SalesOrder PurchaseOrder PurchaseOrder SalesRevenue Revenue: 5,000 PurchaseInvoice Expense: 2,000 PurchaseInvoice Expense: 1,500 sentBy createdBy processedBy createdBy openedFor processedBy basedOn serves serves bills bills bills processedBy
  • 47. Business Transaction Graphs CIT ERP Employee Name: Dave Employee Name: Alice Employee Name: Bob Employee Name: Carol Ticket Expense: 500 SalesQuotation SalesOrder PurchaseOrder PurchaseOrder SalesRevenue Revenue: 5,000 PurchaseInvoice Expense: 2,000 PurchaseInvoice Expense: 1,500 sentBy createdBy processedBy createdBy openedFor processedBy processedBy basedOn serves serves bills bills bills
  • 48. Business Transaction Graphs CIT ERP Employee Name: Dave Employee Name: Alice Employee Name: Bob Employee Name: Carol Ticket Expense: 500 SalesQuotation SalesOrder PurchaseOrder PurchaseOrder SalesRevenue Revenue: 5,000 PurchaseInvoice Expense: 2,000 PurchaseInvoice Expense: 1,500 sentBy createdBy processedBy createdBy openedFor processedBy processedBy basedOn serves serves bills bills bills
  • 49. Business Transaction Graphs CIT ERP Employee Name: Dave Employee Name: Alice Employee Name: Bob Employee Name: Carol Ticket Expense: 500 SalesQuotation SalesOrder PurchaseOrder PurchaseOrder SalesRevenue Revenue: 5,000 PurchaseInvoice Expense: 2,000 PurchaseInvoice Expense: 1,500 sentBy createdBy processedBy createdBy openedFor processedBy processedBy basedOn serves serves bills bills bills
  • 50. Business Transaction Graphs CIT ERP Employee Name: Dave Employee Name: Alice Employee Name: Bob Employee Name: Carol Ticket Expense: 500 SalesQuotation SalesOrder PurchaseOrder PurchaseOrder SalesRevenue Revenue: 5,000 PurchaseInvoice Expense: 2,000 PurchaseInvoice Expense: 1,500 sentBy createdBy processedBy createdBy openedFor processedBy processedBy basedOn serves serves bills bills bills
  • 51. Business Transaction Graphs CIT ERP Employee Name: Dave Employee Name: Alice Employee Name: Bob Employee Name: Carol Ticket Expense: 500 SalesQuotation SalesOrder PurchaseOrder PurchaseOrder SalesRevenue Revenue: 5,000 PurchaseInvoice Expense: 2,000 PurchaseInvoice Expense: 1,500 sentBy createdBy processedBy createdBy openedFor processedBy processedBy basedOn serves serves bills bills bills
  • 52. BTG 1 (1) BTG Extraction BTG 2 BTG 3 BTG 4 BTG 5 BTG n …
  • 53. (1) BTG Extraction // generate base collection btgs = iig.callForCollection( :BusinessTransactionGraphs , {} )
  • 54. (2) Profit Aggregation CIT ERP Employee Name: Dave Employee Name: Alice Employee Name: Bob Employee Name: Carol Ticket Expense: 500 SalesQuotation SalesOrder PurchaseOrder PurchaseOrder SalesRevenue Revenue: 5,000 PurchaseInvoice Expense: 2,000 PurchaseInvoice Expense: 1,500 sentBy createdBy processedBy createdBy openedFor processedBy processedBy basedOn serves serves bills bills bills
  • 55. (2) Profit Aggregation // generate base collection btgs = iig.callForCollection( :BusinessTransactionGraphs , {} ) // define profit aggregate function aggFunc = ( Graph g => g.V.values(“Revenue").sum() - g.V.values(“Expense").sum() )
  • 56. (2) Profit Aggregation BTG 1 BTG 2 BTG 3 BTG 4 BTG 5 BTG n … ∑ Revenue ∑ Expenses Net Profit 5,000 -3,000 2,000 9,000 -3,000 6,000 2,000 -1,500 500 5,000 -7,000 -2,000 10,000 -15,000 -5,000 … … … 8,000 -4,000 4,000
  • 57. (2) Profit Aggregation // generate base collection btgs = iig.callForCollection( :BusinessTransactionGraphs , {} ) // define profit aggregate function aggFunc = ( Graph g => g.V.values(“Revenue").sum() - g.V.values(“Expense").sum() ) // apply aggregate function and store result at new property btgs = btgs.apply( Graph g => g.aggregate( “Profit“ , aggFunc ) )
  • 58. (3) BTG Clustering BTG 1 BTG 2 BTG 3 BTG 4 BTG 5 BTG n … ∑ Revenue ∑ Expenses Net Profit 5,000 -3,000 2,000 9,000 -3,000 6,000 2,000 -1,500 500 5,000 -7,000 -2,000 10,000 -15,000 -5,000 … … … 8,000 -4,000 4,000
  • 59. (3) BTG Clustering // select profit and loss clusters profitBtgs = btgs.select( Graph g => g[“Profit”] >= 0 ) lossBtgs = btgs.difference(profitBtgs)
  • 60. (4) Cluster Characteristic Patterns CIT ERP Employee Name: Dave Employee Name: Alice Employee Name: Bob Employee Name: Carol Ticket Expense: 500 SalesQuotation SalesOrder PurchaseOrder PurchaseOrder SalesRevenue Revenue: 5,000 PurchaseInvoice Expense: 2,000 PurchaseInvoice Expense: 1,500 sentBy createdBy processedBy createdBy openedFor processedBy processedBy basedOn serves serves bills bills bills
  • 61. (4) Cluster Characteristic Patterns CIT ERP Employee Name: Dave Employee Name: Alice Employee Name: Bob Employee Name: Carol Ticket Expense: 500 SalesQuotation SalesOrder PurchaseOrder PurchaseOrder SalesRevenue Revenue: 5,000 PurchaseInvoice Expense: 2,000 PurchaseInvoice Expense: 1,500 sentBy createdBy processedBy createdBy openedFor processedBy processedBy basedOn serves serves bills bills bills
  • 62. (4) Cluster Characteristic Patterns BTG 1 BTG 2 BTG 3 BTG 4 BTG 5 BTG n … ∑ Revenue ∑ Expenses Net Profit 5,000 -3,000 2,000 9,000 -3,000 6,000 2,000 -1,500 500 5,000 -7,000 -2,000 10,000 -15,000 -5,000 … … … 8,000 -4,000 4,000 TicketAlice processedBy Bob createdBy PurchaseOrder
  • 63. (4) Cluster Characteristic Patterns // select profit and loss clusters profitBtgs = btgs.select( Graph g => g[“Profit”] >= 0 ) lossBtgs = btgs.difference(profitBtgs) // apply magic profitFreqPats = profitBtgs.callForCollection( :FrequentSubgraphs , {“Threshold”:0.7} ) lossFreqPats = lossBtgs.callForCollection( :FrequentSubgraphs , {“Threshold”:0.7} ) // determine cluster characteristic patterns trivialPats = profitFreqPats.intersect(lossFreqPats) profitCharPatterns = profitFreqPats.difference(trivialPats) lossCharPatterns = lossFreqPats.difference(trivialPats)
  • 64. Current State & Future Work
  • 65. Current State  0.0.1 First Prototype (May 2015)  Hadoop MapReduce and Giraph for operator implementations  Too much complexity  Performance loss through serialization in HDFS/HBase  0.0.2 Using Flink as execution layer (June 2015)  Basic operators  Currently 0.0.3-SNAPSHOT  Performance improvements  More operator implementations
  • 66. Operator implementations (0.0.3-SNAPSHOT) Unary Pattern Matching Collection Selection Algorithms LabelPropagation Aggregation Distinct BTG Extraction Projection Sort by FSM Summarization Top Binary Combination Union Overlap Intersection Exclusion Difference Isomorphism Auxiliary Apply Reduce Call
  • 67. Future Work  Operator integration into Gelly  Summarization FLINK-2411  Graph Sampling  …  Graph Operations on streams (Flink)  Graph Partitioning (maybe together with the Gelly people)  Graph Versioning (Storage)  Benchmarking  GrALa Interpreter / Web UI
  • 68. Benchmarks Sneak Preview 0 200 400 600 800 1000 1200 1400 1 2 4 8 16 Time [s] # Worker Summarization (Vertex and Edge Labels)  16x Intel(R) Xeon(R) CPU E5-2430 v2 @ 2.50GHz (12 Cores), 48 GB RAM  Hadoop 2.5.2, Flink 0.9.0  slots (per node) 12  jobmanager.heap.mb 2048  taskmanager.heap.mb 40960  Foodbroker Graph (https://github.com/dbs-leipzig/foodbroker)  Generates BI process data  858,624,267 Vertices, 4,406,445,007 Edges, 663GB Payload
  • 69. Web UI Sneak Preview
  • 70. Contributions welcome  Code  Operator implementations  Performance Tuning  Storage layout  Data! and Use Cases  We are researchers, we assume ...  Getting real data (especially BI data) is nearly impossible  People  Bachelor / Master / PhD Thesis
  • 71. Thank you for building Flink! www.gradoop.com https://github.com/dbs-leipzig/gradoop http://dbs.uni-leipzig.de/file/GradoopTR.pdf http://dbs.uni-leipzig.de/file/biiig-vldb2014.pdf