SlideShare a Scribd company logo
Tegra
Time-evolving Graph Processing on
Commodity Clusters
Anand Iyer Joseph GonzalezQifan Pu Ion Stoica
Spark Summit East
8 February 2017
About Me
1
• PhD Candidate at AMP/RISE Lab at UC Berkeley
• Thesis on time-evolving graph processing
• Previous work:
• Collaborative energy diagnosis for smartphones
(carat.cs.berkeley.edu)
• Approximate query processing (BlinkDB)
• Cellular Network Analytics
• Fundamental trade-offs in applying ML to real-time datasets
Graphs are everywhere…
Social Networks
2
Graphs are everywhere…
Gnutella network subgraph
3
Graphs are everywhere…
4
Graphs are everywhere…
Metabolic	network	of	a	single	cell	organism Tuberculosis
5
Plenty of interest in processing them
• Graph DBMS 25% of all enterprises by end of 20171
• Many open-source and research prototypes on distributed graph
processing frameworks: Giraph, Pregel, GraphLab, GraphX, …
1Forrester Research
6
Real-world Graphs are Dynamic
Earthquake	Occurrence	Density
7
Real-world Graphs are Dynamic
8
Real-world Graphs are Dynamic
9
Processing Time-evolving Graphs
Many interesting business and research insights
possible by processing such dynamic graphs…
10
… little or no work in supporting such workloads in
existing graph-processing frameworks
Challenge #1: Storage
11
Time
A
B C
G1
A
B C
D
G2
Redundant storage of graph entities over time
A
B C
D
E
G3
Challenge #2: Computation
12
A
B C
R1
A
B C
D
E
R3
Wasted computation across snapshots
Time
A
B C
G1
A
B C
D
G2
A
B C
D
E
G3
A
B C
D
R2
Challenge #3: Communication
13
A
B C
A
B C
D A
B C
D
E
Time
A
B C
G1
A
B C
D
G2
A
B C
D
E
G3
Duplicate messages sent over the network
How do we process time-evolving,
dynamically changing graphs
efficiently?
14
Share
Storage
Communication
Computation
Tegra
How do we process time-evolving,
dynamically changing graphs
efficiently?
15
Share
Storage
Communication
Computation
Tegra
Sharing Storage
16
Time
A
B C
G1
A
B C
δg1
A D
δg2
A
B C
D
G2
A
B C
D
E
G3
C
D
E
δg3
Storing deltas result in the most optimal storage, but creating
snapshot from deltas can be expensive!
A Better Storage Solution
17
Snapshot 2Snapshot 1
t1
t2
Use a persistent datastructure
Store snapshots in Persistent Adaptive Radix Trees (PART)
Graph Snapshot Index
18
Snapshot 2Snapshot 1
Vertex
t1
t2
Snapshot 2Snapshot 1
t1
t2
Edge
Partition
Snapshot ID Management
Shares structure between snapshots, and enables efficient operations
How do we process time-evolving,
dynamically changing graphs
efficiently?
19
Share
Storage
Communication
Computation
Tegra
Graph Parallel Abstraction - GAS
Gather: Accumulate information from neighborhood
20
Apply: Apply the accumulated value
Scatter: Update adjacent edges & vertices with updated value
Processing Multiple Snapshots
21
for (snapshot in snapshots) {
for (stage in graph-parallel-computation) {…}
}
A
B C
A
B C
D A
B C
D
E
Time
G1 G2
G3
Reducing Redundant Messages
22
A
B C
A
B C
D A
B C
D
E
Time
G1
G2 G3
D
BCBA
AAA
B C
D
E
for (step in graph-parallel-computation) {
for (snapshot in snapshots) {…}
}
Can potentially avoid large number of redundant messages
How do we process time-evolving,
dynamically changing graphs
efficiently?
23
Share
Storage
Communication
Computation
Tegra
Updating Results
• If result from a previous snapshot is available, how can we reuse
them?
• Three approaches in the past:
• Restart the algorithm
• Redundant computations
• Memoization (GraphInc1)
• Too much state
• Operator-wise state (Naiad2,3)
• Too much overhead
• Fault tolerance
24
1Facilitating real- time graph mining, CloudDB ’12
2 Naiad: A timely dataflow system, SOSP ’13
3 Differential dataflow, CIDR ‘13
Key Idea
• Leverage how GAS model executes computation
• Each iteration in GAS modifies the graph by a little
• Can be seen as another time-evolving graph!
• Upon change to a graph:
• Mark parts of the graph that changed
• Expand the marked parts to involve regions for recomputation in every
iteration
• Borrow results from parts not changed
25
Incremental Computation
26
A
B C
D
Iterations
Time
A
A B
A A
A A
A
G1
0 G1
1 G1
2
G2
2
A
B C
A
A B
A
A A
G2
0 G2
1
Larger graphs and more iterations can yield significant improvements
API
val v = sqlContext.createDataFrame(List(
("a", "Alice"),
("b", "Bob"),
("c", "Charlie")
)).toDF("id", "name")
val e = sqlContext.createDataFrame(List(
("a", "b", "friend"),
("b", "c", "follow"),
("c", "b", "follow)
)).toDF("src", "dst", "relationship")
val g = GraphFrame(v, e)
27
val g1 = g.update(v1, e1)
.indexed()
.indexed()
API: Incremental Computations
val g = GraphFrame(v, e)
28
val g1 = g.update(v1,e1)
val result1 = g1.triangleCount.run(result)
val result = g.triangleCount.run()
API: Computations on Multiple Graphs
val g = GraphFrame(v, e)
val g1 = g.update(v1,e1)
29
val g2 = g1.update(v2,e2)
val g3 = g1.update(v3,e3)
val results =
g3.triangleCount.runOnSnapshots(start, end)
API
30
B C
A D
F E
A DD
B C
D
E
AA
F
Transition
After 11 iteration on graph 2,
Both converge to 3-digit precision
1.224
0.8490.502
2.07
0.8490.502
he benefit of PSR computation.
For each iteration of Pregel,
y of a new graph. When it
ations on the current graph,
he new graph after copying
The new computation will
w active set continue message
s a function of the old active
n the new graph and the old
lgorithms (e.g. incremental
ve set includes vertices from
w vertices and vertices with
class Graph[V, E] {
// Collection views
def vertices(sid: Int): Collection[(Id, V)]
def edges(sid: Int): Collection[(Id, Id, E)]
def triplets(sid: Int): Collection[Triplet]
// Graph-parallel computation
def mrTriplets(f: (Triplet) => M,
sum: (M, M) => M,
sids: Array[Int]): Collection[(Int, Id, M)]
// Convenience functions
def mapV(f: (Id, V) => V,
sids: Array[Int]): Graph[V, E]
def mapE(f: (Id, Id, E) => E
sids: Array[Int]): Graph[V, E]
def leftJoinV(v: Collection[(Id, V)],
f: (Id, V, V) => V,
sids: Array[Int]): Graph[V, E]
def leftJoinE(e: Collection[(Id, Id, E)],
f: (Id, Id, E, E) => E,
sids: Array[Int]): Graph[V, E]
def subgraph(vPred: (Id, V) => Boolean,
ePred: (Triplet) => Boolean,
sids: Array[Int]): Graph[V, E]
def reverse(sids: Array[Int]): Graph[V, E]
}
Listing 3: GraphX [24] operators modified to support Tegra’s
Implementation & Evaluation
• Implemented on Spark 2.0
• Extended dataframes with versioning information and iterate
operator
• Extended GraphX API to allow computation on multiple
snapshots
• Preliminary evaluation on two real-world graphs
• Twitter: 41,652,230 vertices, 1,468,365,182 edges
• uk-2007: 105,896,555 vertices, 3,738,733,648 edges
31
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Storage	Reduction
Number	of	Snapshots
Benefits of Storage Sharing
32
Datastructure
overheads
Significant improvements with
more snapshots
Benefits of sharing communication
33
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Time	(s)
Number	of	Snapshots
GraphX Tegra
Benefits of Incremental Computing
34
0
50
100
150
200
250
0 5 10 15 20
Computation	Time	(s)
Snapshot	ID	
Incremental Full	Computation
Only 5% of the graph modified in every snapshot
50x reduction by processing only the modified part
Ongoing/Future Work
• Tight(er) integration with Catalyst
• Tungsten improvements
• Code release
• Incremental pattern matching
• Approximate graph analytics
• Geo-distributed graph analytics
35
Summary
• Processing time-evolving graph efficiently can be useful
• Sharing storage, computation and communication key to efficient
time-evolving graph analysis
• We proposed Tegra that implements our ideas
Please talk to us about your interesting use-cases!
api@cs.berkeley.edu
www.cs.berkeley.edu/~api
36

More Related Content

What's hot (20)

PDF
A Graph-Based Method For Cross-Entity Threat Detection
Jen Aman
 
PDF
Pivoting Data with SparkSQL by Andrew Ray
Spark Summit
 
PDF
Productionizing your Streaming Jobs
Databricks
 
PPTX
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit
 
PDF
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
 
PDF
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
MLconf
 
PDF
Scaling up data science applications
Kexin Xie
 
PDF
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 
PDF
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Spark Summit
 
PDF
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Databricks
 
PDF
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Databricks
 
PPTX
Anomaly Detection with Apache Spark
Cloudera, Inc.
 
PDF
Enhancing Spark SQL Optimizer with Reliable Statistics
Jen Aman
 
PDF
Demystifying DataFrame and Dataset
Kazuaki Ishizaki
 
PDF
Engineering fast indexes
Daniel Lemire
 
PDF
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Spark Summit
 
PDF
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Spark Summit
 
PDF
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Spark Summit
 
PDF
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Databricks
 
PDF
Understanding Query Plans and Spark UIs
Databricks
 
A Graph-Based Method For Cross-Entity Threat Detection
Jen Aman
 
Pivoting Data with SparkSQL by Andrew Ray
Spark Summit
 
Productionizing your Streaming Jobs
Databricks
 
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit
 
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
MLconf
 
Scaling up data science applications
Kexin Xie
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Spark Summit
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Databricks
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Databricks
 
Anomaly Detection with Apache Spark
Cloudera, Inc.
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Jen Aman
 
Demystifying DataFrame and Dataset
Kazuaki Ishizaki
 
Engineering fast indexes
Daniel Lemire
 
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Spark Summit
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Spark Summit
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Spark Summit
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Databricks
 
Understanding Query Plans and Spark UIs
Databricks
 

Viewers also liked (20)

PDF
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Spark Summit
 
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
PDF
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Spark Summit
 
PDF
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
Spark Summit
 
PDF
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
Spark Summit
 
PDF
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
PDF
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Spark Summit
 
PDF
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Summit
 
PDF
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
PDF
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
PDF
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Spark Summit
 
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
PDF
Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...
Spark Summit
 
PDF
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Spark Summit
 
PDF
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Spark Summit
 
PDF
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Spark Summit
 
PDF
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Spark Summit
 
PDF
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark Summit
 
PDF
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark Summit
 
PDF
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Spark Summit
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Spark Summit
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Spark Summit
 
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
Spark Summit
 
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
Spark Summit
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Spark Summit
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Summit
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Spark Summit
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...
Spark Summit
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Spark Summit
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Spark Summit
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Spark Summit
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Spark Summit
 
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark Summit
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark Summit
 
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Spark Summit
 
Ad

Similar to Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk by Anand Iyer (20)

PPTX
MathWorks Interview Lecture
John Yates
 
PDF
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Jason Riedy
 
PDF
Gelly in Apache Flink Bay Area Meetup
Vasia Kalavri
 
PDF
Ling liu part 02:big graph processing
jins0618
 
PDF
Challenging Web-Scale Graph Analytics with Apache Spark
Databricks
 
PPT
An overview of InfiniteGraph, the distributed graph database
InfiniteGraph
 
PDF
Xia Zhu – Intel at MLconf ATL
MLconf
 
PDF
Graph Analysis: New Algorithm Models, New Architectures
Jason Riedy
 
PDF
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
PDF
Gluecon InfiniteGraph Presentation: Scaling the Social Graph in the Cloud
InfiniteGraph
 
PPTX
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
Jinho Lee
 
PDF
Graph Algorithms - Map-Reduce Graph Processing
Jason J Pulikkottil
 
PDF
Graph processing - Graphlab
Amir Payberah
 
PDF
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
PDF
Graph Analysis Beyond Linear Algebra
Jason Riedy
 
PDF
STINGER: Multi-threaded Graph Streaming
Jason Riedy
 
PDF
GraphX and Pregel - Apache Spark
Ashutosh Trivedi
 
PDF
Updating PageRank for Streaming Graphs
Jason Riedy
 
PDF
Graph x pregel
Sigmoid
 
MathWorks Interview Lecture
John Yates
 
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Jason Riedy
 
Gelly in Apache Flink Bay Area Meetup
Vasia Kalavri
 
Ling liu part 02:big graph processing
jins0618
 
Challenging Web-Scale Graph Analytics with Apache Spark
Databricks
 
An overview of InfiniteGraph, the distributed graph database
InfiniteGraph
 
Xia Zhu – Intel at MLconf ATL
MLconf
 
Graph Analysis: New Algorithm Models, New Architectures
Jason Riedy
 
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
Gluecon InfiniteGraph Presentation: Scaling the Social Graph in the Cloud
InfiniteGraph
 
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
Jinho Lee
 
Graph Algorithms - Map-Reduce Graph Processing
Jason J Pulikkottil
 
Graph processing - Graphlab
Amir Payberah
 
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
Graph Analysis Beyond Linear Algebra
Jason Riedy
 
STINGER: Multi-threaded Graph Streaming
Jason Riedy
 
GraphX and Pregel - Apache Spark
Ashutosh Trivedi
 
Updating PageRank for Streaming Graphs
Jason Riedy
 
Graph x pregel
Sigmoid
 
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Recently uploaded (20)

PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PDF
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
Choosing the Right Database for Indexing.pdf
Tamanna
 

Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk by Anand Iyer

  • 1. Tegra Time-evolving Graph Processing on Commodity Clusters Anand Iyer Joseph GonzalezQifan Pu Ion Stoica Spark Summit East 8 February 2017
  • 2. About Me 1 • PhD Candidate at AMP/RISE Lab at UC Berkeley • Thesis on time-evolving graph processing • Previous work: • Collaborative energy diagnosis for smartphones (carat.cs.berkeley.edu) • Approximate query processing (BlinkDB) • Cellular Network Analytics • Fundamental trade-offs in applying ML to real-time datasets
  • 7. Plenty of interest in processing them • Graph DBMS 25% of all enterprises by end of 20171 • Many open-source and research prototypes on distributed graph processing frameworks: Giraph, Pregel, GraphLab, GraphX, … 1Forrester Research 6
  • 8. Real-world Graphs are Dynamic Earthquake Occurrence Density 7
  • 11. Processing Time-evolving Graphs Many interesting business and research insights possible by processing such dynamic graphs… 10 … little or no work in supporting such workloads in existing graph-processing frameworks
  • 12. Challenge #1: Storage 11 Time A B C G1 A B C D G2 Redundant storage of graph entities over time A B C D E G3
  • 13. Challenge #2: Computation 12 A B C R1 A B C D E R3 Wasted computation across snapshots Time A B C G1 A B C D G2 A B C D E G3 A B C D R2
  • 14. Challenge #3: Communication 13 A B C A B C D A B C D E Time A B C G1 A B C D G2 A B C D E G3 Duplicate messages sent over the network
  • 15. How do we process time-evolving, dynamically changing graphs efficiently? 14 Share Storage Communication Computation Tegra
  • 16. How do we process time-evolving, dynamically changing graphs efficiently? 15 Share Storage Communication Computation Tegra
  • 17. Sharing Storage 16 Time A B C G1 A B C δg1 A D δg2 A B C D G2 A B C D E G3 C D E δg3 Storing deltas result in the most optimal storage, but creating snapshot from deltas can be expensive!
  • 18. A Better Storage Solution 17 Snapshot 2Snapshot 1 t1 t2 Use a persistent datastructure Store snapshots in Persistent Adaptive Radix Trees (PART)
  • 19. Graph Snapshot Index 18 Snapshot 2Snapshot 1 Vertex t1 t2 Snapshot 2Snapshot 1 t1 t2 Edge Partition Snapshot ID Management Shares structure between snapshots, and enables efficient operations
  • 20. How do we process time-evolving, dynamically changing graphs efficiently? 19 Share Storage Communication Computation Tegra
  • 21. Graph Parallel Abstraction - GAS Gather: Accumulate information from neighborhood 20 Apply: Apply the accumulated value Scatter: Update adjacent edges & vertices with updated value
  • 22. Processing Multiple Snapshots 21 for (snapshot in snapshots) { for (stage in graph-parallel-computation) {…} } A B C A B C D A B C D E Time G1 G2 G3
  • 23. Reducing Redundant Messages 22 A B C A B C D A B C D E Time G1 G2 G3 D BCBA AAA B C D E for (step in graph-parallel-computation) { for (snapshot in snapshots) {…} } Can potentially avoid large number of redundant messages
  • 24. How do we process time-evolving, dynamically changing graphs efficiently? 23 Share Storage Communication Computation Tegra
  • 25. Updating Results • If result from a previous snapshot is available, how can we reuse them? • Three approaches in the past: • Restart the algorithm • Redundant computations • Memoization (GraphInc1) • Too much state • Operator-wise state (Naiad2,3) • Too much overhead • Fault tolerance 24 1Facilitating real- time graph mining, CloudDB ’12 2 Naiad: A timely dataflow system, SOSP ’13 3 Differential dataflow, CIDR ‘13
  • 26. Key Idea • Leverage how GAS model executes computation • Each iteration in GAS modifies the graph by a little • Can be seen as another time-evolving graph! • Upon change to a graph: • Mark parts of the graph that changed • Expand the marked parts to involve regions for recomputation in every iteration • Borrow results from parts not changed 25
  • 27. Incremental Computation 26 A B C D Iterations Time A A B A A A A A G1 0 G1 1 G1 2 G2 2 A B C A A B A A A G2 0 G2 1 Larger graphs and more iterations can yield significant improvements
  • 28. API val v = sqlContext.createDataFrame(List( ("a", "Alice"), ("b", "Bob"), ("c", "Charlie") )).toDF("id", "name") val e = sqlContext.createDataFrame(List( ("a", "b", "friend"), ("b", "c", "follow"), ("c", "b", "follow) )).toDF("src", "dst", "relationship") val g = GraphFrame(v, e) 27 val g1 = g.update(v1, e1) .indexed() .indexed()
  • 29. API: Incremental Computations val g = GraphFrame(v, e) 28 val g1 = g.update(v1,e1) val result1 = g1.triangleCount.run(result) val result = g.triangleCount.run()
  • 30. API: Computations on Multiple Graphs val g = GraphFrame(v, e) val g1 = g.update(v1,e1) 29 val g2 = g1.update(v2,e2) val g3 = g1.update(v3,e3) val results = g3.triangleCount.runOnSnapshots(start, end)
  • 31. API 30 B C A D F E A DD B C D E AA F Transition After 11 iteration on graph 2, Both converge to 3-digit precision 1.224 0.8490.502 2.07 0.8490.502 he benefit of PSR computation. For each iteration of Pregel, y of a new graph. When it ations on the current graph, he new graph after copying The new computation will w active set continue message s a function of the old active n the new graph and the old lgorithms (e.g. incremental ve set includes vertices from w vertices and vertices with class Graph[V, E] { // Collection views def vertices(sid: Int): Collection[(Id, V)] def edges(sid: Int): Collection[(Id, Id, E)] def triplets(sid: Int): Collection[Triplet] // Graph-parallel computation def mrTriplets(f: (Triplet) => M, sum: (M, M) => M, sids: Array[Int]): Collection[(Int, Id, M)] // Convenience functions def mapV(f: (Id, V) => V, sids: Array[Int]): Graph[V, E] def mapE(f: (Id, Id, E) => E sids: Array[Int]): Graph[V, E] def leftJoinV(v: Collection[(Id, V)], f: (Id, V, V) => V, sids: Array[Int]): Graph[V, E] def leftJoinE(e: Collection[(Id, Id, E)], f: (Id, Id, E, E) => E, sids: Array[Int]): Graph[V, E] def subgraph(vPred: (Id, V) => Boolean, ePred: (Triplet) => Boolean, sids: Array[Int]): Graph[V, E] def reverse(sids: Array[Int]): Graph[V, E] } Listing 3: GraphX [24] operators modified to support Tegra’s
  • 32. Implementation & Evaluation • Implemented on Spark 2.0 • Extended dataframes with versioning information and iterate operator • Extended GraphX API to allow computation on multiple snapshots • Preliminary evaluation on two real-world graphs • Twitter: 41,652,230 vertices, 1,468,365,182 edges • uk-2007: 105,896,555 vertices, 3,738,733,648 edges 31
  • 33. 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Storage Reduction Number of Snapshots Benefits of Storage Sharing 32 Datastructure overheads Significant improvements with more snapshots
  • 34. Benefits of sharing communication 33 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Time (s) Number of Snapshots GraphX Tegra
  • 35. Benefits of Incremental Computing 34 0 50 100 150 200 250 0 5 10 15 20 Computation Time (s) Snapshot ID Incremental Full Computation Only 5% of the graph modified in every snapshot 50x reduction by processing only the modified part
  • 36. Ongoing/Future Work • Tight(er) integration with Catalyst • Tungsten improvements • Code release • Incremental pattern matching • Approximate graph analytics • Geo-distributed graph analytics 35
  • 37. Summary • Processing time-evolving graph efficiently can be useful • Sharing storage, computation and communication key to efficient time-evolving graph analysis • We proposed Tegra that implements our ideas Please talk to us about your interesting use-cases! [email protected] www.cs.berkeley.edu/~api 36