SlideShare a Scribd company logo
Web-Scale Graph Analytics
with Apache Spark
Joseph K Bradley
NYC Data Science Meetup
June 28, 2017
2
About me
Software engineer at Databricks
Apache Spark committer & PMC member
Ph.D. Carnegie Mellon in Machine Learning
3
TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeley in 2009
3	3	
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple
4
Apache Spark Engine
…
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Unified engine across diverse workloads & environments
Scale out, fault tolerant
Python, Java, Scala, & R APIs
Standard libraries
5
6
Spark Packages
340+ packages written for Spark
80+ packages for ML and Graphs
E.g.:
• GraphFrames: DataFrame-based graphs
• Bisecting K-Means: now part of MLlib
• Stanford CoreNLP integration: UDFs for NLP
spark-packages.org
7
Outline
Intro to GraphFrames
Moving implementations to DataFrames
•  Vertex indexing
•  Scaling Connected Components
•  Other challenges: skewed joins and checkpoints
Future of GraphFrames
8
Graphs
vertex
edge
id City State
“JFK” “New York” NY
Example: airports & flights between them
JFK
IAD
LAX
SFO
SEA
DFW
src dst delay tripID
“JFK” “SEA” 45 1058923
9
Apache Spark’s GraphX library
Overview
•  General-purpose graph
processing library
•  Optimized for fast
distributed computing
•  Library of algorithms:
PageRank, Connected
Components, etc.
9	
Challenges
•  No Java, Python APIs
•  Lower-level RDD-based API
(vs. DataFrames)
•  Cannot use recent Spark
optimizations: Catalyst
query optimizer, Tungsten
memory management
10
The GraphFrames Spark Package
Goal: DataFrame-based graphs on Apache Spark
•  Simplify interactive queries
•  Support motif-finding for structural pattern search
•  Benefit from DataFrame optimizations
Collaboration between Databricks, UC Berkeley & MIT
+ Now with community contributors & committers!
10
11
Graphs
vertex
edge
JFK
IAD
LAX
SFO
SEA
DFW
id City State
“JFK” “New York” NY
src dst delay tripID
“JFK” “SEA” 45 1058923
12
GraphFrames
12	
id City State
“JFK” “New York” NY
“SEA” “Seattle” WA
src dst delay tripID
“JFK” “SEA” 45 1058923
“DFW” “SFO” -7 4100224
vertices DataFrame edges DataFrame
13
Graph analysis with GraphFrames
Simple queries
Motif finding
Graph algorithms
13
14
Simple queries
SQL queries on vertices & edges
14	
Simple graph queries (e.g., vertex degrees)
15
Motif finding
15	
JFK
IAD
LAX
SFO
SEA
DFW
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
16
Motif finding
16	
JFK
IAD
LAX
SFO
SEA
DFW
(b)
(a)Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
17
Motif finding
17	
JFK
IAD
LAX
SFO
SEA
DFW
(b)
(a)
(c)
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
18
Motif finding
18	
JFK
IAD
LAX
SFO
SEA
DFW
(b)
(a)
(c)
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
19
Motif finding
19	
JFK
IAD
LAX
SFO
SEA
DFW
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
(b)
(a)
(c)
Then filter using vertex &
edge data.
paths.filter(“e1.delay > 20”)
20
Graph algorithms
Find important vertices
•  PageRank
20	
Find paths between sets of vertices
•  Breadth-first search (BFS)
•  Shortest paths
Find groups of vertices
(components, communities)
•  Connected components
•  Strongly connected components
•  Label Propagation Algorithm (LPA)
Other
•  Triangle counting
•  SVDPlusPlus
21
Saving & loading graphs
Save & load the DataFrames.
vertices = sqlContext.read.parquet(...)
edges = sqlContext.read.parquet(...)
g = GraphFrame(vertices, edges)
g.vertices.write.parquet(...)
g.edges.write.parquet(...)
21
22
GraphFrames vs. GraphX
22	
GraphFrames GraphX
Built on DataFrames RDDs
Languages Scala, Java, Python Scala
Use cases Queries & algorithms Algorithms
Vertex IDs Any type (in Catalyst) Long
Vertex/edge
attributes
Any number of
DataFrame columns
Any type (VD, ED)
23
2 types of graph libraries
Graph algorithms Graph queries
Standard & custom algorithms
Optimized for batch processing
Motif finding
Point queries & updates
GraphFrames: Both algorithms & queries (but not point updates)
24
Outline
Intro to GraphFrames
Moving implementations to DataFrames
•  Vertex indexing
•  Scaling Connected Components
•  Other challenges: skewed joins and checkpoints
Future of GraphFrames
25
Algorithm implementations
Mostly wrappers for GraphX
•  PageRank
•  Shortest paths
•  Strongly connected components
•  Label Propagation Algorithm (LPA)
•  SVDPlusPlus
25	
Some algorithms implemented
using DataFrames
•  Breadth-first search
•  Connected components
•  Triangle counting
•  Motif finding
26
Moving implementations to DataFrames
DataFrames are optimized for a huge number of small records.
•  columnar storage
•  code generation (“Project Tungsten”)
•  query optimization (“Project Catalyst”)
26
27
Outline
Intro to GraphFrames
Moving implementations to DataFrames
•  Vertex indexing
•  Scaling Connected Components
•  Other challenges: skewed joins and checkpoints
Future of GraphFrames
28
Pros of integer vertex IDs
GraphFrames take arbitrary vertex IDs.
à convenient for users
Algorithms prefer integer vertex IDs.
à optimize in-memory storage
à reduce communication
Our task: Map unique vertex IDs to unique (long) integers.
29
The hashing trick?
• Possible solution: hash vertex ID to long integer
• What is the chance of collision?
•  1 - (k-1)/N * (k-2)/N * …
•  seems unlikely with long range N=264
•  with 1 billion nodes, the chance is ~5.4%
• Problem: collisions change graph topology.
Name Hash
Tim 84088
Joseph -2070372689
Xiangrui 264245405
Felix 67762524
30
Generating unique IDs
Spark has built-in methods to generate unique IDs.
•  RDD: zipWithUniqueId(), zipWithIndex()
•  DataFrame: monotonically_increasing_id()
!
Possible solution: just use these methods
31
How it works
ParCCon	1	
Vertex	 ID	
Tim	 0	
Joseph	 1	
ParCCon	2	
Vertex	 ID	
Xiangrui	 100	+	0	
Felix	 100	+	1	
ParCCon	3	
Vertex	 ID	
…	 200	+	0	
…	 200	+	1
32
… but not always
• DataFrames/RDDs are immutable and reproducible by design.
• However, records do not always have stable orderings.
•  distinct
•  repartition
• cache() does not help.
ParCCon	1	
Vertex	 ID	
Tim	 0	
Joseph	 1	
ParCCon	1	
Vertex	 ID	
Joseph	 0	
Tim	 1	
re-compute
33
Our implementation
We implemented (v0.5.0) an expensive but correct version:
1.  (hash) re-partition + distinct vertex IDs
2.  sort vertex IDs within each partition
3.  generate unique integer IDs
34
Outline
Intro to GraphFrames
Moving implementations to DataFrames
•  Vertex indexing
•  Scaling Connected Components
•  Other challenges: skewed joins and checkpoints
Future of GraphFrames
35
Connected Components
Assign each vertex a component ID such that vertices receive the
same component ID iff they are connected.
Applications:
•  fraud detection
• Spark Summit 2016 keynote from Capital One
•  clustering
•  entity resolution
1	 3	
2
36
Naive implementation (GraphX)
1.  Assign each vertex a unique component ID.
2.  Iterate until convergence:
•  For each vertex v, update:
component ID of v ß Smallest component ID in neighborhood of v
Pro: easy to implement
Con: slow convergence on large-diameter graphs
37
Small-/large-star algorithm
Kiveris et al. "Connected Components in MapReduce and Beyond."
1.  Assign each vertex a unique ID.
2.  Iterate until convergence:
• (small-star) for each vertex,
connect smaller neighbors to smallest neighbor
• (big-star) for each vertex,
connect bigger neighbors to smallest neighbor (or itself)
38
Small-star operation
Kiveris et al., Connected Components in MapReduce and Beyond.
39
Big-star operation
Kiveris et al., Connected Components in MapReduce and Beyond.
40
Another interpretation
1	 5	 7	 8	 9	
1	 x	
5	 x	
7	 x	
8	 x	
9	
adjacency	matrix
41
Small-star operation
1	 5	 7	 8	 9	
1	 x	 x	 x	
5	
7	
8	 x	
9	
1	 5	 7	 8	 9	
1	 x	
5	 x	
7	 x	
8	 x	
9	
rotate	&	liK
42
Big-star operation
liK	
1	 5	 7	 8	 9	
1	 x	 x	
5	 x	
7	 x	
8	
9	
1	 5	 7	 8	 9	
1	 x	
5	 x	
7	 x	
8	 x	
9
43
Convergence
1	 5	 7	 8	 9	
1	 x	 x	 x	 x	 x	
5	
7	
8	
9
44
Properties of the algorithm
• Small-/big-star operations do not change graph connectivity.
• Extra edges are pruned during iterations.
• Each connected component converges to a star graph.
• Converges in log2(#nodes) iterations
45
Implementation
Iterate:
• filter
• self-join
Challenge: handle these operations at scale.
46
Outline
Intro to GraphFrames
Moving implementations to DataFrames
•  Vertex indexing
•  Scaling Connected Components
•  Other challenges: skewed joins and checkpoints
Future of GraphFrames
47
Skewed joins
Real-world graphs contain big components.
à data skew during connected components iterations
src	 dst	
0	 1	
0	 2	
0	 3	
0	 4	
…	 …	
0	 2,000,000	
1	 3	
2	 5	
src	 Component	id	 neighbors	
0	 0	 2,000,000	
1	 0	 10	
2	 3	 5	
join
48
Skewed joins
4
src	 dst	
0	 1	
0	 2	
0	 3	
0	 4	
…	 …	
0	 2,000,000	
hash	join	
1	 3	
2	 5	
broadcast	join	
(#nbrs	>	1,000,000)	
union	
src	 Component	id	 neighbors	
0	 0	 2,000,000	
1	 0	 10	
2	 3	 5
49
Checkpointing
We checkpoint every 2 iterations to avoid:
•  query plan explosion (exponential growth)
•  optimizer slowdown
•  disk out of shuffle space
•  unexpected node failures
4
50
Experiments
twitter-2010 from WebGraph datasets (small diameter)
•  42 million vertices, 1.5 billion edges
16 r3.4xlarge workers on Databricks
•  GraphX: 4 minutes
•  GraphFrames: 6 minutes
–  algorithm difference, checkpointing, checking skewness
5
51
Experiments
uk-2007-05 from WebGraph datasets
•  105 million vertices, 3.7 billion edges
16 r3.4xlarge workers on Databricks
•  GraphX: 25 minutes
–  slow convergence
•  GraphFrames: 4.5 minutes
5
52
Experiments
regular grid 32,000 x 32,000 (large diameter)
•  1 billion nodes, 4 billion edges
32 r3.8xlarge workers on Databricks
•  GraphX: failed
•  GraphFrames: 1 hour
5
53
Experiments
regular grid 50,000 x 50,000 (large diameter)
•  2.5 billion nodes, 10 billion edges
32 r3.8xlarge workers on Databricks
•  GraphX: failed
•  GraphFrames: 1.6 hours
5
54
Future improvements
GraphFrames
•  update inefficient code (due to Spark 1.6 compatibility)
•  better graph partitioning
•  letting Spark SQL handle skewed joins and iterations
•  graph compression
Connected Components
•  local iterations
•  node pruning and better stopping criteria
55
UNIFIED ANALYTICS PLATFORM
Try Apache Spark in Databricks!
•  Collaborative cloud environment
•  Free version (community edition)
55	55	
DATABRICKS RUNTIME 3.0
•  Apache Spark - optimized for the cloud
•  Caching and optimization layer - DBIO
•  Enterprise security - DBES
Try for free today.
databricks.com
Thank you!
Get started with GraphFrames
Docs, downloads & tutorials
http://graphframes.github.io
https://docs.databricks.com
Dev community
Github issues & PRs
Twitter: @jkbatcmu à I’ll share my slides.
Ad

More Related Content

What's hot (20)

Migration to Databricks - On-prem HDFS.pptx
Migration to Databricks - On-prem HDFS.pptxMigration to Databricks - On-prem HDFS.pptx
Migration to Databricks - On-prem HDFS.pptx
Kshitija(KJ) Gupte
 
RedisGraph A Low Latency Graph DB: Pieter Cailliau
RedisGraph A Low Latency Graph DB: Pieter CailliauRedisGraph A Low Latency Graph DB: Pieter Cailliau
RedisGraph A Low Latency Graph DB: Pieter Cailliau
Redis Labs
 
Indexes in postgres
Indexes in postgresIndexes in postgres
Indexes in postgres
Louise Grandjonc
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Databricks
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
Introduction To PostGIS
Introduction To PostGISIntroduction To PostGIS
Introduction To PostGIS
mleslie
 
Data dictionary pl17
Data dictionary pl17Data dictionary pl17
Data dictionary pl17
Ståle Deraas
 
Sq lite database
Sq lite databaseSq lite database
Sq lite database
AYESHA JAVED
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
Databricks
 
Data Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet EncryptionData Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet Encryption
Databricks
 
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Databricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
Getting Started with Databricks SQL Analytics
Getting Started with Databricks SQL AnalyticsGetting Started with Databricks SQL Analytics
Getting Started with Databricks SQL Analytics
Databricks
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
Databricks
 
Neo4J : Introduction to Graph Database
Neo4J : Introduction to Graph DatabaseNeo4J : Introduction to Graph Database
Neo4J : Introduction to Graph Database
Mindfire Solutions
 
MySQL Cluster performance best practices
MySQL Cluster performance best practicesMySQL Cluster performance best practices
MySQL Cluster performance best practices
Mat Keep
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance
DataWorks Summit/Hadoop Summit
 
Introduction to SQLite: The Most Popular Database in the World
Introduction to SQLite: The Most Popular Database in the WorldIntroduction to SQLite: The Most Popular Database in the World
Introduction to SQLite: The Most Popular Database in the World
jkreibich
 
Migration to Databricks - On-prem HDFS.pptx
Migration to Databricks - On-prem HDFS.pptxMigration to Databricks - On-prem HDFS.pptx
Migration to Databricks - On-prem HDFS.pptx
Kshitija(KJ) Gupte
 
RedisGraph A Low Latency Graph DB: Pieter Cailliau
RedisGraph A Low Latency Graph DB: Pieter CailliauRedisGraph A Low Latency Graph DB: Pieter Cailliau
RedisGraph A Low Latency Graph DB: Pieter Cailliau
Redis Labs
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Databricks
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
Introduction To PostGIS
Introduction To PostGISIntroduction To PostGIS
Introduction To PostGIS
mleslie
 
Data dictionary pl17
Data dictionary pl17Data dictionary pl17
Data dictionary pl17
Ståle Deraas
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
Databricks
 
Data Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet EncryptionData Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet Encryption
Databricks
 
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Databricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
Getting Started with Databricks SQL Analytics
Getting Started with Databricks SQL AnalyticsGetting Started with Databricks SQL Analytics
Getting Started with Databricks SQL Analytics
Databricks
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
Databricks
 
Neo4J : Introduction to Graph Database
Neo4J : Introduction to Graph DatabaseNeo4J : Introduction to Graph Database
Neo4J : Introduction to Graph Database
Mindfire Solutions
 
MySQL Cluster performance best practices
MySQL Cluster performance best practicesMySQL Cluster performance best practices
MySQL Cluster performance best practices
Mat Keep
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance
DataWorks Summit/Hadoop Summit
 
Introduction to SQLite: The Most Popular Database in the World
Introduction to SQLite: The Most Popular Database in the WorldIntroduction to SQLite: The Most Popular Database in the World
Introduction to SQLite: The Most Popular Database in the World
jkreibich
 

Similar to Web-Scale Graph Analytics with Apache® Spark™ (20)

Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Databricks
 
Challenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache SparkChallenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache Spark
Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Spark Summit
 
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
Ankur Dave
 
Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...
bhargavi804095
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and Modeling
Nesreen K. Ahmed
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
DB Tsai
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large Graphs
Nishant Gandhi
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
Databricks
 
"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19
"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19
"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19
MoscowJS
 
Presentation1
Presentation1Presentation1
Presentation1
Danish Naveed
 
VLSI_CAD_Introductionxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.pptx
VLSI_CAD_Introductionxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.pptxVLSI_CAD_Introductionxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.pptx
VLSI_CAD_Introductionxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.pptx
arvindrathore44
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Flink Forward
 
Synthetic Encoding
Synthetic EncodingSynthetic Encoding
Synthetic Encoding
Cheng LI
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Databricks
 
Challenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache SparkChallenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache Spark
Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Spark Summit
 
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
Ankur Dave
 
Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...
bhargavi804095
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and Modeling
Nesreen K. Ahmed
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
DB Tsai
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large Graphs
Nishant Gandhi
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
Databricks
 
"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19
"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19
"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19
MoscowJS
 
VLSI_CAD_Introductionxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.pptx
VLSI_CAD_Introductionxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.pptxVLSI_CAD_Introductionxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.pptx
VLSI_CAD_Introductionxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.pptx
arvindrathore44
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Flink Forward
 
Synthetic Encoding
Synthetic EncodingSynthetic Encoding
Synthetic Encoding
Cheng LI
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

How can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptxHow can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptx
laravinson24
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
Cryptocurrency Exchange Script like Binance.pptx
Cryptocurrency Exchange Script like Binance.pptxCryptocurrency Exchange Script like Binance.pptx
Cryptocurrency Exchange Script like Binance.pptx
riyageorge2024
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Innovative Approaches to Software Dev no good at all
Innovative Approaches to Software Dev no good at allInnovative Approaches to Software Dev no good at all
Innovative Approaches to Software Dev no good at all
ayeshakanwal75
 
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
AxisTechnolabs
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
F-Secure Freedome VPN 2025 Crack Plus Activation  New VersionF-Secure Freedome VPN 2025 Crack Plus Activation  New Version
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
saimabibi60507
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 
AI in Business Software: Smarter Systems or Hidden Risks?
AI in Business Software: Smarter Systems or Hidden Risks?AI in Business Software: Smarter Systems or Hidden Risks?
AI in Business Software: Smarter Systems or Hidden Risks?
Amara Nielson
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
Download Canva Pro 2025 PC Crack Latest Version .
Download Canva Pro 2025 PC Crack Latest Version .Download Canva Pro 2025 PC Crack Latest Version .
Download Canva Pro 2025 PC Crack Latest Version .
sadiyabibi60507
 
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
Exploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the FutureExploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the Future
ICS
 
How to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud PerformanceHow to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud Performance
ThousandEyes
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
Tools of the Trade: Linux and SQL - Google Certificate
Tools of the Trade: Linux and SQL - Google CertificateTools of the Trade: Linux and SQL - Google Certificate
Tools of the Trade: Linux and SQL - Google Certificate
VICTOR MAESTRE RAMIREZ
 
The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
PRTG Network Monitor Crack Latest Version & Serial Key 2025 [100% Working]
PRTG Network Monitor Crack Latest Version & Serial Key 2025 [100% Working]PRTG Network Monitor Crack Latest Version & Serial Key 2025 [100% Working]
PRTG Network Monitor Crack Latest Version & Serial Key 2025 [100% Working]
saimabibi60507
 
How can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptxHow can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptx
laravinson24
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
Cryptocurrency Exchange Script like Binance.pptx
Cryptocurrency Exchange Script like Binance.pptxCryptocurrency Exchange Script like Binance.pptx
Cryptocurrency Exchange Script like Binance.pptx
riyageorge2024
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Innovative Approaches to Software Dev no good at all
Innovative Approaches to Software Dev no good at allInnovative Approaches to Software Dev no good at all
Innovative Approaches to Software Dev no good at all
ayeshakanwal75
 
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
AxisTechnolabs
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
F-Secure Freedome VPN 2025 Crack Plus Activation  New VersionF-Secure Freedome VPN 2025 Crack Plus Activation  New Version
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
saimabibi60507
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 
AI in Business Software: Smarter Systems or Hidden Risks?
AI in Business Software: Smarter Systems or Hidden Risks?AI in Business Software: Smarter Systems or Hidden Risks?
AI in Business Software: Smarter Systems or Hidden Risks?
Amara Nielson
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
Download Canva Pro 2025 PC Crack Latest Version .
Download Canva Pro 2025 PC Crack Latest Version .Download Canva Pro 2025 PC Crack Latest Version .
Download Canva Pro 2025 PC Crack Latest Version .
sadiyabibi60507
 
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
Exploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the FutureExploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the Future
ICS
 
How to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud PerformanceHow to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud Performance
ThousandEyes
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
Tools of the Trade: Linux and SQL - Google Certificate
Tools of the Trade: Linux and SQL - Google CertificateTools of the Trade: Linux and SQL - Google Certificate
Tools of the Trade: Linux and SQL - Google Certificate
VICTOR MAESTRE RAMIREZ
 
The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
PRTG Network Monitor Crack Latest Version & Serial Key 2025 [100% Working]
PRTG Network Monitor Crack Latest Version & Serial Key 2025 [100% Working]PRTG Network Monitor Crack Latest Version & Serial Key 2025 [100% Working]
PRTG Network Monitor Crack Latest Version & Serial Key 2025 [100% Working]
saimabibi60507
 

Web-Scale Graph Analytics with Apache® Spark™

  • 1. Web-Scale Graph Analytics with Apache Spark Joseph K Bradley NYC Data Science Meetup June 28, 2017
  • 2. 2 About me Software engineer at Databricks Apache Spark committer & PMC member Ph.D. Carnegie Mellon in Machine Learning
  • 3. 3 TEAM About Databricks Started Spark project (now Apache Spark) at UC Berkeley in 2009 3 3 PRODUCT Unified Analytics Platform MISSION Making Big Data Simple
  • 4. 4 Apache Spark Engine … Spark Core Spark Streaming Spark SQL MLlib GraphX Unified engine across diverse workloads & environments Scale out, fault tolerant Python, Java, Scala, & R APIs Standard libraries
  • 5. 5
  • 6. 6 Spark Packages 340+ packages written for Spark 80+ packages for ML and Graphs E.g.: • GraphFrames: DataFrame-based graphs • Bisecting K-Means: now part of MLlib • Stanford CoreNLP integration: UDFs for NLP spark-packages.org
  • 7. 7 Outline Intro to GraphFrames Moving implementations to DataFrames •  Vertex indexing •  Scaling Connected Components •  Other challenges: skewed joins and checkpoints Future of GraphFrames
  • 8. 8 Graphs vertex edge id City State “JFK” “New York” NY Example: airports & flights between them JFK IAD LAX SFO SEA DFW src dst delay tripID “JFK” “SEA” 45 1058923
  • 9. 9 Apache Spark’s GraphX library Overview •  General-purpose graph processing library •  Optimized for fast distributed computing •  Library of algorithms: PageRank, Connected Components, etc. 9 Challenges •  No Java, Python APIs •  Lower-level RDD-based API (vs. DataFrames) •  Cannot use recent Spark optimizations: Catalyst query optimizer, Tungsten memory management
  • 10. 10 The GraphFrames Spark Package Goal: DataFrame-based graphs on Apache Spark •  Simplify interactive queries •  Support motif-finding for structural pattern search •  Benefit from DataFrame optimizations Collaboration between Databricks, UC Berkeley & MIT + Now with community contributors & committers! 10
  • 11. 11 Graphs vertex edge JFK IAD LAX SFO SEA DFW id City State “JFK” “New York” NY src dst delay tripID “JFK” “SEA” 45 1058923
  • 12. 12 GraphFrames 12 id City State “JFK” “New York” NY “SEA” “Seattle” WA src dst delay tripID “JFK” “SEA” 45 1058923 “DFW” “SFO” -7 4100224 vertices DataFrame edges DataFrame
  • 13. 13 Graph analysis with GraphFrames Simple queries Motif finding Graph algorithms 13
  • 14. 14 Simple queries SQL queries on vertices & edges 14 Simple graph queries (e.g., vertex degrees)
  • 15. 15 Motif finding 15 JFK IAD LAX SFO SEA DFW Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  • 16. 16 Motif finding 16 JFK IAD LAX SFO SEA DFW (b) (a)Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  • 17. 17 Motif finding 17 JFK IAD LAX SFO SEA DFW (b) (a) (c) Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  • 18. 18 Motif finding 18 JFK IAD LAX SFO SEA DFW (b) (a) (c) Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  • 19. 19 Motif finding 19 JFK IAD LAX SFO SEA DFW Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”) (b) (a) (c) Then filter using vertex & edge data. paths.filter(“e1.delay > 20”)
  • 20. 20 Graph algorithms Find important vertices •  PageRank 20 Find paths between sets of vertices •  Breadth-first search (BFS) •  Shortest paths Find groups of vertices (components, communities) •  Connected components •  Strongly connected components •  Label Propagation Algorithm (LPA) Other •  Triangle counting •  SVDPlusPlus
  • 21. 21 Saving & loading graphs Save & load the DataFrames. vertices = sqlContext.read.parquet(...) edges = sqlContext.read.parquet(...) g = GraphFrame(vertices, edges) g.vertices.write.parquet(...) g.edges.write.parquet(...) 21
  • 22. 22 GraphFrames vs. GraphX 22 GraphFrames GraphX Built on DataFrames RDDs Languages Scala, Java, Python Scala Use cases Queries & algorithms Algorithms Vertex IDs Any type (in Catalyst) Long Vertex/edge attributes Any number of DataFrame columns Any type (VD, ED)
  • 23. 23 2 types of graph libraries Graph algorithms Graph queries Standard & custom algorithms Optimized for batch processing Motif finding Point queries & updates GraphFrames: Both algorithms & queries (but not point updates)
  • 24. 24 Outline Intro to GraphFrames Moving implementations to DataFrames •  Vertex indexing •  Scaling Connected Components •  Other challenges: skewed joins and checkpoints Future of GraphFrames
  • 25. 25 Algorithm implementations Mostly wrappers for GraphX •  PageRank •  Shortest paths •  Strongly connected components •  Label Propagation Algorithm (LPA) •  SVDPlusPlus 25 Some algorithms implemented using DataFrames •  Breadth-first search •  Connected components •  Triangle counting •  Motif finding
  • 26. 26 Moving implementations to DataFrames DataFrames are optimized for a huge number of small records. •  columnar storage •  code generation (“Project Tungsten”) •  query optimization (“Project Catalyst”) 26
  • 27. 27 Outline Intro to GraphFrames Moving implementations to DataFrames •  Vertex indexing •  Scaling Connected Components •  Other challenges: skewed joins and checkpoints Future of GraphFrames
  • 28. 28 Pros of integer vertex IDs GraphFrames take arbitrary vertex IDs. à convenient for users Algorithms prefer integer vertex IDs. à optimize in-memory storage à reduce communication Our task: Map unique vertex IDs to unique (long) integers.
  • 29. 29 The hashing trick? • Possible solution: hash vertex ID to long integer • What is the chance of collision? •  1 - (k-1)/N * (k-2)/N * … •  seems unlikely with long range N=264 •  with 1 billion nodes, the chance is ~5.4% • Problem: collisions change graph topology. Name Hash Tim 84088 Joseph -2070372689 Xiangrui 264245405 Felix 67762524
  • 30. 30 Generating unique IDs Spark has built-in methods to generate unique IDs. •  RDD: zipWithUniqueId(), zipWithIndex() •  DataFrame: monotonically_increasing_id() ! Possible solution: just use these methods
  • 31. 31 How it works ParCCon 1 Vertex ID Tim 0 Joseph 1 ParCCon 2 Vertex ID Xiangrui 100 + 0 Felix 100 + 1 ParCCon 3 Vertex ID … 200 + 0 … 200 + 1
  • 32. 32 … but not always • DataFrames/RDDs are immutable and reproducible by design. • However, records do not always have stable orderings. •  distinct •  repartition • cache() does not help. ParCCon 1 Vertex ID Tim 0 Joseph 1 ParCCon 1 Vertex ID Joseph 0 Tim 1 re-compute
  • 33. 33 Our implementation We implemented (v0.5.0) an expensive but correct version: 1.  (hash) re-partition + distinct vertex IDs 2.  sort vertex IDs within each partition 3.  generate unique integer IDs
  • 34. 34 Outline Intro to GraphFrames Moving implementations to DataFrames •  Vertex indexing •  Scaling Connected Components •  Other challenges: skewed joins and checkpoints Future of GraphFrames
  • 35. 35 Connected Components Assign each vertex a component ID such that vertices receive the same component ID iff they are connected. Applications: •  fraud detection • Spark Summit 2016 keynote from Capital One •  clustering •  entity resolution 1 3 2
  • 36. 36 Naive implementation (GraphX) 1.  Assign each vertex a unique component ID. 2.  Iterate until convergence: •  For each vertex v, update: component ID of v ß Smallest component ID in neighborhood of v Pro: easy to implement Con: slow convergence on large-diameter graphs
  • 37. 37 Small-/large-star algorithm Kiveris et al. "Connected Components in MapReduce and Beyond." 1.  Assign each vertex a unique ID. 2.  Iterate until convergence: • (small-star) for each vertex, connect smaller neighbors to smallest neighbor • (big-star) for each vertex, connect bigger neighbors to smallest neighbor (or itself)
  • 38. 38 Small-star operation Kiveris et al., Connected Components in MapReduce and Beyond.
  • 39. 39 Big-star operation Kiveris et al., Connected Components in MapReduce and Beyond.
  • 40. 40 Another interpretation 1 5 7 8 9 1 x 5 x 7 x 8 x 9 adjacency matrix
  • 41. 41 Small-star operation 1 5 7 8 9 1 x x x 5 7 8 x 9 1 5 7 8 9 1 x 5 x 7 x 8 x 9 rotate & liK
  • 42. 42 Big-star operation liK 1 5 7 8 9 1 x x 5 x 7 x 8 9 1 5 7 8 9 1 x 5 x 7 x 8 x 9
  • 43. 43 Convergence 1 5 7 8 9 1 x x x x x 5 7 8 9
  • 44. 44 Properties of the algorithm • Small-/big-star operations do not change graph connectivity. • Extra edges are pruned during iterations. • Each connected component converges to a star graph. • Converges in log2(#nodes) iterations
  • 46. 46 Outline Intro to GraphFrames Moving implementations to DataFrames •  Vertex indexing •  Scaling Connected Components •  Other challenges: skewed joins and checkpoints Future of GraphFrames
  • 47. 47 Skewed joins Real-world graphs contain big components. à data skew during connected components iterations src dst 0 1 0 2 0 3 0 4 … … 0 2,000,000 1 3 2 5 src Component id neighbors 0 0 2,000,000 1 0 10 2 3 5 join
  • 48. 48 Skewed joins 4 src dst 0 1 0 2 0 3 0 4 … … 0 2,000,000 hash join 1 3 2 5 broadcast join (#nbrs > 1,000,000) union src Component id neighbors 0 0 2,000,000 1 0 10 2 3 5
  • 49. 49 Checkpointing We checkpoint every 2 iterations to avoid: •  query plan explosion (exponential growth) •  optimizer slowdown •  disk out of shuffle space •  unexpected node failures 4
  • 50. 50 Experiments twitter-2010 from WebGraph datasets (small diameter) •  42 million vertices, 1.5 billion edges 16 r3.4xlarge workers on Databricks •  GraphX: 4 minutes •  GraphFrames: 6 minutes –  algorithm difference, checkpointing, checking skewness 5
  • 51. 51 Experiments uk-2007-05 from WebGraph datasets •  105 million vertices, 3.7 billion edges 16 r3.4xlarge workers on Databricks •  GraphX: 25 minutes –  slow convergence •  GraphFrames: 4.5 minutes 5
  • 52. 52 Experiments regular grid 32,000 x 32,000 (large diameter) •  1 billion nodes, 4 billion edges 32 r3.8xlarge workers on Databricks •  GraphX: failed •  GraphFrames: 1 hour 5
  • 53. 53 Experiments regular grid 50,000 x 50,000 (large diameter) •  2.5 billion nodes, 10 billion edges 32 r3.8xlarge workers on Databricks •  GraphX: failed •  GraphFrames: 1.6 hours 5
  • 54. 54 Future improvements GraphFrames •  update inefficient code (due to Spark 1.6 compatibility) •  better graph partitioning •  letting Spark SQL handle skewed joins and iterations •  graph compression Connected Components •  local iterations •  node pruning and better stopping criteria
  • 55. 55 UNIFIED ANALYTICS PLATFORM Try Apache Spark in Databricks! •  Collaborative cloud environment •  Free version (community edition) 55 55 DATABRICKS RUNTIME 3.0 •  Apache Spark - optimized for the cloud •  Caching and optimization layer - DBIO •  Enterprise security - DBES Try for free today. databricks.com
  • 56. Thank you! Get started with GraphFrames Docs, downloads & tutorials http://graphframes.github.io https://docs.databricks.com Dev community Github issues & PRs Twitter: @jkbatcmu à I’ll share my slides.