0% found this document useful (0 votes)

2 views62 pages

09-hetero

The document discusses the course CS224W: Machine Learning with Graphs, focusing on handling heterogeneous graphs with multiple node and edge types. It introduces concepts such as relational GCNs and heterogeneous graph transformers, emphasizing the importance of relation types in capturing interactions between nodes and edges. The document also highlights the challenges and benefits of using heterogeneous graphs in machine learning applications.

Uploaded by

yf970113

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views62 pages

09-hetero

Uploaded by

yf970113

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

CS224W: Machine Learning with Graphs

Jure Leskovec, Stanford University

http://cs224w.stanford.edu
ANNOUNCEMENTS
• Project Proposal due today

CS224W: Machine Learning with Graphs

Jure Leskovec, Stanford University
http://cs224w.stanford.edu
 So far we only handle graphs with one edge
type
 How to handle graphs with multiple nodes or
edge types (a.k.a heterogeneous graphs)?
 Goal: Learning with heterogeneous graphs
▪ Relational GCNs
▪ Heterogeneous Graph Transformer
▪ Design space for heterogeneous GNNs

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 3
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
2 types of nodes:
 Node type A: Paper nodes
 Node type B: Author nodes
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 5
2 types of edges:
 Edge type A: Cite
 Edge type B: Like
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 6
A graph could have multiple types of nodes and
edges! 2 types of nodes + 2 types of edges.

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 7
8 possible relation types!

(Paper, Cite, Paper) (Author, Cite, Author)

(Paper, Like, Paper) (Author, Like, Author)

(Paper, Cite, Author) (Author, Cite, Paper)

(Paper, Like, Author) (Author, Like, Paper)

Relation types: (node_start, edge, node_end)

 We use relation type to describe an edge (as
opposed to edge type)
 Relation type better captures the interaction
between nodes and edges
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 8
 A heterogeneous graph is defined as
𝑮 = 𝑽, 𝑬, 𝜏, 𝜙
▪ Nodes with node types 𝑣 ∈ 𝑉
▪ Node type for node 𝑣: 𝜏 𝑣
An edge can be
▪ Edges with edge types (𝑢, 𝑣) ∈ 𝐸 described as a
pair of nodes
▪ Edge type for edge (𝑢, 𝑣): 𝜙 𝑢, 𝑣
▪ Relation type for edge 𝑒 is a tuple: 𝑟 𝑢, 𝑣 =
(𝜏 𝑢 , 𝜙 𝑢, 𝑣 , 𝜏(𝑣))
 There are other definitions for heterogeneous graphs
as well – describe graphs with node & edge types
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 9
Biomedical Knowledge Graphs Event Graphs
Example node: Migraine Example node: SFO
Example relation: (fulvestrant, Example relation: (UA689, Origin,
Treats, Breast Neoplasms) LAX)
Example node type: Protein Example node type: Flight
Example edge type: Causes Example edge type: Destination

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 10
 Example: E-Commerce Graph
▪ Node types: User, Item, Query, Location, ...
▪ Edge types: Purchase, Visit, Guide, Search, …
▪ Different node type's features spaces can be different!

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 11
 Example: Academic Graph
▪ Node types: Author, Paper, Venue, Field, ...
▪ Edge types: Publish, Cite, …
▪ Benchmark dataset: Microsoft Academic Graph

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 12
 Observation: We can also treat types of
nodes and edges as features
▪ Example: Add a one-hot indicator for nodes and
edges
▪ Append feature [1, 0] to each “author node”; Append
feature [0, 1] to each “paper node”
▪ Similarly, we can assign edge features to edges with
different types
▪ Then, a heterogeneous graph reduces to a
standard graph
 When do we need a heterogeneous graph?
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 13
 When do we need a heterogeneous graph?
▪ Case 1: Different node/edge types have different
shapes of features
▪ An “author node” has 4-dim feature, a “paper node” has
5-dim feature
▪ Case 2: We know different relation types
represent different types of interactions
▪ (English, translate, French) and (English, translate,
Chinese) require different models

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 14
 Ultimately, heterogeneous graph is a more
expressive graph representation
▪ Captures different types of interactions between
entities
 But it also comes with costs
▪ More expensive (computation, storage)
▪ More complex implementation
 There are many ways to convert a
heterogeneous graph to a standard graph
(that is, a homogeneous graph)
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 15
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
Kipf and Welling. Semi-Supervised Classification with Graph Convolutional Networks, ICLR 2017

 (1) Graph Convolutional Networks (GCN)

 How to write this as Message + Aggregation?

Message
(2) Aggregation

(1) Message

Aggregation
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 18
 We will extend GCN to handle heterogeneous
graphs with multiple edge/relation types
 We start with a directed graph with one relation
▪ How do we run GCN and update the representation of
the target node A on this graph?

B
Target Node
A
C

F
D E
Input Graph

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 19
 We will extend GCN to handle heterogeneous
graphs with multiple edge/relation types
 We start with a directed graph with one relation
▪ How do we run GCN and update the representation of
the target node A on this graph?

B Only pass messages C

Target Node along direction of edges B
A
C F
A C
F
D E E
D
Input Graph

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 20
 What if the graph has multiple relation types?

𝑟 B
Target node 1
𝑟3
A
C
𝑟1 𝑟2
𝑟3 𝑟2
F
D E 𝑟1

Input graph

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 21
 What if the graph has multiple relation types?
 Use different neural network weights for
different relation types.
Weights 𝐖𝑟1 for 𝑟1
𝑟 B
Target node 1
𝑟3
A
C Weights 𝐖𝑟2 for 𝑟2
𝑟1 𝑟2
𝑟3 𝑟2
F
D E 𝑟1 Weights 𝐖𝑟3 for 𝑟3

Input graph

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 22
 What if the graph has multiple relation types?
 Use different neural network weights for
different relation types! Aggregation
C
𝑟 B
Target node 1 B
𝑟3
A F
C
𝑟1 𝑟2 A C
𝑟3 𝑟2
F E
D E 𝑟1 D

Input graph

Neural networks
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 23
 Introduce a set of neural networks for each
relation type!

Weight for rel_1

…
…
Weight for rel_N

Weight for self-loop

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 24
 Relational GCN (RGCN):

 How to write this as Message + Aggregation?

 Message: Normalized by node degree
▪ Each neighbor of a given relation: of the relation 𝑐𝑣,𝑟 = 𝑁𝑣𝑟
(𝑙) 1 𝑙 (𝑙)
𝐦𝑢,𝑟 = 𝐖𝑟 𝐡𝑢
𝑐𝑣 ,𝑟
▪ Self-loop:
(𝑙) 𝑙 (𝑙)
𝐦𝑣 = 𝐖0 𝐡𝑣
 Aggregation:
▪ Sum over messages from neighbors and self-loop, then apply activation
𝑙+1 𝑙 𝑙
▪ 𝐡𝑣 = 𝜎 Sum 𝐦𝑢,𝑟 , 𝑢 ∈ 𝑁(𝑣) ∪ 𝐦𝑣

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 25
1 2 𝐿
 Each relation has 𝐿 matrices: 𝐖𝑟 , 𝐖𝑟 ⋯ 𝐖𝑟
𝑙
 The size of each 𝐖𝑟 is 𝑑(𝑙+1) × 𝑑(𝑙) 𝑑 is the hidden
dimension in layer 𝑙
(𝑙)

 Rapid growth of the number of parameters w.r.t

number of relations!
▪ Overfitting becomes an issue
(𝒍)
 Two methods to regularize the weights 𝐖𝒓
▪ (1) Use block diagonal matrices
▪ (2) Basis/Dictionary learning
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 26
 Key insight: make the weights sparse!
 Use block diagonal matrices for 𝐖𝑟

𝐖𝑟 =
Limitation: only nearby
neurons/dimensions
can interact through 𝑊

 If use 𝐵 low-dimensional matrices, then # param

reduces from to
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 27
 Key insight: Share weights across different
relations!
 Represent the matrix of each relation as a linear
combination of basis transformations
𝐖𝑟 = σ𝐵𝑏=1 𝑎𝑟𝑏 ⋅ 𝐕𝑏 , where 𝐕𝑏 is shared across
all relations
▪ 𝐕𝑏 are the basis matrices
▪ 𝑎𝑟𝑏 is the importance weight of matrix 𝐕𝑏
𝐵
 Now each relation only needs to learn 𝑎𝑟𝑏 𝑏=1 ,
which is 𝐵 scalars
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 28
 Goal: Predict the label of a given node
 RGCN uses the representation of the final layer:
▪ If we predict the class of node 𝑨 from 𝒌 classes
(𝐿)
▪ Take the final layer (prediction head): 𝐡𝐴 ∈ ℝ𝑘 ,
(𝐿)
each item in 𝐡𝐴 represents the probability of that
class
𝑟 B
Target Node 1
𝑟3
A
C
𝑟1 𝑟2
𝑟3 𝑟2
F
D E 𝑟1
Input Graph
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 29
 Link prediction split: Every edge also has a
2 2 relation type, this is
1 1 3
3 independent of the 4
Split
categories.
5 4 5 4
In a heterogeneous
The original graph Split Graph with 4 graph, the homogeneous
categories of edges graphs formed by every
Training message edges for 𝒓𝟏 single relation also have
Training supervision edges for 𝒓𝟏 the 4 splits.
Validation edges for 𝒓𝟏
Test edges for 𝒓𝟏
Training message edges
…..

Training supervision edges

Validation edges
Training message edges for 𝒓𝒏 Test edges
Training supervision edges for 𝒓𝒏
Validation edges for 𝒓𝒏
Test edges for 𝒓𝒏
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 30
 Assume 𝑬, 𝒓𝟑 , 𝑨 is training supervision edge,
all the other edges are training message edges
 Use RGCN to score 𝑬, 𝒓𝟑 , 𝑨 !
𝐿 (𝐿)
▪ Take the final layer of 𝐸 and 𝐴: 𝐡𝐸 and 𝐡𝐴 ∈ ℝ𝑑
▪ Relation-specific score function 𝑓𝑟 : ℝ𝑑 × ℝ𝑑 → ℝ
▪ One example 𝑓𝑟1 𝐡𝐸 , 𝐡𝐴 = 𝐡𝑇𝐸 𝐖𝑟1 𝐡𝐴 , 𝐖𝑟1 ∈ ℝ𝑑×𝑑

𝑟1 B
𝑟3
A
C
𝑟1 𝑟2
𝑟3 𝑟2
𝒓𝟑 F
D E 𝑟1
Input Graph
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 31
 Training:
𝑟1 B 1. Use RGCN to score the training
𝑟3 supervision edge 𝑬, 𝒓𝟑 , 𝑨
A
C 2. Create a negative edge by perturbing
𝑟1 𝑟2
𝑟3 𝑟2 the supervision edge 𝑬, 𝒓𝟑 , 𝑩
𝒓𝟑 F • Corrupt the tail of 𝑬, 𝒓𝟑 , 𝑨
D E 𝑟1 • e.g., 𝑬, 𝒓𝟑 , 𝑩 , 𝑬, 𝒓𝟑 , 𝑫
Input Graph

Note the negative edges should NOT

training supervision edges: 𝑬, 𝒓𝟑 , 𝑨 belong to training message edges or
training message edges: all the rest training supervision edges!
existing edges (solid lines) e.g., 𝑬, 𝒓𝟑 , 𝑪 is NOT a negative edge

(1) Use training message edges to

predict training supervision edges
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 32
 Training:
1. Use RGCN to score the training
𝑟1 B supervision edge 𝑬, 𝒓𝟑 , 𝑨
𝑟3
A 2. Create a negative edge by perturbing
C the supervision edge 𝑬, 𝒓𝟑 , 𝑩
𝑟1 𝑟2
𝑟3 𝑟2
𝒓𝟑 3. Use GNN model to score negative edge
F
D E 𝑟1 4. Optimize a standard cross entropy loss
Input Graph (as discussed in Lecture 6)
1. Maximize the score of training supervision edge
2. Minimize the score of negative edge

ℓ = − log 𝜎 𝑓𝑟3 ℎ𝐸 , ℎ𝐴 − log(1 − 𝜎 𝑓𝑟3 (ℎ𝐸 , ℎ𝐵 ))

𝜎 … Sigmoid function
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 33
 Evaluation:
▪ Validation time as an example, same at the test time
𝑟1 B Evaluate how the model can predict the
𝑟3 validation edges with the relation types.
A
C Let’s predict validation edge 𝑬, 𝒓𝟑 , 𝑫
𝑟1 𝑟2
𝑟3 𝑟3 𝑟2 Intuition: the score of 𝑬, 𝒓𝟑 , 𝑫 should be
F higher than all 𝑬, 𝒓𝟑 , 𝒗 where 𝑬, 𝒓𝟑 , 𝒗 is NOT
D
𝒓𝟑 ?
E 𝑟1 in the training message edges and training
Input Graph supervision edges, e.g., 𝑬, 𝒓𝟑 , 𝑩
validation edges: 𝑬, 𝒓𝟑 , 𝑫
training message edges & training supervision
edges: all existing edges (solid lines)

(2) At validation time:

Use training message edges & training
supervision edges to predict validation edges
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 34
 Evaluation:
▪ Validation time as an example, same at the test time
𝑟1 B Evaluate how the model can predict the
𝑟3 validation edges with the relation types.
A
C Let’s predict validation edge 𝑬, 𝒓𝟑 , 𝑫
𝑟1 𝑟2
𝑟3 𝑟3 𝑟2 Intuition: the score of 𝑬, 𝒓𝟑 , 𝑫 should be
F higher than all 𝑬, 𝒓𝟑 , 𝒗 where 𝑬, 𝒓𝟑 , 𝒗 is NOT
D
𝒓𝟑 ?
E 𝑟1 in the training message edges and training
Input Graph supervision edges, e.g., 𝑬, 𝒓𝟑 , 𝑩
1. Calculate the score of 𝑬,𝒓𝟑 , 𝑫
2. Calculate the score of all the negative edges: 𝑬, 𝒓𝟑 , 𝒗 𝒗 ∈ 𝑩,𝑭 , since 𝑬, 𝒓𝟑 , 𝑨 ,
𝑬, 𝒓𝟑 , 𝑪 belong to training message edges & training supervision edges
3. Obtain the ranking 𝑹𝑲 of 𝑬, 𝒓𝟑 , 𝑫 .
4. Calculate metrics
1. Hits@𝒌: 𝟏 𝑹𝑲 ≤ 𝒌 . Higher is better
𝟏
2. Reciprocal Rank: . Higher is better
𝑹𝑲
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 35
Wang et al. Microsoft academic graph: When experts are not enough.Quantitative Science Studies 2020.

 Benchmark dataset
▪ ogbn-mag from Microsoft Academic Graph (MAG)
 Four (4) types of entities
▪ Papers: 736k nodes
▪ Authors: 1.1m nodes
▪ Institutions: 9k nodes
▪ Fields of study: 60k nodes

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 36
Wang et al. Microsoft academic graph: When experts are not enough.Quantitative Science Studies 2020.

 Benchmark dataset
▪ ogbn-mag from Microsoft Academic Graph (MAG)
 Four (4) directed relations
▪ An author is "affiliated with" an institution
▪ An author "writes" a paper
▪ A paper "cites" a paper
▪ A paper "has a topic of" a field of study

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 37
Wang et al. Microsoft academic graph: When experts are not enough.Quantitative Science Studies 2020.

 Prediction task
▪ Each paper has a 128-dimensional word2vec feature vector
▪ Given the content, references, authors, and author affiliations
from ogbn-mag, predict the venue of each paper
▪ 349-class classification problem due to 349 venues considered
 Time-based dataset splitting
▪ Training set: papers published before 2018
▪ Test set: papers published after 2018

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 38
Wang et al. Microsoft academic graph: When experts are not enough.Quantitative Science Studies 2020.

 Benchmark results:

SOTA

R-GCN

▪ SOTA method: SeHGNN

▪ ComplEx (Next lecture) + Simplified GCN (Lecture 17)

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 39
 Relational GCN, a graph neural network for
heterogeneous graphs

 Can perform entity classification as well as

link prediction tasks.

 Ideas can easily be extended into RGNN

(RGraphSAGE, RGAT, etc.)

 Benchmark: ogbn-mag from Microsoft

Academic Graph, to predict paper venues
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 40
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
 Graph Attention Networks (GAT)
(𝑙) (𝑙−1)
𝐡𝑣 = 𝜎(σ𝑢∈𝑁 𝑣 𝛼𝑣𝑢 𝐖(𝑙) 𝐡𝑢 )
Attention weights

Not all node’s neighbors are equally important

▪ Attention is inspired by cognitive attention.
▪ The attention 𝜶𝒗𝒖 focuses on the important parts of
the input data and fades out the rest.
▪ Idea: the NN should devote more computing power on that
small but important part of the data.
 Can we adapt GAT for heterogeneous graphs?
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 42
Hu et al. Heterogeneous Graph Transformer. WWW 2020.

 Motivation: GAT is unable to represent different

node & different edge types
 Introduce a set of neural networks for each
relation type is too expensive for attention
▪ Recall: relation describes (node_s, edge, node_e)

Weight for rel_1

… Too expensive!
Weight for rel_N
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 43
 HGT uses Scaled Dot-Product Attention
(proposed in Transformer)

 Query: 𝑄, Key: 𝐾, Value: 𝑉

▪ 𝑄, 𝐾, 𝑉 have shape (batch_size, dim)
How do we obtain 𝑄, 𝐾, 𝑉?
 Apply Linear layer to the input
▪ 𝑄 = 𝑄_𝐿𝑖𝑛𝑒𝑎𝑟(𝑋)
▪ 𝐾 = 𝐾_𝐿𝑖𝑛𝑒𝑎𝑟(𝑋)
▪ 𝑉 = 𝑉_𝐿𝑖𝑛𝑒𝑎𝑟(𝑋)
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 44
Hu et al. Heterogeneous Graph Transformer. WWW 2020.

 Recall: Applying GAT to a homogeneous graph

𝑙
▪𝐻 is the 𝑙-th layer representation:

How do we take relation type (node_s, edge,

node_e) into attention computation?

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 45
Hu et al. Heterogeneous Graph Transformer. WWW 2020.

 Innovation: Decompose heterogeneous attention to

Node- and edge-type dependent attention mechanism
▪ 3 node weight matrices, 2 edge weight matrices
▪ Without decomposition: 3*2*3=18 relation types -> 18
weight matrices (suppose all relation types exist)
Paper

� Q-Linear

� � � � � [� ]
Write Cite
� � � [� ] � �� [� 1, � ]
�
� K-Linear
Paper
� � � � � [� ] …
…
� � � �� [� 2, � ]
K-Linear � � � [� ]
Author
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 46
Hu et al. Heterogeneous Graph Transformer. WWW 2020.

 Heterogeneous Mutual Attention:

 Each relation (𝑇 𝑠 , 𝑅 𝑒 , 𝑇 𝑡 ) has a distinct set

of projection weights
▪ 𝑇 𝑠 : type of node 𝑠, 𝑅 𝑒 : type of edge 𝑒
▪ 𝑇(𝑠) & 𝑇(𝑡) parameterize 𝐾_𝐿𝑖𝑛𝑒𝑎𝑟𝑇 𝑠 & 𝑄_𝐿𝑖𝑛𝑒𝑎𝑟𝑇 𝑡 ,
which further return Key and Query vectors 𝐾(𝑠) & 𝑄(𝑡)
▪ Edge type 𝑅(𝑒) directly parameterizes WR(e)
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 47
 A full HGT layer

We have just computed

 Similarly, HGT decomposes weights with node & edge

types in the message computation

Weights for Weights for

each node type each edge type
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 48
Hu et al. Heterogeneous Graph Transformer. WWW 2020.

 Benchmark: ogbn-mag from Microsoft

Academic Graph, to predict paper venues

 HGT uses much fewer parameters, even

though the attention computation is expensive,
while performs better than R-GCN
▪ Thanks to the weight decomposition over node &
edge types
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 49
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
J. You, R. Ying, J. Leskovec. Design Space of Graph Neural Networks, NeurIPS 2020

How do we extend the general GNN design

space to heterogeneous graphs?
(5) Learning objective

(2) Aggregation
GNN Layer 1
(1) Message
(3) Layer
connectivity

GNN Layer 2

(4) Graph augmentation

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 51
 (1) Message computation
(𝑙) 𝑙 𝑙−1
▪ Message function: 𝐦𝑢 = MSG 𝐡𝑢
▪ Intuition: Each node will create a message, which will be
sent to other nodes later
(𝑙) 𝑙−1
▪ Example: A Linear layer 𝐦𝑢 = 𝐖 𝑙 𝐡𝑢

Node 𝒗

(2) Aggregation

(1) Message

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 52
 (1) Heterogeneous message computation
𝑙 𝑙−1 (𝑙)
▪ Message function: = MSG𝑟 𝐡𝑢 𝐦𝑢
▪ Observation: A node could receive multiple types of
messages. Num of message type = Num of relation
type
▪ Idea: Create a different message function for each
relation type
(𝑙) 𝑙 𝑙−1
▪ 𝐦𝑢 = MSG𝑟 𝐡𝑢 , 𝑟 = (𝑢, 𝑒, 𝑣) is the relation
type between node 𝑢 that sends the message, edge
type 𝑒 , and node 𝑣 that receive the message
(𝑙) 𝑙−1 𝑙
▪ Example: A Linear layer 𝐦𝑢 = 𝐖𝑟 𝐡𝑢
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 53
 (2) Aggregation
▪ Intuition: Each node will aggregate the messages from
node 𝑣’s neighbors
(𝑙)
𝐡𝑣 = AGG 𝑙
𝐦𝑢𝑙 , 𝑢 ∈ 𝑁 𝑣
▪ Example: Sum(⋅), Mean(⋅) or Max(⋅) aggregator
𝑙 𝑙
▪ 𝐡𝑣 = Sum({𝐦𝑢 , 𝑢 ∈ 𝑁(𝑣)})
Node 𝒗

(2) Aggregation

(1) Message

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 54
 (2) Heterogeneous Aggregation
▪ Observation: Each node could receive multiple types of
messages from its neighbors, and multiple neighbors
may belong to each message type.
▪ Idea: We can define a 2-stage message passing

▪ Given all the messages sent to a node

▪ Within each message type, aggregate the messages
that belongs to the edge type with AGG𝑟𝑙
𝑙
▪ Aggregate across the edge types with AGG𝑎𝑙𝑙
𝑙 𝑙
▪ Example: 𝐡𝑣 = Concat Sum 𝐦𝑢 , 𝑢 ∈ 𝑁𝑟 𝑣
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 55
 (3) Layer connectivity
▪ Add skip connections, pre/post-process layers

Pre-processing layers: Important when

encoding node features is necessary.
E.g., when nodes represent images/text

Post-processing layers: Important when

reasoning / transformation over node
embeddings are needed
E.g., graph classification, knowledge graphs

In practice, adding these layers works great!

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 56
 Heterogeneous pre/post-process layers:
▪ MLP layers with respect to each node type
▪ Since the output of GNN are node embeddings
(𝑙) (𝑙)
▪ 𝐡𝑣 = MLP𝑇(𝑣) (𝐡𝑣 )
▪ 𝑇(𝑣) is the type of node 𝑣
 Other successful GNN designs are
also encouraged for heterogeneous
GNNs: skip connections, batch/layer
normalization, …

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 57
 Graph Feature manipulation
▪ The input graph lacks features → feature
augmentation
 Graph Structure manipulation
▪ The graph is too sparse → Add virtual nodes / edges
▪ The graph is too dense → Sample neighbors when
doing message passing
▪ The graph is too large → Sample subgraphs to
compute embeddings
▪ Will cover later in lecture: Scaling up GNNs

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 58
 Graph Feature manipulation
▪ 2 Common options: compute graph statistics (e.g.,
node degree) within each relation type, or across the
full graph (ignoring the relation types)
 Graph Structure manipulation
▪ Neighbor and subgraph sampling are also common
for heterogeneous graphs.
▪ 2 Common options: sampling within each relation
type (ensure neighbors from each type are covered),
or sample across the full graph

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 59
Node-level prediction:

Edge-level prediction:
 =
𝐿 𝐿
Linear(Concat(𝐡𝑢 , 𝐡𝑣 ))
Graph-level prediction:

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 60
Node-level prediction:

Edge-level prediction:
 =
𝐿 𝐿
Linear𝑟 (Concat(𝐡𝑢 , 𝐡𝑣 ))
Graph-level prediction:

ℝ𝑑 , ∀𝑇 𝑣 = 𝑖}))

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 61
J. You, R. Ying, J. Leskovec. Design Space of Graph Neural Networks, NeurIPS 2020

Heterogeneous GNNs extend GNNs by separately

modeling node/relation types + additional AGG
(5) Learning objective

(2) Aggregation
GNN Layer 1
(1) Message
(3) Layer
connectivity

GNN Layer 2

(4) Graph augmentation

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 62
 Heterogeneous graphs: graphs with multiple
nodes or edge types
▪ Key concept: relation type (node_s, edge, node_e)
▪ Be aware that we don’t always need
heterogeneous graphs
 Learning with heterogeneous graphs
▪ Key idea: separately model each relation type
▪ Relational GCNs
▪ Heterogeneous Graph Transformer
▪ Design space for heterogeneous GNNs
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 63

Shap
100% (1)
Shap
214 pages
Do Porygon Dream of Mareep 1.00
100% (4)
Do Porygon Dream of Mareep 1.00
158 pages
09 Hetero
No ratings yet
09 Hetero
72 pages
07 Hetero
No ratings yet
07 Hetero
62 pages
05-GNN2
No ratings yet
05-GNN2
72 pages
07 GNN2
No ratings yet
07 GNN2
71 pages
04-GNN1
No ratings yet
04-GNN1
73 pages
Graph Neural Network Introduction
No ratings yet
Graph Neural Network Introduction
88 pages
Stanford CS224W Limitations of Graph Neural Networks 18-Limitations
No ratings yet
Stanford CS224W Limitations of Graph Neural Networks 18-Limitations
75 pages
08 GNN
No ratings yet
08 GNN
79 pages
10 KG
No ratings yet
10 KG
63 pages
04-GNN2
No ratings yet
04-GNN2
73 pages
07-theory
No ratings yet
07-theory
62 pages
02-nodeemb
No ratings yet
02-nodeemb
71 pages
Xford Presentation GNN Part 3
No ratings yet
Xford Presentation GNN Part 3
10 pages
CS 224W 02-Nodeemb
No ratings yet
CS 224W 02-Nodeemb
71 pages
06-GNN3
No ratings yet
06-GNN3
73 pages
03-GNN1
No ratings yet
03-GNN1
71 pages
14-gnn
No ratings yet
14-gnn
58 pages
07 Theory2
No ratings yet
07 Theory2
57 pages
Graph Neural Network & Traditional Neural Network Introduction
No ratings yet
Graph Neural Network & Traditional Neural Network Introduction
69 pages
02 Tradition ML
No ratings yet
02 Tradition ML
68 pages
08 Message
No ratings yet
08 Message
61 pages
03 Nodeemb
No ratings yet
03 Nodeemb
66 pages
Graph Neural Network Node Emending - Node Edge and Sub Graph
No ratings yet
Graph Neural Network Node Emending - Node Edge and Sub Graph
66 pages
Xford Presentation GNN Part 1
No ratings yet
Xford Presentation GNN Part 1
6 pages
Unit I Graph Theory and concepts
No ratings yet
Unit I Graph Theory and concepts
35 pages
gnns
No ratings yet
gnns
75 pages
Exam_Preparation
No ratings yet
Exam_Preparation
18 pages
Ai Presentation
No ratings yet
Ai Presentation
71 pages
04 Pagerank
No ratings yet
04 Pagerank
64 pages
Graph Neural Networks
No ratings yet
Graph Neural Networks
5 pages
Xford Presentation Part 2 GNN
No ratings yet
Xford Presentation Part 2 GNN
5 pages
GNN-Foundations-Frontiers-and-Applications-chapter4
No ratings yet
GNN-Foundations-Frontiers-and-Applications-chapter4
21 pages
stcn major 2
No ratings yet
stcn major 2
96 pages
A Gentle Introduction To Graph Neural Network
100% (1)
A Gentle Introduction To Graph Neural Network
122 pages
Stcn Major 1
No ratings yet
Stcn Major 1
95 pages
Seminar Presentation
No ratings yet
Seminar Presentation
19 pages
10 Graph Neural Networks v2.2
No ratings yet
10 Graph Neural Networks v2.2
61 pages
GML Introduction
No ratings yet
GML Introduction
11 pages
CS 224W 01-Intro
No ratings yet
CS 224W 01-Intro
68 pages
GraphBasedDataScience
No ratings yet
GraphBasedDataScience
37 pages
Lecture 1_Introduction
No ratings yet
Lecture 1_Introduction
124 pages
butler-2025workshop-graph-networks-talk
No ratings yet
butler-2025workshop-graph-networks-talk
46 pages
Intro To GNN
No ratings yet
Intro To GNN
49 pages
Documents 2025-3 [v2] GNN (Node Classification) GNN Classification v2
No ratings yet
Documents 2025-3 [v2] GNN (Node Classification) GNN Classification v2
74 pages
2024_Introduction to Graph Neural Networks A Starting
No ratings yet
2024_Introduction to Graph Neural Networks A Starting
49 pages
822 2020 Metapath Aggregated Graph Neural Network For Heterogeneou Graph Embedding
No ratings yet
822 2020 Metapath Aggregated Graph Neural Network For Heterogeneou Graph Embedding
11 pages
Web - Stanford.edu 01-Intro
No ratings yet
Web - Stanford.edu 01-Intro
87 pages
WWW23-Tutorial-V6 Self-Supervised Learning and Pre-Training On Graphs
No ratings yet
WWW23-Tutorial-V6 Self-Supervised Learning and Pre-Training On Graphs
107 pages
04 GNNBasic
No ratings yet
04 GNNBasic
107 pages
CS224w Machine Learning With Graphs
No ratings yet
CS224w Machine Learning With Graphs
127 pages
Fusion Graph Convolutional Networks
No ratings yet
Fusion Graph Convolutional Networks
10 pages
What is Graph Neural Network_ An Introduction to GNN and Its Applications _ Simplilearn
No ratings yet
What is Graph Neural Network_ An Introduction to GNN and Its Applications _ Simplilearn
13 pages
Yang 20 A
No ratings yet
Yang 20 A
16 pages
Graph Neural Networks
No ratings yet
Graph Neural Networks
124 pages
GNNChap 7
No ratings yet
GNNChap 7
26 pages
Chap7 GNN (20240229) - DL4H Practioner Guide
No ratings yet
Chap7 GNN (20240229) - DL4H Practioner Guide
37 pages
CE6146_Lecture_5
No ratings yet
CE6146_Lecture_5
55 pages
c5 Apelete Sossou Jad Zakharia
No ratings yet
c5 Apelete Sossou Jad Zakharia
6 pages
The Science and Art of Model and Object Drawing - A Text-Book for Schools and for Self-Instruction of Teachers and Art-Students in the Theory and Practice of Drawing from Objects
From Everand
The Science and Art of Model and Object Drawing - A Text-Book for Schools and for Self-Instruction of Teachers and Art-Students in the Theory and Practice of Drawing from Objects
Lucas Baker
No ratings yet
Medial Axis: Exploring the Core of Computer Vision: Unveiling the Medial Axis
From Everand
Medial Axis: Exploring the Core of Computer Vision: Unveiling the Medial Axis
Fouad Sabry
No ratings yet
The AI Jungle Book!
No ratings yet
The AI Jungle Book!
4 pages
Artificial Intelligence For Small Satellites Mission Autonomy
No ratings yet
Artificial Intelligence For Small Satellites Mission Autonomy
191 pages
The Use of Artificial Intelligence in Interrogation: Lies and Truth
No ratings yet
The Use of Artificial Intelligence in Interrogation: Lies and Truth
9 pages
Cognitive Computing
No ratings yet
Cognitive Computing
6 pages
DeepSeek Where AI Meets Future (DeepSeek Mastery Series From Basics to Brilliance) (Jain, Yash)
No ratings yet
DeepSeek Where AI Meets Future (DeepSeek Mastery Series From Basics to Brilliance) (Jain, Yash)
78 pages
Osama 1
No ratings yet
Osama 1
2 pages
UEM_HexaAgro
No ratings yet
UEM_HexaAgro
9 pages
Edge Detection PDF
No ratings yet
Edge Detection PDF
9 pages
Hopfield Network Python implementation _ by Tommasocaputi _ Medium
No ratings yet
Hopfield Network Python implementation _ by Tommasocaputi _ Medium
6 pages
Apu Su Dual Ug Prog Spec It v1.0
No ratings yet
Apu Su Dual Ug Prog Spec It v1.0
27 pages
d41586-025-01125-9
No ratings yet
d41586-025-01125-9
5 pages
Motion Planning For Humanoid Robots
No ratings yet
Motion Planning For Humanoid Robots
27 pages
Intelligent Buildings Europe - Engie Book
No ratings yet
Intelligent Buildings Europe - Engie Book
28 pages
Introduction to Large Language Models (LLMs) - - Unit 7 - Week 5
No ratings yet
Introduction to Large Language Models (LLMs) - - Unit 7 - Week 5
4 pages
NEB Class 12 Computer Recent Trends in Technology Notes
No ratings yet
NEB Class 12 Computer Recent Trends in Technology Notes
17 pages
A Review on the Application of AI Tools in Improving Speech Therapy Outcomes for Children Diagnosed with Autism Spectrum Disorder (ASD)
No ratings yet
A Review on the Application of AI Tools in Improving Speech Therapy Outcomes for Children Diagnosed with Autism Spectrum Disorder (ASD)
3 pages
Grade - Iii Sample
No ratings yet
Grade - Iii Sample
33 pages
Augurys AI Capabilities
No ratings yet
Augurys AI Capabilities
22 pages
the impact of ai on scientific discovery
No ratings yet
the impact of ai on scientific discovery
2 pages
1-s2.0-S1568494623008530-main
No ratings yet
1-s2.0-S1568494623008530-main
20 pages
Top 5 AI-Driven Legal Career Opportunities
No ratings yet
Top 5 AI-Driven Legal Career Opportunities
11 pages
A Review of Multi-Class Classification Algorithms
No ratings yet
A Review of Multi-Class Classification Algorithms
10 pages
Nan Mudhal Van Project
No ratings yet
Nan Mudhal Van Project
35 pages
Measuring The Impact of Online Personalisation Pas
No ratings yet
Measuring The Impact of Online Personalisation Pas
9 pages
Complete MIS Notes - VTH Sem
100% (4)
Complete MIS Notes - VTH Sem
9 pages
Object Detection and Segmentation On Tensor Flow Using
No ratings yet
Object Detection and Segmentation On Tensor Flow Using
10 pages
Brochure 24
No ratings yet
Brochure 24
2 pages
Credit Card Fraud Detection - Machine Learning Methods: March 2019
No ratings yet
Credit Card Fraud Detection - Machine Learning Methods: March 2019
6 pages