0% found this document useful (0 votes)
2 views62 pages

09-hetero

The document discusses the course CS224W: Machine Learning with Graphs, focusing on handling heterogeneous graphs with multiple node and edge types. It introduces concepts such as relational GCNs and heterogeneous graph transformers, emphasizing the importance of relation types in capturing interactions between nodes and edges. The document also highlights the challenges and benefits of using heterogeneous graphs in machine learning applications.

Uploaded by

yf970113
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views62 pages

09-hetero

The document discusses the course CS224W: Machine Learning with Graphs, focusing on handling heterogeneous graphs with multiple node and edge types. It introduces concepts such as relational GCNs and heterogeneous graph transformers, emphasizing the importance of relation types in capturing interactions between nodes and edges. The document also highlights the challenges and benefits of using heterogeneous graphs in machine learning applications.

Uploaded by

yf970113
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

CS224W: Machine Learning with Graphs

Jure Leskovec, Stanford University


http://cs224w.stanford.edu
ANNOUNCEMENTS
• Project Proposal due today

CS224W: Machine Learning with Graphs


Jure Leskovec, Stanford University
http://cs224w.stanford.edu
 So far we only handle graphs with one edge
type
 How to handle graphs with multiple nodes or
edge types (a.k.a heterogeneous graphs)?
 Goal: Learning with heterogeneous graphs
▪ Relational GCNs
▪ Heterogeneous Graph Transformer
▪ Design space for heterogeneous GNNs

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 3
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
2 types of nodes:
 Node type A: Paper nodes
 Node type B: Author nodes
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 5
2 types of edges:
 Edge type A: Cite
 Edge type B: Like
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 6
A graph could have multiple types of nodes and
edges! 2 types of nodes + 2 types of edges.

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 7
8 possible relation types!

(Paper, Cite, Paper) (Author, Cite, Author)

(Paper, Like, Paper) (Author, Like, Author)

(Paper, Cite, Author) (Author, Cite, Paper)

(Paper, Like, Author) (Author, Like, Paper)

Relation types: (node_start, edge, node_end)


 We use relation type to describe an edge (as
opposed to edge type)
 Relation type better captures the interaction
between nodes and edges
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 8
 A heterogeneous graph is defined as
𝑮 = 𝑽, 𝑬, 𝜏, 𝜙
▪ Nodes with node types 𝑣 ∈ 𝑉
▪ Node type for node 𝑣: 𝜏 𝑣
An edge can be
▪ Edges with edge types (𝑢, 𝑣) ∈ 𝐸 described as a
pair of nodes
▪ Edge type for edge (𝑢, 𝑣): 𝜙 𝑢, 𝑣
▪ Relation type for edge 𝑒 is a tuple: 𝑟 𝑢, 𝑣 =
(𝜏 𝑢 , 𝜙 𝑢, 𝑣 , 𝜏(𝑣))
 There are other definitions for heterogeneous graphs
as well – describe graphs with node & edge types
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 9
Biomedical Knowledge Graphs Event Graphs
Example node: Migraine Example node: SFO
Example relation: (fulvestrant, Example relation: (UA689, Origin,
Treats, Breast Neoplasms) LAX)
Example node type: Protein Example node type: Flight
Example edge type: Causes Example edge type: Destination

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 10
 Example: E-Commerce Graph
▪ Node types: User, Item, Query, Location, ...
▪ Edge types: Purchase, Visit, Guide, Search, …
▪ Different node type's features spaces can be different!

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 11
 Example: Academic Graph
▪ Node types: Author, Paper, Venue, Field, ...
▪ Edge types: Publish, Cite, …
▪ Benchmark dataset: Microsoft Academic Graph

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 12
 Observation: We can also treat types of
nodes and edges as features
▪ Example: Add a one-hot indicator for nodes and
edges
▪ Append feature [1, 0] to each “author node”; Append
feature [0, 1] to each “paper node”
▪ Similarly, we can assign edge features to edges with
different types
▪ Then, a heterogeneous graph reduces to a
standard graph
 When do we need a heterogeneous graph?
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 13
 When do we need a heterogeneous graph?
▪ Case 1: Different node/edge types have different
shapes of features
▪ An “author node” has 4-dim feature, a “paper node” has
5-dim feature
▪ Case 2: We know different relation types
represent different types of interactions
▪ (English, translate, French) and (English, translate,
Chinese) require different models

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 14
 Ultimately, heterogeneous graph is a more
expressive graph representation
▪ Captures different types of interactions between
entities
 But it also comes with costs
▪ More expensive (computation, storage)
▪ More complex implementation
 There are many ways to convert a
heterogeneous graph to a standard graph
(that is, a homogeneous graph)
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 15
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
Kipf and Welling. Semi-Supervised Classification with Graph Convolutional Networks, ICLR 2017

 (1) Graph Convolutional Networks (GCN)

 How to write this as Message + Aggregation?


Message
(2) Aggregation

(1) Message

Aggregation
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 18
 We will extend GCN to handle heterogeneous
graphs with multiple edge/relation types
 We start with a directed graph with one relation
▪ How do we run GCN and update the representation of
the target node A on this graph?

B
Target Node
A
C

F
D E
Input Graph

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 19
 We will extend GCN to handle heterogeneous
graphs with multiple edge/relation types
 We start with a directed graph with one relation
▪ How do we run GCN and update the representation of
the target node A on this graph?

B Only pass messages C


Target Node along direction of edges B
A
C F
A C
F
D E E
D
Input Graph

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 20
 What if the graph has multiple relation types?

𝑟 B
Target node 1
𝑟3
A
C
𝑟1 𝑟2
𝑟3 𝑟2
F
D E 𝑟1

Input graph

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 21
 What if the graph has multiple relation types?
 Use different neural network weights for
different relation types.
Weights 𝐖𝑟1 for 𝑟1
𝑟 B
Target node 1
𝑟3
A
C Weights 𝐖𝑟2 for 𝑟2
𝑟1 𝑟2
𝑟3 𝑟2
F
D E 𝑟1 Weights 𝐖𝑟3 for 𝑟3

Input graph

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 22
 What if the graph has multiple relation types?
 Use different neural network weights for
different relation types! Aggregation
C
𝑟 B
Target node 1 B
𝑟3
A F
C
𝑟1 𝑟2 A C
𝑟3 𝑟2
F E
D E 𝑟1 D

Input graph

Neural networks
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 23
 Introduce a set of neural networks for each
relation type!

Weight for rel_1




Weight for rel_N

Weight for self-loop

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 24
 Relational GCN (RGCN):

 How to write this as Message + Aggregation?


 Message: Normalized by node degree
▪ Each neighbor of a given relation: of the relation 𝑐𝑣,𝑟 = 𝑁𝑣𝑟
(𝑙) 1 𝑙 (𝑙)
𝐦𝑢,𝑟 = 𝐖𝑟 𝐡𝑢
𝑐𝑣 ,𝑟
▪ Self-loop:
(𝑙) 𝑙 (𝑙)
𝐦𝑣 = 𝐖0 𝐡𝑣
 Aggregation:
▪ Sum over messages from neighbors and self-loop, then apply activation
𝑙+1 𝑙 𝑙
▪ 𝐡𝑣 = 𝜎 Sum 𝐦𝑢,𝑟 , 𝑢 ∈ 𝑁(𝑣) ∪ 𝐦𝑣

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 25
1 2 𝐿
 Each relation has 𝐿 matrices: 𝐖𝑟 , 𝐖𝑟 ⋯ 𝐖𝑟
𝑙
 The size of each 𝐖𝑟 is 𝑑(𝑙+1) × 𝑑(𝑙) 𝑑 is the hidden
dimension in layer 𝑙
(𝑙)

 Rapid growth of the number of parameters w.r.t


number of relations!
▪ Overfitting becomes an issue
(𝒍)
 Two methods to regularize the weights 𝐖𝒓
▪ (1) Use block diagonal matrices
▪ (2) Basis/Dictionary learning
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 26
 Key insight: make the weights sparse!
 Use block diagonal matrices for 𝐖𝑟

𝐖𝑟 =
Limitation: only nearby
neurons/dimensions
can interact through 𝑊

 If use 𝐵 low-dimensional matrices, then # param


reduces from to
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 27
 Key insight: Share weights across different
relations!
 Represent the matrix of each relation as a linear
combination of basis transformations
𝐖𝑟 = σ𝐵𝑏=1 𝑎𝑟𝑏 ⋅ 𝐕𝑏 , where 𝐕𝑏 is shared across
all relations
▪ 𝐕𝑏 are the basis matrices
▪ 𝑎𝑟𝑏 is the importance weight of matrix 𝐕𝑏
𝐵
 Now each relation only needs to learn 𝑎𝑟𝑏 𝑏=1 ,
which is 𝐵 scalars
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 28
 Goal: Predict the label of a given node
 RGCN uses the representation of the final layer:
▪ If we predict the class of node 𝑨 from 𝒌 classes
(𝐿)
▪ Take the final layer (prediction head): 𝐡𝐴 ∈ ℝ𝑘 ,
(𝐿)
each item in 𝐡𝐴 represents the probability of that
class
𝑟 B
Target Node 1
𝑟3
A
C
𝑟1 𝑟2
𝑟3 𝑟2
F
D E 𝑟1
Input Graph
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 29
 Link prediction split: Every edge also has a
2 2 relation type, this is
1 1 3
3 independent of the 4
Split
categories.
5 4 5 4
In a heterogeneous
The original graph Split Graph with 4 graph, the homogeneous
categories of edges graphs formed by every
Training message edges for 𝒓𝟏 single relation also have
Training supervision edges for 𝒓𝟏 the 4 splits.
Validation edges for 𝒓𝟏
Test edges for 𝒓𝟏
Training message edges
…..

Training supervision edges


Validation edges
Training message edges for 𝒓𝒏 Test edges
Training supervision edges for 𝒓𝒏
Validation edges for 𝒓𝒏
Test edges for 𝒓𝒏
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 30
 Assume 𝑬, 𝒓𝟑 , 𝑨 is training supervision edge,
all the other edges are training message edges
 Use RGCN to score 𝑬, 𝒓𝟑 , 𝑨 !
𝐿 (𝐿)
▪ Take the final layer of 𝐸 and 𝐴: 𝐡𝐸 and 𝐡𝐴 ∈ ℝ𝑑
▪ Relation-specific score function 𝑓𝑟 : ℝ𝑑 × ℝ𝑑 → ℝ
▪ One example 𝑓𝑟1 𝐡𝐸 , 𝐡𝐴 = 𝐡𝑇𝐸 𝐖𝑟1 𝐡𝐴 , 𝐖𝑟1 ∈ ℝ𝑑×𝑑

𝑟1 B
𝑟3
A
C
𝑟1 𝑟2
𝑟3 𝑟2
𝒓𝟑 F
D E 𝑟1
Input Graph
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 31
 Training:
𝑟1 B 1. Use RGCN to score the training
𝑟3 supervision edge 𝑬, 𝒓𝟑 , 𝑨
A
C 2. Create a negative edge by perturbing
𝑟1 𝑟2
𝑟3 𝑟2 the supervision edge 𝑬, 𝒓𝟑 , 𝑩
𝒓𝟑 F • Corrupt the tail of 𝑬, 𝒓𝟑 , 𝑨
D E 𝑟1 • e.g., 𝑬, 𝒓𝟑 , 𝑩 , 𝑬, 𝒓𝟑 , 𝑫
Input Graph

Note the negative edges should NOT


training supervision edges: 𝑬, 𝒓𝟑 , 𝑨 belong to training message edges or
training message edges: all the rest training supervision edges!
existing edges (solid lines) e.g., 𝑬, 𝒓𝟑 , 𝑪 is NOT a negative edge

(1) Use training message edges to


predict training supervision edges
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 32
 Training:
1. Use RGCN to score the training
𝑟1 B supervision edge 𝑬, 𝒓𝟑 , 𝑨
𝑟3
A 2. Create a negative edge by perturbing
C the supervision edge 𝑬, 𝒓𝟑 , 𝑩
𝑟1 𝑟2
𝑟3 𝑟2
𝒓𝟑 3. Use GNN model to score negative edge
F
D E 𝑟1 4. Optimize a standard cross entropy loss
Input Graph (as discussed in Lecture 6)
1. Maximize the score of training supervision edge
2. Minimize the score of negative edge

ℓ = − log 𝜎 𝑓𝑟3 ℎ𝐸 , ℎ𝐴 − log(1 − 𝜎 𝑓𝑟3 (ℎ𝐸 , ℎ𝐵 ))

𝜎 … Sigmoid function
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 33
 Evaluation:
▪ Validation time as an example, same at the test time
𝑟1 B Evaluate how the model can predict the
𝑟3 validation edges with the relation types.
A
C Let’s predict validation edge 𝑬, 𝒓𝟑 , 𝑫
𝑟1 𝑟2
𝑟3 𝑟3 𝑟2 Intuition: the score of 𝑬, 𝒓𝟑 , 𝑫 should be
F higher than all 𝑬, 𝒓𝟑 , 𝒗 where 𝑬, 𝒓𝟑 , 𝒗 is NOT
D
𝒓𝟑 ?
E 𝑟1 in the training message edges and training
Input Graph supervision edges, e.g., 𝑬, 𝒓𝟑 , 𝑩
validation edges: 𝑬, 𝒓𝟑 , 𝑫
training message edges & training supervision
edges: all existing edges (solid lines)

(2) At validation time:


Use training message edges & training
supervision edges to predict validation edges
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 34
 Evaluation:
▪ Validation time as an example, same at the test time
𝑟1 B Evaluate how the model can predict the
𝑟3 validation edges with the relation types.
A
C Let’s predict validation edge 𝑬, 𝒓𝟑 , 𝑫
𝑟1 𝑟2
𝑟3 𝑟3 𝑟2 Intuition: the score of 𝑬, 𝒓𝟑 , 𝑫 should be
F higher than all 𝑬, 𝒓𝟑 , 𝒗 where 𝑬, 𝒓𝟑 , 𝒗 is NOT
D
𝒓𝟑 ?
E 𝑟1 in the training message edges and training
Input Graph supervision edges, e.g., 𝑬, 𝒓𝟑 , 𝑩
1. Calculate the score of 𝑬,𝒓𝟑 , 𝑫
2. Calculate the score of all the negative edges: 𝑬, 𝒓𝟑 , 𝒗 𝒗 ∈ 𝑩,𝑭 , since 𝑬, 𝒓𝟑 , 𝑨 ,
𝑬, 𝒓𝟑 , 𝑪 belong to training message edges & training supervision edges
3. Obtain the ranking 𝑹𝑲 of 𝑬, 𝒓𝟑 , 𝑫 .
4. Calculate metrics
1. Hits@𝒌: 𝟏 𝑹𝑲 ≤ 𝒌 . Higher is better
𝟏
2. Reciprocal Rank: . Higher is better
𝑹𝑲
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 35
Wang et al. Microsoft academic graph: When experts are not enough.Quantitative Science Studies 2020.

 Benchmark dataset
▪ ogbn-mag from Microsoft Academic Graph (MAG)
 Four (4) types of entities
▪ Papers: 736k nodes
▪ Authors: 1.1m nodes
▪ Institutions: 9k nodes
▪ Fields of study: 60k nodes

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 36
Wang et al. Microsoft academic graph: When experts are not enough.Quantitative Science Studies 2020.

 Benchmark dataset
▪ ogbn-mag from Microsoft Academic Graph (MAG)
 Four (4) directed relations
▪ An author is "affiliated with" an institution
▪ An author "writes" a paper
▪ A paper "cites" a paper
▪ A paper "has a topic of" a field of study

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 37
Wang et al. Microsoft academic graph: When experts are not enough.Quantitative Science Studies 2020.

 Prediction task
▪ Each paper has a 128-dimensional word2vec feature vector
▪ Given the content, references, authors, and author affiliations
from ogbn-mag, predict the venue of each paper
▪ 349-class classification problem due to 349 venues considered
 Time-based dataset splitting
▪ Training set: papers published before 2018
▪ Test set: papers published after 2018

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 38
Wang et al. Microsoft academic graph: When experts are not enough.Quantitative Science Studies 2020.

 Benchmark results:

SOTA

R-GCN

▪ SOTA method: SeHGNN


▪ ComplEx (Next lecture) + Simplified GCN (Lecture 17)

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 39
 Relational GCN, a graph neural network for
heterogeneous graphs

 Can perform entity classification as well as


link prediction tasks.

 Ideas can easily be extended into RGNN


(RGraphSAGE, RGAT, etc.)

 Benchmark: ogbn-mag from Microsoft


Academic Graph, to predict paper venues
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 40
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
 Graph Attention Networks (GAT)
(𝑙) (𝑙−1)
𝐡𝑣 = 𝜎(σ𝑢∈𝑁 𝑣 𝛼𝑣𝑢 𝐖(𝑙) 𝐡𝑢 )
Attention weights

Not all node’s neighbors are equally important


▪ Attention is inspired by cognitive attention.
▪ The attention 𝜶𝒗𝒖 focuses on the important parts of
the input data and fades out the rest.
▪ Idea: the NN should devote more computing power on that
small but important part of the data.
 Can we adapt GAT for heterogeneous graphs?
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 42
Hu et al. Heterogeneous Graph Transformer. WWW 2020.

 Motivation: GAT is unable to represent different


node & different edge types
 Introduce a set of neural networks for each
relation type is too expensive for attention
▪ Recall: relation describes (node_s, edge, node_e)

Weight for rel_1


… Too expensive!
Weight for rel_N
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 43
 HGT uses Scaled Dot-Product Attention
(proposed in Transformer)

 Query: 𝑄, Key: 𝐾, Value: 𝑉


▪ 𝑄, 𝐾, 𝑉 have shape (batch_size, dim)
How do we obtain 𝑄, 𝐾, 𝑉?
 Apply Linear layer to the input
▪ 𝑄 = 𝑄_𝐿𝑖𝑛𝑒𝑎𝑟(𝑋)
▪ 𝐾 = 𝐾_𝐿𝑖𝑛𝑒𝑎𝑟(𝑋)
▪ 𝑉 = 𝑉_𝐿𝑖𝑛𝑒𝑎𝑟(𝑋)
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 44
Hu et al. Heterogeneous Graph Transformer. WWW 2020.

 Recall: Applying GAT to a homogeneous graph


𝑙
▪𝐻 is the 𝑙-th layer representation:

How do we take relation type (node_s, edge,


node_e) into attention computation?

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 45
Hu et al. Heterogeneous Graph Transformer. WWW 2020.

 Innovation: Decompose heterogeneous attention to


Node- and edge-type dependent attention mechanism
▪ 3 node weight matrices, 2 edge weight matrices
▪ Without decomposition: 3*2*3=18 relation types -> 18
weight matrices (suppose all relation types exist)
Paper

� Q-Linear

� � � � � [� ]
Write Cite
� � � [� ] � �� [� 1, � ]

� K-Linear
Paper
� � � � � [� ] …

� � � �� [� 2, � ]
K-Linear � � � [� ]
Author
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 46
Hu et al. Heterogeneous Graph Transformer. WWW 2020.

 Heterogeneous Mutual Attention:

 Each relation (𝑇 𝑠 , 𝑅 𝑒 , 𝑇 𝑡 ) has a distinct set


of projection weights
▪ 𝑇 𝑠 : type of node 𝑠, 𝑅 𝑒 : type of edge 𝑒
▪ 𝑇(𝑠) & 𝑇(𝑡) parameterize 𝐾_𝐿𝑖𝑛𝑒𝑎𝑟𝑇 𝑠 & 𝑄_𝐿𝑖𝑛𝑒𝑎𝑟𝑇 𝑡 ,
which further return Key and Query vectors 𝐾(𝑠) & 𝑄(𝑡)
▪ Edge type 𝑅(𝑒) directly parameterizes WR(e)
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 47
 A full HGT layer

We have just computed

 Similarly, HGT decomposes weights with node & edge


types in the message computation

Weights for Weights for


each node type each edge type
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 48
Hu et al. Heterogeneous Graph Transformer. WWW 2020.

 Benchmark: ogbn-mag from Microsoft


Academic Graph, to predict paper venues

 HGT uses much fewer parameters, even


though the attention computation is expensive,
while performs better than R-GCN
▪ Thanks to the weight decomposition over node &
edge types
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 49
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
J. You, R. Ying, J. Leskovec. Design Space of Graph Neural Networks, NeurIPS 2020

How do we extend the general GNN design


space to heterogeneous graphs?
(5) Learning objective

(2) Aggregation
GNN Layer 1
(1) Message
(3) Layer
connectivity

GNN Layer 2

(4) Graph augmentation


2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 51
 (1) Message computation
(𝑙) 𝑙 𝑙−1
▪ Message function: 𝐦𝑢 = MSG 𝐡𝑢
▪ Intuition: Each node will create a message, which will be
sent to other nodes later
(𝑙) 𝑙−1
▪ Example: A Linear layer 𝐦𝑢 = 𝐖 𝑙 𝐡𝑢

Node 𝒗

(2) Aggregation

(1) Message

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 52
 (1) Heterogeneous message computation
𝑙 𝑙−1 (𝑙)
▪ Message function: = MSG𝑟 𝐡𝑢 𝐦𝑢
▪ Observation: A node could receive multiple types of
messages. Num of message type = Num of relation
type
▪ Idea: Create a different message function for each
relation type
(𝑙) 𝑙 𝑙−1
▪ 𝐦𝑢 = MSG𝑟 𝐡𝑢 , 𝑟 = (𝑢, 𝑒, 𝑣) is the relation
type between node 𝑢 that sends the message, edge
type 𝑒 , and node 𝑣 that receive the message
(𝑙) 𝑙−1 𝑙
▪ Example: A Linear layer 𝐦𝑢 = 𝐖𝑟 𝐡𝑢
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 53
 (2) Aggregation
▪ Intuition: Each node will aggregate the messages from
node 𝑣’s neighbors
(𝑙)
𝐡𝑣 = AGG 𝑙
𝐦𝑢𝑙 , 𝑢 ∈ 𝑁 𝑣
▪ Example: Sum(⋅), Mean(⋅) or Max(⋅) aggregator
𝑙 𝑙
▪ 𝐡𝑣 = Sum({𝐦𝑢 , 𝑢 ∈ 𝑁(𝑣)})
Node 𝒗

(2) Aggregation

(1) Message

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 54
 (2) Heterogeneous Aggregation
▪ Observation: Each node could receive multiple types of
messages from its neighbors, and multiple neighbors
may belong to each message type.
▪ Idea: We can define a 2-stage message passing

▪ Given all the messages sent to a node


▪ Within each message type, aggregate the messages
that belongs to the edge type with AGG𝑟𝑙
𝑙
▪ Aggregate across the edge types with AGG𝑎𝑙𝑙
𝑙 𝑙
▪ Example: 𝐡𝑣 = Concat Sum 𝐦𝑢 , 𝑢 ∈ 𝑁𝑟 𝑣
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 55
 (3) Layer connectivity
▪ Add skip connections, pre/post-process layers

Pre-processing layers: Important when


encoding node features is necessary.
E.g., when nodes represent images/text

Post-processing layers: Important when


reasoning / transformation over node
embeddings are needed
E.g., graph classification, knowledge graphs

In practice, adding these layers works great!

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 56
 Heterogeneous pre/post-process layers:
▪ MLP layers with respect to each node type
▪ Since the output of GNN are node embeddings
(𝑙) (𝑙)
▪ 𝐡𝑣 = MLP𝑇(𝑣) (𝐡𝑣 )
▪ 𝑇(𝑣) is the type of node 𝑣
 Other successful GNN designs are
also encouraged for heterogeneous
GNNs: skip connections, batch/layer
normalization, …

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 57
 Graph Feature manipulation
▪ The input graph lacks features → feature
augmentation
 Graph Structure manipulation
▪ The graph is too sparse → Add virtual nodes / edges
▪ The graph is too dense → Sample neighbors when
doing message passing
▪ The graph is too large → Sample subgraphs to
compute embeddings
▪ Will cover later in lecture: Scaling up GNNs

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 58
 Graph Feature manipulation
▪ 2 Common options: compute graph statistics (e.g.,
node degree) within each relation type, or across the
full graph (ignoring the relation types)
 Graph Structure manipulation
▪ Neighbor and subgraph sampling are also common
for heterogeneous graphs.
▪ 2 Common options: sampling within each relation
type (ensure neighbors from each type are covered),
or sample across the full graph

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 59
Node-level prediction:

Edge-level prediction:
 =
𝐿 𝐿
Linear(Concat(𝐡𝑢 , 𝐡𝑣 ))
Graph-level prediction:

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 60
Node-level prediction:

Edge-level prediction:
 =
𝐿 𝐿
Linear𝑟 (Concat(𝐡𝑢 , 𝐡𝑣 ))
Graph-level prediction:

ℝ𝑑 , ∀𝑇 𝑣 = 𝑖}))

2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 61
J. You, R. Ying, J. Leskovec. Design Space of Graph Neural Networks, NeurIPS 2020

Heterogeneous GNNs extend GNNs by separately


modeling node/relation types + additional AGG
(5) Learning objective

(2) Aggregation
GNN Layer 1
(1) Message
(3) Layer
connectivity

GNN Layer 2

(4) Graph augmentation


2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 62
 Heterogeneous graphs: graphs with multiple
nodes or edge types
▪ Key concept: relation type (node_s, edge, node_e)
▪ Be aware that we don’t always need
heterogeneous graphs
 Learning with heterogeneous graphs
▪ Key idea: separately model each relation type
▪ Relational GCNs
▪ Heterogeneous Graph Transformer
▪ Design space for heterogeneous GNNs
2/16/2023 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 63

You might also like