0% found this document useful (0 votes)
4 views

Graph Neural Nets

Uploaded by

Rani Kharmate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Graph Neural Nets

Uploaded by

Rani Kharmate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Graph Neural Nets

tasks of ONNs
① Node level Prediction : Structure

and position of a note

& Edge level Prediction: drug side


effects or
graph-based recom systems

③ Graph level prediction :


of an entire

graph or
subgraphs drug discovery
,

physics simulations

*
adjacency matrix is
extremely sparse .

Connectivity
an undirected connected
graph is one that
has at least one
path between
Alphafold's key idea :
Spatial Graph nodes .
each two

nodes : amino acids


in directed
acids graphs we have
edges :

proximity
between amino
-

strongly connected : has a path from each


mode to
every
Bipartite Graph
other and vice
versa .

Weakly Connected: is connected


-

if we
a
graph whose nodes can be divided into two ignore the direction of edges.
sets and t such that Van U are two
StronglyConnectedComponent
(SCC)
independent and
only ↑
DO
sets interact with
o
nodes of the other set
. 20 -
-
o -0
Traditional ML Pipeline
train an ML model >
-
feet new
graphy note >
-
obtain features -> predict

Nodes A Note on Eigenvalue Centrality


(why eigenvalues )

-
?

given G (V E) = ,
, learn function F:
V-IR
eigenvalue centrality is the weighted sum of
characterizing the properties of anode
neighbor centrality
. This is a recursive definition .

it's also similar to the definition of eigenvalues.


Degree Av = ]v
of the for each component
↓ makes nodes same degree of V :

indistinguishable e
Etijj 4V ;
.

Centeralities this means that Vi is the of


components of
weighted sum the
its
while
degree only counts the
neighboring neighbor j weighted by
Aij This is in definition similar the
nodes to
centrality
.

into
,

Centrality takes their importance we are


looking for .
account .

Eigenvector Centerating Graphlets


the more
important neighbors are the Small that describe
higher importance of the note . subgraphs us
neighborhood.
This can be encoded as a rector of degrees in

=x = A
the graphlet known as
graphlet Degree Vector (GDV)

I Some constant
A:
adjacency matrix
C
Centrality Vector
:

Betweenness
Centrality
high value if it lies
many shortest paths
on

between other nodes . (transition hub)


Closeness
Centrality
if
high a note has small shortest path
length to all other nodes

Clustering
Coefficient
measure how connected us
neighbors are
Edge (link-level) Graph-level
based on
existing links , rank all note Kernel RIO , OE measures
pairs and select top K pairs (Social
.

network) Similaritybw data

1 .
links
missing at random Kernel must be positive semidefinite
.
2 Links over time a
type of Kernel is Graphlet Kernel
the
ranking depends on our
scoring system :
and there are other
types as well

Local
Neighborhood Overlap Some Book Notes
number of of the two notes .
common
neighbors
RN(V)/
Jaccard's Coefficient
IN(u)uN(V))
iftwo nodes aren't connected this would
always be .
0

L
Global Neighborhood overlap
So in traditional ML for graphs
Theorem Summary ,

the number of paths of length N between two notes looks like this :

is pl or powers of the
adjacency matrix
.

index for One


inut-structureeing tion

3
we use Katz

Feature engineering
edge level)
*
(node , graph , or

& sums over paths of len 1 up to s


,

B is the discount factor to give lower Can we do this


importance to
longer paths .

automatically ?


Nose Embeddings nojezve
Similar to Random walk but has a

flexible notion of network neighborhood .

two Classic strategies of neighborhood


:

view
aocal microscopic
Similarity in the
embedding space must approximate BFS : gives L
view
defines Global microscopic
Similarity in the graph . this our goal :

DFS gives
:
a

this is a pair-wise decter


between the two
node2vec can extrapolate

It has two hyperparams


:

1 . Return (p) : now likely to return to previous

the simplest approach is to use Shallow encoder ratio of BFS vs DFS


(9)
.

2 .
In-out :

where Z is a look-up table of embeddings for


each node .

How do we define note Similarity ?


this is
mostly what separates these algorithms.
Si

2
Random Walk W
-S
1
>
- normalize fip a coin !

given
a
graph and 119
a
starting point , we

visit
I number of nodes then S4
119
.

g
nobezvec perform oa

node classification . Random walks are

pretty efficient as well .

goal : learn fu-iR: F(u) = Zu

objective :

max[log P(p(u)l
NR a
is the
neighborhood of
by strategy R
we want to learn representations that
predict which noodes to find in RW .

&
Graph Emb g
iteration
pageRank is solved
using power

1 Initialize ro with some value

rit
+
)
.
2 Herate rt
a simple idea is to
just average over
= M .

.
3 when (r(t
Stop (H)
embeddings of notes graph E
<
the in a -

another idea is to sample anonymous walks limitations of note embeddings via

embed them and concatenate random walks (or PageRank) is that we


, their embeddings
of the graph embettings for nodes
.
can't
to get the
embedding in
obtain not seen

training set
there are more advanced methods (later)

another limitation is these embettings


cannot capture structural Similarity
.

Graph &S a so it neighbors


notes with the same are

in different parts of the


similar
graph we
get

Matrix embettings .

PageRank
the

web
collection
connected

"Can we
idea behind

in the old days


ofStatic webpages
with

rank
link
Google

webpages" ?
was a
3. 3
webpage --
links
>

>
-
Note

edge
Solution .

an idea is to treat links as votes , the more


Deep Representation
incominglis the importantth and hea
,

Learning
This is

let page I
solved by Stochastic

out-links,
Adjacency Marix
Neural Networks
have
dj
if
ji , then
Mij =

Edj
columns of 1 Sum
up to 1

We define Rank Vector R , as ri the importance


of
score page :
[iri = 1 then we have

M
ML with Graphs
an example is
classifying nodes in a
so the basic
message passing
can

graph using labels of of the notes.


thought of
some
be as a MLP
(Semi-supervised)

·00 Deep Learning


message passing

3
.
Similar to
PageRank , labels of a note are influences
in Graphs
by their
neighbors.
a naive approach is to
feet the
homophily
feed forward
individual
>socialations
-
adjacency matrix A to a

characteristics network .

te
influence
the problem
one
way is collective classification where

labels are influenced


by 1st degree neighbors.
(Markov Assumption )
the solution is to use
Probabilistic Relational Classifier measure the a
permutation invariant

weighted average of labels in the neighborhood


.
aggregation method .

each node in a path


listens updates and passes
message
, a to
its
neighbors
.
A D> B
+ -
+
D
mreisony mesothe four of
e

us

each note collects info from it's neighbors,


and considers the prior belief of what label it
must have

2) aggregation I

message ↑
* *

o Graph Convolutional
Network (GCN)
different algorithms differ in how they
and what messages they
aggregate messages
pass between nodes .

the idea of single GNN layer


a is compress

a set of message rectors into a


single vector .

transformation is linear and aggreg


message
ONN Layer = Message + Aggregation is a sum .

Graph Attention
1) Message Computation
we keep
weights of relationship between
nodewand its
neighbors u
.
m = MSO(h 1) -

an example could be a linear transformation


the of
embedding
On the note .

mi =
W(ln(
2) Aggregation must be order-invariant
should matter .
as order of input nodes not

examples :
Sum() , Mean( ) .
. Max ()

but this way when


computing his the
,

information we already have of note

gets lost ! Solution ?

compute message
from note I as well ,
using a

differentComputation and for aggregation , ConCat

the two together .

hi Concat(ad(mi , venus) , mi
domains
So in Contrast with other ,
adding
SNN does yield
layers better
more to a not

should first how big we


results .
We analyse
need to have the receptive field .

an option would be to use layers in a


ON

don't pass (like MLP) -


that messages

the other option skip connection


&
is to use

ON could look like this


a
layer
:

I
linear

BatchNorm
transformer ↓
Dropout

Activation

Attention


Graph
Aggregation
d Augmentation
a standard way of building ON from a
layer in (sparse/densel graphs the
many cases ,

is suitable for
to
simply stack them on each other . Simple input graph may
not be

of over-smoothing Computation
if many
.
the problem
layers are stacked
of the notes converge .
embeddings
Augmentation
occurs where the -

reature
Receptive Field
Sometimes the input graph is just als
of that determine the
modes lack any features.
the set modes
number of matrix and the
note the
of as

embedding
a
of add unique one-not its to notes or
the receptive field Solution :

SNN layers increases ,

Constant feature to all nodes .


a mode covers more of the graph . assign

Structural Aug
virtual (like
2-hop neighbor
&
adding .
nodes to a sparse graph
node
*
if Graph isJense , sample

neighborhood .
traininga
factor
a key in the expressiveness of ONNs
is the
aggregation function
being
able to

distinguish different variation of neighbors .


This of Neighbor aggregation function
is a
good way embedding entire
multi-set.
is a over

graphs .

functions evaluation
other things , such as loss ,
the most
expressive ONN is the one

metrics etc are similar to machine learning


function
,

which injective
:
uses an
in

general .

Train/Test Split in
Graphs :
something special about

Graphs is that there is information leakage meaning


that since nodes are connected the splitted data
in train and test are not totally independent .

How Expressive are

GNNs ?
if a
graph neural net has no note emb

then Some of the nodes would be


structurally

indistinguishable e.

without features , the computation graph


of note 1 and 2 are the same
Relational
the goal is to make emb(1h rl) .
Close

to emb(t)

TransE Generica
&CN agraph
htret
where
every
relation is if fact is true ,
note and
Scoring Function eher--1
labeled by a
type
it is able to
(biomedical or event
graphs)
not capture
symmetrical relations .

TransR
transform from
uses M. projection mat to
entity
space to relation space and then perform TransE .

does NOT
Support
the
aggregation composition
relations .
is Similar to
mother's
my
max
pooling husband is
my
father

now in ROCN we have different weights for


each relation type .
DistMul
Similar to TransR but uses different method

thought ofas
to score
embeddings It can be
cosine Similarity between hir and .

each relation has matrices for each layer


of

this
the

means
GNN : WH W
rapid growth
....

of
WH
parameters
(overing)
- fr(hr , te) > fr (n re + 1)

2 solutions for this :


Can Not model inverse relations as

1.
Block Diagonal Mat :
make the weights matrix
& ra(h reit) , = fr China , H
i

.
sparse
also can T model Composition Relations

ComplEX
represent Wr linear combination
WeightSharing based DistMul but embeds into complex
as
.
2 : a
on

of transformations which are shared across all

a ses
basic vector space .

Wr = aroun
- basic matrix
relations

so
·

>
- importance weights
i is
Conjugate
knowledge Graphs real
part
the

KGs are represented as (hir t) ,


where
also canT model composition relations
head(hI has relation (v) with tail (t) ·

#* I didn't make notes for KOreasoning (next lesson


Kronecker Model
Generative
models for
Graphs
some properties of real-world graphs :

1) Degree Distribution PCK) probability that

of K
.
note has degree
a
randomly Chosen a
the Kronecker product takes cell of M1 and
every

-
multiplies it
by the whole M2 matrix
.

the model iteratively uses Ki on an initiator graph

1234
k1 * k1 * k1 * k1

Coefficient how connected graphs similar all you


2)Clustering
are
: are
Kronecker and real

of note. number of edges need do is choose the right initiator matrix.


the
neighbors
to
a

2
Deep Generative Models
3) Path Length (diameter) the maximum
for Graphs
:
or

of shortest path Jetweten two


average any
nodes .
task1 :
generate graphs similar to a
given set of graphs
task2 : "that optimize given
objectives
Small World Model I such as
dung discovery
Gnew grapas
tries to have high clustering
with low
any .
Shortest
Pouca(G) =>
learn sample
Pa
path (the two oppose each other make p Mie)
moses
close to
Parali)
1) Create a small lattice So we follow Maximum Likelihood
2) rewire and add randomness
&* =

argmaxElog Peel !
I dea Chain Rule.
:
we want to model a complex

distribution over graphs ,


we dont know how , so

we break it down to smaller distributions

S
&
also bus to
wecan generate grapa
use

Graph RNN
generating
a
graph sequentially by
adding
nodes and edges .

-
-
...

03
this has become a
sequencing problem- > RNN

as the generation a new


goes on , note Can connect

to
any/all previous notes. the model must remember

all the modes and edges >


-
Scale Problem !

So we use BFS
ordering !


Scaling
But the computation graph grows exponentially

up especially if

we hit a hub node .

ONNS neighbor sampling


mostHe neighbors.
: at each hop pick
,
at

if use mini-batch SGD


training for
we a
a

GNN and we select notes that are


independent of
each other ,
we cannot effectively train GN-

much computation
it takes so long and

Graph SAGE neighborhood Sampling


RECALL GNN generate note embeddings
:

via
neighborhood aggregation.

* & Hesamation

if of layers or the neighborhood is small we can

ignore much of the network now we can compute

gradients in a reliable way and use SGD


.

You might also like