2019 - Medium - Tutorial On Graph Neural Networks For Computer Vision and Beyond - by Boris Knyazev
2019 - Medium - Tutorial On Graph Neural Networks For Computer Vision and Beyond - by Boris Knyazev
Save
1.8K 10
https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 1/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium
A figure from (Bruna et al., ICLR, 2014) depicting an MNIST image on the 3D sphere. While it’s hard to adapt
Convolutional Networks to classify spherical data, Graph Networks can naturally handle it. This is a toy
example, but similar tasks arise in many real applications.
To answer them, I’ll provide motivating examples, papers and Python code making
it a tutorial on Graph Neural Networks (GNNs). Some basic knowledge of machine
learning and computer vision is expected, however, I’ll provide some background
and intuitive explanation as we go.
First of all, let’s briefly recall what is a graph? A graph G is a set of nodes (vertices)
connected by directed/undirected edges. Nodes and edges typically come from
some expert knowledge or intuition about the problem. So, it can be atoms in
molecules, users in a social network, cities in a transportation system, players in
team sport, neurons in the brain, interacting objects in a dynamic physical system,
pixels, bounding boxes or segmentation masks in images. In other words, in many
practical cases, it is actually you who gets to decide what are the nodes and edges in
a graph.
https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 2/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium
Two undirected graphs with 5 and 6 nodes. The order of nodes is arbitrary.
1. We can become closer to solving important problems that previously were too
challenging, such as: drug discovery for cancer (Veselkov et al., Nature, 2019);
better understanding of the human brain connectome (Diez & Sepulcre, Nature
Communications, 2019); materials discovery for energy and environmental
challenges (Xie et al., Nature Communications, 2019).
2. In most CV/ML applications, data can be actually viewed as graphs even though
you used to represent them as another data structure. Representing your data as
graph(s) gives you a lot of flexibility and can give you a very different and
interesting perspective on your problem. For instance, instead of learning from
image pixels you can learn from “superpixels” as in (Liang et al., ECCV, 2016)
and in our forthcoming BMVC paper. Graphs also let you impose a relational
inductive bias in data — some prior knowledge you have about the problem. For
instance, if you want to reason about a human pose, your relational bias can be
a graph of skeleton joints of a human body (Yan et al., AAAI, 2018); or if you
want to reason about videos, your relational bias can be a graph of moving
bounding boxes (Wang & Gupta, ECCV, 2018). Another example can be
representing facial landmarks as a graph (Antonakos et al., CVPR, 2015) to make
reasoning about facial attributes and identity.
https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 3/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium
3. Your favourite neural network itself can be viewed as a graph, where nodes are
neurons and edges are weights, or where nodes are layers and edges denote
flow of forward/backward pass (in which case we are talking about a
computational graph used in TensorFlow, PyTorch and other DL frameworks).
An application can be optimization of a computational graph, neural
architecture search, analyzing training behavior, etc.
4. Finally, you can solve many problems, where data can be more naturally
represented as graphs, more effectively. This includes, but is not limited to,
molecule and social network classification (Knyazev et al., NeurIPS-W, 2018) and
generation (Simonovsky & Komodakis, ICANN, 2018), 3D Mesh classification
and correspondence (Fey et al., CVPR, 2018) and generation (Wang et al., ECCV,
2018), modeling behavior of dynamic interacting objects (Kipf et al., ICML,
2018), visual scene graph modeling (see the upcoming ICCV Workshop) and
question answering (Narasimhan, NeurIPS, 2018), program synthesis (Allamanis
et al., ICLR, 2018), different reinforcement learning tasks (Bapst et al., ICML,
2019) and many other exciting problems.
https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 4/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium
A figure from (Antonakos et al., CVPR, 2015) showing representation of a face as a graph of landmarks. This
is an interesting approach, but it is not a sufficient facial representation in many cases, since a lot can be told
from the face texture captured well by convolutional networks. In contrast, reasoning over 3D meshes of a
face looks like a more sensible approach compared to 2D landmarks (Ranjan et al., ECCV, 2018).
https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 5/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium
2. Locality — nearby pixels are closely related and often represent some semantic
concept, such as a wheel or a window. This is exploited by using relatively large
filters, which can capture image features in a local spatial neighborhood.
https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 6/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium
All these nice properties make ConvNets less prone to overfitting (high accuracy on
the training set and low accuracy on the validation/test set), more accurate in
different visual tasks, and easily scalable to large images and datasets. So, when we
want to solve important tasks where input data are graph-structured, it is appealing
to transfer all these properties to graph neural networks (GNNs) to regularize their
flexibility and make them scalable. Ideally, our goal is to develop a model that is as
flexible as GNNs and can digest and learn from any data, but at the same time we
want to control (regularize) factors of this flexibility by turning on/off certain priors.
This can open research in many interesting directions. However, controlling of this
trade-off is challenging.
https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 7/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium
An image from the MNIST dataset on the left and an example of its graph representation on the right. Darker
and larger nodes on the right correspond to higher pixel intensities. The figure on the right is inspired by
Figure 5 in (Fey et al., CVPR, 2018)
networkx.grid_graph([4, 4]) .
https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 8/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium
Examples of regular 2D and 3D grids. Images are defined on 2D grids and videos are on 3D grids.
Given this 4×4 regular grid, let’s briefly look at how 2D convolution works to
understand why it’s difficult to transfer this operator to graphs. A filter on a regular
grid has the same order of nodes, but modern convolutional nets typically have
small filters, such as 3×3 in the example below. This filter has 9 values: W₁,W₂,…,
W₉, which is what we are updating during training using backprop to minimize the
loss and solve the downstream task. In our example below, we just heuristically
initialize this filter to be an edge detector (see other possible filters here):
Example of a 3×3 filter on a regular 2D grid with arbitrary weights w on the left and an edge detector on the
right.
When we perform convolution, we slide this filter in both directions: to the right
and to the bottom, but nothing prevents us from starting in the bottom corner — the
important thing is to slide over all possible locations. At each location, we compute
the dot product between the values on the grid (let’s denote them as X) and the
values of filters, W: X₁W₁+X₂W₂+…+X₉W₉, and store the result in the output image.
In our visualization, we change the color of nodes during sliding to match the colors
of nodes in the grid. In a regular grid, we always can match a node of the filter with
a node of the grid. Unfortunately, this is not true for graphs as I’ll explain later
below.
https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 9/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium
2 steps of 2D convolution on a regular grid. If we don’t apply padding, there will be 4 steps in total, so the
result will be a 2×2 image. To make the resulting image larger, we need to apply padding. See a
comprehensive guide to convolution in deep learning here.
The dot product used above is one of so called “aggregator operators”. Broadly
speaking, the goal of an aggregator operator is to summarize data to a reduced
form. In our example above, the dot product summarizes a 3×3 matrix to a single
value. Another example is pooling in ConvNets. Keep in mind, that such methods as
max or sum pooling are permutation-invariant, i.e. they will pool the same value
from a spatial region even if you randomly shuffle all pixels inside that region. To
make it clear, the dot product is not permutation-invariant simply because in
general: X₁W₁+X₂W₂ ≠X₂W₁+X₁W₂.
Now let’s use our MNIST image and illustrate the meaning of a regular grid, a filter
and convolution. Keeping in mind our graph terminology, this regular 28×28 grid
will be our graph G, so that every cell in this grid is a node, and node features are an
actual image X, i.e. every node will have just a single feature — pixel intensity from
0 (black) to 1 (white).
https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 10/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium
Next, we define a filter and let it be a famous Gabor filter with some (almost)
arbitrary parameters. Once we have an image and a filter, we can perform
convolution by sliding the filter over that image (of digit 7 in our case) and putting
the result of the dot product to the output matrix after each step.
A 28×28 filter (left) and the result of 2D convolution of this filter with the image of digit 7 (right).
This is all cool, but as I mentioned before, it becomes tricky when you try to
generalize convolution to graphs.
https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 11/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium
Illustration of “convolution on graphs” of node features X with filter W centered at node 1 (dark blue).
For example, for the graph above on the left, the output of the summation
aggregator for node 1 will be X₁=(X₁+X₂+X₃+X₄)W₁, for node 2: X₂=(X₁+X₂+X₃+X₅)W₁
and so forth for nodes 3, 4 and 5, i.e. we need to apply this aggregator for all nodes.
In result, we will have the graph with the same structure, but node features will now
https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 12/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium
contain features of neighbors. We can process the graph on the right using the same
idea.
Fully-connected layer with learnable weights W. “Fully-connected” means that each output value in X⁽ˡ⁺¹⁾
depends on, or “connected to”, all inputs X⁽ˡ⁾. Typically, although not always, we add a bias term to the output.
The signal in MNIST is so strong, that you can get an accuracy of 91% by just using
the formula above and the Cross Entropy loss without any nonlinearities and other
tricks (I used a slightly modified PyTorch example to do that). Such model is called
multinomial (or multiclass, since we have 10 classes of digits) logistic regression.
Now, how do we transform our vanilla neural network to a graph neural network?
As you already know, the core idea behind GNNs is aggregation over “neighbors”.
Here, it is important to understand that in many cases, it is actually you who
specifies “neighbors”.
https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 13/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium
Let’s consider a simple case first, when you are given some graph. For example, this
can be a fragment (subgraph) of a social network with 5 persons and an edge
between a pair of nodes denotes if two people are friends (or at least one of them
think so). An adjacency matrix (usually denoted as A) in the figure below on the
right is a way to represent these edges in a matrix form, convenient for our deep
learning frameworks. Yellow cells in the matrix represent the edge and blue — the
absence of the edge.
Example of a graph and its adjacency matrix. The order of nodes we defined in both cases is random, while
the graph is still the same.
Now, let’s create an adjacency matrix A for our MNIST example based on
coordinates of pixels (complete code is provided in the end of the post):
import numpy as np
from scipy.spatial.distance import cdist
This is a typical, but not the only, way to define an adjacency matrix for visual tasks
(Defferrard et al., NIPS, 2016, Bronstein et al., 2016). This adjacency matrix is our
prior, or our inductive bias, we impose on the model based on our intuition that
nearby pixels should be connected and remote pixels shouldn’t or should have very
thin edge (edge of a small value). This is motivated by observations that in natural
images nearby pixels often correspond to the same object or objects that interact
frequently (the locality principle we mentioned in Section 2.1.), so it makes a lot of
sense to connect such pixels.
https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 14/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium
Adjacency matrix (NxN) in the form of distances (left) and closeness (middle) between all pairs of nodes.
(right) A subgraph with 16 neighboring pixels corresponding to the adjacency matrix in the middle. Since it’s
a complete subgraph, it’s also called a “clique”.
So, now instead of having just features X we have some fancy matrix A with values
in the range [0,1]. It’s important to note that once we know that our input is a graph,
we assume that there is no canonical order of nodes that will be consistent across
all other graphs in the dataset. In terms of images, it means that pixels are assumed to
be randomly shuffled. Finding the canonical order of nodes is combinatorially
unsolvable in practice. Even though for MNIST we technically can cheat by knowing
this order (because data are originally from a regular grid), it’s not going to work on
actual graph datasets.
Remember that our matrix of features X has 𝑁 rows and C columns. So, in terms of
graphs, each row corresponds to one node and C is the dimensionality of node
features. But now the problem is that we don’t know the order of nodes, so we don’t
know in which row to put features of a particular node. If we just pretend to ignore
this problem and feed X directly to an MLP as we did before, the effect will be the
same as feeding images with randomly shuffled pixels with independent (yet the
same for each epoch) shuffling for each image! Surprisingly, a neural network can
in principle still fit such random data (Zhang et al., ICLR, 2017), however test
performance will be close to random prediction. One of the solutions is to simply
use the adjacency matrix A, we created before, in the following way:
Graph neural layer with adjacency matrix A, input/output features X and learnable weights W.
https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 15/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium
We just need to make sure that row i in A corresponds to features of node in row i of
X. Here, I’m using 𝓐 instead of plain A, because often you want to normalize A. If
𝓐=A, the matrix multiplication 𝓐X⁽ˡ⁾ will be equivalent to summing features of
neighbors, which turned out to be useful in many tasks (Xu et al., ICLR, 2019). Most
commonly, you normalize it so that 𝓐X⁽ˡ⁾ averages features of neighbors, i.e. 𝓐=A/
ΣᵢAᵢ. A better way to normalize matrix A can be found in (Kipf & Welling, ICLR,
2017).
1 import torch
2 import torch.nn as nn
3
4 C = 2 # Input feature dimensionality
5 F = 8 # Output feature dimensionality
6 W = nn.Linear(in_features=C, out_features=F) # Trainable weights
7
8 # Fully connected layer
9 X = torch.randn(1, C) # Input features
10 Z = W(X) # Output features : torch.Size([1, 8])
11
12 #Graph Neural Network layer
13 N = 6 # Number of nodes in a graph
14 X = torch.randn(N, C) # Input feature
15 A = torch.rand(N, N) # Adjacency matrix (edges of a graph)
16 Z = W(torch.mm(A, X)) # Output features: torch.Size([6, 8])
And HERE is the full PyTorch code to train two models above: python mnist_fc.py --
model fc to train the NN case; python mnist_fc.py --model graph to train the GNN
case. As an exercise, try to randomly shuffle pixels in code in the --model graph
case (don’t forget to shuffle A in the same way) and make sure that it will not affect
the result. Is it going to be true for the --model fc case?
https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 16/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium
2D visualization of a filter used in a graph neural network and it’s effect on the image.
So, our graph neural network turned out to be equivalent to a convolutional neural
network with a single Gaussian filter, that we never update during training,
followed by the fully-connected layer. This filter basically blurs/smooths the image,
which is not a particularly useful thing to do (see the image above on the right).
However, this is the simplest variant of a graph neural network, which nevertheless
works great on graph-structured data. To make GNNs work better on regular graphs,
like images, we need to apply a bunch of tricks. For example, instead of using a
predefined Gaussian filter, we can learn to predict an edge between any pair of
pixels by using a differentiable function like this:
https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 17/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium
This idea is similar to Dynamic Filter Networks (Brabander et al., NIPS, 2016), Edge-
conditioned Graph Networks (ECC, Simonovsky & Komodakis, CVPR, 2017) and
(Knyazev et al., NeurIPS-W, 2018). To try it using my code, you just need to add the -
pred_edge . Below I show the animation of the predefined Gaussian and learned
filters. You may notice that the filter we just learned (in the middle) looks weird.
That’s because the task is quite complicated since we optimize two models at the
same time: the model that predicts edges and the model that predicts a digit class.
To learn better filters (like the one on the right), we need to apply some other tricks
from our BMVC paper, which is beyond the scope of this part of the tutorial.
2D filter of a graph neural network centered in the red point. Averaging (left, accuracy 92.24%), learned
based on coordinates (middle, accuracy 91.05%), learned based on coordinates with some tricks (right,
accuracy 92.39%).
https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 18/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium
In the next part of the tutorial, I’ll tell you about more advanced graph layers that
can lead to better filters on graphs.
Update:
Throught this blog post and in the code the dist variable should have been squared
to make it a Gaussian. Thanks Alfredo Canziani for spotting that. All figures and
https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 19/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium
results were generated without squaring it. If you observe very different results
after squaring it, I suggest to tune sigma .
Conclusion
Graph Neural Networks are a very flexible and interesting family of neural networks
that can be applied to really complex data. As always, such flexibility must come at
a certain cost. In case of GNNs it is the difficulty of regularizing the model by
defining such operators as convolution. Research in that direction is advancing
quite fast, so that GNNs will see application in increasingly wider areas of machine
learning and computer vision.
Pytorch
https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 20/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium
https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 21/21