Nguyen Duy

Some Mathematical Perspectives of
Graph Neural Networks

by
Duy Nguyen
A thesis
presented to the University of Waterloo
in fulfillment of the
thesis requirement for the degree of
Master of Mathematics
in
Applied Mathematics
Waterloo, Ontario, Canada, 2022

© Duy Nguyen 2022
Author’s Declaration
I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis,
including any required final revisions, as accepted by my examiners.
I understand that my thesis may be made electronically available to the public.
ii
Abstract
Many real-world entities can be modelled as graphs, such as molecular structures, social
networks, or images. Despite coming with such a great expressive power, the complex struc-
ture of graphs poses significant challenges to traditional deep learning methods, which have
been extremely successful in many machine learning tasks on other input data structures,
such as texts and images data. Recently, there have been many attempts in developing
neural network architectures on graphical data, namely graph neural networks (GNNs).
In this thesis, we first introduce some mathematical notations for graphs and different
aspects of training a feedforward neural network. We then discuss several notable GNN
architectures including Graph Convolutional Neural Networks, Graph Attention Networks,
GraphSAGE, and PinSAGE. Some special aspects of GNN training are also presented. Fi-
nally, we investigate a neighborhood sampling approach on PinSAGE to a product-user
recommendation problem.
iii
Acknowledgements
I would like to thank all the little people who made this thesis possible.
iv
Table of Contents
List of Figures viii
List of Tables x
1 Introduction 1
2 Background 3
2.1 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Machine Learning and Neural Networks Basics . . . . . . . . . . . . . . . . 5
2.2.1 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Performance Measure . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.4 Parameter Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.5 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.6 Overfitting and Regularization . . . . . . . . . . . . . . . . . . . . . 12
2.3 Fourier Transform and Convolution Theorem . . . . . . . . . . . . . . . . . 13
3 Graph Neural Networks Architecture 15

3.1 Graph Convolutional Network (GCN) . . . . . . . . . . . . . . . . . . . . . 15
3.1.1 Spectral Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.2 Spatial Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
v
3.1.3 Efficient Implementation . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 GraphSAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Graph Attention Network (GAT) . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 PinSAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4 GNN Training 27
4.1 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.1 Backpropagation on GCNs . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.2 Skip Connections in GNNs . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Special Training Methods on Graphs . . . . . . . . . . . . . . . . . . . . . 32
4.3.1 Mini-Batching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.2 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.3 Graph Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.4 Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.5 Regularization on Graphs . . . . . . . . . . . . . . . . . . . . . . . 35
5 Numerical Experiments 37
5.1 Dataset and Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1.1 Data Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1.2 Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2.1 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2.2 Experiment Environment . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3 General Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3.1 Dataset Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3.3 Constructing Samplers . . . . . . . . . . . . . . . . . . . . . . . . . 41
vi
5.3.4 Constructing a Training Batch . . . . . . . . . . . . . . . . . . . . . 41
5.3.5 Scoring and Loss Functions . . . . . . . . . . . . . . . . . . . . . . 42
5.3.6 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4.1 Hit@K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4.2 Evaluation Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.5 Results and Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.5.1 Baseline Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.5.2 Adaptive Restart Probability . . . . . . . . . . . . . . . . . . . . . 44
5.5.3 Variable Number of Random Walks . . . . . . . . . . . . . . . . . . 44
5.6 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6 Conclusions 51
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
References 53
vii
List of Figures
2.1 A fully-connected neural network with two hidden layers. . . . . . . . . . . 11
3.1 GCN as a spatial architecture - information is passed from a node to its

direct neighbors. Information of node 5 at ℓth layer is passed to nodes 1 and
4 at pℓ ` 1qth layer. The hidden representation of node 2 at pℓ ` 1qth layer is
calculated based on the hidden representation of nodes 1 and 3 at ℓth layer. 19
3.2 Visualization of GraphSAGE. Left: Sampled neighborhood of node 0, N0 “
t1, 3, 4, 5u. Right: Node 0 aggregates information from its neighborhood. 20
3.3 GAT: From hidden representation to attention weights (See Equations (3.22)
- (3.24)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 An illustration of multihead attention (with K = 3 heads) by node 1 on its
neighborhood. Different colors represent different heads. The aggregated
features from each head are concatenated or averaged to obtain h11 , which is
the representation of node 1 at the next layer. The figure is taken from [20]. 23
3.5 Visualization of PinSAGE’s importance-based neighborhood sampling. Left:
Original graph Center: Simulation of N “ 5 random walks from node 0,
each with length L “ 2, top T “ 3 is chosen. Right: Node 0 aggregates
information from its sampled neighborhood. . . . . . . . . . . . . . . . . . 25
4.1 Effect of adding skip connections in graph neural networks: adding a skip
connection (blue arrow) only helps mitigate the vanishing gradient risk from
the node itself. The risk of vanishing gradient from message passing through
its neighbors still remains. Figure is taken from [15] . . . . . . . . . . . . . 29
viii
4.2 DiffPool [23] for graph classifcation problems. At each hierarchical layer, a
GNN is run to obtain embeddings of nodes. The learn learned embeddings
are used to cluster nodes together. Then another GNN layer is run on this
coarsened graph. This whole process is repeated for L layers and the final
output representation is used to classify the graph. . . . . . . . . . . . . . 34
5.1 Numerical result for Experiment 1 (Original PinSAGE), K = 5 . . . . . . . 46

5.2 Numerical result for Experiment 2 (Original PinSAGE), K = 50 . . . . . . 46
5.3 Numerical result for Experiment 3 (Original PinSAGE), K = 200 . . . . . 47
5.4 Experiment 4: Original PinSAGE vs Adaptive Restart Probability method 47
5.5 Experiment 5: Original PinSAGE vs Variable Number of Random Walks
method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
ix
List of Tables
1 Graph Neural Networks Notation . . . . . . . . . . . . . . . . . . . . . . . xi
5.1 User Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.2 Movie Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Rating Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.4 List of Python libraries used and their versions . . . . . . . . . . . . . . . . 39
5.5 Machine configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.6 Hyperparameters for the original PinSAGE (baseline) . . . . . . . . . . . . 48
5.7 Hyperparameters for the Adaptive Restart Probability (ARP) method . . . 49
5.8 Hyperparameters for the Variable Random Walk Lengths (VRWL) method 49
5.9 Random seeds set for libraries/packages that involve randomness . . . . . . 50
x
Table 1: Graph Neural Networks Notation
Notation Description
IN P RN ˆN Identity matrix of dimension N

G “ pV, Eq Graph G with vertex set V and edge set E
N Number of nodes in the graph
u, v Vertices (nodes) in the graph
αuv P R Importance of node v to node u
L Number of layers in the graph neural network
rLs The set t1, .., Lu
npℓq P R Number of hidden units at layer ℓ
N pvq Set of neighboring nodes of v
Ns pvq Set of sampled neighboring nodes of v
L P RN ˆN Laplacian matrix of graph G
A P RN ˆN Adjacency matrix of graph G
p P RN ˆN
A Adjacency matrix with self-loop added
q P RN ˆN
A Normalized adjacency matrix with self-loop added
D P RN ˆN Node degree matrix associated with A
p P RN ˆN .
D Node degree matrix associated with A p
X P RN ˆd Input feature matrix
pLq
Z P RN ˆn Prediction head matrix
pℓq
Hpℓq P RN ˆn Hidden state matrix (output) at layer ℓ
σp¨q A nonlinear activation function
pℓ`1q ˆnpℓq
Wpℓq P Rn Trainable weight matrix at layer ℓ
pℓq pℓq
hv P Rn Hidden state vector of node v at layer ℓ, can also be
referred as embedding of node v at layer ℓ
pℓ`1q
aℓ P R2n Attention mechanism at layer ℓ
d Element-wise multiplication
|| Concatenation operator
˚ Convolution operator
xi
Chapter 1
Introduction
Many real-world entities can be represented as graphs, such as social networks, molecu-
lar structures, or images. For example, a molecule can be viewed as a graph of many
atoms bonded together, and their properties needed to be predicted for drug discovery.
In e-commerce, users and products can be represented as nodes in a graph whose edges
model interaction among them, and the goal is to exploit the complex structure of these
interactions to serve accurate recommendations to users. The first motivation of Graph
Neural Networks (GNNs) comes from the nineties, but they have just been gaining mas-
sive attraction recently, following the success of convolutional neural networks (CNNs) and
recurrent neural networks (RNNs). Many operations on graphs are generalized from their
successful counterparts in CNNs and RNNs, such as graph convolution and graph attention.
Thanks to recent advancement in computing powers, many generalizations, definitions and
extensions have been proposed and experimented, which leaves a spacious room of ex-
ploration and improvements for researchers and practitioners. However, partly because
of that advancement, results in the field of GNNs are almost always of a computational
nature, somewhat lacking rigorous mathematical foundation. There is also discrepancy
in the mathematical representations among GNN architectures, results in confusion from
readers.
In this thesis, we seek to present a systematic way to write down popular GNN architec-
tures, and the mathematical proof/foundation that is lacking in some original works. Chap-
ter 2 introduces useful notations oh graphs, machine learning tasks in general, and different
aspects of training a neural network. This chapter is followed by an overview of several most
notable GNN architectures, including Graph Convolutional Network (GCN), GraphSAGE,
Graph Attention Network (GAT), and PinSAGE. Motivation, strengths, weaknesses and
use cases of each architecture are also discussed. Some specific aspects of training a GNN
1
are presented and discussed in Chapter 4. Finally, we investigate a neighborhood sampling
approach based on PinSAGE to a product recommendation problem, report our findings,
discuss the numerical results, and propose some improvements.
2
Chapter 2
Background
This section introduces some useful definitions and properties of graphs [9] and neural
networks that will be used for subsequent chapters. Firstly, we define graphs and intro-
duce several graph terminologies, including adjacency matrices, node degree matrices, and
Laplacian matrices. Then we provide an overview of basic principles of machine learning
and neural networks, including tasks, performance measure, and training aspects. Finally,
we recall the Fourier transform and convolution theorem, which serve as inspirations for
the graph convolutional operator.
2.1 Graphs
Definition 1 (Graph). A graph G “ pV, Eq is defined as a collection of two sets V and
E, where V is a finite set of nodes and E Ď V ˆ V is the set of edges. Here we say there
is an edge between node u P V and node v P V if pu, vq P E.
The graph G is called a simple graph if there is at most one edge between any pair of
nodes.
The graph G is called an undirected graph if the edges are undirected, i.e. pu, vq P E
if and only if pv, uq P E.
Let v P V be a node of a graph G “ pV, Eq. We define the set of neighboring nodes
of v, N pvq, as follows:
N pvq “ tu P V |pu, vq P Eu.
3
Definition 2 (Adjacency matrix). Let G “ pV, Eq be a simple undirected graph, where
V “ tv1 , . . . , vN u. The adjacency matrix A “ paij q P RN ˆN of the graph G is defined
to be the matrix whose entries are:
#
1, if pvi , vj q P E
aij “
0, otherwise,
for i, j P rN s.
Definition 3 (Node degree matrix). Given a simple undirected graph G “ pV, Eq, we
define its node degree matrix D “ pdij q P RN ˆN as a diagonal matrix whose diagonal
entries are:
N
ÿ
dii “ aij ě 0, i P rN s
j“1
where paij q is the adjacency matrix of G.

Definition 4 (Adjacency matrix with self-connections). Let G “ pV, Eq be a simple undi-
rected graph, where V “ tv1 , . . . , vN u. We define its adjacency matrix with self-
connections A p “ AÌN P RN ˆN , where A is the adjacency matrix of graph G. Explicitly,
aij of A,
the entry p p where i, j P rN s, is
#
1, if pvi , vj q P E or i “ j,
aij “
p
0, otherwise.
Definition 5 (Node degree matrix with self-connections). Let G “ pV, Eq be a simple

undirected graph, where V “ tv1 , . . . , vN u. We define its node degree matrix with
self-connections D p “ pdpij q P RN ˆN as a diagonal matrix whose diagonal entries are:
N
ÿ
dpii “ aij ě 0,
p i P rN s,
j“1
where A aij q is the adjacency matrix with self-connections of the graph G.

p “ pp
Definition 6 (Normalized adjacency matrix with self-connections). Let G “ pV, Eq be a

simple undirected graph, where V “ tv1 , . . . , vN u. We define its normalized adjacency
matrix A q “D p ´ 21 A
pDp ´ 21 P RN ˆN with the convention that 0´ 12 “ 0. Here D
p and A
p are the
node degree matrix with self-connections and the adjacency matrix with self-connections of
the graph G, respectively.
4
Definition 7 (Laplacian matrix). Given a simple undirected graph G “ pV, Eq, the Lapla-
cian matrix L and the symmetrically normalized Laplacian matrix Lsym of graph G are
defined by:
L :“ D ´ A P RN ˆN ,
1 1 1 1
Lsym :“ D´ 2 LD´ 2 “ IN ´ D´ 2 AD´ 2 P RN ˆN ,
where D, A are the node degree matrix and adjacency matrix of the graph G, respectively,
1
with the convention that 0´ 2 “ 0.
Notice that since the graph G is undirected, its adjacency matrix A is symmetric.
Therefore, the Laplacian matrices L and Lsym are also symmetric. Moreover, since L is
1
diagonally dominant, L is positive semidefinite. Since Lsym “ D´1{2 LD´1{2 and both D´ 2
and L are positive semidefinite, Lsym is also positive semidefinite. Using Lsym over L for
messaging passing has several advantages. Firstly, it makes the influence of neighboring
nodes with large degrees of a node u more equal to that of other neighboring nodes.
Moreover, while L and Lsym share the same set of eigenvectors, Lsym has a bounded
spectrum, which ensures several numerical stability properties [9]. More precisely, it can
be proved that 0 “ λ1 ď ... ď λN ď 2 where tλi uNi“1 be the set of eigenvalues of L
sym
[5].
Definition 8 (Laplacian matrix with self-connections). Given a simple undirected graph
G “ pV, Eq, we define the Laplacian matrix with self-connections L
p and the symmetrically
sym
normalized Laplacian matrix with self-connection L
p of graph G as follows:
L
p :“ D
p Á p P RN ˆN ,
p sym :“ D
L p ´1{2 L
pD p ´ 12 A
p ´1{2 “ IN ´ D pDp ´ 12 P RN ˆN
where Dp and Ap are the node degree matrix with self-connections and adjacency matrix with
self-connections, respectively.
For the remaining of the thesis, we will mainly use the symmetrically normalized
Laplacian matrices with self-connections L p sym . The advantages of using matrices with
self-connections will be discussed in Section 3.1.
2.2 Machine Learning and Neural Networks Basics

In this section, we will provide an overview of basic principles of machine learning, deep
learning, and neural networks, based on [7].
5
2.2.1 Tasks
We can categorize machine learning tasks as supervised, unsupervised, or semi- su-
pervised learning.
In supervised learning, we aim to learn an underlying function that maps the input
(feature vectors) to the corresponding output (labels) from a given set of input-output
pairs and make predictions for unseen examples. Two of the most common tasks for
supervised learning algorithms are classification and regression. In classification tasks,
the algorithm aims to predict which of C given categories that an input belongs to. To
make such prediction, the algorithm learns a function f : Rn Ñ t1, . . . , Cu, where n is the
dimension of each input data. For example, the spam filtering problem is a classification
task, where the algorithm may be given a dataset of content of the emails, name of the
sender, etc. and is required to predict whether that email is spam or not. In regression
tasks, the output can be numerical values. To make such prediction, the algorithm try to
learn a function f : Rn Ñ Rm , where n and m are the dimension of the input and output
data, respectively. For example, the house price prediction problem is a regression task,
where each input data contains the properties of a house (size, location, etc.) and the
corresponding output data is the sold price of the house. The regression algorithm will
learn a function (possibly nonlinear) that maps the properties of the house to its sold price
and predict the prices of new houses.
In unsupervised learning, we try to find underlying patterns from a given dataset.
Some unsupervised learning algorithms want to learn the probability distribution that
generated the dataset, while some others aim to divide the dataset into clusters of similar
inputs (clustering). One example is the StackOverflow questions clustering problem in
which we would like to group similar questions into the same cluster to recommend them
to users.
In semi-supervised learning tasks, we are given a dataset of features, where a sub-
set of the dataset is labelled and the rest are unlabeled. For graph inputs, we are also
given a graph that contains nodes and edges, and a semi-supervised task can be done in
transductive setting or inductive setting. In transductive setting, the model is required
to predict the labels of the given unlabeled nodes. One example is the citation network
problem [11], where we are given a graph of research papers as nodes and citation relations
as edges, along with labels (topics) for some research papers. The goal is to predict the
topics of the remaining unlabeled papers. In inductive setting, the model is required to
predict the labels for new nodes from the same distribution [24]. For example, we may train
a graph neural network on a subgraph of the citation network then test its performance
on another subgraph [9].
6
2.2.2 Performance Measure
A performance measure is a metric that can be used to evaluate the abilities of a machine
learning algorithm. The list of choices for performance measures usually depends on the
given task [7].
One of the most common machine learning tasks on graphs is classification. For clas-
sification tasks, there are many metrics that can be used to measure an algorithm’s
performance. One of the most common metric is accuracy. Accuracy is the proportion
of correct predictions, over the total number of predictions. Equivalently, we can also
use error rate, which is the proportion of incorrect predictions, over the total number
of predictions. We can also use precision, recall (sensitivity), specificity, where the
definitions are given below.
The precision of a model is defined as follows:
TP
P recision “ .
TP ` FP
The recall of a model is defined as follows:

TP
Recall “ .
TP ` FN
The specificity of a model is defined as follows:

TN
Specif icity “ .
TN ` FP
Suppose we have a recommendation model where it predict whether a user will buy a
given product or not. A true positive (TP) case is a case where we predict yes, and the
user actually buys it. A true negative (TN) case is a case where we predict no, and the
user actually does not buy it, as expected. A false positive (FP) case is a case where the
user does not buy the product although we predict they do. A false negative (FN) case
is a case where the user buys a product that we predict they do not.
Since there is usually a trade-off among precision, recall and specificity, we may want
to choose to optimize different metrics in different scenarios. We want precision to be high
when we cannot tolerate a false positive case. We want to optimize for recall if we do not
want to miss a positive case. For instance, it is dangerous to miss a virus-infected person
in case of a highly contagious disease. We want to optimize for specificity if we want to
7
avoid false alarms. For example, we want a cancer diagnosis to have high specificity as it
costs a lot of time, money, and even patient’s wellness to undergo cancer treatments.
On the other hand, it can be argued that we should only consider the performance
of the machine learning algorithms on data that it has not seen before. For that reason,
we can divide the given dataset into training set and test set. Training set is used by
the algorithm to learn, while test set is reserved for evaluating the performance of the
machine learning algorithm. In practice, the size of training dataset is usually bigger than
the size of test dataset.
2.2.3 Loss Function

Most machine learning algorithms include solving optimization problems [7]. For example,
in supervised learning settings, we usually want to minimize the loss function that mea-
sures the difference between the truth outputs and the outputs of the learning algorithm.
Intuitively, the loss will be high if the algorithm makes poor predictions, and it will be low
if the algorithm is doing well [13].
For example, in a classification task, we use a neural network which maps the feature
vector xi P Rd to the output zi “ rz1 , .., zC sT P RC as the probability of the ith example
falling onto C different classes, where i P rN s. Denote yi P RC as the one-hot vector
indicating the ground-truth class of the ith example. For instance, if the true class of the
ith example is k P rCs then yi “ ryi1 , . . . , yiC sT P RC where:
#
1, if j “ k
yij “
0, otherwise.
For such classification tasks, minimizing the cross-entropy loss, which is equivalent to
maximize the log-likelihood estimation of the data given the model, is one of the most
popular choices. The cross-entropy loss is calculated as follows:
N ÿ
ÿ C
L“´ yij ln zij . (2.1)
i“1 j“1
For regression tasks, Root Mean Square Error (RMSE) is one of the most commonly
used loss. It measures how far the model’s predictions are from the actual values. Let yi P R
be the actual value of the ith example, ŷi is the predicted value, then the RMSE loss is
8
given by: g
N
f
f1 ÿ
L“ e pyi ´ ŷi q2 . (2.2)
N i“1
2.2.4 Parameter Updates

In this section, we discuss popular parameter updating methods in neural networks using
first-order optimization algorithms. The list includes vanila gradient descent, gradient
descent with momentum, Adagrad, RMSprop, and Adam. First, we introduce some useful
notations.
Let x “ px1 , . . . , xN qT P RN and L : RN Ñ R. Then, the gradient of L with respect to
x is: „ ȷT
BL BL
∇x Lpxq “ pxq, . . . , pxq P RN .
Bx1 BxN
We define ∇x Lpxq d ∇x Lpxq as an element-wise multiplication operation:
«ˆ ˙2 ˆ ˙2 ffT
BL BL
∇x Lpxq d ∇x Lpxq “ pxq ,..., pxq P RN .
Bx1 BxN
Denote t (where t “ 1, 2, . . .) the iteration count and xptq P RN the vector we need to
update. Then the vector xptq can be updated as follows.
• Vanilla gradient descent update:
xptq “ xpt´1q ´ λ∇x Lpxpt´1q q, (2.3)
where λ ą 0 is the learning rate.
• Gradient descent with momentum: A momentum term v P RN is added and

initialized at zero, meanwhile β P r0, 1s is the decay factor that determines the con-
tribution of the current gradient and previous gradients (momentum) to the weight
change. Then the updates are given by:
vptq “ βvpt´1q ` p1 ´ βq∇x Lpxpt´1q q, (2.4)

xptq “ xpt´1q ` vptq . (2.5)
9
• Adagrad: The main idea of this optimization method is to have per-parameter learn-
ing rates. That is, the component that receive high gradients will have their learning
rate reduced over time. Here, cptq P RN is the accumulated magnitude of gradients
over time, and ϵ ą 0 is used to avoid division by zero. For each backprogagation
step, we update
cptq “ cpt´1q ` ∇x Lpxpt´1q q d ∇x Lpxpt´1q q (2.6)

∇x Lpxpt´1q q d ∇x Lpxpt´1q q
xptq “ xpt´1q ´ λ ? , (2.7)
cptq ` ϵ
where the division here is element-wise division. Explicitly, we have
ptq pt´1q pt´1q pt´1q
ci “ ci ` ∇x Lpxi q d ∇x Lpxi q (2.8)
pt´1q
ptq pt´1q ∇x Lpxi q
xi “ xi ´λ b , (2.9)
ptq
ci ` ϵ
´ ¯T ´ ¯T
ptq ptq ptq ptq
where i P rN s, cptq “ c1 , . . . , cN , and xptq “ x1 , . . . , xN .
• RMSprop: Similar to Adagrad, RMSProp also has per-parameter learning rates.

The difference is that RMSprop uses a moving average of squared gradients instead
of the instantaneous squared gradient. The updates are given by
cptq “ βcpt´1q ` p1 ´ βq∇x Lpxpt´1q q d ∇x Lpxpt´1q q (2.10)

pt´1q
∇x Lpx q
xptq “ xpt´1q ´ λ ? , (2.11)
c `ϵ
ptq
where division and square root are both performed element-wise, β P r0, 1s is the
decay factor, and λ is the learning rate. In practice, β is usually set close to 1.
• Adam: Similar to RMSProp, Adam’s method also uses a moving average for squared
gradients. The main difference is that both the moving averages of the gradients
mptq P RN and squared gradients vptq P RN are further smoothed out over time by
a factor of β1 P r0, 1s and β2 P r0, 1s. Usually, β1 is set to 0.9 and β2 is set to 0.99.
10
Both v and m are initialized at zero. The updates are given by
mptq “ β1 mpt´1q ` p1 ´ β1 q∇x Lpxpt´1q q (2.12)

ptq
m
p ptq “
m (2.13)
1 ´ β1t
vptq “ β2 vpt´1q ` p1 ´ β2 q∇x Lpxpt´1q q d ∇x Lpxpt´1q q (2.14)
ptq
v
pptq “
v (2.15)
1 ´ β2t
mp ptq
xptq “ xpt´1q ´ α ? (2.16)
vpptq ` ϵ
where the division and square-root are both element-wise division. In practice, Adam
works really well, and currently it is the recommended algorithm for most problems.
2.2.5 Neural Networks
Figure 2.1: A fully-connected neural network with two hidden layers.
A neural network consists of neurons that are connected in an acyclic graph, where the
outputs of some neurons can become inputs to other neurons. Usually, a neural network
11
is organized into distinct layers of neurons. The input data is fed into the first layer, and
the output (usually the prediction we need to make from the input) is calculated based on
output of the last layer. A neural network is typically initiated with random weights on the
edges between nodes, these weights are then updated by the backpropagation algorithm
repeatedly until the model performs well enough [14].
For example, let x P R3 be an input data and y P R2 be the corresponding output of
a two-hidden layer neural network associated with Figure 2.1. The output of each layer is
given by:
hp1q “ σpW1 x ` b1 q P R4
hp2q “ σpW2 hp1q ` b2 q P R5 (2.17)
y “ W3 hp2q ` b3 P R2 ,
where W1 P R4ˆ3 , W2 P R5ˆ4 , W3 P R2ˆ5 , b1 P R4 , b2 P R5 , b3 P R2 , and σ is a nonlinear
activation function.
In general, a feedforward L-hidden-layer neural network is a function of the form:
´ ` ˘ ¯
f px; θq “ WL σ WL´1 σ . . . σpW1 x ` b1 q . . . ` bL´1 ` bL , (2.18)
pℓq ˆnpℓ`1q pℓq
where θ “ tpWℓ , bℓ quLℓ“1 , Wℓ P Rn , bℓ P Rn , nplq P Z` , and ℓ P rLs.
2.2.6 Overfitting and Regularization

According to [7], the performance of a machine learning algorithms is measured by how
good their outputs for examples in the test set are. It can be decomposed into two factors:
how big the training error is, and how big the gap between training error and test error is.
Underfitting is the phenomenal when the model cannot perform well on training set. One
of the way to combat underfitting is to increase the number of trainable parameters. On
the other hand, overfitting is when there is a huge discrepancy between training error and
test error. Regularization is a set of techniques that can be used to prevent overfitting.
For the remaining of this section, we discuss regularization methods in neural network
settings. To make a neural network better fit a training set, people can increase the number
of learnable parameters or expand the search range of them, thus make the model more
capable of ”memorizing” the training set. On the contrary side, to prevent overfitting, we
can decrease the number of learnable parameters, or restrict the ranges of these parameters.
The idea is inspired from the Occam’s razor principle. This principle states that among
different hypotheses that explain an observation equally well, we should choose the simplest
12
one. Translating to machine learning settings, the idea is to give more preference to simple
or sparse set of parameters, and penalize overcomplex models. Here are several examples:
Suppose we are using cross-entropy loss as in (2.1). We want to regularize all the weight
matrices Wpℓq where ℓ P rLs. If we prefer the weight matrices to be sparse, we can use
L1 regularization. Because of this property, L1 regularization can also be viewed as
a feature selection mechanism [7]. The loss function with incorporated L1 regularization
term is given by:
ÿN ÿC ÿL ÿÿ
pℓq
L“´ yui ln zui ` |Wij |. (2.19)
u“1 i“1 ℓ“1 i j
Another popular choice of regularization is to use the Frobenius norm on the weight ma-
trices, which is also called the weight decay (L2 regularization):
N ÿ
C L
ÿ 1ÿ
L“´ yij ln zij ` }Wpℓq }2F . (2.20)
i“1 j“1
2 ℓ“1
Another regularization technique is to enforce an upper bound on the norm of the learn-
able parameters, called max norm constranint. For example, after a normal gradient
update, we can normalize the weights as follows:
Wpℓq
Wpℓq ÐÝ c , (2.21)
||Wpℓq ||
where division is the element-wise division, and the norm || ¨ || could be the L1 norm, the
spectral norm, or the Frobenius norm.
If we are training a neural network, dropout can also be used. That is, during training,
we randomly and temporarily deactivate neurons with some probability p P r0, 1s. Let’s
take the neural network in Figure 2.1 for example. If we deactivate the first neuron of the
1st hidden layer, then the formula for the 2nd hidden layer will be as follows:
p p1q ` b2 q P R5 ,
hp2q “ σpW2 h (2.22)
p p1q “ rhp1q , 0, hp1q , hp1q sT .
where h 1 3 4
2.3 Fourier Transform and Convolution Theorem

In the next chapter, we will talk about graph convolution operators and Graph Convo-
lutional Networks (GCN). Graph convolution can be considered as an extension of the
13
convolution operator on graph domain, so we start off by introducing Fourier transform
and convolution theorem.
Definition 9 (Fourier and Inverse Fourier Transform). The Fourier transform of a function
f on Rd is given by:
ż
ˆ
Fpf qpξq “ f pξq “ f pxqe´2πix¨ξ dx, ξ P Rd . (2.23)
Rd
Its attached inversion is given by:

ż
F ´1
pfˆqpxq “ fˆpξqe2πix¨ξ dξ, x P Rd . (2.24)
Rd
Theorem 1 (Convolution Theorem). Let f1 and f2 P L1 pRd q. Then the Fourier transform
of the convolution of two functions is the pointwise product of their Fourier transform:
Fpf1 ˚ f2 q “ Fpf1 q d Fpf2 q (2.25)
Therefore, the convolution operator ˚ can be defined as:
f1 ˚ f2 “ F ´1 pFpf1 q d Fpf2 qq. (2.26)
14
Chapter 3
Graph Neural Networks Architecture
In this chapter, we discuss three popular graph neural network architectures, including
Graph Convolutional Network (GCN), GraphSAGE, and Graph Attention Network (GAT).
Here, we consider only simple undirected graphs. We also present below the list of common
used notation.
3.1 Graph Convolutional Network (GCN)

Many GNN architectures rely on convolution to propagate information. The use of con-
volution operators in GNN is inspired by the success of Convolutional Neural Networks
(CNNs) on image and signal processing, such as ImageNet [8], or YOLOv3 [16]. There are
two main approaches in designing convolution operators: spectral approach and spatial ap-
proach. Spectral approach defines the convolution operator in the spectral domain, such as
ChebNet [6] while spatial approach defines convolution based only on graph topology, such
as GraphSAGE [10]. GCN can be viewed as both a spectral and spatial GNN architecture.
Suppose we are given a simple undirected graph G “ pV, Eq (where N “ |V | is the
number of nodes, V “ tv1 , .., vN u), and a node features matrix X P RN ˆd , where d is the
number of features. Let A P RN ˆN be the adjacency matrix of the graph G. It’s assumed
in [11] that self-connections and edges to neighboring nodes are of equal importance, so we
introduce A p “ AÌN , which can be regarded as the adjacency matrix with self-connections.
The use of A p in place of A, which is called ”renormalization trick” by the author of GCN in
[11], was later confirmed to be effective, both theoretically ([21]) and empirically ([12], [21]).
In [21], the authors prove that the renormalization trick shrinks the value of the largest
15
eigenvalue of the corresponding Laplacian matrix L p sym “ IN ´ Dp ´ 12 A
pDp ´ 12 . According to
the author, this acts as a low-pass filter which attenuate signals of higher frequency, thus
produces smooth features over the graph. The result is that neighboring nodes are more
likely to share similar representations and predictions.
3.1.1 Spectral Motivation

Before discussing GCN’s spectral motivation, we recall some useful definitions.
p sym P RN ˆN be the Laplacian matrix
Definition 10 (Graph Fourier Transform). [5, 4] Let L
sym T p sym . A
of a simple undirected graph G, and L
p “ U ΛU be a spectral decomposition of L
graph Fourier transform associated with the graph G and U is F : RN Ñ RN defined
as follows:
Fpxq :“ FU pxq “ UT x, @ x P RN , (3.1)
and the graph inverse Fourier transform associated with the graph G is F ´1 : RN Ñ
RN given by
F ´1 pxq :“ FU
´1
pxq “ Ux, @x P RN . (3.2)
Definition 11 (Graph Convolution). Let w, h P RN . Given graph G and its fixed and
chosen eigenmatrix U [18], we define the graph convolution operator of w and h as
follows:
w ˚G h :“ F ´1 pFpwq d Fphqq (3.3)
It should be noted that there may not be a unique choice for U. However, we still
assume that U is already chosen and fixed [18].
Lemma 1. Let w, h P RN . Then
w ˚G h “ UdiagpUT wqUT h “ wp
p Lp sym qh
where wp
p Lp sym q is a matrix that depends on w, the Laplacian matrix L
p sym , and its orthog-
onal matrix U.
Proof. It follows directly from the Graph Convolution definition.
Different choices of w lead to different spectral GNN architectures, including GCN,

ChebNet, etc. We will show how GCN can be viewed as a spectral method. It should be
noted that there are some differences between our approach and the original approach that
is shown in [11].
16
Lemma 2. Under the settings above, if w “ cUp1N ˆ1 ´ λq where λ “ rλ1 , . . . , λN sT and
c P R, then
w ˚G h “ cDp ´ 21 A
pDp ´ 12 h. (3.4)
p ´ 12 A
p sym “ I ´ D
Proof. Recall that the Laplacian matrix is defined as L pDp ´ 12 (normalized),
which has a fixed set of eigenvectors U.
Consider w “ cUp1 ´ λq where c P R. Then UT w “ cUT Up1 ´ λq “ cp1 ´ λq “ I ´ Λ.
With w defined as such, we can easily verify that UT w “ cp1´λq, and diagp1´λq “ I´Λ,
where Λ “ diagpλq. Thus,
CG ph; wq “ CG ph; cq “ w ˚G h “ UpUT w d UT hq
“ cUpI ´ ΛqUT h
p sym qh
“ cpI ´ L
p ´ 12 A
“ cD pDp ´ 12 h P RN .
Denote A p ´ 12 A
q “D pDp ´ 12 . Then Cph; wq “ cAx.
q
The above equation is for the simplest case where we have only one filter and the
number of channel is also one. We can generalize to more complex cases, where we may
have many filters, and each of them are multi-channel.
Let’s consider a scenario where our input consists of N nodes and d “ np0q channels.
p0q
Instead of x P RN and c P R, now we have an input matrix of np0q features X P RN ˆn
p0q
and learnable filter parameters c “ rc1 , . . . , cnp0q sT P Rn . Let us define the multi-channel
graph convolutional layer operator CG on X as belows:
p ´ 21 A
CG pX; cq “ D pDp ´ 12 Xc P RN (3.5)
p0q
Now, let’s say we have np1q such filters. We stack the filter parameters cpjq P Rn where
p0q p1q
j P rnp1q s horizontally to form a matrix Wp0q P Rn ˆn to obtain:
p ´ 21 A
CG pX; Wp0q q “ D pDp ´ 21 XWp0q P RN ˆnpℓ`1q (3.6)
Note that Equation (3.5) and (3.6) are generalizations of the graph convolution operator
between two vectors. By adding a nonlinear activation σ, we obtain the formula for the
first hidden layer of GCN:
´ ¯ pℓ`1q
0q
p1q
H “ σpCG pX; W qq “ σ AH Wq p0q p0q
P RN ˆn (3.7)
17
p1q p0q
where Hp1q P RN ˆn , Hp0q “ X P RN ˆn are the output and input of the first hidden
layer, respectively. It can be generalized to any layer l P r0, L ´ 1s, by letting Hpℓ`1q be
pℓq pℓ`1q
input to the pℓ ` 1qth layer, Wpℓq P Rn ˆn and stacking L such layers altogether. The
single-layer formula of GCN is given by:
´ ¯
Hpℓ`1q “ σ AH
q pℓq Wpℓq where ℓ “ 0, .., L ´ 2. (3.8)
except for the output layer, where there is no nonlinear activation:
HpLq “ AH
q pL´1q WpL´1q . (3.9)
In summary, the final output of a GCN with L layers is:

´ ` ˘ pL´2q ¯ pL´1q
pLq p0q
H “ Aσ Aσ . . . σpAXW q . . . W
q q q W , (3.10)
where A p ´ 12 A
q “D pDp ´ 12 .
3.1.2 Spatial Motivation

GCN can also be viewed as a spatial architecture. That is, at any particular layer, a node
passes and receives information to all of its neighbors. The details are given below.
pℓq
Lemma 3. Let hvi be the ith row of Hpℓq , i P rN s, where Hpℓq is defined by Equation (3.8).
Then we have: ´ ÿ ¯ pl`1q
hpℓ`1q
vi “ σ W pℓq,T pℓq
h vk P Rn , i P rN s. (3.11)
vk PN pvi q
That is, a node’s hidden state fcor the next layer is an aggregation of information from all
its neighbors at the current layer.
q pℓq Wpℓq and hpℓq the j th entry of hvpℓqk . It is sufficient to prove that
Proof. Denote B “ AH
˜ ¸ kj
ř pℓq
Bij “ Wpℓq,T hvk , for i P rN s and j P rnpl`1q s.
vk PN pvi q
j
aik “ 0 if vk R N pvi q, for i P rN s, j P rnpℓ`1q s, we have:

aik “ 1 if vk P N pvi q and q
Since q
N
´ÿ ¯ ´ ÿ ¯
q plq qij “ pℓq pℓq
pAH aik hkj
q “ hkj .
k“1 vk PN pvi q
18
Figure 3.1: GCN as a spatial architecture - information is passed from a node to its
direct neighbors. Information of node 5 at ℓth layer is passed to nodes 1 and 4 at pℓ ` 1qth
layer. The hidden representation of node 2 at pℓ ` 1qth layer is calculated based on the
hidden representation of nodes 1 and 3 at ℓth layer.
Therefore,
N
ÿ N ´
ÿ ÿ ¯
q plq Wpℓq qij “
Bij “ pAH q pℓq qip wpℓq “
pAH
pℓq
hkp
pℓq
wpj . (3.12)
pj
p“1 p“1 vk PN pvi q
On the other hand,
ÿ ÿ N
”ÿ N
ÿ ıT
pℓq,T pℓq pℓq pℓq pℓq
W hpℓq
vk “ hkp wp1 ... hkp wpN . (3.13)
vk PN pvi q vk PN pvi q p“1 p“1
Hence the j th element of Equation (3.13) is

´ ÿ ¯ ÿ N
ÿ N ´
ÿ ÿ ¯
pℓqT pℓq pℓq pℓq pℓq pℓq
W hk “ hkp wpj “ hkp wpj ,
j
vk PN pvi q vk PN pvi q p“1 p“1 vk PN pvi q
which completes the proof.
19
3.1.3 Efficient Implementation
In previous sections, we see that GCN can be implemented both in graph level as in
(3.8) and node level as in (3.11). In this section, we discuss the pros and cos of each
implementation.
The graph-level implementation as in (3.8) should be used when our priority is to
minimize the number of mathematical operations. It is because we only compute the
pℓq
embedding hvi exactly once [9]. However, this implementation requires the full adjacency
matrix and node features matrix simultaneously. Therefore, it should not be used if memory
is our top concern. Moreover, using graph-level implementation means that we cannot take
advantage of mini-batching.
On the node-level implementation should be considered when we are limited in memory,
or when the input graph is big. We can also use mini-batching to further control the
memory footprint of the GNN.
3.2 GraphSAGE
Figure 3.2: Visualization of GraphSAGE. Left: Sampled neighborhood of node 0,

N0 “ t1, 3, 4, 5u. Right: Node 0 aggregates information from its neighborhood.
20
One drawback of GCN is that its training heavily depends on the adjacency matrix
A. In transductive settings where all nodes are given at first, GCN can work fairly well.
However, in inductive settings where a new node or edge is introduced to the graph, we
must retrain the GCN from scratch. On the other hand, GraphSAGE is an architecture
that does not require the adjacency matrix A, so it works well in both transductive and
inductive settings [10].
Moreover, GraphSAGE presents a novel sampling approach that is useful for processing
very large graphs. In such large graphs, the number of nodes and the size of the neighbor-
hoods are big, thus it is difficult to store and process all the neighborhood information. To
deal with that, GraphSAGE selects only a subset of each node’s neighborhood, ensuring
the size of the neighborhood is reasonble for message passing. [24] [10].
Suppose we are given a graph G of N vertices, and a matrix of node features X P RN ˆd .
pℓq pℓq
Let hu P Rn be the representation of node u at layer ℓ, where Ns pvq is the sampled
pℓqT pℓqT
neighborhood of node v. We stack them vertically to obtain Hpℓq “ rh1 , .., hN s P
pℓq
RN ˆn , which is the output of the ℓ´th layer (ℓ P rLs). Let Gℓ be a differentiable aggregate
pℓq pℓ´1q
function, at layer ℓ. Let’s also denote hNs pvq P Rn as the aggregated message from
pℓq
neighborhood of node v, at layer ℓ. Then the formula for hNs pvq is as follows:
´ ¯
pℓq
hNs pvq “ Gℓ thupℓ´1q , @u P Ns pvqu (3.14)
There can be many choices for G, such as mean, pooling, or long-short term memory
aggregators [10]. For example, if G mean aggregator:
´ ¯
plq pℓ´1q
hNs pvq “ σ WMphu , @u P Ns pvq Y tvu , (3.15)
, M is the mean operator, Mpa1 , .., ak q “ k1 ki“1 ai P Rp for ai P Rp .

pℓq pℓ´1q ř
where W P Rn ˆn
If Ns pvq is further assumed as the set of actual neighboring nodes of v, which is N pvq, then
this is equivalent to Equation (3.11).
The operator G can also be a pooling aggregator:
plq
hNs pvq “ maxtσpWhupℓ´1q ` bq, @u P Ns pvqu, (3.16)
pℓq ˆnpℓ´1q
where W P Rn and max is an element-wise operation:
maxpa1 , .., ak q “ rmaxpa11 , .., ak1 q, .., maxpa1p , .., akp qs P Rp where a1 , .., ak P Rp .
21
pℓq
After choosing an appropriate aggregator for hNs pvq , a single-layer formula of Graph-
SAGE is defined as follow:
´ ¯
pℓq
hpℓq
v “σ W pℓq
rhvpℓ´1q ||hNs pvq s (3.17)
pℓq ˆ2npℓ´1q
where Wpℓq P Rn .
In summary, the output embedding for a node v P V of a GraphSAGE architecture
with two layers is given by:
´ ¯
p1q
hNs pvq “ G1 thup0q , @u P Ns pvqu (3.18)
´ ¯
p1q
hp1q
v “ σ W p1q
rh p0q
v ||hNs pvq s (3.19)
´ ¯
p2q
hNs pvq “ G2 thup1q , @u P Ns pvqu (3.20)
p2q
hp2q p2q p1q
v “ W rhv ||hNs pvq s (3.21)
p0q p2q
where v P V , hu is the input features of node u, and hv is the final embedding output of
node v.
When looking throughout the above equations, we can see that: given a new node v,
we can induce its embedding with only the embeddings of its neighbors. The entire graph
structure is not required, so we do not need to retrain for a new node. This is not possible
if we use Equation (3.8) to implement GCN.
3.3 Graph Attention Network (GAT)

From Equation (3.11), we can see that GCN treats information from all nodes in neigh-
borhood equally. More precisely, embeddings of all nodes vj in neighborhood of vi are
multiplied by the same matrix Wpℓq . Graph Attention Network (GAT) [20] takes a dif-
ferent approach; it adapts attention mechanism to assign and learn different weights for
different neighbors (see Equation (3.25)).
pℓq pℓq
For each u P V “ tv1 , . . . , vN u, ℓ P rLs, let hu P Rn be the representation of node u at
pℓq pℓq pℓ`1q ˆnpℓq
layer ℓ and Hpℓq P RN ˆn be the matrix whose rows are hi , i P rN s. Let Wpℓq P Rn
22
Figure 3.3: GAT: From hidden representation to attention weights (See Equations
(3.22) - (3.24))
Figure 3.4: An illustration of multihead attention (with K = 3 heads) by node 1 on its

neighborhood. Different colors represent different heads. The aggregated features from
each head are concatenated or averaged to obtain h11 , which is the representation of node
1 at the next layer. The figure is taken from [20].
23
be the shared weight matrix that is applied for every node, at the ℓ-th layer. First of all,
the embedding of node u at layer ℓ is transformed linearly:
n pℓ`1q
zpℓq pℓq pℓq
u “ W hu P R . (3.22)
pℓ`1q
Let aℓ P R2n , the attention mechanism, be a learnable sing-layer feedforward neural
pℓq
network. The unnormalized attention score of node u to node v, denoted as euv , is defined
as:
epℓq
uv “ LeakyReLUpa rzu ||zpℓq
pℓq,T pℓq
v sq P R. (3.23)
The normalized attention of node u to node v (or the importance of node v to node u)
pℓq
at layer l, denoted as αuv P R, is defined as:
pℓq
pℓq pℓq exppeuv q
αuv “ softmaxpeuv q“ ř pℓq
P R. (3.24)
exppeuw q
wPN puq
Lastly, the embedding of node u at layer ℓ ` 1 is:

´ ÿ ¯ pℓ`1q
pℓ`1q
hu “σ αuv zv P Rn
pℓq pℓq
. (3.25)
vPN puq
In practice, multiple such attention weights are used simultaneously to stabilize the
learning process [20]. This technique is called multi-head attention, where (3.25) are ap-
plied K times using K different learnable weight matrices. These K outputs are then
concatenated or averaged as follows:
´ ÿ ¯ pℓ`1q
K pℓq pℓq
hpℓ`1q
u “ ||k“1 σ α pℓq,k
uv W k h v P Rn , (3.26)
vPN puq
npℓ`1q
pℓq ˆnpℓq
where K is the number of attention heads, Wk P R K is the weight matrix for the
k th attention head, at ℓth layer, k P rKs.
3.4 PinSAGE
Similar to GraphSAGE, PinSAGE also utilizes the sampling technique for reducing the
size of the neighborhoods. The difference is that while GraphSAGE uses random sampling,
24
Figure 3.5: Visualization of PinSAGE’s importance-based neighborhood sampling.
Left: Original graph Center: Simulation of N “ 5 random walks from node 0, each with
length L “ 2, top T “ 3 is chosen. Right: Node 0 aggregates information from its
sampled neighborhood.
PinSAGE proposes an importance-based sampling method. For a given target node u, it

performs random walks starting from u, count the occurrences of the nodes appearing on
each walk, then finally choose the top T nodes with the highest visit counts. Having a
fixed size of neighborhoods helps us control the memory footprint for our algorithm.
Suppose we are given a graph G of N vertices, and a matrix of node features X P RN ˆd .
pℓq pℓq
Let hu P Rn be the representation of node u at layer ℓ. Let Gℓ be a differentiable
aggregate function, at layer ℓ. For example, Gℓ can be mean/LSTM/pooling. Let’s also
pℓq pℓq
denote hN pvq P Rn as the aggregated message from neighborhood of node v, at layer ℓ.
Let T “ |N pv1 q| “ .. “ |N pvi q| “ ... “ |N pvN q| @i P rN s be the fixed size of neighbor-
hoods for every node. N puq of size T and αpuq P RT for are the set of neighboring nodes,
pℓq pℓq
and set of neighbor weights of node u, respectively. We also use γ : Rn ˆT Ñ Rn to
denote a symmetric vector function. For example, γ can be element-wise mean or weighted
pℓq pℓq pℓ`1q ˆ2npℓq pℓq pℓq
sum. Qpℓq P Rn ˆn , Wpℓq P Rn be learnable weight matrices, q P Rn , w P Rn
pℓq
be learnable bias vectors. The aggregated message from neighborhood of node u, hN puq is
computed as follows:
´ ¯
pℓq pℓq pℓq
hN puq “ γ tσpQ hv ` qq|v P N puqu, αpuq (3.27)
25
The unnormalized vector representation of node u at layer ℓ ` 1 is as follows:
˜ « ff ¸
pℓq
hu
hpℓ`1q
u,un “ σ W
pℓq
¨ pℓq `w (3.28)
hN puq
The final, normalized vector representation of node u at layer ℓ is as follows:

pℓ`1q
hu,un
hupℓ`1q “ pℓ`1q
(3.29)
||hu,un ||2
By looking at the above equations, we can see PinSAGE: 1) controls the memory footprint
of the algorithm during training, since we chose a fixed size T for a node’s neighborhood,
and 2) takes into account the importance of neighbors when aggregating their representa-
tions. By performing random walks, not only we select top T nodes as neighboring nodes
of u, we are able to find their importance (the corresponding element of αpuq) to u as well.
26
Chapter 4
GNN Training
Graph Neural Network aims to extend existing neural networks to process graph input data.
Beside what are similar to a general neural network training framework, in this chapter,
we will go into specific details of several aspects of training a Graph Neural Network.
4.1 Backpropagation
Backpropagation is an algorithm based on gradient descend to optimize the parameters in a
model. Backpropagation in a neural network consists of two steps: forward calculation and
backward propagation. Given a set of parameters and input, forward calculation computes
the values at each neuron in forward order. On the other hand, backward propagation
computes the error at each variable, and update the parameters with the corresponding
partial derivatives in backward order.
4.1.1 Backpropagation on GCNs

In this section, we demonstrate how backpropagation works for GCN. Consider a graph
p2q
G “ pV, Eq, where V “ pv1 , . . . , vN q. Denote Z “ Hp2q P RN ˆn the output of the GCN
p2q
whose ith row zvi “ rzi,1 , . . . , zi,np2q sT P Rn is the output vector of node vi . Similarly, let
pℓq pℓq pℓq pℓq
hvi “ rhi,1 , . . . , hi,npℓq sT P Rn be the embedding vector at layer ℓ of node vi . We start
from Equation (3.8):
Z “ Hp2q “ AH q p1q Wp1q P RN ˆnp2q , (4.1)
27
´ ¯
q p0q Wp0q P RN ˆnp1q ,
Hp1q “ σ AH (4.2)
where Aq P RN ˆN is the normalized adjacent matrix with self-connections of the graph G,

p0q p1q p1q p2q
while Wp0q P Rn ˆn and Wp1q P Rn ˆn are the trainable parameter matrices. Let’s
consider the entry at row i P rN s, column k P rnp2q s of Z:
p2q
ÿ p1q p1q
zik “ hik “ aij hjp wpk , (4.3)
jPrN s, pPrnp0q s
p1q
We recall from Equation (3.11) that the formula for hjp where j P rN s, p P rnp1q s is given
by:
p0q
´ ÿ nÿ ¯
p1q p0q p0q
hjp “ σ hqr wrp , (4.4)
vq PN pvj q r“1
p1q
From Equation (4.3), the partial derivative of zik with respect to wab is as follows:
Bzik ÿ p1q
p1q
“ δbk hja . (4.5)
Bwab jPN pvi q
p0q
To compute partial derivative of zik with respect to wab , we use chain rule as follows:
p1q
Bzik ÿ Bzik Bhjp
p0q
“ p1q p0q
, (4.6)
Bwab j,p Bhjp Bwab
p1q
where the partial derivative of of zik with respect to hjp is:
#
p1q
Bzik wpk , if j P N piq
p1q
“
Bhjp 0, otherwise,
p1q p0q
while the partial derivative of hjp with respect to wab depends on the choice of σ. We
discuss two popular choices, including ReLU and tanh. If σ is ReLU then:
# ř ´ř řnp0q p0q p0q ¯
p1q p0q
Bhjp δbp qPN pjq hqa , if qPN pjq r“1 hqr wrp ą0
p0q
“
Bwab 0, otherwise.
Thus #ř
p1q ř p0q
Bzik w
jPN piq bk qPN pjq,Bpj,bqą0 hqa
p0q
“
Bwab 0, otherwise,
28
Figure 4.1: Effect of adding skip connections in graph neural networks: adding a skip
connection (blue arrow) only helps mitigate the vanishing gradient risk from the node
itself. The risk of vanishing gradient from message passing through its neighbors still
remains. Figure is taken from [15]
where # ř p0q p0q p0q

1, if qPN pjq nr“1 hqr Wrb ą 0
ř
Bpj, bq :“
0, otherwise.
If σ is tanh then:
p1q
Bhjp p1q2
ÿ
p0q
“ δbp p1 ´ hjp q hp0q
qa .
BWab qPN pjq
Thus
Bzik ÿ p1q2 p1q
ÿ
p0q
“ δkb p1 ´ hjb qWbk hp0q
qa . (4.7)
BWab jPN piq qPN pjq
4.1.2 Skip Connections in GNNs

Similar to a feedforward neural network, a graph neural network can also be prone to
the vanishing gradient problem. There are some approaches to mitigate the effect of these
problems, among those adding skip connections is one of the most popular choice. However,
for graph neural networks, it is shown in [15] that while skip-connections can improve depth-
wise backpropagation between the representations of same nod in successive layers, they
29
do not improve the breadth-wise backpropagation between representations of neighboring
nodes. Moreover, it is also shown in the paper that adding skip connections in GNNs
may hurt performance when the task requires learning patterns containing distant nodes.
Therefore, one must be careful when considering adding skip connections to a graph neural
network.
4.2 Loss Functions

There are two main classification problems on graphs, which are node classification and
graph classification. In node classification, we are interested in assigning a class for
each unlabelled node in the graph based on the graph information and the relations of the
target node with its neighbor. For example, in a citation network, we are given a graph
of research papers which are linked to each other via citationships. We need to categorize
the unlabelled papers into different groups. On the other hand, in the graph-focused
classification, we treat each graph as a single data. An example of graph classification is
to classify images into different classes where we consider each image as a graph itself. For
both two classification tasks, cross-entropy loss is frequently used.
Node classification and graph classification are similar in the sense that we seek to
find an embedding vector representation for the example that we want to classify. In
node classification, an example corresponds to one node, while in graph classification, an
example corresponds to a graph.
We consider a general classification task, where we have N training examples E “
tE1 , . . . , EN u. Let zi “ tzi1 , . . . , ziC u be the output embedding vector of the ith example,
and yi “ ryi1 , . . . , yiC s be the one-hot vector indicating the ground-truth class of ith exam-
ple, where C is the number of classes. For example, if the true class of the ith example is
k then yi “ ryi1 , . . . yiC s P RC where:
#
1 if j “ k
yuj “
0 otherwise.
Then the cross-entropy loss for this classfication problem is calculated as follows:
N ÿ
ÿ C
L“´ yij ln zij (4.8)
i“1 j“1
30
In particular, in a node classification problem, an example Ei corresponds to node
ui P V . The output embedding vectors for all nodes are given by an embedding matrix:
Z “ softmaxpHpLq q “ rz1 , .., zN sT P RN ˆC
Then cross-entropy loss for the node classification problem is given by:
N ÿ
ÿ C
L“´ yui ln zui (4.9)
u“1 i“1
In a graph classification problem, an example Ei now corresponds to a graph Gi P T ,

where T “ tG1 , . . . , Gn u is the set of labeled training graphs, and zGi P Rd , yGi P RC
are embedding output and ground-truth class of the ith graph, correspondingly. The cross
entropy loss for the graph classification task is given by:
N ÿ
ÿ C
L“´ yGi ln zGi . (4.10)
j“1 i“1
Another popular learning problem on graphs is link prediction. For example, graph-
based recommender systems considers a graph where users and products are to types of
nodes. By utilizing the similarity among different users and different products, and the
relation between users and products, the recommender make a prediction on which items a
user will likely buy next, thus recommend these items to the user. There are two popular
choices for the associated loss function, which are the cross-entropy loss and the max-
margin loss. Both are often used with negative sampling [9].
pLq
Suppose we are given a simple undirected graph G “ pV, Eq. Denote zu P Rn as the
embedding at the final layer of node u. We define the decoder function D : V ˆ V Ñ R as
follows:
Dpu, vq “ zuT zv . (4.11)
The cross entropy loss with negative sampling for the link prediction problem is given by
ÿ
L“ ´ logpσpDpzu , zv qqq ´ γEvn „Pn,v pVq rlogpσp´Dpzu , zvn qqqs (4.12)
pu,vqPE
where σ is the logistic function, Pn,u pVq denotes a ”negative sampling” distribution over
the set of nodes V which might depend on u, and γ ą 0 is a hyperparameter.
The above formula can be decomposed into two components. The first component,
logpσpDpzu , zv qqq, equals the log-likelihood that we predict ”true” for an actual edge.
31
The second component, Evn „Pn,v pVq rlogpσp´Dpzu , zvn qqqs, equals the expected log-likelihood
that we correctly predict ”false” for an false edge (edge that does not exist in graph). In
practice, the expectation is evaluated using a Monte Carlo approximation and the most
popular form of this loss is
ÿ ÿ
L“ ´ logpσpDpzu , zv qqq ´ rlogpσp´Dpzu , zvn qqqs, (4.13)
pu,vqPE vn PPn,u
where Pn,u is a small set of nodes sampled from Pn,v pVq.

The max-margin loss (sometimes called hinge loss) is calculated as follows:
ÿ ÿ
L“ maxp0, ´Dpzu , zv q ` Dpzu , zvn q ` ∆q, (4.14)
u,vPE vn PPn,u
where ∆ ą 0. Max-margin loss compares the direct output of the decoders. If the score
for the true pair is bigger than the false pair, we have small loss. ∆ is called the margin,
and the loss for a pair will equal 0 if the difference in scores is at least ∆.
4.3 Special Training Methods on Graphs

4.3.1 Mini-Batching
In order to control the memory footprint of a GNN, we can use a technique called mini-
batching. The idea is to run the node-level message passing equations for a subset of
nodes in the graph in each batch. Redundant computations can be avoided through careful
pℓq
engineering to ensure that we only compute the embedding hu for each node u in the batch
at most once.
However, this technique has one significant limitation. That is, we cannot simply run
message passing on a subgraph withotu losing information. Every time we remove a node,
we also remove its edges. Therefore, it is possible that two nodes u, v in the subgraph are
no longer connected, even if they are connected in the full graph. Several strategies have
been proposed to overcome this limitation, including subsampling node neighborhoods of
each node.
4.3.2 Sampling
Simply using the full graph for training is expensive, especially for dense graphs. Moreover,
since each graph is different, adding unseen nodes to a graph would require starting training
32
from scratch. To address these issues, several GNN architecture use different sampling
methods to sample a subset of nodes for message passing. There are many different ways
to perform sampling on graphs, most of them can be put into one of three categories: node
sampling, layer sampling, and subgraph sampling.
Node sampling is to sample a fixed small number of neighbors for each node. For
example, as we can see from (3.20), GraphSAGE performs neighborhood sampling/ and
aggregation of embeddings on the sampled nodes. This keeps the computational and
memory complexity constant with respect to the size of the graph.
PinSAGE is an extension of GraphSAGE on large graphs. Instead of random sam-
pling, it uses an importance-based sampling approach. Importance of a node v to node u
is proportion to the number of times v appears in a random walk from u. Finally, we may
simply select top T nodes with highest importance score to u, or sampling T nodes with
probability proportional to the visit count of each node. This importance-based approach
has a interesting property: a node v which is not a direct neighbor of u can be more impor-
tant to u than some of its direct neighbors. Thus, the sampled neighborhood here is not
just a subset of the ”traditional” neighborhood of u which we can deduce from adjacency
matrix A.
While node sampling methods sample neighbors for each node, layer sampling meth-
ods chooses a small set of nodes for aggregation in each layer to control the expansion
factor. That means the number of neighbors for message aggregation of each node can
be different. FastGCN (Chen et al., 2018b) directly samples the receptive field for
each layer. It uses importance sampling, where the important nodes are more likely to be
sampled. In contrast to fixed sampling methods above, Huang et al. (2018) introduce a
parameterized and trainable sampler to perform layer-wise sampling conditioned on the
former layer. Furthermore, this adaptive sampler could optimize the sampling importance
and reduce variance simultaneously. LADIES (Zou et al., 2019) intends to alleviate the
sparsity issue in layer-wise sampling by generating samples from the union of neighbors of
the nodes.
Meanwhile, subgraph sampling methods sample multigraphs first, then perform
neighborhood search within the sampled subgraphs. ClusterGCN (Chiang et al., 2019)
samples subgraphs by graph clustering algorithms, while GraphSAINT (Zeng et al., 2020)
directly samples nodes or edges to generate a subgraph.
33
Figure 4.2: DiffPool [23] for graph classifcation problems. At each hierarchical layer, a
GNN is run to obtain embeddings of nodes. The learn learned embeddings are used to
cluster nodes together. Then another GNN layer is run on this coarsened graph. This
whole process is repeated for L layers and the final output representation is used to
classify the graph.
4.3.3 Graph Pooling

In convolutional neural networks (CNNs), there are usually pooling layers following a
convolution layers. The idea of pooling layers is to get more general features. Due to the
great success of CNNs in dealing with images, there has been a lot of effort on extending
pooling modules to graph structures.
As a direct extension from CNNs, the node-wise aggregation operators, such as max,
mean, sum, are still some of the most popular choices for designing graph pooling modules.
Apart from that, there are hierarchical pooling modules which utilizes the hierarchical
property of the graph structure, including DiffPool [23]. The idea of DiffPool is to train
two GNNs (A, B) in parallel, at each level. GNN A computes node embeddings for each
node, GNN B computes the cluster that a node belongs to. A uses clustering assignments
from B to aggregate node embeddings for each cluster. These node embeddings are then
used as input for the next pooling layer.
4.3.4 Augmentation
Augmentation is the process of adding features, nodes, or edges to enrich the graph. When
the input graph does not have features, we can assign unique IDs (one-hot vectors) to
nodes. Sometimes there are extra properties that may be useful such as node degrees,
34
cycle counts, or clustering coefficients, we can add them manually as well. On the other
hand, graph augmentation is to add nodes or edges to graphs. When a graph is sparse the
message passing process may be difficult. In that case, we can simply add more virtual
edges. For example, we can use A ` A2 as adjacency matrix. Another option is to add a
virtual node that connect to all nodes to the graph.
4.3.5 Regularization on Graphs

Many of the standard regularization techniques can be applied on graphs, for example: L2
regularization, dropout. Moreover, there are regularization strategies that are specific to
GNNs.
Parameter Sharing Across Layers: For a network that have many layers, specific
set of parameters across layers can be shared. For example, (Li et al., 2015) uses the same
set of weight matrices in Gated Recurrent Units (GRUs) of different layers. The reason
is that the number of parameters grows linearly with the number of layers, thus make
training difficult for massive graphs. Moreover, this parameter sharing technique is also
widely used in multi-relational GNNs. It is because the number of parameters in these
GNNs also scales linearly with the number of relation types, which further overblows the
number of learnable parameters [9]. However, when using this strategy, one needs to be
extremely careful of the risk of vanishing or exploding gradients [15]. By looking at the
end-to-end formula for a GCN:
´ ` ˘ pL´2q ¯ pL´1q
pLq p0q
H “ Aσ Aσ . . . σpAXW q . . . W
q q q W ,
we can see that in case W “ Wp0q “ Wp1q “ . . . “ WpLq , the scale of our output will be
blown up or shrunken by an approximated factor of ||W||L .
Edge Dropout: The idea is to randomly mask edges during training, with the hope
that this will make the network more robust to noise in the adjacency matrix. The neigh-
borhood sampling technique used by GAT can also be considered as a special case of
edge dropout. For example, given a graph G, DropEdge (Yu., 2020) first computes a new
adjacency matrix Ad for each layer,
pℓq 1
Ad “ A ´ A pℓq , (4.15)
1
where A pℓq is a matrix obtained from a subset of edges of G. Then renormalization tricks
are applied to obtain Ã:
1 1
pℓq
Ãrd “ D p pℓq´ 2 ,
p pℓq´ 2 pApℓq ` IqD (4.16)
d
35
p pℓq is the degree matrix associated with pApℓq ` Iq. Then a single-layer formula of
where D d
GCN with edge dropout is given by
pℓ`1q
Hpℓ`1q “ σpÃrd Hpℓq Wpℓq q. (4.17)
36
Chapter 5
Numerical Experiments
In this chapter, we describes several implementations of PinSAGE architecture for a user-

product recommendation problem.
5.1 Dataset and Task

The experiments are conducted on MovieLens-1M dataset, a dataset contains 1 million
ratings from 6000 users on 4000 movies [2].
5.1.1 Data Format

Information about users, movies, and ratings are provided in 3 files: users.dat, movies.dat
and ratings.dat, respectively. User information is in the following format:
U serID :: Gender :: Age :: Occupation :: ZipCode (5.1)
All demographic information is provided voluntarily by the users and is not checked for
accuracy. Only users who have provided some demographic information are included in
this data set. Movie information is in the following format:
M ovieID :: T itle :: Genre (5.2)
Only movies with at least 20 ratings are included. Ratings information is in the following
format:
U serID :: M ovieID :: Rating :: T imestamp (5.3)
37
Table 5.1: User Features
Feature Description
UserID ID of the user, integers in range [1, 6040]

Gender Gender of the user (’M’ for male and ’F’ for female)
Age Age of the user, in 1 of 6 groups (see Appendix)
ZipCode Postal code of the user’s location
Table 5.2: Movie Features
Feature Description
MovieID ID of the movie

Title String of words, taken from IMDB
Genre Genre (one or many, selected from 18 different genres)
(see Appendix)
Table 5.3: Rating Features
Feature Description
UserID ID of the user

MovieID ID of the movie
Rating Integer in [1, 5]
Timestamp Timestamp of the rating, number of seconds passed
from epoch
38
Table 5.4: List of Python libraries used and their versions
Libraries/Packages Version
Python 3.9.7
PyTorch (torch) 1.11.0+cu113
torchtext 0.5.0
numpy 1.20.3
pandas 1.3.4
scipy 1.7.1
dgl 0.6.1
dask 2021.10.0
tqdm 4.62.3
5.1.2 Task
Given a training set including a list of users, a list of movies, and the lists of movies each
user has watched, we would like to predict the next movie a user will watch.
5.2 Environment
5.2.1 Libraries
The end-to-end model pipeline, including preprocessing data, building models, training
models, and performance evaluation are written in Python. The full list of Python libraries
we used to produce experimental results is listed in the table below. Among them, PyTorch
is a Python-based scientific computing package serving as a replacement for NumPy to use
the power of GPUs and other accelerators, and an automatic differentiation library that is
useful to implement neural networks [3]. Deep Graph Library (DGL) is a Python package
built for easy implementation of graph neural network model family [1], on top of PyTorch.
5.2.2 Experiment Environment

All the experiments are conducted using a machine with the following configurations:
39
Table 5.5: Machine configuration
Specification Configuration
Processor Intel(R) Core(TM) i5-7300HQ CPU @ 2.50GHz,

2.496Mhz, 4 Cores, 4 Logical Processors
Memory 8GB
System type 64-bit operating system, x64-based processor
GPU NVIDIA GeForce GTX 1050
Operation System Windows 10 Pro, build 18363.1556
To accelerate the training, we utilized the power of GPU for parallel computing. In order
to do that, CUDA Toolkit (version 11.6) was downloaded and installed from NVIDIA’s
website.
5.3 General Approach

A bipartite graph G “ pV, Eq is constructed from the dataset, where V “ VU Y VM , VU
is the set of nodes representing users, VE is the set of nodes representing movies; and an
edge pu, vq between user u P VU and item v P VE means that user u watches movie v.
5.3.1 Dataset Split

Dataset is split into 3 disjoint subsets: train, validation, and test. For each user u P VU , let
M puq “ tv1 , . . . , vnu u where nu “ |M puq| be the list of movies that user u watches, sorted
by ascending order of timestamp. For any u such that |M puq| ě 2, the training portion
corresponding to user u consists of all the edges from u, except the last two:
Eutrain “ tpu, v1 q, . . . , pu, vnu ´2 qu (5.4)
The validation portion corresponding to user u consists of the second last movie watched
by user u:
Euval “ tpu, vnu ´1 qu, (5.5)
and the test portion corresponding to user u consists of the last movie watched by user u:
Eutest “ tpu, vnu qu. (5.6)
40
Ť
The training graph is constructed as GT “ pVT , ET q where VT “ V and ET “ Eutrain .
uPVU
For u P VU such that |M puq| “ 0, we do not have validation and test portions. For u P VU
such that |M puq| “ 1, we do not have validation portion.
5.3.2 Preprocessing
We keep all the features from Users and Movies tables. Regarding data types, there are two
types of features in the dataset: numeric features and text features. Numeric features are
already ready for training. Text features (movie titles) are embeded into numeric vectors
using a single-layer neural network.
For Ratings table, we omit the ratings. Timestamps are only used to split the dataset
into train/val/test. That means our graph G is an unweighted graph.
5.3.3 Constructing Samplers

Designing a good neighborhood sampler is a crucial part for the experiments. PinSAGE
uses importance-based neighborhood sampling to choose a fixed-sized neighborhood for a
graph node. For a given target node u, it performs random walks starting from u, then
counts the occurrences of the nodes appearing on each walk, then finally choose the top T
nodes with the highest visit counts.
A neighborhood sampling procedure consists of Nw random walks with restarts, all
start from the source node u. The length for each walk is fixed and denoted as Nl . A
walker is also equipped a restart capability, with probabillity pr P r0, 1s. When a restart
occurs, the walker is forced come back to the source node, no matter where it was currently
standing. The idea of introducing pr is to give more preference to the nodes which are
closer to the source node.
5.3.4 Constructing a Training Batch

Since the input graph is to big, we train the model using minibatches. Each batch B
consists of a positive graph PB “ pVBP , EBP q and a negative graph NB “ pVBN , EBN q where
VBP , VBN Ď V and EBP , EBN Ď E.
The graphs are constructed from NB triplets, each triplet tr “ ph, p, tq where h, p, t P V .
These triplets are chosen as follows. First, NB movies MH “ tm1 , .., mNB u Ď V are
41
drawn randomly and without replacement. These movies can be referred as the heads
of the triplets. Next, NB movies MP “ tp1 , .., pNB u Ď V are chosen randomly such that
dpmi , pi q “ 2 where dp¨, ¨q is the shortest distance between two nodes in the graph. In words,
we are chosen pi such that pi is co-interacted with mi by some users. These movies are
referred as positive tails of the triplets. Finally, other NB movies MN “ tn1 , .., nNB u Ď V
are chosen randomly such that dpmi , ni q ą 2, or in words, ni are chosen such that ni is
not co-interacted with mi by any user. The movies are referred as the negative tails of the
triplet.
The set of nodes of the positive graph PB is the union of the heads and positive tails,
VBP “ MH Y MP . The corresponding set of edges is EBP “ tpu, vq|u P MH , v P MP u.
Similarly, the set of nodes of the positive graph NB is the union of the heads and positive
tails, VBN “ MH Y NP . The corresponding set of edges is EBN “ tpu, vq|u P MH , v P MN u.
We choose Adam as the training optimizer.
5.3.5 Scoring and Loss Functions

Let u, v P VM . The score of an item pair is calculated as:
spu, vq “ hTu hv P R (5.7)
where hu , hv P RpLq are the output are the output embedding vectors of node u and node v
The score of a graph R “ pVR , ER q (can be positive or negative) is calculated as follow:
ÿ
sG pRq “ spu, vq P R (5.8)
pu,vqPVR
The loss over a training batch is given by the difference of its negative graph’s score
and its positive graph’s score:
LpBq “ sG pNB q ´ sG pPB q P R (5.9)
5.3.6 Model Architecture

We use a PinSAGE architecture [22] of two layers, including one hidden layer and the
output layer. The dimension of the hidden layer is 16.
42
5.4 Evaluation
For this task, we aim to predict the next movie a user will watch. For each user u P VU ,
let lu P VM be the last movie that user u watched. The model returns K movies that are
nearest neighbors of lu (in terms of Euclidean distance) and serve them as recommendations
to user u.
5.4.1 Hit@K
For a given user u P VU , let au P VM be the next movie that u will watch, and Ru be the
set of recommended movie for u, where |Ru | “ K. The score of our model corresponding
to user u is given as: #
1, au P Ru
sU puq “
0, otherwise.
The Hit@K (hit at K) metric of the model is calculated as:
ÿ
Hit@K “ sU puq, (5.10)
uPVU
or in words, the number of users whose relevant items are in the recommended list of size
K.
5.4.2 Evaluation Settings

After each epoch, the model is evaluated against the validation set and the test set. Hit@K
values are collected and charted.
5.5 Results and Extensions
5.5.1 Baseline Model

Following the PinSAGE architecture outlined in [22], a PinSAGE-based recommendation
model was implemented and evaluated. The hyperparameters which are used for the model
can be found in Table 5.6. The evaluation results are reported in Figure 5.1, 5.2, 5.3.
43
In the original PinSAGE model, some hyperparameters are fixed throughout the train-
ing phase, including restart probability, and number of random walks. It can be argued
that there is no perfect hyperparameter value for all minibatch graphs: the optimal values
are different and should depend on the size, or density of the graph.
In next sections, we propose two possible improvements to the original model. The
ideas are very simple, they are just there to test the hypotheses that we propose. These
improvements are deterministic and involve no randomness. Furthermore, for all the ex-
periments, we keep the random seed of all the libraries/packages the same and equal to 1,
so the result can easily be reproducible if necessary. More details can be found in Table
5.9.
5.5.2 Adaptive Restart Probability

Hypothesis: The optimal choice for restart probability of the random walks depends on
the density of the graph.
In this experiment, we make the random walk restart probability depends on the density
of the graph. Let p0 be the initial probability value, |Em | be the number of edges of the
minibatch graph, and |E| be the number of edges in the original graph, then the new
restart probability for random walks performing on that graph is calculated as:
|Em |
p “ p0 ´ (5.11)
|E|
Under our training settings, with different hyperparameters, |Em | can range anywhere from
1,000 to 100,000. For the MovieLens dataset, |E| « 1, 000, 000. The set of hyperparameters
ued for this model is reported in Table 5.7. The result of our experiment is shown in Figure
5.4.
5.5.3 Variable Number of Random Walks

Hypothesis: The optimal choice for number of performed random walks depends on the
density of the graph.
In this experiment, we make the number of random walks depends on the density
of the graph. The idea is that a more dense graph needs more random walks to identify
44
appropriate neighborhoods for the nodes. Let |Em | be the number of edges of the minibatch
graph. Then the number of random walks that we perform is given by:
#
10 if |Em | ă 50, 000
Nr “
30 otherwise.
We compare this new model with the original model where Nr is fixed at 20. The results
are shown in Figure 5.5, and reported in Table xx. The set of hyperparameters ued for
this model is reported in Table 5.8. The new model does not quite outperform the original
one. It is expected since once we perform a reasonably enough number of random walks,
it does not matter whether do it several more or less times.
5.6 Observations
We recall that the training set consists of all but last 2 movies (ordered by timestamp) for
each user. The validation set consists of the second last movie, and the test set consists of
the last movie that each user watches. Therefore, it is expected that in most of the cases,
a model’s performance on validation set is better than test set. Our experiments confirms
that empirically, for different values of K. From these experiments, we can conclude that
in a recommender system task, time sensitiveness of data is very important. Predicting
the next movie is much more easier than predicting the movie after that.
It can also be seen from the numerical experiments that as K getting bigger, the Hit@K
metric on the validation set and test set are getting much more closer. This suggests us that
our model is effective: most of our correct predictions are on top of the recommendation
lists.
It can be seen from Experiment that the ARP model outperforms the original model
with a fixed restart probability, even though the modification made is very simple and does
not come with additional parameters. This result suggests that the optimal choice restart
probability of the random walk algorithm should depend on the density of the edges. A
more dense graph should have lower restart probability so we can more effectively take
advantages of information from all the neighboring nodes.
From the Experiment, we can see that the VNRW model is somewhat better than the
original model with a fixed number of random walks, but the result is not so convincing.
There are multiple times during training that the VNRW is behind, and the variance in
performance gain/loss among training iterations is large. One possible explanation is that
once we perform a reasonably enough number of random walks, it does not matter whether
do it several more or less times.
45
Figure 5.1: Numerical result for Experiment 1 (Original PinSAGE), K = 5
46
Figure 5.4: Experiment 4: Original PinSAGE vs Adaptive Restart Probability method
47
Figure 5.5: Experiment 5: Original PinSAGE vs Variable Number of Random Walks
method
Table 5.6: Hyperparameters for the original PinSAGE (baseline)
Hyperparameter Value
Random walk length 2

Random walk restart probability 0.5
Number of random walks 20
Neighborhood size 3
Number of layers 2
Hidden dimensions 16
Batch size 32
Number of epochs 100
Number of batches per epoch 1000
Learning rate 5E-5
K 5
48
Table 5.7: Hyperparameters for the Adaptive Restart Probability (ARP) method

Number of random walks 20
Neighborhood size 3
Number of layers 2
Batch size 32
Learning rate 5E-5
K 5
Table 5.8: Hyperparameters for the Variable Random Walk Lengths (VRWL) method

Number of random walks (low) 10
Number of random walks (high) 30
Neighborhood size 3
Number of layers 2
Batch size 32
Learning rate 5E-5
K 5
49
Table 5.9: Random seeds set for libraries/packages that involve randomness
Package/Library Command
DGL dgl.seed(1)
DGL Random dgl.random.seed(1)
PyTorch torch.manual seed(1)
PyTorch CUDA torch.cuda.manual seed(1)
PyTorch CUDA (multiple GPUs) torch.cuda.manual seed all(1)
PyTorch Backend torch.backends.cudnn.deterministic=True
NumPy Random np.random.seed(1)
Python Random random.seed(1)
50
Chapter 6
Conclusions
6.1 Summary
The contribution of this thesis is two fold. First, we present a systematic way to write
down popular GNN architectures, and the mathematical proof/foundation that is lacking
in some original works. We also discuss the pros and cons and possible use cases of each
architecture. Second, we implement a PinSAGE architecture to build a recommendation
model for a user-product recommendation problem, propose additional improvements and
run additional experiments to improve the result of the baseline model. Our experiements
show that PinSAGE can be well adapted to a large-scale recommendation problem, and
there are still some areas for improvements.
6.2 Future Work

It is already shown in [22] that PinSAGE can be used in web-scale recommendation system.
However, in this work, due to limitation in computational power, we can only run the
experiments on MovieLens-1M dataset. We believe that having an opportunity to run the
experiments on bigger datasets, such as MovieLens-10M, MovieLens-25M, MovieLens-1B
would open the door for more insights and widen the room for more improvement ideas.
In this work, I proposed hypotheses that the restart probability and number of random
walks should be better adaptive to the sampled graph. I believe the same thing can be
said for other hyperparameter of a random walk strategy. Planned future work will be
51
directed in this direction, aside from exploring other innovative ways to sample a node’s
neighborhood.
52
References
[1] Dgl.ai official website. https://docs.dgl.ai/. Accessed: 2022-05-01.
[2] Movielens 1m dataset. https://grouplens.org/datasets/movielens/1m/. Ac-

cessed: 2022-04-18.
[3] Pytorch official website. https://pytorch.org/tutorials/beginner/deep_

learning_60min_blitz.html. Accessed: 2022-05-01.
[4] X. Allan Chen. Understanding spectral graph neural network. 12 2020.
[5] Fan R. K. Chung. Spectral Graph Theory. 1997.
[6] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural
networks on graphs with fast localized spectral filtering. Advances in Neural Informa-
tion Processing Systems 29, 8:175–191, 2016.
[7] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press,
2016. http://www.deeplearningbook.org.
[8] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning
on large graphs. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing
Systems, volume 30. Curran Associates, Inc., 2017.
[9] William L. Hamilton. Graph representation learning. Synthesis Lectures on Artificial

Intelligence and Machine Learning, 14(3):1–159.
[10] William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning
on large graphs. CoRR, abs/1706.02216, 2017.
53
[11] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convo-
lutional networks. ICLR 2017.
[12] Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. Personalized

embedding propagation: Combining neural networks on graphs with personalized
pagerank. CoRR, abs/1810.05997, 2018.
[13] Fei-Fei Li. Lecture notes in cs231n: Convolutional neural networks for visual recogni-
tion.
[14] Zhiyuan Liu and Jie Zhou. Introduction to Graph Neural Networks. Synthesis Lectures
on Artificial Intelligence and Machine Learning. 2020.
[15] Denis Lukovnikov and Asja Fischer. Improving breadth-wise backpropagation in graph
neural networks helps learning long-range dependencies. In Marina Meila and Tong
Zhang, editors, Proceedings of the 38th International Conference on Machine Learning,
volume 139 of Proceedings of Machine Learning Research, pages 7180–7191. PMLR,
18–24 Jul 2021.
[16] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement, 2018.
[17] Luis G. Serrano. Grokking Machine Learning. Manning Publications, November 2021.
[18] David I Shuman, Sunil K. Narang, Pascal Frossard, Antonio Ortega, and Pierre Van-
dergheynst. The emerging field of signal processing on graphs: Extending high-
dimensional data analysis to networks and other irregular domains. IEEE Signal
Processing Magazine, 30(3):83–98, 2013.
[19] Nathan Srebro and Adi Shraibman. Rank, trace-norm and max-norm. In Proceedings
of the 18th Annual Conference on Learning Theory, COLT’05, page 545–560, Berlin,
Heidelberg, 2005. Springer-Verlag.
[20] Petar Velivcković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò,
and Yoshua Bengio. Graph attention networks. 2017.
[21] Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Wein-
berger. Simplifying graph convolutional networks. In Kamalika Chaudhuri and Ruslan
Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine
Learning, volume 97 of Proceedings of Machine Learning Research, pages 6861–6871.
PMLR, 09–15 Jun 2019.
54
[22] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton,
and Jure Leskovec. Graph convolutional neural networks for web-scale recommender
systems. CoRR, abs/1806.01973, 2018.
[23] Rex Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L. Hamilton, and
Jure Leskovec. Hierarchical graph representation learning with differentiable pooling.
CoRR, abs/1806.08804, 2018.
[24] Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu,
Lifeng Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review
of methods and applications. AI Open, 1:57–81, 2020.
55

Nguyen Duy

Uploaded by

Copyright:

Available Formats

Nguyen Duy

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Nguyen Duy

Uploaded by

Copyright:

Available Formats

Some Mathematical Perspectives of

Graph Neural Networks

Waterloo, Ontario, Canada, 2022

I understand that my thesis may be made electronically available to the public.

List of Figures viii

3 Graph Neural Networks Architecture 15

2.1 A fully-connected neural network with two hidden layers. . . . . . . . . . . 11

3.1 GCN as a spatial architecture - information is passed from a node to its

5.1 Numerical result for Experiment 1 (Original PinSAGE), K = 5 . . . . . . . 46

1 Graph Neural Networks Notation . . . . . . . . . . . . . . . . . . . . . . . xi

5.1 User Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

IN P RN ˆN Identity matrix of dimension N

where paij q is the adjacency matrix of G.

Definition 5 (Node degree matrix with self-connections). Let G “ pV, Eq be a simple

where A aij q is the adjacency matrix with self-connections of the graph G.

Definition 6 (Normalized adjacency matrix with self-connections). Let G “ pV, Eq be a

2.2 Machine Learning and Neural Networks Basics

The recall of a model is defined as follows:

The specificity of a model is defined as follows:

2.2.3 Loss Function

2.2.4 Parameter Updates

• Vanilla gradient descent update:

xptq “ xpt´1q ´ λ∇x Lpxpt´1q q, (2.3)

where λ ą 0 is the learning rate.

• Gradient descent with momentum: A momentum term v P RN is added and

vptq “ βvpt´1q ` p1 ´ βq∇x Lpxpt´1q q, (2.4)

cptq “ cpt´1q ` ∇x Lpxpt´1q q d ∇x Lpxpt´1q q (2.6)

• RMSprop: Similar to Adagrad, RMSProp also has per-parameter learning rates.

cptq “ βcpt´1q ` p1 ´ βq∇x Lpxpt´1q q d ∇x Lpxpt´1q q (2.10)

mptq “ β1 mpt´1q ` p1 ´ β1 q∇x Lpxpt´1q q (2.12)

2.2.5 Neural Networks

Figure 2.1: A fully-connected neural network with two hidden layers.

2.2.6 Overfitting and Regularization

2.3 Fourier Transform and Convolution Theorem

Its attached inversion is given by:

Fpf1 ˚ f2 q “ Fpf1 q d Fpf2 q (2.25)

Therefore, the convolution operator ˚ can be defined as:

f1 ˚ f2 “ F ´1 pFpf1 q d Fpf2 qq. (2.26)

Graph Neural Networks Architecture

3.1 Graph Convolutional Network (GCN)

3.1.1 Spectral Motivation

Proof. It follows directly from the Graph Convolution definition.

Different choices of w lead to different spectral GNN architectures, including GCN,

except for the output layer, where there is no nonlinear activation:

In summary, the final output of a GCN with L layers is:

3.1.2 Spatial Motivation

aik “ 0 if vk R N pvi q, for i P rN s, j P rnpℓ`1q s, we have:

On the other hand,

Hence the j th element of Equation (3.13) is

which completes the proof.

Figure 3.2: Visualization of GraphSAGE. Left: Sampled neighborhood of node 0,

, M is the mean operator, Mpa1 , .., ak q “ k1 ki“1 ai P Rp for ai P Rp .

3.3 Graph Attention Network (GAT)

Figure 3.4: An illustration of multihead attention (with K = 3 heads) by node 1 on its

Lastly, the embedding of node u at layer ℓ ` 1 is:

PinSAGE proposes an importance-based sampling method. For a given target node u, it

The final, normalized vector representation of node u at layer ℓ is as follows:

4.1.1 Backpropagation on GCNs

where Aq P RN ˆN is the normalized adjacent matrix with self-connections of the graph G,

where # ř p0q p0q p0q