Nguyen Duy
Nguyen Duy
Nguyen Duy
Duy Nguyen
A thesis
presented to the University of Waterloo
in fulfillment of the
thesis requirement for the degree of
Master of Mathematics
in
Applied Mathematics
I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis,
including any required final revisions, as accepted by my examiners.
ii
Abstract
Many real-world entities can be modelled as graphs, such as molecular structures, social
networks, or images. Despite coming with such a great expressive power, the complex struc-
ture of graphs poses significant challenges to traditional deep learning methods, which have
been extremely successful in many machine learning tasks on other input data structures,
such as texts and images data. Recently, there have been many attempts in developing
neural network architectures on graphical data, namely graph neural networks (GNNs).
In this thesis, we first introduce some mathematical notations for graphs and different
aspects of training a feedforward neural network. We then discuss several notable GNN
architectures including Graph Convolutional Neural Networks, Graph Attention Networks,
GraphSAGE, and PinSAGE. Some special aspects of GNN training are also presented. Fi-
nally, we investigate a neighborhood sampling approach on PinSAGE to a product-user
recommendation problem.
iii
Acknowledgements
I would like to thank all the little people who made this thesis possible.
iv
Table of Contents
List of Tables x
1 Introduction 1
2 Background 3
2.1 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Machine Learning and Neural Networks Basics . . . . . . . . . . . . . . . . 5
2.2.1 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Performance Measure . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.4 Parameter Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.5 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.6 Overfitting and Regularization . . . . . . . . . . . . . . . . . . . . . 12
2.3 Fourier Transform and Convolution Theorem . . . . . . . . . . . . . . . . . 13
v
3.1.3 Efficient Implementation . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 GraphSAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Graph Attention Network (GAT) . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 PinSAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4 GNN Training 27
4.1 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.1 Backpropagation on GCNs . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.2 Skip Connections in GNNs . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Special Training Methods on Graphs . . . . . . . . . . . . . . . . . . . . . 32
4.3.1 Mini-Batching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.2 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.3 Graph Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.4 Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.5 Regularization on Graphs . . . . . . . . . . . . . . . . . . . . . . . 35
5 Numerical Experiments 37
5.1 Dataset and Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1.1 Data Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1.2 Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2.1 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2.2 Experiment Environment . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3 General Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3.1 Dataset Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3.3 Constructing Samplers . . . . . . . . . . . . . . . . . . . . . . . . . 41
vi
5.3.4 Constructing a Training Batch . . . . . . . . . . . . . . . . . . . . . 41
5.3.5 Scoring and Loss Functions . . . . . . . . . . . . . . . . . . . . . . 42
5.3.6 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4.1 Hit@K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4.2 Evaluation Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.5 Results and Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.5.1 Baseline Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.5.2 Adaptive Restart Probability . . . . . . . . . . . . . . . . . . . . . 44
5.5.3 Variable Number of Random Walks . . . . . . . . . . . . . . . . . . 44
5.6 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6 Conclusions 51
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
References 53
vii
List of Figures
4.1 Effect of adding skip connections in graph neural networks: adding a skip
connection (blue arrow) only helps mitigate the vanishing gradient risk from
the node itself. The risk of vanishing gradient from message passing through
its neighbors still remains. Figure is taken from [15] . . . . . . . . . . . . . 29
viii
4.2 DiffPool [23] for graph classifcation problems. At each hierarchical layer, a
GNN is run to obtain embeddings of nodes. The learn learned embeddings
are used to cluster nodes together. Then another GNN layer is run on this
coarsened graph. This whole process is repeated for L layers and the final
output representation is used to classify the graph. . . . . . . . . . . . . . 34
ix
List of Tables
x
Table 1: Graph Neural Networks Notation
Notation Description
xi
Chapter 1
Introduction
Many real-world entities can be represented as graphs, such as social networks, molecu-
lar structures, or images. For example, a molecule can be viewed as a graph of many
atoms bonded together, and their properties needed to be predicted for drug discovery.
In e-commerce, users and products can be represented as nodes in a graph whose edges
model interaction among them, and the goal is to exploit the complex structure of these
interactions to serve accurate recommendations to users. The first motivation of Graph
Neural Networks (GNNs) comes from the nineties, but they have just been gaining mas-
sive attraction recently, following the success of convolutional neural networks (CNNs) and
recurrent neural networks (RNNs). Many operations on graphs are generalized from their
successful counterparts in CNNs and RNNs, such as graph convolution and graph attention.
Thanks to recent advancement in computing powers, many generalizations, definitions and
extensions have been proposed and experimented, which leaves a spacious room of ex-
ploration and improvements for researchers and practitioners. However, partly because
of that advancement, results in the field of GNNs are almost always of a computational
nature, somewhat lacking rigorous mathematical foundation. There is also discrepancy
in the mathematical representations among GNN architectures, results in confusion from
readers.
In this thesis, we seek to present a systematic way to write down popular GNN architec-
tures, and the mathematical proof/foundation that is lacking in some original works. Chap-
ter 2 introduces useful notations oh graphs, machine learning tasks in general, and different
aspects of training a neural network. This chapter is followed by an overview of several most
notable GNN architectures, including Graph Convolutional Network (GCN), GraphSAGE,
Graph Attention Network (GAT), and PinSAGE. Motivation, strengths, weaknesses and
use cases of each architecture are also discussed. Some specific aspects of training a GNN
1
are presented and discussed in Chapter 4. Finally, we investigate a neighborhood sampling
approach based on PinSAGE to a product recommendation problem, report our findings,
discuss the numerical results, and propose some improvements.
2
Chapter 2
Background
This section introduces some useful definitions and properties of graphs [9] and neural
networks that will be used for subsequent chapters. Firstly, we define graphs and intro-
duce several graph terminologies, including adjacency matrices, node degree matrices, and
Laplacian matrices. Then we provide an overview of basic principles of machine learning
and neural networks, including tasks, performance measure, and training aspects. Finally,
we recall the Fourier transform and convolution theorem, which serve as inspirations for
the graph convolutional operator.
2.1 Graphs
Definition 1 (Graph). A graph G “ pV, Eq is defined as a collection of two sets V and
E, where V is a finite set of nodes and E Ď V ˆ V is the set of edges. Here we say there
is an edge between node u P V and node v P V if pu, vq P E.
The graph G is called a simple graph if there is at most one edge between any pair of
nodes.
The graph G is called an undirected graph if the edges are undirected, i.e. pu, vq P E
if and only if pv, uq P E.
Let v P V be a node of a graph G “ pV, Eq. We define the set of neighboring nodes
of v, N pvq, as follows:
N pvq “ tu P V |pu, vq P Eu.
3
Definition 2 (Adjacency matrix). Let G “ pV, Eq be a simple undirected graph, where
V “ tv1 , . . . , vN u. The adjacency matrix A “ paij q P RN ˆN of the graph G is defined
to be the matrix whose entries are:
#
1, if pvi , vj q P E
aij “
0, otherwise,
for i, j P rN s.
Definition 3 (Node degree matrix). Given a simple undirected graph G “ pV, Eq, we
define its node degree matrix D “ pdij q P RN ˆN as a diagonal matrix whose diagonal
entries are:
N
ÿ
dii “ aij ě 0, i P rN s
j“1
N
ÿ
dpii “ aij ě 0,
p i P rN s,
j“1
4
Definition 7 (Laplacian matrix). Given a simple undirected graph G “ pV, Eq, the Lapla-
cian matrix L and the symmetrically normalized Laplacian matrix Lsym of graph G are
defined by:
L :“ D ´ A P RN ˆN ,
1 1 1 1
Lsym :“ D´ 2 LD´ 2 “ IN ´ D´ 2 AD´ 2 P RN ˆN ,
where D, A are the node degree matrix and adjacency matrix of the graph G, respectively,
1
with the convention that 0´ 2 “ 0.
Notice that since the graph G is undirected, its adjacency matrix A is symmetric.
Therefore, the Laplacian matrices L and Lsym are also symmetric. Moreover, since L is
1
diagonally dominant, L is positive semidefinite. Since Lsym “ D´1{2 LD´1{2 and both D´ 2
and L are positive semidefinite, Lsym is also positive semidefinite. Using Lsym over L for
messaging passing has several advantages. Firstly, it makes the influence of neighboring
nodes with large degrees of a node u more equal to that of other neighboring nodes.
Moreover, while L and Lsym share the same set of eigenvectors, Lsym has a bounded
spectrum, which ensures several numerical stability properties [9]. More precisely, it can
be proved that 0 “ λ1 ď ... ď λN ď 2 where tλi uNi“1 be the set of eigenvalues of L
sym
[5].
Definition 8 (Laplacian matrix with self-connections). Given a simple undirected graph
G “ pV, Eq, we define the Laplacian matrix with self-connections L
p and the symmetrically
sym
normalized Laplacian matrix with self-connection L
p of graph G as follows:
L
p :“ D
p ´A p P RN ˆN ,
p sym :“ D
L p ´1{2 L
pD p ´ 12 A
p ´1{2 “ IN ´ D pDp ´ 12 P RN ˆN
where Dp and Ap are the node degree matrix with self-connections and adjacency matrix with
self-connections, respectively.
For the remaining of the thesis, we will mainly use the symmetrically normalized
Laplacian matrices with self-connections L p sym . The advantages of using matrices with
self-connections will be discussed in Section 3.1.
5
2.2.1 Tasks
We can categorize machine learning tasks as supervised, unsupervised, or semi- su-
pervised learning.
In supervised learning, we aim to learn an underlying function that maps the input
(feature vectors) to the corresponding output (labels) from a given set of input-output
pairs and make predictions for unseen examples. Two of the most common tasks for
supervised learning algorithms are classification and regression. In classification tasks,
the algorithm aims to predict which of C given categories that an input belongs to. To
make such prediction, the algorithm learns a function f : Rn Ñ t1, . . . , Cu, where n is the
dimension of each input data. For example, the spam filtering problem is a classification
task, where the algorithm may be given a dataset of content of the emails, name of the
sender, etc. and is required to predict whether that email is spam or not. In regression
tasks, the output can be numerical values. To make such prediction, the algorithm try to
learn a function f : Rn Ñ Rm , where n and m are the dimension of the input and output
data, respectively. For example, the house price prediction problem is a regression task,
where each input data contains the properties of a house (size, location, etc.) and the
corresponding output data is the sold price of the house. The regression algorithm will
learn a function (possibly nonlinear) that maps the properties of the house to its sold price
and predict the prices of new houses.
In unsupervised learning, we try to find underlying patterns from a given dataset.
Some unsupervised learning algorithms want to learn the probability distribution that
generated the dataset, while some others aim to divide the dataset into clusters of similar
inputs (clustering). One example is the StackOverflow questions clustering problem in
which we would like to group similar questions into the same cluster to recommend them
to users.
In semi-supervised learning tasks, we are given a dataset of features, where a sub-
set of the dataset is labelled and the rest are unlabeled. For graph inputs, we are also
given a graph that contains nodes and edges, and a semi-supervised task can be done in
transductive setting or inductive setting. In transductive setting, the model is required
to predict the labels of the given unlabeled nodes. One example is the citation network
problem [11], where we are given a graph of research papers as nodes and citation relations
as edges, along with labels (topics) for some research papers. The goal is to predict the
topics of the remaining unlabeled papers. In inductive setting, the model is required to
predict the labels for new nodes from the same distribution [24]. For example, we may train
a graph neural network on a subgraph of the citation network then test its performance
on another subgraph [9].
6
2.2.2 Performance Measure
A performance measure is a metric that can be used to evaluate the abilities of a machine
learning algorithm. The list of choices for performance measures usually depends on the
given task [7].
One of the most common machine learning tasks on graphs is classification. For clas-
sification tasks, there are many metrics that can be used to measure an algorithm’s
performance. One of the most common metric is accuracy. Accuracy is the proportion
of correct predictions, over the total number of predictions. Equivalently, we can also
use error rate, which is the proportion of incorrect predictions, over the total number
of predictions. We can also use precision, recall (sensitivity), specificity, where the
definitions are given below.
The precision of a model is defined as follows:
TP
P recision “ .
TP ` FP
Suppose we have a recommendation model where it predict whether a user will buy a
given product or not. A true positive (TP) case is a case where we predict yes, and the
user actually buys it. A true negative (TN) case is a case where we predict no, and the
user actually does not buy it, as expected. A false positive (FP) case is a case where the
user does not buy the product although we predict they do. A false negative (FN) case
is a case where the user buys a product that we predict they do not.
Since there is usually a trade-off among precision, recall and specificity, we may want
to choose to optimize different metrics in different scenarios. We want precision to be high
when we cannot tolerate a false positive case. We want to optimize for recall if we do not
want to miss a positive case. For instance, it is dangerous to miss a virus-infected person
in case of a highly contagious disease. We want to optimize for specificity if we want to
7
avoid false alarms. For example, we want a cancer diagnosis to have high specificity as it
costs a lot of time, money, and even patient’s wellness to undergo cancer treatments.
On the other hand, it can be argued that we should only consider the performance
of the machine learning algorithms on data that it has not seen before. For that reason,
we can divide the given dataset into training set and test set. Training set is used by
the algorithm to learn, while test set is reserved for evaluating the performance of the
machine learning algorithm. In practice, the size of training dataset is usually bigger than
the size of test dataset.
For such classification tasks, minimizing the cross-entropy loss, which is equivalent to
maximize the log-likelihood estimation of the data given the model, is one of the most
popular choices. The cross-entropy loss is calculated as follows:
N ÿ
ÿ C
L“´ yij ln zij . (2.1)
i“1 j“1
For regression tasks, Root Mean Square Error (RMSE) is one of the most commonly
used loss. It measures how far the model’s predictions are from the actual values. Let yi P R
be the actual value of the ith example, ŷi is the predicted value, then the RMSE loss is
8
given by: g
N
f
f1 ÿ
L“ e pyi ´ ŷi q2 . (2.2)
N i“1
Denote t (where t “ 1, 2, . . .) the iteration count and xptq P RN the vector we need to
update. Then the vector xptq can be updated as follows.
9
• Adagrad: The main idea of this optimization method is to have per-parameter learn-
ing rates. That is, the component that receive high gradients will have their learning
rate reduced over time. Here, cptq P RN is the accumulated magnitude of gradients
over time, and ϵ ą 0 is used to avoid division by zero. For each backprogagation
step, we update
where division and square root are both performed element-wise, β P r0, 1s is the
decay factor, and λ is the learning rate. In practice, β is usually set close to 1.
• Adam: Similar to RMSProp, Adam’s method also uses a moving average for squared
gradients. The main difference is that both the moving averages of the gradients
mptq P RN and squared gradients vptq P RN are further smoothed out over time by
a factor of β1 P r0, 1s and β2 P r0, 1s. Usually, β1 is set to 0.9 and β2 is set to 0.99.
10
Both v and m are initialized at zero. The updates are given by
A neural network consists of neurons that are connected in an acyclic graph, where the
outputs of some neurons can become inputs to other neurons. Usually, a neural network
11
is organized into distinct layers of neurons. The input data is fed into the first layer, and
the output (usually the prediction we need to make from the input) is calculated based on
output of the last layer. A neural network is typically initiated with random weights on the
edges between nodes, these weights are then updated by the backpropagation algorithm
repeatedly until the model performs well enough [14].
For example, let x P R3 be an input data and y P R2 be the corresponding output of
a two-hidden layer neural network associated with Figure 2.1. The output of each layer is
given by:
hp1q “ σpW1 x ` b1 q P R4
hp2q “ σpW2 hp1q ` b2 q P R5 (2.17)
y “ W3 hp2q ` b3 P R2 ,
where W1 P R4ˆ3 , W2 P R5ˆ4 , W3 P R2ˆ5 , b1 P R4 , b2 P R5 , b3 P R2 , and σ is a nonlinear
activation function.
In general, a feedforward L-hidden-layer neural network is a function of the form:
´ ` ˘ ¯
f px; θq “ WL σ WL´1 σ . . . σpW1 x ` b1 q . . . ` bL´1 ` bL , (2.18)
pℓq ˆnpℓ`1q pℓq
where θ “ tpWℓ , bℓ quLℓ“1 , Wℓ P Rn , bℓ P Rn , nplq P Z` , and ℓ P rLs.
12
one. Translating to machine learning settings, the idea is to give more preference to simple
or sparse set of parameters, and penalize overcomplex models. Here are several examples:
Suppose we are using cross-entropy loss as in (2.1). We want to regularize all the weight
matrices Wpℓq where ℓ P rLs. If we prefer the weight matrices to be sparse, we can use
L1 regularization. Because of this property, L1 regularization can also be viewed as
a feature selection mechanism [7]. The loss function with incorporated L1 regularization
term is given by:
ÿN ÿC ÿL ÿÿ
pℓq
L“´ yui ln zui ` |Wij |. (2.19)
u“1 i“1 ℓ“1 i j
Another popular choice of regularization is to use the Frobenius norm on the weight ma-
trices, which is also called the weight decay (L2 regularization):
N ÿ
C L
ÿ 1ÿ
L“´ yij ln zij ` }Wpℓq }2F . (2.20)
i“1 j“1
2 ℓ“1
Another regularization technique is to enforce an upper bound on the norm of the learn-
able parameters, called max norm constranint. For example, after a normal gradient
update, we can normalize the weights as follows:
Wpℓq
Wpℓq ÐÝ c , (2.21)
||Wpℓq ||
where division is the element-wise division, and the norm || ¨ || could be the L1 norm, the
spectral norm, or the Frobenius norm.
If we are training a neural network, dropout can also be used. That is, during training,
we randomly and temporarily deactivate neurons with some probability p P r0, 1s. Let’s
take the neural network in Figure 2.1 for example. If we deactivate the first neuron of the
1st hidden layer, then the formula for the 2nd hidden layer will be as follows:
p p1q ` b2 q P R5 ,
hp2q “ σpW2 h (2.22)
p p1q “ rhp1q , 0, hp1q , hp1q sT .
where h 1 3 4
13
convolution operator on graph domain, so we start off by introducing Fourier transform
and convolution theorem.
Definition 9 (Fourier and Inverse Fourier Transform). The Fourier transform of a function
f on Rd is given by:
ż
ˆ
Fpf qpξq “ f pξq “ f pxqe´2πix¨ξ dx, ξ P Rd . (2.23)
Rd
Theorem 1 (Convolution Theorem). Let f1 and f2 P L1 pRd q. Then the Fourier transform
of the convolution of two functions is the pointwise product of their Fourier transform:
14
Chapter 3
In this chapter, we discuss three popular graph neural network architectures, including
Graph Convolutional Network (GCN), GraphSAGE, and Graph Attention Network (GAT).
Here, we consider only simple undirected graphs. We also present below the list of common
used notation.
15
eigenvalue of the corresponding Laplacian matrix L p sym “ IN ´ Dp ´ 12 A
pDp ´ 12 . According to
the author, this acts as a low-pass filter which attenuate signals of higher frequency, thus
produces smooth features over the graph. The result is that neighboring nodes are more
likely to share similar representations and predictions.
It should be noted that there may not be a unique choice for U. However, we still
assume that U is already chosen and fixed [18].
Lemma 1. Let w, h P RN . Then
w ˚G h “ UdiagpUT wqUT h “ wp
p Lp sym qh
where wp
p Lp sym q is a matrix that depends on w, the Laplacian matrix L
p sym , and its orthog-
onal matrix U.
16
Lemma 2. Under the settings above, if w “ cUp1N ˆ1 ´ λq where λ “ rλ1 , . . . , λN sT and
c P R, then
w ˚G h “ cDp ´ 21 A
pDp ´ 12 h. (3.4)
p ´ 12 A
p sym “ I ´ D
Proof. Recall that the Laplacian matrix is defined as L pDp ´ 12 (normalized),
which has a fixed set of eigenvectors U.
Consider w “ cUp1 ´ λq where c P R. Then UT w “ cUT Up1 ´ λq “ cp1 ´ λq “ I ´ Λ.
With w defined as such, we can easily verify that UT w “ cp1´λq, and diagp1´λq “ I´Λ,
where Λ “ diagpλq. Thus,
CG ph; wq “ CG ph; cq “ w ˚G h “ UpUT w d UT hq
“ cUpI ´ ΛqUT h
p sym qh
“ cpI ´ L
p ´ 12 A
“ cD pDp ´ 12 h P RN .
Denote A p ´ 12 A
q “D pDp ´ 12 . Then Cph; wq “ cAx.
q
The above equation is for the simplest case where we have only one filter and the
number of channel is also one. We can generalize to more complex cases, where we may
have many filters, and each of them are multi-channel.
Let’s consider a scenario where our input consists of N nodes and d “ np0q channels.
p0q
Instead of x P RN and c P R, now we have an input matrix of np0q features X P RN ˆn
p0q
and learnable filter parameters c “ rc1 , . . . , cnp0q sT P Rn . Let us define the multi-channel
graph convolutional layer operator CG on X as belows:
p ´ 21 A
CG pX; cq “ D pDp ´ 12 Xc P RN (3.5)
p0q
Now, let’s say we have np1q such filters. We stack the filter parameters cpjq P Rn where
p0q p1q
j P rnp1q s horizontally to form a matrix Wp0q P Rn ˆn to obtain:
p ´ 21 A
CG pX; Wp0q q “ D pDp ´ 21 XWp0q P RN ˆnpℓ`1q (3.6)
Note that Equation (3.5) and (3.6) are generalizations of the graph convolution operator
between two vectors. By adding a nonlinear activation σ, we obtain the formula for the
first hidden layer of GCN:
´ ¯ pℓ`1q
0q
p1q
H “ σpCG pX; W qq “ σ AH Wq p0q p0q
P RN ˆn (3.7)
17
p1q p0q
where Hp1q P RN ˆn , Hp0q “ X P RN ˆn are the output and input of the first hidden
layer, respectively. It can be generalized to any layer l P r0, L ´ 1s, by letting Hpℓ`1q be
pℓq pℓ`1q
input to the pℓ ` 1qth layer, Wpℓq P Rn ˆn and stacking L such layers altogether. The
single-layer formula of GCN is given by:
´ ¯
Hpℓ`1q “ σ AH
q pℓq Wpℓq where ℓ “ 0, .., L ´ 2. (3.8)
HpLq “ AH
q pL´1q WpL´1q . (3.9)
where A p ´ 12 A
q “D pDp ´ 12 .
That is, a node’s hidden state fcor the next layer is an aggregation of information from all
its neighbors at the current layer.
q pℓq Wpℓq and hpℓq the j th entry of hvpℓqk . It is sufficient to prove that
Proof. Denote B “ AH
˜ ¸ kj
ř pℓq
Bij “ Wpℓq,T hvk , for i P rN s and j P rnpl`1q s.
vk PN pvi q
j
18
Figure 3.1: GCN as a spatial architecture - information is passed from a node to its
direct neighbors. Information of node 5 at ℓth layer is passed to nodes 1 and 4 at pℓ ` 1qth
layer. The hidden representation of node 2 at pℓ ` 1qth layer is calculated based on the
hidden representation of nodes 1 and 3 at ℓth layer.
Therefore,
N
ÿ N ´
ÿ ÿ ¯
q plq Wpℓq qij “
Bij “ pAH q pℓq qip wpℓq “
pAH
pℓq
hkp
pℓq
wpj . (3.12)
pj
p“1 p“1 vk PN pvi q
ÿ ÿ N
”ÿ N
ÿ ıT
pℓq,T pℓq pℓq pℓq pℓq
W hpℓq
vk “ hkp wp1 ... hkp wpN . (3.13)
vk PN pvi q vk PN pvi q p“1 p“1
19
3.1.3 Efficient Implementation
In previous sections, we see that GCN can be implemented both in graph level as in
(3.8) and node level as in (3.11). In this section, we discuss the pros and cos of each
implementation.
The graph-level implementation as in (3.8) should be used when our priority is to
minimize the number of mathematical operations. It is because we only compute the
pℓq
embedding hvi exactly once [9]. However, this implementation requires the full adjacency
matrix and node features matrix simultaneously. Therefore, it should not be used if memory
is our top concern. Moreover, using graph-level implementation means that we cannot take
advantage of mini-batching.
On the node-level implementation should be considered when we are limited in memory,
or when the input graph is big. We can also use mini-batching to further control the
memory footprint of the GNN.
3.2 GraphSAGE
20
One drawback of GCN is that its training heavily depends on the adjacency matrix
A. In transductive settings where all nodes are given at first, GCN can work fairly well.
However, in inductive settings where a new node or edge is introduced to the graph, we
must retrain the GCN from scratch. On the other hand, GraphSAGE is an architecture
that does not require the adjacency matrix A, so it works well in both transductive and
inductive settings [10].
Moreover, GraphSAGE presents a novel sampling approach that is useful for processing
very large graphs. In such large graphs, the number of nodes and the size of the neighbor-
hoods are big, thus it is difficult to store and process all the neighborhood information. To
deal with that, GraphSAGE selects only a subset of each node’s neighborhood, ensuring
the size of the neighborhood is reasonble for message passing. [24] [10].
Suppose we are given a graph G of N vertices, and a matrix of node features X P RN ˆd .
pℓq pℓq
Let hu P Rn be the representation of node u at layer ℓ, where Ns pvq is the sampled
pℓqT pℓqT
neighborhood of node v. We stack them vertically to obtain Hpℓq “ rh1 , .., hN s P
pℓq
RN ˆn , which is the output of the ℓ´th layer (ℓ P rLs). Let Gℓ be a differentiable aggregate
pℓq pℓ´1q
function, at layer ℓ. Let’s also denote hNs pvq P Rn as the aggregated message from
pℓq
neighborhood of node v, at layer ℓ. Then the formula for hNs pvq is as follows:
´ ¯
pℓq
hNs pvq “ Gℓ thupℓ´1q , @u P Ns pvqu (3.14)
There can be many choices for G, such as mean, pooling, or long-short term memory
aggregators [10]. For example, if G mean aggregator:
´ ¯
plq pℓ´1q
hNs pvq “ σ WMphu , @u P Ns pvq Y tvu , (3.15)
21
pℓq
After choosing an appropriate aggregator for hNs pvq , a single-layer formula of Graph-
SAGE is defined as follow:
´ ¯
pℓq
hpℓq
v “σ W pℓq
rhvpℓ´1q ||hNs pvq s (3.17)
pℓq ˆ2npℓ´1q
where Wpℓq P Rn .
In summary, the output embedding for a node v P V of a GraphSAGE architecture
with two layers is given by:
´ ¯
p1q
hNs pvq “ G1 thup0q , @u P Ns pvqu (3.18)
´ ¯
p1q
hp1q
v “ σ W p1q
rh p0q
v ||hNs pvq s (3.19)
´ ¯
p2q
hNs pvq “ G2 thup1q , @u P Ns pvqu (3.20)
p2q
hp2q p2q p1q
v “ W rhv ||hNs pvq s (3.21)
p0q p2q
where v P V , hu is the input features of node u, and hv is the final embedding output of
node v.
When looking throughout the above equations, we can see that: given a new node v,
we can induce its embedding with only the embeddings of its neighbors. The entire graph
structure is not required, so we do not need to retrain for a new node. This is not possible
if we use Equation (3.8) to implement GCN.
22
Figure 3.3: GAT: From hidden representation to attention weights (See Equations
(3.22) - (3.24))
23
be the shared weight matrix that is applied for every node, at the ℓ-th layer. First of all,
the embedding of node u at layer ℓ is transformed linearly:
n pℓ`1q
zpℓq pℓq pℓq
u “ W hu P R . (3.22)
pℓ`1q
Let aℓ P R2n , the attention mechanism, be a learnable sing-layer feedforward neural
pℓq
network. The unnormalized attention score of node u to node v, denoted as euv , is defined
as:
epℓq
uv “ LeakyReLUpa rzu ||zpℓq
pℓq,T pℓq
v sq P R. (3.23)
The normalized attention of node u to node v (or the importance of node v to node u)
pℓq
at layer l, denoted as αuv P R, is defined as:
pℓq
pℓq pℓq exppeuv q
αuv “ softmaxpeuv q“ ř pℓq
P R. (3.24)
exppeuw q
wPN puq
In practice, multiple such attention weights are used simultaneously to stabilize the
learning process [20]. This technique is called multi-head attention, where (3.25) are ap-
plied K times using K different learnable weight matrices. These K outputs are then
concatenated or averaged as follows:
´ ÿ ¯ pℓ`1q
K pℓq pℓq
hpℓ`1q
u “ ||k“1 σ α pℓq,k
uv W k h v P Rn , (3.26)
vPN puq
npℓ`1q
pℓq ˆnpℓq
where K is the number of attention heads, Wk P R K is the weight matrix for the
k th attention head, at ℓth layer, k P rKs.
3.4 PinSAGE
Similar to GraphSAGE, PinSAGE also utilizes the sampling technique for reducing the
size of the neighborhoods. The difference is that while GraphSAGE uses random sampling,
24
Figure 3.5: Visualization of PinSAGE’s importance-based neighborhood sampling.
Left: Original graph Center: Simulation of N “ 5 random walks from node 0, each with
length L “ 2, top T “ 3 is chosen. Right: Node 0 aggregates information from its
sampled neighborhood.
25
The unnormalized vector representation of node u at layer ℓ ` 1 is as follows:
˜ « ff ¸
pℓq
hu
hpℓ`1q
u,un “ σ W
pℓq
¨ pℓq `w (3.28)
hN puq
By looking at the above equations, we can see PinSAGE: 1) controls the memory footprint
of the algorithm during training, since we chose a fixed size T for a node’s neighborhood,
and 2) takes into account the importance of neighbors when aggregating their representa-
tions. By performing random walks, not only we select top T nodes as neighboring nodes
of u, we are able to find their importance (the corresponding element of αpuq) to u as well.
26
Chapter 4
GNN Training
Graph Neural Network aims to extend existing neural networks to process graph input data.
Beside what are similar to a general neural network training framework, in this chapter,
we will go into specific details of several aspects of training a Graph Neural Network.
4.1 Backpropagation
Backpropagation is an algorithm based on gradient descend to optimize the parameters in a
model. Backpropagation in a neural network consists of two steps: forward calculation and
backward propagation. Given a set of parameters and input, forward calculation computes
the values at each neuron in forward order. On the other hand, backward propagation
computes the error at each variable, and update the parameters with the corresponding
partial derivatives in backward order.
27
´ ¯
q p0q Wp0q P RN ˆnp1q ,
Hp1q “ σ AH (4.2)
p1q
We recall from Equation (3.11) that the formula for hjp where j P rN s, p P rnp1q s is given
by:
p0q
´ ÿ nÿ ¯
p1q p0q p0q
hjp “ σ hqr wrp , (4.4)
vq PN pvj q r“1
p1q
From Equation (4.3), the partial derivative of zik with respect to wab is as follows:
Bzik ÿ p1q
p1q
“ δbk hja . (4.5)
Bwab jPN pvi q
p0q
To compute partial derivative of zik with respect to wab , we use chain rule as follows:
p1q
Bzik ÿ Bzik Bhjp
p0q
“ p1q p0q
, (4.6)
Bwab j,p Bhjp Bwab
p1q
where the partial derivative of of zik with respect to hjp is:
#
p1q
Bzik wpk , if j P N piq
p1q
“
Bhjp 0, otherwise,
p1q p0q
while the partial derivative of hjp with respect to wab depends on the choice of σ. We
discuss two popular choices, including ReLU and tanh. If σ is ReLU then:
# ř ´ř řnp0q p0q p0q ¯
p1q p0q
Bhjp δbp qPN pjq hqa , if qPN pjq r“1 hqr wrp ą0
p0q
“
Bwab 0, otherwise.
Thus #ř
p1q ř p0q
Bzik w
jPN piq bk qPN pjq,Bpj,bqą0 hqa
p0q
“
Bwab 0, otherwise,
28
Figure 4.1: Effect of adding skip connections in graph neural networks: adding a skip
connection (blue arrow) only helps mitigate the vanishing gradient risk from the node
itself. The risk of vanishing gradient from message passing through its neighbors still
remains. Figure is taken from [15]
Thus
Bzik ÿ p1q2 p1q
ÿ
p0q
“ δkb p1 ´ hjb qWbk hp0q
qa . (4.7)
BWab jPN piq qPN pjq
29
do not improve the breadth-wise backpropagation between representations of neighboring
nodes. Moreover, it is also shown in the paper that adding skip connections in GNNs
may hurt performance when the task requires learning patterns containing distant nodes.
Therefore, one must be careful when considering adding skip connections to a graph neural
network.
Then the cross-entropy loss for this classfication problem is calculated as follows:
N ÿ
ÿ C
L“´ yij ln zij (4.8)
i“1 j“1
30
In particular, in a node classification problem, an example Ei corresponds to node
ui P V . The output embedding vectors for all nodes are given by an embedding matrix:
Then cross-entropy loss for the node classification problem is given by:
N ÿ
ÿ C
L“´ yui ln zui (4.9)
u“1 i“1
Another popular learning problem on graphs is link prediction. For example, graph-
based recommender systems considers a graph where users and products are to types of
nodes. By utilizing the similarity among different users and different products, and the
relation between users and products, the recommender make a prediction on which items a
user will likely buy next, thus recommend these items to the user. There are two popular
choices for the associated loss function, which are the cross-entropy loss and the max-
margin loss. Both are often used with negative sampling [9].
pLq
Suppose we are given a simple undirected graph G “ pV, Eq. Denote zu P Rn as the
embedding at the final layer of node u. We define the decoder function D : V ˆ V Ñ R as
follows:
Dpu, vq “ zuT zv . (4.11)
The cross entropy loss with negative sampling for the link prediction problem is given by
ÿ
L“ ´ logpσpDpzu , zv qqq ´ γEvn „Pn,v pVq rlogpσp´Dpzu , zvn qqqs (4.12)
pu,vqPE
where σ is the logistic function, Pn,u pVq denotes a ”negative sampling” distribution over
the set of nodes V which might depend on u, and γ ą 0 is a hyperparameter.
The above formula can be decomposed into two components. The first component,
logpσpDpzu , zv qqq, equals the log-likelihood that we predict ”true” for an actual edge.
31
The second component, Evn „Pn,v pVq rlogpσp´Dpzu , zvn qqqs, equals the expected log-likelihood
that we correctly predict ”false” for an false edge (edge that does not exist in graph). In
practice, the expectation is evaluated using a Monte Carlo approximation and the most
popular form of this loss is
ÿ ÿ
L“ ´ logpσpDpzu , zv qqq ´ rlogpσp´Dpzu , zvn qqqs, (4.13)
pu,vqPE vn PPn,u
where ∆ ą 0. Max-margin loss compares the direct output of the decoders. If the score
for the true pair is bigger than the false pair, we have small loss. ∆ is called the margin,
and the loss for a pair will equal 0 if the difference in scores is at least ∆.
4.3.2 Sampling
Simply using the full graph for training is expensive, especially for dense graphs. Moreover,
since each graph is different, adding unseen nodes to a graph would require starting training
32
from scratch. To address these issues, several GNN architecture use different sampling
methods to sample a subset of nodes for message passing. There are many different ways
to perform sampling on graphs, most of them can be put into one of three categories: node
sampling, layer sampling, and subgraph sampling.
Node sampling is to sample a fixed small number of neighbors for each node. For
example, as we can see from (3.20), GraphSAGE performs neighborhood sampling/ and
aggregation of embeddings on the sampled nodes. This keeps the computational and
memory complexity constant with respect to the size of the graph.
PinSAGE is an extension of GraphSAGE on large graphs. Instead of random sam-
pling, it uses an importance-based sampling approach. Importance of a node v to node u
is proportion to the number of times v appears in a random walk from u. Finally, we may
simply select top T nodes with highest importance score to u, or sampling T nodes with
probability proportional to the visit count of each node. This importance-based approach
has a interesting property: a node v which is not a direct neighbor of u can be more impor-
tant to u than some of its direct neighbors. Thus, the sampled neighborhood here is not
just a subset of the ”traditional” neighborhood of u which we can deduce from adjacency
matrix A.
While node sampling methods sample neighbors for each node, layer sampling meth-
ods chooses a small set of nodes for aggregation in each layer to control the expansion
factor. That means the number of neighbors for message aggregation of each node can
be different. FastGCN (Chen et al., 2018b) directly samples the receptive field for
each layer. It uses importance sampling, where the important nodes are more likely to be
sampled. In contrast to fixed sampling methods above, Huang et al. (2018) introduce a
parameterized and trainable sampler to perform layer-wise sampling conditioned on the
former layer. Furthermore, this adaptive sampler could optimize the sampling importance
and reduce variance simultaneously. LADIES (Zou et al., 2019) intends to alleviate the
sparsity issue in layer-wise sampling by generating samples from the union of neighbors of
the nodes.
Meanwhile, subgraph sampling methods sample multigraphs first, then perform
neighborhood search within the sampled subgraphs. ClusterGCN (Chiang et al., 2019)
samples subgraphs by graph clustering algorithms, while GraphSAINT (Zeng et al., 2020)
directly samples nodes or edges to generate a subgraph.
33
Figure 4.2: DiffPool [23] for graph classifcation problems. At each hierarchical layer, a
GNN is run to obtain embeddings of nodes. The learn learned embeddings are used to
cluster nodes together. Then another GNN layer is run on this coarsened graph. This
whole process is repeated for L layers and the final output representation is used to
classify the graph.
4.3.4 Augmentation
Augmentation is the process of adding features, nodes, or edges to enrich the graph. When
the input graph does not have features, we can assign unique IDs (one-hot vectors) to
nodes. Sometimes there are extra properties that may be useful such as node degrees,
34
cycle counts, or clustering coefficients, we can add them manually as well. On the other
hand, graph augmentation is to add nodes or edges to graphs. When a graph is sparse the
message passing process may be difficult. In that case, we can simply add more virtual
edges. For example, we can use A ` A2 as adjacency matrix. Another option is to add a
virtual node that connect to all nodes to the graph.
we can see that in case W “ Wp0q “ Wp1q “ . . . “ WpLq , the scale of our output will be
blown up or shrunken by an approximated factor of ||W||L .
Edge Dropout: The idea is to randomly mask edges during training, with the hope
that this will make the network more robust to noise in the adjacency matrix. The neigh-
borhood sampling technique used by GAT can also be considered as a special case of
edge dropout. For example, given a graph G, DropEdge (Yu., 2020) first computes a new
adjacency matrix Ad for each layer,
pℓq 1
Ad “ A ´ A pℓq , (4.15)
1
where A pℓq is a matrix obtained from a subset of edges of G. Then renormalization tricks
are applied to obtain Ã:
1 1
pℓq
Ãrd “ D p pℓq´ 2 ,
p pℓq´ 2 pApℓq ` IqD (4.16)
d
35
p pℓq is the degree matrix associated with pApℓq ` Iq. Then a single-layer formula of
where D d
GCN with edge dropout is given by
pℓ`1q
Hpℓ`1q “ σpÃrd Hpℓq Wpℓq q. (4.17)
36
Chapter 5
Numerical Experiments
37
Table 5.1: User Features
Feature Description
Feature Description
Feature Description
38
Table 5.4: List of Python libraries used and their versions
Libraries/Packages Version
Python 3.9.7
PyTorch (torch) 1.11.0+cu113
torchtext 0.5.0
numpy 1.20.3
pandas 1.3.4
scipy 1.7.1
dgl 0.6.1
dask 2021.10.0
tqdm 4.62.3
5.1.2 Task
Given a training set including a list of users, a list of movies, and the lists of movies each
user has watched, we would like to predict the next movie a user will watch.
5.2 Environment
5.2.1 Libraries
The end-to-end model pipeline, including preprocessing data, building models, training
models, and performance evaluation are written in Python. The full list of Python libraries
we used to produce experimental results is listed in the table below. Among them, PyTorch
is a Python-based scientific computing package serving as a replacement for NumPy to use
the power of GPUs and other accelerators, and an automatic differentiation library that is
useful to implement neural networks [3]. Deep Graph Library (DGL) is a Python package
built for easy implementation of graph neural network model family [1], on top of PyTorch.
39
Table 5.5: Machine configuration
Specification Configuration
To accelerate the training, we utilized the power of GPU for parallel computing. In order
to do that, CUDA Toolkit (version 11.6) was downloaded and installed from NVIDIA’s
website.
The validation portion corresponding to user u consists of the second last movie watched
by user u:
Euval “ tpu, vnu ´1 qu, (5.5)
and the test portion corresponding to user u consists of the last movie watched by user u:
40
Ť
The training graph is constructed as GT “ pVT , ET q where VT “ V and ET “ Eutrain .
uPVU
For u P VU such that |M puq| “ 0, we do not have validation and test portions. For u P VU
such that |M puq| “ 1, we do not have validation portion.
5.3.2 Preprocessing
We keep all the features from Users and Movies tables. Regarding data types, there are two
types of features in the dataset: numeric features and text features. Numeric features are
already ready for training. Text features (movie titles) are embeded into numeric vectors
using a single-layer neural network.
For Ratings table, we omit the ratings. Timestamps are only used to split the dataset
into train/val/test. That means our graph G is an unweighted graph.
41
drawn randomly and without replacement. These movies can be referred as the heads
of the triplets. Next, NB movies MP “ tp1 , .., pNB u Ď V are chosen randomly such that
dpmi , pi q “ 2 where dp¨, ¨q is the shortest distance between two nodes in the graph. In words,
we are chosen pi such that pi is co-interacted with mi by some users. These movies are
referred as positive tails of the triplets. Finally, other NB movies MN “ tn1 , .., nNB u Ď V
are chosen randomly such that dpmi , ni q ą 2, or in words, ni are chosen such that ni is
not co-interacted with mi by any user. The movies are referred as the negative tails of the
triplet.
The set of nodes of the positive graph PB is the union of the heads and positive tails,
VBP “ MH Y MP . The corresponding set of edges is EBP “ tpu, vq|u P MH , v P MP u.
Similarly, the set of nodes of the positive graph NB is the union of the heads and positive
tails, VBN “ MH Y NP . The corresponding set of edges is EBN “ tpu, vq|u P MH , v P MN u.
We choose Adam as the training optimizer.
where hu , hv P RpLq are the output are the output embedding vectors of node u and node v
The score of a graph R “ pVR , ER q (can be positive or negative) is calculated as follow:
ÿ
sG pRq “ spu, vq P R (5.8)
pu,vqPVR
The loss over a training batch is given by the difference of its negative graph’s score
and its positive graph’s score:
42
5.4 Evaluation
For this task, we aim to predict the next movie a user will watch. For each user u P VU ,
let lu P VM be the last movie that user u watched. The model returns K movies that are
nearest neighbors of lu (in terms of Euclidean distance) and serve them as recommendations
to user u.
5.4.1 Hit@K
For a given user u P VU , let au P VM be the next movie that u will watch, and Ru be the
set of recommended movie for u, where |Ru | “ K. The score of our model corresponding
to user u is given as: #
1, au P Ru
sU puq “
0, otherwise.
The Hit@K (hit at K) metric of the model is calculated as:
ÿ
Hit@K “ sU puq, (5.10)
uPVU
or in words, the number of users whose relevant items are in the recommended list of size
K.
43
In the original PinSAGE model, some hyperparameters are fixed throughout the train-
ing phase, including restart probability, and number of random walks. It can be argued
that there is no perfect hyperparameter value for all minibatch graphs: the optimal values
are different and should depend on the size, or density of the graph.
In next sections, we propose two possible improvements to the original model. The
ideas are very simple, they are just there to test the hypotheses that we propose. These
improvements are deterministic and involve no randomness. Furthermore, for all the ex-
periments, we keep the random seed of all the libraries/packages the same and equal to 1,
so the result can easily be reproducible if necessary. More details can be found in Table
5.9.
|Em |
p “ p0 ´ (5.11)
|E|
Under our training settings, with different hyperparameters, |Em | can range anywhere from
1,000 to 100,000. For the MovieLens dataset, |E| « 1, 000, 000. The set of hyperparameters
ued for this model is reported in Table 5.7. The result of our experiment is shown in Figure
5.4.
44
appropriate neighborhoods for the nodes. Let |Em | be the number of edges of the minibatch
graph. Then the number of random walks that we perform is given by:
#
10 if |Em | ă 50, 000
Nr “
30 otherwise.
We compare this new model with the original model where Nr is fixed at 20. The results
are shown in Figure 5.5, and reported in Table xx. The set of hyperparameters ued for
this model is reported in Table 5.8. The new model does not quite outperform the original
one. It is expected since once we perform a reasonably enough number of random walks,
it does not matter whether do it several more or less times.
5.6 Observations
We recall that the training set consists of all but last 2 movies (ordered by timestamp) for
each user. The validation set consists of the second last movie, and the test set consists of
the last movie that each user watches. Therefore, it is expected that in most of the cases,
a model’s performance on validation set is better than test set. Our experiments confirms
that empirically, for different values of K. From these experiments, we can conclude that
in a recommender system task, time sensitiveness of data is very important. Predicting
the next movie is much more easier than predicting the movie after that.
It can also be seen from the numerical experiments that as K getting bigger, the Hit@K
metric on the validation set and test set are getting much more closer. This suggests us that
our model is effective: most of our correct predictions are on top of the recommendation
lists.
It can be seen from Experiment that the ARP model outperforms the original model
with a fixed restart probability, even though the modification made is very simple and does
not come with additional parameters. This result suggests that the optimal choice restart
probability of the random walk algorithm should depend on the density of the edges. A
more dense graph should have lower restart probability so we can more effectively take
advantages of information from all the neighboring nodes.
From the Experiment, we can see that the VNRW model is somewhat better than the
original model with a fixed number of random walks, but the result is not so convincing.
There are multiple times during training that the VNRW is behind, and the variance in
performance gain/loss among training iterations is large. One possible explanation is that
once we perform a reasonably enough number of random walks, it does not matter whether
do it several more or less times.
45
Figure 5.1: Numerical result for Experiment 1 (Original PinSAGE), K = 5
46
Figure 5.3: Numerical result for Experiment 3 (Original PinSAGE), K = 200
47
Figure 5.5: Experiment 5: Original PinSAGE vs Variable Number of Random Walks
method
Hyperparameter Value
48
Table 5.7: Hyperparameters for the Adaptive Restart Probability (ARP) method
Hyperparameter Value
Table 5.8: Hyperparameters for the Variable Random Walk Lengths (VRWL) method
Hyperparameter Value
49
Table 5.9: Random seeds set for libraries/packages that involve randomness
Package/Library Command
DGL dgl.seed(1)
DGL Random dgl.random.seed(1)
PyTorch torch.manual seed(1)
PyTorch CUDA torch.cuda.manual seed(1)
PyTorch CUDA (multiple GPUs) torch.cuda.manual seed all(1)
PyTorch Backend torch.backends.cudnn.deterministic=True
NumPy Random np.random.seed(1)
Python Random random.seed(1)
50
Chapter 6
Conclusions
6.1 Summary
The contribution of this thesis is two fold. First, we present a systematic way to write
down popular GNN architectures, and the mathematical proof/foundation that is lacking
in some original works. We also discuss the pros and cons and possible use cases of each
architecture. Second, we implement a PinSAGE architecture to build a recommendation
model for a user-product recommendation problem, propose additional improvements and
run additional experiments to improve the result of the baseline model. Our experiements
show that PinSAGE can be well adapted to a large-scale recommendation problem, and
there are still some areas for improvements.
51
directed in this direction, aside from exploring other innovative ways to sample a node’s
neighborhood.
52
References
[6] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural
networks on graphs with fast localized spectral filtering. Advances in Neural Informa-
tion Processing Systems 29, 8:175–191, 2016.
[7] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press,
2016. http://www.deeplearningbook.org.
[8] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning
on large graphs. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing
Systems, volume 30. Curran Associates, Inc., 2017.
[10] William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning
on large graphs. CoRR, abs/1706.02216, 2017.
53
[11] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convo-
lutional networks. ICLR 2017.
[13] Fei-Fei Li. Lecture notes in cs231n: Convolutional neural networks for visual recogni-
tion.
[14] Zhiyuan Liu and Jie Zhou. Introduction to Graph Neural Networks. Synthesis Lectures
on Artificial Intelligence and Machine Learning. 2020.
[15] Denis Lukovnikov and Asja Fischer. Improving breadth-wise backpropagation in graph
neural networks helps learning long-range dependencies. In Marina Meila and Tong
Zhang, editors, Proceedings of the 38th International Conference on Machine Learning,
volume 139 of Proceedings of Machine Learning Research, pages 7180–7191. PMLR,
18–24 Jul 2021.
[16] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement, 2018.
[17] Luis G. Serrano. Grokking Machine Learning. Manning Publications, November 2021.
[18] David I Shuman, Sunil K. Narang, Pascal Frossard, Antonio Ortega, and Pierre Van-
dergheynst. The emerging field of signal processing on graphs: Extending high-
dimensional data analysis to networks and other irregular domains. IEEE Signal
Processing Magazine, 30(3):83–98, 2013.
[19] Nathan Srebro and Adi Shraibman. Rank, trace-norm and max-norm. In Proceedings
of the 18th Annual Conference on Learning Theory, COLT’05, page 545–560, Berlin,
Heidelberg, 2005. Springer-Verlag.
[20] Petar Velivcković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò,
and Yoshua Bengio. Graph attention networks. 2017.
[21] Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Wein-
berger. Simplifying graph convolutional networks. In Kamalika Chaudhuri and Ruslan
Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine
Learning, volume 97 of Proceedings of Machine Learning Research, pages 6861–6871.
PMLR, 09–15 Jun 2019.
54
[22] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton,
and Jure Leskovec. Graph convolutional neural networks for web-scale recommender
systems. CoRR, abs/1806.01973, 2018.
[23] Rex Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L. Hamilton, and
Jure Leskovec. Hierarchical graph representation learning with differentiable pooling.
CoRR, abs/1806.08804, 2018.
[24] Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu,
Lifeng Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review
of methods and applications. AI Open, 1:57–81, 2020.
55