Simplifying Graph Convolutional Networks
Simplifying Graph Convolutional Networks
Felix Wu * 1 Tianyi Zhang * 1 Amauri Holanda de Souza Jr. * 1 2 Christopher Fifty 1 Tao Yu 1
Kilian Q. Weinberger 1
Historically, the development of machine learning algo- In contrast to its nonlinear counterparts, the SGC is intu-
itively interpretable and we provide a theoretical analysis
*
Equal contribution 1 Cornell University 2 Federal Insti- from the graph convolution perspective. Notably, feature
tute of Ceara (Brazil). Correspondence to: Felix Wu extraction in SGC corresponds to a single fixed filter applied
<fw245@cornell.edu>, Tianyi Zhang <tz58@cornell.edu>.
to each feature dimension. Kipf & Welling (2017) empiri-
Proceedings of the 36 th International Conference on Machine cally observe that the “renormalization trick”, i.e. adding
Learning, Long Beach, California, PMLR 97, 2019. Copyright self-loops to the graph, improves accuracy, and we demon-
2019 by the author(s).
Simplifying Graph Convolutional Networks
GCN
Input Graph Predictions
x6 ⇥(K 1)
x5
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
x7
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
x4
<latexit sha1_base64="(null)">(null)</latexit>
x3
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
>
H(0) = X = [x1 , . . . , xn ] ŶGCN = softmax(SH(K 1)
⇥(K) )
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
x7
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
x4
<latexit sha1_base64="(null)">(null)</latexit>
x3 X̄ SK X
Logistic Regression
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
>
X = [x1 , . . . , xn ]
Feature Value: ŶSGC = softmax X̄⇥
<latexit sha1_base64="(null)">(null)</latexit>
Figure 1. Schematic layout of a GCN v.s. a SGC. Top row: The GCN transforms the feature vectors repeatedly throughout K layers
and then applies a linear classifier on the final representation. Bottom row: the SGC reduces the entire procedure to a simple feature
propagation step followed by standard logistic regression.
strate that this method effectively shrinks the graph spectral vi and vj . A missing edge is represented through aij = 0.
domain, resulting in a low-pass-type filter when applied to We define the degree matrix D = diag(d1 , . . . , dn ) as a
SGC. Crucially, this filtering operation gives rise to locally diagonal matrix where each entry on the diagonal
P is equal
smooth features across the graph (Bruna et al., 2014). to the row-sum of the adjacency matrix di = j aij .
Through an empirical assessment on node classification Each node vi in the graph has a corresponding d-
benchmark datasets for citation and social networks, we dimensional feature vector xi ∈ Rd . The entire feature
show that the SGC achieves comparable performance to matrix X ∈ Rn×d stacks n feature vectors on top of one
>
GCN and other state-of-the-art graph neural networks. How- another, X = [x1 , . . . , xn ] . Each node belongs to one
ever, it is significantly faster, and even outperforms Fast- out of C classes and can be labeled with a C-dimensional
GCN (Chen et al., 2018) by up to two orders of magnitude one-hot vector yi ∈ {0, 1}C . We only know the labels of a
on the largest dataset (Reddit) in our evaluation. Finally, subset of the nodes and want to predict the unknown labels.
we demonstrate that SGC extrapolates its effectiveness to a
wide-range of downstream tasks. In particular, SGC rivals, 2.1. Graph Convolutional Networks
if not surpasses, GCN-based approaches on text classifi-
cation, user geolocation, relation extraction, and zero-shot Similar to CNNs or MLPs, GCNs learn a new feature repre-
image classification tasks. The code is available on Github1 . sentation for the feature xi of each node over multiple layers,
which is subsequently used as input into a linear classifier.
For the k-th graph convolution layer, we denote the input
2. Simple Graph Convolution node representations of all nodes by the matrix H(k−1) and
We follow Kipf & Welling (2017) to introduce GCNs (and the output node representations H(k) . Naturally, the initial
subsequently SGC) in the context of node classification. node representations are just the original input features:
Here, GCNs take a graph with some labeled nodes as input H(0) = X, (1)
and generate label predictions for all graph nodes. Let
us formally define such a graph as G = (V, A), where V which serve as input to the first GCN layer.
represents the vertex set consisting of nodes {v1 , . . . , vn },
A K-layer GCN is identical to applying a K-layer MLP
and A ∈ Rn×n is a symmetric (typically sparse) adjacency
to the feature vector xi of each node in the graph, except
matrix where aij denotes the edge weight between nodes
that the hidden representation of each node is averaged with
1
https://github.com/Tiiiger/SGC its neighbors at the beginning of each layer. In each graph
convolution layer, node representations are updated in three
Simplifying Graph Convolutional Networks
stages: feature propagation, linear transformation, and a 2.2. Simple Graph Convolution
pointwise nonlinear activation (see Figure 1). For the sake
In a traditional MLP, deeper layers increase the expressivity
of clarity, we describe each step in detail.
because it allows the creation of feature hierarchies, e.g.
features in the second layer build on top of the features of
Feature propagation is what distinguishes a GCN from the first layer. In GCNs, the layers have a second important
an MLP. At the beginning of each layer the features hi of function: in each layer the hidden representations are aver-
each node vi are averaged with the feature vectors in its aged among neighbors that are one hop away. This implies
local neighborhood, that after k layers a node obtains feature information from
Xn all nodes that are k−hops away in the graph. This effect
(k) 1 (k−1) aij (k−1) is similar to convolutional neural networks, where depth
h̄i ← hi + p hj .
di + 1 j=1
(di + 1)(d j + 1) increases the receptive field of internal features (Hariharan
(2) et al., 2015). Although convolutional networks can bene-
More compactly, we can express this update over the en- fit substantially from increased depth (Huang et al., 2016),
tire graph as a simple matrix operation. Let S denote the typically MLPs obtain little benefit beyond 3 or 4 layers.
“normalized” adjacency matrix with added self-loops,
1 1
Linearization. We hypothesize that the nonlinearity be-
S = D̃− 2 ÃD̃− 2 , (3) tween GCN layers is not critical - but that the majority of the
benefit arises from the local averaging. We therefore remove
where à = A + I and D̃ is the degree matrix of Ã. The the nonlinear transition functions between each layer and
simultaneous update in Equation 2 for all nodes becomes a only keep the final softmax (in order to obtain probabilistic
simple sparse matrix multiplication outputs). The resulting model is linear, but still has the same
increased “receptive field” of a K-layer GCN,
H̄(k) ← SH(k−1) . (4)
Ŷ = softmax S . . . SSXΘ(1) Θ(2) . . . Θ(K) . (7)
Intuitively, this step smoothes the hidden representations lo-
cally along the edges of the graph and ultimately encourages
To simplify notation we can collapse the repeated multi-
similar predictions among locally connected nodes.
plication with the normalized adjacency matrix S into a
single matrix by raising S to the K-th power, SK . Fur-
Feature transformation and nonlinear transition. Af-
ther, we can reparameterize our weights into a single matrix
ter the local smoothing, a GCN layer is identical to a stan-
Θ = Θ(1) Θ(2) . . . Θ(K) . The resulting classifier becomes
dard MLP. Each layer is associated with a learned weight
matrix Θ(k) , and the smoothed hidden feature representa-
ŶSGC = softmax SK XΘ , (8)
tions are transformed linearly. Finally, a nonlinear activa-
tion function such as ReLU is applied pointwise before which we refer to as Simple Graph Convolution (SGC).
outputting feature representation H(k) . In summary, the
representation updating rule of the k-th layer is:
Logistic regression. Equation 8 gives rise to a natural and
intuitive interpretation of SGC: by distinguishing between
H(k) ← ReLU H̄(k) Θ(k) . (5) feature extraction and classifier, SGC consists of a fixed
(i.e., parameter-free) feature extraction/smoothing compo-
The pointwise nonlinear transformation of the k-th layer is nent X̄ = SK X followed by a linear logistic regression
followed by the feature propagation of the (k + 1)-th layer. classifier Ŷ = softmax(X̄Θ). Since the computation of X̄
requires no weight it is essentially equivalent to a feature
Classifier. For node classification, and similar to a stan- pre-processing step and the entire training of the model re-
dard MLP, the last layer of a GCN predicts the labels using a duces to straight-forward multi-class logistic regression on
softmax classifier. Denote the class predictions for n nodes the pre-processed features X̄.
as Ŷ ∈ Rn×C where ŷic denotes the probability of node
i belongs to class c. The class prediction Ŷ of a K-layer Optimization details. The training of logistic regression
GCN can be written as: is a well studied convex optimization problem and can
be performed with any efficient second order method or
ŶGCN = softmax SH(K−1) Θ(K) , (6) stochastic gradient descent (Bottou, 2010). Provided the
PC graph connectivity pattern is sufficiently sparse, SGD nat-
where softmax(x) = exp(x)/ c=1 exp(xc ) acts as a nor- urally scales to very large graph sizes and the training of
malizer across all classes. SGC is drastically faster than that of GCNs.
Simplifying Graph Convolutional Networks
where Ĝ = diag (ĝ1 , . . . , ĝn ) denotes a diagonal matrix in We now analyze the spectrum of ∆ ˜ sym and show that adding
which the diagonal corresponds to spectral filter coefficients. self-loops to graphs shrinks the spectrum (eigenvalues) of
the corresponding normalized Laplacian.
Graph convolutions can be approximated by k-th order poly-
nomials of Laplacians Theorem 1. Let A be the adjacency matrix of an undirected,
weighted, simple graph G without isolated nodes and with
!
Xk Xk corresponding degree matrix D. Let à = A + γI, such that
>
UĜU x ≈ i
θi ∆ x = U θi Λ U> x, (10)
i
γ > 0, be the augmented adjacency matrix with correspond-
i=0 i=0 ing degree matrix D̃. Also, let λ1 and λn denote the smallest
where θi denotes coefficients. In this case, filter coefficients and largest eigenvalues of ∆sym = I−D−1/2 AD−1/2 ; sim-
correspond ilarly, let λ̃1 and λ̃n be the smallest and largest eigenvalues
P to polynomials of the Laplacian P eigenvalues, i.e., of ∆˜ sym = I − D̃−1/2 ÃD̃−1/2 . We have that
Ĝ = i θi Λi or equivalently ĝ(λj ) = i θi λij .
Graph Convolutional Networks (GCNs) (Kipf & Welling, 0 = λ1 = λ̃1 < λ̃n < λn . (12)
2017) employ an affine approximation (k = 1) of Equa-
tion 10 with coefficients θ0 = 2θ and θ1 = −θ from which
Theorem 1 shows that the largest eigenvalue of the normal-
we attain the basic GCN convolution operation
ized graph Laplacian becomes smaller after adding self-
g ∗ x = θ(I + D−1/2 AD−1/2 )x. (11) loops γ > 0 (see supplementary materials for the proof).
Figure 2 depicts the filtering operations associated with
In their final design, Kipf & Welling (2017) replace the normalized adjacency Sadj = D−1/2 AD−1/2 and its
the matrix I + D−1/2 AD−1/2 by a normalized version augmented variant S̃adj = D̃−1/2 ÃD̃−1/2 on the Cora
D̃−1/2 ÃD̃−1/2 where à = A + I and consequently dataset (Sen et al., 2008). Feature propagation with Sadj cor-
D̃ = D + I, dubbed the renormalization trick. Finally, responds to filters g(λi ) = (1 − λi )K in the spectral range
Simplifying Graph Convolutional Networks
Spectral Coefficient
8 1 1
4 0 0
2
0
−1 −1
0 1 2 0 1 2 0 1 2
Eigenvalue (λ) Eigenvalue (λ) Eigenvalue (λ)
Figure 2. Feature (red) and filters (blue) spectral coefficients for different propagation matrices on Cora dataset (3rd feature).
[0, 2]; therefore odd powers of Sadj yield negative filter coef- show that a linear version of GCN can perform competitively
ficients at frequencies λi > 1. By adding self-loops (S̃adj ), and propose an attention-based GCN variant. Cai & Wang
the largest eigenvalue shrinks from 2 to approximately 1.5 (2018) propose an effective linear baseline for graph classi-
and then eliminates the effect of negative coefficients. More- fication using node degree statistics. Eliav & Cohen (2018)
over, this scaled spectrum allows the filter defined by taking show that models which use linear feature/label propaga-
powers K > 1 of S̃adj to act as a low-pass-type filters. In tion steps can benefit from self-training strategies. Li et al.
supplementary material, we empirically evaluate different (2019) propose a generalized version of label propagation
choices for the propagation matrix. and provide a similar spectral analysis of the renormaliza-
tion trick.
4. Related Works Graph Attentional Models learn to assign different edge
weights at each layer based on node features and have
4.1. Graph Neural Networks achieved state-of-the-art results on several graph learning
Bruna et al. (2014) first propose a spectral graph-based tasks (Velickovic et al., 2018; Thekumparampil et al., 2018;
extension of convolutional networks to graphs. In a follow- Zhang et al., 2018a; Kampffmeyer et al., 2018). However,
up work, ChebyNets (Defferrard et al., 2016) define graph the attention mechanism usually adds significant overhead
convolutions using Chebyshev polynomials to remove the to computation and memory usage. We refer the readers to
computationally expensive Laplacian eigendecomposition. Lee et al. (2018) for further comparison.
GCNs (Kipf & Welling, 2017) further simplify graph con-
volutions by stacking layers of first-order Chebyshev poly- 4.2. Other Works on Graphs
nomial filters with a redefined propagation matrix S. Chen
Graph methodologies can roughly be categorized into two
et al. (2018) propose an efficient variant of GCN based on
approaches: graph embedding methods and graph laplacian
importance sampling, and Hamilton et al. (2017) propose
regularization methods. Graph embedding methods (Weston
a framework based on sampling and aggregation. Atwood
et al., 2008; Perozzi et al., 2014; Yang et al., 2016; Velikovi
& Towsley (2016), Abu-El-Haija et al. (2018), and Liao
et al., 2019) represent nodes as high-dimensional feature
et al. (2019) exploit multi-scale information by raising S to
vectors. Among them, DeepWalk (Perozzi et al., 2014) and
higher order. Xu et al. (2019) study the expressiveness of
Deep Graph Infomax (DGI) (Velikovi et al., 2019) use un-
graph neural networks in terms of their ability to distinguish
supervised strategies to learn graph embeddings. DeepWalk
any two graphs and introduce Graph Isomorphism Network,
relies on truncated random walk and uses a skip-gram model
which is proved to be as powerful as the Weisfeiler-Lehman
to generate embeddings, whereas DGI trains a graph convo-
test for graph isomorphism. Klicpera et al. (2019) separate
lutional encoder through maximizing mutual information.
the non-linear transformation from propagation by using a
Graph Laplacian regularization (Zhu et al., 2003; Zhou et al.,
neural network followed by a personalized random walk.
2004; Belkin & Niyogi, 2004; Belkin et al., 2006) introduce
There are many other graph neural models (Monti et al.,
a regularization term based on graph structure which forces
2017; Duran & Niepert, 2017; Li et al., 2018); we refer to
nodes to have similar labels to their neighbors. Label Prop-
Zhou et al. (2018); Battaglia et al. (2018); Wu et al. (2019)
agation (Zhu et al., 2003) makes predictions by spreading
for a more comprehensive review.
label information from labeled nodes to their neighbors until
Previous publications have pointed out that simpler, some- convergence.
times linear models can be effective for node/graph classi-
fication tasks. Thekumparampil et al. (2018) empirically
Simplifying Graph Convolutional Networks
Pubmed Reddit
80 97
GaAN
SGC GCN 96 SAGE-LSTM
28x SGC SAGE-mean 180x
1x 1x 29x
Test Acc (%) 79 DGI GAT LNet 95
Test F1 (%)
260x 415x 909x
DGI
94
78 FastGCN FastGCN
6x 93
100x
GIN AdaLNet SAGE-GCN
89x 758x 32x
92
77
0 1 2 3
10 10 10 10 100 101 102
Relative Training Time Relative Training Time
Figure 3. Performance over training time on Pubmed and Reddit. SGC is the fastest while achieving competitive performance. We are not
able to benchmark the training time of GaAN and DGI on Reddit because the implementations are not released.
Table 1. Dataset statistics of the citation networks and Reddit. Table 2. Test accuracy (%) averaged over 10 runs on citation net-
Dataset # Nodes # Edges Train/Dev/Test Nodes works. † We remove the outliers (accuracy < 75/65/75%) when
calculating their statistics due to high variance.
Cora 2, 708 5, 429 140/500/1, 000
Citeseer 3, 327 4, 732 120/500/1, 000 Cora Citeseer Pubmed
Pubmed 19, 717 44, 338 60/500/1, 000 Numbers from literature:
Reddit 233K 11.6M 152K/24K/55K GCN 81.5 70.3 79.0
GAT 83.0 ± 0.7 72.5 ± 0.7 79.0 ± 0.3
GLN 81.2 ± 0.1 70.9 ± 0.1 78.9 ± 0.1
AGNN 83.1 ± 0.1 71.7 ± 0.1 79.9 ± 0.1
LNet 79.5 ± 1.8 66.2 ± 1.9 78.3 ± 0.3
5. Experiments and Discussion AdaLNet 80.4 ± 1.1 68.7 ± 1.0 78.1 ± 0.4
DeepWalk 70.7 ± 0.6 51.4 ± 0.5 76.8 ± 0.6
We first evaluate SGC on citation networks and social net- DGI 82.3 ± 0.6 71.8 ± 0.7 76.8 ± 0.6
works and then extend our empirical analysis to a wide Our experiments:
range of downstream tasks. GCN 81.4 ± 0.4 70.9 ± 0.5 79.0 ± 0.4
GAT 83.3 ± 0.7 72.6 ± 0.6 78.5 ± 0.3
5.1. Citation Networks & Social Networks FastGCN 79.8 ± 0.3 68.8 ± 0.6 77.4 ± 0.3
GIN 77.6 ± 1.1 66.1 ± 0.9 77.0 ± 1.2
We evaluate the semi-supervised node classification perfor- LNet 80.2 ± 3.0† 67.3 ± 0.5 78.3 ± 0.6†
mance of SGC on the Cora, Citeseer, and Pubmed citation AdaLNet 81.9 ± 1.9† 70.6 ± 0.8† 77.8 ± 0.7†
network datasets (Table 2) (Sen et al., 2008). We supplement DGI 82.5 ± 0.7 71.6 ± 0.7 78.4 ± 0.7
SGC 81.0 ± 0.0 71.9 ± 0.1 78.9 ± 0.0
our citation network analysis by using SGC to inductively
predict community structure on Reddit (Table 3), which
consists of a much larger graph. Dataset statistics are sum- Table 3. Test Micro F1 Score (%) averaged over 10 runs on Red-
marized in Table 1. dit. Performances of models are cited from their original papers.
OOM: Out of memory.
Setting Model Test F1
Datasets and experimental setup. On the citation net- GaAN 96.4
works, we train SGC for 100 epochs using Adam (Kingma SAGE-mean 95.0
& Ba, 2015) with learning rate 0.2. In addition, we use Supervised SAGE-LSTM 95.4
weight decay and tune this hyperparameter on each dataset SAGE-GCN 93.0
FastGCN 93.7
using hyperopt (Bergstra et al., 2015) for 60 iterations on GCN OOM
the public split validation set. Experiments on citation net-
SAGE-mean 89.7
works are conducted transductively. On the Reddit dataset,
Unsupervised SAGE-LSTM 90.7
we train SGC with L-BFGS (Liu & Nocedal, 1989) using SAGE-GCN 90.8
no regularization, and remarkably, training converges in 2 DGI 94.0
steps. We evaluate SGC inductively by following Chen et al. Random-Init DGI 93.3
(2018): we train SGC on a subgraph comprising only train- No Learning
SGC 94.9
ing nodes and test with the original graph. On all datasets,
we tune the number of epochs based on both convergence
behavior and validation accuracy.
Simplifying Graph Convolutional Networks
Hochreiter, S. and Schmidhuber, J. Long Short-Term Mem- Monti, F., Boscaini, D., Masci, J., Rodolà, E., Svoboda, J.,
ory. Neural Computation, 9:1735–1780, 1997. and Bronstein, M. M. Geometric deep learning on graphs
and manifolds using mixture model cnns. In 2017 IEEE
Huang, G., Sun, Y., Liu, Z., Sedra, D., and Weinberger, Conference on Computer Vision and Pattern Recognition,
K. Q. Deep networks with stochastic depth. In European CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp.
Conference on Computer Vision, pp. 646–661. Springer, 5425–5434, 2017.
2016.
Perozzi, B., Al-Rfou, R., and Skiena, S. Deepwalk: Online
Kampffmeyer, M., Chen, Y., Liang, X., Wang, H., Zhang, Y., learning of social representations. In Proceedings of the
and Xing, E. P. Rethinking knowledge graph propagation 20th ACM SIGKDD International Conference on Knowl-
for zero-shot learning. arXiv preprint arXiv:1805.11724, edge Discovery and Data Mining, KDD’14, pp. 701–710.
2018. ACM, 2014.
Kingma, D. P. and Ba, J. Adam: A method for stochastic Rahimi, S., Cohn, T., and Baldwin, T. Semi-supervised
optimization. In International Conference on Learning user geolocation via graph convolutional networks. In
Representations (ICLR’2015), 2015. Proceedings of the 56th Annual Meeting of the Associ-
ation for Computational Linguistics (Volume 1: Long
Kipf, T. N. and Welling, M. Semi-supervised classifica- Papers), pp. 2009–2019. Association for Computational
tion with graph convolutional networks. In International Linguistics, 2018.
Conference on Learning Representations (ICLR’2017),
2017. Ramakrishnan, R., Hartmann, M., Tapavicza, E., and von
Lilienfeld, O. A. Electronic spectra from TDDFT and
Klicpera, J., Bojchevski, A., and Günnemann, S. Predict machine learning in chemical space. The Journal of
then propagate: Graph neural networks meet personal- chemical physics, 143(8):084111, 2015.
ized pagerank. In International Conference on Learning
Representations (ICLR’2019), 2019. Roller, S., Speriosu, M., Rallapalli, S., Wing, B., and
Baldridge, J. Supervised text-based geolocation using
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, language models on an adaptive grid. In Proceedings
R. E., Hubbard, W., and Jackel, L. D. Backpropaga- of the 2012 Joint Conference on Empirical Methods in
tion applied to handwritten zip code recognition. Neural Natural Language Processing and Computational Natu-
Computation, 1(4):541–551, 1989. ral Language Learning, pp. 1500–1510. Association for
Computational Linguistics, 2012.
Lee, J. B., Rossi, R. A., Kim, S., Ahmed, N. K., and Koh,
E. Attention models in graphs: A survey. arXiv e-prints, Rosenblatt, F. The Perceptron: a probabilistic model for
2018. information storage and organization in the brain. Psy-
chological review, 65(6):386, 1958.
Li, Q., Han, Z., and Wu, X. Deeper insights into graph con-
Rosenblatt, F. Principles of neurodynamics: Perceptrons
volutional networks for semi-supervised learning. CoRR,
and the theory of brain mechanisms. Technical report,
abs/1801.07606, 2018.
Cornell Aeronautical Lab Inc, 1961.
Li, Q., Wu, X.-M., Liu, H., Zhang, X., and Guan, Z. Label Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B.,
efficient semi-supervised learning via graph filtering. In and Eliassi-Rad, T. Collective classification in network
IEEE/CVF Conference on Computer Vision and Pattern data. AI magazine, 29(3):93, 2008.
Recognition (CVPR), 2019.
Sobel, I. and Feldman, G. A 3x3 isotropic gradient operator
Liao, R., Zhao, Z., Urtasun, R., and Zemel, R. Lanczos- for image processing. A talk at the Stanford Artificial
net: Multi-scale deep graph convolutional networks. In Project, pp. 271–272, 1968.
International Conference on Learning Representations
(ICLR’2019), 2019. Thekumparampil, K. K., Wang, C., Oh, S., and Li,
L. Attention-based graph neural network for semi-
Liu, D. C. and Nocedal, J. On the limited memory BFGS supervised learning. arXiv e-prints, 2018.
method for large scale optimization. Mathematical Pro-
gramming, 45(1-3):503–528, 1989. Velickovic, P., Cucurull, G., Casanova, A., Romero, A.,
Liò, P., and Bengio, Y. Graph Attention Networks. In
Miller, G. A. Wordnet: a lexical database for english. Com- International Conference on Learning Representations
munications of the ACM, 38(11):39–41, 1995. (ICLR’2018), 2018.
Simplifying Graph Convolutional Networks
Velikovi, P., Fedus, W., Hamilton, W. L., Li, P., Bengio, Y., Zhang, Y., Qi, P., and Manning, C. D. Graph convolution
and Hjelm, R. D. Deep Graph InfoMax. In International over pruned dependency trees improves relation extrac-
Conference on Learning Representations (ICLR’2019), tion. In Empirical Methods in Natural Language Process-
2019. ing (EMNLP), 2018c.
Waibel, A., Hanazawa, T., and Hinton, G. Phoneme recog- Zhou, D., Bousquet, O., Thomas, N. L., Weston, J., and
nition using time-delay neural networks. IEEE Transac- Schölkopf, B. Learning with local and global consistency.
tions on Acoustics, Speech. and Signal Processing, 37(3), In Thrun, S., Saul, L. K., and Schölkopf, B. (eds.), Ad-
1989. vances in Neural Information Processing Systems 16, pp.
321–328. MIT Press, 2004.
Wang, X., Ye, Y., and Gupta, A. Zero-shot recognition via
semantic embeddings and knowledge graphs. In Com- Zhou, J., Cui, G., Zhang, Z., Yang, C., Liu, Z., and Sun,
puter Vision and Pattern Recognition (CVPR), pp. 6857– M. Graph Neural Networks: A Review of Methods and
6866. IEEE Computer Society, 2018. Applications. arXiv e-prints, 2018.
Weston, J., Ratle, F., and Collobert, R. Deep learning via Zhu, X., Ghahramani, Z., and Lafferty, J. Semi-supervised
semi-supervised embedding. In Proceedings of the 25th learning using gaussian fields and harmonic functions.
International Conference on Machine Learning, ICML In Proceedings of the Twentieth International Confer-
’08, pp. 1168–1175, 2008. ence on International Conference on Machine Learning,
Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., and Yu, P. S. ICML’03, pp. 912–919. AAAI Press, 2003.
A comprehensive survey on graph neural networks. arXiv
preprint arXiv:1901.00596, 2019.
Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful
are graph neural networks? In International Conference
on Learning Representations (ICLR’2019), 2019.
Yanardag, P. and Vishwanathan, S. Deep graph kernels.
In Proceedings of the 21th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining,
pp. 1365–1374. ACM, 2015.
Yang, Z., Cohen, W. W., and Salakhutdinov, R. Revisit-
ing semi-supervised learning with graph embeddings. In
Proceedings of the 33rd International Conference on In-
ternational Conference on Machine Learning - Volume
48, ICML’16, pp. 40–48, 2016.
Yao, L., Mao, C., and Luo, Y. Graph convolutional networks
for text classification. In Proceedings of the 33rd AAAI
Conference on Artificial Intelligence (AAAI’19), 2019.
Zhang, J., Shi, X., Xie, J., Ma, H., King, I., and Yeung,
D. Gaan: Gated attention networks for learning on large
and spatiotemporal graphs. In Proceedings of the Thirty-
Fourth Conference on Uncertainty in Artificial Intelli-
gence (UAI’2018), pp. 339–349. AUAI Press, 2018a.
Zhang, M., Cui, Z., Neumann, M., and Chen, Y. An end-
to-end deep learning architecture for graph classification.
2018b.
Zhang, Y., Zhong, V., Chen, D., Angeli, G., and Manning,
C. D. Position-aware attention and supervised data im-
prove slot filling. In Proceedings of the 2017 Conference
on Empirical Methods in Natural Language Processing,
pp. 35–45. Association for Computational Linguistics,
2017.
Simplifying Graph Convolutional Networks
(Supplementary Material)
˜ sym
A. The spectrum of ∆ ˜ sym . Likewise, 0 can be proved
from the non-negativity of ∆
to be the smallest eigenvalues of ∆sym .
The normalized Laplacian defined on graphs with self-loops,
∆˜ sym , consists of an instance of generalized graph Lapla- Lemma 3. Let β1 ≤ β2 ≤ · · · ≤ βn denote eigenvalues of
cians and hold the interpretation as a difference operator, i.e. D−1/2 AD−1/2 and α1 ≤ α2 ≤ · · · ≤ αn be the eigenval-
for any signal x ∈ Rn it satisfies ues of D̃−1/2 AD̃−1/2 . Then,
!
X ãij xi xj maxi di mini di
˜
(∆sym x)i = √ √ −p . α1 ≥ β1 , αn ≤ . (14)
di + γ di + γ dj + γ γ + maxi di γ + mini di
j
˜ sym is
Proof. The quadratic form associated with ∆ By choosing x such that
P kxk = 1 and y = D1/2 D̃−1/2 x,
di mini di
we have that kyk =
2
di +γ xi and γ+mini di ≤ kyk ≤
2 2
X XX ãij xi xj i
˜ sym x =
x> ∆ x2i − p maxi di
(di + γ)(dj + γ) γ+maxi di .Hence, we use the Rayleigh quotient to provide
i i j
a lower bound to α1 :
1 X 2 X 2 X X 2ãij xi xj
= xi + xj − p α1 = min x> D̃−1/2 AD̃−1/2 x
2 i j i j
(d i + γ)(d j + γ) kxk=1
= min y> D−1/2 AD−1/2 y (by replacing variable)
1 X X ãij x2i X X ãij x2j kxk=1
= +
2 i j
di + γ j i
dj + γ y> D−1/2 AD−1/2 y 2
= min kyk
kxk=1 kyk2
XX 2ãij xi xj > −1/2
− p y D AD−1/2 y
(di + γ)(d j + γ) ≥ min 2
max kyk2
i j kxk=1 kyk kxk=1
!2
1 XX xi xj (∵ min(AB) ≥ min(A) max(B) if min(A) < 0, ∀B > 0,
= ãij √ −p ≥0 (13) > −1/2
2 i j di + γ dj + γ y D AD−1/2 y
and min = β1 < 0)
kxk=1 kyk2
= β1 max kyk2
kxk=1
˜ sym .
Lemma 2. 0 is an eigenvalue of both ∆sym and ∆ maxi di
≥ β1 .
γ + maxi di
Proof. First, note that v = [1, . . . , 1]> is an eigenvector of
∆ associated with eigenvalue 0, i.e., ∆v = (D − A)v = 0.
Also, we have that ∆ ˜ sym = D̃−1/2 (D̃ − Ã)D̃−1/2 = One may employ similar steps to prove the second inequality
D̃ −1/2
∆D̃ −1/2
. Denote v1 = D̃1/2 v, then in Equation 14.
˜ sym v1 = D̃−1/2 ∆D̃−1/2 D̃1/2 v = D̃−1/2 ∆v = 0.
∆
Therefore, v1 = D̃1/2 v is an eigenvector of ∆ ˜ sym associ- Proof of Theorem 1. Note that ∆ ˜ sym = I − γ D̃−1 −
ated with eigenvalue 0, which is then the smallest eigenvalue D̃ −1/2
AD̃−1/2
. Using the results in Lemma 3, we show
Simplifying Graph Convolutional Networks
˜ sym is
that the largest eigenvalue λ̃n of ∆ Text Classification. Yao et al. (2019) use one-hot features
for the word and document nodes. In training SGC, we nor-
λ̃n = max x> (I − γ D̃−1 − D̃−1/2 AD̃−1/2 )x malize the features to be between 0 and 1 after propagation
kxk=1
and train with L-BFGS for 3 steps. We tune the only hy-
≤ 1 − min γx> D̃−1 x − min x> D̃−1/2 AD̃−1/2 x perparameter, weight decay, using hyperopt(Bergstra et al.,
kxk=1 kxk=1
2015) for 60 iterations. Note that we cannot apply this fea-
γ
=1− − α1 ture normalization for TextGCN because the propagation
γ + maxi di cannot be precomputed.
γ maxi di
≤1− − β1
γ + maxi di γ + maxi di Semi-supervised User Geolocation. We replace the 4-
maxi di layer, highway-connection GCN with a 3rd degree propa-
<1− β1 (γ > 0 and β1 < 0)
γ + maxi di gation matrix (K = 3) SGC and use the same set of hy-
< 1 − β 1 = λn (15) perparameters as Rahimi et al. (2018). All experiments on
the GEOTEXT dataset are conducted on a single Nvidia
GTX-1080Ti GPU while the ones on the TWITTER-NA
and TWITTER-WORLD datasets are excuded with 10 cores
of the Intel(R) Xeon(R) Silver 4114 CPU (2.20GHz). In-
B. Experiment Details stead of collapsing all linear transformations, we keep two
of them which we find performing slightly better possibly
Node Classification. We empirically find that on Reddit due to . Despite of this subtle variation, the model is still
dataset for SGC, it is crucial to normalize the features into linear.
zero mean and univariate.
Relation Extraction. We replace the 2-layer GCN with a
Training Time Benchmarking. We hereby describe the 2nd degree propagation matrix (K = 2) SGC and remove
experiment setup of Figure 3. Chen et al. (2018) benchmark the intermediate dropout. We keep other hyperparameters
the training time of FastGCN on CPU, and as a result, it is unchanged, including learning rate and regularization. Sim-
difficult to compare numerical values across reports. More- ilar to Zhang et al. (2018c), we report the best validation
over, we found the performance of FastGCN improved with accuracy with early stopping.
a smaller early stopping window (10 epochs); therefore, we
could decrease the model’s training time. We provide the Zero-shot Image Classification. We replace the 6-layer
data underpinning Figure 3 in Table 8 and Table 9. GCN (hidden size: 2048, 2048, 1024, 1024, 512, 2048)
baseline with an 6-layer MLP (hidden size: 512, 512, 512,
1024, 1024, 2048) followed by a SGC with K = 6. Follow-
Table 8. Training time (seconds) of graph neural networks on Cita- ing (Wang et al., 2018), we only apply dropout to the output
tion Networks. Numbers are averaged over 10 runs. of SGC. Due to the slow evaluation of this task, we do
Models Cora Citeseer Pubmed not tune the dropout rate or other hyperparameters. Rather,
GCN 0.49 0.59 8.31 we follow the GCNZ code and use learning rate of 0.001,
GAT 63.10 118.10 121.74 weight decay of 0.0005, and dropout rate of 0.5. We also
FastGCN 2.47 3.96 1.77 train the models with ADAM (Kingma & Ba, 2015) for 300
GIN 2.09 4.47 26.15
LNet 15.02 49.16 266.47
epochs.
AdaLNet 10.15 31.80 222.21
DGI 21.24 21.06 76.20 C. Additional Experiments
SGC 0.13 0.14 0.29
Random Splits for Citation Networks. Possibly due to
their limited size, the citation networks are known to be
unstable. Accordingly, we conduct an additional 10 experi-
Table 9. Training time (seconds) on Reddit dataset. ments on random splits of the training set while maintaining
Model Time(s) ↓ the same validation and test sets.
SAGE-mean 78.54
SAGE-LSTM 486.53 Propagation choice. We conduct an ablation study with
SAGE-GCN 86.86 different choices of propagation matrix, namely:
FastGCN 270.45
SGC 2.70 Normalized Adjacency: Sadj = D−1/2 AD−1/2
Random Walk Adjacency Srw = D−1 A
Simplifying Graph Convolutional Networks
2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10
Degree