Complex Networks & Their Applications XII: Hocine Cherifi Luis M. Rocha Chantal Cherifi Murat Donduran Editors
Complex Networks & Their Applications XII: Hocine Cherifi Luis M. Rocha Chantal Cherifi Murat Donduran Editors
Complex Networks & Their Applications XII: Hocine Cherifi Luis M. Rocha Chantal Cherifi Murat Donduran Editors
Hocine Cherifi
Luis M. Rocha
Chantal Cherifi
Murat Donduran Editors
Complex
Networks & Their
Applications XII
Proceedings of The Twelfth
International Conference on Complex
Networks and their Applications:
COMPLEX NETWORKS 2023 Volume 1
Studies in Computational Intelligence 1141
Series Editor
Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
The series “Studies in Computational Intelligence” (SCI) publishes new developments
and advances in the various areas of computational intelligence—quickly and with a high
quality. The intent is to cover the theory, applications, and design methods of computa-
tional intelligence, as embedded in the fields of engineering, computer science, physics
and life sciences, as well as the methodologies behind them. The series contains mono-
graphs, lecture notes and edited volumes in computational intelligence spanning the areas
of neural networks, connectionist systems, genetic algorithms, evolutionary computa-
tion, artificial intelligence, cellular automata, self-organizing systems, soft computing,
fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors
and the readership are the short publication timeframe and the world-wide distribution,
which enable both wide and rapid dissemination of research output.
Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago.
All books published in the series are submitted for consideration in Web of Science.
Hocine Cherifi · Luis M. Rocha ·
Chantal Cherifi · Murat Donduran
Editors
© The Editor(s) (if applicable) and The Author(s), under exclusive license
to Springer Nature Switzerland AG 2024
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors
or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
that the papers in these proceedings have undergone rigorous evaluation, resulting in
high-quality contributions.
We encourage you to explore the rich tapestry of knowledge and ideas as we dive
into these four proceedings volumes. The papers presented here represent not only the
diverse areas of research but also the collaborative and interdisciplinary spirit that defines
the complex networks community.
In closing, we extend our heartfelt thanks to the organizing committees and volunteers
who have worked tirelessly to make this conference a reality. We hope these proceed-
ings inspire future research, innovation, and collaboration, ultimately helping us better
understand the world’s networks and their profound impacts on science, technology, and
society.
We hope that the pleasure you have reading these papers matches our enthusiasm
for organizing the conference and assembling this collection of articles.
Hocine Cherifi
Luis M. Rocha
Chantal Cherifi
Murat Donduran
Organization and Committees
General Chairs
Advisory Board
Program Chairs
Lightning Chairs
Poster Chairs
Publicity Chairs
Tutorial Chairs
Sponsor Chairs
Sustainability Chair
Publication Chair
Submission Chair
Web Chairs
Program Committee
Network Embedding
1 Introduction
Learning over graphs has become paramount in machine learning applications
where the data possesses a connective structure, such as social networks [7],
Distribution Statement A. Approved for public release. Distribution is unlimited.
This material is based upon work supported by the Under Secretary of Defense for
Research and Engineering under Air Force Contract No. FA8702-15-D-0001. Any opin-
ions, findings, conclusions or recommendations expressed in this material are those
of the author(s) and do not necessarily reflect the views of the Under Secretary of
Defense for Research and Engineering. c 2023 Massachusetts Institute of Technology.
Delivered to the U.S. Government with Unlimited Rights, as defined in DFARS Part
252.227-7013 or 7014 (Feb 2014). Notwithstanding any copyright notice, U.S. Govern-
ment rights in this work are defined by DFARS 252.227-7013 or DFARS 252.227-7014
as detailed above. Use of this work other than as specifically authorized by the U.S.
Government may violate any copyrights that exist in this work.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 3–15, 2024.
https://doi.org/10.1007/978-3-031-53468-3_1
4 D. Loveland and R. Caceres
chemistry [8], and finance [25]. Fortunately, the field of graph mining has pro-
vided methods to extract useful information from graphs, albeit often need-
ing heavy domain guidance [18]. The advent of graph neural networks (GNNs),
a neural network generalized to learn over graph structured data, has helped
alleviate some of these requirements by learning representations that synthe-
size both node and structure information [8,9,13]. Complimentary to inference,
recent work has proposed methods that edit and design network structures using
gradients from a trained GNN [11,17,19], enabling the efficient optimization of
downstream learning tasks [31] in cyber security [5,15], urban planning [4], drug
discovery [12], and more [3,14,16]. However, as gradient-based editing is applied
more broadly, scrutinizing the conditions that allow for successful editing is crit-
ical. For instance, discerning the factors which influence gradient computation
is still unknown, making it unclear when proposed edits can be trusted. In addi-
tion, it is unknown if gradient quality is dependent on graph structure and GNN
architecture, causing further concern for practical applications.
Focusing strictly on gradient-based edit quality, we analyze the common mask
learning paradigm [11,19,20,29], where a continuous scoring mask is learned over
the edges in a graph. Specifically, we elucidate how structural factors, such as
degree, neighborhood label composition, and edge-to-node distance (i.e., how far
an edge is from a node) can influence the mask through the gradient. When these
factors are not beneficial to the learning task, e.g. edge-to-node distance for a
de-noising task when noise is uniformly-distributed across the graph, the learned
mask can lead to erroneous edits. We additionally highlight how editing methods
that rely on thresholding are more susceptible to such structural biases due to
smoothing of the ground truth signal at the extreme values of the distribution.
To improve editing, we propose a more fine-tuned sequential editing process,
ORE, with two steps: (1) We Order the edge scores and edit the top-k edges
to prioritize high quality edges, and (2) we Re-embed the modified graph after
the top-k edges have been Edited. These properties help prevent choosing edges
near the expected mask value, and thus more likely to be based on irrelevant
structural properties, as well as encourage edits that consider the influence of
other removed edges with higher scores. We highlight the practical benefit of
ORE by designing a systematic study that probes editing quality across a variety
of common GNN tasks, graph structures, and architectures, demonstrating up
to a 50% performance improvement for ORE over previous editing methods.
2 Related Work
Early network design solutions choose edits based on fixed heuristics, such as
centrality scores [16] or triangle closing properties [14]. However, fixed heuris-
tics generally require significant domain guidance and may not generalize to
broader classes of networks and tasks. Reinforcement learning (RL) has enabled
the ability to learn more flexible heuristics, such as in chemistry [30] and social
networks [23]; however, RL can be prohibitively expensive due to data and com-
putation requirements. To fulfill the need for efficient and flexible editing meth-
ods, gradient-based optimization has subsequently been applied to edge editing,
Network Design Through Graph Neural Networks 5
facilitated through trained GNNs. Although computing gradients for edges can
be infeasible given the discrete nature of the input network, previous methods
have adopted a continuous relaxation of the edge set, operating on a soft edge
scoring mask that can be binarized to recover the hard edge set [11,19,20,24,29].
In its simplest form, the gradient of an edge is approximated as the gradient of
the score associated with that edge, with respect to a loss objective [29]. As
this is dependent on the initialization of the scoring mask, GNNExplainer pro-
poses to leverage multiple rounds of gradient descent over the mask to arrive
at a final score, rather than use the gradient directly [29]. CF-GNNExplainer
extends GNNExplainer by generating counterfactual instances and measuring
the change in the downstream objective [19]. Both of these methods convert
the soft mask to a hard mask through fixed thresholding, which, when incor-
rectly chosen, can introduce noisy edits. Moreover, as mask learning is usually
used to support broader objectives, such as robustness or explainability, studies
fail to consider what conditions can inhibit the mask learning sub-component,
instead focusing simply on the downstream objective. Our work provides a direct
analysis of mask quality through a systematic study across a wide array of tasks,
GNNs, and topologies. We highlight that current mask-based editing methods can
become susceptible to bias within the mask scores, prompting the development of
ORE as a means of improving gradient-based edge editing.
3 Notation
Let G = (V, E, X, Y) be a simple graph with nodes V , edges E, feature matrix
X ∈ R|V |×d with d node features, and label matrix Y. Y ∈ {0, 1}|V |×c with c
classes for node classification, Y ∈ R|V | for node regression, and Y ∈ {0, 1}c for
graph classification. A ∈ {0, 1}|V |×|V | is the adjacency matrix of G, where Ai,j =
1 denotes an edge between nodes i and j in G, otherwise Ai,j = 0. While E and
A represent similar information, E is used when discussing edge sets and A is for
matrix computations. Additionally, a k -hop neighborhood of a node i ∈ V , Nk (i),
denotes the nodes and edges that are reachable within k-steps of i. For simplicity,
k is dropped when referring to the 1-hop neighborhood. Additionally, we denote
||B||1 as the L1 -norm of a matrix B, G−ei as the removal of an edge from G, and
G − i as the removal of a node from G. For a k-layer GNN, learning is facilitated
through message passing over k-hop neighborhoods of a graph [8]. A node i’s
representations are updated by iteratively aggregating the features of nodes in
i’s 1-hop neighborhood, denoted AGGR, and embedding the aggregated features
with i’s features, usually through a non-linear transformation parameterized by
(l)
a weight matrix W, denoted ENC. The update for node i is expressed as ri =
(l−1) (l−1) (0)
ENC(ri , AGGR(ru , u ∈ N (i))) for l ∈ {1, 2, ..., k}, where ri = xi . The
update function is applied k times, resulting in node representations that can be
used to compute predictions. For graph-level tasks, a readout function aggregates
the final representation of all nodes into a single graph-level representation.
6 D. Loveland and R. Caceres
min
∗
||A − A∗ ||1
A
(1)
s.t. f (X, A∗ ) − f (X, A) ≥ 0.
As A is discrete and f introduces non-linear and non-convex constraints, it is
difficult to find an exact solution. Thus, we soften the constraints and focus on
increasing f while maintaining the size of A, as shown in Eq. 2,
min
∗
− f (X, A∗ ) + λ||A − A∗ ||1 . (2)
A
where λ trades off the objective and the size of the remaining edge set. The
negative term incentivizes the optimizer to improve f . As the optimization is
still over a discrete adjacency matrix, we re-parameterize A, as done in [10,29],
and introduce a continuous mask M ∈ Rn×n . M is introduced into a GNN’s
(i−1)
aggregation function as AGGR(mu,v · ru , u ∈ N (v))), where mu,v is the mask
value on the edge that connects nodes u and v. By introducing M into AGGR,
it is possible to directly compute partial derivatives over M, enabling gradient-
based optimization over the mask values. As the aggregation function is model-
agnostic, we can easily inject the mask into any model that follows this paradigm.
(2)
ri = xi + Mi,j xj + Mi,j (xj + Mj,k xk ). (3)
j∈N (i) j∈N (i) k∈N (j)
(2)
Then, the class prediction for i is argmax, where zi = ri W . As M is com-
zi
(2)
monly learned through gradient ascent, and only ri depends on M, we focus
(2)
on the partial derivative of ri with respect to a mask value Mu,v , where u, v
are nodes in G. As the GNN has two layers, the edges must be within two-hops
of i to have a non-zero partial derivative. The partial derivative for the one- and
two-hop scenarios are the first and second cases of Eq. 4, respectively,
⎧
⎪
⎪ 2(yj + M y + (Mi,j + 1)N (μ, Σ))
(2)
∂ri ⎨ i,j i
= + Mj,k (yk + N (μ, Σ)), u = i, v = j ∈ N (i) (4)
∂Mu,v ⎪
⎪ k∈N (j)−i
⎩ M (y + N (μ, Σ)), u = j ∈ N (i), v = k ∈ N (j)
i,j k
Network Design Through Graph Neural Networks 7
⎧
⎪
⎪ (Mi,j + 2)N (μ + 1, Σ), yj = 0, yk = 0
⎨
(2) Mi,j + (Mi,j + 2)N (μ, Σ), yj = 1, yk = 0
Δ∂ri,0 =
⎪ 2(Mi,j + 1) + (Mi,j + 2)N (μ, Σ), yj = 0, yk = 1
⎪
⎩
2Mi,j + (Mi,j + 2)N (μ, Σ), yj = 1, yk = 1
+ Mj,k N (μ + 1, Σ) + Mj,k N (μ, Σ). (5)
k∈N (j)−i,yk =yj k∈N (j)−i,yk =yj
First, all cases in Eq. 5 tend to be greater than 0, leading to higher scores for
edges closer to i. Additionally, if elements of M ∼ U (−1, 1) as in [19,29], the last
two summation terms in Eq. 5 scale as hj (dj −1) and (1−hj )(dj −1), respectively,
where hj and dj represent the homophily and degree properties of the node
j. Thus, high degree and high homophily can additionally bias edge selection,
similar to the heuristic designed by [26] where they use hj dj to optimize network
navigation. Each of the above structural factors can either coincide with the true
edge importance, or negatively influence edits when such structural properties
are uninformative to the network design task.
5 Experimental Setup
5.1 Network Editing Process
We study four GNN architectures: GCN [13], GraphSage [9], GCN-II [22], and
Hyperbolic GCN [2]. As attention weights have been shown to be unreliable for
edge scoring [29], we leave them out of this study. After training, each model’s
weights are frozen and the edge mask variables are optimized to modify the
output prediction. We train three independent models on different train-val-test
(50-25-25) splits for each task and the validation set is used to choose the best
hyperparameters over a grid search. Then, editing is performed over 50 random
data points sampled from the test set. For regression tasks, we directly optimize
the output of the GNN, and for classification tasks, we optimize the cross entropy
loss between the prediction and class label. For ORE, s = b so that one edge
is edited per step. Additionally, b is set such that roughly 10% (or less) of the
edges of a graph (or computational neighborhood) are edited. The exact budget
is specified for each task. All hyperparameters and implementation details for
both the GNN training and mask learning are outlined in an anonymous repo1 .
Editing Baselines: We utilize two fixed heuristics for editing: iterative edge
removal through random sampling and edge centrality scores [1]. We also study
CF-GNNExplainer [19], though we extend the algorithm to allow for learning
objectives outside of counterfactuals and variable thresholds that cause b edits
to fairly compare across methods. These changes do not hurt performance and
are simple generalizations. Note that while we focus on CF-GNNExplainer, as
they are the only previous mask learning work to consider editing, their mask
generation is highly similar to other previous non-editing methods, allowing us
to indirectly compare to thresholding-based methods in general [20,24,29].
1
https://anonymous.4open.science/r/ORE-93CC/GNN details.md.
Network Design Through Graph Neural Networks 9
In this section we detail the proposed tasks. For each, the generation process,
parameters, and resultant dataset stats are provided in an anonymous repo2 .
Improving Motif Detection: We begin with node classification tasks similar
to [19,20,29] with a goal of differentiating nodes from two different generative
models. Tree-grid and tree-cycle are generated by attaching either a 3 × 3 grid or
a 6 node cycle motif to random nodes in a 8-level balanced binary tree. We train
the GNNs using cross entropy, and then train the mask to maximize a node’s
class prediction. As the generation process is known, we extrinsically verify if
an edit was correct by determining if it corresponds to an edge inside or outside
of the motifs. The editing budget is set to the size of the motifs, i.e. b = 6 for
tree-cycle and b = 12 for tree-grid. Each model is trained to an accuracy of 85%.
Increasing Shortest Paths (SP): The proposed task is to delete edges to
increase the SP between two nodes in a graph. This task has roots in adversarial
attacks [21] and network interdiction [27] with the goal of force specific traffic
routes. The task is performed on three synthetic graphs: Barabási-Albert (BA),
Stochastic Block Model (SBM), and Erdős-Rényi (ER). The parameters are set
to enforce each graph has an average SP length of 8. The GNN is trained through
MSE of SP lengths, where the SP is estimated by learning embedding for each
node and then computing the L2 distance between each node embedding for
nodes in the training set. The GNN is then used to increase the SP for pairs of
nodes in the test set, which is externally verified through NetworkX. The editing
budget b = 30 given the larger graphs. Each model is trained to an RMSE of 2.
Decreasing the Number of Triangles: The proposed task is to delete edges
to decrease the number of triangles in a graph. Since triangles are often associ-
ated with influence, this task can support applications that control the spread of
a process in a network, such disease or misinformation [6]. We consider the same
graphs as in the SP task, BA, SBM, and ER, but instead generate 100000 differ-
ent graphs each with 100 nodes. Each generation method produces graphs that,
on average, have between 20 and 25 triangles, as computed by NetworkX’s tri-
angle counter. The GNNs are trained using MSE and then used to minimize the
number of triangles in the graph, which is externally verified through NetworkX.
The editing budget b = 20. Each GNN is trained to an RMSE of 6.
Improving Graph-Level Predictions: MUTAG is a common dataset of
molecular graphs used to evaluate graph classification algorithms. The proposed
task is to turn mutagenic molecules into non-mutagenic molecules by deleting
mutagenic functional groups [20,29]. We first train the GNN models to suf-
ficiently predict whether a molecule is mutagenic, then edit the molecules to
reduce the probability of mutagenicity. We only edit mutagenic molecules that
possess mutagenic functional groups, as in [20]. The editing budget b = 5. Each
GNN is trained to an accuracy above 75%. To focus on edit quality, we do
2
https://anonymous.4open.science/r/ORE-93CC/Dataset details stats.md.
10 D. Loveland and R. Caceres
6 Results
We present the empirical results for each task, beginning with an in-depth anal-
ysis on motif detection. Then, we collectively analyze the shortest path, triangle
counting, and mutag tasks, noting trends in editing method and GNN design.
Fig. 1. Performance on tree-grid and tree-cycle across GNNs (shapes) and editing
methods (colors). The axis show the percent change in edges outside and inside the
motifs. Error bars indicate standard deviation in experiments. Performance improves
towards the bottom right, as the goal is to remove edges outside the motif and retain
edges inside the motif, as shown by the gray Pareto front.
How do Editing Methods Vary Across GNNs? In Fig. 1, ORE with GCNII
yields the best performance; however, nearly every ORE and GNN combination
outperforms the CF-GNNExplainer variant with the same GNN, demonstrat-
ing the intrinsic benefit of ORE, as well as the dependence on GNN model. To
Network Design Through Graph Neural Networks 11
Fig. 2. Mask score distribution stratified by distance to ego-node for GCN and
GCNII. Yellow denotes Tree-Grid, green denotes Tree-Cycle. For GCN, the closer
an edge is to the ego-node, the higher the scores, leading to bias within the editing.
GCNII minimizes bias for this unrelated property, improving editing.
In Table 1, we outline the performance metrics for the SP, triangle counting,
and mutag tasks. For each task, we measure the average percent change in their
associated metric. In the SP experiments, all GNNs improve over the baselines,
demonstrating the learned masked values extracted from the GNNs can outper-
form crafted heuristics, such as centrality, which leverages shortest path informa-
tion in its computation. Given that ORE with GCN performs well on this task,
it is possible that the structural biases identified previously, such as reliance on
degree, could coincide with the SP task and improve mask scores. In the triangle
counting task, edge centrality is a strong baseline for BA graphs, likely due to
centrality directly editing the hub nodes that close a large number of triangles.
Across the ER and SBM graphs, which do not possess a hub structure, we find
that ORE with a GCNII backbone performs significantly better than both the
baselines and other GNN models. Mutag reinforces these findings where GCNII
removes nearly all of the mutagenic bonds for the mutagenic molecules. Notably,
the Hyperbolic GCN performs poorly across experiments, possible explained by
most tasks possessing Euclidean geometry, e.g. 82% of the molecules in the muta-
genic dataset are roughly Euclidean as computed by the Gromov hyperbolicity
metric [28]. Comparing editing methods, ORE with GCN and GCNII signifi-
cantly outperforms CF-GNNExplainer with GCN across all three downstream
tasks, highlighting the value of refined and iteratively optimized edge masks.
Fig. 3. Analysis on GCNII and Tree-Grid. (a) Histograms where the axes denote the
percent change in edges inside and outside of the motif, boxes capture the counts. ORE
outperforms CF-GNNExplainer, as shown by the darker boxes in the bottom right. (b)
Performance across edit iterations. Blue denotes ORE, red denotes CF-GNNExplainer,
dashed lines denote out motif change, and solid lines denote in motif change. ORE
rapidly removes edges outside the motifs while maintaining edges inside the motif,
improving upon CF-GNNExplainer.
Network Design Through Graph Neural Networks 13
Table 1. Results for SP, triangle counting, and mutag tasks. CF-GNNExplainer lever-
ages a GCN, often one of the better performers in motif analysis. All metrics are
average percent change, where higher is better. Error is the standard deviation across
each model. The highlighted boxes indicate best performaners.
7 Conclusion
References
1. Brandes, U.: A faster algorithm for betweenness centrality. J. Math. Soc. 25, 163–
177 (2001)
2. Chami, I., Ying, Z., Ré, C., Leskovec, J.: Hyperbolic graph convolutional neural
networks. In: NeurIPS, vol. 32 (2019)
3. Chan, H., Akoglu, L.: Optimizing network robustness by edge rewiring: a general
framework. Data Min. Knowl. Discov. 30(5), 1395–1425 (2016)
4. Domingo, M., Thibaud, R., Claramunt, C.: A graph-based approach for the struc-
tural analysis of road and building layouts. Geo-spatial Inf. Sci. 22(1), 59–72 (2019)
5. Enoch, S., Mendonça, J., Hong, J., Ge, M., Kim, D.S.: An integrated security hard-
ening optimization for dynamic networks using security and availability modeling
with multi-objective algorithm. Comp. Netw. 208, 108864 (2022)
6. Erd, F., Vignatti, A., da Silva, M.V.G.: The generalized influence blocking maxi-
mization problem. Soc. Netw. Anal. Mining (2021)
7. Fan, W., et al.: Graph neural networks for social recommendation. In: WWW, pp.
417–426 (2019)
14 D. Loveland and R. Caceres
8. Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message
passing for quantum chemistry. CoRR (2017)
9. Hamilton, W.L., Ying, R., Leskovec, J.: Inductive representation learning on large
graphs. CoRR (2017)
10. Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax
(2017)
11. Jin, W., Ma, Y., Liu, X., Tang, X., Wang, S., Tang, J.: Graph structure learning
for robust graph neural networks. In: SIGKDD (2020)
12. Jin, W., Barzilay, R., Jaakkola, T.: Junction tree variational autoencoder for molec-
ular graph generation. In: ICML, PMLR (2018)
13. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional
networks. arXiv preprint arXiv:1609.02907 (2016)
14. Kun, J., Caceres, R.S., Carter, K.M.: Locally boosted graph aggregation for com-
munity detection. arXiv preprint arXiv:1405.3210 (2014)
15. Laishram, R., Sariyüce, A., Eliassi-Rad, T., Pinar, A., Soundarajan, S.: Measuring
and improving the core resilience of networks (2018)
16. Li, D., Eliassi-Rad, T., Zhang, H.R.: Optimal intervention on weighted networks
via edge centrality. In: 5th International Workshop on Epidemiology Meets Data
Mining and Knowledge Discovery at KDD (2022)
17. Li, G., Duda, M., Zhang, X., Koutra, D., Yan, Y.: Interpretable sparsification
of brain graphs: better practices and effective designs for graph neural networks.
arXiv preprint arXiv:2306.14375 (2023)
18. Liu, Y., Safavi, T., Dighe, A., Koutra, D.: Graph summarization methods and
applications: a survey. ACM Comput. Surv. 51(3), 1–34 (2018)
19. Lucic, A., ter Hoeve, M., Tolomei, G., de Rijke, M., Silvestri, F.: Counterfactual
explanations for graph neural networks. CoRR, Cfgnnexplainer (2021)
20. Luo, D., et al.: Parameterized explainer for graph neural network. In: NeurIPS
(2020)
21. Miller, B.A., Shafi, Z., Ruml, W., Vorobeychik, Y., Eliassi-Rad, T., Alfeld, S.:
Pathattack: attacking shortest paths in complex networks. In: ECML-PKDD, pp.
532–547 (2021)
22. Wei, Z., Chen, M., Ding, B., Huang, Z., Li, Y.: Simple and deep graph convolutional
networks. In: ICML (2020)
23. Morales, P., Caceres, R., Eliassi-Rad, T.: Selective network discovery via deep
reinforcement learning on embedded spaces. Appl. Netw. Sci. (2021)
24. Schlichtkrull, M.S., Cao, N.D., Titov, I.: Interpreting graph neural networks for
NLP with differentiable edge masking. In: ICLR (2021)
25. Sharma, S., Sharma, R.: Forecasting transactional amount in bitcoin network using
temporal GNN approach. In: ASONAM (2020)
26. Şimşek, Ö., Jensen, D.: Navigating networks by using homophily and degree. Proc.
Natl. Acad. Sci. 105(35), 12758–12762 (2008)
27. Smith, J.C., Prince, M., Geunes, J.: Modern network interdiction problems and
algorithms. In: Pardalos, P.M., Du, D.-Z., Graham, R.L. (eds.) Handbook of Com-
binatorial Optimization, pp. 1949–1987. Springer, New York (2013). https://doi.
org/10.1007/978-1-4419-7997-1 61
28. Väisälä, J.: Gromov hyperbolic spaces. Exposition. Math. 23(3), 187–231 (2005).
https://doi.org/10.1016/j.exmath.2005.01.010
29. Ying, Z., Bourgeois, D., You, J., Zitnik, M., Leskovec, J.: Generating explanations
for graph neural networks. In: NeurIPS, Gnnexplainer (2019)
Network Design Through Graph Neural Networks 15
30. Zhou, Z., Kearnes, S., Li, L., Zare, R.N., Riley, P.: Optimization of molecules via
deep reinforcement learning. Sci. Rep. (2019)
31. Zhu, H., Gupta, V., Ahuja, S.S., Tian, Y., Zhang, Y., Jin, X.: Network planning
with deep reinforcement learning. (2021)
Sparse Graph Neural Networks
with Scikit-Network
1 Introduction
Graph Neural Networks (GNNs) are an extension of traditional deep learning
(DL) methods for relational data structured as graphs [4]. In the past few years,
GNN-based methods have gained increasing attention thanks to their impressive
performance on a wide range of machine learning tasks, such as node classifica-
tion, graph classification, or link prediction [16,19,25,28]. GNNs derive internal
graph element representations using entity relationships and associated features,
making them highly valuable for real-world data, where information frequently
spans across multiple dimensions.
Real-world graphs, exemplified by large social networks or web collections
with millions or billions of elements, each potentially having numerous attributes,
pose substantial challenges for GNNs training [12,13]. Tremendous efforts have
been made to tackle this challenge and scale up GNNs [6–8,16,29,30]. Several
existing approaches rely on parallelisation or approximation techniques such as
sampling or batch training, to reduce memory consumption. Few take the sparse
nature of real-world graphs into account: in a graph with n nodes, the number
of edges m is much lower than the number of possible edges, of order n2 . This
sparsity can be exploited in data representation and algorithms to reduce the
memory footprint and the computation times.
Numerous Python packages already provide GNN implementations, includ-
ing PyTorch Geometric [11], Deep Graph Library [27], Spektral [14], Stellar
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 16–24, 2024.
https://doi.org/10.1007/978-3-031-53468-3_2
Sparse GNNs with Scikit-Network 17
Graph [10] or Dive Into Graphs library [20]. But to allow natural integration
with existing DL frameworks, such as PyTorch [23] or TensorFlow [1], and bene-
fit from differentiable operators, these libraries rely upon a dense tensor-centred
paradigm which does not align with graph sparsity. This dramatically hinders
the use of such libraries on large real-world graphs, by requiring access to servers
with large quantities of RAM.
To address this gap, we propose a GNNs module implementation relying
on sparse matrices for both graph adjacency and features. Our implementation
aligns seamlessly with Scikit-network1 [3], a Python package inspired by Scikit-
learn [5] for graph analysis. Scikit-network leverages the sparse formats provided
by SciPy [26] for encoding graphs. It already provides various state-of-the-art
graph algorithms, including ranking, clustering, and embedding algorithms, with
a high computational efficiency, comparable to that of other tools like graph-
tool [24] or IGraph [9], and generally much higher than that of the popular
NetworkX library [15]. By only relying on NumPy [17] and SciPy [26], the devel-
opment of a GNN module in Scikit-network stays true to the core principles of
the package: performance and ease of use.
To summarise, our contributions encompass three key aspects. Firstly, we
introduce an efficient GNN module within Scikit-network, harnessing the power
of sparse matrices for both graph and features to address the characteristics of
real-world graphs. Secondly, our package bridges the gap between traditional
graph analysis methods and GNNs, offering a unified platform that operates
within the same sparse graph representation. Lastly, we prioritise simplicity by
designing our package to rely solely on foundational Python libraries, specifically
NumPy and SciPy, sparing users the complexity associated with larger tensor-
based DL frameworks.
The rest of the paper is organised as follows. We start by reviewing the
existing related work in Sect. 2. Then, we formulate the computation of the
GNNs message passing scheme using sparse matrices in Sect. 3. In Sect. 4 we
briefly describe our GNNs module design and we show its performance compared
to other GNNs libraries in Sect. 5.
2 Related Work
Several Python packages already exist to help with the development and usage
of GNNs. Pytorch Geometric (PyG) [11] proposes a general message passing
interface with all recent GNN-based aggregation schemes. In order to reduce the
computation time when running on large complex networks, several processing
tools such as sampling or batch training are proposed. Spektral [14] is built
upon the same gather-scatter paradigm as PyG, but implements GNNs on top
of the user-friendly API Keras. Stellar Graph [10] is also based on Keras but uses
its own graph representation. Deep Graph Library [27] involves a combination
of user-configurable message passing functions and sparse-dense matrix multi-
plications to provide the user with a GNNs framework. However, in all these
1
https://github.com/sknetwork-team/scikit-network.
18 S. Delarue and T. Bonald
5 Performance
To evaluate Scikit-network’s GNN implementation, we compare (i) the achieved
accuracy level, which should be similar to that of other implementations, and
(ii) the training time required for the model to learn representations. In both
cases, we use a GCN [19] model for a node classification task and compare with
the two most widely used libraries: PyG and DGL2 . The computer used to run
all the experiments is a Mac with OS 12.6.8, equipped with a M1 Pro processor
and 16 GB of RAM. For fair comparison, all experiments are run on a CPU
device.
2
We rely on the number of forks associated with each package on GitHub as a metric
to gauge library usage. Please note that this metric provides only a partial view of
actual project usage.
20 S. Delarue and T. Bonald
Fig. 1. Graph methods using Scikit-network. On the left, node classification using a
GNN model. On the right, Louvain clustering [2] and PageRank scoring [22]. Traditional
graph algorithms and deep learning based models use the same API.
5.1 Datasets
We use three real-world datasets of varying sizes considering their structure and
attributes. The datasets include Wikipedia-based networks3 , Wikivitals and
Wikivitals+. Nodes in these datasets represent Wikipedia articles, and there is
a link between two nodes if the corresponding articles are referencing each other
(in either direction) through hypertext links on Wikipedia. Additionally, each
article comes with a feature vector, corresponding to the number of occurrences
of each word (or token) in its summary. OGBN-arxiv [18] models a citation
network among Computer Science papers from arXiv. For this dataset, we use
node connections as the feature matrix. We detail the dataset characteristics in
Table 1.
Dataset |V | |E| δA d m δX
4 5 −2 4 6
Wikivitals 1.00 × 10 8.24 × 10 1.64 × 10 3.78 × 10 1.363 × 10 3.59 × 10−3
4 6 −3 4 6
Wikivitals+ 4.51 × 10 3.94 × 10 3.86 × 10 8.55 × 10 4.78 × 10 1.24 × 10−3
5 6 −5 5 6
OGBN-arxiv 1.69 × 10 1.66 × 10 8.13 × 10 1.69 × 10 1.66 × 10 8.13 × 10−5
3
https://netset.telecom-paris.fr/.
Sparse GNNs with Scikit-Network 21
Tables 2 and 3 respectively display the running time of the training process and
the corresponding accuracy scores for the different GNNs implementations. For
DGL and PyG, the dense feature matrix format hinders training models on
large graphs without additional tricks, e.g., sampling or batch training (which
we did not use for fair comparison). Therefore, these implementations trigger an
out-of-memory (OOM) error when used on the OGBN-arxiv dataset. In contrast,
Scikit-network does not require these extra steps to achieve good performance on
this dataset within a reasonable computation time. Furthermore, we can observe
the benefits of using Scikit-network regarding the characteristics of the graph:
the sparser the graph, the more efficient the computation.
Table 2. Average computation times and standard deviations (3 runs) for 100 epochs
model training on 3 real-world datasets.
Table 3. Average accuracy on test set and standard deviations (3 runs) for 100 epochs
model training on 3 real-world datasets.
Fig. 2. Computation time for several GCN [19] implementations, according to adja-
cency and feature matrix densities. Notice the log-scale on both axis.
Fig. 3. Computation time for several GCN [19] implementations, according to the
number of nodes in the graph. Notice the log-scale on both axis.
6 Conclusion
and running time, by making use of sparse encoding and operations in the learn-
ing process. Moreover, our design relies solely on foundational Python libraries,
NumPy and SciPy, and does not require the use of tensor-centred traditional
Deep Learning frameworks. With this module, Scikit-network offers a unified
platform gathering traditional graph analysis algorithms and deep learning-based
models. In the future, we plan to extend the current module by adding additional
state-of-the-art GNNs models and layers, as well as optimizers.
Acknowledgements. The authors would like to thank Tiphaine Viard for the numer-
ous discussions, as well as her insightful comments and suggestions about the paper.
References
1. Abadi, M., et al.: Tensorflow: a system for large-scale machine learning. In: 12th
USENIX Symposium on Operating Systems Design and Implementation (OSDI
16), pp. 265–283 (2016)
2. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of
communities in large networks. J. Statist. Mech. Theory Exp. 2008(10), P10008
(2008). https://doi.org/10.1088/1742-5468/2008/10/p10008
3. Bonald, T., De Lara, N., Lutz, Q., Charpentier, B.: Scikit-network: graph analysis
in python. J. Mach. Learn. Res. 21(1), 7543–7548 (2020)
4. Bronstein, M.M., Bruna, J., LeCun, Y., Szlam, A., Vandergheynst, P.: Geometric
deep learning: going beyond Euclidean data. IEEE Signal Process. Mag. 34(4),
18–42 (2017). https://doi.org/10.1109/MSP.2017.2693418
5. Buitinck, L., et al.: API design for machine learning software: experiences from the
scikit-learn project. In: ECML PKDD Workshop: Languages for Data Mining and
Machine Learning, pp. 108–122 (2013)
6. Chen, J., Ma, T., Xiao, C.: Fastgcn: fast learning with graph convolutional networks
via importance sampling. arXiv preprint arXiv:1801.10247 (2018)
7. Chen, J., Zhu, J., Song, L.: Stochastic training of graph convolutional networks
with variance reduction. arXiv preprint arXiv:1710.10568 (2017)
8. Chiang, W.L., Liu, X., Si, S., Li, Y., Bengio, S., Hsieh, C.J.: Cluster-gcn: an efficient
algorithm for training deep and large graph convolutional networks. In: Proceedings
of the 25th ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining, pp. 257–266 (2019)
9. Csardi, G., Nepusz, T., et al.: The igraph software package for complex network
research. Int. J. Complex Syst. 1695(5), 1–9 (2006)
10. Data61, C.: Stellargraph machine learning library (2018). https://github.com/
stellargraph/stellargraph
11. Fey, M., Lenssen, J.E.: Fast graph representation learning with PyTorch Geometric.
In: ICLR Workshop on Representation Learning on Graphs and Manifolds (2019)
12. Fey, M., Lenssen, J.E., Weichert, F., Leskovec, J.: Gnnautoscale: scalable and
expressive graph neural networks via historical embeddings. In: International Con-
ference on Machine Learning, pp. 3294–3304. PMLR (2021)
13. Frasca, F., Rossi, E., Eynard, D., Chamberlain, B., Bronstein, M., Monti, F.: Sign:
scalable inception graph neural networks. arXiv preprint arXiv:2004.11198 (2020)
14. Grattarola, D., Alippi, C.: Graph neural networks in tensorflow and keras with
spektral [application notes]. IEEE Comput. Intell. Mag. 16(1), 99–106 (2021)
24 S. Delarue and T. Bonald
15. Hagberg, A., Swart, P., Chult, D.S.: Exploring network structure, dynamics, and
function using networkx. Tech. rep., Los Alamos National Lab. (LANL), Los
Alamos (2008)
16. Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large
graphs. Adv. Neural Inf. Pocess. Syst. 30 (2017)
17. Harris, C.R., et al.: Array programming with numpy. Nature 585(7825), 357–362
(2020)
18. Hu, W., et al.: Open graph benchmark: datasets for machine learning on graphs.
arXiv preprint arXiv:2005.00687 (2020)
19. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional
networks. arXiv preprint arXiv:1609.02907 (2016)
20. Liu, M., et al.: Dig: a turnkey library for diving into graph deep learning research.
J. Mach. Learn. Res. 22(1), 10873–10881 (2021)
21. Lutz, Q.: Graph-based contributions to machine-learning. Theses, Institut Poly-
technique de Paris (2022). https://theses.hal.science/tel-03634148
22. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking:
bring order to the web. Tech. rep., Technical report, Stanford University (1998)
23. Paszke, A.,et al.: Pytorch: an imperative style, high-performance deep learning
library. Adv. Neural Inf. Process. Syst. 32 (2019)
24. Peixoto, T.P.: The graph-tool python library. Figshare (2014). https://doi.org/10.
6084/m9.figshare.1164194
25. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph
attention networks. arXiv preprint arXiv:1710.10903 (2017)
26. Virtanen, P., et al.: Scipy 1.0: fundamental algorithms for scientific computing in
python. Nat. Methods 17(3), 261–272 (2020)
27. Wang, M., et al.: Deep graph library: a graph-centric, highly-performant package
for graph neural networks. arXiv preprint arXiv:1909.01315 (2019)
28. Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks?
arXiv preprint arXiv:1810.00826 (2018)
29. Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton, W.L., Leskovec, J.: Graph
convolutional neural networks for web-scale recommender systems. In: Proceedings
of the 24th ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining, pp. 974–983 (2018)
30. Zeng, H., Zhou, H., Srivastava, A., Kannan, R., Prasanna, V.: Graphsaint: graph
sampling based inductive learning method. arXiv preprint arXiv:1907.04931 (2019)
Enhancing Time Series Analysis
with GNN Graph Classification Models
Alex Romanova(B)
1 Introduction
In 2012, a significant breakthrough occurred in the fields of deep learning and
knowledge graphs. Convolutional Neural Network (CNN) image classification
was introduced through AlexNet [1], showcasing its superiority over previous
machine learning techniques in various domains [2]. Concurrently, Google intro-
duced knowledge graphs, enabling machines to understand relationships between
entities and revolutionizing data integration and management, enhancing prod-
ucts with intelligent and ‘magical’ capabilities [3].
The growth of deep learning and knowledge graphs occurred simultaneously
for years, with CNN excelling at grid-structured data tasks but struggling with
graph-structured ones. Conversely, graph techniques thrived on graph structured
data but lacked deep learning’s capability. In the late 2010s, Graph Neural Net-
works (GNN) emerged, combining deep learning and graph processing, and rev-
olutionizing how we handle graph-structured data [4].
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 25–36, 2024.
https://doi.org/10.1007/978-3-031-53468-3_3
26 A. Romanova
2 Related Work
In 2012, the introduction of AlexNet models and Knowledge Graphs marked it as
a breakthrough year for deep learning and knowledge graphs. AlexNet model[1],
along with the success of Convolutional Neural Networks in image classifica-
tion, outperformed previous state-of-the-art machine learning techniques across
various domains [2]. Concurrently, Google’s introduction of Knowledge Graphs
[8] enabled machines to understand entity relationships, driving a new era in
data integration and management that enhance product intelligence and user
experience [3]. In the late 2010s, Graph Neural Networks emerged as a fusion
of deep learning and knowledge graphs. GNN enable complex data analysis and
predictions by effectively capturing relationships between graph nodes [9,10].
Enhancing Time Series Analysis 27
While GNNs have been employed across a myriad of disciplines, our study
zeroes in on the niche area of GNN Graph Classification. This methodology
proves invaluable in sectors like chemistry, medicine, and biology for delving into
intricate relationships within molecular structures, proteins, and biomolecules.
Utilizing graphs to represent these entities allows researchers to uncover novel
insights, potentially revolutionizing therapeutic or treatment avenues [11–13].
In the vast landscape of GNNs, our focus is sharply tuned to the potential
of GNN Graph Classification in time series data analysis. One salient feature of
these models is their nuanced sensitivity to graph topology. While this can com-
plicate handling noise and adapting to spatial-temporal variations, it emerges
as a strength when detecting outliers and anomalies. Upcoming sections will
further explore this intricate balance, showcasing moments where the model’s
discernment comes to the fore. Interestingly, our literature review suggests that
the application of GNN Graph Classification to time series data remains an
underexplored avenue. This observation underscores the novelty and value of
our current research direction.
With the evolution of technology, methods to analyze time series data have
also matured. While traditional approaches may falter at the intricacies and
spatial-temporal challenges inherent to this data, a comprehensive review by
[14] underscored GNNs’ promise. Notably absent, however, was a focus on GNN
graph classification within the time series context. Our study endeavors to fill
this void, illuminating the potential of GNN Graph Classification for in-depth
time series insights.
In this study, we examine the potential of GNN Graph Classification models
in the context of time series data, with a specific focus on healthcare and environ-
mental sectors. By transforming time series, prevalent across various domains,
into graphs, we aim to tap into their inherent relationships. Our primary goal
centers on classifying EEG signals and climate data, confronting the challenges
of graph-centric time series classification [13].
Recent research has leveraged CNN deep learning methods for atmospheric
imaging, particularly in the estimation of tropical cyclones [15]. However, surveys
suggest that the application of deep learning to climate data mining is still in
its early stages and continues to evolve [16,17]. In our study, we extend this
exploration by introducing the application of GNN Graph Classification models
to climate data, aiming to uncover nuanced patterns and insights from this
complex dataset.
Deep learning has reshaped EEG signal interpretation, especially in neural
engineering and biomedical fields [18–20]. In a prior study, we utilized CNN
image classification combined with graph mining to differentiate EEG patterns
between alcoholic and control subjects [21]. Here, we further explore the efficacy
of GNN Graph Classification models on EEG datasets.
3 Methods
The input for GNN Graph Classification models consists of a collection of small
labeled graphs representing objects in the dataset. These graphs are composed of
28 A. Romanova
nodes and edges, with associated features that describe attributes of the entities
and relationships between them.
The data flow diagram in Fig. 1 contrasts and compares the processes applied
to climate data and EEG data. In both cases, graph edges are established based
on pairs of vectors if their cosine similarity surpasses a set threshold. To ensure
connectivity within our graphs, we integrated a virtual node, linking it with
all existing nodes. This approach, as showcased in Fig. 2, effectively transforms
isolated graph segments into unified single connected components.
For both scenarios, we employ a Graph Convolutional Network Convolution
(GCNConv) model from the PyTorch Geometric Library (PyG) to perform GNN
Graph Classification [22].
Fig. 1. Data flow diagrams for EEG data and Climate data scenarios: common steps
and different steps.
Fig. 2. (a) A highly connected graph with high degree of connectivity representing sta-
ble climate patterns in Malaga, Spain. (b) A sparsely connected graph with low degree
of connectivity indicating unstable and unpredictable climate patterns in Orenburg,
Russia.
We employed the GCNConv model from the PyTorch Geometric Library (PyG)
[22] for our GNN Graph Classification tasks. This model harnesses convolutional
operations to extract graph features, leveraging edges, node attributes, and graph
labels during the process. For those interested in the data conversion to the PyG
format, details can be found in our technical blogs [23,24].
In our trials involving EEG and climate data, the model quickly demon-
strated outstanding performance, achieving a remarkable 100% accuracy in just
a few training epochs. Such early success prompted us to wonder about its fea-
sibility. Delving deeper, we attributed this to the model’s acute sensitivity to
nuances in graph topology. The implications and details of this heightened sen-
sitivity are further explored in the experiments section.
30 A. Romanova
4 Experiments
4.1 EEG Data Graph Classification
EEG Data Source. We utilized the ‘EEG-Alcohol’ Kaggle dataset [6], which
consists of EEG correlates from a study on genetic predisposition to alcoholism.
It includes data from 8 subjects, each exposed to different stimuli while their
brain electrical activity was recorded using 64 electrodes at a sampling rate of
256 Hz for 1 s. The total number of person-trial pairs was 61. Our data prepara-
tion process involved using some code from Ruslan Klymentiev’s Kaggle note-
book [25] and developing our own code to transform EEG channel data into time
series.
Prepare Input Data for Graph Classification Model. For the EEG data,
separate graphs were created for each person-trial, with labels indicating the
alcohol or control group. Electrode positions were used as nodes and EEG chan-
nel signals as node features. Cosine similarity matrices were calculated for each
graph to select node pairs with cosine similarities above a certain threshold, and
virtual nodes were added to all graphs to transform them into single connected
components. The challenge of a small training dataset (61 person-trial graphs)
was addressed by randomly varying threshold values within the range (0.75,
0.95), augmenting the input dataset to 1037 graphs and enhancing the GNN
Graph Classification model’s performance.
Train the Model. The study employed the GCNConv model from the PyG
library [22] for classifying EEG data into alcoholic and control groups. The
input dataset was randomly split into 15% testing data (155 graphs) and 85%
training data (882 graphs). The model achieved an accuracy of about 98.4%
on the training data and 98.1% on the testing data. The slight fluctuations in
accuracy can be attributed to the relatively small testing dataset.
Figure 3, adapted from our earlier work, underscores the difficulty in distinguish-
ing between the groups based solely on these trials, highlighting their limited
utility in precise group categorization.
Fig. 3. Graph visualization from our previous study [21] illustrates that reactions on
‘single stimulus’ trials are similar for persons from Alcohol and Control groups.
Climate Data Source. We utilized average daily temperature data from year
1980 to year 2019 for 1000 most populous cities worldwide from a Kaggle dataset
[7]. For the GNN Graph Classification model, we created separate graphs for all
cities, using city-year combinations as nodes, daily temperature vectors of cor-
responding years as node features, and cosine similarities higher than certain
thresholds as graph edges. Graph labels, indicating stable or unstable long-term
climate trends, were created by calculating average cosines between daily tem-
perature vectors of consecutive years.
Table 1. Very high average cosine similarities indicate stable climate with less variance
in daily temperature patterns.
Table 2. A decrease in the average cosine similarity between consecutive years can
indicate an increase in the variance or difference in daily temperature patterns, which
could be a sign of climate change.
Fig. 4. Results of our previous studies [26, 27]: cities located near the Mediterranean
Sea have very stable and consistent temperature patterns.
34 A. Romanova
5 Conclusions
In our pioneering work, we repurposed GNN Graph Classification models, tra-
ditionally used in fields like biology and chemistry, for time series analysis on
EEG and climate data. By introducing virtual nodes, we bridged fragmented
input graphs, deepening our data representation. For climate data, we uniquely
labeled graphs based on average cosine values between consecutive years, pro-
viding a fresh perspective on climate trends.
Our findings revealed a pronounced sensitivity in the model’s interpretation
of graph topology. While initially viewed as a possible shortcoming, this sensitiv-
ity proved to be a valuable strength. In the EEG data, anomalies were notably
centered around one individual from the control group. Additionally, the “single
stimulus” patterns consistently emerged as indistinguishable between the Alco-
holic and Control groups, echoing observations from our prior research.
In our climate analysis, cities such as Monaco, Nice, and Marseille in the
Mediterranean region defied expectations with their stable temperature patterns.
These observations align with our prior research, where we introduced an innova-
tive symmetry metric for time series analysis via CNN models, offering a distinct
perspective compared to the conventional cosine similarity measures. The con-
cordant findings from both the symmetry metrics and GNN Graph Classification
models underscore their collective strength in detecting and emphasizing outliers
within the dataset.
In closing, our study reinforces the versatility and promise of GNN Graph
Classification models, emphasizing their ability to discern intricate patterns,
anomalies, and relationships within data. The harmony between our current
findings and those from previous research speaks to the consistent strength of
our methodologies. We believe these insights pave the way for further exploration
and broader applications of GNN Graph Classification models in forthcoming
analytical pursuits.
Enhancing Time Series Analysis 35
References
1. Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification with deep con-
volutional neural networks. Adv. Neural Inf. Process. Syst. (2012)
2. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444
(2015)
3. Noy, N., Gao, Y., Jain, A., Narayanan, A., Patterson, A., Taylor, J.: Industry-scale
knowledge graphs: lessons and challenges. acmqueue (2019)
4. Bronstein, M., Bruna, J., Cohen, T., Veličković, P.: Geometric deep learning:
grids, groups, graphs, geodesics, and gauges (2021). https://doi.org/10.48550/
arXiv.2104.13478
5. Romanova, A.: GNN graph classification method to discover climate change pat-
terns. In: Artificial Neural Networks and Machine Learning (ICANN). Springer,
Cham (2023)
6. kaggle.com. EEG-Alcohol Data Set (2017)
7. kaggle.com. Temperature History of 1000 Cities 1980 to 2020 (2020)
8. Bradley, A.: Semantics Conference, 2017 (2017)
9. Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., Yu, P.: SA Comprehensive Survey
on Graph Neural Networks (2019)
10. Wang, M., Qiu, L., Wang, X.: A Survey on Knowledge Graph Embeddings for Link
Prediction. Symmetry (2021)
11. Adamczyk, J.: Application of Graph Neural Networks and graph descriptors for
graph classification (2022)
12. Hu, W., et al.: Strategies for Pre-training Graph Neural Networks (2020)
13. He, H., Queen, O., Koker, T., Cuevas, C., Tsiligkaridis, T., Zitnik, M.: Domain
Adaptation for Time Series Under Feature and Label Shifts (2023)
14. Jin, M., et al.: A Survey on Graph Neural Networks for Time Series: Forecasting,
Classification, Imputation, and Anomaly Detection (2023)
15. Ardabili, S., Mosavi, A., Dehghani, M., Varkonyi-Koczy, A.: Application of
Deep Convolutional Neural Networks for Detecting Extreme Weather in Climate
Datasets (2019)
16. Liu, Y., Racah, E.: Deep Learning and Machine Learning in Hydrological Processes,
Climate Change and Earth (2019)
17. Liu, Y., et al.: Application of Deep Convolutional Neural Networks for Detecting
Extreme Weather in Climate Datasets (2016)
18. Craik, A., He, Y., Contreras-Vidal, J.: Deep Learning for Electroencephalogram
(EEG) Classification Tasks: A Review (2019)
19. Gemein, L.A.W.: A Machine-Learning-Based Diagnostics of EEG Pathology (2020)
20. Roy, Y., Banville, H., Albuquerque, I., Fauber, J.: Deep Learning-Based Electroen-
cephalography Analysis: A Systematic Review (2019)
21. Romanova, A.: Time series pattern discovery by deep learning and graph mining.
In: Database and Expert Systems Applications (DEXA) (2021)
22. PyG. Pytorch Geometric Library: Graph Classification with Graph Neural Net-
works (2023)
23. GNN Graph Classification for Climate Change Patterns. Graph Neural Network
(GNN) Graph Classification - A Novel Method for Analyzing Time Series Data
(2023). http://sparklingdataocean.com/2023/02/11/cityTempGNNgraphs/
24. GNN Graph Classification for EEG Pattern Analysis. Graph Neural Net-
work for Time-Series Analysis (2023). http://sparklingdataocean.com/2023/05/
08/classGraphEeg/
36 A. Romanova
1 Introduction
In the past decade, deep Neural Networks (NNs) [11] have revolutionized many
machine learning areas and one of their major strength is their capacity and
effectiveness of learning latent representation from Euclidean data. Recently,
the focus has been put on its applications on non-Euclidean data, e.g., relational
data or graphs. Combining with graph signal processing and convolutional neural
networks [12], numerous Graph Neural Networks (GNNs) have been proposed
[7,8,10,21,27] that empirically outperform traditional neural networks on graph-
based machine learning tasks, e.g., node classification, graph classification, link
prediction, graph generation, etc.
Nevertheless, growing evidence shows that GNNs do not always gain advan-
tages over traditional NNs on relational data [14,16,17,20,23,30]. In some cases,
even a simple Multi-Layer Perceptron (MLP) can outperform GNNs by a large
margin, e.g., as shown in Table 1, MLP outperform baseline GNNs on Cornell,
Wisconsin, Texas and Film and perform almost the same as baseline GNNs
on PubMed, Coauthor CS and Coauthor Phy. This makes us wonder when it is
appropriate to use GNNs. In this work, we explore an explanation and propose
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 37–48, 2024.
https://doi.org/10.1007/978-3-031-53468-3_4
38 S. Luan et al.
two proper measures to determine when to use GNNs for a node classification
task.
A common way to leverage graph structure is to apply graph filters in each
hidden layer of NNs to help feature extraction. Most existing graph filters can be
viewed as operators that aggregate node information from its direct neighbors.
Different graph filters yield different spectral or spatial GNNs. Among them, the
most commonly used is the renormalized affinity matrix [10], which corresponds
to a low-pass (LP) filter [24] mainly capturing the low-frequency components of
the input, i.e.the locally smooth features across the whole graph [28].
The use of LP graph filters relies on the assumption that nodes tend to share
attributes with their neighbors, a tendency called homophily [9,25] that is widely
exploited in node classification tasks. GNNs that are built on the homophily
assumption learn to assign similar labels to nodes that are closely connected
[29], which corresponds to an assumption of intrinsic smoothness on latent label
distribution. We call this kind of relational inductive bias [2] the edge bias. We
believe it is a key factor leading to GNNs’ superior performance over NNs’ in
many tasks.
However, the existing homophily metrics are not appropriate to display the
edge bias, e.g., as shown in Table 1, MLP does not necessarily outperform base-
line GNNs on some low homophily datasets (Chameleon and Squirrel ) and does
not significantly underperform baseline GNNs on some high homophily datasets
When Do We Need Graph Neural Networks for Node Classification? 39
2 Preliminaries
After stating the motivations, in this section, we will introduce the used notations
and formalize the idea. We use bold fonts for vectors (e.g., v). Suppose we
have an undirected connected graph G = (V, E, A) without bipartite component,
where V is the node set with |V| = N ; E is the edge set without self-loop;
A ∈ RN ×N is the symmetric adjacency matrix with Aij = 1 if and only if
eij ∈ E, otherwise Aij = 0; D is the diagonal degree matrix, i.e. Dii = j Aij
and Ni = {j : eij ∈ E} is the neighborhood set of node i. A graph signal is a
vector x ∈ RN defined on V, where xi is defined on the node i. We also have
a feature matrix X ∈ RN ×F whose columns are graph signals and each node i
has a corresponding feature vector Xi: with dimension F , which is the i-th row
of X. We denote Z ∈ RN ×C as label encoding matrix, where Zi: is the one hot
encoding of the label of node i.
where W0 ∈ RF ×F1 and W1 ∈ RF1 ×O are parameter matrices. GCN can learn
by minimizing the following cross entropy loss
The random walk renormalized matrices Ârw = D̃−1 Ã can also be applied to
GCN and Ârw shares the same eigenvalues as Âsym . The corresponding Laplacian
is defined as L̂rw = I − Ârw . Specifically, the nature
of random walk matrix makes
Ârw behaves as a mean aggregator (Ârw x)i = j∈{Ni ∪i} xj /(Dii + 1) which is
applied in [8] and is important to bridge the gap between spatial- and spectral-
based graph convolution methods.
NSV. Even when the features of the node resemble its aggregated neighbor-
hood, it does not necessarily mean that the average pairwise attribute distance
of connected nodes is smaller than that of unconnected nodes. Based on this
argument, we define Normalized Smoothness Value (NSV) as a measure of the
effect of the edge bias.
The total pairwise attribute distance of connected nodes is equivalent to the
Dirichlet energy of X on G as follows,
⎛ ⎞
(e i − e j ) XX (e i − e j ) = tr ⎝ (e i − e j ) XX (e i − e j )⎠
G 2 T T T T
ED (X) = Xi: − Xj: 2 =
i↔j i↔j i↔j
⎛ ⎞
= tr ⎝
T T⎠ T
(e i − e j )(e i − e j ) XX = trace(X LX).
i↔j
The total pairwise distance of unconnected nodes can be derived from the Lapla-
cian LC of the complementary graph G C . To get LC , we introduce the adjacency
matrix of G C as AC = (11T − I) − A, its degree matrix DC = (N − 1)I − D, and
LC = DC − AC = N I − 11T − L. Then, the total pairwise attribute distance of
unconnected nodes (Dirichlet energy of X on G C ) is
GC
ED (X) = trace X T LC X = trace X T (N I −11T )X − ED G
(X)
C
G G
ED (X) and ED (X), are non-negative and are closely related to sample covari-
ance matrix (see Appendix A for details) as follows,
G GC
ED (X) + ED (X) = trace X T (N I − 11T )X = N (N − 1) · trace (Cov(X)) .
Since trace (Cov(X)) is the total variation in X, we can say that the total
G
sample variation can be decomposed in a certain way onto G and G C as ED (X)
C
G
and ED (X). Then, the average pairwise distance (variation) of connected nodes
G GC
and unconnected nodes can be calculated by normalizing ED (X) and ED (X),
C
G E G (X) GC
G
ED (X)
EN (X) = D , EN (X) = (4)
2 |E| N (N − 1) − 2 |E|
42 S. Luan et al.
We can see that 0 ≤ NSVG (X) ≤ 1 and it can be used to interpret the edge
bias: (1) For labels Z, NSVG (Z) ≥ 0.5 means that the proportion of connected
nodes that share different labels is larger than that of unconnected nodes, which
implies that edge bias is harmful for Z and the homophily assumption is invalid;
(2) For features X, NSVG (X) ≥ 0.5 means that the average pairwise feature
distance of connected nodes is greater than that of unconnected nodes, which
suggests that the feature is non-smooth. On the contrary, small NSV(Z) and
NSV(X) indicates that the homophily assumption holds and the edge bias is
potentially beneficial.
The above analysis raises another question: how much does NSV deviating
from 0.5 or what is the exact NSV to indicate the edge bias is statistically bene-
ficial or harmful. In the following section, we study the problem from statistical
hypothesis testing perspective and provide thresholds by the p-values.
For features X:
– D1 = Xi: − Xj: 2
2 eij ∈ E = Distribution of pairwise feature distance of
connected nodes;
2
– D2 = Xi: − Xj: 2 eij ∈ E = Distribution of pairwise feature distance of
unconnected nodes.
Suppose P1 , P2 , D1 , D2 follow:
To conduct the hypothesis tests, we use Welch’s t-test for features and χ2 test
G GC
for labels. We can see EN (Z) and EN (Z) are sample estimation of the mean p1
G GC
and p2 for label Z; EN (X) and EN (X) are sample estimation of mean d1 and
d2 for X. Thus, the p-values of hypothesis tests can suggest if NSV statistically
deviates from 0.5. The smoothness of labels and features can be indicated as
follows,
For feature X:
– p-value(H0F vs H1F ): > 0.05, H0F holds, feature is non-smooth; ≤ 0.05, to be
determined.
– p-value(H0F vs H2F ): ≤ 0.05, feature is statistically significantly non-smooth.
– p-value(H0F vs H3F ): ≤ 0.05, feature is statistically significantly smooth.
For label Z:
– p-value(H0L vs H1L ): > 0.05, H0L holds, label is non-smooth; ≤ 0.05, to be
determined.
– p-value(H0L vs H2L ): ≤ 0.05, label is statistically significantly non-smooth.
– p-value(H0L vs H3L ): ≤ 0.05, label is statistically significantly smooth.
Results of hypothesis testing are summarized in Table 2. We can see that for the
datasets where baseline GNNs underperform MLP, Cornell, Texas and Wiscon-
sin has statistically significantly non-smooth labels and Film has non-smooth
labels. In these datasets, the edge bias will provide harmful information no mat-
ter the features are smooth or not. For other datasets, they have statistically
significantly smooth labels, which means the edge bias can statistically provide
benefits to the baseline GNNs and lead them to have superiority performance
over MLP.
NTV. When the NTV of node features X and labels Z are are small, it implies
Cornell 0.33 0.48 0.00 1.00 0.00 0.33 0.53 0.0003 0.00 1.00 −17.48
Texas 0.33 0.48 0.00 1.00 0.00 0.42 0.60 0.00 0.00 1.00 −16.66
Wisconsin 0.38 0.51 0.72 0.36 0.64 0.40 0.55 0.00 0.00 1.00 −16.47
Film 0.39 0.50 0.19 0.90 0.10 0.37 0.50 0.05 0.97 0.03 −4.09
Coauthor CS 0.36 0.36 0.00 1.00 0.00 0.19 0.18 0.00 1.00 0.00 0.18
Pubmed 0.33 0.44 0.00 1.00 0.00 0.25 0.24 0.00 1.00 0.00 0.35
Coauthor Phy 0.35 0.36 0.00 1.00 0.00 0.16 0.09 0.00 1.00 0.00 0.81
AMZ Photo 0.41 0.39 0.00 1.00 0.00 0.23 0.17 0.00 1.00 0.00 1.03
AMZ Comp 0.41 0.38 0.00 1.00 0.00 0.25 0.22 0.00 1.00 0.00 2.93
Citeseer 0.35 0.45 0.00 1.00 0.00 0.22 0.24 0.00 1.00 0.00 3.28
DBLP 0.37 0.46 0.00 1.00 0.00 0.21 0.20 0.00 1.00 0.00 6.93
Squirrel 0.47 0.54 0.00 0.00 1.00 0.44 0.49 0.00 1.00 0.00 9.24
Chameleon 0.45 0.45 0.00 1.00 0.00 0.45 0.49 0.00 1.00 0.00 10.66
Cora 0.38 0.47 0.00 1.00 0.00 0.20 0.19 0.00 1.00 0.00 12.31
This suggests that GNNs work more effectively than graph-agnostic methods
when NTVG (Z) is small. However, when labels are non-smooth on G, a projection
onto the column space of  will hurt the expressive power of the model. In
a nutshell, GNNs potentially have stronger expressive power than NNs when
NTVG (Z) is small.
where Y = ÂXW, 1 ∈ RC×1 and C is the output dimension. The loss function
(2) can be written as
L = −trace Z T ÂXW + trace 1T log (exp(Y )1) . (10)
We denote X̃ = XW and consider −trace Z T ÂXW , which plays the main
role in the above optimization problem.
−trace Z T ÂXW = −trace Z T ÂX̃ = − T
Âij Zi: X̃j: . (11)
i↔j
When Do We Need Graph Neural Networks for Node Classification? 45
To minimize L, if Âij = 0, then X̃j: will learn to get closer to Zi: and this
means: (1) If Zi: = Zj: , X̃j: will learn to approach to the unseen ground truth
label Zj: which is beneficial; (2) If Zi: = Zj: , X̃j: tends to learn a wrong label,
in which case the edge bias becomes harmful. Conventional NNs can be treated
as a special case with only Âii = 1, otherwise 0. So the edge bias has no effect
on conventional NNs.
To evaluate the effectiveness of edge bias, NSV makes a comparison to see if
the current edges in E G have significantly less probability of indicating different
pairwise labels than the rest edges. If NSV together with the p-value suggests
that the edge bias is statistically beneficial, we are able to say that GNNs will
obtain performance gain from edge bias; otherwise, the edge bias will have a
negative effect on GNNs. NTV, NSV, p-values and the performance comparison
of baseline models on 14 real-world datasets shown in Table 2 are consistent with
our analysis.
4 Related Works
Smoothness (Homophily). The idea of node homophily and its measures are
mentioned in [26] and defined as follows,
1 {u | u ∈ Nv , Zu,: = Zv,: }
Hnode (G) =
|V| dv
v∈V
where [a]+ = max(a, 0); hk is the class-wise homophily metric. The above mea-
sures only consider the label consistency of connected nodes but ignore the
unconnected nodes. Stronger label consistency can potentially happen in uncon-
nected nodes, in which case the edge bias is not necessarily beneficial for GNNs.
Aggregation homophily [18,19] tries to capture the post-aggregation node simi-
larity and is proved to be better than the above homophily measures. But, it is
not able to give a clear threshold value to determine when GNNs can outperform
graph-agnostic NNs.
46 S. Luan et al.
5 Conclusion
References
1. Ahmed, H.B., Dare, D., Boudraa, A.-O.: Graph signals classification using total
variation and graph energy informations. In: 2017 IEEE Global Conference on
Signal and Information Processing (GlobalSIP), pp. 667–671. IEEE (2017)
2. Battaglia, P.W., et al.: Relational inductive biases, deep learning, and graph net-
works. arXiv preprint arXiv:1806.01261 (2018)
3. Chen, S., Sandryhaila, A., Moura, J.M., Kovacevic, J.: Signal recovery on graphs:
variation minimization. IEEE Trans. Signal Process. 63(17), 4609–4624 (2015)
4. Chung, F.R.: Spectral Graph Theory, vol. 92. American Mathematical Soc. (1997)
5. Cong, W., Ramezani, M., Mahdavi, M.: On provable benefits of depth in train-
ing graph convolutional networks. Adv. Neural. Inf. Process. Syst. 34, 9936–9949
(2021)
6. Daković, M., Stanković, L., Sejdić, E.: Local smoothness of graph signals. Math.
Probl. Eng. 2019, 1–14 (2019)
7. Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on
graphs with fast localized spectral filtering. Adv. Neural Inf. Process. Syst. 29
(2016)
8. Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large
graphs. Adv. Neural Inf. Process. Syst. 30 (2017)
9. Hamilton, W.L.: Graph representation learning. Synth. Lect. Artif. Intell. Mach.
Learn. 14(3), 1–159 (2020)
10. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional
networks. In: International Conference on Learning Representations (2016)
11. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)
12. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al.: Gradient-based learning
applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
13. Li, Q., Han, Z., Wu, X.-M.: Deeper insights into graph convolutional networks for
semi-supervised learning. Proc. AAAI Conf. Artif. Intell. 32 (2018)
14. Lim, D., et al.: Large scale learning on non-homophilous graphs: new benchmarks
and strong simple methods. Adv. Neural. Inf. Process. Syst. 34, 20887–20902
(2021)
15. Lim, D., Li, X., Hohne, F., Lim, S.-N.: New benchmarks for learning on non-
homophilous graphs. arXiv preprint arXiv:2104.01404 (2021)
16. Liu, M., Wang, Z., Ji, S.: Non-local graph neural networks. arXiv preprint
arXiv:2005.14612 (2020)
48 S. Luan et al.
17. Luan, S.: On addressing the limitations of graph neural networks. arXiv preprint
arXiv:2306.12640 (2023)
18. Luan, S., et al.: Is heterophily a real nightmare for graph neural networks to do
node classification? arXiv preprint arXiv:2109.05641 (2021)
19. Luan, S., et al.: Revisiting heterophily for graph neural networks. Adv. Neural. Inf.
Process. Syst. 35, 1362–1375 (2022)
20. Luan, S., et al.: When do graph neural networks help with node classification:
investigating the homophily principle on node distinguishability. Adv. Neural Inf.
Process. Syst. 36 (2023)
21. Luan, S., Zhao, M., Chang, X.-W., Precup, D.: Break the ceiling: stronger multi-
scale deep graph convolutional networks. Adv. Neural Inf. Process. Syst. 32 (2019)
22. Luan, S., Zhao, M., Chang, X.-W., Precup, D.: Training matters: unlock-
ing potentials of deeper graph convolutional neural networks. arXiv preprint
arXiv:2008.08838 (2020)
23. Luan, S., Zhao, M., Hua, C., Chang, X.-W., Precup, D.: Complete the missing
half: augmenting aggregation filtering with diversification for graph convolutional
networks. In: NeurIPS 2022 Workshop: New Frontiers in Graph Learning (2022)
24. Maehara, T.: Revisiting graph neural networks: all we have is low-pass filters. arXiv
preprint arXiv:1905.09550 (2019)
25. McPherson, M., Smith-Lovin, L., Cook, J.M.: Birds of a feather: homophily in
social networks. Ann. Rev. Sociol. 27(1), 415–444 (2001)
26. Pei, H, Wei, B., Chang, K.C.-C., Lei, Y., Yang, B.: Geom-gcn: geometric graph
convolutional networks. In: International Conference on Learning Representations
(2020)
27. Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph
attention networks. In: International Conference on Learning Representations
(2018)
28. Wu, F., Souza, A., Zhang, T., Fifty, C., Yu, T., Weinberger, K.: Simplifying graph
convolutional networks. In: International Conference on Machine Learning, pp.
6861–6871. PMLR (2019)
29. Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local
and global consistency. In: Advances in Neural Information Processing Systems,
pp. 321–328 (2004)
30. Zhu, J., Yan, Y., Zhao, L., Heimann, M., Akoglu, L., Koutra, D.: Generalizing
graph neural networks beyond homophily. arXiv preprint arXiv:2006.11468 (2020)
Training Matters: Unlocking Potentials
of Deeper Graph Convolutional Neural
Networks
1 Introduction
2 Preliminaries
We use bold fonts for vectors (e.g., v), block vectors (e.g., V ) and matrix blocks
(e.g., Vi ). Suppose we have an undirected connected graph G = (V, E) without a
Unlocking Potentials of Deeper Graph Convolutional Neural Networks 51
bipartite component, where V is the node set with |V| = N , E is the edge set with
|E| = E. Let A ∈ RN ×N be the adjacency matrix of G, i.e. Aij = 1 for eij ∈ E and
Aij = 0 otherwise. The graph Laplacian
is defined as L = D − A, where D is a
diagonal degree matrix with Dii = j Aij . The symmetric normalized Laplacian
is defined as Lsym = I − D−1/2 AD−1/2 with eigenvalues λ(Lsym ) ∈ [0, 2) and its
renormalized version is defined as
L̃sym = I − D̃−1/2 ÃD̃−1/2 , Ã = A + I, D̃ = diag(D̃ii ), D̃ii = j Ãij (1)
with λ(Â) = 1 − λ(L̃sym ) ∈ (−1, 1], and it is used in GCN [16] as follows
The energy-preserving property means the operator does not change the
energy intensity in the frequency domain after being applied on graph signals.
The strict inequality holds for any x which is independent of [D̃11 , . . . , D̃N N ]T ,
1/2 1/2
Through the forward analysis of energy flow in deep GCN, we can see that
the energy of column features should reduce in top layers. But from Fig. 1(a)–(c)
yielded by a numerical test with 10-layer GCN, we can see that the energy of
column features in top layers (Fig. 1(c)) do not have significant changes, while
in bottom layers (Fig. 1(a), (b)) the energy of features shrinks during training.
The cause of this contradiction is that we either have neglected [18] or have put
too strong assumptions [26,29] on parameter matrices in forward analysis while
ignore how parameter matrices changes in backpropagation. In the following, we
will try to do gradient analysis from backward view and explain the energy loss
in bottom layers in deep GCN.
where 1R+ (·) and log(·) are pointwise indicator and log functions; is the
Hadamard product; Z ∈ RN ×C is the ground truth matrix with one-hot label
vector Zi,: in each row, C is number of classes; l is the scalar loss. Then the
gradient propagates in the following way,
∂l ∂l ∂l ∂l ∂l
Output Layer ∂Y = softmax(Y ) − Z, ∂Wn = YnT Â ∂Y , ∂Y n = Â ∂Y T
Wn
∂l ∂l ∂l ∂l ∂l ∂l
Hidden Layers ∂Y i = ∂Y i 1R+ (Yi ), ∂Wi−1
T
= Yi−1 Â ∂Y , ∂Y i −1 = Â ∂Y T
Wi−1
i i
(4)
The gradient propagation of GCN differs from that of multi-layer perceptron
(MLP) by an extra multiplication of  when the gradient signal flows through
Yi . Since λi (Â) ≤ 1, this multiplication will cause energy loss of gradient signal
(see Fig. 2(c)). In addition, oversmoothing does not only happen in feed-forward
process, but also exists in backpropagation when we see ∂Y∂l i −1
∂l
= Â ∂Y T
Wi−1 as
i
a backward view of hidden layers as (3). In forward view, parameter matrix Wi
is fixed and we update Yi ; in backward view, Yi is fixed and we update Wi . And
the difference is in forward view, the input X is a fixed feature matrix, but in
∂l
backward view, the scale of the input ∂Y = softmax(Y ) − Z (the prediction
error) is getting smaller during training. Thus, the energy loss is more significant
in Wi (see Fig. 2(a)) from backward view and is more serious in bottom layers
instead of in top layers.
This energy losing phenomenon is not an expressive power problem but a
training issue. But this does not mean the training issue is the root cause of the
54 S. Luan et al.
Fig. 2. Comparison of weight and gradient norm in hidden layers of GCN and TR-GCN
(r = 1): the pairs have the same x- and y-ranges.
performance limit problem, which we will draw conclusion later. In the following
section, we propose method to alleviate the energy loss.
Note that the components of x, Âx and Âr x in the direction ûi are ûTi x, λ̂i ûTi x,
and (λ̂i + r)ũTi x, respectively. Thus applying the operator  to x just scales the
component of x in the direction of ui by λ̂i for each i.
Tuning the resolution parameter r actually rescales those components in the
way that global information (high smoothness) will be increased with positive r,
and local information (low smoothness) will be enhanced with negative r. The
GCN with Âr is called topology rescaling GCN (TR-GCN).
Note that λi (Â) ∈ (−1, 1]. A shift which makes maxi λi (Â) + r ≥ 1 is con-
sidered risky because it will cause gradient exploding and numerical instability
during training as stated in [16]. However, through our analysis, TR-GCN will
not only overcome the difficulty when training in deep architecture (see Fig. 1(d)–
(f) and Fig. 2(b), (d)), but also will not lose expressive power (see Table 1) by
setting a proper r (depends on the task and size of the network).
The gradient propagation does not only depends on  but also depends on the
scale of Wi . An initialization with proper scale would make Wi get undiminished
gradient from the start of training and move to the correct direction with a
clearer signal [21]. Thus, we adjust the scale of each element in Wi initialized by
[12] with a tunable constant λinit as follows,
1 1 2
λinit × U (− , ) or λinit × N (0, ) (6)
Fi+1 Fi+1 Fi + Fi+1
4.3 Normalization
Skip (residual) connections [15] is a widely used technique in training deep neu-
ral networks and has achieved success in feature extraction. It helps with gradi-
ent propagation without introducing additional parameters. Skip connections, if
adapted in GCNs, will have the general form as follows:
where σ is the activation function, Yi is the input of the i-th layer and Yi+1 is
the output of the i-th layer as well as the input of the (i + 1)-th layer.
It is shown that existing GCN models are difficult to train when they are
scaled with more than 7-layer-deep. This is possible due to the increase of the
effective context size of each node and overfitting issue as stated in [16]. There
exists one method ResGCN [17] that seeks also to address such problem via
residual connections, but it actually uses concatenation of the intermediate out-
puts of hidden layers, introducing excessive parameters. The effectiveness shown
in experiments are actually not only the result of the skip-connections but also
the expressive power of additional parameters. However, in experiments, we will
show that residual connections alone could accomplish the task.
5 Experiments
This section is crucial to the paper’s main hypothesis: can we boost the per-
formance of GNNs by just training them better? For this purpose, we patch
the most-popular baseline GCN with the ideas in previous section to form a
set of detailed comparative tests and fix the architecture to be 10-layers deep
throughout the entire section1 . Particularly, we have selected the node classifica-
tion tasks on Cora, CiteSeer and PubMed, the three most popular datasets. We
use the most classic setting on training, which is identical to the one suggested in
[35]. The section features two sets of experiments, the first of which validates the
effectiveness of the proposed methods lowering the training difficulty while the
second demonstrates the potential performance boost when the patched meth-
ods are fine-tuned. For all experiments, we used Adam optimizer and ReLU as
the activation function (PyTorch implementation).
Instead of demonstrating how good the performance of the patched method could
possibly be, the first set of experiments focuses on validating the effectiveness
of the proposed ideas aiming to lower the training difficulty with a detailed
ablation study. Also, we investigate the potential loss of generalization abilities,
i.e. whether these ideas lead to overfitting.
For fair comparison, we use the same base architecture for the baseline and
all the patched methods: 10 GCN layers each with width 16. Also, we utilize
the same set of basic hyperparameters: a learning rate of 0.001, weight decay of
5 × 10−4 , 0 dropout. We train all methods to the same extent by using the same
training procedures for all the methods: each method in each run is trained until
the validation loss is not improved for 200 epochs.
With these, we run each method on Cora dataset with public split (20 train-
ing data for each class) for 20 independent runs and obtain the final reported
1
The source code will be submitted within the supplementary materials for blind
review and open-source afterwards.
Unlocking Potentials of Deeper Graph Convolutional Neural Networks 57
Train Loss Train Acc Test Loss Test Acc Change L Change All Change W, b
Mean Std Mean Std Mean Std Mean Std resolution skip weight norm energy norm weight init weight const
Each row represents a method. The first four columns are featured with color indicators: the greener the better result, the redder the worse. The changes applied unto the baseline are highlighted in the later columns. The first row has no colored changes and is therefore the baseline.
We use different highlight colors to indicate the change on the operators: blue for the changes on graph operator L, red for the changes on W and b and purple (blue + red) for the changes applied on all L, W and b .
From the results on the training set, we can observe significantly smaller
training loss (more than 50%) and significantly higher training accuracy (more
than 6 times), comparing those of the patched methods and the original baseline.
Considering that all of the compared methods have exactly the same parameter
composition, we can safely say that the proposed methods are indeed effective
lowering the training difficulties. However, we cannot conclude from the results
which single idea contributes the most to the training difficulty alleviation.
Comparing the results on the test set, we can see that the error and accuracy
on the test set ruled out the argument of overfitting: generally all the losses and
accuracy on the test set are improved significantly. With all the observations in
this set of experiments, the validation of the hypothesis is finished: we can make
GNNs perform better by training them better.
In this second set of experiments, we fine tune each method (including the
baseline) and compare their best reported performance. This shows how much
potential could be unlocked by better training procedures. The fine-tuning is
58 S. Luan et al.
conducted with Bayesian optimization [31] to the same extent2 . Each result
reported in Table 2 is averaged from 20 independent runs together with the
standard deviation3 .
From the results in the table, we observe that the patched methods obtain
statistically significant performance boost. Therefore, together with the obser-
vations from the previous set of experiments, we conclude that the proposed
methods could indeed alleviate the performance limit problem by lowering the
training difficulty.
6 Conclusion
In this paper, we verify the hypothesis that the cause of the performance limit
problem of deep GCNs are more likely the training difficulty rather than insuffi-
cient capabilities. Out of the analyses on signal energy, we address the problem
by proposing several methodologies that seek to mitigate the training process.
The contribution enables lightweight GCN architectures to gain better perfor-
mance when stacked deeper.
Though the proposed methods show effectiveness in lowering the training loss
and improving the performance in practice, the methods introduce additional
hyperparameters that require tuning. In future works, we would investigate the
possibilities of a learnable resolution (self-loop) in the graph operator that is
optimized end-to-end together with the system, essentially turning meta-learning
2
All methods are fixed 10-layer deep. Methods share the same search range for the
base hyperparameters (learning rate in [10−6 , 10−1 ], weight decay in [10−5 , 10−1 ],
width in {100, 200, . . . , 5000}, dropout in (0, 1)). The hyperparameters unique to the
patched methods are also fixed for each patched method (resolution in [−1, 5], weight
constant in [0.1, 5], weight normalization coefficient in [1, 15], energy normalization
coefficient in [25, 2500]). The search stops if the performance is not improved for 64
candidates.
3
GCN is reproduced and performed fine-tuning upon.
Unlocking Potentials of Deeper Graph Convolutional Neural Networks 59
the self-loop that guides the representation learning on graphs. Also, we would
like to seek other possible theoretically-inspired approaches to alleviate training
difficulties.
References
1. Arenas, A., Fernandez, A., Gomez, S.: Analysis of the structure of complex net-
works at different resolution levels. New J. Phys. 10(5), 053039 (2008)
2. Bronstein, M.M., Bruna, J., LeCun, Y., Szlam, A., Vandergheynst, P.: Geomet-
ric deep learning: going beyond euclidean data. arXiv preprint arXiv:1611.08097
(2016)
3. Chen, J., Ma, T., Xiao, C.: Fastgcn: fast learning with graph convolutional networks
via importance sampling. arXiv preprint arXiv:1801.10247 (2018)
4. Chen, J, Zhu, J., Song, L.: Stochastic training of graph convolutional networks
with variance reduction. arXiv preprint arXiv:1710.10568 (2017)
5. Chung, F.R., Graham, F.C.: Spectral Graph Theory. vol. 92. American Mathemat-
ical Society (1997)
6. Daković, M., Stanković, L., Sejdić, E.: Local smoothness of graph signals. Math.
Probl. Eng. 2019 (2019)
7. Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on
graphs with fast localized spectral filtering. arXiv preprint arXiv:1606.09375 (2016)
8. Fortunato, S., Barthelemy, M.: Resolution limit in community detection. Proc.
Natl. Acad. Sci. 104(1), 36–41 (2007)
9. Gavili, A., Zhang, X.-P.: On the shift operator, graph frequency, and optimal fil-
tering in graph signal processing. IEEE Trans. Signal Process. 65(23), 6303–6318
(2017)
10. Gilmer, J, Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message
passing for quantum chemistry. In: Proceedings of the 34th International Confer-
ence on Machine Learning, vol. 70, pp. 1263–1272 (2017). JMLR. org
11. Girault, B., Gonçalves, P., Fleury, É.: Translation on graphs: an isometric shift
operator. IEEE Signal Process. Lett. 22(12), 2416–2420 (2015)
12. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward
neural networks. In: Proceedings of the Thirteenth International Conference on
Artificial Intelligence and Statistics, pp. 249–256 (2010)
13. Good, B.H., De Montjoye, Y.-A., Clauset, A.: Performance of modularity maxi-
mization in practical contexts. Phys. Rev. E 81(4), 046106 (2010)
14. Hamilton, W.L., Ying, R., Leskovec, J.: Inductive representation learning on large
graphs. arXiv preprint arXiv:1706.02216 (2017)
15. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 770–778 (2016)
16. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional
networks. arXiv preprint arXiv:1609.02907 (2016)
17. Li, G, Müller, M., Thabet, A., Ghanem, B.: Can GCNs go as deep as CNNs? arXiv
preprint arXiv:1904.03751 (2019)
18. Li, Q., Han, Z., Wu, X.: Deeper Insights Into Graph Convolutional Networks for
Semi-Supervised Learning. arXiv preprint arXiv:1801.07606 (2018)
19. Liao, R., Zhao, Z., Urtasun, R., Zemel, R.S.: Lanczosnet: multi-scale deep graph
convolutional networks. arXiv preprint arXiv:1901.01484 (2019)
60 S. Luan et al.
20. Lim, D., et al.: Large scale learning on non-homophilous graphs: new benchmarks
and strong simple methods. Adv. Neural. Inf. Process. Syst. 34, 20887–20902
(2021)
21. Luan, S.: On addressing the limitations of graph neural networks. arXiv preprint
arXiv:2306.12640 (2023)
22. Luan, S., Hua, C., Lu, Q., Zhu, J., Chang, X.-W., Precup, D.: When do we need
GNN for node classification? arXiv preprint arXiv:2210.16979 (2022)
23. Luan, S., et al.: Is heterophily a real nightmare for graph neural networks to do
node classification? arXiv preprint arXiv:2109.05641 (2021)
24. Luan, S., et al.: Revisiting heterophily for graph neural networks. Adv. Neural. Inf.
Process. Syst. 35, 1362–1375 (2022)
25. Luan, S., et al.: When do graph neural networks help with node classification:
Investigating the homophily principle on node distinguishability. Adv. Neural Inf.
Process. Syst. 36 (2023)
26. Luan, S., Zhao, M., Chang, X.-W., Precup, D.: Break the ceiling: stronger multi-
scale deep graph convolutional networks. Adv. Neural Inf. Process. Syst. 32 (2019)
27. Luan, S., Zhao, M., Hua, C., Chang, X.-W., Precup, D.: Complete the missing
half: augmenting aggregation filtering with diversification for graph convolutional
networks. In: NeurIPS 2022 Workshop: New Frontiers in Graph Learning (2022)
28. Monti, F., Boscaini, D., Masci, J., Rodola, E., Svoboda, J., Bronstein, M.M.: Geo-
metric deep learning on graphs and manifolds using mixture model CNNs. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-
tion, pp. 5115–5124 (2017)
29. Oono, K., Suzuki, T.: Graph neural networks exponentially lose expressive power
for node classification. arXiv preprint arXiv:1905.10947 (2019)
30. Salimans, T., Kingma, D.P.: Weight normalization: a simple reparameterization
to accelerate training of deep neural networks. Adv. Neural Inf. Process. Syst.
901–909 (2016)
31. Shahriari, B., Swersky, K., Wang, Z., Adams, R.P., de Freitas, N.: Taking the
human out of the loop: a review of bayesian optimization. Proc. IEEE 104(1),
148–175 (2016)
32. Shuman, D.I., Narang, S.K., Frossard, P., Ortega, A., Vandergheynst, P.: The
emerging field of signal processing on graphs: extending high-dimensional data
analysis to networks and other irregular domains. arXiv preprint arXiv:1211.0053
(2012)
33. Stanković, L., Sejdić, E., Daković, M.: Reduced interference vertex-frequency dis-
tributions. IEEE Signal Process. Lett. 25(9), 1393–1397 (2018)
34. Xiang, J., Tet al.: Multi-resolution community detection based on generalized self-
loop rescaling strategy. Physica A 432, 127–139 (2015)
35. Yang, Z., Cohen, W.W., Salakhutdinov, R.: Revisiting semi-supervised learning
with graph embeddings. arXiv preprint arXiv:1603.08861 (2016)
36. Zhang, X.-S., et al.: Modularity optimization in community detection of complex
networks. EPL (Europhys. Lett.) 87(3), 38002 (2009)
E-MIGAN: Tackling Cold-Start
Challenges in Recommender Systems
1 Introduction
Hybrid models that combine multiple recommendation techniques have gained
popularity in recent years [1–7]. For instance, a hybrid model can combine collab-
orative and content-based filtering to address the cold-start problem. Addition-
ally, it can be extended to incorporate graph-based techniques such as GNNs.
Graph neural network (GNN) has become a new state-of-art approach for rec-
ommender systems. The central concept behind GNN is an information propaga-
tion mechanism, i.e., to iteratively aggregate feature information from neighbors
in graphs. The neighborhood aggregation mechanism enables GNN to model
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 61–73, 2024.
https://doi.org/10.1007/978-3-031-53468-3_6
62 A. Drif and H. Cherifi
the correlation among users, items, and related features [8]. Graph Convolu-
tional Matrix Completion (GCMC) [9] and Neural Graph Collaborative Filter-
ing (NGCF) [10] already address several issues in the world of recommendation.
However, they struggle with modeling higher-order feature interactions, relying
heavily on past user interactions for predictions. To address this limitation, we
developed a Mutual Interaction Graph Attention Network recommender model
[11] that allows mapping the original data to higher-order feature interactions.
It models the mutual influence relationship between aspect users and items. Fur-
thermore, MIGAN is designed to handle large-scale datasets efficiently. Despite
this, it also has limitations, which motivates the current proposed approach.
This work proposes an Enhancing Mutual Interaction Graph Attention Net-
work recommender model for the Item-wise cold-start problem. The main con-
tributions of the Enhancing-MIGAN recommended model can be described as
compared to the related works as follows:
2 Related Work
Otunba et al. [12] stack a Generalized Matrix Factorization (GMF) and MLP
ensemble to propagate the prediction from constituent models to the final out-
put. Bao et al. [13] combine component recommendation engines, which are
E-MIGAN: Tackling Cold-Start Challenges in Recommender Systems 63
considered wrappers for a set of smaller, concrete pre-trained models. The wrap-
per then follows a well-defined strategy to aggregate its predictions into a final
one. Da et al. [14] proposed three ensemble approaches based on multimodal
interactions. Unlike previous works with stacked recommenders, the stacking
content-based recommender that we develop in this work creates a profile model
for each user. Its main advantage is the ability of the embedding representation
to integrate the side information in the hybrid architecture.
Many works on GNN-based recommendation systems have been proposed in
the last few years. The most obvious explanation is that GNN techniques are
effective at learning representations for graph data in various domains [15,16],
and most of the data in recommendation has a graph structure. Graph Con-
volutional Networks (GCN) is one of the popular GNN models. They operate
through a series of message-passing steps between nodes in the graph. At each
stage, each node aggregates information from its neighbors applies a neural net-
work layer to the aggregated information, and then sends the transformed data
to its neighbors. This process is repeated for a fixed number of steps or until
convergence is achieved [17]. Graph Attention Network (GAT) [18] uses a func-
tion called attention to selectively aggregate information from neighboring nodes
in the graph. Unlike GCN, GAT can learn different weights for each neighbor-
ing node, allowing them to capture complex patterns in the graph structure. It
makes GAT particularly useful for tasks where the relationships between nodes
are highly non-linear and require a more fine-grained approach to modeling.
Fig. 1. The Architecture of the Content-Enhanced MIGAN. The content stacked rec-
ommender captures the side information to create a profile model for each user while
optimizing the stack’s learners’ objective function. The collaborative filtering MIGAN
recommender presents higher-order feature interactions.
E-MIGAN: Tackling Cold-Start Challenges in Recommender Systems 65
The proposed framework’s first component is a graph neural network design for
modeling complex interactions in graph-structured data. MIGAN representation
is based on the Bipartite Graph Neural Networks (BGNN) [21] to model the
dependencies between the nodes on a large scale. This representation is fed up
to a co-attention neural network recommender. Here, the developed co-attention
layer puts more emphasis not only on learning the complex relationship between
the target users (or items) and their neighbors but also it learning the most
relevant weights that represent the users’ mutual influence on the item. This
idea is illustrated in Fig. 2.
The first recommender consists of two main operations: an attention network
module and a mutual interaction module. The attention network module learns
attention weights for each node and its neighbors. The mutual interaction module
computes a mutual interaction matrix for each node and its neighbors on each
item, which encodes the pairwise interactions between them. Then, MIGAN uses
the mutual interaction matrix to compute a weighted sum of the node features.
Applying a co-attention mechanism in the context of collaborative filter-
ing recommendation allows for discriminating the items that are interesting for
users even those with no previous interaction, through deducing higher attention
weights. The first embedding layers eu and ei captures latent features of users
pu and items qi . Long-Short-Term-Memory (LSTM) layers follow them to enable
long-range learning. Each LSTM state includes two inputs: the current feature
vector and the output vector ht− 1 from the previous state. Its output vector is
ht . Each node embedding layer is chained with an LSTM layer which allows prop-
agating without modification, updating, or resetting states using simple learned
gating functions. The LSTM representation is expressed as follows:
The interactive attention model uses a tangent function to model the mutual
interactions between users and items. Afterward, we compute the probability
66 A. Drif and H. Cherifi
Fig. 2. Interaction between users and items with particular characteristics can reveal
the possibility that an item is interesting for similar users. In this example, we can see
that both user 1 and user 2 have watched the same movies, Avengers and Aquaman,
which means that they have similar tastes. So if user 2 watched another movie, for
example, John Wick, then there is a high probability that user 1 will like the same
movie, so the MIGAN recommender recommends it to him. It is a first-order inter-
action. Moreover, one can deduce a mutual influence based on interactive attention
weights at more than a first-order interaction level. For example, user 3 influences user
1, as user 1 shares similar preferences with user 2, generating a recommendation of the
Titanic movie based on user 3 preferences.
distribution over the embedding space. The softmax function is used to generate
the attention weights:
αu = Sof tmax(f (αp∗ )) (4)
αi = Sof tmax(f (αq∗ )) (5)
where f : is a multi-layer neural network.
Then, the high-order interaction latent space of users and items is given by:
where f : is a dense layer using a sigmoid activation function. Finally, we train the
model to minimize the loss function which is the Mean Absolute Error (MAE):
1
L(Rui , R̂ui ) = |(Rui − Rˆui )| (8)
|C|
(u,i)∈C
E-MIGAN: Tackling Cold-Start Challenges in Recommender Systems 67
Each s ∈ S is trained separately with the same training dataset. Each model
provides predictions for the outcomes (R), which are then cast into a meta-
learner (blender). In other words, the S predictions of each regressor become
68 A. Drif and H. Cherifi
features for the blender. The latter can be any model such as linear regression,
SVR, Decision Tree,...etc.
fblender (x) = fST K (S1 (x), S2 (x), ..., Ss (x)) (9)
where a meta-learner learns the weight vector w. A blender model can then
be defined and tuned with its hyperparameters θblender . It is then trained on
the outputs of the stack S. It learns the mapping between the outcome of the
stacked predictors and the final ground-truth ratings. The expression of the final
prediction is as follows:
RˆBL = φ(fblender (x), fST K (S1 (x), S2 (x), ..., Ss (x))) (10)
Once the two recommenders needed for the task at hand are trained, we apply
an aggregation function to merge their outputs into a single utility matrix. The
Enhanced-MIGAN framework is summarized in Algorithm 2. It uses the simple
unweighted average aggregation function followed by a fully connected layer.
The final predicted utility matrix P̂ui is as follows.
P̂ui = fagg (R̂ui , R̂BL ) (11)
Table 2. The best hyperparameter for MIGAN recommender system. Results are
evaluated based on MAP and NDCG metrics.
5 Conclusion
References
1. Thorat, P.B., Goudar, R.M., Barve, S.: Survey on collaborative filtering, content-
based filtering and hybrid recommendation system. Int. J. Comput. Appl. 110(4),
31–36 (2015)
2. Lucas, J.P., Luz, N., Moreno, M.N., Anacleto, R., Almeida Figueiredo, A., Martins,
C.: A hybrid recommendation approach for a tourism system. Expert Syst. Appl.
40(9), 3532–3550 (2013)
3. Nguyen, L.V., Nguyen, T.-H., Jung, J.J., Camacho, D.: Extending collaborative fil-
tering recommendation using word embedding: a hybrid approach. Concurr. Com-
put. Pract. Exp. 35(16), e6232 (2023)
4. Drif, A., Guembour, S., Cherifi, H.: A sentiment enhanced deep collaborative fil-
tering recommender system. In: Benito, R.M., Cherifi, C., Cherifi, H., Moro, E.,
Rocha, L.M., Sales-Pardo, M. (eds.) COMPLEX NETWORKS 2020 2020. SCI,
vol. 944, pp. 66–78. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-
65351-4 6
5. Lasfar, A., Mouline, S., Aboutajdine, D., Cherifi, H.: Content-based retrieval in
fractal coded image databases. In: Proceedings 15th International Conference on
Pattern Recognition, ICPR-2000, vol. 1, pp. 1031–1034. IEEE (2000)
6. Drif, A., Eddine Zerrad, H., Cherifi, H.: Context-awareness in ensemble recom-
mender system framework. In: 2021 International Conference on Electrical, Com-
munication, and Computer Engineering (ICECCE), pp. 1–6. IEEE (2021)
7. Ahlem, D.R.I.F., Saadeddine, S., Hocine, C.: An interactive attention network with
stacked ensemble machine learning models for recommendations. In: Optimization
and Machine Learning: Optimization for Machine Learning and Machine Learning
for Optimization, pp. 119–150 (2022)
8. Mai, P., Pang, Y.: Vertical federated graph neural network for recommender sys-
tem. arXiv preprint arXiv:2303.05786 (2023)
9. Wu, Y., Liu, H., Yang, Y.: Graph convolutional matrix completion for bipartite
edge prediction. In: KDIR, pp. 49–58 (2018)
10. Wang, X., He, X., Wang, M., Feng, F., Chua, T.-S.:. Neural graph collaborative
filtering. In: Proceedings of the 42nd International ACM SIGIR Conference on
Research and Development in Information Retrieval, pp. 165–174 (2019)
11. Drif, A., Cherifi, H.: MIGAN: mutual-interaction graph attention network for col-
laborative filtering. Entropy 24(8), 1084 (2022)
12. Otunba, R., Rufai, R.A., Lin, J.: Deep stacked ensemble recommender. In: Pro-
ceedings of the 31st International Conference on Scientific and Statistical Database
Management, pp. 197–201 (2019)
13. Bao, X., Bergman, L., Thompson, R.: Stacking recommendation engines with addi-
tional meta-features. In: Proceedings of the Third ACM Conference on Recom-
mender Systems, pp. 109–116 (2009)
14. Da Costa, A.F., Manzato, M.G.: Exploiting multimodal interactions in recom-
mender systems with ensemble algorithms. Inf. Syst. 56, 120–132 (2016)
15. Guo, Q., Qiu, X., Xue, X., Zhang, Z.: Syntax-guided text generation via graph
neural network. Sci. China Inf. Sci. 64, 1–10 (2021)
E-MIGAN: Tackling Cold-Start Challenges in Recommender Systems 73
16. Zhou, J., et al.: Graph neural networks: a review of methods and applications. AI
Open 1, 57–81 (2020)
17. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional
networks. arXiv preprint arXiv:1609.02907 (2016)
18. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph
attention networks. arXiv preprint arXiv:1710.10903 (2017)
19. Wang, X., He, X., Cao, Y., Liu, M., Chua, T.-S.: KGAT: knowledge graph attention
network for recommendation. In: Proceedings of the 25th ACM SIGKDD Interna-
tional Conference on Knowledge Discovery & Data Mining, pp. 950–958 (2019)
20. He, X., Deng, K., Wang, X., Li, Y., Zhang, Y., Wang, M.: LightGCN: simplifying
and powering graph convolution network for recommendation. In: Proceedings of
the 43rd International ACM SIGIR Conference on Research and Development in
Information Retrieval, pp. 639–648 (2020)
21. He, C., et al.: Cascade-BGNN: toward efficient self-supervised representation learn-
ing on large-scale bipartite graphs. arXiv preprint arXiv:1906.11994 (2019)
22. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word repre-
sentation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pp. 1532–1543 (2014)
23. Harper, F.M., Konstan, J.A.: The MovieLens datasets: history and context. ACM
Trans. Interact. Intel. Syst. (TIIS) 5(4), 1–19 (2015)
24. Drif, A., Zerrad, H.E., Cherifi, H.: EnsVAE: ensemble variational autoencoders for
recommendations. IEEE Access 8, 188335–188351 (2020)
25. He, X., Liao, L., Zhang, H., Nie, L., Hu, X., Chua, T.-S.: Neural collaborative
filtering. In: Proceedings of the 26th International Conference on World Wide
Web, pp. 173–182 (2017)
Heterophily-Based Graph Neural
Network for Imbalanced Classification
Zirui Liang1 , Yuntao Li1 , Tianjin Huang1 , Akrati Saxena2 , Yulong Pei1(B) ,
and Mykola Pechenizkiy1
1
Eindhoven University of Technology, Eindhoven, The Netherlands
{y.pei.1,m.pechenizkiy}@tue.nl
2
Leiden Institute of Advanced Computer Science, Leiden University, Leiden,
The Netherlands
a.saxena@liacs.leidenuniv.nl
1 Introduction
GNNs have gained popularity for their accuracy in handling graph data. How-
ever, their accuracy, like other deep learning models, is highly dependent on data
quality. One major challenge is class imbalance, where some classes have far fewer
examples than others. This can lead to biased classification results, favoring the
majority class while neglecting the minority classes [5]. The issue of imbalanced
datasets commonly arises in classification and recognition tasks, where accurate
classification of minority classes is critical. Graph imbalance classification has
real-world applications, like identifying spammers in social networks [18] and
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 74–86, 2024.
https://doi.org/10.1007/978-3-031-53468-3_7
Heterophily-Based Graph Neural Network for Imbalanced Classification 75
detecting fraud in financial networks [8]. In these cases, abnormal nodes are
rare, making graph imbalance classification very challenging. Finding effective
solutions to this problem is valuable for both research and practical applications.
The class-imbalanced problem has been extensively studied in machine learn-
ing and deep learning, as evident by prior research [6]. However, these methods
may not effectively handle imbalanced graph data due to the interconnected
nature of nodes within graphs. Graph nodes are characterized not only by their
own properties but also by the properties of their neighboring nodes, introducing
non-i.i.d. (independent and identically distributed) characteristics. Recent stud-
ies on graph imbalance classification have focused on data augmentation tech-
niques, such as GraphSMOTE [19] and GraphENS [12]. However, our observa-
tions indicate that class imbalance in graphs is often accompanied by heterophilic
connections of minority nodes, where minority nodes have more connections with
nodes of diverse labels than the majority class nodes. This finding suggests that
traditional techniques may be insufficient in the presence of heterophily.
To address this challenge, we propose incorporating a graph heterophily han-
dling strategy into graph imbalanced classification. Our approach builds upon
the bi-kernel design of GBK-GNN [2] to capture both homophily and heterophily
within the graph. Additionally, we introduce a class imbalance-aware loss func-
tion, such as logit adjusted loss, to appropriately reweight minority and majority
nodes. The complexity of GBK-GNN makes training computationally challeng-
ing. To overcome this, we propose an efficient version of the GBK-GNN that
achieves both efficacy and efficiency in training.
Our main contributions are as follows: (1) We provide comprehensive insights
into the imbalance classification problem in graphs from the perspective of graph
heterophily and investigate the relationship between class imbalance and het-
erophily. (2) We present a novel framework that integrates graph heterophily
and class-imbalance handling based on the insights and its fast implementation
that significantly reduces training time. (3) We conduct extensive experiments
on various real-world graphs to validate the effectiveness and efficiency of our
proposed framework in addressing imbalanced classification on graphs.
2 Related Work
Imbalanced Classification. Efforts to counter class imbalance in classification
entail developing unbiased classifiers that account for label distribution in train-
ing data. Existing strategies fall into three categories: loss modification, post-hoc
correction, and re-sampling techniques. Loss modification adjusts the objective
function by assigning greater weights [5] to minority classes. Post-hoc correction
methods [11] adapt logits during inference to rectify underrepresented minority
class predictions. Re-sampling employs techniques, such as sampling strategies
[13] or data generation [1], to augment minority class data. The widely uti-
lized Synthetic Minority Over-sampling Technique (SMOTE) [1] generates new
instances by merging minority class data with nearest neighbors.
To tackle class imbalance in graph-based classification, diverse approaches
harness graph structural information to mitigate the challenge. GraphSMOTE
76 Z. Liang et al.
[19] synthesizes minor nodes by interpolating existing minority nodes, with con-
nectivity guided by a pretrained edge predictor. The Topology-Aware Margin
(TAM) loss [16] considers each node’s local topology by comparing its connec-
tivity pattern to the class-averaged counterpart. When nearby nodes in the tar-
get class are denser, the margin for that class decreases. This change enhances
learning adaptability and effectiveness through comparison. GraphENS [12] is
another technique that generates an entire ego network for the minor class by
amalgamating distinct ego networks based on similarity. These methods effec-
tively combat class imbalance in graph-based classification, leveraging graph
structures and introducing inventive augmentation techniques.
3 Motivation
Node classification on graphs, such as one performed by the Graph Convolutional
Network (GCN), differs fundamentally from non-graph tasks due to the intercon-
nectivity of nodes. In imbalanced class distributions, minority nodes may have a
higher proportion of heterophilic edges in their local neighborhoods, which can
negatively impact classification performance.
To investigate the relationship between homophily and different classes, espe-
cially minorities, we conducted a small analysis on four datasets: Cora, CiteSeer,
Wiki, and Coauthor CS (details about datasets can be found in Sect. 5.1). Our
analysis involves computing the average homophily ratios and calculating node
numbers across different categories. In particular, the average homophily ratio
for nodes with label y is defined as:
1 |Nis |
h (y, GV ) = (1)
|Vy | |Ni |
i∈Vy
Heterophily-Based Graph Neural Network for Imbalanced Classification 77
Fig. 1. Category distributions (left) and average homophily ratios (right) of Cora.
4 Methodology
In this section, we present our solution to address class imbalance that incorpo-
rates heterophily handling and imbalance handling components (Sect. 4.1). We
also propose a fast version that effectively reduces training time (Sect. 4.2). The
main objective of our model is to minimize the loss of minority classes while
ensuring accurate information exchange during the message-passing process.
4.1 Im-GBK
Heterophily Handling. We build our model based on the GBK-GNN [2]
model, which is a good model for graph classification, though not able to handle
class imbalance. GBK-GNN is designed to address the lack of distinguishability
in GNN, which stems primarily from the incapability to adjust weights adap-
tively for various node types based on their distinct homophily properties. As
a consequence, a bi-kernel feature transformation method is employed to cap-
ture either homophily or heterophily information. In this work, we, therefore,
introduce a learnable kernel-based selection gate that aims to distinguish if a
pair of nodes are similar or not and then selectively choose appropriate ker-
nels, i.e., homophily or heterophily kernel. The formal expression for the input
transformation is presented below.
⎛ ⎞
1
hi = σ ⎝Wf hi ⎠
(l) (l−1) (l−1) (l−1)
+ αij Ws hj + (1 − αij ) Wd hj
|N (vi )|
vj ∈N (vi )
(2)
(l−1) (l−1)
αij = Sigmoid Wg hi , hj (3)
L
L = L0 + λ L(l)
g (4)
l
where Ws and Wd are the kernels for homophilic and heterophilic edges, respec-
tively. The value of αij is determined by Wg and the embedding layer of nodes
i and j. The loss function consists of two parts: L0 , a cross-entropy loss for node
(l)
classification, and Lg , a label consistency-based cross-entropy, i.e., to discrimi-
nate if labels of a pair of nodes are consistent for each layer l to guide the training
of the selection gate or not. A hyper-parameter λ is introduced to balance the
two losses. The original GBK-GNN method does not explicitly address the class
imbalance issue, which leads to the model being biased toward the majority class.
Our method employs a class-imbalance awareness for the GBK-GNN design, and
therefore it mitigates the bias.
model tends to optimize for overall accuracy by prioritizing the majority classes
while performing poorly on the minority classes. This issue can be addressed by
adjusting the logits (i.e., the inputs to the softmax function) for each class to
be inversely proportional to their frequencies in the training data, which effec-
tively reduces the weight of the majority classes and increases the weight of the
minority classes [11]. In this study, we calculate logit-adjusted loss as follows:
where πy is the estimate of the class prior. In this approach, a label-dependent off-
set is added to each logit, which differs from the standard softmax cross-entropy
approach. Additionally, the class prior offset is enforced during the learning of
the logits rather than being applied post-hoc, as in other methods.
Loss Function. We design our loss function to combine these two components
that handle heterophily and class imbalance in the proposed Im-GBK model.
The learning objective of our model consists of (i) reducing the weight of the
majority classes and increasing the weight of the minority classes in the training
data, and (ii) improving the model’s ability to select the ideal gate. To achieve
this objective, we incorporate two loss components into the loss function:
L
L = Lim + λ Lg(l) . (7)
l
or heterophily. The hyper-parameter λ balances the two losses in the overall loss
function.
where the hyper-parameter will serve as the minimum similarity threshold, and
Lim is aforementioned Class-Imbalance Handling loss.
5 Experiments
We address four questions to enhance our understanding of the model:
– The experiment demonstrates that Im-GBK models, which use logit adjust
loss and balanced softmax (denoted as Im-GBK (LogitAdj) and Im-GBK
82 Z. Liang et al.
Overall, the experiment demonstrates that the proposed methods are effective
in imbalanced node classification and offer better or comparable performance to
state-of-the-art. In extreme cases, our approaches outperform and show a better
capability in differentiating minority classes.
Subsection 5.3 showed that the proposed method exhibits clear advantages in
performance compared to other baselines in differentiating minority classes for
extremely imbalanced graphs. To further investigate the fundamental factors
underlying the performance improvements of our proposed method in Im-GBK,
we conduct ablation analyses using one of the extreme cases, CiteSeer Extreme.
We show the effectiveness of the model handling class imbalance classifica-
tion by ablating the model Class-Imbalance Handler and Heterophily Handler,
respectively. In Table 4, ‘Class-Imbalance Handling Loss’ represents two Class-
Imbalance Handling losses introduced in Sect. 4.1. ‘Heterophily handling’ refers
to the method introduced in Sect. 4.1 to capture graph heterophily, and ‘×’
means this part is ablated. Considering all strategies, it can be observed from
Table 4 that either dropping ‘Class-Imbalance Handling Loss’ or ‘Heterophily
handling’ components will result in a decrease in performance.
6 Conclusion
In this paper, we studied the problem of imbal- Table 5. Average Execution
anced classification on graphs from the perspec- Time (s) per epoch on CS.
tive of graph heterophily. We observed that if
a model cannot handle heterophilic neighbor- Time
hoods in graphs, its ability to address imbal- GCN 0.0166
anced classification will be impaired. To address GAT 0.1419
the graph imbalance problem effectively, we GraphSage 0.0135
proposed a novel framework, Im-GBK, and GraphSMOTE 5.309
its faster version, Im-GBK, that simultane- Fast Im-GBK 0.5897
ously tackles heterophily and class imbalance. GBK-GNN 12.320
Our framework overcomes the limitations of Im-GBK (LogitAdj) 11.594
previous techniques by achieving higher effi- Im-GBK (BLSM) 11.271
ciency while maintaining comparable perfor-
mance. Extensive experiments are conducted on various real-world datasets,
Heterophily-Based Graph Neural Network for Imbalanced Classification 85
References
1. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic
minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
2. Du, L., et al.: GBK-GNN: gated bi-kernel graph neural networks for modeling both
homophily and heterophily. In: Proceedings of the ACM Web Conference 2022, pp.
1550–1558 (2022)
3. Fey, M., Lenssen, J.E.: Fast graph representation learning with PyTorch geometric.
arXiv preprint arXiv:1903.02428 (2019)
4. Hamilton, W.L., Ying, R., Leskovec, J.: Inductive representation learning on large
graphs (2018)
5. Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell.
Data Anal. 6(5), 429–449 (2002)
6. Johnson, J.M., Khoshgoftaar, T.M.: Survey on deep learning with class imbalance.
J. Big Data 6(1), 1–54 (2019)
7. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional
networks (2017)
8. Liu, Y., et al.: Pick and choose: a GNN-based imbalanced learning approach for
fraud detection. In: Proceedings of the Web Conference 2021, pp. 3168–3177 (2021)
9. Liu, Y., Zheng, Y., Zhang, D., Chen, H., Peng, H., Pan, S.: Towards unsupervised
deep graph structure learning. In: Proceedings of the ACM Web Conference 2022,
pp. 1392–1403 (2022)
10. McPherson, M., Smith-Lovin, L., Cook, J.M.: Birds of a feather: homophily in
social networks. Ann. Rev. Sociol. 27(1), 415–444 (2001)
11. Menon, A.K., Jayasumana, S., Rawat, A.S., Jain, H., Veit, A., Kumar, S.: Long-tail
learning via logit adjustment (2021)
12. Park, J., Song, J., Yang, E.: GraphENS: neighbor-aware ego network synthesis
for class-imbalanced node classification. In: International Conference on Learning
Representations (2021)
13. Ren, J., et al.: Balanced meta-softmax for long-tailed visual recognition (2020)
14. Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., Eliassi-Rad, T.: Collec-
tive classification in network data. AI Mag. 29(3), 93–93 (2008)
15. Shchur, O., Mumme, M., Bojchevski, A., Günnemann, S.: Pitfalls of graph neural
network evaluation. arXiv preprint arXiv:1811.05868 (2018)
16. Song, J., Park, J., Yang, E.: TAM: topology-aware margin loss for class-imbalanced
node classification. In: International Conference on Machine Learning, pp. 20,369–
20,383. PMLR (2022)
17. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph
attention networks (2018)
18. Wu, Y., Lian, D., Xu, Y., Wu, L., Chen, E.: Graph convolutional networks with
Markov random field reasoning for social spammer detection. In: Proceedings of
the AAAI Conference on Artificial Intelligence, vol. 34, pp. 1054–1061 (2020)
86 Z. Liang et al.
19. Zhao, T., Zhang, X., Wang, S.: GraphSMOTE: imbalanced node classification on
graphs with graph neural networks. In: Proceedings of the 14th ACM International
Conference on Web Search and Data Mining, pp. 833–841 (2021)
20. Zheng, X., Liu, Y., Pan, S., Zhang, M., Jin, D., Yu, P.S.: Graph neural networks
for graphs with heterophily: a survey. arXiv preprint arXiv:2202.07082 (2022)
21. Zhu, J., et al.: Graph neural networks with heterophily. In: Proceedings of the
AAAI Conference on Artificial Intelligence, vol. 35, pp. 11,168–11,176 (2021)
22. Zhu, J., Yan, Y., Zhao, L., Heimann, M., Akoglu, L., Koutra, D.: Beyond homophily
in graph neural networks: current limitations and effective designs. Adv. Neural.
Inf. Process. Syst. 33, 7793–7804 (2020)
TimeGNN: Temporal Dynamic Graph
Learning for Time Series Forecasting
1 Introduction
From financial investment and market analysis [6] to traffic [21], electricity man-
agement, healthcare [4], and climate science, accurately predicting the future real
values of series based on available historical records forms a coveted task over
time in various scientific and industrial fields. There are a wide variety of meth-
ods employed for time series forecasting, ranging from statistical [2] to recent deep
learning approaches [22]. However, there are several major challenges present.
Real-world time series data are often subject to noisy and irregular observations,
missing values, repeated patterns of variable periodicities and very long-term
dependencies. While the time series are supposed to represent continuous phenom-
ena, the data is usually collected using sensors. Thus, observations are determined
by a sampling rate with potential information loss. On the other hand, standard
sequential neural networks, such as recurrent (RNNs) [27] and convolutional net-
works (CNNs) [20], are discrete and assume regular spacing between observations.
Several continuous analogues of such architectures that implicitly handle the time
information have been proposed to address irregularly sampled missing data [26].
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 87–99, 2024.
https://doi.org/10.1007/978-3-031-53468-3_8
88 N. Xu et al.
The variable periodicities and long-term dependencies present in the data make
models prone to shape and temporal distortions, overfitting and poor local min-
ima while training with standard loss functions (e. g., MSE). Variants of DTW
and MSE have been proposed to mitigate these phenomena and can increase the
forecasting quality of deep neural networks [16,19].
A novel perspective for boosting the robustness of neural networks for com-
plex time series is to extract representative embeddings for patterns after trans-
forming them to another representation domain, such as the spectral one. Spec-
tral approaches have seen much use in the text domain. Graph-based text mining
(i. e., Graph-of-Words) [25] can be used for capturing the relationships between
the terms and building document-level representations. It is natural, then, that
such approaches might be suitable for more general sequence modeling. Capital-
izing on the recent success of graph neural networks (GNNs) on graph structured
data, a new family of algorithms jointly learns a correlation graph between inter-
related time series while simultaneously performing forecasting [3,29,32]. The
nodes in the learnable graph structure represent each individual time series and
the links between them express their temporal similarities. However, since such
methods rely on series-to-series correlations, they do not explicitly represent the
inter-series temporal dynamics evolution. Some preliminary studies have pro-
posed simple computational methods for mapping time series to temporal graphs
where each node corresponds to a time step, such as the visibility graph [17] and
the recurrence network [7].
In this paper, we propose a novel neural network, TimeGNN, that extends
these previous approaches by jointly learning dynamic temporal graphs for time
series forecasting on raw data. TimeGNN (i) extracts temporal embeddings
from sliding windows of the input series using dilated convolutions of differ-
ent receptive sizes, (ii) constructs a learnable graph structure, which is forward
and directed, based on the similarity of the embedding vectors in each window
in a differentiable way, (iii) applies standard GNN architectures to learn embed-
dings for each node and produces forecasts based on the representation vector of
the last time step. We evaluate the proposed architecture on various real-world
datasets and compare it against several deep learning benchmarks, including
graph-based approaches. Our results indicate that TimeGNN is significantly less
costly in both inference and training while achieving comparable forecasting per-
formance. The code implementation for this paper is available at https://github.
com/xun468/Time-GNN.
2 Related Work
Time Series Forecasting Models. Time series forecasting has been a long-
studied challenge in several application domains. In terms of statistical methods,
linear models including the autoregressive integrated moving average (ARIMA)
[2] and its multivariate extension, the vector autoregressive model (VAR) [10]
constitute the most dominant approaches. The need for capturing non-linear
patterns and overcoming the strong assumptions for statistical methods, e. g.,
the stationarity assumption, has led to the application of deep neural networks,
TimeGNN: Temporal Dynamic Graph Learning for Time Series Forecasting 89
Graph Neural Networks. Over the past few years, graph neural networks
(GNNs) have been applied with great success to machine learning problems on
graphs in various fields, including chemistry for drug screening [14] and biology
for predicting the functions of proteins modeled as graphs [9]. The field of GNNs
has been largely dominated by the so-called message passing neural networks
(MPNNs) [8], where each node updates its feature vector by aggregating the
feature vectors of its neighbors. In the case of time series data on arbitrary
known graphs, e. g., in traffic forecasting, several architectures that combine
sequential models with GNNs have been proposed [21,28,33,34].
Fig. 1. The proposed TimeGNN framework time series for graph learning from raw
time series and forecasting based on embeddings learned on the parameterized graph
structures.
3 Method
Let {Xi,1:T }m i=1 be a multivariate time series that consists of m channels and
has a length equal to T . Then, Xt ∈ Rm represents the observed values at time
step t. Let also G denote the set of temporal dynamic graph structures that we
want to infer.
Given the observed values of τ previous time steps of the time series, i. e.,
Xt−τ , . . . , Xt−1 , the goal is to forecast the next h time steps (e. g., h = 1 for
1-step forecasting), i. e., X̂t , X̂t+1 , . . . , X̂t+h−1 . These values can be obtained
by the forecasting model F with parameters Φ and the graphs G as follows:
where ∗ the convolutional operator, C01,1 , C11,1 , C21,1 convolutional kernels of size
1 and dilation rate 1, C23,3 a convolutional kernel of size 3 and dilation rate 3, C25,5
TimeGNN: Temporal Dynamic Graph Learning for Time Series Forecasting 91
a convolutional kernel of size 5 and dilation rate 5, and b01 , b11 , b21 , b23 , b25 the
corresponding bias terms.
The final representations per window k are obtained using a fully connected
layer on the concatenated features f0k , f1k , f2k , i. e., zk = FC(f0k f1k f2k ), such that
zk ∈ Rτ ×d . In the next sections, we refer to each time step of the hidden represen-
tation of the feature extraction module in each window k as zki , ∀ i ∈ {1, . . . τ }.
Akij = σ((log(θij
k
/(1 − θij
k 1
)) + (gi,j − gi,j
2
))/s),
(3)
1
gi,j 2
, gi,j ∼ Gumbel(0, 1), ∀ i, j
1 2
where gi,j , gi,j are vectors of i.i.d samples drawn from Gumbel distribution, σ
is the sigmoid activation and s is a parameter that controls the smoothness of
samples, so that the distribution converges to categorical values as s − → 0.
The link predictor takes each pair of extracted features (zki , zkj ) of window
k and maps their similarity to a θij k
∈ [0, 1] by applying fully connected layers.
Then the Gumbel reparameterization trick is used to approximate a sigmoid
activation function while retaining differentiability:
k
θij = σ FC FC(zki zkj ) (4)
In order to obtain directed and forward (i. e., no look-back in previous time steps
in the history) graph structures G we only learn the upper triangular part of the
adjacency matrices.
Once the collection G of learnable graph structures per sliding window k are
sampled, standard GNN architectures can be applied for capturing the node-to-
node relations, i. e., the temporal graph dynamics. GraphSAGE [11] was chosen
as the basic building GNN block of the node embedding learning architecture
as it can effectively generalize across different graphs with the same attributes.
GraphSAGE is an inductive framework that exploits node feature information
92 N. Xu et al.
and generates node embeddings (i. e., hu for node u) via a learnable function,
by sampling and aggregating features from a node’s local neighborhood (i. e.,
N (u)).
Let (V k , E k ) correspond to the set of nodes and edges of the learnable
graph structure for each G k . The node embedding update process for each
p ∈ {1, . . . , P } aggregation steps, employs the mean-based aggregator, namely
u , ∀u ∈
convolutional, by calculating the element-wise mean of the vectors in {hp−1
N (u)}, such that:
hpu ←
− σ(W · MEAN({hp−1
u } ∪ {hu ∀u ∈ N (u)}))
p−1
(5)
where W trainable weights. The final normalized (i. e., h̃pu ) representation of
the last node (i. e., time step) in each forward and directed graph denoted as
zuT = h̃puT is passed to the output module. The output module consists of two
fully connected layers which reduce the vector into the final output dimension, so
as to correspond to the forecasts X̂t , X̂t+1 , . . . , X̂t+h−1 . Figure 1 demonstrates
the feature extraction, graph learning, GNN and output modules of the proposed
TimeGNN architecture.
To train the parameters of Eq. (1) for the time series point forecasting task,
we use the mean absolute error loss (MAE). Let X̂(i) , i ∈ {1, ..., K} denote the
predicted vector values for K samples, then the MAE loss is defined as:
1
K
L= X̂(i) − X(i)
K i=1
The optimized weights for the feature extraction, graph structure learning,
GNN and output modules are selected based on the minimum validation loss dur-
ing training, which is evaluated as described in the experimental setup (Sect. 4.3)
4 Experimental Evaluation
We next describe the experimental setup, including the datasets and baselines
used for comparisons. We also demonstrate and analyze the results obtained by
the proposed TimeGNN architecture and the baseline models.
4.1 Datasets
This work was evaluated on the following multivariate time series datasets:
Fig. 2. Computation costs of TimeGNN, TimeMTGNN and baseline models. (a) The
inference and epoch training time per epoch between datasets. (b) The inference and
epoch times with varying window sizes on the weather dataset
4.2 Baselines
We consider five baseline models for comparison with our TimeGNN proposed
architecture. We chose two graph-based methods, MTGNN [32] and GTS [29],
and three non graph-based methods, LSTNet [18], LSTM [12], and TCN [1].
Also, we evaluate the performance of TimeMTGNN, a variant of MTGNN that
includes our proposed graph learning module. LSTM and TCN follow the size
of the hidden dimension and number of layers of TimeGNN. Those were fixed
to three layers with hidden dimensions of 32, 64 for the Exchange-Rate and
1
https://www.ncei.noaa.gov/data/local-climatological-data/.
2
https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014.
3
http://www.nrel.gov/grid/solar-power-data.html.
4
http://pems.dot.ca.gov.
94 N. Xu et al.
Weather datasets and 128 for Electricity, Solar-Energy and Traffic. In the case
of MTGNN, GTS, and LSTNet, parameters were kept as close as possible to the
ones mentioned in their experimental setups.
4.4 Results
Scalability. We compare the inference and training times of the graph-based
models TimeGNN, MTGNN, GTS in Fig. 2. These figures also include record-
ings from the ablation study of the TimeMTGNN variant, which is described in
the relevant paragraph below. Figure 2(a) shows the computational costs on each
dataset. Among the baseline models, GTS is the most costly in both inference
and training time due to the use of the entire training dataset for graph con-
struction. In contrast, MTGNN learns static node features and is subsequently
more efficient. In inference time, as the number of variables increases there is a
noticeable increase in inference time for MTGNN and GTS as their graph sizes
also increase. TimeGNN’s graph does not increase in size with the number of
variables and consequently, the inference time scales well across datasets. The
training epoch times follow the observations in inference time.
Since the size of the graphs used by TimeGNN is based on window size, the
cost of increasing the window size on the weather dataset is shown in Fig. 2(b).
As the window size increases, so does the cost of inference and training for all
models. As the graph learning modules for MTGNN and GTS do not inter-
act with the window size, the increase in cost can primarily be attributed to
their forecasting modules. MTGNN’s inference times do not increase as dramat-
ically as GTS’s, implying a more robust forecasting module. As the window size
increases, TimeGNN’s inference and training cost growth is slower than the other
methods and remains the fastest of the GNN methods. The time-based graph
learning module does not become overly cumbersome as window sizes increase.
TimeGNN: Temporal Dynamic Graph Learning for Time Series Forecasting 95
Table 1. Forecasting performance for all multivariate datasets and baselines for dif-
ferent horizons h - best in bold, second best underlined.
Exchange-Rate
Metric LSTM TCN LSTN GTS MTGNN TimeGNN TimeMTGNN
h = 1 mse 0.328 ± 0.007 0.094 ± 0.118 0.004 ± 0.000 0.005 ± 0.001 0.006 ± 0.002 0.129 ± 0.012 0.004 ± 0.001
mae 0.475 ± 0.033 0.191 ± 0.163 0.033 ± 0.000 0.041 ± 0.004 0.048 ± 0.011 0.294 ± 0.029 0.034 ± 0.005
h = 3 mse 0.611 ± 0.001 0.063 ± 0.035 0.013 ± 0.003 0.009 ± 0.000 0.012 ± 0.000 0.368 ± 0.059 0.008 ± 0.001
mae 0.631 ± 0.031 0.190 ± 0.041 0.078 ± 0.012 0.063 ± 0.000 0.078 ± 0.000 0.501 ± 0.045 0.061 ± 0.003
h = 6 mse 0.877 ± 0.105 0.189 ± 0.221 0.033 ± 0.005 0.014 ± 0.001 0.024 ± 0.001 0.354 ± 0.031 0.019 ± 0.004
mae 0.775 ± 0.032 0.290 ± 0.214 0.139 ± 0.008 0.081 ± 0.005 0.111 ± 0.000 0.453 ± 0.052 0.099 ± 0.016
h = 9 mse 0.823 ± 0.118 0.123 ± 0.030 0.030 ± 0.006 0.020 ± 0.001 0.035 ± 0.003 0.453 ± 0.149 0.034 ± 0.002
mae 0.743 ± 0.080 0.277 ± 0.037 0.124 ± 0.011 0.096 ± 0.001 0.140 ± 0.008 0.543 ± 0.084 0.139 ± 0.010
Weather
Metric LSTM TCN LSTN GTS MTGNN TimeGNN TimeMTGNN
h = 1 mse 0.162 ± 0.001 0.176 ± 0.006 0.193 ± 0.001 0.209 ± 0.003 0.232 ± 0.008 0.178 ± 0.001 0.182 ± 0.003
mae 0.202 ± 0.003 0.220 ± 0.011 0.236 ± 0.002 0.213 ± 0.004 0.230 ± 0.002 0.185 ± 0.000 0.186 ± 0.000
h = 3 mse 0.221 ± 0.000 0.232 ± 0.003 0.233 ± 0.001 0.320 ± 0.005 0.263 ± 0.003 0.234 ± 0.001 0.234 ± 0.002
mae 0.265 ± 0.000 0.275 ± 0.000 0.285 ± 0.000 0.320 ± 0.001 0.273 ± 0.000 0.249 ± 0.001 0.251 ± 0.001
h = 6 mse 0.268 ± 0.004 0.274 ± 0.002 0.266 ± 0.001 0.374 ± 0.003 0.301 ± 0.003 0.287 ± 0.002 0.282 ± 0.007
mae 0.320 ± 0.004 0.323 ± 0.001 0.321 ± 0.000 0.388 ± 0.002 0.311 ± 0.002 0.297 ± 0.001 0.300 ± 0.003
h = 9 mse 0.292 ± 0.007 0.307 ± 0.009 0.288 ± 0.000 0.399 ± 0.002 0.329 ± 0.001 0.316 ± 0.001 0.311 ± 0.002
mae 0.342 ± 0.003 0.350 ± 0.005 0.345 ± 0.003 0.420 ± 0.004 0.339 ± 0.004 0.331 ± 0.001 0.331 ± 0.001
Electricity-Load
Metric LSTM TCN LSTN GTS MTGNN TimeGNN TimeMTGNN
h = 1 mse 0.226 ± 0.002 0.267 ± 0.001 0.064 ± 0.001 0.135 ± 0.002 0.046 ± 0.000 0.211 ± 0.003 0.047 ± 0.000
mae 0.323 ± 0.000 0.375 ± 0.002 0.167 ± 0.001 0.246 ± 0.001 0.131 ± 0.000 0.309 ± 0.001 0.135 ± 0.000
h = 3 mse 0.255 ± 0.001 0.329 ± 0.015 0.065 ± 0.001 0.303 ± 0.019 0.079 ± 0.001 0.179 ± 0.003 0.077 ± 0.000
mae 0.339 ± 0.000 0.406 ± 0.013 0.163 ± 0.002 0.388 ± 0.019 0.171 ± 0.000 0.320 ± 0.002 0.173 ± 0.000
h = 6 mse 0.253 ± 0.005 0.331 ± 0.010 0.125 ± 0.006 0.334 ± 0.000 0.097 ± 0.000 0.246 ± 0.004 0.104 ± 0.015
mae 0.340 ± 0.006 0.408 ± 0.009 0.238 ± 0.005 0.413 ± 0.000 0.189 ± 0.001 0.332 ± 0.004 0.200 ± 0.016
h = 9 mse 0.271 ± 0.009 0.349 ± 0.022 0.144 ± 0.013 0.289 ± 0.021 0.108 ± 0.002 0.258 ± 0.010 0.104 ± 0.001
mae 0.351 ± 0.003 0.410 ± 0.019 0.251 ± 0.013 0.368 ± 0.020 0.198 ± 0.002 0.344 ± 0.007 0.196 ± 0.001
Solar-Energy
Metric LSTM TCN LSTN GTS MTGNN TimeGNN TimeMTGNN
h = 1 mse 0.019 ± 0.000 0.012 ± 0.000 0.007 ± 0.000 0.012 ± 0.001 0.006 ± 0.000 0.022 ± 0.000 0.006 ± 0.000
mae 0.064 ± 0.000 0.055 ± 0.001 0.035 ± 0.000 0.046 ± 0.003 0.026 ± 0.000 0.059 ± 0.000 0.026 ± 0.000
h = 3 mse 0.031 ± 0.000 0.030 ± 0.001 0.026 ± 0.000 0.044 ± 0.001 0.022 ± 0.002 0.030 ± 0.000 0.022 ± 0.000
mae 0.086 ± 0.002 0.087 ± 0.004 0.080 ± 0.000 0.098 ± 0.003 0.058 ± 0.002 0.071 ± 0.000 0.058 ± 0.000
h = 6 mse 0.046 ± 0.001 0.050 ± 0.000 0.049 ± 0.004 0.103 ± 0.001 0.042 ± 0.000 0.044 ± 0.000 0.043 ± 0.002
mae 0.108 ± 0.005 0.121 ± 0.005 0.125 ± 0.013 0.163 ± 0.001 0.086 ± 0.001 0.090 ± 0.000 0.088 ± 0.004
h = 9 mse 0.067 ± 0.003 0.073 ± 0.001 0.068 ± 0.000 0.167 ± 0.003 0.055 ± 0.001 0.060 ± 0.002 0.060 ± 0.000
mae 0.138 ± 0.009 0.150 ± 0.005 0.154 ± 0.004 0.218 ± 0.006 0.101 ± 0.001 0.109 ± 0.001 0.110 ± 0.000
Traffic
Metric LSTM TCN LSTN GTS MTGNN TimeGNN TimeMTGNN
h = 1 mse 0.558 ± 0.007 0.594 ± 0.091 0.246 ± 0.002 0.520 ± 0.010 0.233 ± 0.003 0.567 ± 0.002 0.293 ± 0.026
mae 0.296 ± 0.005 0.352 ± 0.025 0.203 ± 0.002 0.319 ± 0.013 0.157 ± 0.002 0.281 ± 0.000 0.162 ± 0.001
h = 3 mse 0.595 ± 0.014 0.615 ± 0.002 0.447 ± 0.010 0.970 ± 0.027 0.438 ± 0.001 0.622 ± 0.006 0.465 ± 0.012
mae 0.318 ± 0.007 0.363 ± 0.003 0.286 ± 0.009 0.456 ± 0.010 0.205 ± 0.000 0.306 ± 0.002 0.218 ± 0.007
h = 6 mse 0.603 ± 0.001 0.680 ± 0.021 0.465 ± 0.005 0.938 ± 0.048 0.450 ± 0.009 0.623 ± 0.004 0.495 ± 0.012
mae 0.321 ± 0.003 0.403 ± 0.013 0.288 ± 0.002 0.461 ± 0.023 0.213 ± 0.003 0.311 ± 0.007 0.239 ± 0.001
h = 9 mse 0.614 ± 0.011 0.655 ± 0.017 0.467 ± 0.010 0.909 ± 0.024 0.471 ± 0.000 0.622 ± 0.002 0.494 ± 0.000
mae 0.329 ± 0.010 0.382 ± 0.014 0.290 ± 0.006 0.453 ± 0.016 0.220 ± 0.002 0.313 ± 0.002 0.236 ± 0.005
may give GTS an advantage over the other methods on this dataset. TimeGNN
however shows signs of overfitting during training and is unable to match the
other two GNNs. On the Weather dataset, the purely recurrent methods per-
form the best in MSE score across all horizons. TimeGNN is competitive with
the recurrent methods on these metrics and surpasses the recurrent models on
MAE. This suggests TimeGNN is producing more significant outlier predictions
than the recurrent methods and TimeGNN is the best performing GNN method.
On the larger Electricity-Load, Solar-Energy, and Traffic datasets, in general,
MTGNN is the top performer with LSTNet close behind. However, for larger
horizons, TimeGNN performs better than GTS and competitively with LSTNet
and the other recurrent models. This shows that time-domain graphs can suc-
cessfully capture long-term dependencies within a dataset although TimeGNN
struggles more with short-term predictions. This could also be attributed to the
simplicity of TimeGNN’s forecasting module compared to the other graph-based
approaches.
5 Conclusion
capture and learn the underlying properties of time series. Additionally, it is far
faster and more scalable than existing graph methods as both the number of
variables and the window size increase.
References
1. Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional
and recurrent networks for sequence modeling. arXiv:1803.01271 (2018)
2. Box, G.E., Jenkins, G.M., Reinsel, G.C., Ljung, G.M.: Time Series Analysis: Fore-
casting and Control. Wiley, New York (2015)
3. Cao, D., et al.: Spectral temporal graph neural network for multivariate time-series
forecasting. Adv. Neural. Inf. Process. Syst. 33, 17766–17778 (2020)
4. Chauhan, S., Vig, L.: Anomaly detection in ECG time signals via deep long short-
term memory networks. In: 2015 IEEE International Conference on Data Science
and Advanced Analytics (DSAA), pp. 1–7. IEEE (2015)
5. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for
statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
6. Ding, X., Zhang, Y., Liu, T., Duan, J.: Deep learning for event-driven stock pre-
diction. In: Twenty-Fourth International Joint Conference on Artificial Intelligence
(2015)
7. Donner, R.V., Zou, Y., Donges, J.F., Marwan, N., Kurths, J.: Recurrence networks-
a novel paradigm for nonlinear time series analysis. New J. Phys. 12(3), 033025
(2010)
8. Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message
passing for quantum chemistry. In: International Conference on Machine Learning,
pp. 1263–1272. PMLR (2017)
9. Gligorijević, V., et al.: Structure-based protein function prediction using graph
convolutional networks. Nat. Commun. 12(1), 3168 (2021)
10. Hamilton, J.D.: Time Series Analysis. Princeton University Press, Princeton (2020)
11. Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large
graphs. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
12. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8),
1735–1780 (1997)
13. Jang, E., Gu, S., Poole, B.: Categorical reparameterization with Gumbel-Softmax.
arXiv preprint arXiv:1611.01144 (2016)
14. Kearnes, S., McCloskey, K., Berndl, M., Pande, V., Riley, P.: Molecular graph
convolutions: moving beyond fingerprints. J. Comput. Aided Mol. Des. 30, 595–
608 (2016)
15. Kipf, T., Fetaya, E., Wang, K.C., Welling, M., Zemel, R.: Neural relational infer-
ence for interacting systems. In: International Conference on Machine Learning,
pp. 2688–2697. PMLR (2018)
98 N. Xu et al.
16. Kosma, C., Nikolentzos, G., Xu, N., Vazirgiannis, M.: Time series forecasting mod-
els copy the past: How to mitigate. In: Pimenidis, E., Angelov, P., Jayne, C., Papa-
leonidas, A., Aydin, M. (eds.) Artificial Neural Networks and Machine Learning–
ICANN 2022: 31st International Conference on Artificial Neural Networks, Bristol,
UK, 6–9 September 2022, Proceedings, Part I, vol. 13529, pp. 366–378. Springer,
Cham (2022). https://doi.org/10.1007/978-3-031-15919-0 31
17. Lacasa, L., Luque, B., Ballesteros, F., Luque, J., Nuno, J.C.: From time series to
complex networks: the visibility graph. Proc. Natl. Acad. Sci. 105(13), 4972–4975
(2008)
18. Lai, G., Chang, W.C., Yang, Y., Liu, H.: Modeling long-and short-term temporal
patterns with deep neural networks. In: The 41st International ACM SIGIR Con-
ference on Research & Development in Information Retrieval, pp. 95–104 (2018)
19. Le Guen, V., Thome, N.: Deep time series forecasting with shape and temporal
criteria. IEEE Trans. Pattern Anal. Mach. Intell. 45(1), 342–355 (2022)
20. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to
document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
21. Li, Y., Yu, R., Shahabi, C., Liu, Y.: Diffusion convolutional recurrent neural net-
work: data-driven traffic forecasting. arXiv preprint arXiv:1707.01926 (2017)
22. Lim, B., Zohren, S.: Time-series forecasting with deep learning: a survey. Phil.
Trans. R. Soc. A 379(2194), 20200209 (2021)
23. Lin, M., Chen, Q., Yan, S.: Network in network. arXiv preprint arXiv:1312.4400
(2013)
24. Oreshkin, B.N., Carpov, D., Chapados, N., Bengio, Y.: N-beats: neural basis
expansion analysis for interpretable time series forecasting. arXiv preprint
arXiv:1905.10437 (2019)
25. Rousseau, F., Vazirgiannis, M.: Graph-of-word and TW-IDF: new approach to Ad
Hoc IR. In: Proceedings of the 22nd ACM International Conference on Information
& Knowledge Management, pp. 59–68 (2013)
26. Rubanova, Y., Chen, R.T., Duvenaud, D.K.: Latent ordinary differential equations
for irregularly-sampled time series. In: Advances in Neural Information Processing
Systems, vol. 32 (2019)
27. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-
propagating errors. Nature 323(6088), 533–536 (1986)
28. Seo, Y., Defferrard, M., Vandergheynst, P., Bresson, X.: Structured sequence mod-
eling with graph convolutional recurrent networks. In: Cheng, L., Leung, A.,
Ozawa, S. (eds.) Neural Information Processing: 25th International Conference,
ICONIP 2018, Siem Reap, Cambodia, 13–16 December 2018, Proceedings, Part
I 25, pp. 362–373. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-
04167-0 33
29. Shang, C., Chen, J., Bi, J.: Discrete graph structure learning for forecasting mul-
tiple time series. arXiv preprint arXiv:2101.06861 (2021)
30. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information
Processing Systems, vol. 30 (2017)
31. Wu, H., Xu, J., Wang, J., Long, M.: Autoformer: decomposition transformers with
auto-correlation for long-term series forecasting. Adv. Neural. Inf. Process. Syst.
34, 22419–22430 (2021)
32. Wu, Z., Pan, S., Long, G., Jiang, J., Chang, X., Zhang, C.: Connecting the dots:
multivariate time series forecasting with graph neural networks. In: Proceedings
of the 26th ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining, pp. 753–763 (2020)
TimeGNN: Temporal Dynamic Graph Learning for Time Series Forecasting 99
33. Yu, B., Yin, H., Zhu, Z.: Spatio-temporal graph convolutional networks: a deep
learning framework for traffic forecasting. arXiv preprint arXiv:1709.04875 (2017)
34. Zhao, L., et al.: T-GCN: a temporal graph convolutional network for traffic pre-
diction. IEEE Trans. Intell. Transp. Syst. 21(9), 3848–3858 (2019)
35. Zhou, H., et al.: Informer: beyond efficient transformer for long sequence time-
series forecasting. In: Proceedings of the AAAI Conference on Artificial Intelli-
gence, vol. 35, pp. 11106–11115 (2021)
UnboundAttack: Generating Unbounded
Adversarial Attacks to Graph Neural
Networks
1 Introduction
In recent years, Graph Neural Networks (GNNs) emerged as an effective app-
roach to learning powerful graph representations. These neural network-based
models, for instance Graph Convolution Networks (GCNs) [11], have shown to be
highly effective in a number of graph-based applications such as drug design [10].
However, recent literature has shown that these architectures can be attacked
by injecting small perturbations into the input [2,22]. These attacks, referred to
as adversarial attacks in the literature, are highly critical, and this vulnerability
has raised tremendous concerns about applying them in safety-critical applica-
tions such as financial and healthcare applications. For example, a malicious user
could exploit this limitations by adding some inaccurate information to social
networks. As a result, several studies focus on developing methods to mitigate
the possible perturbation effects in parallel to these attacks. The proposed meth-
ods include adversarial training [5], enhancing the robustness of an input GNN
through edge pruning [25], and recently proposing robustness certificates [15].
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 100–111, 2024.
https://doi.org/10.1007/978-3-031-53468-3_9
UnboundAttack - Unbounded Adversarial Attacks to GNNs 101
The currently available attacks are mainly based to applying small pertur-
bations on either the structure or the node features of the graph [23,26]. Given
that most of the proposed defense strategies enhance the robustness of the classi-
fiers to small perturbations [9], they have shown some success in detecting these
attacks and in limiting their effect. Moreover, most existing approaches formu-
late the problem of generating adversarial attacks as a search or constrained
optimization problem. While the available constrained optimization tools are
easily applicable in continuous input domains (i. e., images), adapting them to
discrete domains such as graphs represents a significant challenge. Furthermore,
in contrast to images, changing the graph structure by adding/deleting an edge
may be infeasible and easily detectable in many settings. For instance, given a
molecular graph where the edges represent chemical bonds, by deleting/adding
an edge, the emerging graph may not represent a realistic molecule anymore.
To tackle the aforementioned limitations, in this paper, we introduce Unboun-
dAttack, a more general and realistic attack mechanism which creates new adver-
sarial examples from scratch instead of just applying perturbations to an input
graph. The approach capitalizes on recent advancements in the field of Gener-
ative Adversarial Networks (GANs) to generate a set of legitimate graphs that
share similar properties with the input graphs. These properties include degree
distribution, diameter and subgraph structures among others. This approach of
producing artificially generated graphs that do not emerge directly from input
samples and which can mislead a targeted victim model is known as unbounded
adversarial attacks. The term “unbounded” in this setting refers to the idea that
these attacks are not directly linked to a specific existing graph but rather to a
more general view of the dataset to be attacked. We validate in an experimental
setting that these attacks can actually mislead the victim classifier but not some
oracle function, thus presenting a major threat for real-world applications. The
proposed framework is general and can operate on top of any GNN. Our main
contributions are summarized as follows:
– We propose UnboundAttack, a generative framework for crafting from scratch
adversarial attacks to pretrained GNNs. The proposed framework assumes no
knowledge about the underlying architecture of the attacked model and may
be applied to an ensemble of available models.
– We designed a realistic experimental setting using molecular data in which
our model is evaluated and we show its effectiveness and ability to generate
realistic and relevant attacks.
2 Related Work
Given the discrete nature of graphs, applying attack methods from other domains
is very challenging. Similarly to the image domain attacks, most available meth-
ods formulate the task as a search problem. The objective of the task is to find
the closest adversarial perturbation to a given input data point. This approach
has led to several proposed attack strategies. For example, Nettack [26] intro-
duced a targeted attack on both the graph structure and nodes features based
102 S. Ennadir et al.
3 Preliminaries
Before continuing with our contribution, we begin by introducing the graph
classification problem and some key notation.
to generate new representations for the nodes. Specifically, GNNs update nodes’
feature vectors by aggregating local neighborhood information. Suppose we have
(0)
a GNN model that contains T neighborhood aggregation layers. Let also hv
denote the initial feature vector of node v, i. e., the row of matrix X that corre-
(t)
sponds to node v. At each iteration (t > 0), the hidden state hv of a node v is
updated as follows:
a(t)
v = AGGREGATE
(t)
h(t−1)
u : u ∈ N (v) ; h(t)
v = COMBINE
(t)
h(t−1)
v , a(t)
v
(1)
where AGGREGATE is a permutation invariant function that maps the feature
vectors of the neighbors of a node v to an aggregated vector. This aggregated
(t−1)
vector is passed along with the previous representation of v (i. e., hv ) to the
COMBINE function which combines those two vectors and produces the new
representation of v. After T iterations of neighborhood aggregation, to produce
a graph-level representation, GNNs apply a permutation invariant readout func-
tion, e. g., the sum or mean operator, to nodes feature vectors as follows:
hG = READOUT hv(T ) : v ∈ V (2)
similar properties with the graphs of the training set (e. g., similar degree distri-
bution, same motifs, etc.) as the graphs in the training set, and which can fool
the classifier but not the oracle. We next define unbounded adversarial attacks.
min max Ex∼pdata (x) [log dφ (x)] + Ez∼pz (z) [log(1 − dφ (gθ (z)))] (3)
gθ dφ
where z refers to the i-th vector sampled from the normal distribution and
given to the generator. Additionally, the maxi=c (f (gθ (z))i refers the maximum
component (different from the c-th one) of the predicted probabilities vector of
predicted probabilities. At each training step, we evaluate all the graphs pro-
duced by the generator given the sampled vectors from the normal distribution
N (0, I). By minimizing this quantity, we maximize the other classes’ (different
from c) probabilities, therefore reaching our adversarial target.
5 Experimental Evaluation
In this section, we investigate the ability of the UnboundAttack framework to
produce adversarial examples in a realistic experimental setting. We first describe
the experimental setup, and then report the results and provide examples of
generated graphs. More specifically, we address two main points: (Q1) Validity
of attacks and (Q2) Adversarial aspect of these attacks.
Table 2. Results from the MMD and other metrics (± standard deviation) of the
generated samples on the QM9 dataset for LogP metric.
Fig. 2. Examples of graphs from the QM9 dataset (left). Examples of generated attacks
(right). These examples have succeeded in misleading the classifier (i. e., o(G) = f (G)).
6 Conclusion
This work explores a new perspective on adversarial attacks on GNNs. Instead
of performing perturbations on a graph by adding/removing edges or editing
the nodes’ feature vectors, we propose to learn a new graph from scratch using
graph generative models. The produced graph has similar semantics to those of
the graphs of the training set, and hence may be an effective tool to mislead
a victim model. The proposed approach does not assume any knowledge about
the architecture of the targeted model. Experiments show that the method per-
forms better or comparable to other methods in degrading the performance of
the victim model. This work can be extended to other graph setting such as
node classification and edge classification. Furthermore, we anticipate that the
proposed architecture may support the development of new defense strategies
that could limit the potential negative impact of adversarial attacks, enhancing
the ability to deploy GNNs in real practical settings.
References
1. Bhattad, A., Chong, M.J., Liang, K., Li, B., Forsyth, D.A.: Unrestricted adversar-
ial examples via semantic manipulation (2019). https://doi.org/10.48550/ARXIV.
1904.06347, https://arxiv.org/abs/1904.06347
2. Dai, H., et al.: Adversarial attack on graph structured data (2018). https://doi.
org/10.48550/ARXIV.1806.02371, https://arxiv.org/abs/1806.02371
3. Dai, H., et al.: Adversarial attack on graph structured data. In: Proceedings of the
35th International Conference on Machine Learning, pp. 1115–1124 (2018)
4. De Cao, N., Kipf, T.: MolGAN: an implicit generative model for small molec-
ular graphs (2018). https://doi.org/10.48550/ARXIV.1805.11973, https://arxiv.
org/abs/1805.11973
5. Feng, F., He, X., Tang, J., Chua, T.S.: Graph adversarial training: dynamically
regularizing based on graph structure (2019). https://doi.org/10.48550/ARXIV.
1902.08226, https://arxiv.org/abs/1902.08226
6. Goodfellow, I.J., et al.: Generative adversarial networks (2014). https://doi.org/
10.48550/ARXIV.1406.2661, https://arxiv.org/abs/1406.2661
7. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.: Improved
training of Wasserstein GANs (2017). https://doi.org/10.48550/ARXIV.1704.
00028, https://arxiv.org/abs/1704.00028
8. Jang, E., Gu, S., Poole, B.: Categorical reparameterization with Gumbel-Softmax
(2016). https://doi.org/10.48550/ARXIV.1611.01144, https://arxiv.org/abs/1611.
01144
9. Jin, W., et al.: Adversarial attacks and defenses on graphs: a review, a tool and
empirical studies (2020). https://doi.org/10.48550/ARXIV.2003.00653, https://
arxiv.org/abs/2003.00653
10. Kearnes, S., McCloskey, K., Berndl, M., Pande, V., Riley, P.: Molecular graph
convolutions: moving beyond fingerprints. J. Comput. Aided Mol. Des. 30(8), 595–
608 (2016)
11. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional
networks (2016). https://doi.org/10.48550/ARXIV.1609.02907, https://arxiv.org/
abs/1609.02907
12. Kipf, T.N., Welling, M.: Variational graph auto-encoders (2016). https://doi.org/
10.48550/ARXIV.1611.07308, https://arxiv.org/abs/1611.07308
13. Mu, J., Wang, B., Li, Q., Sun, K., Xu, M., Liu, Z.: A hard label black-box adversar-
ial attack against graph neural networks (2021). https://doi.org/10.48550/ARXIV.
2108.09513, https://arxiv.org/abs/2108.09513
14. Ramakrishnan, R., Dral, P.O., Rupp, M., von Lilienfeld, O.A.: Quantum chemistry
structures and properties of 134 kilo molecules. Sci. Data 1, 140,022 (2014)
15. Schuchardt, J., Bojchevski, A., Gasteiger, J., Günnemann, S.: Collective robustness
certificates: exploiting interdependence in graph neural networks. In: International
Conference on Learning Representations (2021). https://openreview.net/forum?
id=ULQdiUTHe3y
16. Song, Y., Shu, R., Kushman, N., Ermon, S.: Constructing unrestricted adversarial
examples with generative models (2018). https://doi.org/10.48550/ARXIV.1805.
07894, https://arxiv.org/abs/1805.07894
17. Sun, Y., Wang, S., Tang, X., Hsieh, T.Y., Honavar, V.: Adversarial attacks on
graph neural networks via node injections: a hierarchical reinforcement learning
approach. In: Proceedings of The Web Conference 2020, WWW 2020, pp. 673–
683. Association for Computing Machinery, New York, NY, USA (2020). https://
doi.org/10.1145/3366423.3380149
UnboundAttack - Unbounded Adversarial Attacks to GNNs 111
18. Wan, X., Kenlay, H., Ru, B., Blaas, A., Osborne, M.A., Dong, X.: Adversarial
attacks on graph classification via Bayesian optimisation (2021). https://doi.org/
10.48550/ARXIV.2111.02842, https://arxiv.org/abs/2111.02842
19. Xu, K., et al.: Topology attack and defense for graph neural networks: an optimiza-
tion perspective (2019). https://doi.org/10.48550/ARXIV.1906.04214, https://
arxiv.org/abs/1906.04214
20. Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks?
In: 7th International Conference on Learning Representations (2019)
21. You, J., Ying, R., Ren, X., Hamilton, W.L., Leskovec, J.: GraphRNN: generating
realistic graphs with deep auto-regressive models (2018). https://doi.org/10.48550/
ARXIV.1802.08773, https://arxiv.org/abs/1802.08773
22. Zügner, D., Akbarnejad, A., Günnemann, S.: Adversarial attacks on neural net-
works for graph data. In: Proceedings of the 24th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. ACM (2018)
23. Zhan, H., Pei, X.: Black-box gradient attack on graph neural networks: deeper
insights in graph-based attack and defense. arXiv preprint arXiv:2104.15061 (2021)
24. Zhang, H., et al.: Projective ranking: a transferable evasion attack method on graph
neural networks. In: Proceedings of the 30th ACM International Conference on
Information & Knowledge Management, CIKM 2021, pp. 3617–3621. Association
for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/
3459637.3482161
25. Zhang, X., Zitnik, M.: GNNGuard: defending graph neural networks against adver-
sarial attacks (2020). https://doi.org/10.48550/ARXIV.2006.08149, https://arxiv.
org/abs/2006.08149
26. Zügner, D., Akbarnejad, A., Günnemann, S.: Adversarial attacks on neural net-
works for graph data. In: Proceedings of the 24th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining, pp. 2847–2856 (2018)
27. Zügner, D., Günnemann, S.: Adversarial attacks on graph neural networks via meta
learning. In: 7th International Conference on Learning Representations (2019)
Uncertainty in GNN Learning
Evaluations: The Importance
of a Consistent Benchmark
for Community Detection
1 Introduction
GNNs are a popular neural network based approach for processing graph-
structured data due to their ability to combine two sources of information by
propagating and aggregating node feature encodings along the network’s con-
nectivity [14]. Nodes in a network can be grouped into communities based on
similarity in associated features and/or edge density [28]. Analysing the struc-
ture to find clusters, or communities, of nodes provides useful information for
real world problems such as misinformation detection [21], genomic feature dis-
covery [3], social network or research recommendation [38]. As an unsupervised
task, clusters of nodes are identified based on the latent patterns within the
dataset, rather than “ground-truth” labels. Assessing performance at the dis-
covery of unknown information is useful to applications where label access is
prohibited. Some applications of graphs deal with millions of nodes and there is
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 112–123, 2024.
https://doi.org/10.1007/978-3-031-53468-3_10
Uncertainty in GNN Learning Evaluations 113
a low labelling rate with datasets that mimic realistic scenarios [11]. In addition,
clustering is particularity relevant for new applications where there is not yet
associated ground truth.
However, there is no widely accepted or followed way of evaluating algorithms
that is done consistently across the field, despite benchmarks being widely con-
sidered as important. Biased benchmarks, or evaluation procedures, can mis-
lead the narrative of research which distorts understanding of the research field.
Inconclusive results that may be valuable for understanding or building upon
go unpublished, which wastes resources, money, time and energy that is spent
on training models. In fields where research findings inform policy decisions or
medical practices, publication bias can lead to decisions based on incomplete or
biased evidence, potentially causing harm or inefficiency. To accurately reflect
the real-world capabilities of research, it would beneficial to use a common frame-
work for evaluating proposed methods.
The framework detailed herein is a motivation and justification for this posi-
tion. To demonstrate the need for this, we measure the difference between using
the default parameters given by the original implementations to those optimised
for under this framework. A metric is proposed for evaluating consistency of
algorithm rankings over different random seeds which quantifies the robustness
of results. This work will help guide practitioners to better model selection and
evaluation within the field of GNN community detection.
2 Related Work
There are various frameworks for assessing performance and the procedure
used for evaluation changes the performance of all algorithms [42]. Under con-
sistent conditions, it has been show that simple models can perform better with
a thorough exploration of the hyperparameter space [29]. This can be because
performance is subject to random initialisations [6]. Lifting results from papers
without carrying out the same hyperparameter optimisation over all models is
not consistent and is a misleading benchmark. Biased selection of random seeds
that skew performance is not fair. Not training over the same number of epochs
or not implementing model selection based on the validation set results in unfair
comparisons with inaccurate conclusions about model effectiveness. Hence, there
is no sufficient empirical evaluation of GNN methods for community detection
as presented in this work.
3 Methodology
This section details the procedure for evaluation; the problem that is aimed
to solve; the hyperparameter optimisation and the resources allocated to this
investigation; the algorithms that are being tested; the metrics of performance
and datasets used. At the highest level, the framework coefficient calculation is
summarised by Algorithm 1.
as G = (A, X), with the relational information of nodes modelled by the adjacency
matrix A ∈ RN ×N . Given a set of nodes V and a set of edges E, let ei,j = (vi , vj ) ∈
E denote the edge that points from vj to vi . The graph is considered weighted
so, the adjacency matrix 0 < Ai,j ≤ 1 if ei,j ∈ E and Ai,j = 0 if ei,j ∈ / E.
Also given is a set of node features X ∈ RN ×d , where d represents the number
of different node attributes (or feature dimensions). The objective is to partition
the graph G into k clusters such that nodes in each partition, or cluster, generally
have similar structure and feature values. The only information typically given to
the algorithms at training time is the number of clusters k to partition the graph
into. Hard clustering is assumed, where each community detection algorithm must
assign each node a single community to which it belongs, such that P ∈ RN and
we evaluate the clusters associated with each node using the labels given with each
dataset such that L ∈ RN .
There are sweet spots of architecture combinations that are best for each dataset
[1] and the effects of not selecting hyperparameters (HPs) have been well doc-
umented. Choosing too wide of a HP interval or including uninformative HPs
in the search space can have an adverse effect on tuning outcomes in the given
budget [39]. Thus, a HPO is performed under feasible constraints in order to val-
idate the hypothesis that HPO affects the comparison of methods. It has been
shown that grid search is not suited for searching for HPs on a new dataset and
that Bayesian approaches perform better than random [1]. There are a variety
of Bayesian methods that can be used for hyperparameter selection. One such is
the Tree Parzen-Estimator (TPE) [2] that can retain the conditionality of vari-
ables [39] and has been shown to be a good estimator given limited resources
[40]. The multi-objective version of the TPE [23] is used to explore the multiple
Table 1. Resources are allocated an investigation, those detailed are shared across all
investigations. Algorithms that are designed to benefit from a small number of HPs
should perform better as they can search more of the space within the given budget.
All models are trained with 1x 2080 Ti GPU on a server with 12GB of RAM, and a
16core Xeon CPU.
2
https://github.com/yueliu1999/Awesome-Deep-Graph-Clustering.
118 W. Leeney and R. McConville
the following is a brief summary: Cora [19], CiteSeer [8], DBLP [30] are graphs
of academic publications from various sources with the features coming from
words in publications and connectivity from citations. AMAC and AMAP are
extracted from the Amazon co-purchase graph [10]. Texas, Wisc and Cornell are
extracted from web pages from computer science departments of various univer-
sities [4]. UAT, EAT and BAT contain airport activity data collected from the
National Civil Aviation Agency, Statistical Office of the European Union and
Bureau of Transportation Statistics [17].
Table 2. The datasets and associated statistics.
Datasets Nodes Edges Features Classes Average Clustering Coefficient Mean Closeness Centrality
amac [10] 13752 13752 80062 10 0.157 0.264
amap [10] 7650 7650 119081 8 0.404 0.242
bat [17] 131 131 1038 4 0.636 0.469
citeseer [8] 3327 3327 4552 6 0.141 0.045
cora [19] 2708 2708 5278 7 0.241 0.137
dblp [30] 4057 4057 3528 4 0.177 0.026
eat [17] 399 399 5994 4 0.539 0.441
uat [17] 1190 1190 13599 4 0.501 0.332
texas [4] 183 183 162 5 0.198 0.344
wisc [4] 251 251 257 5 0.208 0.32
cornell [4] 183 183 149 5 0.167 0.326
3.4 Models
a general graph diffusion [9]. Bootstrapped Graph Latents (BGRL) [31] uses
a self-supervised bootstrap procedure by maintaining two graph encoders; the
online one learns to predict the representations of the target encoder, which in
itself is updated by an exponential moving average of the online encoder. Self-
GNN [13] also uses this principal but uses augmentations of the feature space
to train the network. Towards Unsupervised Deep Graph Structure Learning
(SUBLIME) [18] an encoder with the bootstrapping principle applied over the
feature space as well as a contrastive scheme between the nearest neighbours.
Variational Graph AutoEncoder Reconstruction (VGAER) [26] reconstructs a
modularity distribution using a cross entropy based decoder from the encoding
of a VGAE [15].
The Framework Comparison Rank is the average rank when comparing perfor-
mance of the parameters found through hyperparameter optimisation versus the
default values. From Table 3 it can be seen that Framework Comparison Rank
indicates that the hyperparameters that are optimised on average perform better.
This validates the hypothesis that the hyperparameter optimisation significantly
impacts the evaluation of GNN based approaches to community detection. The
W Randomness Coefficient quantifies the consistency of rankings over the dif-
ferent random seeds tested on, averaged over the suite of tests in the framework.
With less deviation of prediction under the presence of randomness, an eval-
uation finds a more confident assessment of the best algorithm. A higher W
value using the optimised hyperparameters indicates that the default parame-
ters are marginally more consistent over randomness however this does deviate
more across all tests. By using and optimising for the W Randomness Coefficient
with future extensions to this framework, we can reduce the impact of biased
evaluation procedures. With this coefficient, researchers can quantify how trust-
worthy their results are, and therefore the usability in real-world applications.
It is likely that there is little difference in the coefficients in this scenario as the
default parameters have been evaluated with a consistent approach to model
selection and constant resource allocation to training time. This sets the base-
line for consistency in evaluation procedure and allows better understanding of
relative method performance.
Table 3. Here is the quantification of intra framework consistency using the W Ran-
domness Coefficient and inter framework disparity using the Framework Comparison
Rank. Low values for the Framework Comparison Rank and W Randomness Coefficient
are preferred.
Fig. 1. The average performance and standard deviation of each metric averaged over
every seed tested on for all methods on all datasets. The hyperparameter investigation
under our framework is shown in colour compared with the default hyperparameters in
dashed boxes. The lower the value for Conductance is better. Out Of Memory (OOM)
occurrences on happened on the amac dataset with the following algorithms during
HPO: daegc, sublime, dgi, cagc, vgaer and cagc under the default HPs.
Uncertainty in GNN Learning Evaluations 121
5 Conclusion
In this work we demonstrate flaws with how GNN based community detection
methods are currently evaluated, leading to potentially misleading and confusing
conclusions. To address this, an evaluation framework is detailed for evaluating
GNNs at community detection that provides a more consistent and fair evalua-
tion, and can be easily extended. We provide further insight that consistent HPO
is key at this task by quantifying the difference in performance from HPO to
reported values. Finally, a metric is proposed for the assessing the consistency of
rankings that empirically states the trust researchers can have in the robustness
of results.
References
1. Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J.
Mach. Learn. Res. 13(2), 281–305 (2012)
2. Bergstra, J., Yamins, D., Cox, D.: Making a science of model search: hyperparam-
eter optimization in hundreds of dimensions for vision architectures. In: Interna-
tional Conference on Machine Learning, pp. 115–123. PMLR (2013)
3. Cabreros, I., Abbe, E., Tsirigos, A.: Detecting community structures in Hi-C
genomic data. In: 2016 Annual Conference on Information Science and Systems
(CISS), pp. 584–589. IEEE (2016)
122 W. Leeney and R. McConville
4. Craven, M., McCallum, A., PiPasquo, D., Mitchell, T., Freitag, D.: Learning to
extract symbolic knowledge from the world wide web. Technical report, Carnegie-
Mellon University Pittsburgh, PA, School of Computer Science (1998)
5. Dwivedi, V.P., Joshi, C.K., Laurent, T., Bengio, Y., Bresson, X.: Benchmarking
graph neural networks. arXiv preprint arXiv:2003.00982 (2020)
6. Errica, F., Podda, M., Bacciu, D., Micheli, A.: A fair comparison of graph neural
networks for graph classification. arXiv preprint arXiv:1912.09893 (2019)
7. Field, A.P.: Kendall’s coefficient of concordance. Encycl. Stat. Behav. Sci. 2, 1010–
1011 (2005)
8. Giles, C.L., Bollacker, K.D., Lawrence, S.: CiteSeer: an automatic citation indexing
system. In: Proceedings of the Third ACM Conference on Digital Libraries, pp. 89–
98 (1998)
9. Hassani, K., Khasahmadi, A.H.: Contrastive multi-view representation learning on
graphs. In: International Conference on Machine Learning, pp. 4116–4126. PMLR
(2020)
10. He, R., McAuley, J.: Ups and downs: modeling the visual evolution of fashion trends
with one-class collaborative filtering. In: Proceedings of the 25th International
Conference on World Wide Web, pp. 507–517 (2016)
11. Hu, W., Fey, M., Ren, H., Nakata, M., Dong, Y., Leskovec, J.: OGB-LSC: a large-
scale challenge for machine learning on graphs. arXiv preprint arXiv:2103.09430
(2021)
12. Jin, D., et al.: A survey of community detection approaches: from statistical mod-
eling to deep learning. IEEE Trans. Knowl. Data Eng. 35(2), 1149–1170 (2021)
13. Kefato, Z.T., Girdzijauskas, S.: Self-supervised graph neural networks without
explicit negative sampling. arXiv preprint arXiv:2103.14958 (2021)
14. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional
networks. arXiv preprint arXiv:1609.02907 (2016)
15. Kipf, T.N., Welling, M.: Variational graph auto-encoders. arXiv preprint
arXiv:1611.07308 (2016)
16. Liu, F., et al.: Deep learning for community detection: progress, challenges and
opportunities. arXiv preprint arXiv:2005.08225 (2020)
17. Liu, Y., et al.: A survey of deep graph clustering: taxonomy, challenge, and appli-
cation. arXiv preprint arXiv:2211.12875 (2022)
18. Liu, Y., Zheng, Y., Zhang, D., Chen, H., Peng, H., Pan, S.: Towards unsupervised
deep graph structure learning. In: Proceedings of the ACM Web Conference 2022,
pp. 1392–1403 (2022)
19. McCallum, A.K., Nigam, K., Rennie, J., Seymore, K.: Automating the construction
of internet portals with machine learning. Inf. Retrieval 3(2), 127–163 (2000)
20. McConville, R., Santos-Rodriguez, R., Piechocki, R.J., Craddock, I.: N2D: (Not
Too) deep clustering via clustering the local manifold of an autoencoded embed-
ding. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp.
5145–5152. IEEE (2021)
21. Monti, F., Frasca, F., Eynard, D., Mannion, D., Bronstein, M.M.: Fake
news detection on social media using geometric deep learning. arXiv preprint
arXiv:1902.06673 (2019)
22. Morris, C., Kriege, N.M., Bause, F., Kersting, K., Mutzel, P., Neumann, M.:
TUDataset: a collection of benchmark datasets for learning with graphs. arXiv
preprint arXiv:2007.08663 (2020)
Uncertainty in GNN Learning Evaluations 123
23. Ozaki, Y., Tanigaki, Y., Watanabe, S., Onishi, M.: Multiobjective tree-structured
parzen estimator for computationally expensive optimization problems. In: Pro-
ceedings of the 2020 Genetic and Evolutionary Computation Conference, pp. 533–
541 (2020)
24. Palowitch, J., Tsitsulin, A., Mayer, B., Perozzi, B.: GraphWorld: fake graphs bring
real insights for GNNs. arXiv preprint arXiv:2203.00112 (2022)
25. Pineau, J.: Improving reproducibility in machine learning research (a report from
the NeurIPS 2019 reproducibility program). J. Mach. Learn. Res. 22(1), 7459–7478
(2021)
26. Qiu, C., Huang, Z., Xu, W., Li, H.: VGAER: graph neural network reconstruction
based community detection. In: AAAI: DLG-AAAI 2022 (2022)
27. Salzberg, S.L.: On comparing classifiers: pitfalls to avoid and a recommended app-
roach. Data Min. Knowl. Disc. 1, 317–328 (1997)
28. Schaeffer, S.E.: Graph clustering. Comput. Sci. Rev. 1(1), 27–64 (2007)
29. Shchur, O., Mumme, M., Bojchevski, A., Günnemann, S.: Pitfalls of graph neural
network evaluation. arXiv preprint arXiv:1811.05868 (2018)
30. Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., Su, Z.: ArnetMiner: extraction and
mining of academic social networks. In: Proceedings of the 14th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, pp. 990–998
(2008)
31. Thakoor, S., Tallec, C., Azar, M.G., Munos, R., Veličković, P., Valko, M.: Boot-
strapped representation learning on graphs. In: ICLR 2021 Workshop on Geomet-
rical and Topological Representation Learning (2021)
32. Tsitsulin, A., Palowitch, J., Perozzi, B., Müller, E.: Graph clustering with graph
neural networks. arXiv preprint arXiv:2006.16904 (2020)
33. Velickovic, P., Fedus, W., Hamilton, W.L., Liò, P., Bengio, Y., Hjelm, R.D.: Deep
graph infomax. ICLR (Poster) 2(3), 4 (2019)
34. Wang, C., Pan, S., Hu, R., Long, G., Jiang, J., Zhang, C.: Attributed graph clus-
tering: a deep attentional embedding approach. arXiv preprint arXiv:1906.06532
(2019)
35. Wang, T., Yang, G., He, Q., Zhang, Z., Wu, J.: NCAGC: a neighborhood con-
trast framework for attributed graph clustering (2022). https://doi.org/10.48550/
ARXIV.2206.07897, https://arxiv.org/abs/2206.07897
36. Wasserman, S., Faust, K., et al.: Social Network Analysis: Methods and Applica-
tions. Cambridge University Press, Cambridge (1994)
37. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature
393(6684), 440–442 (1998)
38. Yang, J., McAuley, J., Leskovec, J.: Community detection in networks with node
attributes. In: 2013 IEEE 13th International Conference on Data Mining, pp. 1151–
1156. IEEE (2013)
39. Yang, L., Shami, A.: On hyperparameter optimization of machine learning algo-
rithms: theory and practice. Neurocomputing 415, 295–316 (2020)
40. Yuan, Y., Wang, W., Pang, W.: A systematic comparison study on hyperparameter
optimisation of graph neural networks for molecular property prediction. In: Pro-
ceedings of the Genetic and Evolutionary Computation Conference, pp. 386–394
(2021)
41. Zhu, Y., Xu, Y., Yu, F., Liu, Q., Wu, S., Wang, L.: Deep graph contrastive repre-
sentation learning. arXiv preprint arXiv:2006.04131 (2020)
42. Zöller, M.A., Huber, M.F.: Benchmark and survey of automated machine learning
frameworks. J. Artif. Intell. Res. 70, 409–472 (2021)
Link Analysis and Ranking
Stochastic Degree Sequence Model
with Edge Constraints (SDSM-EC)
for Backbone Extraction
1 Introduction
It is common to use the projection of a bipartite network to measure a unipar-
tite network of interest. For example, scientific collaboration networks are often
measured using a co-authorship network, which is the projection of a bipartite
author-paper network [12]. Similarly, corporate networks are often measured
using a board co-membership or ‘interlocking directorate’ network, which is the
projection of a bipartite executive-board network [1]. The edges in a bipartite
projection are weighted (e.g., number of co-authored papers, number of shared
boards), but these weights do not provide an unbiased indicator the strength of
the connection between vertices [5,9]. To overcome this bias, backbone extrac-
tion identifies the edges that are stronger than expected under a relevant null
model, retaining only these edges to yield a simpler unweighted network (i.e.,
the backbone) that is more suitable for visualization and analysis.
Many null models exist for extracting the backbone of bipartite networks,
with each model specifying different constraints on the random networks against
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 127–136, 2024.
https://doi.org/10.1007/978-3-031-53468-3_11
128 Z. P. Neal and J. W. Neal
1.1 Preliminaries
A bipartite network’s vertices can be partitioned into two sets such that edges
exist between, but not within, sets. In this work, we focus on a special case
of a bipartite network – a two-mode network – where the two sets of vertices
represent distinctly different entities that we call ‘agents’ and ‘artifacts’ (e.g.
authors and papers, or executives and corporate boards).
To facilitate notation, we represent networks as matrices. First, we represent
a bipartite network containing r ‘agents’ and c ‘artifacts’ as an r × c binary
incidence matrix B, where Bik = 1 if agent i is connected to artifact k (e.g.,
author i wrote paper k), and otherwise is 0. The row sums R = r1 ...rc of B
contain the degree sequence of the agents (e.g., the number of papers written by
each author), while the column sums C = c1 ...cr of B contain the degree sequence
of the artifacts (e.g., the number of authors on each paper). A prohibited edge
in a bipartite network is represented by constraining a cell to equal zero, and
therefore is sometimes called a ‘structural zero’ [13]. Second, we represent the
projection of a bipartite network as an r × r weighted adjacency matrix P =
BBT , where BT represents the transpose of B. In P, Pij equals the number of
artifacts k that are adjacent to both agent i and agent j (e.g., the number of
papers co-authored by authors i and j). Finally, we represent the backbone of
a projection P as an r × r binary adjacency matrix, where Pij = 1 if agent i is
connected to agent j in the backbone, and otherwise is 0.
Let B be an ensemble of r × c binary incidence matrices, which can be con-
strained to have certain features present in B. Let Pij∗ be a random variable equal
to (B∗ B∗ T )ij for B∗ ∈ B. Decisions about which edges appear in a backbone
extracted at the statistical significance level α are made by comparing Pij to
Pij∗ :
1 if Pr(Pij∗ ≥ Pij ) < α2 ,
Pij =
0 otherwise.
This test includes edge Pij in the backbone if its weight in the observed projection
Pij is uncommonly large compared to its weight in projections of members of
the ensemble Pij∗ .
Stochastic Degree Sequence Model with Edge Constraints 129
2 Backbone Models
2.1 The Stochastic Degree Sequence Model (SDSM)
Models for extracting the backbone of bipartite projections differ in the con-
straints they impose on B. The most stringent model – the Fixed Degree
Sequence Model (FDSM) [17] – relies on a microcanonical ensemble that con-
strains each member of B to have exactly the same row and column sums as B.
Computing Pij∗ under the FDSM is slow because it requires approximation via
computationally intensive Monte Carlo simulation. Despite recent advances in
the efficiency of these simulations [2], it is often more practical to use the less
stringent Stochastic Degree Sequence Model (SDSM) [9]. The SDSM relies on a
canonical ensemble that constrains each member of B to have the same row and
column sums as B on average. SDSM is fast and exact, and comparisons with
FDSM reveal that it yields similar backbones [11].
Under the SDSM, Pij∗ follows a Poisson-binomial distribution whose param-
eters can be computed from the entries of probability matrix Q, where Qik =
∗
Pr(Bik = 1) for B∗ ∈ a microcanonical B. That is, Qik is the probability that
∗
Bik contains a 1 in the space of all matrices with given row and column sums.
Most implementations of SDSM approximate Q using the fast and precise Bipar-
tite Configuration Model (BiCM) [14,15]. However, it can also be computed with
minimal loss of speed and precision [11] using a logistic regression [9], which offers
more flexibility. This method estimates the β coefficients in
Bik = β0 + β1 ri + β2 ck +
using maximum likelihood, then defines Qik as the predicted probability that
Bik = 1.
The constraints that SDSM imposes on B are determined by the way that Q is
defined. In the conventional SDSM, Q is defined such that Qik is the probability
∗
that Bik contains a 1 in the space of all matrices with given row and column sums,
which only imposes constraints on the row and column sums of members of B.
To accommodate edge constraints, we define Q such that Qik is the probability
∗
that Bik contains a 1 in the space of all matrices with given row and column
sums and no 1s in prohibited cells.
The BiCM method cannot be used to approximate Q , however the logistic
regression method can be adapted to approximate it. If Bik is a prohibited
edge, then Qik = 0 by definition. If Bik is not a prohibited edge, then Qik
is the predicted probability that Bik = 1 based on a fitted logistic regression.
Importantly, however, whereas the logistic regression used to estimate Q is fitted
over all Bik , the logistic regression used to estimate Q is fitted only over Bik
that are not prohibited edges.
130 Z. P. Neal and J. W. Neal
3 Results
3.1 Estimating Q
In general the true values of Qik are unknown. However, for small matrices they
can be computed from a complete enumeration of the space. To evaluate the
precision of Qik estimated using the SDSM-EC method described above, we
first enumerated all 4 × 4 incidence matrices with row sums {1, 1, 2, 2} and
column sums {1, 1, 2, 2}; there are 211. Next, we constrained this space to
matrices in which a randomly selected one or two cells always contain a zero
(i.e. bipartite networks with one or two prohibited edges). Finally, we computed
the true value of each Qik for all cells and all spaces, estimated each Qik using
the logistic regression method, and computed the absolute deviation between
the two.
Figure 1A illustrates that, compared to the cardinality of the unconstrained
space (|B| = 211), the cardinalities of the spaces constrained by one or two
prohibited edges are much lower (|B| = 2 − 29, gray bars). That is, while the
SDSM evaluates whether a given edge’s weight is significant by comparing its
value to a large number of possible worlds, the SDSM-EC compares its value to
a much smaller number of possible worlds. Figure 1B illustrates the deviations
between the true value of Qik and the value estimated using the logistic regres-
sion method. It demonstrates that although SDSM-EC requires approximating
Qik , these approximations tend to be close to the true values.
Fig. 1. (A) The cardinality of the space of matrices with row sums {1, 1, 2, 2} and
column sums {1, 1, 2, 2} and one or two cells constrained to zero is small compared to
the cardinality of the space without constrained cells. (B) The deviation between the
true and estimated Qik for all such constrained spaces tends to be small.
filled squares), such that agents are only connected to artifacts of the same type.
Such a network might arise in the context of university students joining clubs.
For example, suppose Harvard students (open circles) only join Harvard clubs
(open squares), while Yale students (filled circles) only join Yale clubs (filled
squares).
Fig. 2. (A) A bipartite network containing two groups of agents and two groups of
artifacts, such that agents are connected only to their own group’s artifacts. (B) The
SDSM backbone of a projection of this bipartite graph, which assumes that an agent
could be connected to another group’s artifact, suggests within-group cohesion among
agents. (C) The SDSM-EC projection, which assumes that an agent could not be
connected to another group’s artifact, suggests none of the edges in the projection are
significant.
group share many artifacts, they also share many artifacts under the null model.
The absence of connections in the SDSM-EC backbone reflects the fact that
it is uninteresting that pairs of Harvard students, or pairs of Yale students,
are members of many of the same clubs because they could not have chosen
otherwise.
Fig. 3. (A) Backbone extracted using SDSM and (B) SDSM-EC from 1829 observations
of 53 preschool childrens’ play groups. Vertex shape represents age-based classrooms:
circles = 3 year old classroom, squares = 4 year old classroom. Vertex color represents
attendance status: black = full day, gray = AM only, white = PM only.
does consider these edge constraints. There are some similarities between the
SDSM and SDSM-EC backbones that reflect characteristics of the setting: 3-
year-olds (circles) are never connected to 4-year-olds (squares), and AM children
(gray) are never connected to PM children (white), because it was not possible to
observe such children together. However, there are also differences that highlight
the impact of incorporating edge constraints using SDSM-EC. The SDSM-EC
backbone contains many fewer edges (E = 85) than the SDSM backbone (E =
153). This occurs for similar reasons to the loss of edges in the toy example
above, although is less extreme.
A hypothetical example serves to illustrate why the SDSM-EC backbone con-
tains fewer edges in this context. Consider the case of an AM child and a Full
Day child in the 3-year-old classroom who were observed to play together a few
times. The SDSM compares this observed co-occurrence to the expected number
of co-occurrences if these two children had played with other AM or Full Day
children and with others in the 3-year-old classroom (which is possible), but also
if they had played with PM children and children in the 4-year-old classroom
(which is not possible). Under such a broad null model that includes some impos-
sible play configurations, observing these two children playing together even just
a few times seems noteworthy, and therefore an edge between them is included in
the backbone. In contrast, the SDSM-EC compares this observed co-occurrence
to the expected number of co-occurrences if these two children had played with
other AM or Full Day children and with others in the 3-year-old classroom
only, recognizing that it was not possible for the AM child to play with PM
children or for either to play with children in the 4-year-old classroom. Under
this more constrained null model that excludes impossible play configurations,
observing these two children playing together just a few times is not particularly
noteworthy, and therefore an edge between them is omitted from the backbone.
134 Z. P. Neal and J. W. Neal
As this example illustrates, the SDSM-EC contains fewer edges because it cor-
rectly omits edges that might seem significantly strong when evaluated against a
null model that includes impossible configuration, but that are not significantly
strong when evaluated against a properly constrained null model that excludes
impossible configurations.
4 Conclusion
the potential to benefit not only the SDSM-EC, but all variants of the SDSM.
Second, while a broad class of bipartite null models exist [16] and now include
edge constraints, future work should investigate the importance and feasibility
of incorporating other types of constraints.
Acknowledgements. We thank Emily Durbin for her assistance collecting the empir-
ical data.
Data Availability Statement. The data and code necessary to reproduce the results
reported above are available at https://osf.io/7z4gu.
References
1. Burris, V.: Interlocking directorates and political cohesion among corporate elites.
Am. J. Sociol. 111(1), 249–283 (2005). https://doi.org/10.1086/428817
2. Godard, K., Neal, Z.P.: fastball: a fast algorithm to randomly sample bipartite
graphs with fixed degree sequences. J. Complex Netw. 10(6), cnac049 (2022).
https://doi.org/10.1093/comnet/cnac049
3. Gornik, A.E., Neal, J.W., Lo, S.L., Durbin, C.E.: Connections between preschool-
ers’ temperament traits and social behaviors as observed in a preschool setting.
Soc. Dev. 27(2), 335–350 (2018). https://doi.org/10.1111/sode.12271
4. Hanish, L.D., Martin, C.L., Fabes, R.A., Leonard, S., Herzog, M.: Exposure to
externalizing peers in early childhood: homophily and peer contagion processes. J.
Abnorm. Child Psychol. 33, 267–281 (2005). https://doi.org/10.1007/s10802-005-
3564-6
5. Latapy, M., Magnien, C., Del Vecchio, N.: Basic notions for the analysis of large
two-mode networks. Soc. Netw. 30(1), 31–48 (2008). https://doi.org/10.1016/j.
socnet.2007.04.006
6. Neal, J.W., Brutzman, B., Durbin, C.E.: The role of full-and half-day preschool
attendance in the formation of children’s social networks. Early Childhood Res. Q.
60, 394–402 (2022). https://doi.org/10.1016/j.ecresq.2022.04.003
7. Neal, J.W., Durbin, C.E., Gornik, A.E., Lo, S.L.: Codevelopment of preschoolers’
temperament traits and social play networks over an entire school year. J. Pers.
Soc. Psychol. 113(4), 627 (2017). https://doi.org/10.1037/pspp0000135
8. Neal, J.W., Neal, Z.P., Durbin, C.E.: Inferring signed networks from preschoolers’
observed parallel and social play. Soc. Netw. 71, 80–86 (2022). https://doi.org/10.
1016/j.socnet.2022.07.002
9. Neal, Z.P.: The backbone of bipartite projections: inferring relationships from co-
authorship, co-sponsorship, co-attendance and other co-behaviors. Soc. Netw. 39,
84–97 (2014). https://doi.org/10.1016/j.socnet.2014.06.001
10. Neal, Z.P.: backbone: an R package to extract network backbones. PLOS ONE
17(5), e0269,137 (2022). https://doi.org/10.1371/journal.pone.0269137
11. Neal, Z.P., Domagalski, R., Sagan, B.: Comparing alternatives to the fixed degree
sequence model for extracting the backbone of bipartite projections. Sci. Rep.
11(1), 1–13 (2021). https://doi.org/10.1038/s41598-021-03238-3
12. Newman, M.E.: Scientific collaboration networks. I. Network construction and fun-
damental results. Phys. Rev. E 64(1), 016,131 (2001). https://doi.org/10.1103/
PhysRevE.64.016131
136 Z. P. Neal and J. W. Neal
13. Ripley, R.M., Snijders, T.A.B., Boda, Z., Voros, A., Preciado, P.: Manual for Siena
version 4.0. Technical report. Department of Statistics, Nuffield College, University
of Oxford, Oxford (2023). R package version 1.3.14.4. https://www.cran.r-project.
org/web/packages/RSiena/
14. Saracco, F., Di Clemente, R., Gabrielli, A., Squartini, T.: Randomizing bipartite
networks: the case of the world trade web. Sci. Rep. 5(1), 1–18 (2015). https://
doi.org/10.1038/srep10595
15. Saracco, F., Straka, M.J., Di Clemente, R., Gabrielli, A., Caldarelli, G., Squartini,
T.: Inferring monopartite projections of bipartite networks: an entropy-based app-
roach. New J. Phys. 19(5), 053,022 (2017). https://doi.org/10.1088/1367-2630/
aa6b38
16. Strona, G., Ulrich, W., Gotelli, N.J.: Bi-dimensional null model analysis of
presence-absence binary matrices. Ecology 99(1), 103–115 (2018). https://doi.org/
10.1002/ecy.2043
17. Zweig, K.A., Kaufmann, M.: A systematic approach to the one-mode projection of
bipartite graphs. Soc. Netw. Anal. Min. 1(3), 187–218 (2011). https://doi.org/10.
1007/s13278-011-0021-0
Minority Representation and Relative
Ranking in Sampling Attributed
Networks
1 Introduction
Given that most real networks can only be observed indirectly, network sam-
pling, and its impact on the representation/learning of the true network, is an
activate area of research across multiple communities (see e.g. [2,3] and the ref-
erences therein). In this context, there has been significant interest in attributed
network sampling where there is a particular small minority of certain attribute
nodes. Here, we explore two related questions in this area, namely, (a) settings
where Page-rank and other exploration based sampling schemes favor sampling
small minorities, (b) effects of homophily and out-degrees on the relative rank-
ing of minorities compared to majorities in degree-based sampling. To this end,
we shall use an attributed network model that incorporates homophily [1]. We
employ the asymptotic theory developed in [1] to gain insight through data
studies of the various network sampling schemes and attribute representation
in concrete applications. The findings will also be assessed with real-world net-
works. More concretely, we investigate the following research problems:
(a) We consider the case where there is a particular small minority which
has higher propensity to connect within itself as opposed to majority nodes;
for substantial recent applications and impact of such questions, see [10,11,13].
In such setting, devising schemes where one gets a non-trivial representation of
minorities is challenging if the sample size is much smaller than the network
size. In this case, uniform sampling will clearly not be fair as the sampled nodes
will tend to be more often from the majority attribute. Additionally, uniform
sampling does not give preference to “more popular” minority nodes, i.e., higher
degree/Page-rank nodes. Therefore, it is desirable to explore the network locally
around the initial (uniformly sampled) random node and try to travel towards
the “centre”, thereby traversing edges along their natural direction. However, to
avoid high sampling costs, the explored set of nodes should not be too large. We
compare through a data study several sampling schemes derived from centrality
measures like degree and Page-rank and show that they increase the probability
of sampling a minority node and its “popularity”. This is investigated in several
network model configurations and in a real network dataset.
(b) We consider two degree-based sampling schemes and explore the effects
of homophily and out-degrees of the model parameters on the relative ranking of
minority compared to majority (in terms of proportion) in the samples. As in (a),
we again study minority representation, but focus on degree-based sampling and
are interested in dependence on structural network properties. The conditions
in a asymptotic regime (when the number of nodes goes to infinity) are known
for the minority nodes to rank higher (i.e. have larger proportions) than the
majority nodes (based on the tail distribution and sum of the degrees) [1]. For
three scenarios - heterophily, homogeneous homophily (homogenous mixing) and
asymmetric homophily - the results are numerically investigated for the minority
nodes to rank higher. The last two scenarios were briefly considered heuristically
in [7,9] using fluid limits. We show that the results for two real networks with
degree power-law distributions agree with those for the synthetic model.
The paper is organized as follows. A synthetic model with homophily is given
in Sec. 2. Network sampling in the presence of a small minority is studied in Sec.
Sampling and Ranking of Small Minorities 139
rank and fixed length walk sampling. The next sections investigate how these
results hold in a non-asymptotic regime in (sub-)linear, (non-)tree networks, as
well as in a real network.
Sampling and Ranking of Small Minorities 141
Fig. 1. Synthetic networks with 500 nodes: (l.h.s.) linear, tree network (a = 0.003,
D = 1), (m.h.s.) sub-linear, tree network (α = 0.25, a = 0.02, D = 1), (r.h.s.) linear,
non-tree network (m1 = 1, m2 = 2, a = 0.02, D = 1). The red (green) circles represent
the minority (majority) nodes with sizes proportional to the degrees.
of the network which is 18 (in the tree case, it is O(log |V |)). These drawbacks
explain partly the good performance of fixed length walk sampling which also
has the higher rank of the minority sampled nodes. This sampling scheme gives
preference to nodes with a higher Page-rank as well.
We next consider the sub-linear, tree network with α = 0.25 and a = 0.02
(D = 1) which gives π1 ≈ 0.124. The characteristics of the generated network
are given in Table 1 (Syn. 2). An illustration of a small network with these char-
acteristics is shown in Fig. 1 (m.h.s.). We estimate the probability of sampling
a minority and its importance for each sampling scheme using 104 runs – see
Table 3. The qualitative comparison of the performance of the sampling schemes
is the same as in the linear case. However, the number of steps for sampling pro-
portional to Page-rank and fixed length walk sampling is larger. The diameter
of the generated network is 25.
Finally, we consider a linear, non-tree network with m1 = 1 and m2 = 2
and a = 0.02 (D = 1). The number of nodes is 25,000 which resulted in a
network diameter of 16. The network properties are shown in Table 2 (Syn. 3) –
see also Fig. 1 (r.h.s.) for a network generated with a smaller number of nodes.
As seen from the results (averaged over 104 runs) in Table 4, the probability of
sampling a minority node with fixed length walk sampling decreases compared
to the sub-linear case due to the non-tree network structure (however, it is still
approximately the double compared to uniform sampling).
network where nodes denote users, and edges represent retweets among them.
Users in the dataset are classified as either “hateful” (attribute 1) or “normal”
(attribute 2) depending on the sentiment of their tweets [7]. “Hateful” users rep-
resent the minority. We consider the largest connected component of the network
and remove loops and multiple edges for a comparison with the synthetic net-
works. Table 5 shows the key characteristics of interest of the directed network
(with diameter 24). The results (averaged over 104 runs) in Table 6 are in line
with the synthetic model, where fixed length walk sampling shows the higher
probability of sampling a minority node in addition to a higher rank compared
to uniform sampling. The smaller differences are due to the characteristics of
the network, where the proportions of edges from “normal” to “hateful” users
is only slightly higher than in the opposite direction. This can also be seen from
the homophily measures H21 and H12 .
γ 0.01 0.02 0.03 0.04 0.05 0.1 0.15 0.20 0.3 0.4 0.5
Scheme A: m1 = 1, m2 = 1 0.868 0.810 0.733 0.673 0.626 0.514 0.467 0.417 0.407 0.355 0.325
m1 = 5, m2 = 1 0.372 0.458 0.515 0.560 0.576 0.660 0.679 0.701 0.732 0.754 0.513
Scheme B: m1 = 1, m2 = 1 0.442 0.432 0.433 0.422 0.422 0.397 0.384 0.371 0.355 0.343 0.334
m1 = 5, m2 = 1 0.535 0.5424 0.546 0.545 0.550 0.550 0.550 0.544 0.533 0.513 0.488
1e−05 1e−04 1e−03 1e−02 1e−01 1e+00
Majority Majority
p.m.f.
p.m.f.
1 5 10 50 500 1 5 10 50 500
degree degree
(m1 , m2 ) (the out-degree vector). We shall gain insight through the following
results and the quantities involved. From the analysis of the linear model [1], we
have: as n, k → ∞,
m v∈V :a(v)=a deg(v, n) m
ηa := → ηam , pm
n
,a
(k) ∼ k −(1+2/φa ) , a = 1, 2, (2)
2(n + n0 )
where ηam represents the limit of the normalized sum ηam of degrees of attribute
type a and pm n
,a
(k) represents the proportion of nodes of type a with degree
k which follows a power law with exponent Φm m
a := 2/φa in the limit. The
m m
quantities ηa and φa are related to the relative ranking of minorities under the
two sampling schemes A and B above and can be precisely computed. (η1m , η2m )
is the minimizer of a suitable function [1, Eq. (4.1)], and
φm m
a = 2 − ma πa /ηa . (3)
If φm m
1 > φ2 , the tail of the minority degree distribution is heavier (see Eq.
(2)) and hence minorities are higher ranked in scheme A. On the other hand, if
η1m > η2m , the probability of sampling a minority node is higher in each draw
and hence the same conclusion holds in scheme B. We consider three different
network configurations as follows (for the proofs of the results (4)–(9) below,
see [1]).
Heterophilic Network. We first consider the scenario of a strongly heterophilic
network, such that κ11 = κ22 = 1 and κ12 = κ21 = K is large. In this case, node
pairs with different attributes are more likely to be connected than node pairs
with concordant attributes. As K increases, φm m
1 and φ2 behave as
m m1 π 1 m m2 π 2
φ1 ≈ 2 1 − , φ2 ≈ 2 1 − . (4)
m 1 π1 + m 2 π2 m 1 π1 + m 2 π2
Sampling and Ranking of Small Minorities 145
γ 0.01 0.02 0.03 0.04 0.05 0.1 0.15 0.20 0.3 0.4 0.5
scheme A: m1 = 5, m2 = 1 0.888 0.886 0.887 0.881 0.890 0.868 0.859 0.857 0.86 0.751 0.601
m1 = 5, m2 = 2 0.700 0.718 0.697 0.693 0.694 0.692 0.698 0.658 0.673 0.6556 0.614
m1 = 2, m2 = 1 0.552 0.598 0.601 0.594 0.593 0.589 0.589 0.576 0.586 0.547 0.580
Scheme B: m1 = 5, m2 = 1 0.684 0.690 0.682 0.676 8 0.677 0.659 0.6441 0.629 0.596 0.558 0.516
m1 = 5, m2 = 2 0.527 0.522 0.518 0.522 0.516 0.505 0.497 0.493 0.473 0.455 0.436
m1 = 2, m2 = 1 0.5056 0.505 0.503 0.507 0.502 0.499 0.493 0.487 0.477 0.465 0.451
1e+00
1e+00
1e−01
1e−02
1e−02
p.m.f.
p.m.f.
p.m.f.
1e−03
1e−03
1e−04
1e−04
1e−05
1e−05
Thus, the rank of minority nodes under scheme A depends on the relation
between m1 π1 and m2 π2 . Table 7 shows the results for two linear networks with
25,000 nodes, K = 10 and π1 = 0.3. The out-degree vectors m are (1, 1) and
(5, 1). For m1 = 1, we have φm m
1 ≈ 1.373 and φ2 ≈ 0.659 (using (3)) which are
close, respectively, to 1.4 and 0.6 given by the approximations in (4). In this
case m1 π1 < m2 π2 , and the minority nodes rank higher under scheme A due
to the fact that majority nodes tend to connect to minority nodes, increasing
their ranks. This holds for any tree network. For m1 = 5, we have φm 1 ≈ 0.688
and φm 2 ≈ 1.377 which are close, respectively, to 0.636 and 1.364 given by (4).
In this setting m1 π > m2 π2 , the minority nodes increase the ranks of majority
nodes for small values of γ, by connecting to the majority with more output
edges. (Note that when γ = 1 the relative ranking is given by the proportion of
minority nodes in the network.) Figure 2 shows the degree distribution for each
attribute, where in (l.h.s.) the minority has a heavier tail (φm m
1 > φ2 ) and in
m m
(r.h.s.) it is the majority (φ1 < φ2 ).
As K gets larger, η1m and η2m approach the same limit value
m 1 π1 + m 2 π2
η1m ≈ η2m ≈ , (5)
2
which implies that the differences between the relative rankings are smaller
between the two attributes for scheme B. Table 7 shows the relative ranking
of the minority for the networks described above under this scheme (the results
were averaged over a large number of runs). For m1 = 1, we have η1m ≈ 0.478
and η2m ≈ 0.522 which are close to 0.5 given by the approximation in (5). For
146 N. Antunes et al.
γ 0.01 0.02 0.03 0.04 0.05 0.1 0.15 0.20 0.3 0.4 0.5
Scheme A: m1 = 1, m2 = 1 0.712 0.632 0.589 0.570 0.526 0.468 0.4565 0.397 0.384 0.345 0.323
m1 = 2, m2 = 1 0.980 0.938 0.911 0.890 0.866 0.769 0.725 0.702 0.597 0.601 0.596
Scheme B: m1 = 1, m2 = 1 0.423 0.411 0.396 0.392 0.390 0.370 0.363 0.357 0.342 0.334 0.327
m1 = 2, m2 = 1 0.591 0.585 0.568 0.567 0.559 0.539 0.520 0.501 0.4766 0.449 0.425
1e−05 1e−04 1e−03 1e−02 1e−01 1e+00
Majority Majority
p.m.f.
p.m.f.
degree degree
m1 = 5, we have η1m ≈ 1.144, η2m ≈ 1.056 which approach 1.1 in (5). However,
the higher value of η1m makes the minority slightly more dominant for scheme B.
Homogenous Homophily and Homogenous Mixing. We consider the cases
of a strong homogeneous homophily with κ21 = κ21 = 1 and κ11 = κ22 = K
large and homogenous mixing with all the elements of the matrix κ equal to 1.
As K goes to infinity, the exponents of the tail degree distribution per attribute
are equal and behave as
φm m
1 = φ2 ≈ 1, (6)
which also holds in the case of homogenous mixing. However, we will see that
the relative ranking of the minority under scheme A will depend on the ratio
m1 /m2 . Table 8 depicts two homogenous homophily networks with 25,000 nodes,
K = 10, π1 = 0.3, and m vectors (5, 1) and (5, 2) which result in φm 1 ≈ 1.022,
φm2 ≈ 0.948 and φ m
1 ≈ 1.003, φ m
2 ≈ 0.997, respectively. An homogenous mixing
network with 25,000 nodes, π1 = 0.35 and m = (2, 1) is also considered. Figure 3
shows the degree distributions per attribute. Despite the degree tail exponents
being similar from the plots, if m1 is larger than m2 , the degrees of minority
nodes get a high initial boost. Additionally, from the works [4,8], for multi-
attributes, there is a “persistence phenomenon”, i.e., the maximal degree nodes
from any attribute type emerge from, with high probability, the oldest nodes
of that type added to the network. Therefore, the results in Table 8 show that
minority nodes have a higher ranking under scheme A.
On the other hand, as K goes to infinity (homogeneous homophily) and also
for homogenous mixing,
γ 0.01 0.02 0.03 0.04 0.05 0.1 0.15 0.20 0.3 0.4 0.5
Scheme A: Hate 0.778 0.778 0.716 0.704 0.689 0.637 0.588 0.533 0.430 0.356 0.304
APS 0.154 0.115 0.132 0.157 0.203 0.211 0.250 0.266 0.297 0.289 0.300
Scheme B: Hate 0.472 0.460 0.460 0.460 0.451 0.427 0.413 0.404 0.386 0.311 0.282
APS 0.269 0.294 0.277 0.278 0.302 0.291 0.284 0.283 0.300 0.298 0.305
For the networks considered with m = (5, 1) and m = (5, 2), the exact values
(resp. approximations in (7)) are η1m ≈ 1.534, η2m ≈ 0.666 (resp. 1.5 and 0.7),
and η1m ≈ 1.504 , η2m ≈ 1.396 (resp. 1.5 and 1.4). For m = (2, 1), the true value
and approximation match with η1m ≈ 0.7, η2m ≈ 0.65. Thus, if m1 π1 > m2 π2 ,
the minority nodes rank higher under scheme B – see Table 8.
In both types of networks, minority nodes can increase their popularity via
schemes A and B through a higher ratio m1 /m2 . In the context of social net-
works, it means minorities increasing their social interaction.
Asymmetric Homophily. The last scenario is the case of a strong asymmetric
homophily network (slightly different from Sec. 3), where κ11 = K is large, and
κ22 = κ12 = κ21 = 1. As K tends to infinity,
2m1 π1 + 3m2 π2 m 2 π2
φm
1 ≈ , φm
2 ≈ (8)
2m1 π1 + 2m2 π2 m 1 π1 + m 2 π2
and
2m1 π1 (m1 π1 + m2 π2 ) m2 π2 (m1 π1 + m2 π2 )
η1m ≈ , η2m ≈ . (9)
2m1 π1 + m2 π2 2m1 π1 + m2 π2
Two networks are considered with 25,000 nodes, K = 10 and m vectors
(1, 1) and (2, 1) in Table 9. In both networks, φm m
1 > φ2 and the minorities rank
higher under scheme A. The exact values are φ1 ≈ 1.31, φm
m
2 ≈ 0.761 (m1 = 1)
and φm m
1 ≈ 1.247, φ2 ≈ 0.609 (m1 = 2) which are close to the approximations
in (8). This also agrees with the degree tail exponents in Fig. 4 with the degree
distribution of the minority being more heavy-tailed (higher φm 1 ).
Under scheme B, minorities rank higher with m1 = 2 since 2m1 π1 > m2 π2
in (9) (also η1m ≈ 0.821, η2m ≈ 0.479). This means that in a social network, if the
arriving majority nodes have almost a neutral attribute preference attachment
(κ12 = κ22 ), the minorities can increase their popularity through the number of
outgoing edges that connect to other minority nodes.
A, the minorities rank higher. APS is a scientific network from the American
Physical Society where nodes represent articles from two subfields and edges
represent citations with homogeneous homophily. Some networks statistics are:
1281 (nodes), 3064 (edges). The minority rank is lower in both schemes where
the exponents and normalized sums of the degrees are 3.947 and 3.292, and 1.332
and 3.452 respectively, for subfields 1 (minority) and 2 (majority). For these two
real networks, the results on relative ranking of the minority are in line with
those for the synthetic networks.
This paper explored settings where Page-rank and walk-based network sampling
schemes favor small minority attribute nodes compared to uniform sampling.
We also investigated the conditions for the minority nodes to rank higher in
degree-based sampling. To this end, we used an attributed network model with
homophily under several network configurations which provided insight into real-
world networks.
In follow-up work, we plan to compare and contrast the performance of var-
ious centrality measures, including degree and Page-rank centrality, for ranking
and attribute reconstruction tasks in the semi-supervised setting, where one has
partial information on the attributes and wants to reconstruct it for the rest of
the network. In the setting of dynamic and evolving networks, contrary to static
networks, preliminary results in [1] seem to suggest starkly different behavior
between degree vs Page-rank centrality in such settings.
References
1. Antunes, N., Banerjee, S., Bhamidi, S., Pipiras, V.: Attribute network mod-
els, stochastic approximation, and network sampling and ranking. arXiv preprint
arXiv:2304.08565v1 (2023)
2. Antunes, N., Bhamidi, S., Guo, T., Pipiras, V., Wang, B.: Sampling based estima-
tion of in-degree distribution for directed complex networks. J. Comput. Graph.
Stat. 30(4), 863–876 (2021)
3. Antunes, N., Guo, T., Pipiras, V.: Sampling methods and estimation of triangle
count distributions in large networks. Netw. Sci. 9(S1), S134–S156 (2021)
4. Banerjee, S., Bhamidi, S.: Persistence of hubs in growing random networks. Probab.
Theor. Relat. Fields 180(3–4), 891–953 (2021)
5. Chebolu, P., Melsted, P.: PageRank and the random surfer model. In: SODA 2008,
pp. 1010–1018 (2008)
6. Crawford, F.W., Aronow, P.M., Zeng, L., Li, J.: Identification of homophily and
preferential recruitment in respondent-driven sampling. Am. J. Epidemiol. 187(1),
153–160 (2018)
Sampling and Ranking of Small Minorities 149
7. Espı́n-Noboa, L., Wagner, C., Strohmaier, M., Karimi, F.: Inequality and inequity
in network-based ranking and recommendation algorithms. Sci. Rep. 12(1), 1–14
(2022)
8. Galashin, P.: Existence of a persistent hub in the convex preferential attachment
model. arXiv preprint arXiv:1310.7513 (2013)
9. Karimi, F., Génois, M., Wagner, C., Singer, P., Strohmaier, M.: Homophily influ-
ences ranking of minorities in social networks. Sci. Rep. 8(1), 1–12 (2018)
10. Merli, M.G., Verdery, A., Mouw, T., Li, J.: Sampling migrants from their social
networks: the demography and social organization of Chinese migrants in Dar es
Salaam, Tanzania. Migr. Stud. 4(2), 182–214 (2016)
11. Mouw, T., Verdery, A.M.: Network sampling with memory: a proposal for more
efficient sampling from social networks. Sociol. Methodol. 42(1), 206–256 (2012)
12. Park, J., Barabási, A.-L.: Distribution of node characteristics in complex networks.
Proc. Natl. Acad. Sci. 104(46), 17916–17920 (2007)
13. Stolte, A., Nagy, G.A., Zhan, C., Mouw, T., Merli, M.G.: The impact of two types
of COVID-19-related discrimination and contemporaneous stressors on Chinese
immigrants in the US South. SSM Ment. Health 2, 100159 (2022)
A Framework for Empirically Evaluating
Pretrained Link Prediction Models
1 Introduction
In recent years, researchers have studied complex networks to understand and
analyze the intricate relationships that underlie various real-world systems. Com-
plex networks, characterized by their non-trivial topological structures, have
applications in diverse fields such as the social sciences, biology, transportation,
and information technology [6]. Understanding the dynamics of these different
types of networks and predicting the formation of new or missing connections,
also known as “link prediction”, is a well-known and well-studied problem in the
field [17]. Link prediction aims to uncover hidden or potential interactions in a
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 150–161, 2024.
https://doi.org/10.1007/978-3-031-53468-3_13
A Framework for Empirically Evaluating Pretrained Link Prediction Models 151
network; for example, to predict who might connect to whom in a social network
or which proteins are likely to interact in a biological network. Furthermore,
link prediction is also used in various other applications and tasks, including
recommender systems, anomaly detection, privacy control, network routing, and
understanding the underlying mechanisms that govern network evolution [9–
11,14].
In literature, different types of methods for link prediction have been pro-
posed. Initial methods focused on node pair similarities, such as the Jaccard coef-
ficient, Adamic-Adar index, and resource-allocation index [1,17]. Node pair simi-
larity relies on the notion that if a given pair of nodes has a similarity score higher
than some threshold, then this pair is more likely to be connected [11]. Later,
researchers proposed other types of methods, including (i) maximum likelihood-
based methods that work on maximizing the likelihood of the observed structure
so that any missing link can be calculated using the identified rules and param-
eter [22], (ii) probabilistic models based methods that focus on modeling the
underlying network structure and then use the learned model to predict the miss-
ing links [24], (iii) machine learning-based methods that train a machine learning
model based on node pair features for existing and non-existing links [2,5], and
(iv) network embedding-based methods that create a low dimensional represen-
tation of the network using word2vec models or matrix-factorization, and then
train a machine learning model using these vector representation of nodes to
predict missing links [8,16,23]. In literature, it has been shown that the third
category, machine learning based methods, outperforms other types of methods
and has lately become the focus of link prediction research [12]. An additional
advantage is that the use of topological features of the node pairs ensures the
interpretability and explainability of resulting models through the analysis of
feature importance. However, one limitation of these methods is that a link pre-
diction model must be trained for each new network dataset.
To solve this problem, people have used transfer learning, i.e., a machine
learning technique where a model developed for a particular task is reused or
adapted as the starting point for a model on a second task [20]. Instead of training
a new model from scratch, transfer learning leverages the knowledge gained from
solving one problem and applies it to a different but related problem. By using a
pretrained model as a starting point, one can save time and resources compared
to training a new model from the ground up. In this work, we investigate the
feasibility of transfer learning for link prediction in real-world complex networks.
In the remainder of this work, we analyze the characteristics and topology
of 49 networks to understand how they affect the ability to train and predict
links across networks. Specifically, we first propose a framework to perform cross-
validation across multiple datasets, to efficiently test and compare the transfer
learning performance of pretrained models for link prediction. Working towards
automated pretrained model selection, we subsequently investigate what kind of
topological network properties are important for selecting a well-performing pre-
trained model. Finally, we analyze what topological network similarities between
training and testing networks, yield good transfer learning performance. In doing
152 E. S. Olivares et al.
so, we aim to understand to what extent transfer learning can be applied to pre-
dict unseen links in real-world networks by employing pretrained models.
The structure of the remainder of this paper is as follows. In Sect. 2, we
discuss the approach followed to train our link prediction model, as well as the
framework to test transfer learning. Then, Sect. 3 describes the data, evaluation
criteria used, and the experimental setup developed, as well as the experimental
results. Finally, we draw conclusions and propose future directions of research
in Sect. 4.
2 Methodology
In this section, we first discuss the network features used to train predictive
models for link prediction. Then, we give an overview of machine learning algo-
rithms used to predict missing links and explain how we split the datasets for
training and testing.
2.1 Features
Working towards a machine learning model that takes node pairs as input, and
outputs whether this node pair is likely to be connected in the future, features
that describe these node pairs are required.
In this work, we employ features commonly used in link prediction models,
focusing on the work presented by Bors [3], to design a good link prediction
model and test transfer learning. The chosen features balance simplicity, speed,
and performance. We note that this study aims not to design the best link
prediction model with the most comprehensive set of features, but instead aims
to assess the feasibility of transfer learning in link prediction.
The selected set of features used throughout our experiments are as follows:
(i) total neighbors, i.e., the union of all neighbors of the source and target nodes;
(ii) common neighbors, i.e., the number of nodes connected to both the source
and target nodes; (iii) Jaccard Coefficient [17], i.e., the ratio between the common
and total neighbors; (iv) Adamic-Adar [1], which used to compute the closeness
of nodes based on their shared neighbors; (v) preferential attachment [17,19],
i.e., the multiplication of the number of neighbors of the source and target nodes;
(vi) degree of the source node (vii) degree of target nodes, (viii) ratio of degrees
of source and target node, (ix) triangle count for the source node, and (x) triangle
count for the target nodes, i.e., denoting the number of triangles they are involved
in.
3 Experiments
This section covers the experimental setup and results. First, in Sect. 3.1, we
discuss the datasets and metrics used. Then, in Sect. 3.2, we determine the overall
feasibility of using transfer learning for link prediction by studying the AUC
(loss) matrix and distributions. Next, in Sect. 3.3, we discuss what the most
important topological features are that affect the performance of a pretrained
model. Finally, in Sect. 3.4, we determine which structural network similarities
yield good transfer learning performance.
Fig. 1. Cross-validation training assignments. For each split, we generate a training set
with 75% of the data and a validation set with the remaining 25%. (Letters indicate
the dataset, numbers indicate the split.)
Table 1. Datasets sourced from [13], along with the number of nodes and edges.
Fig. 2. Distributions of the AUC scores and AUC loss across all networks. Alongside
the mean and median the quartiles are depicted with light grey lines. The blue line
approximates the trend.
distance and clustering coefficients tend to result in higher AUC scores. On the
contrary, low-to-middle transitivity with higher clustering coefficients and high
transitivity with small mean distances or small clustering coefficients result in
low AUC scores. As such, only high transitivity or clustering coefficient are not
universally good or bad for the transfer learning link prediction performance of
a pretrained model. However, note that clustering coefficient and transitivity
are usually correlated, so a network with a high clustering coefficient will likely
not have low transitivity. Interestingly, our previous observation from Fig. 4b
indicates that high AUC scores are obtained when these topological features are
indeed correlated for a network, and low AUC scores are obtained when they
are not.
158 E. S. Olivares et al.
In short, we find that some of the most important topological features influ-
encing the performance of pretrained models are the number of triangles (per
link), the transitivity, and the clustering coefficient. Notably, the level of corre-
lation between these features can be especially indicative of the resulting high
or low AUC scores of a pretrained model.
A Framework for Empirically Evaluating Pretrained Link Prediction Models 159
4 Conclusion
In this work, we studied the feasibility of using pretrained link prediction mod-
els in complex networks. Moreover, we studied the network characteristics that
impact model training, and how these can be used for selecting a well-performing
pretrained model. We conducted experimental analysis on a large corpus of struc-
turally diverse networks, including co-authorship, citation friendship, human
interaction, biological, and transportation networks. Through our experiments,
160 E. S. Olivares et al.
we observed that transfer learning for link prediction is a feasible way to move
forward, and some network categories perform better as sources for training and
others to predict missing links on. Furthermore, we found that network features
based on local connectivity, such as clustering coefficient, number of triangles,
or transitivity, are important indicators when picking a network for training a
predictive model. Specifically, we found that when two networks show very dis-
similar topologies in terms of clustering coefficient, but also in terms of degree
assortativity, gini coefficient, and transitivity, it is likely that the performance
of transfer learning is hindered.
This work demonstrates the feasibility of using pretrained models in link pre-
diction. Future work could focus on designing better transfer learning methods
to achieve higher accuracy using topological properties of an unseen network and
the network used for pre-training. Additionally, this work opens an avenue to
use transfer learning for complex network problems, such as node classification,
role identification, and influence maximization.
References
1. Adamic, L.A., Adar, E.: Friends and neighbors on the web. Soc. Netw. 25(3),
211–230 (2003)
2. Al Hasan, M., Chaoji, V., Salem, S., Zaki, M.: Link prediction using supervised
learning. In: Workshop on Link Analysis, Counter-Terrorism and Security, SDM
2006, vol. 30, pp. 798–805 (2006)
3. Bors, P.P.: Topology-aware network feature selection in link prediction (2022)
4. de Bruin, G.J., Veenman, C.J., van den Herik, H.J., Takes, F.W.: Supervised tem-
poral link prediction in large-scale real-world networks. Soc. Netw. Anal. Min.
11(1), 1–16 (2021)
5. van Engelen, J.E., Boekhout, H.D., Takes, F.W.: Explainable and efficient link
prediction in real-world network data. In: Boström, H., Knobbe, A., Soares, C.,
Papapetrou, P. (eds.) IDA 2016. LNCS, vol. 9897, pp. 295–307. Springer, Cham
(2016). https://doi.org/10.1007/978-3-319-46349-0 26
6. Estrada, E.: The Structure of Complex Networks: Theory and Applications. Oxford
University Press, USA (2012)
7. Ghasemian, A., Hosseinmardi, H., Galstyan, A., Airoldi, E.M., Clauset, A.: Stack-
ing models for nearly optimal link prediction in complex networks. Proc. Natl.
Acad. Sci. 117(38), 23393–23400 (2020)
8. Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: Pro-
ceedings of the 22nd ACM SIGKDD International Conference on Knowledge Dis-
covery and Data Mining, pp. 855–864 (2016)
9. Guimerà, R., Sales-Pardo, M.: Missing and spurious interactions and the recon-
struction of complex networks. Proc. Natl. Acad. Sci. 106(52), 22073–22078 (2009)
10. Huang, Z., Zeng, D.D.: A link prediction approach to anomalous email detection.
In: 2006 IEEE International Conference on Systems, Man and Cybernetics, vol. 2,
pp. 1131–1136. IEEE (2006)
11. Kumar, A., Singh, S.S., Singh, K., Biswas, B.: Link prediction techniques, applica-
tions, and performance: a survey. Phys. A Stat. Mech. Appl. 553, 124,289 (2020)
12. Kumari, A., Behera, R.K., Sahoo, K.S., Nayyar, A., Kumar Luhach, A., Prakash
Sahoo, S.: Supervised link prediction using structured-based feature extraction in
social network. Concurrency Comput. Pract. Exp. 34(13), e5839 (2022)
A Framework for Empirically Evaluating Pretrained Link Prediction Models 161
13. Kunegis, J., Staab, S., Dünker, D.: KONECT – the Koblenz network collection.
In: Proceedings of the International School and Conference on Network Science
(2012)
14. Li, J., Zhang, L., Meng, F., Li, F.: Recommendation algorithm based on link pre-
diction and domain knowledge in retail transactions. Procedia Comput. Sci. 31,
875–881 (2014)
15. Li, Y., Liu, X., Wang, C.: Research on link prediction under the structural features
of attention stream network. In: 2021 IEEE Asia-Pacific Conference on Image
Processing, Electronics and Computers (IPEC), pp. 148–154. IEEE (2021)
16. Li, Y., Wang, Y., Zhang, T., Zhang, J., Chang, Y.: Learning network embedding
with community structural information. In: IJCAI, pp. 2937–2943 (2019)
17. Liben-Nowell, D., Kleinberg, J.: The link prediction problem for social networks.
In: Proceedings of the 12th International Conference on Information and knowledge
management, pp. 556–559 (2003)
18. Liu, Z., Zhang, Q.M., Lü, L., Zhou, T.: Link prediction in complex networks: a
local Naı̈ve Bayes model. Europhys. Lett. 96(4), 48,007 (2011)
19. Newman, M.E.: Clustering and preferential attachment in growing networks. Phys.
Rev. E 64(2), 025,102 (2001)
20. Niu, S., Liu, Y., Wang, J., Song, H.: A decade survey of transfer learning (2010–
2020). IEEE Trans. Artif. Intell. 1(2), 151–166 (2020)
21. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn.
Res. 12, 2825–2830 (2011)
22. Redner, S.: Teasing out the missing links. Nature 453(7191), 47–48 (2008)
23. Saxena, A., Fletcher, G., Pechenizkiy, M.: NodeSim: node similarity based network
embedding for diverse link prediction. EPJ Data Sci. 11(1), 24 (2022)
24. Wang, C., Satuluri, V., Parthasarathy, S.: Local probabilistic models for link pre-
diction. In: Seventh IEEE International Conference on Data Mining, ICDM 2007,
pp. 322–331. IEEE (2007)
Masking Language Model Mechanism
with Event-Driven Knowledge Graphs
for Temporal Relations Extraction from Clinical
Narratives
Abstract. For many natural language processing systems, the extraction of tem-
poral links and associations from clinical narratives has been a critical challenge.
To understand such processes, we must be aware of the occurrences of events and
their time or temporal aspect by constructing a chronology for the sequence of
events. The primary objective of temporal relation extraction is to identify rela-
tionships and correlations between entities, events, and expressions. We propose a
novel architecture leveraging Transformer based graph neural network by combin-
ing textual data with event graph embeddings for predicting temporal links across
events, entities, document creation time and expressions. We demonstrate our
preliminary findings on i2b2 temporal relations corpus for predicting BEFORE,
AFTER and OVERLAP links with event graph for correct set of relations. Compar-
ison with various Biomedical-BERT embedding types were benchmarked yield-
ing best performance on PubMed BERT with language model masking (LMM)
mechanism on our methodology. This illustrates the effectiveness of our proposed
strategy.
1 Introduction
It is crucial to extract temporal information from clinical narratives about events, expres-
sions, and their occurrences to better comprehend the past and, to the best of our ability,
can predict the future. A clinical event is anything that is pertinent to the clinical timeline,
such as clinical concepts, entities, etc., and especially in the medical domain, there are a
vast number of texts almost ready to be exploited. The foundation for performing tem-
poral relationship tasks in NLP has traditionally been temporal events and expressions.
Temporal information, such as dates, time expressions, durations, and intervals, allows
for tracking disease progression, treatment timelines, and event sequencing. Events that
occur within a clinical context, such as diagnoses, treatments, procedures, or laboratory
tests, have both temporal and spatial aspects. Clinical reports often contain explicit tem-
poral expressions such as dates, durations, time intervals and extracting these expressions
is the first step in identifying temporal information. For instance, identifying phrases like
“two weeks ago” or “since last year” as temporal expressions. Once temporal expres-
sions are identified, the next step is to establish relationships between different events or
findings mentioned in the clinical reports. This involves determining the order of events,
durations, or time intervals between them. For instance, determining whether a specific
treatment occurred before or after a diagnosis or the duration between two laboratory
test results.
2 Related Work
3 Methodology
More common information is included and displayed as an event graph with already
discovered relationships between events as a categorization problem to enhance the
performance of temporal relation extraction. A graph structure known as an event graph
is one in which events are represented as nodes and temporal links as directed edges.
Then it is possible for us to connect the events in the text to be represented as a graph,
in which the existing relations are intended to be used as additional factors as they
record data regarding the relationships that are typical among various event types. A
pretrained language model and knowledge graph are leveraged to derive two sets of
relations based on the event embeddings of the input text. Thus, the model may learn
rules that apply to relations that are next to one another, such as transitivity. Following
that, the model creates a classifier based on each pair of event embeddings, and then it
combines both categories to create a single relation prediction. We utilize EntityBERT
[16] for text encoding due to its domain knowledge in the learning process by masking
entities as a whole and shows superior results on downstream clinical extraction tasks,
such as negation detection, document time relation classification, and temporal relation
extraction. The Deep Graph Library (DGL) guidelines were used in the construction of
our Temporal relations extraction model and are approached as a link prediction problem
by classifying whether two nodes are connected by an edge or not. We utilize Relational
Graph Convolutional Network (R-GCN) [22] for temporal relation prediction as it allows
representation of multiple relations along edges.
In our study, temporal relations are predicted as links using a parameterized score
function to reconstruct an edge using an autoencoder architecture. Our Temporal Rela-
tional Graph Convolutional Network (TR-GCN) aggregates incoming messages and
generates new node representations for each node, calculating outgoing messages for
each node using the node representation and an edge type-specific weight matrix. A
two-layered Temporal Relation Graph Convolutional Network (TR-GCN) allows the
representation of multiple relations along edges by encoding the graph input that is opti-
mized for distinguishing temporal links between events. The first TR-GCN layer served
as the input layer followed by projected features (BERT embeddings) into hidden space.
Negative sampling methodology is used to compare the scores of nodes connected by
edges to the scores of any two random pairs of nodes under the assumption that nodes
connected by edges will receive a higher score than nodes that are not connected. To
achieve this, a negative graph during the training loop is created, which contained ‘n’
negative examples of each positive edge and used a pairwise dot product predictor to
compute the dot product between the node embeddings and to calculate the relevance
score between edges. Nodes connected by an edge should have a higher score than nodes
that are not connected. In Fig. 2, we display the proposed event-driven knowledge graph
(KG) based TR-GCN architecture for temporal relationship extraction. A significant
benefit is that we were able to classify using just one type of input for instance, with
either text or knowledge graph our architecture still can capture and categorize temporal
relations.
166 K. Uma et al.
(TIMEXs), and SECTIMEs (the patient’s arrival and departure times) are present in the
i2b2 dataset. We use the BEFORE, AFTER, and OVERLAP links from 310 discharge
summaries annotated with temporal information.
We used the Python library Deep Graph Library (DGL) [26] to convert each dis-
charge summary into a graph, where EVENTs, TIMEXs, and SECTIMEs serve as
the nodes and the BEFORE, AFTER, and OVERLAP links serve as the edges. The
SAME_SENTENCE relationship connected all the nodes that were in the same sen-
tence in the raw clinical narratives/discharge summaries. This fourth type of link, which
was not included in the i2b2, was added because we thought it would improve the model’s
prediction accuracy. By doing this, we made sure that when making predictions, we had
a relationship that could be automatically added to fresh graphs that still lacked the
BEFORE, AFTER and OVERLAP relationships. Graph design is streamlined by giv-
ing each node a single type (an “entity”) and used node features to store information
about their actual types (EVENT, TIMEX, and SECTIME) in a one-hot-encoded vector.
Because our graph had three different types of edges in addition to only having one type
of node, it was heterogeneous. Then we added BERT contextual embeddings from each
token of the plain text reports to our nodes to preserve the contextual information from
the raw clinical discharge summary. For multi-token entities, the context is preserved
using the mean of the embeddings with the token’s components. Eventually, our nodes
attributes had 789 dimensions: 786 dimensions from the BERT embeddings vector and
3 dimensions from the one-hot-encoded vector, which represents the entity type.
We wanted to see if the quality of the BERT embeddings used as attributes changed
after performing Language Model Masking (LMM) on the original models in addition
168 K. Uma et al.
to testing their effectiveness by contrasting the embeddings of six models. The four mod-
els namely ClinicalBERT1 , PubMedBERT2 , BlueBERT3 and SciBERT4 were obtained
straight from Hugging Face Models Hub5 which were initially trained on PubMed arti-
cles, clinical narratives, MIMIC notes, and electronic health records. Unlike the Pub-
MedBERT (PMB) model, all other models use embedding with a dimension of 1024
rather than 789. The effectiveness of our strategy is demonstrated by conducting an LMM
on the PubMedBERT model. This led us to run the PubMedBERT_LMM on Google
Colaboratory Notebook (PMB-LMM-GC) and on our servers (PMB-LMM-Serv). We
masked 20% of the tokens in the i2b2 dataset for the LMM task and employed the
AdamW optimizer with a batch size of 64 and a learning rate of 5e−5. All the models
other than PubMedBERT showed decreased performance in the metrics after 2 epochs,
hence we only trained our models for PubMedBERT.
We divided our 310 graphs from i2b2 2012 into a train set (80%) and a test set
(10%) before training our model and the rest (10%) of the data as the validation set. The
model’s two primary components are built in such a way to predict the temporal relations
and categorize it for text and graph input. We initially concatenate the two prediction
vectors, after which we run them through two graph convolution layers that calculate
the combined prediction where the model learns to trust one portion of the model over
another. The model is trained in three stages. First, by simply using one portion of the
network to identify temporal connections, we independently train the text and graph
components of the model. This enables us to precisely adjust the training settings for
each component separately. We continue training the network using the complete model
when both network segments have been trained. A batch size of 20 graphs is batched
together to prevent memory issues during the training phase and each input graph is
treated as a separate and distinct component of the batched graph in DGL.
Iteratively the negative graph is built during the training loop and calculated the
margin loss. We also changed the original negative graph function so that each subgraph
only received negative examples from its own subgraph because the DGL guidelines
do not use batched graphs. The DGL allows us to predict one type of relationship at
a time, and hence we separately trained our model to predict BEFORE, AFTER, and
OVERLAP links. Our goal is to compare 6 different types of embeddings, and we have
set some hyperparameters to establish the baseline comparison. Five negative samples
are generated for each positive edge, and we used hidden and out dimensions of 1280
for BlueBERT, SciBERT, and ClinicalBERT models, and 1024 for the PubMedBERT
model. We trained the model in Google collaboratory notebook for 50 epochs with a
training batch size of 20 graphs and later optimized some of the hyperparameters for the
selected model where we presented the captured metrics in Table 3.
1 emilyalsentzer/Bio_ClinicalBERT.
2 microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext.
3 bionlp/bluebert_pubmed_uncased_L-24_H-1024_A-16.
4 allenai/scibert_scivocab_uncased.
5 Models - Hugging Face
Masking Language Model Mechanism 169
Table 1. Metrics for different input variants on our model for temporal relation extraction on i2b2
2012 test set
increased in the test set. For BEFORE, AFTER, and OVERLAP relations AUC reached
what we think are remarkable levels after 50 epochs. When comparing the two original
models, as shown in Table 3, all BERT language models’ loss decreased slightly while
achieving a slightly higher AUC than PubMedBERT. We believe that this is because
its embeddings are larger in size. Additionally, when comparing the performance of the
original PubMedBERT model with that of our LMM strategy, the latter two enhanced
both the train Loss and the eval AUC of the first one. Notably, out of the six models,
the one trained on our server (Nvidia RTX 3090 24 GB) achieved the best Loss and
AUC values for both types of relationships. Furthermore, only this final model exceeded
97%, 96%, and 87% of eval AUC for the BEFORE, AFTER and OVERLAP links,
respectively.
Table 2. Comparison of Baseline vs our model’s performance on i2b2 Clinical Narratives for
Temporal Relation Extraction
Models F1-Score
Ours
Text input-based temporal relation prediction model 81.2%
Graph input-based temporal relation prediction model 82.1%
Text + Graph based combined prediction model 85.7%
Baseline models
SOTA [7] 73.7%
ELMo [7] 71.2%
BERT (base) [7] 76.4%
BERT (large) [7] 73.9%
BioALBERT [7] 76.86%
BERT-Linear layer classifier [8] 78.6%
BERT-Linear layer with soft logic regularizer [8] 80.2%
We believe that the findings presented in the preceding section demonstrate both of
our hypotheses. First and foremost, it has been demonstrated that Graph Neural Networks
and BERT embeddings work well together and produce impressive results when it comes
to the prediction of temporal relationships in the clinical domain, which opens numerous
avenues for further research in this area. Second, after performing a Masked Language
Modelling on the original models, the performance of BERT embeddings improved.
Since the OVERLAP relationship continues to present one of the greatest challenges in
predicting temporal relationships today, it is a much more laborious prediction than the
BEFORE and AFTER relationship, which is obviously more related to the linear nature
of time in the text and therefore easier to predict. This seems to be the reasoning behind
the significant difference between the results of BEFORE, AFTER, and OVERLAP.
The main issue with our analysis is that, for two main reasons, we are unable to
draw a direct comparison with earlier work on the prediction of temporal relationships
Masking Language Model Mechanism 171
from the i2b2 dataset. First, since it was a preliminary study, we only looked at three of
the eight relationships identified in the dataset because we thought they were the most
fundamental and a good place to start. Second, because the majority of edges in graph data
are negative, metrics used in earlier studies of temporal link prediction (accuracy, recall,
and F-score) may include noise when predicting links, so DGL suggests using AUC to
assess these models. Despite this, we believe that the model’s performance, both during
training and evaluation, is exceptional. We chose two of eight relationship types for our
research due to their fundamental significance and manageable scope. Additionally, the
use of AUC as an evaluation metric aligns with the dataset’s predominant negative edges.
We are optimistic about the possibilities for the future because the model’s contin-
uous improvement during the training loop and the impressive results in the evaluation
graphs provide a great area for further research. This model uses embeddings from Pub-
MedBERT after performing an LMM on the i2b2 dataset in our server. We intend to select
the most effective model, currently appearing to be the PubMedBERT_LMM_Serv one
to continue optimising it and combine it with our own medical NER system as a first
step to creating the timeline from any given clinical record.
5 Conclusion
We introduce a novel architecture by extrapolating temporal relationships from text, util-
ising the nature of graphs, and further leverage the power of BERT embeddings adapted
to the clinical domain which offer a great potential when working with temporal relations.
Our preliminary experiments show that the proposed architecture greatly outperforms
the baseline models which is because we use information present in text and information
about other relations captured in a knowledge graph. The current limitation of the pro-
posed approach is that it relies on a knowledge graph to contain correct relations between
events. In real-world scenarios, the relations would likely contain errors, as they would
come from previously extracted information. When addressing longevity, this earlier
examination of clinical narratives’ temporality can be incredibly helpful. Our future
172 K. Uma et al.
scope is to improve the real-world usability of the proposed architecture and assessed
in more scenarios as part of our ongoing study. A patient’s medical record can be used
to create a timeline of their history, which can be used to predict both their future and
their past. When discussing multiple patients, this benefit becomes more apparent, and
having a large collection of clinical texts with their corresponding illnesses, cures, and
side effects all temporally ordered can aid in forecasting and, consequently, encourage
the survival of new patients.
Acknowledgement. The authors acknowledge the AIDAVA project financed by Horizon Europe:
EU HORIZON-HLTH-2021-TOOL-06-03 and ANTIDOTE project financed by CHIST-ERA and
FWO.
Availability of data: https://www.i2b2.org/NLP/DataSets/Main.php.
References
1. Alfattni, G., Peek, N., Nenadic, G.: Extraction of temporal relations from clinical free text:
a systematic review of current approaches. J. Biomed. Inform. 108(2020), 103488 (2020).
https://doi.org/10.1016/j.jbi.2020.103488
2. Alfattni, G., Peek, N., Nenadic, G.: Attention-based bidirectional long short-term memory
networks for extracting temporal relationships from clinical discharge summaries. J. Biomed.
Inform. 123(2021), 103915 (2021). https://doi.org/10.1016/j.jbi.2021.103915
3. Galvan, D., Okazaki, N., Matsuda, K., Inui, K.: Investigating the challenges of temporal
relation extraction from clinical text. In: Lavelli, A., Minard, A.-L., Rinaldi, F. (eds.) Pro-
ceedings of the Ninth International Workshop on Health Text Mining and Information Anal-
ysis, Louhi@EMNLP 2018, Brussels, Belgium, 31 October 2018, pp. 55–64. Association for
Computational Linguistics (2018). https://doi.org/10.18653/v1/w18-5607
4. Guan, H., Li, J., Xu, H., Devarakonda, M.V.: Robustly pre-trained neural model for direct
temporal relation extraction. In: 9th IEEE International Conference on Healthcare Informatics,
ICHI 2021, Victoria, BC, Canada, 9–12 August 2021, pp. 501–502. IEEE (2021). https://doi.
org/10.1109/ICHI52183.2021.00090
5. Gumiel, Y.B., et al.: Temporal relation extraction in clinical texts: a systematic review. ACM
Comput. Surv. 54(7), 144:1–144:36 (2022). https://doi.org/10.1145/3462475
6. Han, R., Hsu, I.-H., Yang, M., Galstyan, A., Weischedel, R.M., Peng, N.: Deep structured
neural network for event temporal relation extraction. In: Bansal, M., Villavicencio, A. (eds.)
Proceedings of the 23rd Conference on Computational Natural Language Learning, CoNLL
2019, Hong Kong, China, 3–4 November 2019, pp. 666–106. Association for Computational
Linguistics (2019). https://doi.org/10.18653/v1/K19-1062
7. Han, R., Ning, Q., Peng, N.: Joint event and temporal relation extraction with shared represen-
tations and structured prediction. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of
the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Inter-
national Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong
Kong, China, 3–7 November 2019, pp. 434–444. Association for Computational Linguistics
(2019). https://doi.org/10.18653/v1/D19-1041
8. Ul Haq, H., Kocaman, V., Talby, D.: Deeper clinical document understanding using relation
extraction. CoRR abs/2112.13259 (2021). arXiv:2112.13259
9. Lee, H.-J., Zhang, Y., Jiang, M., Xu, J., Tao, C., Xu, H.: Identifying direct temporal relations
between time and events from clinical notes. BMC Med. Inform. Decis. Mak. 18(S-2), 23–34
(2018). https://doi.org/10.1186/s12911-018-0627-5
Masking Language Model Mechanism 173
10. Leeuwenberg, A., Moens, M.-F.: Structured learning for temporal relation extraction from
clinical records. In: Lapata, M., Blunsom, P., Koller, A. (eds.) Proceedings of the 15th Con-
ference of the European Chapter of the Association for Computational Linguistics, EACL
2017, Volume 1: Long Papers, Valencia, Spain, 3–7 April 2017, pp. 1150–1158. Association
for Computational Linguistics (2017). https://doi.org/10.18653/v1/e17-1108
11. Leeuwenberg, A., Moens, M.-F.: Temporal information extraction by predicting relative time-
lines. In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J. (eds.) Proceedings of the 2018 Con-
ference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 Octo-
ber–4 November 2018, pp. 1237–1246. Association for Computational Linguistics (2018).
https://doi.org/10.18653/v1/d18-1155
12. Lin, C., Miller, T., Dligach, D., Bethard, S., Savova, G.: A BERT-based universal model for
both within-and cross-sentence clinical temporal relation extraction. In: Proceedings of the
2nd Clinical Natural Language Processing Workshop, pp. 65–71 (2019)
13. Lin, C., Miller, T., Dligach, D., Sadeque, F., Bethard, S., Savova, G.: A BERT-based one-pass
multi-task model for clinical temporal relation extraction (2020)
14. Lin, C., Miller, T.A., Dligach, D., Amiri, H., Bethard, S., Savova, G.: Self-training improves
recurrent neural networks performance for temporal relation extraction. In: Lavelli, A.,
Minard, A.-L., Rinaldi, F. (eds.) Proceedings of the Ninth International Workshop on Health
Text Mining and Information Analysis, Louhi@EMNLP 2018, Brussels, Belgium, 31 Octo-
ber 2018, pp. 165–176. Association for Computational Linguistics (2018). https://doi.org/10.
18653/v1/w18-5619
15. Lin, C., Miller, T.A., Dligach, D., Bethard, S., Savova, G.: Representations of time expressions
for temporal relation extraction with convolutional neural networks. In: Cohen, K.B., Demner-
Fushman, D., Ananiadou, S., Tsujii, J. (eds.) BioNLP 2017, Vancouver, Canada, 4 August
2017, pp. 322–327. Association for Computational Linguistics (2017). https://doi.org/10.
18653/v1/W17-2341
16. Lin, C., Miller, T.A., Dligach, D., Bethard, S., Savova, G.: EntityBERT: entity-centric masking
strategy for model pretraining for the clinical domain. In: Demner-Fushman, D., Cohen, K.B.,
Ananiadou, S., Tsujii, J. (eds.) Proceedings of the 20th Workshop on Biomedical Language
Processing, BioNLP@NAACL-HLT 2021, Online, 11 June 2021, pp. 191–201. Association
for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.bionlp-1.21
17. Man, H., Ngo, N.T., Van, L.N., Nguyen, T.H.: Selecting optimal context sentences for event-
event relation extraction. In: Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI
2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI
2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI
2022, Virtual Event, 22 February–1 March 2022, pp. 11058–11066. AAAI Press (2022).
https://ojs.aaai.org/index.php/AAAI/article/view/21354
18. Mathur, P., Jain, R., Dernoncourt, F., Morariu, V.I., Tran, Q.H., Manocha, D.: TIMERS:
document-level temporal relation extraction. In: Zong, C., Xia, F., Li, W., Navigli, R. (eds.)
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics
and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP
2021, Volume 2: Short Papers, Virtual Event, 1–6 August 2021, pp. 524–533. Association for
Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.acl-short.67
19. Ning, Q., Feng, Z., Roth, D.: A structured learning approach to temporal relation extraction.
In: Palmer, M., Hwa, R., Riedel, S. (eds.) Proceedings of the 2017 Conference on Empiri-
cal Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, 9–11
September 2017, pp. 1027–1037. Association for Computational Linguistics (2017). https://
doi.org/10.18653/v1/d17-1108
174 K. Uma et al.
20. Peng, Y., Yan, S., Lu, Z.: Transfer learning in biomedical natural language processing: an
evaluation of BERT and ELMo on ten benchmarking datasets. In: Demner-Fushman, D.,
Cohen, K.B., Ananiadou, S., Tsujii, J. (eds.) Proceedings of the 18th BioNLP Workshop and
Shared Task, BioNLP@ACL 2019, Florence, Italy, 1 August 2019, pp. 58–65. Association
for Computational Linguistics (2019). https://doi.org/10.18653/v1/w19-5006
21. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation.
In: Moschitti, A., Pang, B., Daelemans, W., (eds.) Proceedings of the 2014 Conference on
Empirical Methods in Natural Language Processing, EMNLP 2014, A meeting of SIGDAT, a
Special Interest Group of the ACL, 25–29 October 2014, Doha, Qatar, pp. 1532–1543. ACL
(2014). https://doi.org/10.3115/v1/d14-1162
22. Schlichtkrull, M., Kipf, T.N., Bloem, P., van den Berg, R., Titov, I., Welling, M.: Modeling
relational data with graph convolutional networks. In: Gangemi, A., et al. (eds.) ESWC 2018.
LNCS, vol. 10843, pp. 593–607. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-
93417-4_38
23. Sun, W., Rumshisky, A., Uzuner, O.: Evaluating temporal relations in clinical text: 2012 i2b2
challenge. J. Am. Med. Inf. Assoc. 20(5), 806–813 (2013). https://doi.org/10.1136/amiajnl-
2013-001628
24. Tourille, J., Ferret, O., Névéol, A., Tannier, X.: Neural architecture for temporal relation
extraction: a Bi-LSTM approach for detecting narrative containers. In: Barzilay, R., Kan,
M.-Y. (eds.) Proceedings of the 55th Annual Meeting of the Association for Computational
Linguistics, ACL 2017, Volume 2: Short Papers, Vancouver, Canada, 30 July–4 August,
pp. 224–230. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/
P17-2035
25. Wang, L., Li, P., Xu, S.: DCT-centered temporal relation extraction. In: Calzolari, N., et al.
(eds.) Proceedings of the 29th International Conference on Computational Linguistics, COL-
ING 2022, Gyeongju, Republic of Korea, 12–17 October 2022, pp. 2087–2097. International
Committee on Computational Linguistics (2022). https://aclanthology.org/2022.coling-1.182
26. Wang, M., et al.: Deep graph library: a graph-centric, highly-performant package for graph
neural networks. arXiv preprint arXiv:1909.01315 (2019)
27. Zhang, S., Ning, Q., Huang, L.: Extracting temporal event relation with syntax-guided graph
transformer. In: Carpuat, M., de Marneffe, M.-C., Ruíz, I.V.M. (eds.) Findings of the Associ-
ation for Computational Linguistics, NAACL 2022, Seattle, WA, United States, 10–15 July
2022, pp. 379–390. Association for Computational Linguistics (2022). https://doi.org/10.
18653/v1/2022.findings-naacl.29
28. Zhao, X., Lin, S.-T., Durrett, G.: Effective distant supervision for temporal relation extraction.
CoRR abs/2010.12755. arXiv arXiv:2010.12755 (2020)
29. Zhou, Y., et al.: Clinical temporal relation extraction with probabilistic soft logic regularization
and global inference. In: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI
2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI
2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI
2021, Virtual Event, 2–9 February 2021, pp. 14647–14655. AAAI Press (2021). https://ojs.
aaai.org/index.php/AAAI/article/view/177212023-09-1109:46
Machine Learning and Networks
Efficient Approach for Patient Monitoring:
ML-Enabled Framework with Smart
Connected Systems
G. Dheepak(B)
1 Introduction
Patients are subjected to adverse conditions due to insufficient/inefficient observation
methods in hospitals, which may cause their treatment to deteriorate or even endanger
their lives. Medical negligence is one of the most alarming causes of patient death. A
survey by Nursing Times found that one in five nurses “rarely” or “never” monitored their
wards [1, 2]. Human errors lead to about 5.2 million deaths in India every year. In the US,
about 44,000 to 98,000 people are affected. This is not due to the lack of medical skill
or knowledge of doctors, but rather the lack of team coordination, observation strategies
and communication [3]. Based on this problem statement, we identify three cases and
propose a solution. The solution is motivated by observing the multiple challenges faced
by patients. First, due to inadequate observation by doctors, nurses or workers, there is
a delay in checking each patient, which poses a serious risk to their lives in hospitals. In
domestic scenarios, e.g. at home, elderly people and infants need regular supervision.
Second, in recent years, we notice many patients disappearing from hospitals when
they are under treatment, due to the misunderstanding of their disease or condition. On
the other hand, special and constant care is required for mentally unstable patients, as
they tend to escape from their wards [4]. Also, in hospitals, we witness infants being
kidnapped, and this trend is consistent [5]. Moreover, studies show that there is a huge
impact due to the lack of pre-analysis of health by patients and doctors. Therefore, this
paper proposes an efficient way to overcome the limitations of the current approach by
developing a connected smart system for children, senior citizens, especially differently-
abled individuals, or anyone who needs supervision, such that the concerned person is
notified regularly without any service interruption and the patient data is processed and
real-time prediction is performed. A unique feature of this proposal is the range detection
technique, where the patient is given a limit for movement and beyond which the system
makes an alert call. This is achieved by a bluetooth master-slave system to warn the
concerned person if the person under observation crosses a specified boundary. The
solution is based on the field of IoT, Data Analytics, and Machine Learning and the scope
is feasible in the domain of smart home automation, security and data analytics. This
paper is structured as follows. The introduction section is followed by a brief overview
of the related research works in Sect. 2. Categorical approach for classification of cases
and design methodology in Sect. 3. Complete hardware implementations in Sect. 4,
followed by software implementation in Sect. 5. Section 6 discusses the mechanism
for ML model and data analysis, which covers various models used for training and
prediction. Section 7 illustrates the entire workflow explained in the proposal, followed
by system deployment & testing in Sect. 8 and Conclusion in Sect. 9.
The field of remote patient monitoring is rapidly evolving with various applications
and methods emerging in domains such as healthcare [6], education [7], agriculture
[8, 9], wearable industry [10], etc. Several studies focus on the analysis of chronic
diseases [11, 12], which requires continuous observation. Zanaj, E. et al. [13] suggested
a method that uses Wireless Sensor Networks (WSNs) to transfer various biometric data
such as heart rate, body temperature, SpO2, respiration, ECG for distant monitoring
and classification. Wang, P. [14] developed a real-time monitoring system for cardiac
in-patients using zigbee as data acquisition device that sends the collected data to a
database and evaluates the patient by fuzzy reasoning and this proposal is a distance
bound mechanism. Siddik, A.B. et al. [15] demonstrated the use of cloud computing
with visualization of the collected data and incorporated a GSM module for notifying the
relevant person. Mansfield, S et al. [16] proposed an IoT based system for autonomous
patient monitoring focused on pressure injury monitoring and prevention. A Kavak,
A. and ˙Inner [17] proposed end to end remote patient monitoring using a framework
for data collection and visualization focused on diabetic patients with doctor centric
Efficient Approach for Patient Monitoring 179
decision support mechanism. Feng, M et al. [18] proposed an integrated and intelligent
system, iSyNCC, to monitor patients and facilitate clinical decision making in Neuro
Intensive/Critical Care Units (NICUs). Anifah, L [19] designed a framework that stores
data from hardware to backend through MQTT protocol where the data is published
and subscribed. Aditya, T.R., et al. [20] proposed a model that uses image processing
technique to predict the status of a patient remotely where the system compares, captures
and generates alert messages using GSM module while monitoring the body temperature
for any anomaly. Sharma, G.K. and Mahesh [21] provided analysis on ESP32 based IoT
system for medical monitoring purpose [23] that integrates software and hardware and
uses the internet for data transmission and further presented the analysis of percentage
of error.
the user and no alert is needed. If the received sum mismatch occurs, it indicates that
data is lost in communication so negative acknowledgment is received by the user, and
as a loop the master then initiates the next data transmission.
The design in Fig. 2 shows the prototype (α) that is attached to the wrist of the
patient/Person under supervision, this records the data for essential health data collection
and observation. This serves as the master and is linked with the slave device i.e., the
Prototype placed by the supervisor (β) in Fig. 3.
Efficient Approach for Patient Monitoring 181
The schematic and hardware implementation of prototype is placed by the Doctor (β)
is shown in Fig. 3. This device gathers the data for Ambience and movement prediction
and sends confirmation to the master. Two parts are connected to the internet so that the
data is stored for monitoring, alerting, and processing.
The IoT backend is hosted on firebase. As shown in Fig. 5, The data from the sensors are
sent to Real-time Database for receiving and transmitting data and updates to application
and ML training and deployment.
Fig. 6. Post processing scheme (A) ML Model Deployment (B) Node Structure
Efficient Approach for Patient Monitoring 183
LR is used in this approach for prediction as this model is extensively used in practical
applications which rely linearly on their unknown parameters and are simpler to fit than
models which are non-linearly related to their parameters. Also, because of the resulting
estimators, the statistical properties are easier to determine. The equation takes the form
The relationship between the dependent variable y and the vector of regressors x is
linear. So, the equation can be simplified as follows,
y = Xβ+ε,
where
⎡ ⎤
⎡ ⎤ ⎡ T⎤ ⎡ ⎤ β0 ⎡ ⎤
y1 x1 1 x11 . . . x1p ⎢ β1 ⎥ ε1
⎢ y2 ⎥ ⎢ xT ⎥ ⎢ 1 x21 . . . x2p ⎥ ⎢ ⎥ ⎢ ε2 ⎥
⎢ ⎥ ⎢ 2⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
y = ⎢ . ⎥,X = ⎢ . ⎥ = ⎢ . . . . .. ⎥ , β = ⎢ β2 ⎥, ε = ⎢ . ⎥
⎣ .. ⎦ ⎣ .. ⎦ ⎣ .. .. . . ⎦ ⎢. ⎥ ⎣ .. ⎦
⎣ .. ⎦
yn xTn 1 xn1 . . . xnp εn
βp
where T denotes the transpose, so that xi T β is the inner product between vectors xi
and β. Generally, This relationship is modeled with a disturbance term or error variable
ε, an unobserved random variable that adds “noise” to the linear relationship between
the dependent variable and regressors.
With training accuracy of 88% and the test accuracy of 81%, This provides a prelimi-
nary level of understanding of patient’s condition. The graphical analysis of realtime and
predicted data using Linear Regression displayed in front-end UI of the Web application
is shown in Fig. 7.
184 G. Dheepak
Figure 9 depicts the option for K-NN trained model deployed on Web application
which when applied to the collected data generates a trendline which tracks the subse-
quent status of a patient. With training accuracy of 95% and the test accuracy of 89%,
This approach helps satisfy the objective to optimum level. Overall, this expedites the
preventive actions that are to be undertaken by doctors. So, in future, Models developed
from the collected data can be used to forecast similar condition/ailments, the health
condition in the process of receiving treatment, and the survival rate can be obtained
which could be used for further research and data modelling.
Efficient Approach for Patient Monitoring 185
7 Work Flow
Figure 10 shows the overall workflow of the system. First, data is captured by hardware
devices that use bluetooth to measure distance and one device that communicates with
the database. The data is then accessed by two clients: ML Framework and Mobile & Web
Application. The Application monitors the data and sends alerts. The ML framework
pre-processes the data for training. The user can also manually examine the data. The
Automated mechanism selects the train and test data in a 4:1 ratio for each sample set.
The user can choose different algorithms to analyze the data depending on the situation.
Doctors are provided with accuracy scores to help them make informed decisions. The
graphical data with accuracy is then displayed on the mobile and web application, which
offers ML-Integrated data analysis and Real-Time data monitoring.
The prototype was deployed in a domestic environment to evaluate the systems’ func-
tionality. The prototype performed remarkably during the deployment and testing phase,
obtaining real-time data reliably and transmitting data smoothly with alerts for range
monitoring. Moreover, the ML framework supported the overall effort with its accurate
prediction, which was demonstrated to be a useful aid in the observation process.
References
1. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6316509/
2. https://www.nursingtimes.net/news/hospital/poor-observation-skills-are-risking-patients-
lives-13-10-2009/
3. https://timesofindia.indiatimes.com/life-style/health-fitness/health-news/medical-neglig
ence-70-of-deaths-are-a-result-of-miscommunication/articleshow/51235466.cms
4. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8311958/
5. https://www.missingkids.org/content/dam/missingkids/pdfs/ncmec-analysis/Infant%20A
bduction%20Trends_10_10_22.pdf
Efficient Approach for Patient Monitoring 187
6. Hu, F., Xie, D., Shen, S.: On the application of the internet of things in the field of medical
and health care. In: Green Computing and Communications (GreenCom), 2013 IEEE and
Internet of Things (iThings/CPSCom), IEEE International Conference on and IEEE Cyber,
Physical and Social Computing, pp. 2053–2058. IEEE (2013)
7. Gómez, J., Huete, J.F., Hoyos, O., Perez, L., Grigori, D.: Interaction system based on internet
of things as support for education. Procedia Comput. Sci. 21, 132–139 (2013)
8. Sridharani, J., Chowdary, S., Nikhil, K.: Smart farming: the IoT based future agriculture. In
2022 4th International Conference on Smart Systems and Inventive Technology (ICSSIT),
January 2022, pp. 150–155. IEEE (2022)
9. Hamdi, M., Rehman, A., Alghamdi, A., Nizamani, M.A., Missen, M.M.S., Memon, M.A.:
Internet of Things (IoT) based water ırrigation system. Int. J. Online Biomed. Eng. 17(5), 69
(2021)
10. Swan, M.: Sensor Mania! the ˙Internet of Things, wearable computing, objective metrics, and
the quantified self-2.0. J. Sensor Actuator Netw. 1(3), 217–253 (2012)
11. Strollo, S.E., Caserotti, P., Ward, R.E., Glynn, N.W., Goodpaster, B.H., Strotmeyer, E.S.: A
review of the relationship between leg power and selected chronic disease in older adults. J.
Nutr. Health Aging 19(2), 240–248 (2015)
12. Diet, nutrition and the prevention of chronic diseases: report of a joint WHO/FAO expert
consultation. In: WHO Technical Report Series, 916(i-viii) (2003)
13. Zanaj, E., Basha, G., Biberaj, A., Balliu, L.: An intelligent wireless monitoring system in
telemedicine using IOT technology. In: 2023 10th International Conference on Modern Power
Systems (MPS) , June 2023, pp. 1–5. IEEE (2003)
14. Wang, P.: The real-time monitoring system for in-patient based on Zigbee. In: 2008 Sec-
ond International Symposium on Intelligent Information Technology Application, December
2008, vol. 1, pp. 587–590. IEEE (2008)
15. Siddik, A.B., et al.: Real-time patient monitoring system to reduce medical error with the
help of database system. In: 2022 4th ICECTE, December 2022, pp. 1–4. IEEE (2002)
16. Mansfield, S., Vin, E., Obraczka, K.: An IoT-based system for autonomous, continuous, real-
time patient monitoring and ıts application to pressure ınjury management. In: 2021 17th
International Conference on Distributed Computing in Sensor Systems (DCOSS), July 2021,
pp. 66–68. IEEE (2021)
17. Kavak, A., Inner, A.B.: ALTHIS: design of an end to end ıntegrated remote patient monitor-
ing system and a case study for diabetic patients. In: 2018 Medical Technologies National
Congress (TIPTEKNO) , November 2018, pp. 1–4. IEEE (2018)
18. Feng, M., et al.: iSyNCC: an intelligent system for patient monitoring & clinical decision sup-
port in neuro-critical-care. In: 2011 Annual International Conference of the IEEE Engineering
in Medicine and Biology Society, August 2011, pp. 6426–6429. IEEE (2011)
19. Anifah, L.: Smart ıntegrated patient monitoring system based Internet of Things. In: 2022 6th
ICITISEE, December 2022, pp. 69–74. IEEE (2022)
20. Aditya, T.R., et al.: Real time patient activity monitoring and alert system. In: 2020 Interna-
tional Conference on Electronics and Sustainable Communication Systems (ICESC). IEEE
(2020)
21. Sharma, G.K., Mahesh, T.R.: A deep analysis of medical monitoring system based on ESP32
IoT system. In: 2023 3rd International Conference on Advance Computing and Innovative
Technologies in Engineering (ICACITE), May 2023, pp. 1848–1852. IEEE (2023)
188 G. Dheepak
22. Zhou, P., Ling, X.: HCI-based bluetooth master-slave monitoring system design. In: 2010
International Conference on Computational Problem-Solving, ICCP 2010 (2010)
23. Dhasarathan, C., Shanmugam, M., Kumar, M., Tripathi, D., Khapre, S., Shankar, A.:
Anomadic multi-agent based privacy metrics for e-health care: a deep learning approach.
Multimedia Tools Appl. 83, 7249–7272 (2023)
Economic and Health Burdens of HIV
and COVID-19: Insights from a Survey
of Underserved Communities
in Semi-Urban and Rural Illinois
Abstract. This paper presents findings from the “Estimating the Eco-
nomic and Health Burdens of HIV in Semi-Urban and Rural Illinois”
survey conducted in downstate Illinois, USA. The survey targeted hid-
den and hard-to-reach communities of HIV-positive individuals and
their partners. The study utilizes network science techniques, includ-
ing community detection and visualization, to analyze the social, med-
ical, and economic forces influencing three underserved communities:
African Americans, HIV-positive individuals, and those facing worsened
economic situations due to COVID-19. The analysis reveals disparities
in healthcare access, discrimination, and economic challenges faced by
these communities. The paper highlights the value of network analysis
in interpreting smaller datasets and calls for further collaborations and
research using the freely available survey data and analysis materials.
1 Introduction
The “Estimating the Economic and Health Burdens of HIV in Semi-Urban and
Rural Illinois” survey (BOH) encompassed over 200 questions and targeted hid-
den and hard-to-reach communities in the St. Louis Metro East area of downstate
Illinois, USA. While many studies of disease response are conducted in urban
areas, the St. Louis Metro East offers an opportunity to examine individual
responses to HIV and COVID-19 in a semi-urban and rural context. The sample
consisted of 22 respondents, primarily comprising HIV-positive MSM African
Americans, but also including black cisgender women, Hispanics, and transgen-
der participants. The survey encompassed a range of age groups from 18 to 60,
providing valuable insights into the health, economic, and community environ-
ments experienced by underserved populations in a post-COVID-19 world.
While statistical analysis poses challenges with small datasets, network anal-
ysis has demonstrated successful applications even with limited samples [1]. In
this study, we employ network science techniques, including community detection
and visualization, to perform a preliminary analysis of how social, medical, and
economic forces have influenced three underserved and hard-to-reach commu-
nities: African Americans, HIV-positive individuals, and those facing worsened
economic situations due to the COVID-19 epidemic.
The survey and this study were conducted under the supervision of the
Southern Illinois University Edwardsville Institutional Review Board. All sur-
vey results and analysis materials are freely available, and we encourage oth-
ers to collaborate using this data. Code used in the study can be accessed at
https://github.com/SIUEComplexNetworksLab/BOHComplexNetworks. Data
are available at https://www.openicpsr.org/openicpsr/project/192186/version/
V1/view.
2 Related Work
Network science has a rich history of application in studying various aspects
related to HIV, starting with the Potterat et al. [17] HIV transmission network
of Colorado Springs, CO, dating back to the early stages of the epidemic in
1985. Recent studies have explored HIV from different perspectives, including
investigating the interconnectivity between syphilis and HIV transmission net-
works [3], identifying intersections among different HIV-adjacent communities to
determine optimal locations for intervention efforts [4,7], and combining social
and genetic data to infer transmission networks more accurately [21].
Network analysis in the context of HIV is not limited to transmission net-
works. Adams and Light [2] used bibliographic coupling networks to study inter-
disciplinary research gaps in HIV/AIDS. Online social networks have also been
analyzed to explore support systems for people living with HIV [6] and to identify
undiagnosed communities for targeted outreach [10].
3 Methods
3.1 The Burden of HIV Dataset
The BOH survey was conducted from late 2021 to April 2023 by the Applied
Research Consultants group at Southern Illinois Carbondale and supervised by
Drs. Sinha and Matta. Survey questions covered domains including perceptions
of discrimination, and the impact of the COVID-19 pandemic on sexual practices,
living conditions, employment, and economic well-being. A dedicated section of
the survey addressed mental health and self-esteem, assessing factors such as
Economic and Health Burdens of HIV and COVID-19 191
partner relationships, frequency of contact with family and friends, pride in gay
identity, and level of participation in community organizations.
In the context of this study, responses to individual questions are referred to
as “attributes,” “features,” or “variables.” Questions that serve as the primary
focus of analysis are referred to as “target” variables. The target variables con-
sidered in this study are African American race, HIV positive status, and those
with economic situations worsened by COVID-19.
To ensure data integrity, variables with fewer than 19 responses were removed
from the dataset. Multi-valued variables were transformed into binary choices
using one-hot encoding. For instance, a variable such as “What sources of
transportation discrimination have you experienced,” with responses including
“Discrimination based on race,” and“Discrimination based on sexual orienta-
tion,” would be transformed into two separate variables using one-hot encoding,
denoted as “TransportDiscrim:Race,” and “TransportDiscrim:SO.”
After data curation, the final survey dataset consisted of 274 variables. Par-
ticipants who did not finish the interview were removed, resulting in 19 sam-
ples. The Recursive Feature Elimination (RFE) class from Python’s SciKit
Learn library was utilized for feature selection. RFE identifies a fixed number
of attributes or attribute combinations that contribute the most to predicting
the target attribute. Based on previous studies demonstrating improved cluster-
ing results with a reduced number of attributes [11,14,18], feature selection was
performed to identify the most important sets of 15 and 30 attributes for each
target variable.
3.4 Clustering
Community detection was performed using the Leiden algorithm [19]. This algo-
rithm optimizes modularity, a widely used measure for quantifying the quality of
communities. The Leiden algorithm allows control over the number of detected
192 J. Matta et al.
Table 1. Modularity quality scores computed for graphs created from either 15 or 30
features, for three target variables.
15 Features 30 Features
2 Clusters 3 Clusters 2 Clusters 3 Clusters
Race:Black 0.4059 0.1548 0.4185 0.3596
CovidFinance:Worse 0.4134 0.2076 0.4380 0.3380
HIV:Positive 0.4387 0.238 0.5116 0.1622
For each target variable, we constructed two graphs using the 15 and 30 most
important variables identified through feature selection. To choose which to ana-
lyze, we did a preliminary clustering, increasing the Leiden resolution parameter
until 2 and 3 clusters were produced. The modularity scores corresponding to
each clustering are presented in Table 1. In all but one case, the clusterings based
on 30 features exhibited higher scores. Therefore, we conduct in-depth analysis of
those clusterings. We did a third clustering on the 30-feature graphs, increasing
the resolution parameter until 4 clusters were obtained. This approach allowed
us to identify patterns that exist at varying levels of granularity.
To provide insight into the attribute values within the clusterings, Table 2
presents the percentages of each cluster that answered true for 40 separate vari-
ables. Due to space limitations, we only display results for the 3-cluster groups.
The clusterings are visually represented in Figs. 1, 2, and 3. The color at the top
of each column in Table 2 corresponds to the node color in the network visualiza-
tion. In selecting the variables for inclusion in Table 2, we prioritized attributes
that exhibited “interesting” characteristics, based on variations between clusters.
The intention of this analysis was to qualitatively identify attributes of interest
and to observe changes as the number of clusters increased.
4 Results
and coherence. For example, in the 4-cluster graph shown in Fig. 1c, the clusters
are clearly defined, with the blue and green clusters forming cliques, and the red
cluster displaying a high level of cohesion.
In the initial 2-cluster partition displayed in Fig. 1a, the red cluster primarily
comprises participants who identify as gay (71%) and male (86%), whereas the
blue cluster consists of 50% women and 67% heterosexuals. The red cluster faces
significant economic and social challenges, with 100% of its members growing up
in financially difficult circumstances and 57% reporting an annual income below
$10,000. In contrast, 50% of the blue cluster grew up financially comfortable.
Discrimination is more prevalent among the red cluster, with 71% experiencing
healthcare discrimination based on race compared to 33% in blue, and 86%
experiencing employment discrimination based on race compared to 50% in blue.
The red cluster also exhibits more strained family relationships, as 43% of its
members communicate with their family only rarely, whereas all members of the
blue cluster maintain monthly contact with their families.
Partitioning into 3 clusters (shown in Fig. 1b), gives additional noteworthy
attributes. The blue cluster becomes entirely heterosexual, and has a low preva-
lence of smoking and alcohol use. Additionally, the blue cluster reports lower
levels of name-calling and family abuse, with none experiencing it frequently,
unlike the “mostly” or “sometimes” reported cases in other clusters. The newly
formed green cluster has 100% exposure to some form of education and health-
care discrimination, in contrast to reduced exposure in the other clusters.
Upon repartitioning into 4 clusters (Fig. 1c), the additional cluster consists
of a single node. This individual reports experiencing various forms of discrim-
ination, including educational, transportation, housing (based on credit score),
healthcare, job, social and community. Furthermore, this person is unemployed
and relies on disability benefits, exacerbating their challenging circumstances.
When considering 3 clusters, as depicted in Fig. 2b, the new green cluster
is notable for its absence of straight members (67% gay and 33% bisexual).
All members of this cluster prioritize their health, and they are all employed.
Interestingly, this group reported the highest number of sexual encounters in the
past year, with an average of 46 partners.
Introducing a fourth cluster, shown in Fig. 2c, reveals that all individuals
in this cluster are gay (100%), and 67% of them have no insurance, while all
members of other clusters possess some form of insurance. This new cluster
faces significant challenges related to lack of support, as 100% of its members
communicate with their families on a yearly or infrequent basis, none feel a sense
of belonging to the local LGBT+ community, and 100% experience verbal abuse
from their families at least occasionally.
The third group is respondents whose financial situations were adversely affected
by COVID-19, as depicted in Fig. 3. In the 2-cluster partition illustrated in
Fig. 3a, the red cluster demonstrates reduced engagement with healthcare.
Notably, red individuals exhibited lower vaccination rates for COVID-19, with
55% remaining unvaccinated, while 83% of the blue individuals were partially
or fully vaccinated. Moreover, the red cluster reports lingering brain fog and
fatigue (22% vs. 0%), increased interference of COVID-19 with job retention
(44% vs. 0%), and higher rates of job loss due to the pandemic (89% vs. 66%).
Respondents in the red cluster also reported higher expenditures on alcohol and
tobacco.
The three-cluster partition is shown in Fig. 3b. Compared to the new red
cluster, green cluster members were more likely to have had a checkup in the
past year (100% vs. 20%), sought medical care related to COVID-19 (100% vs.
0%), and have full vaccination for COVID-19 (75% vs. 0%). Green individuals
also displayed higher employment rates compared to red (100% vs. 60%). Inter-
estingly, respondents in the green cluster were more likely to receive some form
of income support than those in the red cluster (100% vs. 40%).
The 4-cluster partition is depicted in Fig. 3c. The orange cluster and the
new blue cluster have both lost work due to COVID-19 (67% in both clusters).
These clusters also exhibit notable differences. The orange cluster experienced
less discrimination than blue, with no orange members reporting education or
transportation discrimination, while 100% of the blue members reported experi-
encing both forms of discrimination. Individuals in the orange cluster were less
likely to have experienced worsened living situations due to COVID-19 (33% vs.
100%). Respondents in the blue cluster were more likely to seek medical care
related to COVID-19 (100% vs. 0%).
198 J. Matta et al.
Fig. 3. Clustering Progression for COVID-19 Finance Worse Cohort from 2 to 4 Clus-
ters
Economic and Health Burdens of HIV and COVID-19 199
Acknowledgments. This research was conducted through a grant obtained from the
Illinois Innovation Network, Sustaining Illinois Seed Grant Program. The authors wish
to thank Larry Mayhew and Larry McCulley at SIHF Healthcare for giving support
and access to their staff of social workers. Also, thanks to Dr. William Summers and
Tawnya Brown at Vivent Health for their support.
References
1. Abbas, M., et al.: Biomarker discovery in inflammatory bowel diseases using
network-based feature selection. PloS one 14(11), e0225, 382 (2019)
200 J. Matta et al.
2. Adams, J., Light, R.: Mapping interdisciplinary fields: efficiencies, gaps and redun-
dancies in hiv/aids research. PLoS One 9(12), e115, 092 (2014)
3. Billock, R.M., et al.: Network interconnectivity and community detection in
hiv/syphilis contact networks among men who have sex with men. Sex. Transm.
Dis. 47(11), 726 (2020)
4. Clipman, S.J., et al.: Deep learning and social network analysis elucidate drivers of
hiv transmission in a high-incidence cohort of people who inject drugs. Sci. Adv.
8(42), eabf0158 (2022)
5. Cluesman, S.R., et al.: Exploring behavioral intervention components for African
American/black and latino persons living with hiv with non-suppressed hiv viral
load in the United States: a qualitative study. Inter. J. Equity Health 22(1), 1–29
(2023)
6. Gao, Z., Shih, P.C.: Communities of support: social support exchange in a hiv
online forum. In: Proceedings of the Seventh International Symposium of Chinese
CHI, pp. 37–43 (2019)
7. Grubb, J., Lopez, D., Mohan, B., Matta, J.: Network centrality for the identifica-
tion of biomarkers in respondent-driven sampling datasets. Plos one 16(8), e0256,
601 (2021)
8. Hong, C., Ochoa, A.M., Wilson, B.D., Wu, E.S., Thomas, D., Holloway, I.W.: The
associations between hiv stigma and mental health symptoms, life satisfaction, and
quality of life among black sexual minority men with hiv. Quality Life Res., 1–10
(2023)
9. Khubchandani, J., Macias, Y.: Covid-19 vaccination hesitancy in hispanics and
african-americans: a review and recommendations for practice. Brain Behav.
Immunity-health 15, 100(277) (2021)
10. Kimbrough, L.W., Fisher, H.E., Jones, K.T., Johnson, W., Thadiparthi, S., Dooley,
S.: Accessing social networks with high rates of undiagnosed hiv infection: the social
networks demonstration project. Am. J. Public Health 99(6), 1093–1099 (2009)
11. Kramer, J., Boone, L., Clifford, T., Bruce, J., Matta, J.: Analysis of medical data
using community detection on inferred networks. IEEE J. Biomed. Health Inform.
24(11), 3136–3143 (2020)
12. Mahadevan, M., Amutah-Onukagha, N., Kwong, V.: The healthcare experiences of
african americans with a dual diagnosis of hiv/aids and a nutrition-related chronic
disease: a pilot study. In: Healthcare, vol. 11, p. 28. Multidisciplinary Digital Pub-
lishing Institute (2023)
13. Manning, M., Byrd, D., Lucas, T., Zahodne, L.B.: Complex effects of racism and
discrimination on african americans’ health and well-being: navigating the status
quo. Soc. Sci. Med. 316, 115, 421 (2023)
14. Matta, J., Singh, V., Auten, T., Sanjel, P.: Inferred networks, machine learning,
and health data. Plos one 18(1), e0280, 910 (2023)
15. Matta, J., Zhao, J., Ercal, G., Obafemi-Ajayi, T.: Applications of node-based
resilience graph theoretic framework to clustering autism spectrum disorders phe-
notypes. Applied Netw. Sci. 3(1), 1–22 (2018)
16. Mitchell, B.D., et al.: Patient-identified markers of quality care: improving hiv
service delivery for older african americans. J. Racial Ethn. Health Disparities
10(1), 475–486 (2023)
17. Potterat, J.J., et al.: Risk network structure in the early epidemic phase of hiv
transmission in colorado springs. Sexually Transmitted Infect. 78(suppl 1), i159–
i163 (2002)
Economic and Health Burdens of HIV and COVID-19 201
18. Sanjel, P., Matta, J.: Inferred networks and the social determinants of health. In:
Complex Networks & Their Applications X: Volume 2, Proceedings of the Tenth
International Conference on Complex Networks and Their Applications Complex
Networks 2021 10, pp. 703–715. Springer (2022). https://doi.org/10.1007/978-3-
030-93413-2 58
19. Traag, V.A., Waltman, L., Van Eck, N.J.: From louvain to leiden: guaranteeing
well-connected communities. Sci. Rep. 9(1), 5233 (2019)
20. Wright, J.E., Merritt, C.C.: Social equity and covid-19: the case of african ameri-
cans. Public Adm. Rev. 80(5), 820–826 (2020)
21. Zarrabi, N., Prosperi, M., Belleman, R.G., Colafigli, M., De Luca, A., Sloot, P.M.:
Combining epidemiological and genetic networks signifies the importance of early
treatment in hiv-1 transmission (2012)
Untangling Emotional Threads:
Hallucination Networks of Large
Language Models
1 Introduction
In the realm of artificial intelligence, generative models have emerged as remark-
able tools capable of producing intricate and realistic data that mimics human
creativity and understanding. At the threshold of an impending AI-driven trans-
formation, it is essential to appreciate the ingenuity of generative AI models while
subjecting their outputs to meticulous examination.
Generative AI models have been used to generate training data for machine
learning models, such as those used for global warming prediction[9], in medical
responses [18], and in education [14]. These models have achieved high accuracy,
comparable to human-labeled models in various domains [17]. They also auto-
mate data labeling in customer service chatbots, enhancing response precision
[15]. However, some studies have found varying outcomes in producing data,
particularly in AI-generated figurative speech [33,37].
In this study, we delve into the classification prowess of generative AI, exam-
ining the extent of the variations between their outputs through the lens of
network analysis.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 202–214, 2024.
https://doi.org/10.1007/978-3-031-53468-3_17
Hallucination Networks of LLMs 203
Amidst these issues, in this paper, we ponder the extent to which emotions
manifest in the outputs of generative AI models. Is there a more feasible approach
for untangling hallucinations from reality by examining emotional variations and
shared components through network analysis? In pursuit of this objective, we
present the hallucination networks derived from tweets collected between March
and April 2023 for the September 16–29, 2022 time frame. Our focus centers on
the Zhina Mahsa Amini case in Iran, a context rich in written emotions. The
labels for these emotions are provided by generative AI models, namely GPT3.5
and RoBERTa, and are subsequently transformed into a network of emotions
incorporating diverse vocabulary perspectives.
2 Methodology
2.1 Dataset
In this paper, we used snscraper [1] to gather data from X [13], formerly Twit-
ter, just before API changes [7]. For our case study, we focused on Zhina Mahsa
Amini’s case, a 22-year-old Kurdish woman who died in police custody in Tehran,
Iran, on September 16, 2022. This incident prompted protests and activism for
women’s rights in Iran, especially on X against the government, encouraging
many to speak out. We searched using popular event hashtags [4] and filtered
tweets under 120 characters with memes, images, or videos. From over 6 mil-
lion tweets in all languages, we randomly selected 5,000 due to the high cost
associated with running GPT-3.5.
2.2 Models
The paper “Attention is All You Need,” by Vaswani et al. in 2017 [38], intro-
duced the transformative “Transformer” neural network architecture, which has
become the basis for state-of-the-art NLP models like GPT and BERT due to
its parallelization, scalability, and effective self-attention mechanisms. This study
employs two prominent large language models, RoBERTa-2023 and GPT-3.5, for
emotion identification.
RoBERTa model is built on BERT’s [2] encoder-only architecture [36]
trained on 154M tweets [22]. Compared to other post-BERT models, they signif-
icantly improve downstream Natural Language Processing (NLP) tasks [2]. The
Tweet classification version of RoBERTa fine-tuned on the TweetEval bench-
mark, initially included four major emotions: Anger, Joy, Optimism, and Sad-
ness in its training data. This RoBERTa-base model was updated in June 2023
[11] with TweetEval benchmark to include 11 emotions. In our analysis, we refer
to it as RoBERTa’23. As RoBERTa’23 is purpose-built for Tweet emotion clas-
sification, it serves as the baseline model in this paper.
The GPT-3.5 model represents a refined iteration of the GPT-3 (Generative
Pre-Trained Transformer) model. Introduced in January 2022, GPT-3.5 offers
three distinct variants, each characterized by parameter counts of 1.3 billion, 6
Hallucination Networks of LLMs 205
billion, and 175 billion [26]. A notable feature of GPT-3.5 is its enhanced ability
to mitigate the generation of noxious content to a certain extent. In contrast to
RoBERTa’23, our approach to eliciting emotions from GPT-3.5 involved employ-
ing the zero-shot learning method with prompt engineering techniques [36] and
Parrot’s emotion binning to obtain ten emotion labels [27]. We have 3691 tweets
after processing GPT-3.5 outputs for reasonable consistency.
Please note that GPT’s zero-shot learning requires sample prompts for accu-
rate results. Prompt quality is crucial, and we have designed specific prompts
to enhance GPT’s emotion detection. Prompt design complexity yields diverse
outcomes, often amidst the haystack of possibilities. Hence, our study focuses
on LLMs’ emotion inference, not prompt quality or frameworks, while existing
research explores effective, prompt design [31].
3 Experimental Results
3.1 Consistency Analysis of Models
Our initial experiment explored generative AI models’ ability to identify emo-
tions when presented with nearly identical or closely resembling inputs. To do
this, we selected a subset of tweets from our dataset containing duplicates and
substantial similarity. We applied the cosine similarity metric to retain tweets
with similarity levels surpassing 90%. Notably, RoBERTa consistently assigned
the same emotion label to identical or similar tweets, while GPT-3.5 yielded up
to three different emotions. We removed the duplicates from our corpus for the
subsequent analysis after accounting for hashtags.
RoBERTa’23 vs GPT-3.5
Euclidean 834.31
Manhattan 40,130.00
Canberra 0.382
In this section, we investigated the patterns and grouping of tweets by the word
embeddings. To this end, we first created word embeddings, dense vector rep-
resentations of words of the tweets. Subsequently, we performed lemmatization
and stemming techniques [8], which aim to reduce words to their base form
and truncate words to their root form, respectively. Then, t-SNE (t-distributed
Stochastic Neighbor Embedding) was employed to decrease the dimensionality
of the data to facilitate visualization. Please note that t-SNE is often used to
visualize high-dimensional data in 2D or 3D space while preserving the pairwise
similarities between data points. Finally, we implemented k-means clustering,
an unsupervised machine learning algorithm, to cluster data points into groups
of clusters based on their similarities and labels, e.g., the emotions generated by
GPT-3.5 and RoBERTa’23.
In Fig. 1, we mark coordinates to represent single-tweet word embeddings.
Closer points suggest similar word embeddings. Despite showing distinct clus-
Hallucination Networks of LLMs 207
ters (clstr-1,2, 3, 4, and 5), mixed emotions appear with various colors. This
validates prior research in psychology and psychiatry, indicating shared com-
ponents among emotions, namely vocabulary. However, from a Generative AI
perspective, tweets in the figure display different colors and emotions. Emotions
mix across clusters and even within them. Considering RoBERTa’23 as our base-
line model, we observe that the labels generated by GPT-3.5 are significantly
different, revealing its distinct hallucination patterns. It is essential to highlight
that out of the 3,691 tweets in this clustering analysis, only 1,266 tweets have
been assigned identical emotional labels by the GPT-3.5 and Roberta’23 models.
Fig. 2. Distribution of RoBERTa’23 and GPT-3.5 scaled scores for the same tweet in
different emotions
The discernible fact is that tweets can be categorized with distinct emotions.
Our interest is assessing the hallucination that occurs, specifically in GPT-3.5
and RoBERTa’23, in ascribing emotional labels and confidence score to tweets.
To this end, in Fig. 2, we present the tweets classified with the same emotion
by two models. To be able to compare the model scores, we normalized the
scores adding the total to 100%. For instance, in the Anger figure (upper-left),
there was a total of 459 tweets; both RoBERTa’23 and GPT-3.5 classified them
as Anger. Yet, RoBERTa’23 was supremely confident in tweets’ emotions being
Anger, with an average of 70%, whereas GPT-3.5 was about 30% confident in
tweets’ emotions being Anger. Similarly, in Fear, Joy, Optimism, and Sadness
figures, RoBERTa’23 was confident with averages of 63%, 63%, 58%, and 63%,
respectively, whereas GPT-3.5 averaged 37%, 37%, 42%, and 37%, respectively.
Interestingly, it was close to 50% for both RoBERTa’23 (54%) and GPT-3.5
208 M. Goodarzi et al.
(46%) for Disgust. Overall, RoBERTa’23, being fine-tuned for emotion classi-
fication, exhibits higher confidence, while GPT-3.5 struggles. This asserts the
need for high quality task-specific fine-tuning of foundational models like GPT-
3.5 and their hallucination to provide a score not close to reality otherwise.
and GPT-3.5 in orange. Please note that we thoroughly cleaned word embed-
dings during the preprocessing step. This process involved lemmatization and
stemming to extract the most valuable vocabulary for each tweet. As illustrated
in the figure, an increase in k results in a higher percentage of GPT-3.5 links.
This indicates that GPT-3.5 categorizes words into more classes than our base-
line model, RoBERTa’23. This observation proves GPT-3.5’s tendency to gen-
erate erroneous inferences based on written context and prompts. Additionally,
the largest observable k-core network within the vocabulary consists of thirteen
links. The thirteen-core sub-graph (not shown in the figure) reveals that the 230
words connected to all emotions have a higher weighted connection based on
frequency to GPT-3.5 (53.78%) than RoBERTa’23 (46.22%).
Fig. 5. The vocabulary networks annotated with emotions detected by GPT-3.5 (on
the left) and RoBERTa’23 (on the right), colored by Modularity
there are instances in the GPT-3.5 network where the Generative AI model could
not assign emotions to tweets and vocabulary items, denoted as “none-classified”
(NC) in the figure.
It is important to emphasize that this figure warrants further investigation
into the potential clustering of emotions into smaller communities and the dis-
tinctions in the vocabulary associated with Anticipation, Surprise, and None-
Classified. However, this aspect is reserved for future research, potentially involv-
ing collaboration with domain experts.
Regarding emotion detection, it is noteworthy that distinctions are observed
in the classification of emotions, particularly in the differentiation between Anger
and Fear. On the other hand, similarities exist in the classification of Joy and
Optimism, as they appear to share commonalities in their categorization. This
suggests that the models may exhibit variability in their interpretation and clas-
sification of emotions, with some emotions being more closely aligned than others
in the results.
of disassortative mixing. This means that words within the networks, particularly
those with dissimilar linguistic or contextual characteristics, tend to form connec-
tions. In contrast to assortative networks where similar words cluster together,
a disassortative network suggests that antonyms, words with differing gram-
matical roles, or those used in dissimilar contexts are more likely to connect.
This behavior highlights language diversity and contrasting aspects within the
analyzed corpus. Additional evidence indicates that emotion detection models,
including the baseline model RoBERTA’23, still face significant challenges. It is
noteworthy that although RoBERTA’23 achieves a higher classification rate, it
also demonstrates the same disassortative mixing behavior, suggesting improve-
ments in emotion detection models are warranted.
Interestingly, while the assortativity metrics of networks are very low, the
average clustering coefficients for GPT-3.5 and RoBERTa’23 are 0.584 and 0.603,
respectively. The high clustering coefficients, along with a small average shortest
path, indicate that words within the network tend to form tight-knit clusters -yet
another small-world network- suggesting strong linguistic or contextual associ-
ations among words. Overall, the networks exhibit cohesive word clusters and
connections between words with differing characteristics, reflecting a diverse and
intricate structure encompassing semantic similarities and contrasts in language
usage.
4 Conclusion
The development of Large Language Models signifies a notable stride toward
achieving human-like intelligence. Their capacity to grasp and classify figurative
speech is a testament to their growing capabilities. Moreover, as these models
continue to enhance their reliability levels, they are poised to play a pivotal role
in facilitating the accurate detection of emotions within textual content. This
not only expands our understanding of language but also opens up a wealth of
possibilities for applications across various domains, underlining the promising
future of LLMs in the realm of natural language processing and artificial intel-
ligence. Our research is a step in that direction to point out the issues from a
212 M. Goodarzi et al.
References
1. snscrape. https://github.com/JustAnotherArchivist/snscrape
2. Acheampong, F.A., Nunoo-Mensah, H., Chen, W.: Transformer models for text-
based emotion detection: a review of BERT-based approaches (2021)
3. Alkaissi, H., McFarlane, S.I.: Artificial hallucinations in chatgpt: implications in
scientific writing. Cureus (2023)
4. Amidi, F.: Hashtags, a viral song and memes empower iran’s protesters (2022)
5. Athaluri, S.A., Manthena, S.V. Kesapragada, V.S.R., Yarlagadda, V., Dave, T.,
Duddumpudi, R.T.S.: Exploring the boundaries of reality: investigating the phe-
nomenon of artificial intelligence hallucination in scientific writing through chatgpt
references (2023)
6. Bar-Kalifa, E., Sened, H.: Using network analysis for examining interpersonal emo-
tion dynamics. Multivariate Behav. Res. (2020)
7. Barnes, J.: Twitter ends its free API: Here’s who will be affected
8. Bird, S., Klein, E., Loper, E.: Natural language processing with Python. O’Reilly
(2009)
9. Biswas, S.S.: Potential use of chat gpt in global warming. Annals Biomed. Eng.
(2023)
10. Blondel, V.D., Guillaume, J.-L., Lambiotte, R., Lefebvre, E.: Fast unfolding of
communities in large networks (2008)
11. Camacho-Collados, J., et al.: TweetNLP: Cutting-Edge Natural Language Process-
ing for Social Media (2022)
12. Cowen, A.S., Keltner, D.: Self-report captures 27 distinct categories of emotion
bridged by continuous gradients. Proc. Nat. Acad. Sci. (2017)
13. Davis, W.: Twitter is being rebranded as x - the verge (2023)
Hallucination Networks of LLMs 213
14. Fuchs, K.: Exploring the opportunities and challenges of nlp models in higher
education: is chat gpt a blessing or a curse? (2023)
15. Gilardi, F., et al.: Chatgpt outperforms crowd-workers for text-annotation tasks.
arXiv: 2303.15056 (2023)
16. Hasmi, L., et al.: Network approach to understanding emotion dynamics in rela-
tion to childhood trauma and genetic liability to psychopathology: replication of a
prospective experience sampling analysis (2017)
17. Huang, F., Kwak, H., An, J.: Is chatgpt better than human annotators? potential
and limitations of chatgpt in explaining implicit hate speech. arXiv:2302.07736
(2023)
18. Johnson, D., Goodman, R., Patrinely, J., Stone, C., Zimmerman, E., et al.: Assess-
ing the accuracy and reliability of ai-generated medical responses: an evaluation of
the chat-gpt model (2023)
19. Kinnison, J., Padmala, S., Choi, J-M., Pessoa, L.: Network analysis reveals
increased integration during emotional and motivational processing. J. Neurosci.
(2012)
20. Lange, J., Zickfeld, J.H.: Emotions as overlapping causal networks of emotion com-
ponents: implications and methodological approaches (2021)
21. Liang, S., et al.: The relations between emotion regulation, depression and anxiety
among medical staff during the late stage of covid-19 pandemic: a network analysis.
Psych. Res. (2022)
22. Loureiro, D., et al.: Tweet insights: a visualization platform to extract temporal
insights from twitter arXiv:2308.02142 (2023)
23. Martı́n-Brufau, R., Suso-Ribera, C., Corbalán, J.: Emotion network analysis during
covid-19 quarantine-a longitudinal study. Front. Psychol. (2020)
24. Matsumoto, D., Keltner, D., Shiota, M.N., O’Sullivan, M., Frank, M.: Facial expres-
sions of emotion (2008)
25. Mauss, I.B., Levenson, R.W., McCarter, L., Wilhelm, F.H., Gross, J.J.: The tie
that binds? coherence among emotion experience, behavior, and physiology (2005)
26. Ouyang, L., et al.: Training language models to follow instructions with human
feedback (2022)
27. Parrott, W.G.: Emotions in social psychology: Essential readings. Psychology Press
(2001)
28. Pessoa, L.: Understanding emotion with brain networks. Current Opinion Behav.
Sci. (2018)
29. Rudolph, J., Tan, S., Tan, S.: Chatgpt: bullshit spewer or the end of traditional
assessments in higher education? J. Appl. Learn. Teach. (2023)
30. Sailunaz, K., Alhajj, R.: Emotion and sentiment analysis from twitter text. J.
Comput. Sci. (2019)
31. Si, C., et al.: Prompting GPT-3 to be reliable (2023)
32. Siegel, E.H., et al.: Emotion fingerprints or emotion populations? a meta-analytic
investigation of autonomic features of emotion categories (2018)
33. Sohail, S.S., et al.: Decoding chatgpt: A taxonomy of existing research, current
challenges, and possible future directions (2023)
34. Tantardini, M., Ieva, F., Tajoli, L., Piccardi, C.: Comparing methods for comparing
networks. 9, 17557 (2019)
35. Trampe, D., Quoidbach, J., Taquet, M.: Emotions in everyday life. PloS one (2015)
36. Tunstall, L., von Werra, L., Wolf, T.: Natural language processing with transform-
ers: building language applications with Hugging Face (2022)
37. Vaira, L.A., et al.: Accuracy of chatgpt-generated information on head and neck
and oromaxillofacial surgery: a multicenter collaborative analysis (2023)
214 M. Goodarzi et al.
1 Introduction
In recent years, social media data has been increasingly used to predict real-
world outcomes. Data from platforms like Twitter, Reddit, and Facebook has
been shown to be valuable in predicting public sentiment or response towards
many different topics. This information has been used across many different
fields like predicting stock market price changes or movie popularity [3,15].
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 215–224, 2024.
https://doi.org/10.1007/978-3-031-53468-3_18
216 A. Mandviwalla et al.
Social media platforms have continued to get more popular over time. Due to
this, the size of social media datasets continues to increase. As the size of these
datasets gets bigger and bigger, computational time complexity of the algorithms
being used becomes a significant issue. Some of the most popular features used in
social media predictions like sentiment analysis become prohibitively expensive
when working with larger datasets [1,14].
In this paper we propose a method to generate features that can be used in
social media predictions on big datasets. We create a weighted semantic network
between Twitter hashtags from a corpus of 3.7 million tweets related to the 2022
French presidential election. A bipartite graph is formed where hashtags are
nodes and weighted edges connect the hashtags reflecting the number of Twitter
users that interacted with both hashtags. The graph is then transformed into a
maximum-spanning tree with the most popular hashtag designated as its root
node to construct a hierarchy amongst the hashtags. We then provide a vector
feature for each user where the columns represent each of the 1037 hashtags in
the filtered dataset and the value for each column is the normalized count of
interactions for the user with that hashtag and any children of the hashtag in
the tree.
To validate the usefulness of our semantic feature we performed a regres-
sion experiment to predict the response rate of each user with six emotions like
anger, enjoyment, or disgust. The emotion data was manually annotated by a
DARPA team created for the INCAS Program. We provide a baseline simple
feature representing the counts the number of times a user interacts with each
of the 1037 hashtags. Both the baseline and our semantic feature perform well
with the regression with most emotions having R2 above 0.5. The semantic fea-
ture statistically significantly outperforms the baseline feature on five out of six
emotions using an F-test with a p value of 0.05.
The rest of the paper is organized as follows. Section 2 details related
works. In Sect. 3, we present the dataset used for experimentation. Then, Sect. 4
describes the methodology used in our paper. The design of experiments and
their results are presented in Sect. 5, and the conclusions are discussed in Sect. 6.
2 Related Works
Analyzing Twitter using semantic networks has been done in the past with var-
ious methods to determine relationships between hashtags and their trends. For
example, [18] considered two hashtags to be semantically related if an individ-
ual tweet contained both hashtags in the text. Similarly, [9] created a semantic
network based on word co-occurrence within tweets. However, [17] presented an
approach using a bipartite network between users and hashtags where an edge
between a user node and a hashtag node was added if the user tweeted the hash-
tag at least once. This bipartite network was then projected into a monopartite
network of hashtags. This approach is more applicable to our purposes because it
captures the latent social network of the dataset. In addition, in [17] the authors
focus on a Twitter dataset taken from the 2018 Italian elections which is similar
to our 2022 French election dataset.
2022 French Election Trendy Hashtags Analysis 217
Many studies have shown that semantic network features can be rich enough
to use for regressions in a multitude of situations. In the field of psychology,
semantic networks can be used to analyze a person’s vocabulary to gain insight
on cognitive states [4,7]. In terms of social media semantic networks, [10] used
semantic networks generated from sentences as features for a time series regres-
sion to capture the volatility of the stock market. In general, these approaches
involve creating a semantic network for each person or object in the study. The
alternative approach is to create large-scale, singular semantic networks that can
be used to describe all users. For example, [12] demonstrated a recommender sys-
tem which used a large-scale word co-occurrence semantic network created from
social media posts to recommend related social media posts to users. Such an
approach might be better for analyzing users since it can take advantage of the
nuanced relationships between different social media communities, which cannot
be done with an approach that only generates an individual semantic network
for each user.
3 Data
We applied these enrichments to a dataset provided by the DARPA INCAS
program team that comprised 3.7 million French language tweets from 2022.
This dataset was collected such that each tweet is relevant to the discussions that
arose during the 2022 French presidential election. After pruning, this dataset
contains 1037 hashtags and 389,187 users.
4 Methods
4.1 Semantic Network Generation
We performed several steps to prepare the Twitter data and create a semantic
network.
Preprocessing. The corpus of Tweets was first cleaned by removing URLs with
regular expression and French stop words using the NLTK Python library [5].
Each Tweet was tokenized by converting all words to lowercase, removing digit-
only words, and removing punctuation, except for hashtags. After extracting
a set of hashtags and corresponding occurrence counts, any hashtags with an
occurrence count below the mean were removed from the set to focus on trendy
hashtags.
trendy hashtag if the user retweets, quote retweets, comments under, or posts
a tweet that contains the hashtag word with or without the hashtag symbol.
We chose this relaxed approach because we consider situations such as “france”
versus “#france” to be semantically identical. The resulting bipartite graph was
projected along the hashtags as a weighted semantic network where each node
represents a trendy Twitter hashtag, and the weighted edges represent the shared
audience of users between two trendy hashtags.
Edge Pruning. Next, the bipartite graph was then converted into a maximum
spanning tree (MST) to only consider the most important links between trendy
hashtags. We had conducted multiple experiments with and without edge prun-
ing and concluded that some form of edge pruning is essential for removing noisy
edges. We tested a flat cutoff approach for excluding edges with a weight below
a set cutoff, and the MST approach, achieving the best and most robust results
with the MST. All graph operations were performed with the NetworkX Python
library [11]. The implementation negates the weights of the edges and then fol-
lows Kruskal’s [13] algorithm to build a minimum spanning tree. A visualization
of the resulting MST can be seen in Fig. 1.
distinct emotions and an “other” value (representing fear, anger, enjoyment, sad-
ness, disgust, surprise, and “none of the above” tag) where the sum of each array
equals 1. We split the data into a training and testing period where the testing
period is the final 2 weeks of the data and the training period covers the first
10 weeks. For every user’s tweets, for both the training and testing periods we
summed the emotion arrays, then divided the resulting array by its 1-norm, so
that each array element follows U (0, 1) and represents the probability of that user
interacting with each emotion. Each user array was split into a set of emotion
target variables, and each one was paired with the corresponding user enrich-
ment method as the input variable. Only users with ≥ 10 tweets in the training
period were included, resulting in 49,360 entries of input/target pairs for each
emotion. Since this experiment is only meant to compare the different methods
relative to each other with often minimal differences, we used the Scikit-learn
implementation of linear regression [16]. To measure the performance of each
regression we compute the R2 value between the predicted emotion of all the
users and the actual emotion of all the users.
5 Results
5.1 Experiment Results
Table 1. Regression experiment results. (*: semantic > baseline; p < 0.05)
close to zero. Intuitively, people would have various positive or negative views
towards certain political trendy hashtags, which would correlate with most of
the emotions. However, surprise cannot easily be categorized as on the positive
or negative binary spectrum, which could explain why the linear regression per-
formed poorly in those cases. Previous research on sentiment analysis has also
shown notably lower performance when predicting surprise [6,19]. A change to
our method to distinguish hashtags written in all uppercase or with exclamation
points at the end may improve performance on surprise.
Both enrichments had the best performance when being used to predict the
enjoyment emotion with R2 values of 0.634 and 0.648. Social media users are
likely to frequently repeatedly mention the topics that give them enjoyment,
so this result is not very surprising. Similar logic would apply to anger and
disgust emotions and those regressions also performed well with R2 values greater
than 0.5.
To analyze which trendy hashtags are most associated with improvement with
the semantic enrichment, we decided to filter for the top 10% of users that saw
the most improvement in prediction accuracy between the baseline regression
and the semantic regression. This was determined by the mean absolute error in
emotion predictions versus the ground truth. Then we compared trendy hashtag
occurrence rates for the top 10% users with the hashtag occurrence rates for the
rest of the users. All hashtag occurrence rates were calculated based on direct
interactions, like the baseline enrichment. We then selected the top 10 trendy
hashtags that saw the largest increase in occurrence rate between the top 10%
of users and the rest of the users.
Since the presence of these trendy hashtags in Table 2 result in more accu-
rate emotion predictions with the semantic enrichment, this suggests that the
users engaging with these trendy hashtags tend to engage with other emotionally
salient trendy hashtags, which would be more useful when predicting emotion
levels. Given the severity of war, it would make sense that “Ukrainians” would
be strongly connected to other highly emotionally salient trendy hashtags. A
previous English language Twitter study about the Ukrainian war found that
“Ukrainians” is a significant buzzword, so it is not surprising that the word reap-
pears in French. In addition, that study identified the YouTube twitter account
that was frequently mentioned in relation to the war, which would explain why
it invokes strongly emotional trendy hashtags [20]. “Paris” on its own may not
seem like it would be emotionally charged. However, looking at the children of
this hashtag in the tree, “hidalgodemission”, “conseilparis”, and “saccageparis”,
they are each related to an emotionally-charged movement to remove the mayor
of Paris and rebuild some of the crumbling architecture in the city.
222 A. Mandviwalla et al.
Table 2. Trendy hashtags with the largest increase in occurrence rate between the
top 10% of users and the rest of users. Occurrence rate is the proportion of users that
directly interacted with that trendy hashtag. The top 10% of users is computed based
on the improvement in prediction accuracy between the baseline regression and the
semantic regression for those users.
by using a maximum spanning tree to only retain the strongest edges. This has
the disadvantage of removing connections between different communities within
the graph. It is possible that using a more sophisticated pruning approach could
improve the quality of using semantic networks in this manner. Future work can
take inspiration from knowledge graph edge pruning methods, which can account
for domain information of the trendy hashtags [8].
Acknowledgements. This work was partially supported by the DARPA INCAS Pro-
gram under Agreement No. HR001121C0165 and by the NSF Grant No. BSE-2214216.
References
1. Almuayqil, S.N., Humayun, M., Jhanjhi, N.Z., Almufareh, M.F., Khan, N.A.:
Enhancing sentiment analysis via random majority under-sampling with reduced
time complexity for classifying tweet reviews. Electronics 11(21) (2022). https://
doi.org/10.3390/electronics11213624
2. Asadi, M., Ghadiri, N., Nikbakht, M.A.: A scalable method for one-mode projection
of bipartite networks based on hadoop platform. In: 2018 8th International Con-
ference on Computer and Knowledge Engineering (ICCKE), pp. 237–242 (2018).
https://doi.org/10.1109/ICCKE.2018.8566259
3. Asur, S., Huberman, B.A.: Predicting the future with social media. In: 2010
IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent
Agent Technology, vol. 1, pp. 492–499 (2010). https://doi.org/10.1109/WI-IAT.
2010.63
4. Beckage, N., Smith, L., Hills, T.: Semantic network connectivity is related to vocab-
ulary growth rate in children. In: Proceedings of the Annual Meeting of the Cog-
nitive Science Society, vol. 32 (2010)
5. Bird, S., Klein, E., Loper, E.: Natural language processing with Python: analyzing
text with the natural language toolkit. O’Reilly Media, Inc.’ (2009)
6. Buechel, S., Hahn, U.: Emotion analysis as a regression problem–dimensional mod-
els and their implications on emotion representation and metrical evaluation. In:
ECAI 2016, pp. 1114–1122. IOS Press (2016)
7. Chan, A.S., Salmon, D.P., Butters, N., Johnson, S.A.: Semantic network abnor-
mality predicts rate of cognitive decline in patients with probable Alzheimer’s
disease. J. Int. Neuropsychol. Soc. 1(3), 297–303 (1995). https://doi.org/10.1017/
S1355617700000291
8. Faralli, S., Finocchi, I., Ponzetto, S.P., Velardi, P.: Efficient pruning of large knowl-
edge graphs. In: Proceedings of the 27th International Joint Conference on Artifi-
cial Intelligence, IJCAI 2018, pp. 4055-4063. AAAI Press (2018)
9. Featherstone, J.D., Ruiz, J.B., Barnett, G.A., Millam, B.J.: Exploring childhood
vaccination themes and public opinions on Twitter: A semantic network analysis.
Telematics Inform. 54, 101,474 (2020). https://doi.org/10.1016/j.tele.2020.101474.
https://www.sciencedirect.com/science/article/pii/S0736585320301337
10. Fronzetti Colladon, A., Grassi, S., Ravazzolo, F., Violante, F.: Forecasting financial
markets with semantic network analysis in the COVID-19 crisis. J. Forecasting
(2020)
11. Hagberg, A.A., Schult, D.A., Swart, P.J.: Exploring network structure, dynamics,
and function using Network. In: Varoquaux, G., Vaught, T., Millman, J. (eds.)
Proceedings of the 7th Python in Science Conference, Pasadena, CA USA, pp. 11
– 15 (2008)
224 A. Mandviwalla et al.
12. He, Y., Tan, J.: Study on SINA micro-blog personalized recommenda-
tion based on semantic network. Expert Syst. Appli. 42(10), 4797–4804
(2015). https://doi.org/10.1016/j.eswa.2015.01.045, https://www.sciencedirect.
com/science/article/pii/S0957417415000603
13. Kruskal, J.B.: On the shortest spanning subtree of a graph and the traveling sales-
man problem. Proc. Am. Math. Soc. 7(1), 48–50 (1956)
14. Kumari, S.: Impact of big data and social media on society. Global J. Res. Anal.
5, 437–438 (2016)
15. Pagolu, V.S., Challa, K.N.R., Panda, G., Majhi, B.: Sentiment analysis of Twitter
data for predicting stock market movements. CoRR abs/ arXiv: 1610.09225 (2016)
16. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn.
Res. 12, 2825–2830 (2011)
17. Radicioni, T., Saracco, F., Pavan, E., Squartini, T.: Analysing Twitter semantic
networks: the case of 2018 Italian elections. Sci. Rep. 11(1), 1–22 (2021)
18. Shi, W., Fu, H., Wang, P., Chen, C., Xiong, J.: #climatechange vs. #glob-
alwarming: Characterizing two competing climate discourses on Twitter with
Semantic Network and temporal analyses. Inter. J. Environ. Res. Public Health
17(3) (2020). https://doi.org/10.3390/ijerph17031062, https://www.mdpi.com/
1660-4601/17/3/1062
19. Tsakalidis, A., et al.: Building and evaluating resources for sentiment analysis in
the Greek language. Lang. Resour. Eval. 52, 1021–1044 (2018)
20. Vyas, P., Vyas, G., Dhiman, G.: RUemo-the classification framework for Russia-
Ukraine war-related societal emotions on Twitter through Machine Learning. Algo-
rithms 16(2) (2023). https://doi.org/10.3390/a16020069
Rewiring Networks for Graph Neural
Network Training Using Discrete
Geometry
1 Introduction
Data captured by structures beyond vectors living in Euclidean space are becom-
ing increasingly abundant, thus, it is becoming increasingly important to develop
methods to analyze them. When such data lack a rigorous metric structure,
notions of shape and size of the data become useful to incorporate in their
analysis—this is the premise of geometric deep learning [3]. Networks are an
important example of non-Euclidean spaces lacking a natural metric structure,
which are the focus of this work.
In this paper, we study the problem of information over-squashing, associ-
ated with training graph neural networks (GNNs) [1,21], which occurs when
information does not flow efficiently between distant nodes on a graph. This
problem tends to occur when there is heavy traffic passing through particular
edges of a graph, known as the bottleneck. Graph rewiring is a common mitigating
approach, which adds or suppresses edges on the input network to alleviate bot-
tlenecks and improve information flow over a network. Recent pioneering work
models network information flow using a new notion of discrete curvature—the
balanced Forman curvature (BFC)—and uses it to rewire graphs prior to train-
ing GNNs, yielding the current state-of-the-art for GNN training in the presence
of bottlenecks [21].
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 225–236, 2024.
https://doi.org/10.1007/978-3-031-53468-3_19
226 J. Bober et al.
The main difference between the traditional deep neural networks and GNNs has
to do with the message passing algorithm [7]. In message passing, at each layer
and for each node, features from the neighboring nodes are aggregated before
updating the features of the target node. The principle concern of over-squashing
is that the influence of important node features may be too small and eventually
have minimal or no impact on features of distant nodes on the network when
message passing over the GNN. When propagating information from a node in
a source component to a node in the target component, over-squashing is likely
to happen as the information is crowded or “squashed” together with all other
node features from the source component, which happens on the edge connecting
the two components called a bottleneck.
Bottlenecks may be alleviated with graph rewiring, which better supports
the bottleneck and provides alternative access routes between components to
reduce the risk that features become crowded out (over-squashed); see Fig. 1.
Edges that have little impact on information flow in the graph can be deleted
to control the size of the graph.
The Ricci curvature quantifies how much a Riemannian manifold locally differs
from a Euclidean space. It determines whether two geodesics shot in parallel
from two nearby points on a given manifold converge, remain parallel, or diverge
along the manifold. The curvature is positive if the geodesics converge to a single
point; zero, if the geodesics remain parallel; and negative, if the geodesics diverge.
Curvature Rewiring for GNNs 227
(a) (b)
Fig. 1. Graph rewiring reduces over-squashing. (a): A graph with a bottleneck (blue
edge). (b): A rewiring that alleviates the bottleneck.
The quicker the convergence or divergence, the larger the Ricci curvature. Ricci
curvature can be used to smooth a manifold via the Ricci flow, described by a
partial differential equation (PDE) [4].
In the discrete setting of meshes or networks, the PDE describing Ricci flow
becomes an ordinary differential equation, thus the flow is reversible.
1D Forman Curvature. Perhaps the most basic and one of the first notions of
discrete curvature is the one proposed by Forman [6,19].
Definition 1. For two nodes v1 , v2 in a graph and an edge e between them, the
general 1D Forman curvature of e is
⎛ ⎞
wv w v
w v w v
Ffull (e) = we ⎝ 1 + 2 − √
1
+√ 2 ⎠ , (1)
we we e ∼e,e ∼e
we w e v 1
we we v 2
v1 v2
where ev1 ∼ e and ev2 ∼ e denote the edges other than e that are adjacent to
nodes v1 and v2 respectively; we , wev1 , and wev2 denote the weights of edges e,
228 J. Bober et al.
ev1 and ev2 respectively; and wv1 and wv2 denote the weights of the nodes v1 and
v2 respectively.
For unweighted graphs, the weights of all nodes and edges are set to 1 and
(1) becomes simply
where deg(x) is the degree of node x. In our work, we compute the 1D Forman
curvature using (2) which is a very simple expression and extremely fast to
compute, and is concerned only by the degrees of the endpoints of the edge
under consideration.
The drawback of this simplicity is that it is not always very descriptive, since
under combinatorial weights, the 1D Forman curvature gives information only
about the number of edges directly connected to the edge under consideration. It
generally assigns lower curvature to clique-like components of the graph rather
than tree-like components, since an edge in a clique is connected directly to all of
the other edges in the clique, while in a binary tree it is only directly connected
to 3 other edges.
⎡⎛ ⎞ ⎤
we
wv⎠
√we · wê wv
#
Ffull (e) = we ⎣⎝ + − √ ⎦,
wf v<e we wf we · wê
e<f êe ê,e<f v<e,v<ê
(3)
where a b denotes that a is parallel to b, i.e., a and b have a common higher
or lower dimensional graph face; a < b denotes that a is a graph face of b; and
the rest of the notation is as in Definition 1 (here the faces are also weighted).
Following [15], we may consider solely 3-cycles and again, under combinato-
rial weights, (3) reduces to
Haantjes Curvature. Less common than the Forman curvatures, the Haantjes
curvature [8] has the simplest and most intuitive definition.
Definition 3. Consider a graph where all weights are set to 1 (i.e., the combi-
natorial case). For two nodes v1 , v2 in a graph and an edge e between them, the
Haantjes curvature is given by κ2H (e) = t, where t is as in (4).
The Haantjes curvature is a metric curvature, thus in the network case it
takes into account solely edge weights. Definition 3 is commonly used in graphics
settings and simply counts the triangles adjacent to a given edge. The Haantjes
curvature is typically higher for clique-like components of a graph than for tree-
like components. It is trivially nonnegative, which is also in contrast with 1D
Forman, where the majority of edges usually have negative curvature. The aug-
mented Forman curvature can be thought of as a balance between 1D Forman
and Haantjes curvatures.
Balanced Forman Curvature. The BFC proposed by [21] aims to balance between
computational complexity and the richness of structural information associated
with neighboring edges. It takes into account 3- and 4-cycles, as well as “loose”
neighboring edges, i.e., those that do not create 3- or 4-cycles.
experimental work. It operates in the same spirit as Ricci flow, where regions of
negative or low curvature are identified and compensated by an opposite effect
depending on the negativity in order to smooth the manifold. Additionally, it
incorporates a mechanism to prevent a blow-up on the graph size. The algorithm
intakes a graph and produces another graph where the regions of the most neg-
atively curved edges of the input graph are augmented with additional edges to
increase the curvature at those regions.
At each iteration, the algorithm chooses the edge with the lowest curvature;
candidate edges to add to support the lowest curvature edge; and the edge to
add from candidates with softmax probability (regulated with a temperature
parameter τ ) to increase curvature, where this latter value is calculated as the
difference between curvature of the lowest curvature edge before and after adding
the support edge. The algorithm then chooses the edge with the highest curvature
and, if this curvature value surpasses a certain threshold, removes this edge from
the graph, ensuring a bound on the size of the graph. The process repeats until
either the convergence is reached (no additional candidates and no edges to
remove) or the maximum number of iterations is reached.
3.2 Datasets
We studied the following 12 benchmarking datasets in our experimental study in
a supervised learning task of node classification, whose details are summarized
in Table 1: Cora [11] and Citeseer [16], which are large citations datasets con-
taining information about the presence of specific words in publications; Pubmed
[13], a large citations dataset containing information about diabetes of patients
classified into one of three classes; Cornell, Texas, and Wisconson [5], which
are small datasets containing information about webpages collected from com-
puter science departments of corresponding universities; Chameleon, Squirrel
[14], and Actor [20], which are large datasets based on the Wikipedia networks;
Computers and Photo [17], which are large e-commerce (Amazon) datasets; and
finally, Coauthor CS [10], which is a large citation dataset with papers in com-
puter science. The last 3 datasets were not evaluated by [21].
4.1 Accuracy
Each experiment was run for 100 seeds; we report 95% confidence intervals of
mean accuracies using a z-score of 1.96. For reference and performance compari-
son, the 95% confidence intervals for the SDRF-rewiring using BFC reported by
[21] are also given for those relevant datasets.
The best two results are highlighted for each dataset in each accuracy table:
the best one in red bold, the second best in black bold (excluding the reported
BFC results from [21] for reference). The None curvature row represents results
without any rewiring. OOM indicates that the out of memory error has occurred.
N/A in the reference BFC row for Computers, Photo, and Coauthor CS datasets
indicates that there are no reference results for these datasets as these datasets
were not studied by [21].
We see that SDRF rewiring generally improves training performance. In par-
ticular, we note that performance for the classical curvatures is generally better
than the performance without any rewiring, and often better than performance
of BFC. For some results in Table 2, the simplest form of curvature—the 1D For-
man curvature—tends to give the best results. This indicates that edges with
large sums of degrees are the graph bottlenecks and suffer from over-squashing.
The results for Haantjes curvature are the best for some of the other datasets,
which suggests that association with many 3-cycles helps an edge reduce over-
squashing. The augmented Forman curvature also yields best results for certain
experiments (with less frequency), which suggests that maintaining the balance
between the two metrics may reduce over-squashing most effectively.
Note, however, that the experiments upon rerunning yielded results that dif-
fer quite significantly, especially for small datasets (Cornell, Texas, Wisconsin).
For example, Table 2 shows that the Haantjes curvature seems to generally bring
the best results in the first run, while the augmented Forman curvature performs
best in the second run. More importantly, it is often the case that the correspond-
ing results (dataset–curvature pairs) for different rewirings for the two runs are
232 J. Bober et al.
often not within the respective 95% confidence intervals, indicating a lack of
robustness of the results. One explanation could be overfitting of the average
accuracy to one instance of the SDRF rewiring, which can significantly impact
the average performance. This is especially true for the small datasets, for which
the rewiring of multiple edges can have a greater impact on the graph structure
than for larger datasets. The results for these datasets also differ significantly
between each curvature type, and compared to no rewiring. Moreover, the BFC
results differ more significantly for these datasets than for others with respect
to the reference BFC results.
To further investigate the intuition that adding or deleting edges on smaller
graphs impact the overall graph structure more significantly, we re-ran exper-
iments for Cora, Citeseer, Cornell, Texas and Wisconsin datasets with
rewiring for each seed. These datasets were selected given that rewiring was the
fastest (as will be discussed further on in considering computational runtime).
Table 2. 95% confidence intervals of mean accuracies for given datasets and curvature
types given in percentages of two experimental runs (first two tables for the first run,
last two for the second).
Table 3 presents results from two runs with rewiring for every seed, which are
shown to be significantly more robust. The sizes of the 95% confidence intervals
Curvature Rewiring for GNNs 233
are comparable to those reported previously in Table 2, but only two pairs of
corresponding runs are not contained in the 95% confidence intervals of one
another (Cornell–Haantjes and Texas–BFC). As there are 5 · 5 = 25 dataset–
curvature pairs for which the experiments were run, the mean results are indeed
robust and it is reasonable to consider the results as close to being independent
and identically distributed (i.i.d.): the probability that two or more out of 25
means of i.i.d. random variables are not within the corresponding 95% confidence
intervals is 1 − 25 · 0.05 · 0.9524 ≈ 0.635 = 63.5%, which is high.
Furthermore, the results of these additional experiments are significantly
worse than the reference BFC results. This is likely due to the accuracies for
differently rewired graphs having been averaged out, as opposed to using the
rewiring with the best validation accuracy for the benchmarking. In contrast,
the results in Table 2 are slightly better for some dataset–curvature pairs than
the reference BFC results, and sometimes slightly worse. When actually using
the framework in practice, for the best results, the training can be performed
for several different seeds and the model with the best validation accuracy can
be chosen with the most effective rewired graph structure.
Table 3. 95% confidence intervals of selected datasets with graph rewiring for each
seed given in percentages run twice.
We summarize the test results for rewiring instances and model parameters
pairs that achieved the best validation accuracy in the experiments reported
in the second run from Table 3 in Table 4. Only the second run is considered,
but this does not have a significantly negative impact on the robustness of the
results, since, as previously justified, the results in Table 3 are robust.
The main conclusion from these experiments is that there is no clear cur-
vature that has overall better mean performance across the multiple datasets,
but it is reasonable to deduce that using the classical curvatures for SDRF-based
rewiring can lead to significant performance improvement, often achieving better
234 J. Bober et al.
Table 4. Percentage accuracy for the best rewiring cases from Table 3.
results than the BFC. For every dataset, performing the SDRF-based rewiring
almost always yields the best test accuracy when using one of the three classical
curvature types, compared to BFC, although no rewiring may also yield the best
results. Often, the best test accuracies were achieved using classical curvatures.
We measure the runtime for one rewiring process per curvature type and per
dataset; the measurements are given in Table 5. The runtimes here are reported
for only one instance for each dataset and each curvature type, in order to avoid
influences of spurious computational issues such as CPU and GPU occupancy
with other processes, which would become much more significant with repeated
iterations. Here, the interest is rather in the comparison between longer compu-
tation times which shows the difference in computational complexity at scale.
From these results, we see that all of the classical discrete curvatures studied
have a significantly shorter computation time than the BFC. The slowest among
the three classical curvatures is the augmented Forman curvature, at scale. This
is expected, as it essentially needs to do the same calculations as 1D Forman
and Haantjes curvatures combined (computation of degrees of endpoints and
adjacent triangles for each edge).
For the Computers and Photo datasets, however, the computation of the
augmented Forman curvature took longer than the computation of BFC. This
suggests that for some types of graphs, possibly for bigger or more dense graphs
(notice from Table 1 that the edges to nodes ratio is very high for these two
datasets), the BFC computation can outperform the augmented Forman curva-
ture computation in terms of computation time. Nevertheless, the 1D Forman
and Haantjes curvatures are still quicker to compute.
Curvature Rewiring for GNNs 235
5 Discussion
We systematically and comprehensively studied various classical and novel dis-
crete curvatures in mitigating the over-squashing problem in training GNNs.
Following [21], we adapted discretizations of Ricci curvature and Ricci flow to
model information flow and bottleneckness of a network, respectively. In [21],
the BFC was proposed as a discretization of Ricci curvature, while the SDRF
algorithm was proposed as a discretization of Ricci flow. We tested a wide range
of classical discrete curvatures against the BFC in implementing the SDRF algo-
rithm. We found that more classical curvatures were able to achieve performance
of the same order as the BFC in training accuracy and, at times, outperformed
the BFC. Moreover, they far outperformed it in computational runtime. We thus
found that the impact of the contribution by [21] lies in the SDRF algorithm,
rather than the BFC. We conclude that almost any of the more classical discrete
curvatures may be used over the BFC together with the SDRF algorithm for
more efficient computational runtimes.
Building on our study, future work may take into account directedness of the
graphs in the SDRF and other rewiring methods. Also, alternative non-rewiring,
discrete geometric approaches to mitigating over-squashing may be explored,
such as CGNNs [10]. Here, other computational geometric notions for networks
may be investigated, such as those arising from topological data analysis, where
persistent homology concurrently captures the topology of data as well as its
integral geometry. Such an approach would be particularly interesting when the
goal is to preserve the topology of a graph, as the CGNN does.
References
1. Alon, U., Yahav, E.: On the bottleneck of graph neural networks and its practical
implications. arXiv preprint arXiv:2006.05205 (2020)
2. Barkanass, V., Jost, J., Saucan, E.: Geometric sampling of networks. J. Complex
Netw. 10(4), cnac014 (2022)
236 J. Bober et al.
3. Bronstein, M.M., Bruna, J., LeCun, Y., Szlam, A., Vandergheynst, P.: Geometric
deep learning: going beyond euclidean data. IEEE Signal Process. Mag. 34(4),
18–42 (2017)
4. Chow, B., Knopf, D.: The Ricci Flow: An Introduction: An Introduction, vol. 1.
American Mathematical Soc. (2004)
5. Craven, M., McCallum, A., PiPasquo, D., Mitchell, T., Freitag, D.: Learning to
extract symbolic knowledge from the world wide web. Technical report, Carnegie-
Mellon Univ Pittsburgh PA School of Computer Science (1998)
6. Forman, R.: Bochner’s method for cell complexes and combinatorial ricci curvature.
Discret. Comput. Geom. 29(3), 323–374 (2003)
7. Gilmer, J., , S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing
for quantum chemistry. In: Precup, D., Teh, Y.W. (eds.), Proceedings of the 34th
International Conference on Machine Learning, vol. 70. Proceedings of Machine
Learning Research, 06–11 Aug, pages 1263–1272. PMLR (2017)
8. Haantjes, J.: Distance geometry. curvature in abstract metric spaces. Proc. Kon.
Ned. Akad. V. Wetenseh., Amsterdam 50, 302–314 (194
9. Klicpera, J., Weißenberger, S., Günnemann., S.: Diffusion improves graph learning.
arXiv preprint arXiv:1911.05485 (2019)
10. Li, H., Cao, J., Zhu, J., Liu, Y., Zhu, Q., Wu, G.: Curvature graph neural network.
arXiv preprint arXiv:2106.15762 (2021)
11. McCallum, A.K., Nigam, K., Rennie, J., Seymore, K.: Automating the construction
of internet portals with machine learning. Inf. Retrieval 3(2), 127–163 (2000)
12. Najman, L., Romon, P. (eds.): Modern Approaches to Discrete Curvature. LNM,
vol. 2184. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58002-9
13. Namata, G., London, B., Getoor, L., Huang, B., Edu, U.: Query-driven active
surveying for collective classification. In: 10th International Workshop on Mining
and Learning with Graphs, vol. 8, p. 1 (2012)
14. Rozemberczki, B., Allen, C., Sarkar, R.: Multi-scale attributed node embedding.
J. Complex Netw. 9(2), cnab014 (2021)
15. Samal, A., Sreejith, R., Gu, J., Liu, S., Saucan, E.: Comparative analysis of two
discretizations of ricci curvature for complex networks. Sci. Rep. 8, 8650 (2018)
16. Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., Eliassi-Rad, T.: Collec-
tive classification in network data. AI Mag. 29(3), 93–93 (2008)
17. Shchur, O., Mumme, M., Bojchevski, A., Günnemann, S.: Pitfalls of graph neural
network evaluation. arXiv preprint arXiv:1811.05868 (2018)
18. Sigbeku, J., Saucan, E., Monod, A.: Curved markov chain monte carlo for network
learning. In: Benito, R.M., Cherifi, C., Cherifi, H., Moro, E., Rocha, L.M., Sales-
Pardo, M. (eds.) Complex Networks & Their Applications X. pp, pp. 461–473.
Springer International Publishing, Cham (2022). https://doi.org/10.1007/978-3-
030-93413-2 39
19. Sreejith, R., Mohanraj, K., Jost, J., Saucan, E., Samal, A.: Forman curvature for
complex networks. J. Stat. Mech: Theory Exp. 2016(6), 063206 (2016)
20. Tang, J., Sun, J., Wang, C., Yang, Z.: Social influence analysis in large-scale net-
works. In: Proceedings of the 15th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pp. 807–816 (2009)
21. Topping, J., Di Giovanni, F., Chamberlain, B.P., Dong, X., Bronstein, M.M.:
Understanding over-squashing and bottlenecks on graphs via curvature. arXiv
preprint arXiv:2111.14522 (2021)
Rigid Clusters, Flexible Networks
1 Introduction
Predictive models hold significance across behavioral, operational, financial,
and various other domains. Machine Learning algorithms construct a predic-
tive model by utilizing a training dataset to identify meaningful relationships
between predicting features and a predicted label.
To illustrate, consider the realm of decision-making. Human choices are
influenced by various factors, including the decision-maker’s personality and
the expected emotional response to potential decision results. A conventional
approach to solving this decision prediction problem entails uncovering distinct
causal relationships and elucidating how personality leads to specific emotional
expectations, which subsequently influence a decision. Nevertheless, such an app-
roach tends to obscure the inquiry into how personality anticipates patterns of
emotional fluctuations and the resulting array of potential decisions.
Consider, for example, a decision-maker who confronts two job options-one
guarantees a fixed salary, while the other offers higher yet uncertain earnings.
The decision involves a mental analysis of how the decision-maker will feel
when earning a low salary, a high salary, or no salary. The expected emotions
influence their decision. It is challenging to predict decisions that are emotion-
related because the decision-maker expects different emotions to result from his
decision. Moreover, expected emotions are often mixed, and their precise effect
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 237–249, 2024.
https://doi.org/10.1007/978-3-031-53468-3_20
238 G. G. Freedman
Algorithm section outlines the proposed R&A algorithm for solving the clustering
into network models problem. In the Results section, we showcase the efficacy of
our algorithm in predicting decisions. We summarize our findings and conclude
in the Summary and Conclusions section.
2 Literature Review
The Clustering Problem. Our proposed methodology employs the R&F algo-
rithm, which commences by utilizing the Machine Learning algorithm Kmeans
[22] to cluster projects into states. This is followed by an application of the
Agglomerating Hierarchical algorithm [23] to aggregate the states into clusters.
Both algorithms address the clustering problem, which has deep roots in the
Machine Learning literature (for surveys see [31,35]). The primary objective of
clustering is to identify groups of objects, where objects within the same cluster
exhibit greater similarity to one another compared to those in different clusters,
based on specific similarity measures. In line with this definition, the solution
involves a function that maps each object to a cluster. Clustering techniques may
use various approaches, including the partitioning of networks [36] or graphs [13],
and hierarchical clustering [29], (refer to surveys such as [16,33]), each possess-
ing unique advantages and limitations [3]. These techniques find applications in
areas such as pattern recognition [16] and insight extraction [16], among others.
Symmetric Uncertainty. The intended application of the results (a model of
clusters of networks) is to predict a label, and the model’s performance depends
on how informative these networks are regarding the label value. To quantify
the information content of a network, we borrow a measure from the literature
of information theory [8,24]. Specifically, we compute the distribution of objects
among the states of the network. and the distribution of their choice; We then
use these distributions to compute Symmetric Uncertainty [10,27], a measure
of how much information is shared between two features relative to the entire
information content. The Symmetric Uncertainty of two random variables mea-
sures the degree to which knowing the value of either random variable reduces
uncertainty regarding the value of the other [26]. It is derived by normalizing the
information gain to the entropies of the random variables (see [30] for definitions
of these measures). Normalization induces values in the range [0,1] and assures
symmetry. A value of zero means that the two variables are independent. For
a formal definition of Symmetric Uncertainty, see [34]. In our analysis, a higher
(vs. lower) level of Symmetric Uncertainty represents that knowing the state of
an object is more (vs. less) informative for predicting its expected label.
Network of Networks. In the current article, we map each object to a cluster,
wherein we identify a network structure. Partitioning into clusters is designed to
achieve a set of network structures that are effective in predicting a predefined
label. Our ’clustering into networks’ approach draws parallels with techniques
found in the scientific literature concerning the development of ’networks of
networks,’ as evident from various reviews such as [14], and [15]. Our approach
240 G. G. Freedman
3 Methods
We propose an algorithm that produces rigid clusters of flexible networks. The
algorithm is called ’Rigid & Flexible’ (R&F). It assumes the availability of
training data in which each data point includes ’stable parameters’ and ’flex-
ible parameters’ and uses Machine Learning techniques to identify a clustering
model that ensures that the objects in each cluster are not merely similar but
are grouped for the efficiency of modeling as a flexible model of states. Specifi-
cally, it uses Kmeans [21] to cluster participants into states, and a hierarchical
algorithm [23] to agglomerate the states into clusters.
We demonstrate the implementation of the R&F algorithm on the Allais
paradox [2], to show how producing rigid personality clusters of flexible emotion
Rigid Clusters, Flexible Networks 241
Fig. 1. Allais Experiment. In the first game, the choice was between a specific outcome
and a gamble, while in the second game, it was between two gambles with different
risks and outcomes.
It is only the second type of behavior that is paradoxical for being inconsistent
with the expected utility theory (by which the participant should choose either
1A and 2A or 1B and 2B). This behavior is associated with the certainty effect
and leads to the following definition:
Definition
We say that a decision-maker is ’demonstrating a certainty effect’ and denote
it by DCE, if and only if he is ’choosing Gamble A in GAME1’ and ’choosing
Gamble B in GAME2.’
4 Algorithm
We present an algorithm that identifies ’type clusters,’ along with the ’network
of states’ in each cluster.
4.1 Experimentation
1. M = Kmeans(D)
M is a clustering model, built with Kmeans algorithm using data D.
2. Mnum
Mnum is the number of clusters in model M, each represents a ’state’.
3. G : D → [0, 1, · · · , Mnum − 1]
G is a mapping function of each data point (in D) to state (cluster in model M).
4. D∗ = centroids(M)
D∗ is a data-set with the Mnum centroids (cluster centers) of M.
Phase II (compute N): Identifying the list of ’state networks’. Agglomerating states
into networks.
Phase III (compute G∗∗ ): Identifying the clustering model of ’type clusters’.
4.2 Analysis
As a prior step, we standardized the data as a common practice for the follow-
ing analysis. Data analysis involves two phases: The first phase investigates the
application of the R&F algorithm to the personality-emotions data. For each
participant, the result is an assignment to a state model and to a state within
this model. The second phase examines the application of the results, utilizing
the set of state models to gain information on decisions. For each state model,
244 G. G. Freedman
we consider the participants assigned to the cluster of this model and compute 2
random variables. The first random variable represents the distribution among
the states, and the second random variable represents the distribution of the
decision property of either demonstrating a certainty effect or not. These two
random variables are used to compute Symmetric Uncertainty which measures
how much information is portaged between them. In our analysis, a higher (vs.
lower) level of Symmetric Uncertainty represents that knowing a participant’s
mapping onto a state (in the flexible emotion network) is more (vs. less) infor-
mative for knowing its label (demonstrating a certainty effect or not).
5 Results
The findings are divided into two sections, corresponding to the two phases of
the data analysis process discussed in Sect. 4.2.
In this section, we present the results of the application of the R&F algorithm
(as referenced in Sect. 4) to the data-set gathered in our extended Allais Paradox
experiment (refer to Sect. 4.1). For each participant, the outcomes offer a dual
assignment: one to a cluster representing their personality type and the other to
a specific emotion state within the state model that is associated with the cluster.
The assignment to the Rigid Cluster is determined by the six rigid properties
representing the participant’s personality. On the other hand, the assignment to
the state is delineated by the 24 flexible attributes that capture the participant’s
Fig. 2. Rigid Clusters (of personality types): Application of the R&F algorithm to
the Allais experiment data identified 10 states and aggregated them into two clusters.
This figure demonstrates the average properties (personality traits) of the participants
assigned to each cluster (Cluster-A and Cluster-B).
Rigid Clusters, Flexible Networks 245
State-2 State-4
nonDCE 4 5
DCE 7 25
Fig. 3. Flexible States (of emotions) within Cluster-B: the R&F algorithm
implementation on the Allais experiment data, identified 10 states and aggregated
them into two clusters. This figure considers Cluster-B and demonstrates the aver-
age properties of the participants that are assigned to each of its states (State-2 and
State-4) in terms of their reported emotions.
Kmeans algorithm, which are based on rigid parameters (personality traits), flex-
ible parameters (reported emotions), or both. Additionally, we include a random
assignment of participants into two clusters. Similar to our model, we calcu-
late Symmetric Uncertainty for these benchmark models. 2 The results of this
analysis highlight the efficacy of our methodology in uncovering patterns rich
in informational value. Notably, our R&F algorithm efficiently identifies groups
Our ‘Rigid & Flexible’ algorithm harnesses the power of both Machine Learning
and Network Analysis to carve out an effective model of flexible states within
rigid clusters. The Allais paradox serves as our experimental ground, highlighting
the efficacy of our approach for understanding how an individual’s characteris-
tics are associated with their inclination to exhibit the intriguing certainty effect.
Unlike traditional methods that mostly focus on either rigid or flexible character-
istics, we chose to decode the decision-making complexity by segmenting objects
into flexible states and then reconstructing them into rigid clusters. This shift in
perspective offers remarkable insights and links rigid personality traits to explain
the valuable predictive patterns of flexible emotional states. Looking beyond, our
methodology finds applications in diverse domains. Imagine classifying machines
into distinct functional categories, each operating within its unique state space.
This approach could also be used to categorize products based on satisfaction-
related or weather-related states, segment employees according to task-related
states, and more. By pioneering a clustering approach that truly captures shared
state spaces, we are opening doors to a new realm of predictive insights.
References
1. Allais, M.L.: comportement de l’homme rationnel devant le risque: Critique des
postulats et axiomes de l’école américaine. Econometrica 21, 503–546 (1953)
2. Andersen, P.K., Keiding, N.: Multi-state models for event history analysis. Stat.
Methods Med. Res. 11(2), 91–115 (2002)
3. Arabie, P., et al.: Hierarchical classification Clustering and classification, pp. 65-121
(1996):
4. Aren, S., Hamamci, H.N.: Relationship between risk aversion, risky investment
intention, investment choices: impact of personality traits and emotion’. Kyber-
netes 49(11), 2651–2682 (2020)
5. Chandrasekaran, B.: Survey of network traffic models. Washington University in
St. Louis CSE 567 (2009)
6. Lai, R.: A survey of communication protocol testing. J. Syst. Softw. 62(1), 21–46
(2002)
7. McCrae, R.R., Costa, P.T., Jr.: Personality trait structure as a human universal.
Am. Psychol. 52, 509–516 (1997)
8. Dieck, R.H.: Measurement uncertainty: methods and applications. ISA (2007)
9. Dougherty, L.R., Guillette, L.M.: Linking personality and cognition: a meta-
analysis. Philos. Trans. Royal Soc. B: Biol. Sci. 373(1756), 20170282 (2018)
10. Edwards, W.: Methods for computing uncertainties. Am. J. Psychol. 67(1), 164–
170 (1954)
11. Finucane, M.L., Alhakami, A., Slovic, P., Johnson, S.M.: The affect heuristic in
judgments of risks and benefits. J. Behav. Decis. Mak. 2000(13), 1–17 (2000)
248 G. G. Freedman
12. Moors, A., Fischer, M.: Demystifying the role of emotion in behaviour: toward a
goal-directed account. Cogn. Emot. 33(1), 94–100 (2019)
13. Fjällström, P.: Algorithms for graph partitioning: a survey. Linköping University
Electronic Press (1998)
14. Gao, J., Li, D., Havlin, S.: From a single network to a network of networks. Natl.
Sci. Rev. 1(3), 346–356 (2014)
15. Gu, S., et al.: Modeling multi-scale data via a network of networks. Bioinformatics
38(9), 2544–2553 (2022)
16. Han, J., Pei, J., Tong, H.: Data mining: concepts and techniques. Morgan kaufmann
(2022)
17. Kassarjian, H.H.: Personality and consumer behavior: a review. J. Mark. Res. 8(4),
409–418 (1971)
18. John, O.P., Srivastava, S.: The Big five trait taxonomy: history, measurement, and
theoretical perspectives. Handbook Personal. Theory Res. 2, 102–138 (1999)
19. Lerner, J.S., Keltner, D.: Beyond valence: toward a model of emotion-specific influ-
ences on judgement and choice. Cogn. Emot. 14, 473–493 (2000)
20. Samar, S.M., Walton, K.E., McDermut, W.: Personality traits predict irrational
beliefs. J. Rational-Emotive Cognit.-Behav. Therapy 31, 231–242 (2013)
21. Sarwar, M.G., et al.: Machine learning at the network edge: a survey. ACM Com-
put. Surv. (CSUR) 54(8), 1–37 (2021)
22. Ahmed, M., Seraj, R., Islam, S.M.S.: The Kmeans algorithm: a comprehensive
survey and performance evaluation. Electronics 9(8), 1295 (2020)
23. Murtagh, F., Contreras, P.: Algorithms for hierarchical clustering: an overview.
Wiley Interdisciplinary Rev. Data Mining Knowl. Dis. 2(1), 86–97 (2012)
24. Namdari, A., Zhaojun L.: A review of entropy measures for uncertainty quantifi-
cation of stochastic processes. Adv. Mech. Eng. 11(6) (2019)
25. Palan, S., Schitter, C.: Prolific. ac - A subject pool for online experiments. J. Behav.
Experim. Finance 17, 22-27 (2018)
26. Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T.: Numerical Recipes
in C. Cambridge University Press (1988)
27. Piao, M., Piao, Y., Lee, J.Y.: Symmetrical uncertainty-based feature subset genera-
tion and ensemble learning for electricity customer classification. Symmetry 11(4),
498 (2019)
28. Kort, B., Rob R., Picard R.W.: An affective model of interplay between emotions
and learning: Reengineering educational pedagogy-building a learning companion.
In: Proceedings IEEE International Conference on Advanced Learning Technolo-
gies. IEEE (2001)
29. Reddy, C.K., Vinzamuri, B.: A survey of partitional and hierarchical clustering
algorithms. Data Clustering: Alg. Appli. 87 (2013)
30. Renyi, A.: On measures of entropy and information. In: Proceedings of the Fourth
Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Con-
tributions to the Theory of Statistics, pp. 547 -561. University of California Press
(1961)
31. Rokach, L.: A survey of clustering algorithms. Data Mining Knowl. Dis. Handbook,
269–298 (2010)
32. Rotter, J.B., Mulry, R.C.: Internal versus external control of reinforcement and
decision time. J. Pers. Soc. Psychol. 2, 598–604 (1965)
33. Sisodia, D., et al.: Clustering techniques: a brief survey of different clustering algo-
rithms. Inter. J. Latest Trends Eng. Technol. (IJLTET) 1(3), 82–87 (2012)
34. Song, Q., Ni, J., Wang, G.: A fast clustering-based feature subset selection algo-
rithm for high-dimensional data. IEEE Trans. Knowl. Data Eng. 25(1), 1–14 (2013)
Rigid Clusters, Flexible Networks 249
35. Xu, D., Tian, Y.: A comprehensive survey of clustering algorithms. Annals Data
Sci. 2, 165–193 (2015)
36. Xu, X., et al.: Scan: a structural clustering algorithm for networks. In: Proceedings
of the 13th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining (2007)
Beyond Following: Augmenting Bot
Detection with the Integration
of Behavioral Patterns
1 Introduction
Social networks like Twitter — currently in the process of rebranding to X —
have become an integral part of our social lives. They revolutionized the way
we communicate online, shape public discourse, and provide access to the latest
news and opinions. One major issue within social networks is the prevalence of
bot accounts, which have been known to influence public opinion, especially in
critical areas like politics or financial markets [2]. It is notoriously hard to esti-
mate the true extent of the presense of bots on social media platforms, and plat-
forms may be incentivized to misrepresent them, as it could negatively impact
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 250–259, 2024.
https://doi.org/10.1007/978-3-031-53468-3_21
Augmenting Bot Detection with Behavioral Patterns 251
revenue1 . In 2017, Varol et al. estimated that bots may make up to 15% of all
Twitter accounts [13]. In another study, Cresci et al. analyzed Twitter dicussions
concerning the US stock market, and concluded that up to 71% of the engaged
users might be bots [4].
Furthermore, bots seem to become more sophisticated over time [2,6], a phe-
nomenon often referred to as bot evolution. This term describes the adversarial
cycle in which newer bots evade increasingly more sophisticated bot detection
measures, by becoming progressively indistinguishable from real humans. An
illustrative example of this effect are the results reported in early 2017 by Cresci
et al. [3]. In this experiment, the users were tasked to tell bots apart from legiti-
mate users, only being able to correctly identify newer bots with a 24% accuracy,
compared to 91% on older bots. Cresci [2] points out that bot detection methods
must be able to distinguish between genuine users and bots, who disguise as gen-
uine users through stolen profile pictures and neutral messages. This complexity
has been further intensified by the advancement of artificial intelligence, partic-
ularly generative AI, which makes it more difficult to separate individual bot
accounts from genuine users. The increasing difficulty in distinguishing between
human-written and AI-generated text underscores the complexity of the issue.
This is highlighted by OpenAI’s decision to disable their AI classifier as of July
2023 due to low rate of accuracy in distinguishing between AI-generated and
human-generated content.2
In response to these challenges with feature-based methods, graph-based
methods are emerging as an alternative, due to their proven effectiveness in rec-
ognizing coordinated, synchronized activities [6]. By leveraging these techniques
it is not only possible to study how users interact with content, but also how they
interact with other users. The rationale behind these approaches stems from the
assumption that human-guided and authentic activities typically display more
variability than their automated, inauthentic counterparts. This emphasizes the
need to move beyond analyzing individual accounts to focusing on patterns of
suspicious coordination within groups.
However, research by Elmas et al. [5] on retweet bots, utilizing data from
services previously purchased on black market sites, discovered discrepancies
in common assumptions about bot characteristics. This included, but was not
limited to, areas of volume of activity, diversity, following and followers and
temporality. They illustrated that bots may emerge from compromised accounts,
acting as bots only for certain period of time, and did not find a single case of one
bot following another one. Such insights should prompt researchers to critically
assess, whether the metrics used to evaluate the performance of bot detection
methods are in fact contributing to improving downstream applications. Hays et
al. [8] argued that this is currently not the case for Twitter bot detection tools,
attributing high performance to simplistic collection and labeling practices of the
datasets employed. Separately, Martini et al. [10] observed that different methods
1
https://storage.courtlistener.com/recap/gov.uscourts.cand.330648/gov.uscourts.
cand.330648.257.0.pdf.
2
https://openai.com/blog/new-ai-classifier-for-indicating-ai-written-text.
252 S. Reiche et al.
yield remarkably different results in comparison. This implies that current tools
may not be ready for downstream usage and may result in the misclassification
of many users [11].
With the heightened difficulty in identifying individual bot accounts, we focus
our efforts on group activities and their coordinated behavior patterns. Our work
is in line with trends in recent research that focuses more on actions and behavior
of groups of accounts rather than on the classification of individual accounts [1].
We investigate the potential of new sets of relations that are challenging to
circumvent; any attempts to do so could drastically limit the functionality of
organized automated actions by restricting their common operational patterns.
The goal of our research is to determine the feasibility of utilizing coordination
patterns for the purpose of bot detection, with due consideration to both the
inherent complexities and data restrictions. By recognizing these challenges, we
contrast first-order behavior-based relations, such as retweets (a user sharing a
tweet), with higher-order relations like co-retweet (two users retweet the same
tweet) and co-hashtag (two users tweet the same hashtag more than a certain
number of times). The former highlights direct user behavior, while the latter
reveals shared interests or subjects, uncovering subtler collective actions. This
approach is set against the current conventional method of utilizing follow rela-
tions, which are more static. Utilizing the same dataset and graph neural network
architecture across our experiments, we conduct a comparative study between
the conventional follow relations and those centered around behavioral patterns
to assess their impact on bot detection, avoiding the introduction of new uncer-
tainties through algorithmic changes or dataset variations. Though our results
did not surpass the conventional approach, they remain competitive in terms of
accuracy and F1-score, demonstrating the viability of this approach. To the best
of our knowledge, this is the first work that integrates higher-order relations in
a behavior-based approach for bot detection.
2 Methodology
2.1 Dataset
We utilize the TwiBot-22 dataset for our experiments. Compared to previous
datasets, TwiBot-22 includes a broader and more diverse range of relations.
For an in-depth exploration of the dataset’s conceptual framework, we refer
the readers to the work of Feng at al. [6] that introduced TwiBot-22. Previous
bot detection methods were constrained to rely only on follower/following
relationships between user entities and an implicit relation between users and
their tweets. The TwiBot-22 dataset encompasses extensive 14 different kinds of
relations. In this work we leverage the follower (user a is followed by user b),
following (user a follows user b), retweet (tweet a retweets tweet b), post
(user a posts tweet b), and discuss (tweet a discusses hashtag b) relations.
We believe that this range of relations offers a lot of potential for future devel-
opment of more sophisticated and accurate bot detection methods. The acces-
sibility of these diverse relations not only enhances our analytical capabilities
Augmenting Bot Detection with Behavioral Patterns 253
Table 1. Statistics (left) and in-depth analysis (right) of human and bot characteristics
in TwiBot-22. ∗users with at least 1 tweet. † with at least 1 follower / following.
3
Somewhat counter-intuitively, the total following and follower counts do not match.
This is due to specifics of data collection, see [6] for insights into the process.
254 S. Reiche et al.
2.2 BotRGCN
BotRGCN (Bot detection with Relational Graph Convolutional Networks) [7] is a
graph-based method for Twitter bot detection. The model first creates a multi-
modal encoding by jointly encoding multiple numerical and categorical user
properties, as well as encoding user tweets and descriptions using a pre-trained
RoBERTa model. These encodings serve to represent individual users, captur-
ing diverse aspects of their behavior and characteristics. A heterogeneous graph
is constructed by defining multiple relational neighborhoods for each Twitter
user. BotRGCN applies relational graph convolutional networks (RGCN), which
support a variable number of relations, allowing the model to capture complex
patterns of interactions between users. We chose to work with BotRGCN due to
its modular and well-designed architecture that allows for easy modification and
experimentation. The model was used with the initialization of hyperparameters
as found in the original implementation, available at the corresponding Github
repository.4 Adjustments were made to accommodate the specific number of cat-
egorical and numerical properties in TwiBot-22. The architecture and specific
components of BotRGCN are further detailed in Table 2.
BotRGCN Architecture
D
Input Layers Description Embedding Linear (RDs × 4 ) + LeakyReLU
D
Tweet Embedding Linear (RTs × 4 ) + LeakyReLU
D
Numerical Properties Embedding Linear (RNs × 4 ) + LeakyReLU
Cs × D
Categorical Properties Embedding Linear (R 4 ) + LeakyReLU
Hidden Layers Input Transformation Linear (RD×D ) + LeakyReLU
RGCN 1st Layer RGCN Convolution (RD×D )
RGCN 2nd Layer RGCN Convolution (RD×D )
Hidden Transformation Linear (RD×D )+ LeakyReLU
Output Layer Final Output Linear (RD×2 )
the majority of bot accounts in their dataset had more followers than accounts
they were following, and no two bots followed each other.
Moreover, the authors also observed different retweet behaviour for bots, both
temporal as well as quantitative. This insight, coupled with the observation of
bot evolution, led us to investigate the potential offered by new sets of relations.
Inspired by work from Vargas et al. [12], which builds upon coordination
patterns from [9] we introduce the following relations:
– Retweet: a user retweeted the tweet of another user.
– Co-Retweet: two users retweeted the same tweet.
– Co-Hashtag: two users tweet the same hashtag above a certain threshold.
These relations are behavior-based, which makes them harder to manipulate
than, e.g., follower and following relations. We believe that this approach
has the potential to reveal additional patterns of coordinated behavior among
users. However, none of these are readily usable for us out-of-the-box and require
some data transformation steps.
Retweet: Our analysis showed that bots tend to retweet disproportionately.
In order to take advantage of this, we first need to transform the existing
retweet relation from tweet→tweet to user→user. By cross-referencing the
given retweet relation with the post relation (user→tweet), we are able to
associate a user for each tweet and subsequently derive the retweet relation
in the form of user→user. This process is illustrated in Fig. 1.
Co-Retweet: We introduce this relation to emphasize instances where two
users retweeted the same tweet. To achieve this, we map a user to each tweet
that retweets another tweet, similar to the process laid out in retweet above.
Then, we group these users by their retweeted target tweet. From these groups,
we create all possible combinations of users (excluding pairs with the same user
twice) and export them as our new co-retweet relation.
Co-Hashtag: Using a similar grouping and pairing approach as with the Co-
Retweet relation, we focus on the discuss relation (tweet→hashtag). Prior to
the pairing step, we filter out hashtags with an unusually large number of users to
decrease computational demands and filter out those hashtags that do not offer
any reasonable insight. After this step, we create pairs of users who tweeted
the same hashtag a minimum of n times. The choice of n can be regarded as
a hyperparameter itself and is detailed further, in the subsequent experiments
section and Table 3.
3 Experiments
To determine the feasibility of utilizing coordination patterns for bot detection
we conducted sensitivity and ablation studies. We kept hyperparameters con-
stant across all experiments. The model is initialized with the same parameters
as mentioned in Subsect. 2.2. We further fixed the dropout rate at 0.3, the learn-
ing rate at 0.001, and weight decay at 0.005. Furthermore, we standardized the
256 S. Reiche et al.
Fig. 1. Visualization of the process of deriving the new Co-Hashtag (co hashtag)
relation. Initially, the edge file is split into individual relations (not depicted). We
then join the post and discuss relation to associate user-ids with each hashtag in
the discuss table. In this example we assume a threshold amount value of 100, below
which co hashtag occurrences are discarded. We then create pairs of users with the
respective count of how often they share a hashtag. Lastly, we keep only those with at
least n shared hashtags and discard the amount column to get the expected format.
number of training epochs to 200 across all experimental runs. We reused the
train/test split that comes with TwiBot-22, for comparability with prior work.
First, we defined a threshold for the Co-Hashtag relation. The threshold
was set to three standard deviations above the mean, with values provided in
Table 3. Since the differences between the thresholds were minor, we chose the
one that achieved the highest F1-score, indicating the most reliable predictions.
Additional experimentation with the sets and quantities of relations can be ref-
erenced in Table 4. Notably, the follower relation yielded the best results, as
opposed to the common follower+following combination. It matches the
intuition that this relation can be a strong indicator. Our main interest, however,
was on the newly derived behavioral relations, with follow relationships serving
as a baseline for comparison.
Augmenting Bot Detection with Behavioral Patterns 257
Table 3. Sensitivity study of the co-hashtag edge creation threshold. The Amount
column corresponds to the parameter n, representing the minimum number of times
pairs of users tweeted the same hashtag. We run each experiment five times and report
the average value as well as the standard deviation in parentheses.
relation, there’s evident improvement when using three or five relations instead
of two. Our concerns regarding these biases are outlined in Subsect. 2.1 dedicated
to the dataset. This highlights the potential of a multi-rational approach, but it is
essential to note that inherent characteristics of the used dataset might influence
these observations. Such results are particularly significant, as bot developers
may find it challenging to avoid behavior-based detection without substantially
constraining their capabilities. Building on the findings from Feng et al. [7],
where it was confirmed that the optimal performance is achieved with 2 layers
of RGCN, we have carried out an ablation study of BotRGCN, utilizing the same
layer configuration. Our experiments, as detailed in Table 5, prove that the inte-
gration of all available modalities remains essential for robust bot detectors. The
challenge requires a multi-faceted approach, integrating various modalities. This
approach must then model the aggregation of these signals, aiming to ensure a
clear distinction between accounts involved in automated coordinated efforts and
those demonstrating authentic behavior, which may stem from social initiatives.
Table 5. Ablation Study of BotRGCN under different relation types using 2 layers of
RGCN. Abbreviations used: T = User Tweets; N = User Numerical Properties; C = User
Categorical Properties; D = User Descriptions. We run each experiment five times and
report the average value as well as the standard deviation in parentheses.
4 Conclusion
The complexity of bots continues to evolve, making the task of bot detection
a critical challenge. Our investigation into alternative higher-order, behavioral-
based relations emphasizes a different approach in detecting automated coordi-
nated group activities. Although not surpassing the conventional approach, the
competitiveness of our results suggest a reliable method without falling into sus-
pected biases of traditional techniques. Bot developers seeking to avoid detection
may find it increasingly difficult without limiting their capacities. TwiBot-22,
the dataset used in this study, has been instrumental in establishing these new
relations. Yet, as we look into further research, the incorporation of temporal
patterns into these newly established relations seems promising. This direction,
however, necessitates datasets that support this, a limitation we currently face.
We are optimistic that pursuits into this direction can foster the development of
more robust and reliable detection methods.
Augmenting Bot Detection with Behavioral Patterns 259
Acknowledgements. The authors thank Ali Alhosseini for his guidance during the
early conceptual phase and Lukas Drews for his collaboration in the initial experiments.
References
1. Cinelli, M., Cresci, S., Quattrociocchi, W., Tesconi, M., Zola, P.: Coordinated
inauthentic behavior and information spreading on twitter. Decision Support Syst.
160, 113,819 (2022)
2. Cresci, S.: A decade of social bot detection. Commun. ACM 63(10), 72–83 (2020)
3. Cresci, S., Di Pietro, R., Petrocchi, M., Spognardi, A., Tesconi, M.: The paradigm-
shift of social spambots: evidence, theories, and tools for the arms race. In: Pro-
ceedings of the 26th International Conference on World Wide Web Companion,
pp. 963–972 (2017)
4. Cresci, S., Lillo, F., Regoli, D., Tardelli, S., Tesconi, M.: Cashtag piggybacking:
uncovering spam and bot activity in stock microblogs on twitter. ACM Trans.
Web (TWEB) 13(2), 1–27 (2019)
5. Elmas, T., Overdorf, R., Aberer, K.: Characterizing retweet bots: The case of black
market accounts. In: Proceedings of the International AAAI Conference on Web
and Social Media, vol. 16, pp. 171–182 (2022)
6. Feng, S., Tan, Z., Wan, H., Wang, N., Chen, Z., Zhang, B., Zheng, Q., Zhang,
W., Lei, Z., Yang, S., et al.: Twibot-22: towards graph-based twitter bot detection.
Adv. Neural. Inf. Process. Syst. 35, 35254–35269 (2022)
7. Feng, S., Wan, H., Wang, N., Luo, M.: Botrgcn: Twitter bot detection with rela-
tional graph convolutional networks. In: Proceedings of the 2021 IEEE/ACM Inter-
national Conference on Advances in Social Networks Analysis and Mining, pp.
236–239 (2021)
8. Hays, C., Schutzman, Z., Raghavan, M., Walk, E., Zimmer, P.: Simplistic collec-
tion and labeling practices limit the utility of benchmark datasets for twitter bot
detection. In: Proceedings of the ACM Web Conference 2023, pp. 3660–3669 (2023)
9. Keller, F.B., Schoch, D., Stier, S., Yang, J.: Political astroturfing on twitter: how
to coordinate a disinformation campaign. Polit. Commun. 37(2), 256–280 (2020)
10. Martini, F., Samula, P., Keller, T.R., Klinger, U.: Bot, or not? comparing three
methods for detecting social bots in five political discourses. Big data & society
8(2), 20539517211033,566 (2021)
11. Rauchfleisch, A., Kaiser, J.: The false positive problem of automatic bot detection
in social science research. PloS one 15(10), e0241,045 (2020)
12. Vargas, L., Emami, P., Traynor, P.: On the detection of disinformation campaign
activity with network analysis. In: Proceedings of the 2020 ACM SIGSAC Confer-
ence on Cloud Computing Security Workshop, pp. 133–146 (2020)
13. Varol, O., Ferrara, E., Davis, C., Menczer, F., Flammini, A.: Online human-bot
interactions: Detection, estimation, and characterization. In: Proceedings of the
International AAAI Conference on Web and social media, vol. 11, pp. 280–289
(2017)
Graph Completion Through Local
Pattern Generalization
Zhang Zhang1,2 , Ruyi Tao1,2 , Yongzai Tao3 , Mingze Qi4 , and Jiang Zhang1,2(B)
1
School of Systems Science, Beijing Normal University, Beijing, China
zhangjiang@bnu.edu.cn
2
Swarma Research, Beijing, China
3
College of Computer Science and Technology, Zhejiang University,
Hangzhou, China
4
College of Science, National University of Defense Technology, Changsha, China
1 Introduction
Networks form the underlying structures of numerous systems and hold signifi-
cant implications in both scientific research and everyday life [1,2]. One approach
to understanding these complex systems involves studying the properties of the
networks that underlie them. However, obtaining a complete network structure
is often infeasible due to factors such as measurement errors, privacy concerns,
and other limitations [3–5]. For instance, while online social network data can
be readily collected, there are ’offline’ nodes that may exert significant influ-
ence at certain times but remain difficult to capture due to the unavailability
of offline data. Consequently, there is a pressing need for methodologies capa-
ble of inferring missing information in incomplete networks. The methodologies
R. Tao and Y. Tao—Those author contribute equally
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 260–271, 2024.
https://doi.org/10.1007/978-3-031-53468-3_22
Graph Completion and Local Patterns 261
2 Results
2.1 The Network Completion Problem
Problem Definition. Suppose we have a undirected network G(V, E) with
an adjacency matrix A. This network cannot be fully observed, as information
about some nodes and their corresponding edges is missing. Instead, we can
only observe a sub-graph Go of G, which contains some observed vertices Vo and
262 Z. Zhang et al.
edges Eo between them. Assume we know the number of missing nodes Nm . Our
objective is to infer the missing part of the network, denoted as Gm = G − Go ,
which includes the missing nodes Vm and the missing edges Em . We can reorder
the nodes such that the observable nodes are placed at the beginning of the
sequence. This allows us to divide the adjacency matrix into two sub-matrices:
one for the observable nodes (Ao ) and another for the connections related to
unobserved nodes (Am ). Thus, the task is to reconstruct the whole adjacency
matrix A = Ao + Am based on Ao , where Am is an inverted L-shaped matrix
describing the connections between Vo and Vm , as well as between Vm themselves,
as shown in Fig. 1.
Overview of the C-GIN Framework. The C-GIN model assumes that a net-
work’s different areas share common connection patterns. Using a Graph Auto-
Encoder, C-GIN captures these patterns from the partially observed subgraph to
fill in the missing network portions. The Graph Neural Network (GNN) within
the Auto-Encoder’s Encoder learns the formation of local structures around
nodes in two steps: first, by generating initial node embeddings from a linear
layer, and then by obtaining final embeddings through message-passing layers.
We selected the Graph Isomorphism Network (GIN) [25] for its superior expres-
sive power among various GNN models like GCN [13], GAT [26], and Graph-
SAGE [27].
To extend learned patterns to the unobserved area, the Graph Auto-Encoder
encodes the network structures into node embeddings. The observed part are
used to compute the loss function, thereby compelling the Graph Neural Network
Graph Completion and Local Patterns 263
Fig. 2. The Overview of our C-GIN model: Our model employs GIN to learn the
local connection patterns of the network’s known portion and iteratively complete the
missing elements. Each iteration comprises an encoding stage and a decoding stage.
During the encoding stage, initial node features represented by one-hot vectors, along
with the inferred complete network from the previous iteration, are fed into GIN.
This results in updated node feature vectors as output. In the decoding stage, these
updated feature vectors are used to generate a matrix that represents the probabilities
of connections between each node pair. The observable portion of this probability
matrix is utilized to calculate the loss. The remainder of the matrix is rescaled by
multiplying it with a scaling factor γ. Finally, the adjacency matrix for the next epoch
is sampled according to these adjusted probabilities.
Next, we delve into the specific execution flow of the algorithm. According
to the problem definition, we have partial information about how the observed
nodes are connected, represented by Ao , which forms the upper-left quadrant
of the adjacency matrix. We also assume knowledge of the number of missing
nodes, Nm . In the initial stage, we populate the adjacency matrix of the unknown
part with zeros to obtain an N × N matrix, denoted as  . We use one-hot
vectors as the initial features, X = I, for all nodes, allowing the neural network’s
linear layers to independently determine each node’s embedding vector. This
initialized matrix  and feature vector X are then fed into the GIN encoder,
which is known to excel in learning local connection patterns.” [25], to update
the node feature vector H. The initial vector X = I remains unchanged during
the training process. The GIN gradually learns how to map from one-hot vectors
to appropriate node embedding vectors to represent the network structure. This
operation can be expressed by the Eq. 1:
H = GINθ (Â, X). (1)
264 Z. Zhang et al.
In this context, θ denotes the parameters of the GIN encoder. Once we have
the encoded node features, they are then fed into the decoder as described by
Eq. 2. This results in a decoded probability matrix PN ×N . Each element Pi,j
represents the probability that nodes vi and vj are connected.
1
Pi,j = . (2)
1 + exp(−H i , H j )
Note that the probability matrix P is divided into two sections. The upper-
left subsection, a No × No matrix, represents the connection probabilities for
the observed part of the network. This subsection is used to calculate the loss
function. The loss function, in turn, helps to optimize the parameters θ within
the GIN encoder. The specific definition of the loss function is provided in Eq. 3.
o −2
N
No
L(θ) = − Ai,j log(Pi,j ) + (1 − Ai,j ) log(1 − Pi,j ). (3)
i=0 j=i+1
2.2 Experiments
Baselines Models. The baselines were chosen from three different types of
network completion algorithms for comparison:
2.3 Metrics
We evaluate the effectiveness of our model and the baseline methods in net-
work completion tasks by treating the completion of the inverted L-shaped
region of the adjacency matrix as a binary classification problem. Performance
is assessed using the area under the Receiver Operating Characteristic (ROC)
curve (AUC). To ensure a balanced evaluation, we follow previous literature [21]
in randomly sampling an equal number of positive and negative edges. Specifi-
cally, we introduce two sets of these metrics-namely, AU CObserved−U nobserved and
AU CU nobserved−U nobserved to rigorously evaluate the accuracy of inferred connec-
tions between both observed and unobserved nodes, and exclusively unobserved
nodes, respectively.
To calculate AUC, it is necessary to compare the predicted connection prob-
ability matrix with a ground-truth adjacency matrix. However, a direct com-
parison is not feasible due to the alignment requirement for the inferred unob-
served nodes with the actual ones. Specifically, a permutation matrix is needed to
reorder the rows and columns corresponding to the unobserved nodes in the prob-
ability matrix generated by the network completion algorithm. This reordering
aims to make the rearranged probability matrix resemble the ground-truth adja-
cency matrix as closely as possible. Considering there are Nm ! possible permuta-
tions, we opt for the best-performing permutation in the performance comparison
for fairness. This problem is known as sub-graph matching [28]. We employ the
Seeded Graph Matching [29] algorithm to address this challenge. According to
existing literature, this algorithm achieves a matching accuracy exceeding 90%
when the similarity between the two matrices in question is greater than 90%
and the number of nodes to be matched exceeds 15. Details can be referred
to [29].
Note that we did not use the Seeded Graph Matching method to reorder the
probability matrix returned by KronEM, because this algorithm cannot only
output the learned matrix of Am but also the node alignment.
matrix. This indicates that C-GIN model excels at modeling connections between
unobserved nodes. In the ’observed-unobserved’ section, C-GIN model also
attained the best results for two biological networks. However, G-GCN achieved
superior performance on the the citation network(Cora). This discrepancy may
stem from the inherent aptitude of G-GCN for modeling growing networks such
as citation network. To further investigate which types of networks C-GIN model
is most effective for, we will examine the relationship between model performance
and structural features in the following section.
Fig. 3. C-GIN Performance on W-S and Empirical Networks: The left 2 panel
are the experiments on W-S network. The x-axis is the reconnecting probability p of the
WS network. This figure has two y-axes, one is CC and the other is the AUC difference,
in which we replace the output of the GIN encoder with the randomly generated matrix
that has the same shape. The right panel is the model relation of model performance
and Reachable CC on several empirical networks.
While the Clustering Coefficient (CC) offers some insight into local connec-
tivity, it falls short in capturing more complex structures. For instance, in a
2D grid network, first-order neighbors are not interconnected, maintaining a CC
of zero despite complex local connections involving second-order neighbors. To
Graph Completion and Local Patterns 269
n
n−1
A(n) = sgn( Ai ) − A(i) , (5)
i=1 i=1
n
An = A(i) ∗ (1 − λ)i−1 , (6)
i=1
where A(1) = A. In Eq. 5, we get a matrix A(n) that represents the n-order
connection between nodes, specifically, if there is a path of length n between
(n) (n)
node i and node j, then Ai,j = 1, otherwise Ai,j = 0. In Eq. 6, we get An by
weighted summation of A(i) . where λ refers to the decay index, which can also
be understood as the reaching cost corresponding to the path length. If λ is 0,
it means that node i can reach node j at no cost. If λ is 1, it means that node
i cannot reach any second-order and above neighbors, and then An degenerates
into the original adjacency matrix A. We calculate the clustering coefficient on
the newly obtained An , resulting in Reachable CC.
In Fig. 3’s right panel, we demonstrate a positive correlation between edge
completion performance and Reachable CC across various networks, including
real, Grid, and Circulant networks. This trend can be understood through the
model’s mechanism: it learns local structure patterns from observed nodes and
extrapolates to unobserved nodes. With a high Reachable CC, higher-order
neighbors are more interconnected, making it easier for the GIN-based encoder
to learn the network’s structural pattern. As a result, C-GIN performs better in
networks with higher Reachable CC.
2.4 Discussion
In this study, we introduce C-GIN model to address the network completion
problem when certain nodes and their edges are unobserved. Utilizing a Graph
Auto-Encoder, C-GIN learns the local connectivity patterns within the observed
portions of the network and generalizes these patterns to unknown regions of the
adjacency matrix. Experimental results confirm that C-GIN outperforms bench-
mark models on both synthetic and real-world networks, particularly excelling
at completing edges between unobserved nodes. To delve deeper into the model’s
nature, we introduce the Reachable CC metric, which gauges the likelihood of
connecting edges between higher-order neighbors within a network. C-GIN per-
forms notably better on networks with higher Reachable CC values.
Despite its contributions, the model has limitations. For instance, it presup-
poses the number of unobserved nodes, a parameter often unknown in real-world
scenarios. Future research could focus on estimating this number. Moreover, cer-
tain networks, like Cora, offer node-specific features, while others, such as those
270 Z. Zhang et al.
References
1. Marsden, P.V.: Network data and measurement. Ann. Rev. Sociol. 16(1), 435–463
(1990)
2. Newman, M.E.: Communities, modules and large-scale structure in networks. Nat.
Phys. 8(1), 25–31 (2012)
3. Cimini, G., Squartini, T., Garlaschelli, D., Gabrielli, A.: Systemic risk analysis on
reconstructed economic and financial networks. Sci. Rep. 5(1), 1–12 (2015)
4. Kossinets, G.: Effects of missing data in social networks. Social networks 28(3),
247–268 (2006)
5. Anand, K., et al.: The missing links: a global study on uncovering financial network
structures from partial data. J. Financ. Stab. 35, 107–119 (2018)
6. Su, R.-Q., Wang, W.-X., Lai, Y.-C.: Detecting hidden nodes in complex networks
from time series. Phys. Rev. E 85(6), 065201 (2012)
7. Haehne, H., Casadiego, J., Peinke, J., Timme, M.: Detecting hidden units and
network size from perceptible dynamics. Phys. Rev. Lett. 122(15), 158301 (2019)
8. Chen, M., Zhang, Y., Zhang, Z., Du, L., Wang, S., Zhang, J.: Inferring network
structure with unobservable nodes from time series data. Chaos: Interdiscip. J.
Nonlinear Sci. 32(1), 1 013126 (2022)
9. Pearl, J.: Causal inference.: Causality: objectives and assessment, pp. 39–58 (2010)
10. Lichtenwalter, R.N., Lussier, J.T., Chawla, N.V.: New perspectives and methods
in link prediction. In: Proceedings of the 16th ACM SIGKDD International Con-
ference on Knowledge Discovery and Data Mining, pp. 243–252 (2010)
11. Lü, L., Zhou, T.: Link prediction in complex networks: a survey. Phys. A: A 390(6),
1150–1170 (2011)
12. Wang, P., Xu, B., Wu, Y., Zhou, X.: Link prediction in social networks: the state-
of-the-art. SCIENCE CHINA Inf. Sci. 58(1), 1–38 (2015)
13. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional
networks. arXiv preprint arXiv:1609.02907 (2016)
14. Bhagat, S., Cormode, G., Muthukrishnan, S.: Node classification in social networks.
In: Social network data analytics. Springer, pp. 115–148 (2017)
15. Rong, Y., Huang, W., Xu, T., Huang, J.: Dropedge: Towards deep graph convolu-
tional networks on node classification. arXiv preprint arXiv:1907.10903 (2019)
16. Cai, H., Zheng, V.W., Chang, K.C.-C.: A comprehensive survey of graph embed-
ding: problems, techniques, and applications. IEEE Trans. Knowl. Data Eng. 30(9),
1616–1637 (2018)
17. Bunke, H., Riesen, K.: Graph classification based on dissimilarity space embed-
ding. In: da Vitoria Lobo, N., Kasparis, T., Roli, F., Kwok, J.T., Georgiopoulos,
M., Anagnostopoulos, G.C., Loog, M. (eds.) Structural, Syntactic, and Statistical
Pattern Recognition, pp. 996–1007. Springer Berlin Heidelberg, Berlin, Heidelberg
(2008). https://doi.org/10.1007/978-3-540-89689-0 103
18. Zhou, T., Lü, L., Zhang, Y.-C.: Predicting missing links via local information.
Europ. Phys. J. B 71(4), 623–630 (2009)
19. Liu, Z., Zhang, Q.-M., Lü, L., Zhou, T.: Link prediction in complex networks: A
local naı̈ve bayes model. EPL (Europhysics Letters) 96(4), 48007 (2011)
Graph Completion and Local Patterns 271
20. Tan, F., Xia, Y., Zhu, B.: Link prediction in complex networks: a mutual informa-
tion perspective. PLoS ONE 9(9), e107056 (2014)
21. Xu, D., Ruan, C., Motwani, K., Korpeoglu, E., Kumar, S. and Achan, K.: Genera-
tive graph convolutional network for growing graphs. In: ICASSP 2019-2019 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 2019, pp. 3167–3171 (2019)
22. Wei, Q., Hu, G.: Unifying node labels, features, and distances for deep network
completion. Entropy 23(6), 771 (2021)
23. Tran, C., Shin, W.-Y., Spitz, A., Gertz, M.: Deepnc: Deep generative network com-
pletion. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020)
24. Kim, M., Leskovec, J.: The network completion problem: inferring missing nodes
and edges in networks. In: Proceedings of the 2011 SIAM International Conference
on Data Mining, SIAM, 2011, pp. 47–58 (2011)
25. Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks?
arXiv preprint arXiv:1810.00826 (2018)
26. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph
attention networks. arXiv preprint arXiv:1710.10903 (2017)
27. Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large
graphs. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
28. Koutra, D., Parikh, A., Ramdas, A., Xiang, J.: Algorithms for graph similarity
and subgraph matching. In: Proceedings of Ecol. inference conf, vol. 17. Citeseer
(2011)
29. Fishkind, D.E., et al.: Seeded graph matching. Pattern Recogn. 87, 203–215 (2019)
30. Rossi, R., Ahmed, N.: The network data repository with interactive graph ana-
lytics and visualization,. In: Proceedings of the AAAI Conference on Artificial
Intelligence, vol. 29, no. 1 (2015)
31. Rossi, R., Ahmed, N.: The network data repository with interactive graph analytics
and visualization. In: AAAI (2015). https://networkrepository.com
A Consistent Diffusion-Based Algorithm
for Semi-Supervised Graph Learning
1 Introduction
The principle of heat diffusion has proved instrumental in graph mining [5].
It has been applied for many different tasks, including pattern matching [10],
ranking [7], embedding [4], clustering [11], classification [2,6,13,14] and feature
propagation [9]. In this paper, we focus on the task of semi-supervised node
classification: given labels known for a few nodes of the graph, referred to as the
seeds, how to infer the labels of the other nodes? A popular approach consists
in using diffusion in the graph, under boundary constraints, a problem known
in physics as the Dirichlet problem [14]. Specifically, one Dirichlet problem is
solved per label, setting at 1 the temperature of the seeds with this label and at
0 the temperature of the other seeds. Each node is then assigned the label with
the highest temperature over the different Dirichlet problems. In this paper, we
prove using a simple block model that this algorithm is actually not consistent,
unless the temperatures are centered before label assignment. This step of tem-
perature centering does not only make the algorithm consistent but also brings
substantial performance gains on real datasets. This is a crucial observation
given the popularity of the algorithm1 .
The rest of this paper is organized as follows. In Sect. 2, we introduce the
Dirichlet problem on graphs. Section 3 describes our algorithm for node classifi-
cation. The analysis showing the consistency of our algorithm on a simple block
model is presented in Sect. 4. Section 5 presents some experimental results and
Sect. 6 concludes the paper.
1
The number of citations of the paper [14] exceeds 4 000 in 2023, according to Google
Scholar.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 272–282, 2024.
https://doi.org/10.1007/978-3-031-53468-3_23
Semi-Supervised Graph Learning 273
L = D − A.
Now let S be some strict subset of {1, . . . , n} and assume that each node
i ∈ S is assigned some fixed temperature Ti . We are interested in the evolution
of the temperatures of the other nodes, we refer to as the free nodes. We assume
that heat exchanges occur through each edge of the graph proportionally to the
temperature difference between the corresponding nodes, so that:
n
dTi
∀i ∈
/ S, = Aij (Tj − Ti ),
dt j=1
that is,
dTi
∀i ∈
/ S, = −(LT )i ,
dt
where T is the vector of temperatures, of dimension n. This is the heat equation
in discrete space. At equilibrium, the vector T satisfies Laplace’s equation:
∀i ∈
/ S, (LT )i = 0. (1)
With the boundary constraint giving the temperature Ti for each node i ∈ S,
this defines a Dirichlet problem. Observe that Laplace’s equation (1) can be
written equivalently:
∀i ∈
/ S, Ti = (P T )i , (2)
where P = D−1 A is the transition matrix of the random walk in the graph.
We now characterize the solution to the Dirichlet problem (1). Without any loss
of generality, we assume that free nodes (i.e., not in S) are indexed from 1 to
n − s so that the vector of temperatures can be written
X
T = ,
Y
274 T. Bonald and N. De Lara
Fig. 1. Binary classification of the Karate Club graph with 2 seeds (indicated with a
black circle). Blue nodes have label 0, red nodes have label 1. (Color figure online)
In the general case with K labels, we use a one-against-all strategy: the seeds
of each label alternately serve as hot sources (temperature 1) while all the other
seeds serve as cold sources (temperature 0). After centering the temperatures
(so that the mean temperature of each diffusion is equal to 0), each node is
assigned the label that maximizes its temperature. This algorithm, we refer to
as the Dirichlet classifier, is parameter-free.
4 Analysis
In this section, we prove the consistency of Algorithm 1 on a simple block model.
In particular, we highlight the importance of temperature centering (line 9 of
the algorithm) for the consistency of the algorithm.
Consider the Dirichlet problem when the temperature of the s1 seeds of block 1
is set to 1 and the temperature of the other seeds is set to 0. We have an explicit
solution to this Dirichlet problem, given by Lemma 1. All proofs are deferred to
the appendix.
4.3 Classification
We now state the main result of the paper: the Dirichlet classifier is a consistent
algorithm for the block model, in the sense that all nodes are correctly classified
whenever p > q.
Theorem 1. If p > q, then the predicted label of each free node i of block k is
ŷi = k, for any n1 , . . . , nK (label distribution) and s1 , . . . , sK (seed distribution).
their temperature is the highest in the Dirichlet problem associated with label
1. In view of Lemma 1, this means that for all k = 2, . . . , K,
⎛ ⎞
K
n1 (p − q) + nq (nj − sj )q ⎠
s1 q + s1 (p − q) ⎝1 −
s1 (p − q) + nq j=1
sj (p − q) + nq
nk (p − q) + nq
> sk q .
sk (p − q) + nq
This condition might be violated even if p > q, depending on the parameters
n1 , . . . , nK and s1 , . . . , sK . In the simplest case of K = 2 blocks, with p = 10−1
and q = 10−2 for instance, the classification is incorrect in the following two
asymmetric cases:
Seed asymmetry (blocks of same size but different number of seeds):
n1 = n2 = 100; s1 = 10, s2 = 5,
Label asymmetry (blocks with the same number of seeds but different sizes):
n1 = 100, n2 = 10; s1 = s2 = 5.
This sensitivity of the algorithm to both forms of asymmetry will be confirmed
by the experiments. The step of temperature centering is crucial for consistency.
5 Experiments
In this section, we show the impact of temperature centering on the quality of
classification using both synthetic and real data. The Python code is available
as a Jupyter notebook in Python2 , making the experiments fully reproducible.
We now focus on real datasets available from the SNAP collection3 and the
NetSet4 collection, restricting to graphs having ground-truth labels. All graphs
are considered undirected.
3
https://snap.stanford.edu/.
4
https://netset.telecom-paris.fr/.
280 T. Bonald and N. De Lara
Table 2. Macro-F1 scores (mean ± standard deviation) without and with temperature
centering.
(a) 5% of seeds
Dataset No centering Centering Variation
6 Conclusion
We have proposed a novel approach to node classification based on heat diffusion.
Specifically, our technique consists in centering the temperatures of each solution
to the Dirichlet problem before classification. We have proved the consistency
of this algorithm on a simple block model and shown that the temperature
centering brings significant performance gains on real datasets. This is a crucial
observation given the popularity of the algorithm.
The question of the consistency of the algorithm when the mean temperature
is computed over free nodes (instead of all nodes) remains open. Another inter-
esting research perspective is to extend our proof of consistency of the algorithm
to stochastic block models, where edges are drawn at random [1].
Appendix
A Proof of Lemma 1
Proof. In view of (2), we have:
(n1 (p − q) + nq)T1 = s1 p + (n1 − s1 )pT1 + (nj − sj )qTj ,
j=1
(nk (p − q) + nq)Tk = s1 q + (nk − sk )pTk + (nj − sj )qTj ,
j=k
for k = 2, . . . , K. We deduce:
(s1 (p − q) + nq)T1 = s1 p + U q,
(sk (p − q) + nq)Tk = s1 q + U q ∀k = 2, . . . , K,
with
K
U= (nj − sj )Tj .
j=1
B Proof of Theorem 1
(1)
Proof. Let Δk = Tk − T̄ be the deviation of temperature of non-seed nodes of
block k for the Dirichlet problem associated with label 1. In view of Lemma 1,
we have:
(1)
(s1 (p − q) + nq)Δ1 = s1 (p − q)(1 − T̄ ),
(1)
(sk (p − q) + nq)Δk = −sk (p − q)T̄ k = 2, . . . , K,
282 T. Bonald and N. De Lara
(1) (1)
For p > q, using the fact that T̄ ∈ (0, 1), we get Δ1 > 0 and Δk < 0 for all
(l) (l)
k = 2, . . . , K. By symmetry, for each label l = 1, . . . , K, Δl > 0 and Δk < 0
(l)
for all k = l. We deduce that for each block k, ŷi = arg maxl Δk = k for each
free node i of block k.
References
1. Airoldi, E.M., Blei, D.M., Fienberg, S.E., Xing, E.P.: Mixed membership stochastic
blockmodels. J. Mach. Learn. Res. (2008)
2. Berberidis, D., Nikolakopoulos, A.N., Giannakis, G.B.: Adadif: Adaptive diffusions
for efficient semi-supervised learning over graphs. In: International Conference on
Big Data. IEEE (2018)
3. Chung, F.R.: Spectral graph theory. American Mathematical Soc. (1997)
4. Donnat, C., Zitnik, M., Hallac, D., Leskovec, J.: Learning structural node embed-
dings via diffusion wavelets. In: International Conference on Knowledge Discovery
& Data Mining. In: ACM (2018)
5. Kondor, R.I., Lafferty, J.: Diffusion kernels on graphs and other discrete structures.
In: Proceedings of the 19th international conference on machine learning (2002)
6. Li, Q., An, S., Li, L., Liu, W.: Semi-supervised learning on graph with an alter-
nating diffusion process. CoRR (2019)
7. Ma, H., King, I., Lyu, M.R.: Mining web graphs for recommendations. IEEE Trans-
actions on Knowledge and Data Engineering (2011)
8. Newman, M.E.J., Girvan, M.: Mixing patterns and community structure in
networks. In: Pastor-Satorras, R., Rubi, M., Diaz-Guilera, A. (eds.) Statistical
Mechanics of Complex Networks, pp. 66–87. Springer Berlin Heidelberg, Berlin,
Heidelberg (2003). https://doi.org/10.1007/978-3-540-44943-0 5
9. Rossi, E., Kenlay, H., Gorinova, M.I., Chamberlain, B.P., Dong, X., Bronstein,
M.M.: On the unreasonable effectiveness of feature propagation in learning on
graphs with missing node features. In: Proceedings of Machine Learning Research
(2022)
10. Thanou, D., Dong, X., Kressner, D., Frossard, P.: Learning heat diffusion graphs.
IEEE Transactions on Signal and Information Processing over Networks (2017)
11. Tremblay, N., Borgnat, P.: Graph wavelets for multiscale community mining. IEEE
Transactions on Signal Processing (2014)
12. Zachary, W.W.: An information flow model for conflict and fission in small groups.
J. Anthropol. Res. (1977)
13. Zhu, X.: Semi-supervised learning with graphs. Ph.D. thesis, Carnegie Mellon Uni-
versity (2005)
14. Zhu, X., Ghahramani, Z., Lafferty, J.D.: Semi-supervised learning using gaussian
fields and harmonic functions. In: Proceedings of the 20th International conference
on Machine learning (2003)
Leveraging the Power of Signatures
for the Construction of Topological
Complexes for the Analysis
of Multivariate Complex Dynamics
1 Introduction
Topological Data Analysis (TDA) is a new field with a wide range of application
in fields such as finance, neuroscience, medicine, etc. TDA addresses the prob-
lem of accounting for groupwise interactions in the data and therefore opens
very promising prospects to better apprehend complex phenomena than models
relying on pairwise interactions only. Several tools from algebraic topology, such
as homology groups, homotopy groups, Betti numbers, etc. can be put to work
in building a set of relevant features that can capture the intricate nature of
dependencies.
Some compelling examples of the benefits of using topological features appear
regularly in the literature. In [17] the homological features of brain functional
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 283–294, 2024.
https://doi.org/10.1007/978-3-031-53468-3_24
284 S. Chrétien et al.
networks are shown to take different values in two states depending on the
absorption of some drug. More generally, it is shown in [21] that homological
cycles in structural brain networks finds connections between regions of early
and late evolutionary origin. TDA can also be efficiently used in dynamical set-
tings as well. One very intriguing example is change detection as illustrated in
the study of functional brain networks conditioned in different tasks [22]. In
[12] it is investigated how speech-related brain regions connectivity changes in
different scenarios of speech perception.
In the present paper, we focus on the analysis of dynamical high dimensional
phenomena and on the problem of constructing associated relevant topological
structures with the aim of proposing news computational tools for deepening
our understanding of the higher order structures hidden in time series data. Our
main contribution is a new approach to building statistically informed simplicial
complexes and possibly more general structures.
Our two main tools will be the basic objects of TDA and Signature theory.
Signatures were recently proposed as a very powerfull feature map for time series
and dynamical systems in [5,7,14]. The introduction of Signatures for building
topological structures for high dimensional dynamical phenomena is new and
appears as a key and very natural ingredient that can accurately account for
orientations of the various simplices in the complex at hand while capturing the
main shape features from the dynamics. Using Signature in a statistical/machine
learning context is an approach which is adopted in a growing number of appli-
cations nowadays [7] and our work is also intended to illustrate the relevance of
Signature theory combined with statistics/machine learning for building a higher
order interaction modelling framework.
In mathematical terms, our proposal is based on the assumption that k-
simplices are simply sets of k nodes with there time series attached to them,
with an orientation prescribed by the ordering of the nodes in computing their
associated k-Signature. Recall that the orientation encoded in the computation of
the associated signature carries potentially very interesting interpretation about
the causal dependencies of the times series [11]. In the next step, the relevance
of incorporating a simplex into our simplicial complex is assessed using a purely
statistical procedure: each oriented k-simplex is associated with a corresponding
k-Signature that is included into a set of multivariate features that is used to
predict the Signatures of all the other potential simplices. More precisely, our
construction is a generalisation of the approach developed by Meinshausen and
Bühlmann in [16] for Gaussian Graphical Models. Simplices that are selected
from the set of potential Simplices whose Signature can predict the Signature of
a target simplex in terms of confidently regressing or predicting1 the Signature
associated with this target are included as candidates for being considered as
adjacent to this target. Using this procedure, we obtain a construction of a sim-
plicial complex that accurately incorporates the statistical relationships between
all the simplices in terms of regression or prediction, while keeping track of the
inherent orientations of the simplices.
1
for time dependent Signatures.
Signature Based Topological Complexes for Complex Dynamics 285
The plan of the paper is as follows. In Sect. 2, we recall the necessary back-
ground on topological data analysis and Signature theory. In Sect. 3, we present
our method for constructing the simplicial complex using the Signatures of the
simplices and the LASSO algorithm. In Sect. 4, we present our numerical exper-
iments on real datasets. A conclusion section completes the paper.
j
which lies in Rd × d × · · · × d .
Chen’s identity is a very useful result that allows to compute the Signature
recursively based on linear interpolation of observed values of a trajectory.
Theorem 1 (Chen’s identity). Let X : [a, b] → Rd and Y : [b, c] → Rd .
Consider the concatenation of X and Y (noted X ∗ Y ) defined by:
(X ∗ Y ) : [a, c] → Rd
X(t) , t ∈ [a, b]
t →
X(b) − Y (b) + Y (t) , t ∈ [b, c].
2
The k-truncated version of the signature is S (1) (X) ⊕ S (2) (X) ⊕ · · · ⊕ S (k) (X).
286 S. Chrétien et al.
Then:
We now turn to some useful definitions from topology. Consider a set of n vertices
V = {v1 , . . . , vn }.
Definition 2. For k < n, a k-simplex σk of V is the collection of a subset Vk
of length k + 1 and all its subsets. The geometric realization of a k-simplex is
the convex hull C of k + 1 points, such that dim(C) = k. A face (of dimension
l) σk is a collection of set in σk that form a l-simplex (l ≤ k).
Remarks
Let us now address some technical question that arise from the proposed con-
struction.
Time dependancy: The algorithm is applied to evolving time series for which
the computation of the Signatures is updated incrementally and prediction is
performed using these updated Signatures as time increases.
Orientation: This algorithm gives a natural orientation on every simplex, as
S(Xi , Xj ) = S(Xj , Xi ) which is often of great potential use for interpretabil-
ity.
https://github.com/ben2022lo/conf-complex-network
Fig. 1. Histograms of the observed duration for the discovered 1 and 2-simplices
Fig. 2. Simplicial complex constructed with persistent simplicies. 1-simplices are blue,
2-simplicies are defined by their orange 1-simplex faces. The gray signifies the coinci-
dence of a 1-simplex and one face of a 2-simplex. The selected ROI and numerated
0-simplices are matched by the following dictionary: 0 - LH Vis 9, 1 - LH SomMot 4, 2
- LH DorsAttn Post 4, 3 - Cont Par 1, 4 - LH Cont pCun 1, 5 - LH Default pCunPCC
1, 6 - LH Default pCunPCC 2, 7 - RH Cont Par 1, 8 - RH Cont PFCl 1, 9 - RH
Cont pCun 1, 10 - RH Default Par 1, 11 - RH Default PFCdPFCm 1, 12 - RH Default
PFCdPFCm 2, 13 - RH Default PFCdPFCm 3, 14 - RH Default pCunPCC 2.
References
1. Borkar, K., Chaturvedi, A., Vinod, P.K., Bapi, R.S.: Ayu-characterization of
healthy aging from neuroimaging data with deep learning and rsfmri. Front. Com-
put. Neurosci. 16, 940922 (2022)
2. Broyd, S.J., Demanuele, C., Debener, S., Helps, S.K., James, C.J., Sonuga-Barke,
E.J.: Default-mode brain dysfunction in mental disorders: a systematic review.
Neurosci. Biobehav. Rev. 33, 279–96 (Oct 2008)
3. Chazal, F., Michel, B.: An introduction to topological data analysis: fundamental
and practical aspects for data scientists. Front. Artif. Intell. 4, 108 (2021)
4. Chen, K.-T.: Integration of paths, geometric invariants and a generalized baker-
hausdorff formula. Ann. Math. 65(1), 163–178 (1957)
5. Chevyrev, I., Kormilitzin, A.: A primer on the signature method in machine learn-
ing. arXiv preprint arXiv:1603.03788 (2016)
6. Eckmann, J.P., Genève, U.: Martin Hairer got the fields medal for his study of the
KPZ equation
7. Fermanian, A.: Embedding and learning with signatures. Comput. Stat. Data Anal.
157, 107148 (2021)
8. Fermanian, A., Marion, P., Vert, J.-P., Biau, G.: Framing RNN as a kernel method:
A neural ode approach. Adv. Neural. Inf. Process. Syst. 34, 3121–3134 (2021)
9. Friz, P.K., Hairer, M.: A Course on Rough Paths: With an Introduction to Regu-
larity Structures. Springer International Publishing, Cham (2020)
10. Friz, P.K., Victoir, N.B.: Multidimensional stochastic processes as rough paths:
theory and applications, vol. 120 Cambridge University Press (2010)
11. Giusti, C., Lee, D.: Iterated integrals and population time series analysis. In: Baas,
N.A., Carlsson, G.E., Quick, G., Szymik, M., Thaule, M. (eds.) Topological Data
Analysis: The Abel Symposium 2018, pp. 219–246. Springer International Publish-
ing, Cham (2020). https://doi.org/10.1007/978-3-030-43408-3 9
12. Kim, H., Hahm, J., Lee, H., Kang, E., Kang, H., Lee, D.S.: Brain networks
engaged in audiovisual integration during speech perception revealed by persis-
tent homology-based network filtration. Brain Connectivity 5(4), 245–258 (2015)
13. Kormilitzin, A., Vaci, N., Liu, Q., Ni, H., Nenadic, G., Nevado-Holgado, A.: An
efficient representation of chronological events in medical texts. arXiv preprint
arXiv:2010.08433 (2020)
294 S. Chrétien et al.
14. Lyons, T., McLeod, A.D.: Signature methods in machine learning. arXiv preprint
arXiv:2206.14674 (2022)
15. Lyons, T., Qian, Z.: System control and rough paths. Oxford University Press
(2002)
16. Meinshausen, N., Bühlmann, P.: High-dimensional graphs and variable selection
with the lasso (2006)
17. Petri, G., et al.: Homological scaffolds of brain functional networks. J. Royal Society
Interface. 11(101), 20140873 (2014)
18. Posner, M.I., Petersen, S.E.: The attention system of the human brain. Annual
Rev. Neurosci. 13, 25–42 (Feb 1990)
19. Santoro, A., Battiston, F., Petri, G., Amico, E.: Higher-order organization of mul-
tivariate time series. Nat. Phys. 19(2), 221–229 (2023)
20. Schaefer, A., et al.: Local-global parcellation of the human cerebral cortex from
intrinsic functional connectivity MRI. (July 2017)
21. Sizemore, A.E., Giusti, C., Kahn, A., Vettel, J.M., Betzel, R.F., Bassett, D.S.:
Cliques and cavities in the human connectome. J. Comput. Neurosci. 44(1), 115–
145 (2018). https://doi.org/10.1007/s10827-017-0672-6
22. Stolz, B.J., Harrington, H.A., Porter, M.A.: Persistent homology of time-dependent
functional networks constructed from coupled time series. Chaos: An Interdiscip.
J. Nonlinear Sci. 27(4) (2017)
23. Yeo, B.T., et al.: The organization of the human cerebral cortex estimated by
functional correlation. J. Neurophysiol. 106, 1125–1165 (June 2011)
Empirical Study of Graph Spectra
and Their Limitations
1 Introduction
Graphs have several matrix representations. The adjacency and the several
Laplacian matrices are instances of these representations. Spectral decompo-
sition of these matrices is used for many tasks in the study of graphs, especially
those with community structure (complex networks) [5]. For example, it is used
for vertex clustering [4,7,10,11,19,21,22] and in the “eigengap heuristic” [4,19].
In this short article, we identify the limitations of spectral decomposition of these
graph matrix representations. In doing so, we also compare results obtained by
decomposing the two most commonly used matrix representations, the adjacency
and the symmetric normalized Laplacian matrices.
Our empirical investigations reveal that normalized Laplacian eigenvalues
(eigenvectors) are extremely sensitive to noise and scale. This noise manifests
itself in the form of edge randomness. This randomness is a consequence of inter-
block edge probability or small block size. We find that increases in noise or graph
size lead to eigenvalue uniformity, which renders spectral methods of limited
use. Indeed, spectral techniques rely on eigenvalue differentiability. Obviously,
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 295–307, 2024.
https://doi.org/10.1007/978-3-031-53468-3_25
296 P. Miasnikof et al.
2 Previous Work
As mentioned earlier, spectral techniques are widely used in the study of graphs.
The foundations of this area of study were laid by Chung [6] and later by Spiel-
man [24]. In this short work, we focus on three areas of the spectral graph analysis
literature. First, our experiments motivate us to examine past inquiries into the
convergence of eigenvalues, under the stochastic block model (SBM) [23]. Then,
in order to gain a better understanding of this convergence, we also survey stud-
ies in which authors have established a link between Laplacian eigenvalues and
vertex degree [5,27]. Finally, we are also motivated by authors who have high-
lighted the differences in the conclusions of analyses based on adjacency and
Laplacian matrix representations [18,21].
Asymptotic convergence of eigenvectors (and consequently eigenvalues) under
the SBM was identified by Rohe et al. [23]. These authors posit the existence of a
“population Laplacian” towards which the empirical Laplacian converges, as the
graph grows. While our results do not agree with these authors’ conclusions, we
also document convergence with increased graph size and noise in connectivity.
Under random graph models, like the SBM or the Erdös-Rényi-Gilbert
(ERG) model [9,13], edge probabilities are independent of each other and
only depend on the nodes they are connecting. These models have often been
described as too simplistic to represent real world networks [1,2,15,20,23]. In
particular, the degree uniformity yielded by these generative models, its lack of
skewness or heavy right tail, has been identified as a weakness as models of real
world networks.
Nevertheless, random graph models have been found to be adequate in many
empirical cases [10,11,16,20,22]. For example, Newman et al. [20] state that
Spectral Limitations 297
“in some cases random graphs with appropriate distributions of vertex degree
predict with surprising accuracy the behavior of the real world”. In closing, while
a detailed examination of this debate on realistic models of real world networks is
beyond the scope of our work, we do note that some authors claim that networks
with power-law degree distributions are rare (e.g., [3]). We also note that the
SBM is still used as a model of real world networks in the recent literature
(e.g., [12])
3 Mathematical Background
As discussed earlier, there are several matrix representations of graphs. The two
most commonly used are the adjacency matrix (A) and the symmetric normal-
ized Laplacian (L). Because of its symmetry and specific properties, we use the
symmetric normalized Laplacian, instead of the unnormalized or random walk
Laplacians, in this work.
In order to study the link between graph characteristics and spectra, we gen-
erate several synthetic random graphs with known structure. Indeed, by modi-
fying the parameters of random graph generative models, we are able to isolate
and unambiguously observe the sensitivities of the spectra. We use the Planted
Partition Model (PPM), a special case of the SBM, for its clarity. Using the
PPM also allows us to compare our conclusions to those reported in the litera-
ture (e.g., [23]). We use the Python NetworkX library [14,25,26] to generate our
graphs.
For the remainder of this article, we will use the following naming conven-
tions:
– N = |V | is the total number of vertices,
– nk is the number of vertices in block k,
– di is the degree of the i−th vertex and
– λ0 = 0 < λ1 ≤ . . . ≤ λN are the eigenvalues of the (N × N ) normalized
symmetric matrix L,
– Because we only consider graphs with one connected component, only the
first eigenvalue is equal to 0, all others are strictly positive.
Throughout this article, the matrix A is symmetric and all matrix elements (edge
weights wij ) are binary (i.e., graphs are unweighted & undirected). Also, because
we only consider simple graphs, all diagonal elements are equal to zero (i.e., no
self-loops ⇔ Aii = wii = 0).
Here, Pk denotes the probability that two arbitrary nodes in block k are con-
nected by an edge. Similarly, Pkm denotes the probability that an arbitrary
node in block k is connected to an arbitrary node in block m. (With, Pk > Pkm ,
typically.)
300 P. Miasnikof et al.
Vertex Degree Under PPM. For the remainder of this document, we will
use the following variable naming conventions to describe the PPM graphs:
– di = din
i + di
out
is the degree of a node i.
– di is the sum of connections to nodes within the same block (dini ) and to
nodes in other blocks (dout
i ).
– Pin , Pout are the within/between-block edge probabilities,
– N is the total number of vertices,
– n is the number of vertices within any given block and, finally,
– K = N/n is the total number of blocks or partitions (under the PPM, all
blocks have the same number of nodes).
In Eqs. 1 and 2, we clearly see how the (expected) degree of a vertex can be
partitioned. Degree can be understood as the cardinality of the union of a set
of connections to nodes within the same block and to nodes on the remaining
graph. We use this partition to examine the sensitivity of graph spectra to overall
degree, but also specific generative model parameters. Specifically, we examine
the relationships between spectra and graph size, block size and inter-block edge
probability. The analysis of these relationships is a useful tool in understanding
the applicability and limitations of spectral graph techniques. While we use the
PPM for its transparency, our conclusions reveal critical information about the
relationship between graph structures and spectra. This relationship transcends
generative models, because they study the links between graph structure (esp.
degree) and spectra.
4 Empirical Tests
To examine the sensitivity of graph spectra to graph size, block size and to
noise from increased inter-block edge probability, we conduct four sets of exper-
iments using the Planted Partition Model. In all experiments, we begin with a
graph with small block size, number of blocks or inter-block probability. We then
gradually increase these parameters and observe the effect on spectra. We stop
our sensitivity tests, when a pattern appears (or disappears, e.g., eigengap) in
the spectra. For completeness, we examine the spectra of both the symmetric
normalized Laplacian and adjacency matrices. While most studies of graph spec-
tra examine the normalized symmetric Laplacian (e.g., [6,19,24]), the adjacency
matrix remains the most basic matrix representation of a graph.
Spectral Limitations 301
In our experiments, growth in size occurs in two different ways. In the first
case, growth occurs in the sizes of blocks, while the number of blocks remains
constant. We start with a graph containing 50 blocks of five nodes each and
then expand to graphs with 50 blocks of 50 and 500 nodes. In the second case,
growth occurs in the numbers of blocks, while sizes of blocks remains constant.
We begin with a graph of five blocks of 50 nodes and expand to 50 blocks and
500 blocks of 50 nodes. These results are presented in Sect. 4.1. We also isolate
the effect of block size, within a fixed size graph. The goal of these numerical
experiments is to identify the (in)ability of spectra to detect the presence of
densely connected blocks of varying sizes, within a sparser graph with identical
characteristics (edge probability, number of blocks and overall size). These results
are presented in Sect. 4.2. In all the above-mentioned experiments, intra-/inter–
block edge probabilities are held constant (Pin = 0.9, Pout = 0.1).
Our last batch of experiments, is an examination sensitivity to edge prob-
ability. In these experiments, we generate PPM graphs with intra-block edge
probability of 0.9 but with varying inter-block edge probabilities. Here again, we
isolate the effect of block size. We begin by generating graphs with a relatively
large block size (n = 500) and relatively small number of blocks (K = 50).
We then repeat the same experiments with graphs containing a relatively large
number (K = 500) of relatively small (n = 50) blocks.
In this first set of experiments, we vary graph size by increasing block size (n),
while keeping the number of blocks constant (K = 50). We compute the eigen-
values for the adjacency and normalized Laplacian matrices for graphs with:
– n = 5 ⇒ N = 5 × 50 = 250,
– n = 50 ⇒ N = 50 × 50 = 2, 500,
– n = 500 ⇒ N = 500 × 50 = 25, 000,
– Pin = 0.9, Pout = 0.1.
Results are shown in Fig. 1. The blue curve shows the (N − 1) non-zero eigen-
values, sorted in ascending order, for the graph with 250 nodes. The orange
curve shows the same, for the graph with 2, 500 nodes. Finally, the green curve
shows the eigenvalues of the graph with 25, 000 nodes. In order to focus on the
theoretical location of the eigengap, we adjust the x-axis accordingly.
In a separate set of experiments, we generate graphs with the same char-
acteristics as those in Fig. 1. We then record the range of eigenvalues and the
eigengap of the normalized Laplacian as the graph grows in size. Results are
reported in Table 1.
In our second set of experiments, we vary graph size by increasing the number
of blocks (K), while keeping the block size constant (n = 50). It has been argued
that this growth model is more realistic and consistent with real world networks
[17,23]. We compute the eigenvalues for the adjacency and normalized Laplacian
matrices for graphs with:
302 P. Miasnikof et al.
Fig. 1. Varying N , number of blocks is constant (K = 50) (blue 250 nodes, orange
2, 500 nodes, green 25, 000 nodes)
– K = 5 ⇒ N = 5 × 50 = 250,
– K = 50 ⇒ N = 50 × 50 = 2, 500,
– K = 500 ⇒ N = 500 × 50 = 25, 000,
– Pin = 0.9, Pout = 0.1.
Results are shown in Fig. 2. Here too, in order to focus on the theoretical location
of the eigengap, we adjust the x-axis accordingly.
Once more, we also generate a new set of graphs with the same characteristics
as those in Fig. 2. We record the range of eigenvalues and the eigengap of the
normalized Laplacian as the graph grows in size. Results are reported in Table 2.
Results from all four experiments in this section highlight the relationships
between Laplacian eigenvalues and graph and block sizes. Our observations are
consistent with and extend prior work that has linked vertex degree and spectra
[5,27]. In particular, we isolate the effect of increases in graph and block sizes on
Spectral Limitations 303
(a) Adj mat, 250 nodes (b) Adj mat, 2500 nodes (c) Adj mat, 25K nodes
(d) Lap mat, 250 nodes (e) Lap mat, 2500 nodes (f) Lap mat, 25K nodes
To further isolate the effect of block size, we keep the number of nodes constant
(N = 500), but vary block size (n ∈ {5, 10, 20}). Once again, we adjust the
x-axis to focus on the eigengap. Results are shown in Fig. 3.
Once again, we observe that smaller block sizes lead to increased unifor-
mity in eigenvalues. In fact, the eigengap is non-existent, except in the very last
experiment (n = 20).
304 P. Miasnikof et al.
Fig. 4. Increasing inter-block edge probability (N = 25, 000, K = 50, n = 500) (blue
Pout = 0.1, orange Pout = 0.2, green Pout = 0.3)
Spectral Limitations 305
Fig. 5. Increasing inter-block edge probability (N = 25, 000, K = 500, n = 50) (blue
Pout = 0.1, orange Pout = 0.2, green Pout = 0.3)
References
1. Albert, R., Barabási, A.L.: Statistical mechanics of complex networks. Rev. Mod.
Phys. 74, 47–97 (2002). https://doi.org/10.1103/RevModPhys.74.47
2. Barabási, A.L., Albert, R.: Emergence of scaling in random networks. Science 286,
509–512 (1999)
3. Broido, A.D., Clauset, A.: Scale-free networks are rare. Nature Commun. 10(1),
1017 (2019)
4. Bruneau, P., Parisot, O., Otjacques, B.: A heuristic for the automatic parametriza-
tion of the spectral clustering algorithm. In: 2014 22nd International Conference on
Pattern Recognition, pp. 1313–1318 (2014). https://doi.org/10.1109/ICPR.2014.
235
5. Chen, J., Lu, J., Zhan, C., Chen, G.: Laplacian Spectra and Synchronization
Processes on Complex Networks, pp. 81–113. Springer US, Boston, MA (2012).
https://doi.org/10.1007/978-1-4614-0754-6 4. URL https://doi.org/10.1007/978-
1-4614-0754-6 4
6. Chung, F.R.K.: Spectral graph theory. American Mathematical Soc. (1997)
7. Coja-Oghlan, A., Goerdt, A., Lanka, A.: Spectral partitioning of random graphs
with given expected degrees. In: Navarro, G., Bertossi, L., Kohayakawa, Y. (eds.)
Fourth IFIP International Conference on Theoretical Computer Science- TCS 2006,
pp. 271–282. Springer, US, Boston, MA (2006)
8. Condon, A., Karp, R.: Algorithms for graph partitioning on the planted parti-
tion model. Random Struct. Algorithms 18(2), 116–140 (2001). https://doi.org/
10.1002/1098-2418(200103)18:2116::AID-RSA10013.0.CO;2-2
9. Erdös, P., Rényi, A.: On random graphs I. Publicationes Mathematicae Debrecen
6, 290–297 (1959)
10. Fortunato, S.: Community detection in graphs. Phys. Rep. 486, 75–174 (2010).
https://doi.org/10.1016/j.physrep.2009.11.002
11. Fortunato, S., Hric, D.: Community detection in networks: A user guide. arXiv
(2016)
12. Gan, L., Wan, X., Ma, Y., Lev, B.: Efficiency evaluation for urban industrial
metabolism through the methodologies of emergy analysis and dynamic network
stochastic block model. Sustainable Cities and Society, p. 104396 (2023)
13. Gilbert, E.: Random graphs. Ann. Math. Statist. 30(4), 1141–1144 (1959). https://
doi.org/10.1214/aoms/1177706098.
14. Hagberg, A., Schult, D., Swart, P.: Exploring Network Structure, Dynamics, and
Function using NetworkX. In: G. Varoquaux, T. Vaught, J. Millman (eds.) Pro-
ceedings of the 7th Python in Science Conference, pp. 11–15. Pasadena, CA USA
(2008)
Spectral Limitations 307
15. Lancichinetti, A., Fortunato, S., Radicchi, F.: Benchmark graphs for testing com-
munity detection algorithms. arXiv 78(4), 046110 (2008). https://doi.org/10.1103/
PhysRevE.78.046110
16. Lee, C., Wilkinson, D.J.: A review of stochastic block models and extensions for
graph clustering. Appl. Netw. Sci. 4(1), 1–50 (2019)
17. Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Community structure
in large networks: natural cluster sizes and the absence of large well-defined clus-
ters. Internet Mathematics 6(1), 29–123 (2009). https://doi.org/10.1080/15427951.
2009.10129177
18. Lutzeyer, J.F., Walden, A.T.: Comparing Graph Spectra of Adjacency and Lapla-
cian Matrices. arXiv e-prints arXiv:1712.03769 (2017). https://doi.org/10.48550/
arXiv.1712.03769
19. von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17, 395–416
(2007)
20. Newman, M.E.J., Strogatz, S., Watts, D.J.: Random graphs with arbitrary degree
distributions and their applications. Phys. Rev. E 64, 026,118 (2001). https://doi.
org/10.1103/PhysRevE.64.026118.
21. Priebe, C.E., et al.: On a two-truths phenomenon in spectral graph clustering.
Proc. Natl. Acad. Sci. 116(13), 5995–6000 (2019)
22. Rao Nadakuditi, R., Newman, M.E.J.: Graph spectra and the detectability of com-
munity structure in networks. arXiv e-prints arXiv:1205.1813 (2012). https://doi.
org/10.48550/arXiv.1205.1813
23. Rohe, K., Chatterjee, S., Yu, B.: Spectral clustering and the high-dimensional
stochastic blockmodel. Ann. Stat. 39(4), 1878–1915 (2011). https://doi.org/10.
1214/11-AOS887.
24. Spielman, D.A.: Spectral graph theory and its applications. In: 48th Annual IEEE
Symposium on Foundations of Computer Science (FOCS’07), pp. 29–38 (2007).
https://doi.org/10.1109/FOCS.2007.56
25. documentaton page (author unknown), O.: Planted partition model. https://
networkx.org/documentation/stable/reference/generated/networkx.generators.
community.planted partition graph.html
26. documentaton page (author unknown), O.: Stochastic block model. https://
networkx.org/documentation/stable/reference/generated/networkx.generators.
community.stochastic block model.html
27. Zhan, C., Chen, G., Yeung, L.F.: On the distributions of Laplacian eigenvalues
versus node degrees in complex networks. Physica A: Statistical Mechanics and its
Applications 389(8), 1779–1788 (2010). https://doi.org/10.1016/j.physa.2009.12.
005. URL https://www.sciencedirect.com/science/article/pii/S0378437109010012
FakEDAMR: Fake News Detection Using
Abstract Meaning Representation
Network
1 Introduction
Social media became essential for communication and information sharing. How-
ever, news shared over social media platforms lacks cross-referencing, allowing
the spread of misinformation. Interestingly, it appears that the rate at which fake
news is shared on Twitter exceeds that of genuine news [18]. Figure 1 presents
some examples of fake news that spread through various media platforms, includ-
ing Twitter. Many ML/DL methods were proposed to identify fake news from
social media [7]. These existing methods focused on syntactic features and did
not investigate how semantic features of news content affect ML models. How-
ever, complex semantic features are seen to improve the performance of different
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 308–319, 2024.
https://doi.org/10.1007/978-3-031-53468-3_26
FakEDAMR 309
NLP tasks such as event detection [6], abstractive summarization [14], and ques-
tion answering [13] in machine learning. Considering this one may ask “Does
incorporating complex semantic features of sentences enhance the performance
of fake news detection models too?”
Fig. 1. Examples of false information related to topics ‘Nupur Sharma’ and ‘Agniveer’
controversy showcased in the images (Courtesy: Boomlive). The images depict various
misleading claims, including a) Russia, Netherlands, France and 34 other countries
are supporting India and Nupur Sharma. b) Nupur Sharma is arrested and in jail. c)
Oppressors are damaging the railway line in the protest of Aginveer scheme.
The present study proposed a fake news detection model, FakEDAMR, that
classifies tweets as genuine and fake information, by introducing graph-based
semantic features with syntactic and lexical features of the sentences. The main
contribution of our work is to use deep semantic representation from the features
of the Abstract Meaning Representation (AMR) graph. AMR helps to better
extract the relationships between entities far apart in the text with minimum
cost. This approach reduces the emphasis on syntactic features and collapses
certain elements of word category, such as verbs and nouns, as well as word
order and morphological variations. To the best of our knowledge, this research
rigorously investigates the semantic features of Abstract Meaning Representation
(AMR) graphs in comparison to other studies focused on identifying fake news.
We curated a fake news dataset, namely, FauxNSA, related to the well-known
controversies Nupur Sharma and Agniveer in India. Tweets with a list of curated
hastags (Table 1) on said topics are collected from the Twitter platform, in two
different languages - Hindi and English. We extracted AMR graphs from each
text document by using STOG model [24]. We encoded AMR graphs using graph
embedding and combined them with the syntactic features of the text used
in state-of-the-art model [22]. Finally, the resulting embedding vector, which
includes both semantic and syntactic features, is fed into a deep-learning model
to predict the probability of fake and real. We have experimented with our model
on two publicly available datasets (Covid19-FND[19], KFN[12]) and FauxNSA.
Our experiments demonstrated an improvement in accuracy of 2–3% over all
the datasets when the AMR graph features were included with existing textual
features in the model.
The rest of the paper is organized as follows: Sect. 2 reports the related
work. Section 3 and 4 describe the working methodology and experimental setup.
310 S. Gupta et al.
Section 5 reports the results with comparative analysis. Ablation study is pre-
sented in Sect. 6. Finally, Sect. 7 concludes the research outcome.
2 Related Work
Fake news detection has been extensively studied recently using Natural Lan-
guage Processing. Oshikawa et al. [16] clarify the distinction between detect-
ing fake news and related concepts, including rumor detection, and provide an
overview of current data sets, features, and models. As mentioned in the intro-
duction, Castillo et al. [3] created a set of 68 features in the identification of
false information. They used a propagation tree over the feature set to identify
whether the news is false or not. An extension to the lexical-based analysis model
is used in [15] by incorporating speaker profile details into an attention-based
long short-term memory (LSTM) model. Zervopoulos et al. [23] created a set
of 37 handcrafted features that includes morphological (e.g., part of speech),
vocabulary (e.g., type-to-token ratio), semantic (e.g., text and emoji sentiment),
and lexical features (e.g., number of pronouns) to predict the false news using
traditional ML algorithms. Further, in 2022 [22], they have extended the research
to run different feature sets with complex deep learning models.
AMR is a graph-based representation of natural language that accurately
captures the complex semantics of a sentence in a way that is both language-
independent and computationally tractable. A growing number of researchers
are investigating how to use the information stored in the AMR graphs and its
representations to assist in the resolution of other NLP problems. AMR has been
successfully applied to more advanced semantic tasks such as entity linking [17],
question answering [13], and machine translation [10]. Garg et al. [5] were the first
to employ AMR representation for extracting interactions from biomedical text,
utilizing graph kernel methods to determine if a given AMR subgraph expresses
an interaction. Aguilar et al. [1] and Huang et al. [8] had conducted research and
indicated that the semantic structures of sentences, such as AMR introduced in
[2], encompass extensive and varied semantic and structural information concern-
ing events. AMR graphs in Fake News detection has been relatively unexplored,
however, considering its capability to determine the trigger words by extracting
complex semantic information, AMR graphs have the potential to improve the
efficiency of existing fake news detection methodologies. Recently, Zhang et al.
[25] extracted fact queries based on AMR to verify the factual information in
multimodality. Some evidence-based GNN fake news detection models [4,9,21]
are also proposed for misinformation detection.
3 Methodology
The methodology in this study is divided into two parts: 1) curation of the
proposed data set FauxNSA, and 2) fake news detection model FakEDAMR.
Figure 2 shows the methodology, and the description of each step is provided in
the following sections.
FakEDAMR 311
Table 1. List of curated hashtags used to scrap tweets from Twitter platform.
Category Hashtags
Nupur Sharma Controversy #NupurSharmaControversy, #Jamamasjid,
#Nupur Sharma, #NupurSharma, #HinduRashtra,
#HindusUnderAttack, #SarTanSeJuda, #KanahiyaLal,
#NupurSharmaBJP, #IsupportNupurSharma
Agniveer Controversy #AgnipathRecruitmentScheme, #Agnipath, #Agniveer,
#AgnipathProtests
Fake news dataset The data set was gathered from the Twitter platform between
May and September 2022 using the Twitter Academic API’s full-archive search
over the political controversy ‘Nupur Sharama’ and ‘Agniveer’. This controversy
holds the data related to religion, political, and terrorist issues. The method-
ology to collect the tweets can be broken down as follows. First, a list of
curated hashtags mentioned in Table 1 related to the topics ‘Nupur Sharama’ and
‘Agniveer’ controversy is manually constructed. Tweets were captured through
Twitter API consisting of at least one hashtag from the list. Total 31,889 tweets
including 31 features such as account information (display name, # of follow-
ers), tweet information (text, hashtag, URLs), and network information (quote,
like, reply) are collected. We used popular fact-checking websites such as Boom-
Live1 , NewsChecker2 , AltNews3 , etc. to annotate the data for fake news which
were then manually searched over various social media platforms and carefully
annotated by two human annotators. We have also collected tweets from the
42 verified fact-checker Twitter accounts (PIBFackCheck, ABCFactCheck, etc.)
from Twitter platform. Subsequently, this comprehensive approach ensures the
1
https://www.boomlive.in/.
2
https://newschecker.in/.
3
https://www.altnews.in/.
312 S. Gupta et al.
reliability and credibility of the data used in our study. After performing all the
filtering process, we have 4632 tweets that are fake news.
Real news dataset In the collection of real news, we have used the same approach
discussed in [22]. That is, we have considered journalists and news agency as
trustworthy source and collected tweets related to topics Nupur Sharama and
Agniveer controversy. Overall, the account of 34 news agencies4 and 82 account
of journalists5 with a global outreach are identified and gathered. Total 4657
tweets are collected and verified with the human annotators to make data set
for the real news. Figure 3 shows the word cloud of the collected true and fake
data for Hindi and English languages. We can observe from that all the keywords
are related to ‘Nupur Sharama’ and ‘Agniveer’ controversy only. After gathering
tweets from news agencies, journalists, and fake news sources, a comparison of
their characteristics was made to determine their similarities. Specifically, the
average number of hashtags per tweet was found to be 2.95 in the fake news
data set, 3.12 for tweets posted by journalists, and 2.7 for those posted by news
agencies. Additionally, the mean number of URLs per tweet was found to be
0.42 in the fake news data set, 0.55 for tweets posted by journalists, and 1.18
for those posted by news agencies. The statistics show that collected data from
all the sources shows almost similar properties. Finally, the data set consists of
9289 tweets with 4632 fake and 4657 real tweets.
Fig. 3. Frequency word clouds of a) fake and b) true tweets collected from Twitter
over ‘Nupur Sharama’ and ‘Agniveer’ controversy.
Text Encoder. Research in the field of Natural Language Processing (NLP) has
long focused on effectively representing sequential data. In line with previous
studies, we have employed two different approaches to encode the sequence of
4
https://www.similarweb.com/top-websites/india/news-and-media/.
5
https://www.listofpopular.com/tv/top-journalist-of-india/.
FakEDAMR 313
tokens. The first approach involves the use of handcrafted features, consisting of
37 specific features outlined in [22]. The second approach utilizes GloVe embed-
ding [20], which is pre-trained using a Twitter-based corpus comprising 27 billion
tokens. This embedding maps each word to a d-dimensional vector. Mathemat-
ically, the cost function of the GloVe embedding for a word in a word sequence
(represented as w =< w1 , ..., wk >) can be expressed as follows:
V 2
J= f (Npq ) wpT w̃q + bp + b̃q − log Npq (1)
p,q=1
Here, f (Npq ) is a weighting function, wpT is a context word vector and w̃q is
out of context word vector and bp , b̃q are bias terms. In Eq. 1, bias terms are
also learned along with weight vector. Finally, we get the text embedding vector
i=d
t = ([ti ]i=1 ; ti ∈ R1×m ), where d is the fixed dimension and m is the maximum
number of tokens.
The AMR graph conversion process of each text in the document utilizes
STOG model [24]. STOG model breaks down the sequence-to-graph task into
two main components: node prediction and edge prediction.
In node prediction, the model takes an input sequence w =< w1 , . . . , wk >,
where each word wa is part of the sentence. It sequentially decodes a list of nodes
v =< v1 , . . . , vk > and assigns their indices i =< i1 , . . . , ik > deterministically
using the equation:
k
P (v) = P (va | v<a , i<a , w) (2)
a=1
For edge prediction, given an input sequence w, a list of nodes v, and indices
i, the model searches for the highest scoring parse tree y within the space Y of
valid trees over v, while adhering to the constraint of i. A parse tree y represents
a collection of directed head-modifier edges, depicted as:
After obtaining the parse tree, the model proceeds with a merging operation
to reconstruct the standard Abstract Meaning Representation (AMR) graph
by combining nodes that share identical indices. Once we have the AMR tree
denoted as y, we extract the RDF triplets from it. These triplets are represented
as t = {(v1 , r1 , v2 ), . . . , (vk−1 , rj , vk )}. Each triplet consists of a subject va , a
concept rk , and an object vb .
Using the extracted RDF triplets, we construct the final graph denoted as
g = (v, e, r). Here, v represents the set of vertices, specifically v = {v1 , . . . , vk },
r corresponds to the set of concepts obtained from the RDF triplets, i.e., r =
{r1 , . . . , rj }. Lastly, e represents the set of edges in the graph, which is defined
as e = {(va , rj , vb )|∃ va , vb ∈ v and rj ∈ r}. In other words, the edges in e connect
the vertices va and vb using the relation rj . This process of extracting RDF
triplets and constructing the final graph enables the representation and analysis
of the AMR graph, capturing the semantic relationships between entities and
facilitating further processing and interpretation.
Afterward, a list of graph G, where each graph g ∈ G represents one text,
is passed as input to the Graph2Vec model, specifically the skip-gram model, to
obtain the final embedding. The Graph2Vec model processes the AMR graph and
generates embeddings by considering the graph structure and the relationships
between its elements. The resulting embedding is obtained from the last hidden
layer of the model, capturing the learned representation of the AMR graph in a
i=d
vector form. Finally, we get the graph embedding vector u = ([ui ]i=1 ; ui ∈ Rn×1 ),
where d is the fixed dimension and n is number of sentence in the document.
FakEDAMR 315
Classification Layer. After getting the text embedding t(m×d) and graph embed-
ding u(n×d) , we get final embedding x by Eq. 5, where | represents concatenation:
4 Experimental Setup
Dataset: Other than our dataset, we have also used two publicly available
datasets Covid-FND [19] and KFN [12] for our experiments. Covid-FND dataset
consists social media posts and articles related to COVID-19 and KFN dataset
includes 20,387 news items which spans the fields of politics, commerce, and
technology, contains an evenly distributed mix of real and fake news pieces.
Statistics of the data used in our experiments is given in the Table 2.
used basic preprocessing, like removing URLs, stopwords, etc., on each text doc-
ument of the data set. We have incorporated AMR graph features on the feature
sets proposed by [22]. They used two feature sets: Feature-set 1 adopts a fea-
ture engineering approach, where the chosen features are hand-crafted, includ-
ing various categories such as morphological, vocabulary, and lexical features.
Feature-set 2 employs tokenization of each tweet’s text and conversion into word
embedding. GloVe embedding [20], pre-trained with a Twitter-based corpus of
27 billion tokens, is used to map each word to a 100-dimensional vector. Despite
each word being mapped to a fixed-size vector, tweet length still varies; to address
this issue, post-padding (i.e., padding at the end of a tweet) is used to match the
longest tweet (approximately 100 tokens). Therefore, a tweet in Feature-set 2 is
presented as a 100 × 100 matrix. Although the size of Graph2Vec can vary based
on the length of the AMR graph, we have fixed the dimension to 100, considering
the length of the tweet is fixed in the Twitter platform. We evaluated Feature-
set 1 on Naive Bayes, SVM, C4.5, random forests, and Feature-set 2 on CNN,
C-LSTM, and BiLSTM. For the purpose of training each model on the data sets,
we carried out three distinct trials with various seed values. The performance
metric was then generated using the test data set findings, taking into account
the best-performing trial. Four performance metrics, namely, Precision, Recall,
F1-score, and Accuracy are considered for comparative study. Model configura-
tion, such as the number of hyper-parameters and number of layers used in the
model, is kept the same as in the research [22].
5 Results
It is evident that incorporating AMR semantic features into the feature sets sig-
nificantly improves the performance of the models. Among the models evaluated
using Feature-set 1, Random Forest with AMR-encoded feature sets achieves the
highest accuracy of 88.90% and an F1-score of 85.92% on the FauxNSA (pro-
posed) dataset. Furthermore, it also achieves the highest accuracy of 89.48% and
87.09%, along with F1-scores of 88.69% and 86.70%, on the publicly available
datasets Covid19-FND and KFN, respectively.
BiLSTM with AMR-encoded features outperforms other models in the case
of Feature-set 2. The model achieved an accuracy of 93.96% and an F1-score
of 91.96% on our data set. Similar performance is observed on the other two
publicly available data sets as well, where the accuracy and F1-scores of 93.26%
and 93.20% on Covid19-FND, and 93.52% and 93.52% on KFN, respectively, are
achieved.
Fig. 5. Comparative analysis of AMR and text features in three datasets: Covid19-
FND, KFN, and FauxNSA. Graph is plotted for three correct predicted samples by
model where x-axis represents the feature index and y-axis represents its corresponding
value.
318 S. Gupta et al.
7 Conclusion
In this paper, we show that detecting fake news requires a more sophisticated
understanding of the semantic relationships between trigger words and entities
in the text. We demonstrated how Abstract Meaning Representation (AMR)
graph improves the fake news detection model and we concluded that semantic
features are just as important as linguistic and syntactic features for identifying
fake news in posts. In the future, we are exploring way to embed AMR graphs
with pre-trained transformer-based models such as Bert, XLM-Roberta, Electra,
etc. Also, we are interested in exploring more ways to encode AMR knowledge
in order to increase the performance of existing fake news models.
References
1. Aguilar, J., Beller, C., McNamee, P., Van Durme, B., Strassel, S., Song, Z.,
Ellis, J.: A comparison of the events and relations across ACE, ERE, TAC-KBP,
and FrameNet annotation standards. In: Proceedings of the Second Workshop on
EVENTS: Definition, Detection, Coreference, and Representation, pp. 45–53. Asso-
ciation for Computational Linguistics, Baltimore (2014)
2. Banarescu, L., et al.: Abstract meaning representation for sembanking. In: Pro-
ceedings of the 7th Linguistic Annotation Workshop and Interoperability with
Discourse, Sofia, Bulgaria, pp. 178–186 (2013)
3. Castillo, C., Mendoza, M., Poblete, B.: Predicting information credibility in time-
sensitive social media. Internet Res. Electron. Network. Appl. Policy 23, 560–588
(2013)
4. Dun, Y., Tu, K., Chen, C., Hou, C., Yuan, X.: Kan: knowledge-aware attention net-
work for fake news detection. In: Proceedings of the AAAI Conference on Artificial
Intelligence, vol. 35, no. 1, pp. 81–89 (2021)
5. Garg, S., Galstyan, A., Hermjakob, U., Marcu, D.: Extracting biomolecular inter-
actions using semantic parsing of biomedical text. In: AAAI, AAAI 2016, pp. 2718–
2726. AAAI Press (2016)
6. Gupta, S., Kundu, S.: Interaction graph, topical communities, and efficient local
event detection from social streams. Expert Syst. Appl. 232, 120890 (2023)
7. Hu, L., Wei, S., Zhao, Z., Wu, B.: Deep learning for fake news detection: a com-
prehensive survey. AI Open 3, 133–155 (2022)
8. Huang, L., Cassidy, T., Feng, X., Ji, H., Voss, C.R., Han, J., Sil, A.: Liberal event
extraction and event schema induction. In: ACL, vol. 1: Long Papers, pp. 258–268.
ACL, Berlin (2016)
9. Jin, Y., et al.: Towards fine-grained reasoning for fake news detection. In: AAAI,
vol. 36, pp. 5746–5754 (2022)
10. Jones, B., Andreas, J., Bauer, D., Hermann, K.M., Knight, K.: Semantics-based
machine translation with hyperedge replacement grammars. In: Proceedings of
COLING 2012, pp. 1359–1376 (2012)
11. Kiperwasser, E., Goldberg, Y.: Simple and accurate dependency parsing using
bidirectional LSTM feature representations. Trans. Assoc. Comput. Linguist. 4,
313–327 (2016)
12. Lifferth, W.: Fake news (2018). https://kaggle.com/competitions/fake-news
FakEDAMR 319
13. Lim, J., Oh, D., Jang, Y., Yang, K., Lim, H.: I know what you asked: graph
path learning using AMR for commonsense reasoning. In: ICCL, pp. 2459–2471.
International Committee on Computational Linguistics, Barcelona (2020)
14. Liu, F., Flanigan, J., Thomson, S., Sadeh, N., Smith, N.A.: Toward abstractive
summarization using semantic representations. In: Proceedings of the 2015 Con-
ference of the North American Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, pp. 1077–1086. Association for Compu-
tational Linguistics, Denver (2015)
15. Long, Y., Lu, Q., Xiang, R., Li, M., Huang, C.R.: Fake news detection through
multi-perspective speaker profiles. In: Proceedings of the Eighth International Joint
Conference on Natural Language Processing, vol. 2: Short Papers, pp. 252–256.
Asian Federation of Natural Language Processing, Taipei (2017)
16. Oshikawa, R., Qian, J., Wang, W.Y.: A survey on natural language processing for
fake news detection. CoRR arxiv:1811.00770 (2018)
17. Pan, X., Cassidy, T., Hermjakob, U., Ji, H., Knight, K.: Unsupervised entity linking
with abstract meaning representation. In: NAACL: Human Language Technologies,
pp. 1130–1139 (2015)
18. Parmelee, J.H., Bichard, S.L.: Politics and the Twitter revolution: how tweets
influence the relationship between political leaders and the public. Lexington books
(2011)
19. Patwa, P., et al.: Fighting an infodemic: Covid-19 fake news dataset. In:
Chakraborty, T., Shu, K., Bernard, H.R., Liu, H., Akhtar, M.S. (eds.) CON-
STRAINT 2021., vol. 1402, pp. 21–29. Springer, Cham (2021). https://doi.org/
10.1007/978-3-030-73696-5 3
20. Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word repre-
sentation. In: EMNLP, pp. 1532–1543. Association for Computational Linguistics,
Doha (2014)
21. Xu, W., Wu, J., Liu, Q., Wu, S., Wang, L.: Evidence-aware fake news detection
with graph neural networks. In: WWW, WWW 2022, pp. 2501-2510. ACM, New
York (2022)
22. Zervopoulos, A., Alvanou, A., Bezas, K., Papamichail, A., Maragoudakis, M., Ker-
manidis, K.: Deep learning for fake news detection on twitter regarding the 2019
Hong Kong protests. Neural Comput. Appl. 34, 1–14 (2022)
23. Zervopoulos, A., Alvanou, A.G., Bezas, K., Papamichail, A., Maragoudakis, M.,
Kermanidis, K.: Hong Kong protests: using natural language processing for fake
news detection on twitter. In: Maglogiannis, I., Iliadis, L., Pimenidis, E. (eds.)
AIAI 2020, vol. 584, pp. 408–419. Springer, Cham (2020)
24. Zhang, S., Ma, X., Duh, K., Van Durme, B.: AMR parsing as sequence-to-graph
transduction. In: ACL, pp. 80–94. ACL, Florence (2019)
25. Zhang, Y., Trinh, L., Cao, D., Cui, Z., Liu, Y.: Detecting out-of-context multimodal
misinformation with interpretable neural-symbolic model (2023)
Visual Mesh Quality Assessment Using
Weighted Network Representation
Abstract. This paper addresses the critical task of evaluating the visual
quality of triangular mesh models. We introduce an innovative approach
that leverages weighted graphs for this purpose. Motivated by the grow-
ing need for accurate quality assessment in various fields, including com-
puter graphics and 3D modeling, our methodology begins by generating
saliency maps for each distorted mesh model. These models are sub-
sequently transformed into a network representation, where mesh ver-
tices are nodes and mesh edges are edges in the graph. The determina-
tion of vertex weights relies on the salience values. We then extract a
wide range of topological properties and compute statistical measures
to create a signature vector. To predict the quality score, we rigorously
evaluate the performance of three regression algorithms. Experiments
span four publicly available databases designed for mesh model qual-
ity assessment. Results demonstrate that the proposed approach excels
in this task, showcasing remarkable correlations with subjective evalu-
ations. This preliminary analysis paves the way for further research to
address potential limitations and explore additional applications of mesh
network representation.
1 Introduction
Recently, the utilization of 3D models has expanded significantly across various
application domains, including virtual and mixed reality, computer-aided diag-
nosis, architecture, and cultural heritage preservation. However, processing these
3D models through operations like simplification and compression introduces the
potential for various distortions that can adversely affect the visual quality [1,2].
Addressing this issue, there is a growing demand for the development of robust
methods to assess perceived quality.
Traditionally, assessing distortion levels in 3D models has relied on human
observers, a time-consuming and resource-intensive endeavor. To streamline this
process, objective methods have emerged as a practical solution. These meth-
ods involve the implementation of automated metrics that aim to replicate the
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 320–330, 2024.
https://doi.org/10.1007/978-3-031-53468-3_27
Visual Mesh Quality Assessment Using Weighted Network Representation 321
judgments of an ideal human observer [3]. These metrics generally fall into three
categories: full reference [4–7], reduced reference [8–10], and blind methods [11–
14]. Among these, blind methods, which do not rely on reference models, have
gained particular significance, especially in real-world applications.
3D meshes are very complex structures consisting of vertices, edges, and faces
that collectively shape a 3D model. Selecting an appropriate data structure to
represent this extensive volume of data and relationships is paramount. It’s worth
noting that the effectiveness of any quality assessment method greatly relies on
the chosen data structure.
Graphs stand out as remarkably versatile data structures, capable of intu-
itively representing 3D triangular meshes. The degree of connectivity between a
vertex and its neighboring vertices provides valuable insights into the perceptual
characteristics of a mesh. Additionally, the geometric structure of a mesh can be
effectively described in terms of the network’s topological properties. The appli-
cation of graph representations has yielded considerable success in addressing
various challenges in computer vision, including tasks like image segmentation
[15–17], classification [18–20] and denoising [1,21].
Nevertheless, it is noteworthy that the literature contains relatively few
studies that have delved into assessing 3D mesh quality using graph-based
approaches. Lin et al. proposed a novel method that relies on learning the graph
spectrum’s entropy and the mesh’s spatial characteristics, as described in [22].
Similarly, Abouelaziz et al. introduced the concept of convolutional graph net-
works to estimate mesh quality, as proposed in [23]. These two methods, although
distinct in their approaches, share a common limitation in that they separately
learn the geometric and perceptual attributes without integrating them into the
weighted graph construction process.
In this context, we present a novel approach in this paper to assess the
visual quality of 3D meshes. Our proposed approach relies on a weighted graph
constructed from the geometric coordinates and the saliency values of the mesh
vertices. Subsequently, the graph’s characteristics are trained with a machine
learning-based regression method to forecast the quality score.
The remainder of this paper is organized as follows. We present in Sect. 2
a description of the proposed method. Section 3 is devoted to the experimental
results. Finally, we present some concluding remarks and perspectives in Sect. 4.
2 Proposed Method
of the graph. Based on these findings, we derive signatures for each mesh by
computing statistical parameters related to these properties. These resultant
signatures serve as inputs for the regression module, enabling us to predict the
quality score. In this study, we opted for three regression methods: Random
Forest, Support Vector Regression, and Generalized Regression Neural Networks
due to their proven utility and appropriateness for learning-based applications.
Degree: The fundamental and most basic metrics of graphs are the vertex
degree. In a weighted graph, the degree of an individual vertex, denoted as d(v ),
is explicitly characterized as the summation of the weights associated with all
edges connected to it.
d(v) = w(u, v) (4)
u∈V \{v}
That means that the higher the degree centrality of a node is, the more edges are
connected to the particular node and thus the more neighbor nodes this node
has.
324 M. El Hassouni and H. Cherifi
3 Experimental Results
This section deals with our experimental methodology, including the studied
databases, the evaluation criteria, and a brief overview of the results obtained
through comparison with the latest state-of-the-art methods.
3.1 Datasets
Fig. 2. The reference models from: the LIRIS masking database (a) the general-purpose
database (b) and the UWB compression database (c) and the IEETA simplification
database (d).
326 M. El Hassouni and H. Cherifi
n
i=1 (rank(MOSi ) − rank(Qsi ))
2
SRCC = 1 − (10)
n(n2 − 1)
where n denotes the number of distortions in a given database. The mean opinion
scores in the database are defined by MOSi , and Qsi is the objective quality score
obtained by a given method. MOS and Qs are the mean values of MOSi and
Qsi , respectively.
relying on unweighted graphs across all mesh types and entire datasets. Regard-
ing regression methods, we observe substantial disparities between RF as com-
pared to SVR, and GRNN. In summary, the most effective combination emerges
as the one employing the RF regression method within the framework of weighted
graphs.
Table 1. Correlation coefficients SRCC (%) and PLCC (%) with weighted and
unweighted graphs on the four databases.
Table 2. Correlation coefficients SRCC (%) and PLCC (%) of different objective
methods on LIRIS masking database, LIRIS/EPFL general-purpose database and the
UWB compression database.
4 Conclusion
In this paper, we introduced a novel approach for evaluating the visual quality of
meshes using graph feature learning methodology. Our method focused on ana-
lyzing 3D meshes, treating them as weighted graphs while considering topological
features and statistical characteristics. The construction of these graphs relied on
the saliency of vertices and their geometric coordinates. We compared the perfor-
mance of three distinct machine learning methods and found that the Random
Forest regression model excels in predicting quality scores, primarily attributed
to its suitability for handling graph-structured data. Compared to existing meth-
ods for assessing mesh visual quality, including full reference, reduced reference,
and no-reference methods, our proposed method exhibits robust correlations
with human visual perception. Remarkably, this high accuracy level is achieved
by using a straightforward machine learning algorithm. This pioneering explo-
ration into the application of graph representation for mesh quality assessment
Visual Mesh Quality Assessment Using Weighted Network Representation 329
holds significant promise and paves the way for numerous future research open-
ings.
References
1. Pastrana-Vidal, R.R., Gicquel, J.-C., Colomes, C., Cherifi, H.: Frame dropping
effects on user quality perception. In: Proceedings of 5th International WIAMIS
(2004)
2. Pastrana-Vidal, R.R., Gicquel, J.C., Blin, J.L., Cherifi, H.: Predicting subjective
video quality from separated spatial and temporal assessment. SPIE Human Vision
Electron. Imaging XI 6057, 276–286 (2006)
3. Corsini, M., Larabi, M.-C., Lavoué, G., Petrik, O., Vasa, L., Wang, K.: Perceptual
metrics for static and dynamic triangle meshes. In: Computer Graphics Forum,
vol. 32, no. 1, pp. 101–125. Wiley Online Library (2013)
4. Cignoni, P., Rocchini, C., Scopigno, R.: Metro: measuring error on simplified sur-
faces. In: Computer Graphics Forum, vol. 17, pp. 167–174. Wiley Online Library
(1998)
5. Aspert, N., Santa-Cruz, D., Ebrahimi, T.: Mesh: measuring errors between surfaces
using the Hausdorff distance. In: IEEE International Conference on. Multimedia
and Expo, vol. 1, pp. 705–708 (2002)
6. Lavoué, G.: A multiscale metric for 3d mesh visual quality assessment. In: Com-
puter Graphics Forum, vol. 30, pp. 1427–1437. Wiley Online Library (2011)
7. Torkhani, F., Wang, K., Chassery, J.M.: A curvature tensor distance for mesh
visual quality assessment. In: Computer Vision and Graphics, pp. 253–263 (2012)
8. Corsini, M., Gelasca, E.D., Ebrahimi, T., Barni, M.: Watermarked 3-d mesh quality
assessment. IEEE Trans. Multimedia 9(2), 247–256 (2007)
9. Wang, K., Torkhani, F., Montanvert, A.: A fast roughness-based approach to the
assessment of 3D mesh visual quality. Comput. Graph. 36(7), 808–818 (2012)
10. Váša, L., Rus, J.: Dihedral angle mesh error: a fast perception correlated distortion
measure for fixed connectivity triangle meshes. In: Computer Graphics Forum, vol.
31, no. 5, pp. 1715–1724. Blackwell Publishing Ltd., Oxford (2012)
11. Abouelaziz, I., El Hassouni, M., Cherifi, H.: A convolutional neural network frame-
work for blind mesh visual quality assessment. In: IEEE International Conference
on Image Processing (ICIP), pp. 755–759 (2017)
12. Abouelaziz, I., Chetouani, A., El Hassouni, M., Latecki, L.J., Cherifi, H.: Convo-
lutional neural network for blind mesh visual quality assessment using 3d visual
saliency. In: 25th IEEE International Conference on Image Processing (ICIP), pp.
3533–3537 (2018)
13. Hamidi, M., Chetouani, A., El Haziti, M., El Hassouni, M., Cherifi, H.: Blind
robust 3D mesh watermarking based on mesh saliency and wavelet transform for
copyright protection. Information 10(2), 67 (2019)
14. Abouelaziz, I., Chetouani, A., El Hassouni, M., Latecki, L.J., Cherifi, H.: No-
reference mesh visual quality assessment via ensemble of convolutional neural net-
works and compact multi-linear pooling. Pattern Recogn. 100, 107174 (2020)
15. Mourchid, Y., El Hassouni, M., Cherifi, H.: A general framework for complex
network-based image segmentation. Multimedia Tools Appl. 78, 20191–20216
(2019)
330 M. El Hassouni and H. Cherifi
16. Rital, S., Bretto, A., Cherifi, H., Aboutajdine, D.: A combinatorial edge detection
algorithm on noisy images. In: International IEEE Symposium on VIPromCom
Video/Image Processing and Multimedia Communications, pp. 351–355 (2002)
17. Rital, S., Cherifi, H., Miguet, S.: Weighted adaptive neighborhood hypergraph
partitioning for image segmentation. In: Singh, S., Singh, M., Apte, C., Perner,
P. (eds.) Pattern Recognition and Image Analysis: Third International Conference
on Advances in Pattern Recognition, ICAPR 2005, Bath, UK, 22–25 August 2005,
Proceedings, Part II, vol. 3, pp. 522–531. Springer, Heidelberg (2005). https://doi.
org/10.1007/11552499 58
18. Ribas, L.C., Riad, R., Jennane, R., Bruno, O.M.: A complex network-based app-
roach for knee Osteoarthritis detection: data from the Osteoarthritis initiative.
Biomed. Signal Process. Control. 71, 103133 (2022)
19. Lasfar, A., Mouline, S., Aboutajdine, D., Cherifi, H.: Content-based retrieval in
fractal coded image databases. In: Proceedings 15th International Conference on
Pattern Recognition, vol. 1, pp. 1031–1034. IEEE ICPR (2000)
20. Demirkesen, C., Cherifi, H.: A comparison of multiclass SVM methods for real-
world natural scenes. In: Blanc-Talon, J., Bourennane, S., Philips, W., Popescu,
D., Scheunders, P. (eds.) ACIVS 2008, vol. 5259. Springer, Heidelberg (2008)
21. Hassouni, M.E., Cherifi, H., Aboutajdine, D.: HOS-based image sequence noise
removal. IEEE Trans. Image Process. 15(3), 572–581 (2006)
22. Lin, Y., Yu, M., Chen, K., Jiang, G., Chen, F., Peng, Z.: Blind mesh assessment
based on graph spectral entropy and spatial features. Entropy 22(2), 190 (2020)
23. Abouelaziz, I., Chetouani, A., Hassouni, M.E., Cherifi, H., Latecki, L.J.: Learning
graph convolutional network for blind mesh visual quality assessment. IEEE Access
9, 108200–108211 (2021)
24. Lee, C.H., Varshney, A., Jacobs, D.W.: Mesh saliency. In: ACM SIGGRAPH 2005
Papers, pp. 659–666 (2005)
25. Lavoué, G., Gelasca, E.D., Dupont, F., Baskurt, A., Ebrahimi, T.: Perceptually
driven 3D distance metrics with application to watermarking. In: Applications of
Digital Image Processing XXIX, vol. 6312, p. 63120L. International Society for
Optics and Photonics (2006)
26. Lavoué, G., Larabi, M.C., Váša, L.: On the efficiency of image metrics for evaluating
the visual quality of 3D models. IEEE Trans. Visualization Comput. Graph. 22(8),
1987–1999 (2015)
27. Silva, S., Santos, B.S., Ferreira, C., Madeira, J.: A perceptual data repository for
polygonal meshes. In: 2009 Second International Conference in Visualisation, pp.
207–212. IEEE (2009)
28. Abouelaziz, I., El Hassouni, M., Cherifi, H.: No-reference 3d mesh quality assess-
ment based on dihedral angles model and support vector regression. In: Mansouri,
A., Nouboud, F., Chalifour, A., Mammass, D., Meunier, J., Elmoataz, A. (eds.)
International Conference on Image and Signal Processing, pp. 369–377. Springer,
Cham (2016). https://doi.org/10.1007/978-3-319-33618-3 37
29. Abouelaziz, I., El Hassouni, M., Cherifi, H.: A curvature based method for blind
mesh visual quality assessment using a general regression neural network. In: 2016
12th International Conference on Signal-Image Technology & Internet-Based Sys-
tems (SITIS), pp. 793–797. IEEE (2016)
Multi-class Classification Performance
Improvements Through High Sparsity
Strategies
1 Introduction
Despite the first embedded system was born way back in the 60 s of the last cen-
tury [1] for NASA’s Apollo missions, there is nowadays an exponentially growing
interest for its application on IoT (i.e., Internet of Things) [2] that, thanks to
the easier and cheaper availability of small-sized sensors, allow a smoother inte-
gration of smart devices in our society. The applications of such technologies are
the most diverse: mobility, grids, domotics, environmental monitoring, industrial
processing, healthcare, and security, to cite a few [3].
This latter aspect was the application domain of the present work. Indeed, in
the context of security issues, it became crucial to develop smarter IoT devices
for cheap embedded lightweight CCTVs, and a few steps are already moved in
this direction in the state-of-art [4]. Lightweight cameras are a perfect example
of how Edge AI can revolutionise an established system architecture. By moving
the inference on the edge of the system’s network (i.e., the deployed devices), it is
possible to avoid data breaches and obtain an overall more secure architecture [5].
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 331–338, 2024.
https://doi.org/10.1007/978-3-031-53468-3_28
332 L. Cavallaro et al.
the impact that such an outcome will have on the future design of smaller embed-
ded ML in addressing security issues on lightweight devices.
2.2 Dataset
To perform our experiments, we used the CelebA dataset2 [9]. This is a pub-
lic large-scale face attributes dataset with more than 200.000 celebrity images,
1
https://github.com/dcmocanu/sparse-evolutionary-artificial-neural-networks/tree/
master/SET-MLP-Keras-Weights-Mask.
2
https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html.
334 L. Cavallaro et al.
2.3 Methodology
In this section, a description of the methodology followed after the pre-processing
step of the input images described in the previous section is shown.
For the implementation, we employed Keras and Tensorflow APIs to create
a training pipeline that allows the generation of ML models at different sparsity
levels.
We adopted the sparse training approach, which induces sparsity in the net-
work while training the model itself.
Our analysis is a trade-off among varying architecture, image resolution,
and sparsity levels. In detail, we used two inspired CNNs architectures, namely
Residual Neural Networks (i.e., ResNet) [10] and AlexNet [11]. We selected them
because our previous analysis [7], which was applied to disease detection, unveiled
that those two are the most promising architectures to achieve fast and reliable
results among the ones evaluated. Similarly to [7], we applied a binary weight
mask to induce sparsity at each epoch of the training process.
To detect the lowest, but yet accurate, resolution image, we considered three
different levels, namely 96 × 96, 48 × 48, and 32 × 32 contrary to our previous
work because of the different nature of the input images. Indeed, in the present
study, a higher resolution was required to distinguish facial features.
We recall that CNNs are a type of feed-forward neural network that is able
to extract features from input data with convolution structures (i.e., mathe-
matical operations that allow the merging of two sets of information) to filter
the information and generate a feature map. They are particularly suitable for
the use case as architectures make the implicit assumption that the input is
image-like [12].
The focus of our work relies on sparsity-level investigation. We varied the
network density from the benchmark (i.e., dense network with 0% sparsity) up
to 90%. To the sake of brevity, we reported only the three most relevant sparsity
levels, which are 50%, 70%, and 90%.
Multi-class Classification Through High Sparsity Strategies 335
Table 1. The table shows the accuracy and size outcomes of the simulations conducted
on AlexNet inspired and ResNet CNNs architectures at the variation of image resolu-
tion (i.e., 32 × 32, 48 × 48, and 96 × 96) and sparsity level (i.e., 50%, 70% and 90%
plus the 0% that is the dense network used as benchmark).
Fig. 1. The figure shows the accuracy percentage trend at the variation of the sparsity
level of the two architectures under scrutiny (i.e., AlexNet in light orange and ResNet
in dark orange). (Color figure online)
Table 1 also highlights that the resolution of the input image does not affect
the size of ResNet models. This feature allows the analysis of more detailed
frames and achieves better accuracy overall. Contrary to our previous work on
binary classifications of blood cells [7], in which we varied the resolution image
from 8 × 8 up to 32 × 32, we set herein higher resolutions as it was necessary to
identify the higher number of features that characterise a human face.
Hence, from higher (i.e., 96 × 96) to medium resolutions (i.e., 48 × 48) with
a sparsity of 70% a significant drop of accuracy occurs (from 85.8% to 78.3%).
However, it is worth noticing that, when high resolutions are not an option,
such as if using cheap devices and having similar hardware constraints, there is
no significant performance degradation from medium to low resolutions (from
78.3% to 76.1%).
Since ResNet models maintained the same size, we can consider them as a
better solution compared to CNN models in this classification task.
The better performances of ResNet compared with AlexNet inspired CNNs
are also visible from Fig. 2. Indeed, the configurations having both lower size and
higher accuracy are all ResNets.
Multi-class Classification Through High Sparsity Strategies 337
Fig. 2. The figure shows the trade-off between the accuracy’s percentage and size of
the tensorflowlite files (in Kb) of the two architectures under scrutiny (i.e., AlexNet
in squares, and ResNet in circles).
References
1. Brady, C.D.: Apollo guidance and navigation electronics. IEEE Trans. Aerospace
(2), 354–362 (1965)
2. Samie, F., Bauer, L., Henkel, J.: Iot technologies for embedded computing: a survey.
In: Proceedings of the Eleventh IEEE/ACM/IFIP International Conference on
Hardware/Software Codesign and System Synthesis, CODES ’16, New York, NY,
USA. Association for Computing Machinery (2016)
3. Khanna, A., Kaur, S.: Internet of things (IoT), applications and challenges: a
comprehensive review. Wirel. Pers. Commun. 114, 1687–1762 (2020)
4. Rohadi, E., et al. Internet of things: CCTV monitoring by using raspberry pi. In:
2018 International Conference on Applied Science and Technology (iCAST), pp.
454–457. IEEE (2018)
5. Raghubir Singh and Sukhpal Singh Gill: Edge AI: a survey. Internet Things Cyber-
Phys. Syst. 3, 71–92 (2023)
6. Mocanu, D., Mocanu, E., Stone, P., Nguyen, P., Gibescu, M., Liotta, A.: Scalable
training of artificial neural networks with adaptive sparse connectivity inspired by
network science. Nat. Commun. 9, 06 (2018)
7. Cavallaro, L., Serafin, T., Liotta, A.: Miniaturisation of binary classifiers through
sparse neural networks. In: Numerical Computations: Theory and Algorithms.
Springer, Heidelberg (2023)
8. Erdös, P., Rényi, A.: On random graphs i. Publicationes Mathematicae Debrecen
6, 290 (1959)
9. Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In:
Proceedings of International Conference on Computer Vision (ICCV) (2015)
10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 770–778 (2016)
11. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. Adv. Neural Inf. Process. Syst. 25 (2012)
12. Teuwen, J., Moriakov, N.: Chapter 20 - convolutional neural networks. In: Zhou,
S.K., Rueckert, D., Fichtinger, G. (eds.) Handbook of Medical Image Comput-
ing and Computer Assisted Intervention, The Elsevier and MICCAI Society Book
Series, pp. 481–501. Academic Press (2020)
Learned Approximate Distance Labels
for Graphs
1 Introduction
2 Theoretical Analysis
2.1 Problem Definition
Given an undirected arbitrary graph Gi with length i and any two nodes u, v in
Gi . Denote the distance between u, v in Gi by dGi (u, v), the distance labelling
problem aims to find a labeling scheme (·) and a function f such that
Cycles. Denote the set of cycles in which the maximum length of cycle is n by
Cn . The goal is to assign the minimum number of labels to the vertices of all
cycles according to the following rules:
Thus, we found Ω(n4 ) triples which must be labelled differently, and hence χ =
Ω(n4/3 ).
Note that the upper bound uses log2 (n3/2 ) + 1 = 3/2log2 (n) + 1 bits and
the lower bound shows that at least log2 (n4/3 ) = 4/3log2 (n) bits are required
for exact distances. Our results for cycles are approximate distances, and the
number of bits we use approximately match the lower bound.
Trees. We consider distance labeling schemes for trees: given a tree with n
nodes, label the nodes with binary strings such that, given the labels of any two
nodes, one can determine, by looking only at the labels, the distance in the tree
between the two nodes. Alstrup et al. showed in [7] that 14 log2 (n) bits are needed
for exact distances and that 12 log2 (n) bits are sufficient. They also give a (1 + )-
stretch labeling schemes using Θ(log n) bits for constant > 0. This result was
extended by Freedman et al. in [12], who showed that the recent labeling scheme
of [7] can be easily modified to obtain an O(log1/ n) upper bound they also
proved a matching O(log1/ n) lower bound. Our method achieves approximately
the theoretical bounds of 1 + approximate distances on trees with = 0.4.
Proof. We prove that all the labels in a cycle must be different and thus one
must use log2 (n) bits to represent the labels of the vertices in the largest cycle
whose length is n. Assume by contradiction that x,y are vertices in the same cycle
that are assigned with the same label and let z be the neighbor of y such that
d(x, z) ≥ 2 (since n ≥ 5, it is easy to verify that at least one of the neighbors of y
has this property). Then d(z, y)=1 (as y and z are neighbors) and d(z,x)≥2. As
x and y have the same label, then the labeling scheme outputs the same distance
D for d(z,x) and d(z,y). As the labeling plan is (1+ε) approximate, it must hold
that D<=1+ε so that D is a (1+ε)-approximation of d(z,y), and it also must
hold that D ≥ 1+ε2
so that D is a (1+ε)-approximation of d(z,y). However, these
√
two equations cannot hold when ε < 2 − 1, which is a contradiction (Fig. 1).
3 Approach
In this paper, we focus on developing a method that calculates the shortest dis-
tance between any two nodes in a graph quickly while simultaneously minimiz-
ing the size of the labels employed for distance computation. This dual objective
enhances the efficiency of our approach and significantly reduces storage require-
ments. Our model incorporates an embedding layer and a two-layer feed-forward
neural network for this purpose.
Approximate Distance Labels 343
We utilized synthetic datasets for cycles and trees, as well as real-world graphs
obtained from [16] for our experiments.
For cycles, we generated a dataset DN = {(u, v, d) : u, v ∈ Ci , 3 ≤ i ≤ L} for
experiments. Here, G = i Ci represents a union of disjoint cycles, where Ci is
a cycle. VCi is the set of all nodes that belong to Ci , then V = i VCi is the set
of nodes that exist in any cycle within G.
For trees, we followed a similar procedure to generate adataset DN =
{(u, v, d) : u, v ∈ Ti , 3 ≤ i ≤ L} for experiments. Define G = i Ti be a union
of disjoint
random trees, VCi be the set of all nodes that belong to Ci , then
V = i VCi is the set of nodes that exist in any tree that belongs to G. In both
the trees and cycles dataset, each node is assigned a unique integer id.
The real world graphs dataset comprises enzymes graphs obtained from [16]
which provides the graph edge lists. We processed the data to ensure that every
node in the entire set has a unique id and that there are no gaps between the
ids of any two nodes. The dataset DN = {(u, v, d) : u, v ∈ Gi , 3 ≤ i ≤ L} was
then created from this collection of real-world graphs, where Gi is a graph in the
collection. We specifically use the chem-ENZYMES-g1 and chem-ENZYMES-
g118 graphs.
3.2 Training
The training phase of our model utilizes a two-layer feed-forward neural network.
Our objective is to predict the distance between any pair of nodes from the same
cycle, tree, or graph. We employ the embedding layer to generate a vector of size i
(the number of bits) as the label for each node. These vectors are then quantized
to binary form by taking the sign of each individual element within the vector.
We initialize the feature of the node pairs as the concatenation of their respec-
tive quantized labels. After processing this feature through two layers of our
model, an output prediction Ŷ of the distance between two nodes is produced.
Each layer is followed by an activation function. Let n denote the number of
344 I. Abioye et al.
where W1 ⊆ R2×d , W2 ⊆ Rd×1 are the weight matrices to be trained. The final
activation function σ is modified so that the prediction will be re-ranged to fit
the dataset’s distance range.
4 Experiments
4.1 Dataset
We use synthetic datasets with N = 26 (or 64). This results in a total of (64−3)∗
(3 + 63)/2 = 2079 nodes each for both the cycle and tree datasets. To enhance
the performance of the model on node pairs with close proximity, we duplicate
pairs of cycle nodes with a distance less than 5. We use the entire dataset for
training purposes, while reserving 10% of the data as a validation set.
The cycles dataset consists of cycles of length 3 to 26 .
Approximate Distance Labels 345
1 |ŷi − yi |
N
M RE(Ŷ , Y ) = (2)
N i=1 |yi |
We define parameter α as a float that ranges from 0.0 to 1.0, the final loss l
is computed as the combination of MSE and MRE:
5 Results
5.1 Cycles and Alpha Values
We evaluate the performance of our model, specifically trained on cycle struc-
tures, across various alpha values. This assessment is conducted using the post-
training quantization approach.
We report the results in Table 1. By comparing the whole table with previous
theoretical analysis, we have the following observations:
346 I. Abioye et al.
(a) MSE using different number of bits (b) MRE using different number of bits
We also conduct study on the effect of different α. The results are shown in
Fig 2. From the figure, we observed that:
– Value of α significantly affects the number of bits needed. When α = 0.1 and
α = 0.7, using two bits, model’s performance is obviously better than when
alpha = 0.3 and α = 0.9. In other words, given more strict space requirement,
adjusting the value of alpha could significantly improve the performance.
– Value of α has less effect than number of bits. As the number of bits is
increasing, models’ performance converges. They are quite close when we are
using more than six bits.
To sum up, our approach demonstrate its effectiveness in learning labels for
distance approximation. By changing α, we could obtain descent approximation
even using very few bits.
Based on this result, we use an alpha value of 0.5 for the rest of our training.
5.2 Trees
Training Quantization Method. In the case of random tree graphs, our
model tends to yield better results when the training quantization strategy is uti-
lized, compared to the post-training quantization method. However, our model’s
efficacy does not extend as seamlessly to random trees as it does to cycles and
Approximate Distance Labels 347
general graphs. When applied to random trees, despite increasing the number of
bits to 30, the model records a relatively higher Mean Relative Error (MRE) loss
of approximately 0.41. This indicates a need for further optimization or a differ-
ent approach when dealing with trees. The detailed performance visualizations
can be found in Fig. 3.
(a) MSE using different number of bits (b) MRE using different number of bits
Fig. 3. Model’s performance on Trees using different number of bits with quantization
during training
The addition of bits does seem to trend error downwards with the training
quantization method, however the effect on error is less dramatic than the results
on cycles. Trees require a significantly higher number of bits to achieve an MRE
around .45.
After fine-tuning our model on both tree and graph datasets, we proceed to
evaluate its performance on general graphs. Notably, employing the post-training
quantization methodology and setting α to 0.5 yields results superior to those
348 I. Abioye et al.
(a) MSE using different number of bits (b) MRE using different number of bits
Fig. 4. Model’s performance on Trees using different number of bits with post-training
quantization
(a) MSE using different number of bits (b) MRE using different number of bits
achieved on the cycle dataset. The detailed performance outcomes are depicted
in Fig. 5 below.
These results on general, real world graphs are the most promising and exper-
imentally demonstrate that this method has viability for real world applications
with large graphs.
6 Limitations
Despite the promising results, our model is not without its limitations. First and
foremost, our model exhibits an asymptotic limit in its training loss, implying
that past a certain point, increasing the number of bits used for the embedding
does not yield further improvements in accuracy. This bottleneck could poten-
tially be addressed by expanding the model size, thereby enabling it to better
capture the intricacies of the data structure.
Approximate Distance Labels 349
7 Computation Specifications
9 Conclusion
References
1. Akiba, T., Iwata, Y., Yoshida, Y.: Fast exact shortest-path distance queries on
large networks by pruned landmark labeling. In: Proceedings of the 2013 ACM
SIGMOD International Conference on Management of Data, SIGMOD 2013, pp.
349–360 (2013)
2. Alstrup, S., Bille, P., Rauhe, T.: Labeling schemes for small distances in trees.
SIAM J. Disc. Math. 19(2), 448–462 (2005)
3. Alstrup, S., Dahlgaard, S., Knudsen, M.B.T., Porat, E.: Sublinear distance label-
ing. In: 24th Annual European Symposium on Algorithms (ESA 2016), vol. 57, pp.
5:1–5:15 (2016)
4. Alstrup, S., Gørtz, I.L., Halvorsen, E.B., Porat, E.: Distance labeling schemes for
trees. CoRR arxiv:1507.04046 (2015)
5. Alstrup, S., Gørtz, I.L., Halvorsen, E.B., Porat, E.: Distance labeling schemes for
trees. In: 43rd International Colloquium on Automata, Languages, and Program-
ming, ICALP 2016, LIPIcs, vol. 55, pp. 132:1–132:16 (2016)
6. Alstrup, S., Kaplan, H., Thorup, M., Zwick, U.: Adjacency labeling schemes and
induced-universal graphs. In: Proceedings of the Forty-Seventh Annual ACM Sym-
posium on Theory of Computing, STOC 2015, pp. 625–634 (2015)
7. Alstrup, S., LiGørtz, I., Halvorsen, E.B., Porat, E.: Distance labeling schemes for
trees. In: 43rd International Colloquium on Automata, Languages, and Program-
ming, ICALP 2016, Rome, Italy, 11–15 July 2016, LIPIcs, vol. 55, pp. 132:1–132:16
(2016). https://doi.org/10.4230/LIPIcs.ICALP.2016.132
8. Brunner, D.: Distance preserving graph embedding (2021)
9. Chang, L., Yu, J., Qin, L., Cheng, H., Qiao, M.: The exact distance to destination
in undirected world. VLDB J. 21 (2012)
10. Dijkstra, E.W.: A note on two problems in connexion with graphs. Numer. Math.
1(1), 269–271 (1959)
11. Freedman, O., Gawrychowski, P., Nicholson, P.K., Weimann, O.: Optimal distance
labeling schemes for trees. In: Proceedings of the ACM Symposium on Principles
of Distributed Computing, PODC 2017, pp. 185–194 (2017)
12. Freedman, O., Gawrychowski, P., Nicholson, P.K., Weimann, O.: Optimal distance
labeling schemes for trees. In: Proceedings of the ACM Symposium on Principles
of Distributed Computing, PODC 2017, Washington, DC, USA, 25–27 July 2017,
pp. 185–194. ACM (2017)
13. Gavoille, C., Peleg, D., Pérennes, S., Raz, R.: Distance labeling in graphs. J. Algor.
53(1), 85–112 (2004)
14. Hart, P.E., Nilsson, N.J., Raphael, B.: A formal basis for the heuristic determina-
tion of minimum cost paths. IEEE Trans. Syst. Sci. Cybern. 4(2), 100–107 (1968)
15. Neff, J.: Neural distance oracle for road graphs (2021)
16. Rossi, R.A., Ahmed, N.K.: The network data repository with interactive graph
analytics and visualization. In: AAAI (2015). https://networkrepository.com
Investigating Bias in YouTube
Recommendations: Emotion, Morality,
and Network Dynamics in China-Uyghur
Content
1 Introduction
Recommendation algorithms are frequently associated with biases such as selec-
tion bias [14], position bias [7,12], and popularity bias [15,16]. Popular recom-
mendation platforms have, in the past, been associated with patterns that lead
users to highly homogeneous content, resulting in certain phenomena such as
filter bubbles and echo chambers [11,30]. In such scenarios, users are isolated
from diverse content and are instead exposed to a narrower band of information.
This can pose the risk of reinforcing specific viewpoints [17,18].
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 351–362, 2024.
https://doi.org/10.1007/978-3-031-53468-3_30
352 M. C. Cakmak et al.
This study examines the emotion and morality bias present in YouTube’s
recommendation algorithm. By analyzing the evolution of emotion and morality
across recommended YouTube videos related to the China-Uyghur crisis narra-
tive, we aim to determine if YouTube’s recommendation algorithm favors videos
with certain emotions over others and to explore the distribution of moral con-
tent across YouTube’s recommendation algorithm. Also, this research studies
the network analysis and aims to find whether echo chambers and topic shifting
appear in the narratives by analyzing the eigenvector centrality of videos and
tracing the communities.
2 Methodology
For this research, we introduce a drift analysis methodology that allowed us to
monitor changes in video characteristics and explore the patterns of the recom-
mendation algorithm. We apply the resulting approach to our dataset, which
consists of a collection of videos recommended through various methodologies.
Ax = λx (2)
3 Results
3.1 Emotion Analysis
The goal of emotional drift analysis is to determine how emotions across the
seven categories (anger, surprise, fear, joy, neutral, disgust, and sadness) evolve
or drift across recommendation depths. To determine emotional drift, we com-
puted and visualized the predominant emotions at each depth of recommen-
dation, from seed to depth 4, on a line graph. In our analysis, we considered
video text data at two different places: video titles and video descriptions. This
process allowed us to effectively apply emotion analysis and visualize emotional
drift across different levels of video details. The neutral emotion effectively iso-
lates text data with no identifiable emotion embedded. As a result, the neutral
emotion in the line graph was not considered in our result analysis. From the
graph below in Fig. 1(a), we observe that on emotional drift analysis of the video
titles, there was a significant presence of fear, anger, and disgust emotions at the
seed level (depth 0), and a reduced presence of joy and surprise emotions. As we
moved from seed videos to depth 4 through the recommendations made by the
algorithm, we observed an increase in the proportion of joy and surprise emo-
tions. This trend was accompanied by a decrease in the previously heightened
fear, anger, and disgust emotions as we approached depth 4. This trend of the
emergence of positive emotions and decline of negative emotions across recom-
mendations was also seen in the emotional drift analysis of video descriptions in
Fig. 1(b) but with a clear level of distinction. On analyzing video descriptions,
we saw a more distinct pattern of emergence and decline; this is most likely due
to the higher level of content and information in video descriptions as compared
to video titles.
356 M. C. Cakmak et al.
also evident in the video descriptions analysis depicted in Fig. 2(b). However, the
decline appears to occur at a slower pace compared to that in the video titles.
Fig. 2. Line graph showing the distribution of vice morality across recommendations
of videos using (a) video titles (b) video descriptions
content, like the “Learn Hebrew Alphabet” video, suggests initial diversification
in recommendations. In depth 2 and 3, recommendations drift notably from the
China-Uyghur topic. The focus shifts to financial transactions and the figure
“Ida Dayek”. This aligns with the research’s observation about changing themes
in deeper recommendations. In depth 4, the absence of China-Uyghur-related
content continues, indicating further content diversification with topics like space
exploration and politics.
To sum up, the persistence of the “Strange transaction at Ministry of
Finance” video across depths is intriguing and might signal how recommen-
dation algorithms may persist on certain topics, potentially diverting users away
from the original search or interest. The Table 1 acts as empirical evidence for the
research’s primary claim: YouTube’s recommendation system potentially drifts
from morally complex and emotionally charged subjects, leading users down
diverging paths and potentially shaping their perceptions and beliefs in the pro-
cess. This makes us realize just how tricky and subtle these recommendation
algorithms can be, especially when we’re diving into sensitive or emotionally
intense topics. Knowing about these shifts helps us all be smarter and more
mindful viewers.
Table 1. Influential videos in each depth of the China-Uyghur narrative and their
respective topics.
from the China-Uyghur seed videos, we see an increase in positive emotions and
a decrease in moral vices.
Adding to this, our network analysis, utilizing eigenvector centrality, high-
lighted how influential videos, over time, changed direction from the China
Uyghur topic. This suggests the recommendation system might not just be
responding to content but possibly to other factors like popularity. This com-
bined drift, in emotion and content, offers insight into the workings of YouTube’s
recommendation system, illustrating its tendency to shift users towards broader
or more prevalent content themes.
References
1. Shivhare, S.N., Khethawat, S.: Emotion detection from text (2012). https://doi.
org/10.48550/arXiv.1205.4944
2. Liu, Q., Huang, H., Feng, C.: Micro-blog post topic drift detection based on LDA
model. In: Behavior and Social Computing, Cham, pp. 106–118 (2013)
3. O’Hare, N., et al.: Topic-dependent sentiment analysis of financial blogs. In: Pro-
ceedings of the 1st International CIKM Workshop on Topic-Sentiment Analysis
for Mass Opinion, New York, NY, USA, pp. 9–16 (2009). https://doi.org/10.1145/
1651461.1651464.
4. Suhasini, M., Badugu, S.: Two step approach for emotion detection on twitter data.
Int. J. Comput. Appl. 179, 12–19 (2018). https://doi.org/10.5120/ijca2018917350
5. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text
transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)
6. Hopp, F.R., Fisher, J.T., Cornell, D., Huskey, R., Weber, R.: The extended Moral
Foundations Dictionary (eMFD): development and applications of a crowd-sourced
approach to extracting moral intuitions from text. Behav. Res. 53(1), 232–246
(2021). https://doi.org/10.3758/s13428-020-01433-0
7. Agarwal, A., Zaitsev, I., Wang, X., Li, C., Najork, M., Joachims, T.: Estimating
position bias without intrusive interventions. In: Proceedings of the Twelfth ACM
International Conference on Web Search and Data Mining, pp. 474–482 (2019).
https://doi.org/10.1145/3289600.3291017.
Investigating Bias in YouTube Recommendations 361
22. Xu, D., Tian, Z., Lai, R., Kong, X., Tan, Z., Shi, W.: Deep learning based emotion
analysis of microblog texts. Inf. Fusion 64, 1–11 (2020). https://doi.org/10.1016/
j.inffus.2020.06.002
23. Jamdar, A., Abraham, J., Khanna, K., Dubey, R.: Emotion analysis of songs based
on lyrical and audio features. IJAIA 6(3), 35–50 (2015). https://doi.org/10.5121/
ijaia.2015.6304
24. GitHub - medianeuroscience/emfd: The Extended Moral Foundations Dictionary
(E-MFD). https://github.com/medianeuroscience/emfd. Accessed 04 June 2023
25. Okeke, O.I., Cakmak, M.C., Spann, B., Agarwal, N.: Examining content and emo-
tion bias in youtube’s recommendation algorithm. In the Ninth International Con-
ference on Human and Social Analytics, Barcelona, Spain (2023)
26. Banjo, D. S., Trimmingham, C., Yousefi, N., Agarwal, N.: Multimodal characteri-
zation of emotion within multimedia space (2022)
27. Shaik, M., Hussain, M., Stine, Z., Agarwal, N.: Developing situational awareness
from blogosphere: an Australian case study (2021)
28. DiCicco, K., Noor, N. B., Yousefi, N., Maleki, M., Spann, B., Agarwal, N.: Toxicity
and Networks of COVID-19 discourse communities: a tale of two social media
platforms. In: Proceedings (2020). http://ceur-ws.org. ISSN, 1613, 0073
29. Maharani, W., Gozali, A.A.: Degree centrality and eigenvector centrality in twitter.
In: 2014 8th International Conference on Telecommunication Systems Services and
Applications (TSSA), pp. 1–5. IEEE (2014)
30. Kirdemir, B., Agarwal, N.: Exploring bias and information bubbles in youtube’s
video recommendation networks. In: Benito, R.M., Cherifi, C., Cherifi, H., Moro,
E., Rocha, L.M., Sales-Pardo, M. (eds.) COMPLEX NETWORKS 2021, vol.
1073, pp. 166–177. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-
030-93413-2 15
31. Kirdemir, B., Kready, J., Mead, E., Hussain, M.N., Agarwal, N.: Examining video
recommendation bias on YouTube. In: Boratto, L., Faralli, S., Marras, M., Stilo, G.
(eds.) BIAS 2021, pp. 106–116. Springer, Cham (2021). https://doi.org/10.1007/
978-3-030-78818-6 10
32. Kirdemir, B., Kready, J., Mead, E., Hussain, M.N., Agarwal, N., Adjeroh, D.:
Assessing bias in YouTube’s video recommendation algorithm in a cross-lingual and
cross-topical context. In Social, Cultural, and Behavioral Modeling: 14th Interna-
tional Conference, SBP-BRiMS 2021, Virtual Event, 6–9 July 2021, Proceedings 14,
pp. 71–80. Springer, Heidelberg (2021). https://doi.org/10.1007/978-3-030-78818-
6 10
Improving Low-Latency Mono-Channel
Speech Enhancement by Compensation
Windows in STFT Analysis
1 Introduction
Speech enhancement (SE) algorithms recover clean speech from a mixture of
speech and noise. It is a key component in audio communication technology
stacks, serving as a crucial pre-processing step for down-stream tasks such as
automatic speech recognition (ASR) [6,8], acoustic echo cancellation [16], etc.
Besides speech quality, low latency is among the most important desirable prop-
erties of a SE system, especially in real-time applications such as voice-video
teleconferencing or hearing aids [4,5].
Traditional SE algorithms include filter-bank and statistics-based models
such as filtering, spectral subtraction, and optimally log-spectral amplitude esti-
mator [14]. These methods, however, are prone to complex noise environments
and often inadequate for the current demand of SE. Recent advances in deep
Work performed while Minh N. Bui was an research intern at Microsoft.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 363–373, 2024.
https://doi.org/10.1007/978-3-031-53468-3_31
364 M. N. Bui et al.
learning [10,12,19] have made significant progress in the field and have been
preferable to classical SE methods.
Deep learning-based SE methods can be categorized into temporal-based and
spectral-based methods. The former operates directly on waveforms whereas
the latter takes a transformation of the original waveform (typically short-time
Fourier Transform - STFT) as input. Spectral-based methods require signifi-
cantly more time samples to obtain a balanced trade-off in time-frequency tiling.
As a result, spectral-based approaches produce better prediction accuracy at the
cost of higher latency compared to temporal-based approaches. Recent spectral-
based methods have latency in the regime of 40 ms [7,20,21], much higher than
the typical value on temporal methods. Our work focuses on improving the
latency of spectral-based methods while preserving their superior predictive per-
formance.
In spectral-based methods, algorithmic latency relies heavily on STFT win-
dows design (see Sect. 2.1). To modify these windows, certain perfect recon-
struction (PR) constraints [1] must be satisfied. Recent works have successfully
Speech Enhancement with Compensation STFT Window 365
2 Proposed Method
2.1 Spectral Methods: Pipeline and Algorithmic Latency
Fig. 2. Example of algorithmic latency for STFT with window size of 4 ms and hop
size of 1 ms and illustration of main and compensation window. (a). At time t, the
information for the beginning of the 4th block is available (gray signal) but to produce
the corresponding output wave (black signal), the algorithm will need to wait until
receiving complete the 7th block for overlap-add. (b) Compensation window (green)
looks further into the past when compared to the main window (red) depicted with
their Hann windowing functions.
where {Xm , Fm , wm } and {Xc , Fc , wc } are the STFT frame, the window size
and the windowing function of the main window and the compensation window,
respectively. Also, Δ = Fc −F
2
m
. Intuitively, Xc is the transformation of the right-
shifted version, by an amount of Δ, of x. The compensation hop size is always
set equal to the main hop size. The set (Xm , Xc ) is used as the input of the
enhancement algorithm.
The benefits of our proposal are threefold. First, as the main analysis-
synthesis window pair is kept intact, PR is always guaranteed without com-
plicated window designing schemes as in previous works [18] [15]. Second, the
compensation window thus can ignore PR constraints, allowing downstream fea-
ture learning to focus more on learning useful representations for the prediction
task. Finally, the transformation for the compensation window is versatile: one
can replace STFT with equivalent transformations.
2.4 Multi Encoder Deep Neural Network for Low Latency Deep
Noise Suppression
We use a UNet-based model with causal convolution and the network directly
regresses enhanced spectrogram given two input spectrograms: one from the
main window and another from the compensation window. The DNN has three
main components: encoder, and enhancer, and decoder. The encoder comprises
two different encoder heads corresponding to the main window and the compen-
sation window. The outputs of these heads are fused by 1 × 1 convolution before
being fed into several shared encoder blocks. The enhancer includes 4 enhance-
ment blocks, each comprises 4 sequential ResNet blocks. The decoder consists
of two separate branches, both of which can process the same output of the
enhancer to produce the lower and higher frequency parts of the output, respec-
tively. The output of the decoders is concatenated to form an enhanced spectro-
gram. Each encoder-decoder block includes a convolution layer, a LeakyRELU
layer with negative slope of 0.2 and a Batch Norm layer. Such architecture is
depicted in Fig. 1.
F F
Let φ(.) : CT ×( 2 +1) → CT ×( 2 +1) be a preprocessing function defined as
1
φ(X) = (|X| 3 /|X|)X. Let Xc ∈ RT ×Fc /4×2 , Xm ∈ RT ×Fm /4×2 be the processed
spectrograms of the main and compensation window after applying φ, removing
half of higher frequency bins and concatenate real and imaginary parts to form
a third dimension. The network produces X̂ ∈ RT ×Fm /2×2 spectrogram.
For the loss function L, we employ the consistency projection loss as in [17]
[2]. Let X, X̂ be the ground truth and enhanced spectrogram (unprocessed by
φ). Let γw,l (.) be the STFT transformation parameterized by window size w and
hop size l. Then,
3 Experiments
3.1 Datasets and Metrics
We train and test the networks using two datasets including VCD and the
Microsoft dataset derived from the DNS challenge [4] dataset. Both are full-
band 48 kHz. Both are speech datasets that are often used for benchmarking
noise suppression systems.
The VCD dataset includes 11572 noisy-clean pairs for training and 824 pairs
for testing. We pre-process the audio samples by zero-padding at the beginning
of the signals so that the minimum duration is 4 s.
The Microsoft dataset is derived from [4] by data augmentation. This dataset
includes 1000 h, 360,000 noisy-clean pairs for training. Each noisy or clean sample
is 10 s long. The test set includes 560 noisy-clean multilingual pairs for testing,
each last 17–21 s.
For evaluation, we ignore the first 3 s of each audio (since convergence time is
2.2 s) and evaluate the rest. Since PESQ score only operates on signals under 16
kHz [9], we downsample the enhanced outputs to 16 kHz and compute the PESQ
score for each output given the corresponding target signal. We also employ STOI
[11], and DNSMOS [3] includes background (BAK), signal (SIG) and overal
(OVL) scores.
results support our aforementioned insights and claims. First and most impor-
tant, even if shrinking the window size reduces the latency and sacrifices speech
enhancement quality (e.g. line 6 compares to line 1), the usage of compensation
window makes the speech quality still comparable to those produced by high
latency setting (e.g. line 5, 8, 11 compares to line 1). Second, using the com-
pensation window only significantly outperforms the baseline that uses the main
short window. Third, utilizing an asymmetric Hann window for compensation
yields better prediction than the regular Hann window. Finally, combining both
the main window and the compensation window produces the best prediction
quality (line 5, line 8). We also observe that in the 5 ms case, using both the
main window and compensation window does not give best results in all metrics
(line 10,11), we believe the reason is when the main window becomes too small,
it may not offer significant useful information, hence a better design is needed
for under 5 ms systems.
Table 1. Performance on VCD test set. W, H, M, C, Lat, (R) denotes main window
size, hop size, main window used, compensation window used, latency, and regular
Hann window (symmetric), respectively.
4 Ablation Study
We conduct ablation study to better understand the impact of the compensation
window. Table 2 shows the SE quality for several compensation window sizes. We
keep the baseline main window size as 512 and vary the compensation window
sizes from 512 to 4096 and observe that the 2048 window size achieves the best
performance. Note that a too long window (4096) degrades the signal quality,
as it includes unrelated information from the past and also increases network
complexity.
Table 3 shows how increasing the number of compensation windows in addi-
tion to the main window influences the enhancement performance. For the two
370 M. N. Bui et al.
# Window size Hop size Main window Compensation PESQ Lat (ms)
window
0 2048 512 ✓ 2.4132 42
1 2048 1024 ✓ 2.3539 42
2 1024 512 ✓ 2.3242 21
3 1024 512 ✓ 2.3555 21
4 1024 512 ✓ ✓ 2.3570 21
5 512 256 ✓ 2.1906 10
6 512 256 ✓ 2.2153 10
7 512 256 ✓ ✓ 2.2250 10
8 256 128 ✓ 2.1222 5
9 256 128 ✓ 2.1614 5
10 256 128 ✓ ✓ 2.1676 5
5 Conclusions
In this work, we propose a simple approach to enhance the speech enhancement
quality for a low-latency speech SE system. Through various experiments on two
different datasets, we observe that by simply adding an additional compensation
window along with the main window analysis helps improve the speech quality
while lowering the latency, possibly pushing it down to 5 ms. Future possible
directions include exploring specific window/filter designs for this compensation
window and taking advantage of other forms of feature representation in this
compensation window.
References
1. Allen, J.: Short term spectral analysis, synthesis, and modification by discrete
Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 25(3), 235–238
(1977). https://doi.org/10.1109/TASSP.1977.1162950
2. Braun, S., Gamper, H., Reddy, C.K.A., Tashev, I.: Towards efficient models for
real-time deep noise suppression (2021). https://doi.org/10.48550/ARXIV.2101.
09249, https://arxiv.org/abs/2101.09249
3. Dubey, H., et al.: Deep speech enhancement challenge at ICASSP 2023. In: ICASSP
(2023)
4. Dubey, H., et al.: ICASSP 2022 deep noise suppression challenge. In: ICASSP
(2022)
372 M. N. Bui et al.
5. Graetzer, S., et al.: Clarity-2021 challenges: machine learning challenges for advanc-
ing hearing aid processing. In: Interspeech (2021)
6. Li, C.Y., Vu, N.T.: Improving speech recognition on noisy speech via speech
enhancement with multi-discriminators CycleGAN. In: 2021 IEEE Automatic
Speech Recognition and Understanding Workshop (ASRU), pp. 830–836 (2021).
https://api.semanticscholar.org/CorpusID:245123920
7. Li, Q., Gao, F., Guan, H., Ma, K.: Real-time monaural speech enhancement with
short-time discrete cosine transform (2021). https://doi.org/10.48550/ARXIV.
2102.04629, https://arxiv.org/abs/2102.04629
8. Pandey, A., Liu, C., Wang, Y., Saraf, Y.: Dual application of speech enhance-
ment for automatic speech recognition. In: IEEE Spoken Language Technol-
ogy Workshop, SLT 2021, Shenzhen, China, 19-22 January 2021, pp. 223–228.
IEEE (2021). https://doi.org/10.1109/SLT48900.2021.9383624, https://doi.org/
10.1109/SLT48900.2021.9383624
9. Rix, A.W., Beerends, J.G., Hollier, M., Hekstra, A.P.: Perceptual evaluation of
speech quality (PESQ)-a new method for speech quality assessment of telephone
networks and codecs. In: 2001 IEEE International Conference on Acoustics, Speech,
and Signal Processing. Proceedings (Cat. No.01CH37221), vol. 2, pp. 749–752
(2001). https://api.semanticscholar.org/CorpusID:5325454
10. Schröter, H., Escalante, A.N., Rosenkranz, T., Maier, A.K.: DeepFilternet: a low
complexity speech enhancement framework for full-band audio based on deep
filtering. In: ICASSP 2022 - 2022 IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP), pp. 7407–7411 (2021). https://api.
semanticscholar.org/CorpusID:238634774
11. Taal, C., Hendriks, R., Heusdens, R., Jensen, J.: A short-time objective intelligi-
bility measure for time-frequency weighted noisy speech, pp. 4214 – 4217 (2010).
https://doi.org/10.1109/ICASSP.2010.5495701
12. Taherian, H., Eskimez, S.E., Yoshioka, T., Wang, H., Chen, Z., Huang, X.: One
model to enhance them all: array geometry agnostic multi-channel personalized
speech enhancement. In: ICASSP 2022 - 2022 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pp. 271–275 (2021). https://
api.semanticscholar.org/CorpusID:239049883
13. Valentini-Botinhao, C.: Noisy speech database for training speech enhancement
algorithms and TTS models (2017)
14. Vihari, S., Murthy, A., Soni, P., Naik, D.: Comparison of speech enhancement
algorithms. Procedia Comput. Sci. 89, 666–676 (2016). https://doi.org/10.1016/j.
procs.2016.06.032
15. Wang, Z.Q., Wichern, G., Watanabe, S., Roux, J.L.: STFT-domain neural speech
enhancement with very low algorithmic latency. IEEE/ACM Trans. Audio Speech
Lang. Process. 31, 397–410 (2022). https://api.semanticscholar.org/CorpusID:
248300088
16. Westhausen, N.L., Meyer, B.T.: Acoustic Echo Cancellation with the Dual-Signal
Transformation LSTM Network. In: ICASSP 2021 - 2021 IEEE International Con-
ference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7138–7142
(2021). https://doi.org/10.1109/ICASSP39728.2021.9413510
17. Wisdom, S., Hershey, J.R., Wilson, K.W., Thorpe, J., Chinen, M., Patton, B.,
Saurous, R.A.: Differentiable consistency constraints for improved deep speech
enhancement. CoRR abs/1811.08521 (2018). http://arxiv.org/abs/1811.08521
18. Wood, S.U.N., Rouat, J.: Unsupervised low latency speech enhancement with RT-
GCC-NMF. IEEE Journal of Selected Topics in Signal Processing 13(2), 332–346
(2019). https://doi.org/10.1109/jstsp.2019.2909193
Speech Enhancement with Compensation STFT Window 373
19. Zhang, G., Yu, L., Wang, C., Wei, J.: Multi-scale temporal frequency convolutional
network with axial attention for speech enhancement. In: ICASSP 2022 - 2022 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP),
pp. 9122–9126 (2022). https://doi.org/10.1109/ICASSP43922.2022.9746610
20. Zhang, Z., Zhang, L., Zhuang, X., Qian, Y., Li, H., Wang, M.: FB-MSTCN: a
full-band single-channel speech enhancement method based on multi-scale tem-
poral convolutional network (2022). https://doi.org/10.48550/ARXIV.2203.07684,
https://arxiv.org/abs/2203.07684
21. Zhao, S., Ma, B., Watcharasupat, K.N., Gan, W.S.: FRCRN: boosting feature rep-
resentation using frequency recurrence for monaural speech enhancement (2022).
https://doi.org/10.48550/ARXIV.2206.07293, https://arxiv.org/abs/2206.07293
Network Embedding
Filtering Communities in Word
Co-Occurrence Networks to Foster
the Emergence of Meaning
1 Introduction
In the field of Natural Language Processing (NLP), one of the main challenges
is to represent the meaning of words into vectors, these vectors then being used
as input to classification systems in order to solve various tasks such as part-of-
speech tagging, named entity recognition, machine translation, etc. Vectors that
represent words are commonly designated as word embeddings: the meaning of
words is embedded in a small latent space with dense vectors. Approaches to
train such vectors are based on the distributional hypothesis. Harris defines this
hypothesis, writing that “linguistic items with similar distributions have similar
meanings”. To train word embedding, one thus need to estimate these distribu-
tions using word co-occurrences from large corpora. The seminal approaches to
train word embeddings are actually based on word co-occurrences matrix factor-
ization [16,19,20]. Other popular approaches such as Word2vec [18] use neural
networks to build lexical representations, thus approximating matrix factoriza-
tion methods [15]. Finally, transformer-based approaches [17] and large language
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 377–388, 2024.
https://doi.org/10.1007/978-3-031-53468-3_32
378 A. Béranger et al.
eats crow
cat eagle
2
Corpus before preprocessing: My cat eats mice. Your flies
owl eats mice. This owl flies. A crow flies slower mouse owl slower
than an eagle.
After preprocessing: cat eats mouse. owl eats mouse.
(b) A graph G = (V, E, W ) parti-
owl flies. crow flies slower eagle.
tioned in two communities.
(a) Text corpus.
C0 C1 C0 C1
cat 1 0
1 1 1 0.66 0.33 0.33 0.66 1 1 1 eats 1 0
mouse 1 0
owl 0.66 0.33
cat eats mouse owl flies crow slower eagle flies 0.33 0.66
crow 0 1
eagle 0 1
(c) Bipartite projection of G into graph G =
slower 0 1
( , ⊥, E, W ) along the communities. Weight on the
edges is based on NR, the proportion of the weighted (d) Adjacency matrix of G , each
degree of each node related to the community. row is a SINr embedding.
Fig. 1. SINr: words are represented based on the communities they are linked to.
380 A. Béranger et al.
Fig. 2. Distribution of community sizes on BNC (left) and UkWac (right) corpora. The
ordinate axis is in logarithmic scale.
Fig. 3. Distribution of the number of activations per dimension on BNC (left) and UkWac
(right) corpora. The ordinate axis is in logarithmic scale.
not consistent topically. We thus propose to remove these dimensions from the
model, and we will see in Sect. 4 that it improves model performances.
We can also see that many dimensions are actually activated by only a few
words. While these dimensions may be useful for very specific topics, they may
not be useful for most of the vocabulary. They may also be noisy dimensions that
penalize performances. We also propose to remove these dimensions from the
models, and we will demonstrate in Sect. 4 that it notably reduces the memory
footprint of the model (dividing the number of dimensions by 5) while preserving
performances.
are at 3.55. Furthermore, SimLex does not rely on frequency information from
a reference corpus to select its word pairs, it thus includes rarer words than
WS353, MEN or SCWS. This dataset is of particular interest due to its difficulty,
and its split with regard to parts of speech : SimLex999 being the whole dataset
with noun-noun, adjective-adjective and verb-verb pairs, and SimLex665 being
the noun subset. This split allows us to determine the validity of our model-
ing on different word categories, which probably follow different distributions of
contexts.
Similarities in embedding spaces using cosine similarity are supposed to be
correlated with human similarities, as shown in Fig. 4. Correlation is computed
with Spearman’s definition: the closer to 1, the better. In order to assess the
performances of our model, we also consider Word2vec, one of the most popular
approaches to train word embeddings. We do not consider more recent state-of-
the-art approaches that allow to get better performances because our approach
focuses on interpretability and low compute. Eventually, it may be used in more
complex architectures, such as transformers,
Fig. 4. Example of word similarity rating from the MEN dataset and cosine similarity
between vectors.
Preprocessing and Models Parameters. The text corpora are preprocessed with
spaCy to improve the quality of cooccurrence information and reduce the vocab-
ulary to be covered by the models. The text is tokenized and lemmatized,
named entities are chunked, words shorter than three characters, punctuation
and numerical characters are deleted. The minimum frequency to represent a
type is set at 20 for BNC and 50 for UkWac. All models use a cooccurrence win-
dow of 5 words to the left and to the right of a target within sentence boundaries.
Filtering Communities in Word Co-Occurrence Networks 383
4.2 Results
Table 1. Summary of the results of competing models and of SINr and its filtered
version, SINr-filtered, introduced in this paper.
Filtering the Distribution’s Head. As one can see in Fig. 5, model performances
varies a lot with regard to the filter threshold. On the right, when the threshold is
set to 12 000, there is actually no filtering. Filtering more and more, moving the
threshold to the left until the best threshold 4000, allows to gradually increase
performances of models for both UkWac and BNC. Between 4000 and 2000, results
are rather plateauing. After 2000, significant information is removed, leading
to a decrease in performance. Using the 4000 threshold allows catching up with
Word2vec’s performances, our reference. Indeed, the 4000 filter allows a gain of 5
points in performance for the MEN dataset (from 0.67 to 0.72 for BNC and from 0.70
to 0.75 for UkWac), and a slight gain of 2 points for the WS353 dataset (from 0.63
to 0.65 for BNC, from 0.68 to 0.70 on UkWac) and the SCWS dataset (from 0.56 to
384 A. Béranger et al.
Fig. 5. Similarity on BNC (left) and UkWac (right) corpora. Dimensions often activated
are removed according to the threshold in abscissa.
0.58 for BNC, from 0.56 to 0.59 for UkWac). Such gains are statistically significant,
and they are particularly interesting because they result from a simplification of
the model, even if only 95 (resp. 90) dimensions are removed in average on the
BNC (resp. UkWac) model using this filter. The SimLex dataset is much harder than
the three others, for SINr but also for the reference model Word2vec. However,
as one can see, filtering allows significant gain for SimLex also, especially for
SimLex999 on BNC (from 0.20 to 0.25), and the results are better between 2000
and 4000 thresholds like for the other datasets.
Fig. 6. Similarity on BNC (left) and UkWac (right) corpora. Dimensions scarcely acti-
vated are removed according to the threshold in abscissa.
Filtering Communities in Word Co-Occurrence Networks 385
Filtering the Distribution’s Long Tail. The effect of filtering the long tail of
the distribution of activations is quite different, as one can see in Fig. 6. At
left, no filter is applied, and increasing the filter does not lead to any gain in
performances. Still, it is interesting to notice that filtering dimensions with less
than 500 activations does not lead to any significant loss in information, on the
five similarity datasets used for evaluation. Indeed, it actually divides by 5 the
number of dimensions of the model, reducing its number of dimensions from
roughly 6600 to 1200 on average for BNC, and from 5700 to 1100 for UkWac, thus
allowing to drastically improve its memory footprint!
Fig. 7. Filtering effects on the community sizes distribution for BNC (left) and UkWac
(right).
386 A. Béranger et al.
Singling out the Subset of Distinctive Contexts. One may wonder if from one
run of SINr to another, by filtering dimensions with more than 500 and less
than 4000 activations, the vocabulary that forms communities that are kept is
the same. It is surprising to notice that it is mostly the case: roughly 80% of the
vocabulary kept is actually the same over ten runs when considering BNC and
UkWac separately. However, this set is not the same from one corpus to another,
only 35% of the vocabulary kept is actually common to BNC and UkWac. It seems
to mean that these respective subsets of the vocabulary are essential to describe
the meaning of words in these respective corpora. Those results, combined to
the similarity improvements, point towards the notion that our filtering approach
discriminates a subset of the dimensions that is the best fit to describe a given
corpus. However, the evaluation results solely give insight on the subset of the
lexicon covered by the similarity datasets, a subset that is heavily biased toward
nouns, and especially frequent concrete nouns.
5 Conclusion
SINr is a graph-based approach to train word embedding which requires low
compute and whose results are interpretable. In this paper, we show that we can
significantly improve model performances and reduce its memory footprint by
filtering its dimensions. Indeed, filtering the most activated dimensions allows
gaining a few points on the similarity task for each dataset considered, showing
that these dimensions are actually the bearer of noise into the model. This gain
allows SINr performing on-par with Word2vec. Furthermore, filtering-out dimen-
sions that are the least activated allows dividing the number of dimensions by 5
while preserving performances. We show that these filters relying on activations
of the dimensions are somehow correlated with community sizes, but not com-
pletely, showing their relevance. Finally, we demonstrate that the vocabulary
of communities that correspond to dimensions that are not filtered remains the
same from one run of SINr to the other. We plan to experiment on other corpora
but also on downstream tasks to confirm the ability of these results to generalize
in a variety of contexts. Furthermore, it would be particularly interesting to test
the ability of these filtered embeddings to model the meaning of very specialized
vocabulary, to evaluate if removing dimensions affects the representation of these
words.
Acknowledgments. The work has been funded by the ANR project DIGING (ANR-
21-CE23-0010).
References
1. Adamic, L.: Unzipping Zipf’s law. Nature 474(7350), 164–165 (2011)
2. Agirre,E., Alfonseca,E., Hall,K., Kravalova, J., Paşca, M.: A study on similarity
and relatedness using distributional and WordNet-based approaches. In: NAACL,
pp. 19–27, (2009)
Filtering Communities in Word Co-Occurrence Networks 387
22. Prouteau, T., Dugué, N., Camelin, N., Meignier, S.: Are embedding spaces inter-
pretable? Results of an intrusion detection evaluation on a large French corpus. In:
LREC (2022)
23. Subramanian, A., Pruthi, D., Jhamtani, H., Berg-Kirkpatrick, T., Hovy, E.: Spine:
SParse interpretable neural embeddings. In: AAAI (2018)
DFI-DGCF: A Graph-Based
Recommendation Approach
for Drug-Food Interactions
Sofia Bourhim(B)
1 Introduction
Drug-food interaction (DFI) refers to the alteration of a drug’s pharmacological
effect by a food component. DFIs can occur through a variety of mechanisms,
including changes in the absorption, distribution, metabolism, and excretion of
the drug. DFIs can exert a substantial influence on the safety and effectiveness of
drug therapy. In certain instances, DFIs may even result in severe adverse drug
reactions. It can occur through a variety of mechanisms that include changes in
absorption where food components can interfere with the absorption of drugs
from the gut, leading to lower blood levels of the drug. For instance, grapefruit
juice can inhibit the absorption of the cholesterol-lowering drug atorvastatin. It
also includes the change in distribution where the food components can bind to
drugs in the bloodstream, preventing them from reaching their target tissues. For
example, calcium can bind to the antibiotic tetracycline, making it less effective
against bacteria.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 389–399, 2024.
https://doi.org/10.1007/978-3-031-53468-3_33
390 S. Bourhim
Over the past few years, there has been a surge in interest towards developing
predictive models for DFIs. These models serve the purpose of detecting potential
DFIs, evaluating the associated risks, and devising effective strategies to mitigate
such risks. There are a variety of models for predicting DFIs including: In silico
models where it uses computer simulations to predict the interactions between
drugs and food components, Statistical models, and deep learning models.
Lately, deep learning models applied to DFIs showed interesting results and
can learn complex relationships between drug molecules and food components,
which allows them to predict new drug-food interactions that would not be
detected by other methods. Traditionally, the identification of DFIs has been
a difficult task due to the limited availability of data and the complexity of
the interactions between drugs and food. However, recent advances in machine
learning have made it possible to develop more effective methods for identifying
DFIs.
Graphs are a powerful data structure for modeling biomedical systems. They
can be used to represent the relationships between different entities, such as
drugs, food compounds, and proteins. This makes them well-suited for modeling
DFIs, which are interactions between drugs and food compounds. In recent years,
there has been a growing interest in using graph embedding methods to predict
DFIs. Graph embedding methods learn a low-dimensional representation of each
node in a graph while maximally preserving the structural information of the
graph. This allows the relationships between nodes to be preserved, even when
the nodes are represented in a low-dimensional space. One of the most popular
graph embedding methods for DFI prediction is DeepDDI [3]. DeepDDI uses a
deep neural network to learn the representations of drugs and food compounds.
The model is trained on a dataset of known DFIs, and it can then be used to
predict new DFIs.
Paper Organization. The paper starts by delving into Sect. 2 where it intro-
duces the novel workflow for recommending novel Drug-Food interactions. The
results of applying a graph-based community approach are shown and discussed
in Sect. 3. Finally, we review possible directions for future investigation.
Graph-Based RS for DFI 391
2 Methodology
We propose a workflow for inferring novel DFIs, set as a link prediction problem.
The workflow consist of three main steps : (1) Data Preparation, (2) DFI Network
Construction, (3) DGCF Model Adaptation, and (4) DFIs prediction.
The row vector of community affiliation F for node u and node v are denoted
as Fu and Fv , respectively. The set of edges that connect nodes in the graph is
represented by E.
In order to optimize the F matrix, we update the parameters of the neural
network architecture by minimizing the negative log-likelihood. Our encoding
layer employs a 2-layer GCN with a hidden size of 128, and the final layer
outputs the number of communities to be detected. To prevent overfitting, we
incorporate batch normalization and dropout with a ratio of 0.5. Furthermore,
we leverage the unique relationships conveyed by the two graphs by merging the
outputs of the CE and EB-GCN layers, which correspond to the drug-food and
drug-drug graphs, respectively.
the embedding size. In this encoding layer, we input the drug-food bipartite
graph, which consists of nodes belonging to drugs and nodes belonging to food.
The main objective is to capture collaborative signals from various types of
interactions in the network and learn the final representations for both drugs and
food. To achieve this, we leverage the power of GNN algorithms applied to the
bipartite graph. The EB-GCN layer exploits the high-order connectivity present
in drug-food interactions (DFIs). It utilizes the message-passing mechanism of
GNNs to encode drug and food nodes by iteratively aggregating information
from neighboring drugs. The high-order propagation is achieved through stacking
multiple embedding layers. Each layer involves the construction and aggregation
of messages. The construction of the message for a drug-food pair (u, i) is defined
as mu←i :
Here, N denotes the set of pairwise training data with observed and unob-
served interactions. The EB-GCN layer generates two embeddings, one for drugs
and one for food.
Graph-Based RS for DFI 395
CP = (F · EU ) · EI (9)
To select the most relevant communities, we choose the top two affiliations
for each drug. The fusion formula captures the similarity between the drug u
and the food constituent i, taking into account the profile of the communities to
which the drug belongs.
Parameter Settings. Our DGCF model was developed utilizing Pytorch. The
model boasts an embedding size of 64 and was trained using the Adam optimizer
with default parameters. The Xavier initializer was employed to initialize the
model parameters. The learning rate, L2 normalization coefficient, and dropout
ratio were set to 10−3 , 10−5 , and 0.5, respectively. Additionally, an early stopping
mechanism was implemented to prevent overfitting of the model.
Comparison with Baselines. The DGCF model performed better than the
older approaches on both recall and precision. However, it performed similarly
to DFinder on recall, but not on precision. DFinder has a recall of 87.33%, while
DGCF has a recall of 87.40%. This means that both models are very similar in
terms of their ability to correctly classify positive instances. However, DGCF has
a precision of 88.01%, which is higher than DFinder’s precision of 87.11%. This
means that DGCF is more accurate in its predictions of positive instances. This
is an important property for drug discovery, where it is important to avoid false
positives. In other words, DGCF is better at distinguishing between positive and
negative instances. This means that DGCF is more likely to correctly classify
a positive instance as positive, and it is also less likely to incorrectly classify a
negative instance as positive.
Overall, the DGCF model is a promising new approach for drug discovery.
It is more accurate than the older approaches, and it is particularly good at
avoiding false positives. This makes it a valuable tool for identifying potential
drug-food interactions (Table 1).
takes a step further by integrating information from the EB-GCN and CE lay-
ers. While NGCF primarily captures similarity signals based on user behavior
towards items, DGCF captures a broader range of signals, particularly the com-
munity behavioral signal, which reflects the influence of sub-communities. The
results presented in Table 2 demonstrate how the inclusion of the community
detection step enhances overall performance and validates our hypothesis regard-
ing the advantages of incorporating contextual and topological information.
Table 2. The overall comparison of quality metrics in detecting communities for the
DFI dataset.
4 Conclusion
In this paper, we proposed a novel workflow that uses community profile concept
to infer and identify novel drug-food interactions. The workflow was evaluated
on the DrugBank dataset, and we showed that the DGCF model was able to
predict new drug-food interactions than the baseline models. We first prepare
the DFI dataset, then we create the DFI networks, and then apply the DGCF
model to predict the novel drug-food interactions.
We believe that the proposed workflow is a promising approach for identifying
novel drug-food interactions, and we plan to enhance the workflow in future work
by adding more features for the drugs and the food components. We also plan
to evaluate the workflow on a larger dataset of drug-food interactions.
Graph-Based RS for DFI 399
References
1. Wishart, D., et al.: DrugBank 5.0: a major update to the DrugBank database for
2018. Nucleic Acids Research. 46, D1074–D1082 (2018). https://doi.org/10.1093/
nar/gkx1037
2. Wishart, D., et al.: FoodDB: a comprehensive food database for dietary studies,
research, and education. Nucleic Acids Res. 37, D618–D623 (2009). https://doi.
org/10.1093/nar/gkn815
3. Ryu, J., Kim, H., Lee, S.: Deep learning improves prediction of drug-drug and
drug-food interactions. Proc. Natl. Acad. Sci. 115, E4304–E4311 (2018)
4. Bourhim, S., Benhiba, L., Idrissi, M.: A community-driven deep collaborative app-
roach for recommender systems. IEEE Access. 10, 131144–131152 (2022)
5. Rendle, S., Freudenthaler, C., Gantner, Z., Schmidt-Thieme, L.: BPR: Bayesian
personalized ranking from implicit feedback. In: Proceedings Of The Twenty-Fifth
Conference On Uncertainty In Artificial Intelligence (2012)
6. Cao, S., Lu, W., Xu, Q.: GraRep: learning graph representations with global struc-
tural information. In: Proceedings Of The 24th ACM International On Conference
On Information And Knowledge Management, pp. 891–900 (2015)
7. Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., Mei, Q.: Line: Large-scale infor-
mation network embedding. In: Proceedings Of The 24th International Conference
On World Wide Web, pp. 1067–1077 (2015)
8. Wang, D., Cui, P., Zhu, W.: Structural deep network embedding. In: Proceedings
Of The 22nd ACM SIGKDD International Conference On Knowledge Discovery
And Data Mining, pp. 1225–1234 (2016)
9. Kipf, T., Welling, M.: Variational graph auto-encoders. In: International Confer-
ence On Learning Representations (2017)
10. Ou, M., Cui, P., Pei, J., Zhang, Z., Zhu, W.: Asymmetric transitivity preserv-
ing graph embedding. In: Proceedings Of The 22nd ACM SIGKDD International
Conference On Knowledge Discovery And Data Mining, pp. 1105–1114 (2016)
11. Wang, T., et al.: DFinder: a novel end-to-end graph embedding-based method to
identify drug-food interactions. Bioinformatics. 39, btac837 (2023)
12. Bourhim, S., Benhiba, L., Idrissi, M.: Towards a Novel Graph-based collaborative
filtering approach for recommendation systems. In: Proceedings Of The 12th Inter-
national Conference On Intelligent Systems: Theories And Applications, pp. 1–6
(2018)
13. Bourhim, S., Benhiba, L., Idrissi, M.: Investigating algorithmic variations of an RS
Graph-based collaborative filtering approach. In: Proceedings Of The ArabWIC
6th Annual International Conference Research Track, pp. 1–6 (2019)
L2G2G: A Scalable Local-to-Global
Network Embedding with Graph
Autoencoders
1 Introduction
Graph representation learning has been a core component in graph based real
world applications, for an introduction see [13]. As graphs have become ubiqui-
tous in a wide array of applications, low-dimensional representations are needed
to tackle the curse of dimensionality inherited by the graph structure. In prac-
tice, low-dimensional node embeddings are used as efficient representations to
address tasks such as graph clustering [25], node classification [2], and link pre-
diction [27], or to protect private data in federated learning settings [14,20].
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 400–412, 2024.
https://doi.org/10.1007/978-3-031-53468-3_34
Local-2-GAE-2-Global 401
Fig. 1. L2G2L pipeline for two patches. The two patches are in blue and yellow, the
overlapping nodes between them in green. Separate node embeddings for each patch are
obtained via a single GCN. The decoder aligns the embeddings using the Local2Global
synchronisation algorithm to yield a global embedding and then uses a standard sigmoid
function. The GCN is then iteratively optimised using the training loss.
2 Preliminaries
Notations: An undirected attributed graph G = (V, E, X) consists of a set of
nodes V of size N , a set of unweighted, undirected edges E of size M , and a
N × F matrix X of real-valued node attributes (features). The edge set is also
represented by the N × N adjacency matrix A. Moreover, based on the L2G
framework, we define a patch P to be a subgraph of G which is induced by a
subset of the node set V ; hence a patch Pi with the feature matrix corresponding
to its nodes is denoted as (V (i) , E (i) , X (i) ) Node embeddings are denoted as a
N × e matrix Z, where e is the embedding size and σ is the sigmoid function.
T
each pair of overlapping patches (Pi , Pj ) ∈ Ep by Rij = Mij (Mij Mij )−1/2 .
Next we build R̃ij = wij Rij / j |V (Pi ) ∩ V (Pj )| to approximately solve the
eigen problem S = R̃S, obtaining Ŝ = [Ŝ1 , ...Ŝk ]. We also find a transla-
tion matrix T̂ = [T̂1 , . . . , T̂k ] by solving T̂ = arg min||BT − C||22 , where
T ∈Rk×F
|Ep |×k
B ∈ {−1, 1} is the incidence matrix of the patch graph with entries
B(Pi ,Pj ),t = δit − δjt , δ is the Kronecker Delta, and C ∈ R|Ep |×F has
(i) (j)
entries C(Pi ,Pj ) = t∈Pi ∩Pj Ẑt − Ẑt /|Pi ∩ Pj |. This solution yields the
estimated coordinates of all the nodes up to a global rigid transformation.
Next, we apply the appropriate rotation transform to each patch individu-
ally, Ẑ (j) = Z (j) ŜjT , then apply the corresponding translation to each patch
(hence performing translation synchronisation), and finally average in order to
(j)
obtain the final aligned node embedding Z̄i = j (Ẑi + T̂j )/|{j : i ∈ Pj }|.
3 Methodology
dimensions of the hidden layers in the GCN are all F . Then, the complexity of a
L-layer GCN scales like O(LN F 2 +LM F ) and that of the inner product decoder
scales like O(N 2 F ). maybe add something here about full decoder? Thus, for as
shown in [7], for T epochs the time complexity of the decoder and the encoder
of a GAE scales like O(T (LN F 2 + LM F + N 2 F )). In √ contrast, as stated in
[22], the complexity of per-epoch of FastGAE with a N down-sampling size
is O(LN F 2 + LM F + N F ), and hence for T epochs the FastGAE complexity
scales like O(T (LN F 2 + LM F + N F )).
To simplify the complexity analysis of both Local2Global approaches we
assume that the overlap size of two overlapping patches in the patch graph
is fixed to d ∼ F . Following [15], finding the rotation matrix S scales like
O(|Ep |dF 2 ) = O(|Ep |F 3 ). The translation problem can be solved by a t-iteration
solver with a complexity per iteration of O(|Ep |F ), where t is fixed. To align the
local embeddings, one has to perform matrix multiplications, which requires
O(Nj F 2 ) computations, where Nj is the number of nodes in the j th patch. The
complexity of finding the rotation matrix (O(|Ep |F 3 )) dominates the complexity
of the computing the translation (O(|Ep |F )). Thus, the complexity of the L2G
k
algorithm with k patches is O |Ep |F 3 + F 2 j=1 Nj .
The GAE+L2G algorithm uses a GAE for every patch, and for the j th patch,
for T training epochs the GAE scales like O(T (LNj F 2 + LMj F + Nj2 F )), with
Mj number of edges in the j th patch. Summing over all patches and ignor-
ing the overlap between patches as lower order term, so that j Nj = O(N ),
2
N ≈ N 2 /k, and j Mj ≈ M, the GAE+L2G algorithm scales like
j j
O T F (LN F + LM + N/k) + kF 3 . For the complexity of L2G2G, as L2G2G
aligns the local embeddings in each epoch rather than after training, we
3 2 3
replace
kF + N F with T (kF + N F 2 ), and thus the algorithm scales like
2
O T LN F 2 + LM F + Nk F + kF 3 . In the PyTorch implementation of Fast-
GAE, the reconstruction
√ error is approximated by creating the induced subgraph
from sampling N proportional to degree, with an expected number of at least
O(M/N ) edges between them. Then, the computation of the decoder is (at least)
O(M/N ) instead of O(N 2 ). Table 1 summarises the complexity results.
Thus, in the standard case, increasing the number of patches k reduces the
complexity of the computation of the GAE decoders. In the PyTorch implemen-
tation, if k scales linearly with N , the expression is linear in N . In contrast, when
406 R. Ouyang et al.
the number of nodes N is not very large, the number of features F becomes more
prominent, so that the training speed may not necessarily increase
with increas-
ing number of patches. Table 1 shows that L2G2G sacrifices O T kF 3 training
time to obtain better performance; with an increase in the number of patches,
the training speed gap between L2G2G and GAE+L2G increases linearly.
4 Experimental Evaluation
Datasets To measure the performance of our method, we compare the ability of
L2G2G to learn node embeddings for graph reconstruction against the following
benchmark datasets Cora ML, Cora [3], Reddit [26] and Yelp [26].
In addition, we tested the performance of L2G2G on four synthetic data sets,
generated using a Stochastic Block Model (SBM) which assigns nodes to blocks;
edges are placed independently between nodes in a block with probability pin
and between blocks with probability pout [17]. We encode the block membership
as node features; with L blocks, v being in block l is encoded as unit vector
el ∈ {0, 1}L . To test the performance across multiple scales we fix the number
of blocks at 100, and vary the block size, pin and pout , as follows:
1. ‘SBM-Small’ with block sizes 102 and (pin , pout ) = (0.02, 10−4 ),
2. ‘SBM-Large-Sparse’ with block sizes 103 and (pin , pout ) = (10−3 , 10−4 ),
3. ‘SBM-Large’ with blocks of sizes 103 and (pin , pout ) = (0.02, 10−4 ),
4. ‘SBM-Large-Dense’ with block sizes 103 and (pin , pout ) = (0.1, 0.002).
Table 2 gives some summary statistics of these real and synthetic data sets.
Table 2. Network data statistics: N = no. nodes, M = no. edges, F =no. features
Table 3. Experiments on different data sets with patch size 10. Bold: the best among
the fast methods, underlined: the model outperforms the GAE.
that FastGAE while achieving much better performance. In almost all cases,
L2G2G is faster than the standard GAE, except for the two smaller datasets. Its
training time is around an order of magnitude smaller per epoch for the larger
models. As an aside, GAEs suffer from memory issues as they need to store
very large matrices during the decoding step.
Fig. 2. Training time of the baseline models(GAE, FastGAE and GAE+L2G) and
L2G2G on benchmark data sets (excluding partitioning time). Note that the y-axis is
on a log-scale, and thus the faster methods are at least an order of magnitude faster.
Fig. 3. Lineplots of the ROC score and accuracy of L2G2G and GAE+L2G, trained on
each dataset, with different patch sizes. For each subplot, the blue lines represent the
metrics for L2G2G, while the orange ones represent those for GAE+L2G. The shadows
in each subplot indicate the standard deviations of each metric.
Local-2-GAE-2-Global 409
Ablation Study Here we vary the number of patches, ranging from 2 to 10.
Figure 3 shows the performance changes with different number of patches for
each model on each data set. When the patch size increases, the performance
of L2G2G decreases less than GAE+L2G. This shows that updating the node
embeddings dynamically during the training and keeping the local information
with the agglomerating loss actually brings stability to L2G2G.
Moreover, we have explored the behaviour of training time for L2G2G when
patch size increases from 2 to 30, on both a small (Cora) and a large (Yelp)
dataset. Figure 4 shows that on the small-scale data set Cora, the gap in train-
ing speed between L2G2G and GAE+L2G remains almost unchanged, while on
Yelp, the gap between L2G2G and GAE+L2G becomes smaller. However, the
construction of the overlapping patches in the Local2Global library can create
patches that are much larger than N/k, potentially resulting in a large number
of nodes in each patch. Hence, the training time in our tests increases with the
number of patches.
CP U GPU
Cora
Yelp
Since all the computations in Local2Global library built by [15] are carried
out on the CPU, the GPU training can be slowed down by the memory swap
between CPU and GPU. Thus, to further explore the behaviour of our algorithm
when the number of patches increases, we ran the test on both CPU and GPU.
The results are given by Fig. 4. This plot illustrates that the GPU training
time of L2G2G increases moderately with increasing patch size, mimicking the
behaviour of GAE+L2G. In contrast, the CPU training time for the smaller data
410 R. Ouyang et al.
set (Cora) decreases with increasing patch size. The larger but much sparser Yelp
data set may not lend itself naturally to a partition into overlapping patches.
Summarising, L2G2G performs better than the baseline models across most
settings, while sacrificing a tolerable amount of training speed.
In this paper, we have introduced L2G2G, a fast yet accurate method for obtain-
ing node embeddings for large-scale networks. In our experiments, L2G2G out-
performs FastGAE and GAE+L2G, while the amount of training speed sacrificed
is tolerable We also find that L2G2G is not as sensitive to patch size change as
GAE+L2G.
Future work will investigate embedding the synchronization step in the net-
work instead of performing the Local2Global algorithm to align the local embed-
dings. This change would potentially avoid matrix inversion, speeding up the
calculations. We shall also investigate the performance on stochastic block mod-
els with more heterogeneity. To improve accuracy, one could add a small num-
ber of between–patch losses into the L2G2G loss function, to account for edges
which do not fall within a patch. The additional complexity of this change would
be relatively limited when restricting the number of between–patches included.
Additionally, the Local2Global library from [16] is implemented on CPU, losing
speed due to moving memory between the CPU and the GPU.
References
1. Baldi, P.: Autoencoders, unsupervised learning, and deep architectures. In: Guyon,
I., Dror, G., Lemaire, V., Taylor, G., Silver, D., (eds.) Proceedings of ICML Work-
shop on Unsupervised and Transfer Learning, Proceedings of Machine Learning
Research, vol. 27, pp. 37–49. PMLR, Bellevue, Washington, USA (2012)
2. Bayer, A., Chowdhury, A., Segarra, S.: Label propagation across graphs: node clas-
sification using graph neural tangent kernels. In: ICASSP 2022-2022 IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.
5483–5487. IEEE (2022)
3. Bojchevski, A., Günnemann, S.: Deep gaussian embedding of graphs: unsupervised
inductive learning via ranking. In: International Conference on Learning Represen-
tations (2018). https://openreview.net/forum?id=r1ZdKJ-0W
4. Bojchevski, A., et al.: Scaling graph neural networks with approximate PageRank.
In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge
Discovery & Data Mining. ACM (2020)
5. Bruna, J., Zaremba, W., Szlam, A., LeCun, Y.: Spectral networks and locally
connected networks on graphs. arXiv preprint arXiv:1312.6203 (2013)
6. Chen, J., Ma, T., Xiao, C.: FastGCN: fast learning with graph convolutional net-
works via importance sampling. arXiv preprint arXiv:1801.10247 (2018)
7. Chen, M., Wei, Z., Ding, B., Li, Y., Yuan, Y., Du, X., Wen, J.: Scalable graph
neural networks via bidirectional propagation. CoRR abs/2010.15421 (2020).
https://arxiv.org/abs/2010.15421
Local-2-GAE-2-Global 411
8. Chen, M., Wei, Z., Huang, Z., Ding, B., Li, Y.: Simple and deep graph convolu-
tional networks. In: International Conference on Machine Learning, pp. 1725–1735.
PMLR (2020)
9. Chiang, W.L., Liu, X., Si, S., Li, Y., Bengio, S., Hsieh, C.J.: Cluster-GCN. In:
Proceedings of the 25th ACM SIGKDD International Conference on Knowledge
Discovery & Data Mining. ACM (2019)
10. Cucuringu, M., Lipman, Y., Singer, A.: Sensor network localization by eigenvector
synchronization over the Euclidean group. ACM Trans. Sen. Netw. 8(3), 1–42
(2012)
11. Cucuringu, M., Singer, A., Cowburn, D.: Eigenvector synchronization, graph rigid-
ity and the molecule problem. Inf. Infer. 1(1), 21–67 (2012)
12. Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large
graphs. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
13. Hamilton, W.L.: Graph Representation Learning. Morgan & Claypool Publishers
(2020)
14. He, C., et al.: FedGraphNN: a federated learning system and benchmark for
graph neural networks. CoRR abs/2104.07145 (2021). https://arxiv.org/abs/
2104.07145
15. Jeub, L.G., Colavizza, G., Dong, X., Bazzi, M., Cucuringu, M.: Local2Global: a
distributed approach for scaling representation learning on graphs. Mach. Learn.
112(5), 1663–1692 (2023)
16. Jeub, L.G.S.: Local2Global github package. Github (2021). https://github.com/
LJeub/Local2Global
17. Karrer, B., Newman, M.E.J.: Stochastic blockmodels and community structure in
networks. Phys. Rev. E 83(1), 016107 (2011)
18. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR
(Poster) (2015)
19. Kipf, T.N., Welling, M.: Variational graph auto-encoders. arXiv preprint
arXiv:1611.07308 (2016)
20. Pan, Q., Zhu, Y.: FedWalk: communication efficient federated unsupervised node
embedding with differential privacy. arXiv preprint arXiv:2205.15896 (2022)
21. Perozzi, B., Al-Rfou, R., Skiena, S.: DeepWalk. In: Proceedings of the 20th ACM
SIGKDD international conference on Knowledge discovery and data mining. ACM
(2014)
22. Salha, G., Hennequin, R., Remy, J.B., Moussallam, M., Vazirgiannis, M.: FastGAE:
scalable graph autoencoders with stochastic subgraph decoding. Neural Netw. 142,
1–19 (2021)
23. Simonovsky, M., Komodakis, N.: GraphVAE: towards generation of small graphs
using variational autoencoders. In: Kůrková, V., Manolopoulos, Y., Hammer, B.,
Iliadis, L., Maglogiannis, I. (eds.) ICANN 2018. LNCS, vol. 11139, pp. 412–422.
Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01418-6 41
24. Tang, L., Liu, H.: Leveraging social media networks for classification. Data Min.
Knowl. Discov. 23(3), 447–478 (2011)
25. Tsitsulin, A., Palowitch, J., Perozzi, B., Müller, E.: Graph clustering with graph
neural networks. arXiv preprint arXiv:2006.16904 (2020)
26. Zeng, H., Zhou, H., Srivastava, A., Kannan, R., Prasanna, V.: GraphSAINT: graph
sampling based inductive learning method. In: International Conference on Learn-
ing Representations (2020)
27. Zhang, M., Chen, Y.: Link prediction based on graph neural networks. In: Advances
in Neural Information Processing Systems, vol. 31 (2018)
412 R. Ouyang et al.
28. Zhang, S., Tong, H., Xu, J., Maciejewski, R.: Graph convolutional networks: a
comprehensive review. Comput. Soc. Netw. 6(1), 1–23 (2019)
29. Zou, D., Hu, Z., Wang, Y., Jiang, S., Sun, Y., Gu, Q.: Layer-dependent importance
sampling for training deep and large graph convolutional networks. In: Advances
in neural information processing systems, vol. 32 (2019)
A Comparative Study of Knowledge
Graph-to-Text Generation Architectures
in the Context of Conversational Agents
Abstract. This work delves into the dynamic landscape of Knowledge Graph-to-
text generation, where structured knowledge graphs are transformed into coher-
ent natural language text. Three key architectural paradigms are explored: Graph
Neural Networks (GNNs), Graph Transformers (GTs), and linearization with
sequence-to-sequence models. We discuss the advantages and limitations of these
architectures, and we do some experiments on these architectures. Performance
evaluations on WebNLG V.2 demonstrate the superiority of sequence-to-sequence
Transformer-based models, especially when enriched with structural information
from the graph. Despite being unsupervised, the CycleGT model also outperforms
GNNs and GTs. However, practical constraints, such as computational efficiency
and model validity, make sequence-to-sequence models the preferred choice for
real-time conversational agents. Future research directions include enhancing the
efficiency of GNNs and GTs, addressing scalability issues, handling multimodal
knowledge graphs, improving interpretability, and devising data labeling strate-
gies for domain-specific models. Cross-lingual and multilingual extensions can
further broaden the applicability of these models in diverse linguistic contexts. In
conclusion, the choice of architecture should align with specific task requirements
and application constraints, and the field offers promising prospects for continued
innovation and refinement.
1 Introduction
architectural options. These conversational agents, rooted in KG, leverage the organized
information within a knowledge graph to craft responses that closely mimic human lan-
guage. These agents can access and employ structured information during conversations
by utilizing the richly interconnected representation of entities and their relationships
within the knowledge graph, resulting in more accurate and comprehensive user interac-
tions. Incorporating knowledge graphs in conversational agents significantly augments
their capabilities, making interactions more informative and beneficial.
In natural language processing and knowledge representation, the survey on graph-
to-text generation architectures assumes a vital role. This review guides researchers,
practitioners, and decision-makers through this rapidly evolving landscape by present-
ing a comprehensive overview of current techniques. Its insights into emerging trends,
approach strengths, and limitations facilitate the design of effective systems for trans-
forming structured data into coherent human-readable narratives. Furthermore, the sur-
vey encourages collaboration, knowledge sharing, and innovation within the field, thus
advancing graph-to-text generation and bridging the gap between structured knowledge
and natural language expression.
Recently, neural approaches have demonstrated remarkable performance, surpass-
ing traditional methods in achieving linguistic coherence. However, challenges persist in
maintaining semantic consistency, particularly with long texts [41]. The inherent com-
plexity of neural approaches also poses limitations, as they need more parameterization
and control over the structure of the generated text. Consequently, while current neural
approaches tend to lag template-based methods regarding semantic consistency [42],
they outperform them significantly in terms of linguistic coherence. This distinction can
be attributed to the ability of large language models (LLM) to capture specific syntactic
and semantic properties of the language. Despite the advantages neural approaches offer
regarding linguistic consistency, their performance in maintaining semantic consistency
is still a work in progress.
Graph-to-text (G2T) generation is a natural language processing (NLP) task that
involves transforming structured data from a graph format into human-readable text.
This task converts a knowledge graph (structured data representation where entities are
nodes and relationships between entities are edges) into coherent sentences or paragraphs
in a natural language. Generating text from graphs necessitates sophisticated methods in
graph processing for extracting pertinent information and in natural language generation
to produce coherent and contextually fitting text. This is a demanding yet valuable
endeavor within the larger framework of content creation and communication driven by
data.
In the natural language generation (NLG), two criteria [13] are used to assess the
quality of the produced answers. The first criterion is semantic consistency (Semantic
Fidelity), which quantifies the fidelity of the data produced against the input data. The
most common indicators are 1/ Hallucination: It is manifested by the presence of infor-
mation (facts) in the generated text that is not present in the input data; 2/ Omission: It is
exemplified by the omission of one of the pieces of information (facts) in the generated
text; 3/ Redundancy: This is manifested by the repetition of information in the generated
text; 4/ Accuracy: The lack of accuracy is manifested by the modification of information
such as the inversion of the subject and the direct object complement in the generated
A Comparative Study of Knowledge Graph-to-Text Generation Architectures 415
text; 5/ Ordering: It occurs when the sequence of information is different from the input
data. The second criterion is linguistic coherence (Output Fluency) to evaluate the flu-
idity of the text and the linguistic constructions of the generated text, the segmentation
of the text into different sentences, the use of anaphoric pronouns to reference entities
and to have linguistically correct Sentences.
Our objective is to delve into the intricacies of deep neural network architectures
that harness the power of graphs, aiming to gain a comprehensive understanding of their
inherent strengths and limitations for optimal utilization in the context of conversational
agents.
This paper follows a structured progression to delve into the knowledge graph-
to-text (KG2T) generation landscape. Beginning with an in-depth review of advanced
KG2T approaches in Sect. 2, the paper examines the architectures and innovations in
Sect. 3. Section 4 explores the empirical aspects, encompassing model performance
assessment, datasets, metrics, and experiments. Section 5 then encapsulates the findings
and discussions, presenting the culmination of results. Finally, the paper concludes in
Sect. 6 by critically evaluating the implications of the discussed techniques within the
context of conversational agents.
2 Background
The goal of KG-to-text generation is to create comprehensible sentences in natural lan-
guage based on knowledge graphs (KGs), all while upholding semantic coherence with
the KG triplets (Fig. 1). The term “knowledge graph” has been in existence since 1972,
but its current definition can be attributed to Google’s introduction of its Knowledge
Graph in 2012 [5]. This marked the beginning of a trend, with numerous companies
like Airbnb, Amazon, eBay, Facebook, IBM, LinkedIn, Microsoft, and Uber also mak-
ing similar announcements. This collective push has led to widely adopting knowledge
graphs across various industries [21]. Consequently, academic research in this domain
has experienced a notable upsurge in recent years, resulting in many scholarly publi-
cations focused on knowledge graphs [21]. These graphs employ a data model based
on graphs to efficiently manage, integrate, and extract valuable insights from large and
diverse datasets [38].
The problem formulation is the following. Given the input KG G, which is composed
of {(h1 , r1 , t1 ), . . . , (hn , rn , tn )|h∗ , t∗ ∈ E, r∗ ∈ R}, where E denotes the entity set and R
represents the relation set, The objective of the KG-to-text generation task is to produce
a coherent and logically sound sequence of text T =< t1 , t2 , . . . , tn >, tk ∈ V where V
denotes the vocabulary composed of n output tokens.
Unlike the conventional text generation task (Seq2Seq), generating text from a
knowledge graph adds the extra challenge of ensuring the accuracy of the words within
the generated sentences. The current methods can be classified into three distinct cat-
egories, as illustrated in (Fig. 1). We will later delve into these categories in more
comprehensive detail in the Sect. 3:
1. Linearization with Sequence-to-Sequence (Seq2Seq): This involves converting the
graph G into a sequence Glinear = <g1 , g2 , …, gm > consisting of m input tokens for
the sequence-to-sequence model.
416 H. Ghanem and C. Cruz
2. Graph Neural Networks (GNNs) [47]: These models encode the topological struc-
tures of a graph and learn entity representations by aggregating the features of entities
and their neighbors. GNNs are not standalone; they require a decoder to complete the
encoder-decoder architecture.
3. Graph Transformer (GT): This is an enhanced version of the original transformer
[52] model adapted to handle graph data.
Graph-to-Text (G2T) leverages graph embedding techniques and Pre-trained Lan-
guage Models (PLMs). Graph embeddings and PLMs are essential for distinct reasons,
playing crucial roles in G2T tasks. Graph embeddings allow us to capture subtle relation-
ships between entities and properties in a numerical format, facilitating the manipulation
of this data and the creation of generative models. Generating text and establishing align-
ments between source entities/relationships and target tokens is challenging for standard
language models due to limited parallel graph-text data availability. The following two
sections deal with these topics.
Fig. 1. The architecture of KG-to-text generation with the three categories of representation: a)
Linearization + Seq2Seq, b) GNNs with decoder (e.g., LSTM), and c) Graph Transformer (GT)
sequence suitable for algorithms, tasks, or models requiring linear input, such as many
methods in NLP. One approach involves linearizing the knowledge graph (KG) [15, 45]
and using PLMs like GPT, BART, or T5 for seq2seq generation. PLMs can generalize
for downstream NLG tasks [27], but most were trained on text data [26, 43], needing
more structured input.
Heuristic search algorithms like breadth-first search (BFS) or predefined rules
are standard for graph linearization. However, these methods often need more atten-
tion to structural information during KG encoding, as they don’t explicitly consider
relationships between input entities.
Researchers such as [19, 36, 59] introduce different neural planners to compute the
input triples order before linearization. [36] employ data-driven scoring for precise plans,
considering recommendations and user rules. Others use GCN-based neural planners
[59] (Graph Convolution Network), reordering graph nodes for sequential content plans
encoded by LSTM-based sequential encoders. Plan-and-pretrain techniques introduced
by [19] leverage text planners based on relational graph convolutional networks (R-
GCN) [59] and pretrained T5 Seq2Seq models. [23] proposed a joint graph-text learning
framework called JointGT.
To address the issue when the input is a sequence of RDF triplets, [10] introduces an
encoder model called GTR-LSTM. This model maintains the structure of RDF triplets
within a small knowledge graph, enabling it to capture relationships both within individ-
ual triplets (intra-triple relations) and between interconnected triplets (inter-triple rela-
tions). This approach improves sentence generation accuracy. Unlike TreeLSTM [51],
which lacks cycle handling, GTR-LSTM handles cycles using a combination of topolog-
ical sorting and breadth-first traversal. An attention model is employed to gather compre-
hensive global information from the knowledge graph. Additionally, unlike GraphLSTM
[29], which only supports specific entity relations, all relations are incorporated into the
calculation of hidden states [10].
In efforts to preserve the graph’s topology, despite striving for maximum retention
using seq2seq methods, the Transformer-based seq2seq models come with significant
costs, particularly during the pretraining phase. Additionally, the computational expense
of linearization can become substantial when dealing with extensive knowledge graphs.
Therefore, to more effectively maintain the graph’s topology, the introduction of Graph
Neural Networks (GNNs) has been proposed, and their details will be explored in the
following section.
Various approaches employ different versions of Graph Neural Network (GNN) archi-
tectures for processing graph-structured data. GNNs, including Graph Convolutional
Networks (GCNs) [24], extended forms like Syn-GCNs [34] and DCGCNs [20], Graph
Attention Networks (GATs) [54], and Gated Graph Neural Networks (GGNNs) [6, 7,
28], are well-suited for modeling entity relationships within knowledge graphs to gen-
erate text. GNNs have demonstrated promise in knowledge graph-to-text generation by
effectively representing relationships between entities in a knowledge graph.
A Comparative Study of Knowledge Graph-to-Text Generation Architectures 419
4.2 Experiments
We empirically assessed several approaches executed on our local machines and eval-
uated them using the same test dataset. Our evaluation encompassed various types of
systems, including graph linearization [19, 23, 45], a model employing Graph Neural
Networks (GNNs) [46], and another using Graph Transformers (GTs) [48]. Addition-
ally, we implemented an unsupervised cycling model [17]. While preserving most of the
A Comparative Study of Knowledge Graph-to-Text Generation Architectures 421
We see in Table 2 that all models using seq2seq Transformer-based models have
better results than GNNs and GTs models, and with a considerable margin when these
models have some additional processes before linearization to keep some information
about the structure of the graph like in JointGT [23] and P2 [19]. Even the CycleGT
model has better results than GNNs and GTs models, an unsupervised model that uses
a linearization phase on the graph.
As mentioned before, we evaluate our models on WebNLG V.2 [49], which has 1600
test instance (source KGs), except P2 and CycleGT, which are evaluated on the enriched
version of WebNLG V.2 [49] and which has 1860 test instance.
In this paper, we have explored and compared three distinct architectures for Knowledge
Graph-to-text generation: Graph Neural Networks (GNNs), Graph Transformers (GTs),
and linearization with seq2seq models, primarily Pre-trained Language Models (PLMs).
Each of these architectures has its unique advantages and limitations, making them
suitable for different scenarios and use cases.
GNNs offer a flexible and scalable approach to model graph structures and relation-
ships, but they may need help with efficiency when handling large and complex knowl-
edge graphs. GTs, on the other hand, provide a specialized solution for graph-based
tasks, offering a more direct and efficient way to process graph structures. However,
they may require extensive training data and computational resources.
Linearization with seq2seq models, especially those based on PLMs, simplifies
the process by converting knowledge graphs into linear sequences and generating text
from them. Despite its simplicity, this approach can lose some structural information
during linearization. However, as our experiments show, seq2seq Transformer-based
models consistently outperformed GNNs and GTs, especially when models incorporate
additional processes to retain graph structure information like JointGT and P2.
In the context of conversational agents, where factors like response time and correct-
ness are critical, the inference time and resource requirements of GNNs and GTs can
be limiting. Hence, for practical deployment, seq2seq Transformer-based models stand
out as a more feasible choice, given their superior performance and efficiency.
Looking ahead, the field of Knowledge Graph-to-text generation presents several
avenues for advancement. Firstly, there’s a pressing need to enhance the computational
efficiency of Graph Neural Networks (GNNs) and Graph Transformers (GTs) to make
them more suitable for real-time applications. This involves optimizing their architec-
tures, parallelizing computations, and harnessing hardware accelerators. Additionally,
as the scale and complexity of knowledge graphs continue to grow, developing strate-
gies to effectively handle large graphs while maintaining performance is a significant
challenge that warrants further exploration.
Moreover, extending these approaches to accommodate multimodal knowledge
graphs, which integrate textual, visual, and other data types, could open up new horizons
for comprehensive information retrieval and generation, especially with the use of Meta-
transformers multi-modal approach [58]. Furthermore, ensuring the interpretability of
GNNs and GTs is crucial for building trust in generated text, particularly in domains
A Comparative Study of Knowledge Graph-to-Text Generation Architectures 423
like healthcare and law. Moreover, addressing the scarcity of labeled data for specific
knowledge domains through innovative data labeling and augmentation techniques can
enhance the training of domain-specific models. Lastly, the advancement of these mod-
els to handle multiple languages and facilitate cross-lingual knowledge transfer holds
promise for their broader applicability in diverse linguistic contexts.
In conclusion, the choice of architecture for Knowledge Graph-to-text generation
should be guided by the specific requirements of the task and the constraints of the
application. While GNNs and GTs offer valuable approaches for particular scenarios,
the efficiency and performance of seq2seq Transformer-based models make them a
compelling choice for many real-world applications. Future research should address the
challenges and opportunities presented by these diverse architectures to advance the field
further.
Acknowledgement. The authors thank the French company DAVI (Davi The Humanizers,
Puteaux, France) for their support and the French government for the plan France Relance funding.
References
1. Balažević, I., Allen, C., Hospedales, T.M.: Tucker: tensor factorization for knowledge graph
completion. arXiv preprint arXiv:1901.09590 (2019)
2. Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved
correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and
Extrinsic Evaluation Measures for Machine Translation and/or Summarization (2005)
3. Bordes, A., et al.: Translating embeddings for modeling multi-relational data. In: Advances
in Neural Information Processing Systems, vol. 26 (2013)
424 H. Ghanem and C. Cruz
4. Cai, D., Lam, W.: Graph transformer for graph-to-sequence learning. In: Proceedings of the
AAAI Conference on Artificial Intelligence, vol. 34, no. 05 (2020)
5. Chaudhri, V., et al.: Knowledge graphs: introduction, history and perspectives. AI Mag. 43(1),
17–29 (2022)
6. Chen, Y., Wu, L., Zaki, M.J.: Reinforcement learning based graph-to-sequence model for
natural question generation. arXiv preprint arXiv:1908.04942 (2019)
7. Chen, Y., Wu, L., Zaki, M.J.: Toward subgraph-guided knowledge graph question generation
with graph neural networks. IEEE Trans. Neural Netw. Learn. Syst., 1–12 (2023)
8. Chen, W., et al.: KGPT: knowledge-grounded pre-training for data-to-text generation. arXiv
preprint arXiv:2010.02307 (2020)
9. Dai, Y., et al.: A survey on knowledge graph embedding: approaches, applications and
benchmarks. Electronics 9(5), 750 (2020)
10. Distiawan, B., et al.: GTR-LSTM: a triple encoder for sentence generation from RDF data.
In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers) (2018)
11. Duvenaud, D.K., et al.: Convolutional networks on graphs for learning molecular fingerprints.
In: Advances in Neural Information Processing Systems, vol. 28 (2015)
12. Ferreira, T.C., et al.: Enriching the WebNLG corpus. In: Proceedings of the 11th International
Conference on Natural Language Generation (2018)
13. Ferreira, T.C., et al.: Neural data-to-text generation: a comparison between pipeline and
end-to-end architectures. arXiv preprint arXiv:1908.09022 (2019)
14. Fu, Z., et al.: Partially-aligned data-to-text generation with distant supervision. arXiv preprint
arXiv:2010.01268 (2020)
15. Gardent, C., et al.: The WebNLG challenge: generating text from RDF data. In: Proceedings
of the 10th International Conference on Natural Language Generation (2017)
16. Gilmer, J., et al.: Neural message passing for quantum chemistry. In: International Conference
on Machine Learning. PMLR (2017)
17. Guo, Q., et al.: CycleGT: unsupervised graph-to-text and text-to-graph generation via cycle
training. arXiv preprint arXiv:2006.04702 (2020)
18. Guo, Q., et al.: Fork or fail: cycle-consistent training with many-to-one mappings. In:
International Conference on Artificial Intelligence and Statistics. PMLR (2021)
19. Guo, Q., et al.: P2: a plan-and-pretrain approach for knowledge graph-to-text generation:
a plan-and-pretrain approach for knowledge graph-to-text generation. In: Proceedings of
the 3rd International Workshop on Natural Language Generation from the Semantic Web
(WebNLG+) (2020)
20. Guo, Z., et al.: Densely connected graph convolutional networks for graph-to-sequence
learning. Trans. Assoc. Comput. Linguist. 7, 297–312 (2019)
21. Hogan, A., et al.: Knowledge graphs. ACM Comput. Surv. (Csur) 54(4), 1–37 (2021)
22. Joshi, C.: Transformers are graph neural networks. The Gradient 7 (2020)
23. Ke, P., et al.: JointGT: graph-text joint representation learning for text generation from
knowledge graphs. arXiv preprint arXiv:2106.10502 (2021)
24. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks.
arXiv preprint arXiv:1609.02907 (2016)
25. Koncel-Kedziorski, R., et al.: Text generation from knowledge graphs with graph transform-
ers. arXiv preprint arXiv:1904.02342 (2019)
26. Lewis, M., et al.: Bart: denoising sequence-to-sequence pre-training for natural language
generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019)
27. Li, J., et al.: Pretrained language models for text generation: A survey. arXiv preprint arXiv:
2201.05273 (2022)
28. Li, Y., et al.: Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493 (2015)
A Comparative Study of Knowledge Graph-to-Text Generation Architectures 425
29. Liang, X., Shen, X., Feng, J., Lin, L., Yan, S.: Semantic object parsing with graph LSTM.
In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part I. LNCS, vol. 9905,
pp. 125–143. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_8
30. Lin, C.-Y.: Rouge: a package for automatic evaluation of summaries. Text summarization
branches out (2004)
31. Liu, L., et al.: How to train your agent to read and write. In: Proceedings of the AAAI
Conference on Artificial Intelligence, vol. 35, no. 15 (2021)
32. Liu, W., et al.: K-BERT: enabling language representation with knowledge graph. In:
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 03 (2020)
33. Mager, M., et al.: GPT-too: a language-model-first approach for AMR-to-text generation.
arXiv preprint arXiv:2005.09123 (2020)
34. Marcheggiani, D., Frolov, A., Titov, I.: A simple and accurate syntax-agnostic neural model
for dependency-based semantic role labeling. arXiv preprint arXiv:1701.02593 (2017)
35. Moon, S., et al.: OpenDialKG: explainable conversational reasoning with attention-based
walks over knowledge graphs. In: Proceedings of the 57th Annual Meeting of the Association
for Computational Linguistics (2019)
36. Moryossef, A., Goldberg, Y., Dagan, I.: Step-by-step: separating planning from realization in
neural data-to-text generation. arXiv preprint arXiv:1904.03396 (2019)
37. Nan, L., et al.: DART: Open-domain structured data record to text generation. arXiv preprint
arXiv:2007.02871 (2020)
38. Noy, N., et al.: Industry-scale Knowledge Graphs: Lessons and Challenges: Five diverse
technology companies show how it’s done. Queue 17(2), 48–75 (2019)
39. Papineni, K., et al.: Bleu: a method for automatic evaluation of machine translation. In:
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics
(2002)
40. Peters, M.E., et al.: Knowledge enhanced contextual word representations. arXiv preprint
arXiv:1909.04164 (2019)
41. Puduppully, R., Dong, L., Lapata, M.: Data-to-text generation with content selection and
planning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33. no. 01
(2019)
42. Puzikov, Y., Gurevych, I.: E2E NLG challenge: neural models vs. templates. In: Proceedings
of the 11th International Conference on Natural Language Generation (2018)
43. Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8),
9 (2019)
44. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer.
J. Mach. Learn. Res. 21(1), 5485–5551 (2020)
45. Ribeiro, L.F.R., et al.: Investigating pretrained language models for graph-to-text generation.
arXiv preprint arXiv:2007.08426 (2020)
46. Ribeiro, L.F.R., et al.: Modeling global and local node contexts for text generation from
knowledge graphs. Trans. Assoc. Comput. Linguist. 8, 589–604 (2020)
47. Scarselli, F., et al.: The graph neural network model. IEEE Trans. Neural Netw. 20(1), 61–80
(2008)
48. Schmitt, M., et al.: Modeling graph structure via relative position for text generation from
knowledge graphs. arXiv preprint arXiv:2006.09242 (2020)
49. Shimorina, A., Gardent, C.: Handling rare items in data-to-text generation. In: Proceedings
of the 11th International Conference on Natural Language Generation (2018)
50. Song, L., et al.: A graph-to-sequence model for AMR-to-text generation. arXiv preprint arXiv:
1805.02473 (2018)
51. Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured
long short-term memory networks. arXiv preprint arXiv:1503.00075 (2015)
426 H. Ghanem and C. Cruz
52. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing
Systems, vol. 30 (2017)
53. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image descrip-
tion evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (2015)
54. Veličković, P., et al.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017)
55. Wahde, M., Virgolin, M.: Conversational agents: theory and applications. In: Handbook
on Computer Learning and Intelligence: Volume 2: Deep Learning, Intelligent Control and
Evolutionary Computation, pp. 497–544 (2022)
56. Wang, Z., Zhang, J., Feng, J., Chen, Z.: Knowledge graph embedding by translating on
hyperplanes. In: Proceedings of the 28th AAAI Conference on Artificial Intelligence,
pp. 1112–1119. AAAI Press (2014)
57. Zhang, Z., et al.: ERNIE: enhanced language representation with informative entities. arXiv
preprint arXiv:1905.07129 (2019)
58. Zhang, Y., et al.: Meta-transformer: a unified framework for multimodal learning. arXiv
preprint arXiv:2307.10802 (2023)
59. Zhao, C., Walker, M., Chaturvedi, S.: Bridging the structural gap between encoding and
decoding for data-to-text generation. In: Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics (2020)
Network Embedding Based on DepDist
Contraction
1 Introduction
Second, the learned embedding space can effectively support network inference,
such as predicting unseen links or identifying important nodes [2].
In this paper, we present a new network embedding method based on calcu-
lating the distance between pairs of network nodes based on their structurally
non-symmetric relationship. We describe an iterative procedure that, using the
distance defined in this way, quickly reveals the community structure. We per-
form experiments with four well-known small networks and show the results of
the application of our method to dimension 2.
2 Related Work
Different models can be used to transform networks from the original network
space to the embedding one, working with different types of information or
addressing different goals. Commonly used models include matrix factorization,
where, e.g., Singular Value Decomposition is used due to its optimality for the
low-rank approximation of the adjacency matrix [15] and non-negative matrix
factorization is often used due to its advantages as an additive model [19].
Furthermore, random walk models, analogous to Word2Vector, are used to
generate random paths through the network. If a node is considered as a word,
the random path can be considered as a sentence, and the neighborhood of
the node can be identified using a measure of cooccurrence as in the case of
Word2Vector [12]. For example, the Node2Vec embedding method [6], similar
in design principle to the DeepWalk method [14], can be considered. However,
Node2Vec improves the random walk generation in DeepWalk and mirrors the
depth and breadth sampling properties to enhance the network embedding effect.
Finally, let us mention deep neural networks and their variants because they
are a suitable choice if we are looking for an efficient model for learning nonlinear
functions. Representative methods include SDNE [17] or, e.g., SiNE [18].
One of the important applications for network embedding is visualization of
a network in two-dimensional space. We can find a comparison of visualization
results with different embedding approaches in [10]. Classes of graph drawing
algorithms, including multi-level and dimensionality reduction-based techniques,
are described in detail in a review [5]. In network analysis fields, interpretation
and understanding of network structure may be based on calculating local or
global measures. Visual representation of network structure can help detect,
understand, and identify unexpected patterns or outliers in networks.
The layout and arrangement of nodes affect how the user perceives relation-
ships in the network. There is no one best way; the layout of a network depends
on which network features are important to us. These may be, for example, spe-
cific measures of centrality or important properties of nodes or edges. Criteria
for evaluating a network layout include the algorithm’s computational complex-
ity, the network’s size, the algorithm’s ability to follow certain layout rules or
aesthetics, clustering, etc.
Many of the methods used are based on the force-directed network paradigm,
a paradigm of modeling the network as a physical object system where nodes
DepDist Contraction 429
attract and repel according to some force. Other network drawing algorithms
are methods using multilevel and dimensionality reduction-based techniques.
There are two approaches to force-directed layouts: those based on spring
embedding and those that solve optimization problems. A very often used
method of this type is the method of Fruchterman and Reingold [4], the con-
nected nodes attract each other while all other nodes, modeled as electrical
charges, repel each other.
The second approach considers the layout problem as an optimization prob-
lem that minimizes an energy function designed concerning the properties of the
network being visualized. Important energy-based techniques are Noack’s LinLog
[13] and ForceAtlas [8] layouts. Noack’s edge repulsion model removes the bias
of the node model towards attraction by ensuring that nodes that are strongly
attracting are also strongly repelling, similarly for nodes with weak attraction.
Therefore, nodes with a high degree are less likely to be clustered in the center
of the network, and it is able to show any underlying clustering structure in the
network. ForceAtlas is strongly associated with Noack’s LinLog. Its advantage is
that all nodes are subject to at least some repulsive force, and poorly connected
nodes are thus approximated by well-connected nodes, reducing visual clutter.
The forces in the algorithm vary between Noack’s edge repulsion model and the
Fruchterman and Reingold distributions.
Multilevel algorithms are one of the options that can be used to streamline
force-directed techniques. Their idea is to find a sequence of coarser representa-
tions of the network, optimize the drawing in the coarsest representation, and
propagate this distribution back to the original network. The coarser representa-
tions are created by composing connected nodes whose edges become the union
of the edges of all the nodes [7].
Other options for drawing networks are dimension reduction techniques,
including multidimensional scaling, linear dimension reduction [1], or spectral
graph drawing approaches. The challenge is to preserve the information in a
high-dimensional space and capture it in a lower-dimensional representation.
Most dimension reduction techniques used for network layout use the graph-
theoretical distance between nodes, [3], as the information to be preserved.
As mentioned above, the most common use cases of node embedding are visu-
alization, clustering, and link prediction. The problem of visualizing networks in
2D, with its long history, and network drawing algorithms are probably the most
well-known embedding techniques commonly used to visualize networks in 2D
space. Data-driven network layouts, such as spring embedding, are unsupervised
methods of arranging nodes based on their connectivity and are de facto dimen-
sionality reduction techniques. Despite the great potential layouts are rarely the
basis of systematic network visualization.
Therefore, node embedding offers a powerful new paradigm for network visu-
alization: because nodes are mapped to real-valued vectors, researchers can easily
leverage general techniques for visualizing high-dimensional data. For example,
node embedding can be combined with well-known techniques such as t-SNE
430 E. Dopater et al.
[11] to create 2D network visualizations [16] that can be useful for revealing
communities and other hidden structures.
where Γ(A, B) is the set of common neighbors of nodes A, B and N (A) is the
neighborhood (set of all neighbors) of node A.
If there is an edge between nodes A, B, then w(A, B) is the weight of this edge;
otherwise, w(A, B) = 0. Thus the dependency is non-zero if and only if the node
A, B have an edge or at least one common neighbor. A dependency defined in
this way is non-symmetric, so D(A, B) = D(B, A) generally does not hold. While
the value of the numerator is the same in both directions of the dependency, the
value of the denominator may be different. Therefore, it may be true that the
dependencies between the nodes of A, B may be substantially different.
Informally speaking, dependency is high if a node is significantly connected to
its neighbor through common neighbors compared to the rest of its neighbors.
This property provides information about the network’s community structure
since nodes in a community should have stronger dependencies with each other
than with nodes outside the community.
DepDist Contraction 431
The unanswered question is, what distance should the two nodes of the network
be if we want to start from the exact dependencies Two situations can arise: (1)
nodes have zero dependencies and thus have neither an edge nor a common neigh-
bor, and (2) nodes have non-zero, potentially non-symmetric dependencies. In
the first case, we have no straightforward information to determine the distance.
In the second case, we have to convert the dependencies into the Euclidean space
that is, by definition, symmetric. For further considerations, let us start with a
simple interpretation of dependency, which can be described as a relation that
attracts two nodes together. To express this relation, let us define the mutual
dependency coefficient qS (A, B) as the product of the partial dependencies of
the nodes A, B with their arithmetic mean, i.e.:
D(A,B)+D(B,A)
qS (A, B) = D(A, B) · D(B, A) · 2 (3)
The coefficient q takes into account both dependencies and, thanks to the
average, information about their balance. The coefficient qs can be further used
to determine the symmetric distance between nodes A, B. An alternative is to
work with the non-symmetric distance and leave the determination of the sym-
metric distance to the iterative procedure described in Sect. 4. In this case, the
assumed distance between the nodes may be non-symmetric at the input. For
this case, let us define the coefficient qN (A, B):
4 DepDist Contraction
As mentioned above, the essence of network embedding is to find a network rep-
resentation in low-dimensional space in which the relationships between network
nodes are highly preserved. We next present an iterative procedure based on a
432 E. Dopater et al.
4.1 Algorithm
The first step of the algorithm to find the representation of the network in n-
dimensional space is to randomly distribute the points representing each network
node into a cube of dimension n with edge length a. Next, we set the value of
maxDepDist to be much smaller than the edge length a (so there is enough
space for contraction). We then iterate so that in one iteration, each node A
moves in space to some node B (we will return to the selection of node B later).
For the move, step length corresponds to the distance between A and B and
their coefficient q(A, B). The iterating terminates, as we show later, either after
a fixed number of steps or after the contraction stabilizes.
Let us consider a node A, a node B selected for it, their coefficient q(A, B), and
their expected DepDist(A, B). By one iteration step, we mean moving node A
to node B so that their distance approaches DepDist(A, B). Let û be a unit
vector in the direction of the vector B − A. The new position A of node A is:
1
Non-parallel Python implementation used for the experiments in this paper is at
https://github.com/emanueldopater/DepDistContraction/tree/conference.
DepDist Contraction 433
Based on this threshold distance for acceleration, we then define the acceler-
ation coefficient accCoef to be equal to one for accDist(A, B) = B − A :
B−A
accCoef (A, B) = 0.5 + 0.5 · accDist(A,B) . (8)
The acc function is then defined as:
1
acc(A, B) = q(A, B) accCoef (A,B) . (9)
Thus, in general, pairs of nodes that are weakly dependent on each other are
farther apart than strongly dependent nodes, slowly converging to the expected
distances DepDist(A, B) and DepDist(B, A), respectively. Thus, for example,
two high-degree nodes that share a common edge but have very few common
neighbors compared to their other neighbors will hardly change position during
an iteration. This contrasts with, for example, nodes that are part of a large and
almost disjoint clique, which have strong dependencies and thus small distances
to neighbors that they move to very quickly.
More interesting is the situation when node A is strongly dependent on
node B, but the reverse is not true, i.e., when, for example, node B is a hub and
node A has degree 1. Using the non-symmetric alternative qN (A, B), node A will
approach node B very fast, and node B will slowly move towards node A.
p(A) = 1 − 1+k(A) ,
1
(10)
where k(A) is the degree of node A; with complementary probability 1 − p(A),
a neighbor of the neighbors of A is then chosen at random. Thus, if a node has
a very high degree, its neighbor is chosen with near certainty, and conversely, if
a node has degree k(A) = 1, then its neighbor is chosen with probability 0.5.
derived, and (3) the maximum distance maxAccDist for the acceleration of
the move from which the distance above which the move accelerates and below
which it decreases is derived for each pair of nodes. Thus, from the perspec-
tive of the whole network, it is a contraction that results in a distribution of
nodes in a small part of the input n-dimensional cube in which the nodes almost
stop moving. Our experiments show that, regardless of the size of the network,
after 20 − 50 iterations, the community structure emerges (strongly dependent
node groups), and after 200 − 500 iterations, the distribution changes very little;
groups of strongly dependent nodes move (relatively) away from each other, and
the distribution stabilizes. Thus, the number of iterations needed is not much
affected by the network size because the algorithm efficiently separates locally
strongly connected sub-structures from the rest of the network. The strength of
the DepDist Contraction algorithm is, therefore, most evident when applied to
networks with significant community structure, and its discovery in embedding
is only a side effect of the DepDist.
Figure 1 visualizes the distribution of nodes after 50 iterations of the four
networks we used in our experiments; it is a 2D embedding, which is comple-
mented by the edges between the nodes and the sizes of the nodes corresponding
to their degree for better clarity. As can be seen, even after a relatively small
number of iterations, the community structure of the networks is obvious.
4.5 Scalability
Calculating the dependency of one node on another is similar to calculating
the clustering coefficient and has time complexity O(k 2 ), where k is the aver-
age degree of the network (we can calculate the dependencies in both directions
simultaneously). However, the computations for each pair of nodes are inde-
pendent and can be computed in parallel as needed. Within a single iteration,
storing the current node positions at the beginning and computing the depen-
dencies including moving the nodes to their new positions in parallel is possible.
When the iteration ends, the current positions are swapped with the new ones.
Thus, during the algorithm, we work with two states of the network (current
and new node positions); therefore, the spatial complexity is O(N ), where N is
the number of nodes in the network.
To estimate the time complexity, we assume a sparse network for which
the relationship between the number of edges and nodes is O(N ). If we want
to optimize the computational complexity, we must continuously compute the
dependencies during the iterative procedure (i.e., only when necessary) and store
them for reuse. Thus, the estimate of the time complexity of computing all
dependencies is based on the dependency computation complexity and the total
number of node pairs for which the dependency must be computed. Given that
we compute dependencies for neighbors and neighbors of neighbors based on
random selection, we can estimate the time complexity of computing all required
dependencies to be O(N k 4 ); however, this is the worst case, where we assume
computing dependencies for all neighbors and neighbors of neighbors for all nodes
in the network. Moreover, for sparse networks in the case of stored dependencies,
DepDist Contraction 435
Fig. 1. 2D embedding for karate, lesmis, football, netscience (giant component) net-
works after 50 iterations.
the spatial complexity changes to O(N k 2 ). Here again, this is the worst case that
does not occur in practice since dependencies with neighbors of neighbors are
computed only rarely for nodes with a higher degree (see Sect. 4.3). In general,
for sparse networks, we can expect a time and space complexity of O(N k 3 ) and
O(N k), respectively.
Random selection around the selected node depends on the representation of
the network. If we use an adjacency list, then neighbor selection has complexity
O(1). Therefore, for the total complexity, we only need to consider the number
of iterations r; the estimate of the total time complexity is then O(rN + N k 3 )
for sparse networks.
5 Experiment
To present the effectiveness of the DepDist Contraction algorithm, we used four
small networks; for these small networks, the quality of the embedding can be
visually assessed in the form of a visualized network layout. We chose well-known
436 E. Dopater et al.
networks from Mark E.J. Newman2 : Zachary’s karate club (karate), Les Miser-
ables (lesmis), American College football (football), giant component of Coau-
thorships in network science (netscience). In Table 1, we can see that each
network has different properties (number of nodes and edges, average, minimum
and maximum degree, average clustering coefficient, Louvain modularity).
5.1 Results
In the experiment, we used dimension n = 2 for all four networks, the side
size of the square for the random initial nodes distribution a = 1, i.e., a
square with a diagonal [0, 0], [1, 1], the maximum expected dependency distance
maxDepDist = 0.002, and the maximum acceleration distance maxAccDist =
0.01. The result of applying the DepDist Contraction algorithm is shown for 50
and 500 iterations in Figs. 1 and 2; the difference between 200 and 500 iterations
is visually negligible, and there is virtually no further moving. Figure 3 shows
the changes in the positions of the network nodes expressed in terms of mean
squared error (MSE) between two consecutive iterations.
Even though our goal is embedding (i.e., in this case, transforming the net-
work to a vector representation of dimension 2), the result is comparable to
layout-oriented algorithms, which usually use balancing based on attractive and
repulsive forces between pairs of nodes. However, compared to the force-directed
layout in Fig. 4, one significant difference can be seen in the karate layout.
Namely, the dependency is much more related to the connectivity of the nodes
to the neighborhood than to the edge weights. Therefore, the distance between
pairs of nodes is small only when both dependencies are high. On the other hand,
if at least one dependency decreases to zero, the distance increases, regardless
of the edge weights. In Fig. 2, this property in the karate network highlights (1)
the separation of the three groups of nodes in the middle and at the boundaries
and (2) the relatively large distances between nodes weakly connected to their
neighborhood.
The figures show how embedding is affected by other properties of the indi-
vidual networks. Zachary’s karate club contains no larger cliques except triangles;
Les Miserables, on the other hand, contains well-separated cliques, near-clique
2
http://www-personal.umich.edu/∼mejn/netdata/.
DepDist Contraction 437
Fig. 2. 2D embedding for karate, lesmis, football, netscience (giant component) net-
works after 500 iterations.
Fig. 3. Evolution of MSE between two consecutive iterations for karate, lesmis, footbal,
netscience (giant component) networks.is significantly larger.
References
1. Çivril, A., Magdon-Ismail, M., Bocek-Rivele, E.: SSDE: fast graph drawing using
sampled spectral distance embedding. In: Kaufmann, M., Wagner, D. (eds.) Graph
Drawing, pp. 30–41. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-
540-70904-6 5
2. Cui, P., Wang, X., Pei, J., Zhu, W.: A survey on network embedding. IEEE Trans.
Knowl. Data Eng. 31(5), 833–852 (2018)
3. Freeman, L.C.: Graphic techniques for exploring social network data. Models Meth-
ods Soc. Netw. Anal. 28, 248–269 (2005)
4. Fruchterman, T.M., Reingold, E.M.: Graph drawing by force-directed placement.
Softw. Pract. Exp. 21(11), 1129–1164 (1991)
5. Gibson, H., Faith, J., Vickers, P.: A survey of two-dimensional graph layout tech-
niques for information visualisation. Inf. Vis. 12(3–4), 324–357 (2013)
DepDist Contraction 439
6. Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: Pro-
ceedings of the 22nd ACM SIGKDD International Conference on Knowledge Dis-
covery and Data Mining, pp. 855–864 (2016)
7. Hu, Y.: Efficient, high-quality force-directed graph drawing. Math. J. 10(1), 37–71
(2005)
8. Jacomy, M., Venturini, T., Heymann, S., Bastian, M.: Forceatlas2, a continuous
graph layout algorithm for handy network visualization designed for the Gephi
software. PloS One 9(6), e98679 (2014)
9. Kudelka, M., Ochodkova, E., Zehnalova, S., Plesnik, J.: Ego-zones: non-symmetric
dependencies reveal network groups with large and dense overlaps. Appl. Netw.
Sci. 4(1), 1–49 (2019)
10. Liao, L., He, X., Zhang, H., Chua, T.S.: Attributed social network embedding.
IEEE Trans. Knowl. Data Eng. 30(12), 2257–2270 (2018)
11. Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res.
9(11) (2008)
12. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed represen-
tations of words and phrases and their compositionality. Adv. Neural Inf. Process.
Syst. 26 (2013)
13. Noack, A.: Energy models for graph clustering. J. Graph Algorithms Appl. 11(2),
453–480 (2007)
14. Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: online learning of social represen-
tations. In: Proceedings of the 20th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pp. 701–710 (2014)
15. Qiu, J., Dong, Y., Ma, H., Li, J., Wang, K., Tang, J.: Network embedding as
matrix factorization: unifying deepwalk, line, pte, and node2vec. In: Proceedings
of the Eleventh ACM International Conference on Web Search and Data Mining,
pp. 459–467 (2018)
16. Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., Mei, Q.: Line: large-scale infor-
mation network embedding. In: Proceedings of the 24th International Conference
on World Wide Web, pp. 1067–1077 (2015)
17. Wang, D., Cui, P., Zhu, W.: Structural deep network embedding. In: Proceedings
of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, pp. 1225–1234 (2016)
18. Wang, S., Tang, J., Aggarwal, C., Chang, Y., Liu, H.: Signed network embedding
in social media. In: Proceedings of the 2017 SIAM International Conference on
Data Mining, pp. 327–335. SIAM (2017)
19. Wang, X., Cui, P., Wang, J., Pei, J., Zhu, W., Yang, S.: Community preserving
network embedding. In: Proceedings of the AAAI Conference on Artificial Intelli-
gence, vol. 31 (2017)
Evaluating Network Embeddings Through
the Lens of Community Structure
1 Introduction
Networks often exhibit a modular structure, where nodes cluster into communi-
ties with shared characteristics or functions [1]. Understanding these community
structures is crucial for various applications, from recommendation systems to
the optimal spread of information and disease control [2–10]. With network sizes
increasingly increasing, generating lower order representation, known as network
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 440–451, 2024.
https://doi.org/10.1007/978-3-031-53468-3_37
Evaluating Network Embeddings Through the Lens of Community Structure 441
embedding, has gained significant attention in recent years [11]. This technique
transforms networks into low-dimensional vector representations.
While certain techniques are designed to explicitly maintain or enhance the
community structure through the embedding process, others may not consider
community structure preservation a primary objective. Nonetheless, one of the
fundamental goals of all network embedding techniques is to project the simi-
larity of the nodes of the original network onto the lower-dimensional space.
Further, network embedding techniques are commonly evaluated through
classification metrics [11]. Nonetheless, these metrics are agnostic about the
community structure: they do not indicate whether it is well preserved after
the embedding process. In other words, they offer information about the over-
all quality of results but do not reveal the fine-grained details of community
structure within a network.
Consequently, there is a need for a comprehensive comparative analysis of
network embedding algorithms from a modular perspective. This paper ana-
lyzes the performance of the most prominent network embedding algorithms on
controlled synthetic networks.
The rest of the paper is organized as follows. Section 2 overviews the funda-
mental concepts of network embedding and introduces the mesoscopic evaluation
metrics. Section 3 presents the synthetic modular network generation. It details
the experimental setup and evaluation metrics for comparing these algorithms
and the voting model used in ranking these algorithms. Section 4 presents the
results of our comparative analysis and discusses the findings and their implica-
tions. Finally, Sect. 5 concludes the paper.
2 Background
The landscape of network embedding algorithms is notably diverse. To ensure
that we encompass a spectrum of approaches, we include ten widely recog-
nized methods that span random walks, matrix factorization, and deep learning
[12]. This diversity should allow us to understand the challenges and opportu-
nities of different network embedding strategies. We briefly describe these algo-
rithms, highlighting their specificity. It is worth highlighting that the DeepWalk,
Node2Vec, Walklets, M-NMF, and M-GAE algorithms explicitly indicate main-
taining the community structure as they acquire node representations.
used by the machine learning community, which include the adjusted mutual
information score (AMI), normalized mutual information score (NMI), adjusted
random score (ARI), Micro-F1 score, and Macro-F1 score, which are not neces-
sarily community-aware evaluators [23]. Here we propose a complementary set
of metrics to assess the quality of these algorithms. These below-described meso-
scopic metrics are used to evaluate the quality of the embedding techniques in
preserving the community structure. The latter, denoted by C, is defined as fol-
lows: for an undirected unweighted graph G(V, E), where V is the set of nodes
and E ⊆ V × V is the set of edges, C = {c1 , c2 , ..., cq , ..., c|C| } where cq is q-th
community, mcq and ncq are the total number of links and nodes inside com-
munity cq , respectively, and | C | is the total number of communities. A node i
in G has a total degree of kitot = kiintra + kiinter where kiintra denotes its intra-
community links and kiinter denotes its inter-community links. The mesoscopic
metrics are calculated for each community in the network and then averaged over
all the communities. We denote each evaluation metric for each community cq as
f (cq ). In the context of this study, we employ a set of nine mesoscopic metrics,
where seven are defined within our work. The internal degree and community
size distributions are inherently self-explanatory through their nomenclature:
kiinter
f (cq ) = max(i∈cq )
kitot
– Average-out degree fraction (Average-ODF): is based on the inter-
community links of all the nodes in the community cq they belong to:
1 kiinter
f (cq ) =
ncq i∈c kitot
q
|fi |
f (cq ) =
i∈c
ncq
q
1 kiintra
f (cq ) =
ncq i∈c kitot
q
kiintra
f (cq ) = max(i∈cq )
ncq (ncq − 1)
Fig. 1. Flowchart of the experimental setup to evaluate the performance of the network
embedding algorithms. Mesoscopic metrics are calculated individually for each of the
network’s communities.
Fig. 2. KL-divergence between the ground truth distributions of the mesoscopic metrics
and those recovered by the different algorithms.
Concerning Max-ODF, embeddedness and internal degree, over the full range
of μ, BoostNE is outperformed by the other algorithms as shown in Figs. 2c, 2d,
and 2g. Interestingly, it is joined by Walklets after the μ = 0.4 value. Hub
dominance in Fig. 2e follows this trend. However, in this case, the performance
of Walklets decreases considerably more than the value of BoostNE.
For internal density in Fig. 2f, BoostNE is performing the worst up to μ = 0.4,
which becomes similar to the rest of the algorithms. At that point, we see a big
decrease in the performance of RandNE.
As for Flake-ODF, all the algorithms seem to be following the same trend
where the value of the KL is high for low values of μ, then after the value of
μ = 0.4, we see a sharp decline.
In the case of community size distribution, all algorithms’ performance
decreases around μ = 0.4 as shown in Fig. 2h, except for NetMF which increases.
The results of the comparison of the ranks are rendered to a heatmap describ-
ing the correlation between the rankings of the classification metrics and the
mesoscopic ones for the values of μ = 0.1 and 0.7 are shown in Figs. 3 and 4,
respectively. We note that the performance of the different algorithms changes
448 J. Barbour et al.
Fig. 3. Correlation between the ranking of the algorithms based on the classification
metrics and the mesoscopic metrics for µ = 0.1.
with the mixing parameter μ. Particularly, beyond the value of μ = 0.4, some
algorithms exhibit a significant drop in performance, which we report in Table 1.
In the latter table, we also report the ranking of the algorithms using the clas-
sification metrics.
Fig. 4. Correlation between the ranking of the algorithms based on the classification
metrics and the mesoscopic metrics for µ = 0.7.
Based on the mesoscopic metrics, denoted as Meso in the table, LEM ranks
first while M-GAE ranks second for μ ≤ 0.4, which is in total agreement with
Evaluating Network Embeddings Through the Lens of Community Structure 449
Table 1. The ranking of the embedding algorithms based on Schulze’s method for
mesoscopic and classification metrics.
It is also worth noting that RandNE, LEM, and NetMF are matrix factor-
ization methods that rely on a distance optimization scheme, while M-GAE is
a deep learning-based algorithm. This seems to suggest that the metrics have
inherent biases. More precisely, LEM, which relies on evaluating the Laplacian,
is expected to rank first for most mesoscopic measures. The reason is the Lapla-
cian’s properties, the difference between the degree and the adjacency matrices.
In that, it retains information about each node’s degree and neighborhood. More-
over, the number of degenerate eigenvectors corresponding to the eigenvalue 0
of the Laplacian represents the number of communities, equivalent to a distance
optimization problem [27]. In contrast, the mutual information measures tend
to rank M-GAE first. That could be explained by the fact that M-GAE is opti-
mizing on modularity which is related to the mutual information between two
nodes. Therefore, it seems like it is a self-looping metric that is optimizing for
itself.
In accordance with the outcomes presented here, upon crossing μ = 0.4, M-
GAE exhibits superior performance while it is second when μ ≤ 0.4 in both clas-
sification and mesoscopic metrics. Additionally, LEM ranks first when μ ≤ 0.4,
ranks fifth with μ > 0.4; NetMF, which ranks fourth and fifth with mesoscopic
metrics and classification metrics, respectively, now ranks second when μ > 0.4.
To summarize, LEM demonstrates outstanding performance within a robust
community structure, excelling in community-aware and classification met-
rics. However, as the community structure strength diminishes, its effective-
ness prominently declines with classification metrics. The opposite behavior is
seen with NetMF. In contrast, M-GAE maintains outperformance across both
community-aware and classification metrics regardless of the community struc-
ture strength, by ranking either first or second.
450 J. Barbour et al.
5 Conclusion
Preserving network community structure is crucial, and network embedding tech-
niques offer a significant potential. However, the evaluation metrics commonly
used in the literature fail to capture this preservation effectively. This study
highlights the need for a comprehensive comparison of network embedding algo-
rithms from a modular perspective. Our work is limited to evaluating the effect
of the mixing parameter on the embedding quality. Our study specifically aims
to determine the adequacy of classification metrics employed in the literature
to comprehend the effectiveness of network embeddings. Results reveal that
these metrics do not comprehensively reflect the network’s community struc-
ture, exhibiting a low correlation with community-aware metrics. Furthermore,
the efficacy of certain embedding techniques, such as LEM, M-GAE, and NetMF,
is influenced by the strength of the community structure. These findings under-
score the need for a more attentive approach in evaluating embedding techniques
tailored to the specific application.
Acknowledgment. S.N and J.B would like to acknowledge support from the Center
for Advanced Mathematical Science (CAMS) at the American University of Beirut
(AUB).
References
1. Girvan, M., Newman, M.E.J.: Community structure in social and biological net-
works. PNAS 99(12), 7821–7826 (2002)
2. Cherifi, H., Palla, G., Szymanski, B.K., Lu, X.: On community structure in complex
networks: challenges and opportunities. Appl. Netw. Sci. 4(1), 1–35 (2019)
3. Orman, K., Labatut, V., Cherifi, H.: An empirical study of the relation between
community structure and transitivity. In: Menezes, R., Evsukoff, A., González,
M. (eds.) Complex Networks. Studies in Computational Intelligence, vol. 424, pp.
99–110. Springer, Cham (2013). https://doi.org/10.1007/978-3-642-30287-9 11
4. Gupta, N., Singh, A., Cheri, H.: Community-based immunization strategies for epi-
demic control. In: 2015 7th International Conference on Communication Systems
and Networks (COMSNETS), pp. 1–6. IEEE (2015)
5. Chakraborty, D., Singh, A., Cherifi, H.: Immunization strategies based on the
overlapping nodes in networks with community structure. In: Nguyen, H., Snasel,
V. (eds.) Computational Social Networks: 5th International Conference, CSoNet
2016, Ho Chi Minh City, Vietnam, 2–4 August 2016, Proceedings 5, pp. 62–73.
Springer, Cham (2016). https://doi.org/10.1007/978-3-319-42345-6 6
6. Kumar, M., Singh, A., Cheri, H.: An efficient immunization strategy using over-
lapping nodes and its neighborhoods. In: Companion Proceedings of the Web Con-
ference 2018, pp. 1269–1275 (2018)
7. Ghalmane, Z., Cheri, C., Cheri, H., El Hassouni, M.: Extracting backbones in
weighted modular complex networks. Sci. Rep. 10(1), 15539 (2020)
8. Rajeh, S., Savonnet, M., Leclercq, E., Cheri, H.: Interplay between hierarchy and
centrality in complex networks. IEEE Access 8, 129717–129742 (2020)
9. Rajeh, S., Savonnet, M., Leclercq, E., Cheri, H.: Characterizing the interactions
between classical and community-aware centrality measures in complex networks.
Sci. Rep. 11(1), 10088 (2021)
Evaluating Network Embeddings Through the Lens of Community Structure 451
10. Rajeh, S., Savonnet, M., Leclercq, E., Cheri, H.: Comparative evaluation of
community-aware centrality measures. Qual. Quant. 57(2), 1273–1302 (2023)
11. Hou, M., Ren, J., Zhang, D., Kong, X., Zhang, D., Xia, F.: Network embedding:
taxonomies, frameworks and applications. Comput. Sci. Rev. 38, 100296 (2020)
12. Goyal, P., Ferrara, E.: Graph embedding techniques, applications, and perfor-
mance: a survey. Knowl.-Based Syst. 151, 78–94 (2018)
13. Perozzi, B., Al-Rfou, R., Skiena, S.: DeepWalk: online learning of social represen-
tations. In: Proceedings of the 20th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pp. 701–710 (2014)
14. Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: Pro-
ceedings of the 22nd ACM SIGKDD International Conference on Knowledge Dis-
covery and Data Mining, pp. 855–864 (2016)
15. Rozemberczki, B., Sarkar, R.: Fast sequence-based embedding with diffusion
graphs. In: Cornelius, S., Coronges, K., Goncalves, B., Sinatra, R., Vespignani,
A. (eds.) Complex Networks IX: Proceedings of the 9th Conference on Complex
Networks CompleNet 2018 9, pp. 99–107. Springer, Cham (2018). https://doi.org/
10.1007/978-3-319-73198-8 9
16. Perozzi, B., Kulkarni, V., Chen, H., Skiena, S.: Don’t walk, skip! Online learning
of multi-scale network embeddings. In: Proceedings of the 2017 IEEE/ACM Inter-
national Conference on Advances in Social Networks Analysis and Mining 2017,
pp. 258–265 (2017)
17. Wang, X., Cui, P., Wang, J., Pei, J., Zhu, W., Yang, S.: Community preserving
network embedding. In: Proceedings of the AAAI Conference on Artificial Intelli-
gence, vol. 31 (2017)
18. Belkin, M., Niyogi, P.: Laplacian eigenmaps and spectral techniques for embedding
and clustering. In: Advances in Neural Information Processing Systems, vol. 14
(2001)
19. Zhang, Z., Cui, P., Li, H., Wang, X., Zhu, W.: Billion-scale network embedding
with iterative random projection. In: ICDM, pp. 787–796. IEEE (2018)
20. Li, J., Wu, L., Guo, R., Liu, C., Liu, H.: Multi-level network embedding with
boosted low-rank matrix approximation. In: Proceedings of the 2019 IEEE/ACM
International Conference on Advances in Social Networks Analysis and Mining,
pp. 49–56 (2019)
21. Qiu, J., Dong, Y., Ma, H., Li, J., Wang, K., Tang, J.: Network embedding as matrix
factorization: unifying DeepWalk, LINE, PTE, and node2vec. In: Proceedings of
the Eleventh ACM International Conference on Web Search and Data Mining, pp.
459–467 (2018)
22. Salha-Galvan, G., Lutzeyer, J.F., Dasoulas, G., Hennequin, R., Vazirgiannis, M.:
Modularity-aware graph autoencoders for joint community detection and link pre-
diction. Neural Networks 153, 474–495 (2022)
23. Xuan Vinh, N., Epps, J., Bailey, J.: Information theoretic measures for cluster-
ings comparison: is a correction for chance necessary? In: Proceedings of the 26th
Annual International Conference on Machine Learning, pp. 1073–1080 (2009)
24. Kamiński, B., Pralat, P., Théberge, F.: Artificial benchmark for community detec-
tion (ABCD)-fast random graph model with community structure. Netw. Sci. 9(2),
153–178 (2021)
25. Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1),
79–86 (1951)
26. Schulze, M.: A new monotonic, clone-independent, reversal symmetric, and
condorcet-consistent single-winner election method. Soc. Choice Welfare 36, 267–
303 (2011)
27. Strang, G.: Linear algebra and learning from data. SIAM (2019)
Deep Distance Sensitivity Oracles
1 Introduction
The shortest path problem is frequently encountered in the real-world. In road
networks, users want to know how long it will take to get from one place to
another [17]. In biological networks, consisting of genes and their products, the
shortest paths are used to find clusters and identify core pathways [23]. In social
networks, the number of connections between users can be used for friend rec-
ommendation [26]. In web search, relevant web pages can be ranked by their
distances from queried terms [27].
For graphs in the real world, often consisting of millions of nodes, special data
structures called Distance Oracles (DO) are used to store information about dis-
tances of an input graph G = (V, E) with n vertices and m edges. Without
storing the entire graph, they can quickly retrieve important distance informa-
tion to answer the shortest path queries. These shift the computational burden
to the preprocessing step, so that queries can be answered quickly.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 452–463, 2024.
https://doi.org/10.1007/978-3-031-53468-3_38
Deep Distance Sensitivity Oracles 453
However, in addition to being large in size, real-world networks are also fre-
quently susceptible to failures. For example, in road networks, a construction, a
traffic accident, or an event might temporarily block nodes. In social networks,
users might temporarily deactivate their accounts, resulting in a null node. And
on the internet, web servers may be temporarily down due to mechanical failures
or malicious attacks [4]. In these instances, we desire a method that can continue
answering shortest path queries without stalling or having to recompute shortest
paths on the entire graph again.
Distance Sensitivity Oracles (DSO) are a type of DO that can respond to
queries of the form (s, t, f ), requesting the shortest path between nodes s and
t when a vertex f fails and is thus unavailable. Desirable DSOs should provide
reasonable trade-offs among space consumption, query time, and MRE (i.e.,
quality of the estimated distance). In this paper, we consider the simplest case,
in which there is only one failed node.
1.1 Contributions
– In our theoretical analysis, we first present a simple proof for the existence
of an underlying combinatorial structure for replacement paths: specifically,
the existence of pivot nodes.
– We observe that one can use deep learning to find pivot nodes in distance
sensitivity oracles. In fact, to the best of our knowledge, we are the first to
use deep learning to build a distance sensitivity oracle.
– We empirically evaluate our method and compare it with related works to
demonstrate near-exact accuracy across a diverse range of real-world net-
works.
Given we are the first to propose a deep learning approach to DSOs, we describe
previous works in both DSOs and deep learning in this section.
Õ(mn)1 . Note that the All-Pairs Shortest Paths (APSP) problem, which only
asks the distances between each pair of vertices u, v, is conjectured to require
mn1−o(1) time to solve [20]. Since we can solve the APSP problem by using a
DSO, by querying it with (s, t, emptyset) for every s, t, the preprocessing time
Õ(mn) is theoretically asymptotically optimal in this sense, up to a polyloga-
rithmic factor (note that, in practice, such polylogarithmic factors may be very
large). Several additional results improved upon the theoretical preprocessing
time by using fast matrix multiplication [8,16,22].
With respect to the size of the oracle, Duan and Zhang [14] improved the
space complexity of [13] to O(n2 ), which is from a theoretical perspective asymp-
totically optimal for dense graphs (i.e., m = Θ(n2 )). To do so, Duan and Zhang
store multiple data-structures, which is reasonable for a theoretical work, how-
ever from a practical perspective the hidden constant is large. Therefore, it may
also be interesting to consider DSOs with smaller space, at the cost of an approx-
imate answer.
Here are several DSOs that provide tradeoffs between the size of DSO and
the stretch (the length reported divided by the actual length):
– The DSO described in [2], for every parameter > 0 and integer k ≥ 1 has
stretch (2k − 1)(1 + ) and size O(k 5 n1+1/k log3 n/4 ).
– The DSO described in [9], for every integer parameter k ≥ 1 has stretch
(16k − 4) and size O(kn1+1/k log n).
Note that even though the size of the above two DSOs for k ≥ 2 is asymp-
totically smaller than O(n2 ), the stretch guarantee is at least 3 in [2] and at
least 28 in [9], which is far from the optimum and may not be practical in many
applications.
In this work, we construct the first DSO that is built using deep learning. Our
method uses deep learning to find pivot nodes (as described in Sect. 3), utilizing
a combintorial structural property we observe in Sect. 2, computing near optimal
paths as shown in Sect. 5.
1
For a non-negative function f = f (n), we use Õ(f ) to denote O(f · polylog(n)).
Deep Distance Sensitivity Oracles 455
Among the first to apply graph embeddings to the shortest paths problem
was Orion [29]. Inspired by the successes of virtual coordinate systems, a land-
mark labelling approach was employed, where positions of all nodes were chosen
based on their relative distances to a fixed number of landmarks. Using the Sim-
plex Downhill algorithm, representations were found in a Euclidean coordinate
space, allowing constant time distance calculations and producing mean relative
error (MRE) between 15% - 20% [29]. Other existing coordinate systems have
also been used. Building off of network routing schemes in hyperbolic spaces,
Rigel used a hyperbolic graph coordinate system to reduce the MRE to 9% and
found that the hyperbolic space performed empirically better across distortion
metrics than Euclidean and spherical coordinate systems [12,31]. In road net-
works, geographical coordinates have been utilized with a multi-layer perceptron
to predict distances between locations with 9% MRE [18].
In addition to these coordinate systems, general graph embedding techniques
have recently been employed to handle shortest path queries to great success.
In 2018, researchers from the University of Passau proposed node2vec-Sg[24].
To find the shortest path between nodes s and t, their Node2vec and Poincare
embeddings were combined through various binary operations and fed into a
feed-forward neural network, which was trained only on the distances between
l landmark nodes l << n and the rest of the graph. The model which took
concatenated Node2vec embeddings performed the best, with an MRE between
3% to 7% .
Researchers have also demonstrated the accuracy of graph embeddings
learned alongside distance predictors, to produce representations more specific
to the shortest path task. Vdist2vec directly learned vertex embeddings by pass-
ing the gradient from the distance predictor back to a N × k matrix, achieving
an MRE between 1% to 7% [21]. Huang et al. computed shortest path distances
on road networks using a hierarchical embedding model and achieved an MRE
of 0.7% [17]. Most recently, ndist2vec built upon the landmark learning, graph
embedding, and neural network aspects of all of these approaches, reporting an
MRE of 3.4% with a dataset on the order of O(n).
Current works for estimating the shortest path lengths between two nodes
are limited by the representations they learn. They rely on datasets which, even
using schemas like landmark labelling or hierarchical, are proportional to n, the
number of nodes in the network[10,17]. This presents a significant bottleneck for
larger graphs. Taking these lessons to the deep learning DSO task, we present
a model in Sect. 3, which extracts signal more efficiently thus requiring training
samples without sacrificing accuracy.
2 Theoretical Analysis
Lemma 1. [1] After k edge failures in an unweighted graph, each new shortest
path is the concatenation of at most k + 1 original shortest paths.
In other words, the replacement path can be defined using so called pivot
nodes that specify at which nodes in the graph the shortest paths may be
stitched together. In this work we are interested in the failure of a single node,
which is equivalent to the failure of its incident edges. The number of concate-
nations (and with that the number of corresponding pivots) required to obtain
the replacement path then depends on the degree of the failed node. While in
real-world networks the average degree is often rather small, finding suitable piv-
ots remains a hard task. To overcome this problem, we consider an approximate
setting, where we allow for a slack in the quality of the obtained paths (they
may be longer than a shortest replacement path) but where only one pivot node
is used. From the theoretical perspective, the following lemma is a special case
of Lemma 1 for the case of a single edge failure.
Lemma 2. After an edge failure in an unweighted undirected graph, each new
shortest path is the concatenation of at most two original shortest paths.
Given (s, t, f ), let P (s, t, f ) be a shortest path from s to t in G−{f }. Accord-
ing to Lemma 2 it follows that P (s, t, f ) is a concatenation of two original short-
est paths, or in other words, there exists a pivot vertex v such that P (s, t, f ) is
the concatenation of the two shortest paths P (s, v) and P (v, t), where P (s, v)
is a shortest path from s to v in G and P (v, t) is a shortest path from v to t in
G. Motivated by this lemma, we will assume that it is sufficient to approximate
the replacement path of a node failure using a single pivot node. As mentioned
above, in the remainder of this paper we show that it is possible to use deep
learning to find such pivot nodes.
3 Method
By the above argumentation, we reduce the problem of finding the short-
est replacement path to finding pivot candidates. In the following section, we
describe how we use a graph convolutional network to encode relevant graph
information and a multi-layer perceptron to select pivot nodes.
Fig. 1. The overall neural network architecture. During back-propagation, the gra-
dient is passed through the MLP and to the relevant GAT parameters, so that the
representations learned to encode relevant task-specific attributes.
Our final output is a n-sized vector, representing the log likelihoods of each node
in the network being a pivot node.
3.4 Summary
4 Experiments
As mentioned previously, to the best of our knowledge, we are the first to pro-
pose a deep learning approach to the DSO problem. Thus, we will evaluate our
proposed method against the state-of-the-art deep learning approaches to the
shortest paths problem: namely, ndist2vec and node2vec-Sg [10,24].
We ran the authors’ implementations for all comparison models with their
recommended hyperparameter settings. All embeddings were 128-dimensional,
in line with previous shortest paths works [15,21,24].
All experiments were implemented in Python and run on a Quadro RTX
8000 and an Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz.
4.1 Datasets
4.2 Evaluation
In line with previous works computing shortest paths with deep learning, we
evaluate our method using the Mean Relative Error (MRE) metric. Let dˆi,j
denote the predicted distance and di,j denote the actual distance. The Relative
Error is then given by
|dˆi,j − di,j |
RE =
di,j
We note that, for evaluations of DSOs, dˆi,j and di,j denote the predicted and
actual distances on the graph after a node failure.
We also provide the representation factor, denoting the ratio of the MRE
obtained with random pivots and the one obtained using our method, as a metric
for evaluating the quality of our representations.
5 Results
We aimed to evaluate our deep learning approach across the following ques-
tions:
1. How much longer than the optimal paths are our replacement paths?
2. How does our deep learning model compare with previous state-of-the-art
shortest paths works?
3. Is the resulting performance an achievement of our approach or merely an
artifact of the structure of the input graph?
average degree. For instance, our model obtained an MRE less than 1% soc-
gplus and ia-email-EU but struggled with the comparably smaller bio-grid-
yeast network at 6.25%. These insights suggest that other factors related to the
networks’ structures affected our models’ performance
Given the lack of DSOs using deep learning, the second question hopes to
understand how our model performs compared to methods finding the shortest
paths without node failures. Across all networks, we were able to match or
outperform the state-of-the-art shortest paths works, often by several degrees
of magnitude (Table 2). We did so with less training cases (numerically and
proportionally) and no special selection process. In doing so, we demonstrated
that deep learning can be used to effectively find the shortest replacement paths
as well.
We’d like to note that we used the authors’ implementation of ndist2vec
and confirmed its performance on the road network datasets presented in the
original paper [10]. Nonetheless, the model performed significantly worse on our
real-world networks and did not finish after a week of computation for the two
largest networks. We have two potential explanations. First, the authors suggest
that their landmark-labelling approach will not scale well to sparse, large net-
works, many of which were used in our experiments. Second, the model often
became stuck on local optima during training, producing a MRE value of 100%
corresponding to a constant ouput of 0, demonstrating a reliance on initial val-
ues. For a fair comparison, we depict the best MRE values after two runs in
Table 2.
Finally, we aimed to determine whether our model performed well because
of the model or because of the inherent structure of the networks. For example,
consider a graph that is almost a clique (almost all vertices are pairwise con-
nected). Then, all paths are short and after a failure (having almost no impact
on the graph structure) most replacement paths are short as well. In this setting
Deep Distance Sensitivity Oracles 461
almost any node can serve as a suitable pivot, yielding a replacement path with
a small stretch and one would expect that even randomly chosen pivots would
yield good results. While the networks considered in our experiments are not
as dense (see Table 2), other graph properties like a small diameter may make
finding good pivots easier.
Table 2 lists the MRE values we get when considering random pivots, which
we obtained by replacing the output of our pipeline with random noise. As can
be clearly seen, the MRE is much larger in this setting. For most networks the
MRE is larger than 200%, meaning the found paths are more than 3 times longer
than the shortest replacement paths. In order to compare our method with the
random approach, Table 2 also lists the representation factor. Except for bio-
grid-yeast, ia-wiki-Talk, and tech-RL-caida this factor is always larger than
200, meaning on most networks our approach is over 200 times better than the
random method, clearly indicating that the close to optimal performance is due
to the quality of our approach and not an artifact of properties of the considered
inputs.
6 Conclusion
We have shown that distance sensitivity oracles with close to optimal perfor-
mance can be obtained by utilizing the power of deep learning. Our method
builds on a combinatorial property that allows for finding replacement paths
based on pivot vertices. On a variety of real-world networks in the presence of
failures, we can reliably find suitable pivots where the lengths of the correspond-
ing replacement paths are very close to those of optimal paths. Moreover, our
experiments suggest that these results are not artifacts of the inherent structure
of the inputs, but are instead based on the fact that the different building blocks
of our pipeline successfully capture the relevant structural information about the
input graph.
As a consequence, it would be interesting to apply this method to related
tasks where similar structural information needs to be captured. One such exam-
ple is local routing, where the goal is to find short paths in a graph without the
use of a central data structure by greedily routing to nearby embeddings. Prior
work has shown that close to optimal greedy routing can be performed when
embedding networks into hyperbolic space [5]. However, the resulting embed-
dings were susceptible to numerical inaccuracies, and network failures decreased
routing performance a lot. It would thus be interesting to see whether our app-
roach can be extended to the greedy routing setting as well, in order to overcome
the previously observed issues.
Additionally, our approach has currently not been tested on larger networks
containing millions of nodes. By calculating the APSP information using a dis-
tance oracle and using an improved node2vec implementation, we plan to test
our networks’ scalability in the future.
462 D. Jeong et al.
References
1. Afek, Y., Bremler-Barr, A., Kaplan, H., Cohen, E., Merritt, M.: Restoration by
path concatenation: fast recovery of MPLS paths. Dis. Comput. 15(4), 273–283
(2002). https://doi.org/10.1007/s00446-002-0080-6
2. Baswana, S., Khanna, N.: Approximate shortest paths avoiding a failed vertex:
near optimal data structures for undirected unweighted graphs. Algorithmica 66,
18–50 (2013).https://doi.org/10.1007/s00453-012-9621-y
3. Bernstein, A., Karger, D.R.: A nearly optimal oracle for avoiding failed vertices
and edges. In: Proceedings of the 41st Annual ACM Symposium on Theory of
Computing, STOC 2009, Bethesda, MD, USA, 31 May - 2 June 2009, pp. 101–110.
ACM (2009). https://doi.org/10.1145/1536414.1536431
4. Billand, P., Bravard, C., Iyengar, S.S., Kumar, R., Sarangi, S.: Network connec-
tivity under node failure. Econ. Lett. 149, 164–167 (2016)
5. Bläsius, T., Friedrich, T., Katzmann, M., Krohmer, A.: Hyperbolic embeddings for
near-optimal greedy routing. ACM J. Exp. Algorithmics 25 (2020). https://doi.
org/10.1145/3381751. https://doi.org/10.1145/3381751
6. Cai, H., Zheng, V.W., Chang, K.C.C.: A comprehensive survey of graph embed-
ding: problems, techniques, and applications. IEEE Trans. Knowl. Data Eng. 30(9),
1616–1637 (2018)
7. Cai, T., Luo, S., Xu, K., He, D., Liu, T.y., Wang, L.: Graphnorm: a principled app-
roach to accelerating graph neural network training. In: International Conference
on Machine Learning, pp. 1204–1215. PMLR (2021)
8. Chechik, S., Cohen, S.: Distance sensitivity oracles with subcubic preprocessing
time and fast query time. In: Proccedings of the 52nd Annual ACM SIGACT
Symposium on Theory of Computing, STOC 2020, Chicago, IL, USA, 22-26 June
2020, pp. 1375–1388. ACM (2020). https://doi.org/10.1145/3357713.3384253
9. Chechik, S., Langberg, M., Peleg, D., Roditty, L.: f -sensitivity distance oracles
and routing schemes. Algorithmica 63, 861–882 (2012). https://doi.org/10.1007/
s00453-011-9543-0
10. Chen, X., et al.: Ndist2vec: node with landmark and new distance to vector method
for predicting shortest path distance along road networks. ISPRS Int. J. Geo Inf.
11(10), 514 (2022)
11. Crichton, G., Guo, Y., Pyysalo, S., Korhonen, A.: Neural networks for link pre-
diction in realistic biomedical graphs: a multi-dimensional evaluation of graph
embedding-based approaches. BMC Bioinform. 19(1), 1–11 (2018)
12. Cvetkovski, A., Crovella, M.: Hyperbolic embedding and routing for dynamic
graphs. In: IEEE INFOCOM 2009, pp. 1647–1655. IEEE (2009)
13. Demetrescu, C., Thorup, M., Chowdhury, R.A., Ramachandran, V.: Oracles for
distances avoiding a failed node or link. SIAM J. Comput. 37(5), 1299–1318 (2008).
https://doi.org/10.1137/S0097539705429847
14. Duan, R., Zhang, T.: Improved distance sensitivity oracles via tree partitioning.
In: WADS 2017. LNCS, vol. 10389, pp. 349–360. Springer, Cham (2017). https://
doi.org/10.1007/978-3-319-62127-2 30
15. Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: Pro-
ceedings of the 22nd ACM SIGKDD International Conference on Knowledge Dis-
covery and Data Mining, pp. 855–864 (2016)
16. Gu, Y., Ren, H.: Constructing a Distance Sensitivity Oracle in O(n2 .5794M )
Time. In: Bansal, N., Merelli, E., Worrell, J. (eds.) 48th International Col-
loquium on Automata, Languages, and Programming (ICALP 2021), Leib-
niz International Proceedings in Informatics (LIPIcs), vol. 198, pp. 76:1–76:20.
Deep Distance Sensitivity Oracles 463
A E
Abioye, Ikeoluwa 339 El Hassouni, Mohammed 320
Agarwal, Nitin 351 Elliott, Andrew 400
Alkhatib, Amr 100 Ennadir, Sofiane 100
Antunes, Nelson 137
F
B Francis, Sumam 162
Banerjee, Sayan 137 Freedman, Gail Gilboa 237
Barbour, Jason 440 Friedrich, Tobias 250, 452
Béranger, Anna 377
Bhakta, Arnav 452 G
Bhamidi, Shankar 137 Gao, Ben 283
Bober, Jakub 225 Ghanem, Hussam 413
Boekhout, Hanjo D. 150 Goodarzi, Mahsa 202
Bonald, Thomas 16, 272 Guillot, Simon 377
Boström, Henrik 100 Guiochon, Astrid Thébault 283
Bourhim, Sofia 389 Gunby-Mann, Allison 339, 452
Bravo, Cristián 295 Gupta, Shubham 308
Bui, Minh N. 363
H
C Hua, Chenqing 37
Caceres, Rajmonda 3 Huang, Tianjin 74
Cakmak, Mert Can 351
Canbaz, M. Abdullah 202 J
Cavallaro, Lucia 331 Jeong, Davin 452
Chang, Xiao-Wen 37, 49
Cherifi, Hocine 61, 320, 440 K
Chin, Peter 339, 363, 452 Katzmann, Maximilian 452
Chrétien, Stéphane 283 Koishida, Kazuhito 363
Cohen, Sarel 250, 339, 452 Kosma, Chrysoula 87
Cruz, Christophe 413 Kudelka, Milos 427
Cucuringu, Mihai 400 Kundu, Suman 308
D L
De Lara, Nathan 272 Lawryshyn, Yuri 295
Delarue, Simon 16 Leeney, William 112
Dheepak, G. 177 Li, Yuntao 74
Dopater, Emanuel 427 Liang, Zirui 74
Drif, Ahlem 61 Limnios, Stratis 400
Dugué, Nicolas 377 Liotta, Antonio 331
© The Editor(s) (if applicable) and The Author(s), under exclusive license
to Springer Nature Switzerland AG 2024
H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2023, SCI 1141, pp. 465–466, 2024.
https://doi.org/10.1007/978-3-031-53468-3
466 Author Index
Loveland, Donald 3 S
Lu, Qincheng 37 Sankepally, Sainathreddy 308
Luan, Sitao 37, 49 Sappington, Zachary 189
Saucan, Emil 225
M Saxena, Akrati 74, 150
Mandviwalla, Aamir 215 Serafin, Tommaso 331
Matta, John 189 Shestopaloff, Alexander Y. 295
McConville, Ryan 112 Simonov, Kirill 250
Miasnikof, Pierre 295 Sinha, Koushik 189
Moens, Marie-Francine 162 Spann, Billy 351
Monod, Anthea 225 Szymanski, Boleslaw K. 215
T
N Takes, Frank W. 150
Najem, Sara 440 Tao, Ruyi 260
Neal, Jennifer Watling 127 Tao, Yongzai 260
Neal, Zachary P. 127 Tran, Dung N. 363
Nikolentzos, Giannis 100 Tran, Trac D. 363
O U
Ochodkova, Eliska 427 Uma, Kanimozhi 162
Okeke, Obianuju 351
Olivares, Emilio Sánchez 150 V
Onyepunuka, Ugochukwu 351 Vaucher, Rémi 283
Ouyang, Ruikang 400 Vazirgiannis, Michalis 87, 100
Venkatakrishnan, Radhakrishnan 202
P
Pechenizkiy, Mykola 74 W
Pei, Yulong 74 Wang, Xu 339
Pham, Chau 452 Webster, Kevin N. 225
Philbrick, John 189 Woodard, Cameron 189
Pipiras, Vladas 137
Precup, Doina 37, 49 X
Prouteau, Thibault 377 Xu, Nancy 87
Y
Q Yadav, Narendra 308
Qi, Mingze 260 Yin, Lake 215
R Z
Rajeh, Stephany 440 Zhang, Jiang 260
Reiche, Sebastian 250 Zhang, Zhang 260
Reinert, Gesine 400 Zhao, Mingde 49
Romanova, Alex 25 Zhu, Jiaqi 37