Structural - Social Networks: Twitter Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Social Network Analysis and Mining (2023) 13:65

https://doi.org/10.1007/s13278-023-01063-2

ORIGINAL ARTICLE

Social network analysis of Twitter interactions: a directed multilayer


network approach
Austin P. Logan1 · Phillip M. LaCasse2 · Brian J. Lunday2

Received: 12 January 2023 / Revised: 4 March 2023 / Accepted: 4 March 2023 / Published online: 7 April 2023
This is a U.S. Government work and not under copyright protection in the US; foreign copyright protection may apply 2023

Abstract
Effective employment of social media for any social influence outcome requires a detailed understanding of the target audi-
ence. Social media provides a rich repository of self-reported information that provides insight regarding the sentiments and
implied priorities of an online population. Using Social Network Analysis, this research models user interactions on Twit-
ter as a weighted, directed network. Topic modeling through Latent Dirichlet Allocation identifies the topics of discussion
in Tweets, which this study uses to induce a directed multilayer network wherein users (in one layer) are connected to the
conversations and topics (in a second layer) in which they have participated, with inter-layer connections representing user
participation in conversations. Analysis of the resulting network identifies both influential users and highly connected groups
of individuals, informing an understanding of group dynamics and individual connectivity. The results demonstrate that the
generation of a topically-focused social network to represent conversations yields more robust findings regarding influential
users, particularly when analysts collect Tweets from a variety of discussions through more general search queries. Within
the analysis, PageRank performed best among four measures used to rank individual influence within this problem context.
In contrast, the results of applying both the Greedy Modular Algorithm and the Leiden Algorithm to identify communi-
ties were mixed; each method yielded valuable insights, but neither technique was uniformly superior. The demonstrated
four-step process is readily replicable, and an interested user can automate the process with relatively low effort or expense.

Keywords Social network analysis · Networks · Multilayer networks · Natural language processing

Mathematics Subject Classification 62H30 · 68T50 · 90B10 · 90B90 · 91B08 · 91C20 · 91D30

JEL classification C31 · C38 · C44 · D85

1 Introduction

Austin P. Logan, Phillip M. LaCasse and Brian J. Lunday have Social media provides an open environment for individual
contributed equally to this work. users to send and receive information within the greater
online community, simultaneously influencing and being
* Phillip M. LaCasse influenced by it. This influence may take the form of mass
phillip.lacasse@afit.edu
content dissemination to a wide audience. It could also con-
Austin P. Logan sist of targeted, tailored advertising to individuals based on
austin.logan.2@us.af.mil
their online behaviors. A less benign example is the use of
Brian J. Lunday propaganda or misinformation to shape public opinion or
brian.lunday@afit.edu
sow discord.
1
Directorate of Plans, Programs, and Requirements, Twitter is a compelling source for social network analy-
Air Combat Command, 129 Andrews Street, sis because its content is heavily text-based compared other
Langley Air Force Base, VA 23665, USA common, image-based platforms, such as Instagram, Snap-
2
Department of Operational Sciences, Air Force chat, or Tik Tok. Moreover, the unique features of Tweet
Institute of Technology, 2950 Hobson Way,
Wright‑Patterson Air Force Base, OH 45433, USA

13
Vol.:(0123456789)
65 Page 2 of 18 Social Network Analysis and Mining (2023) 13:65

sharing and hashtag labeling distinguish Twitter from sites On some level, social ties and connections have been wor-
such as Facebook where post sharing is less emphasized. thy of study dating back to antiquity, as evidenced by the
For the most part, the capability for widespread social presence of genealogies in ancient texts such as the Bible or
influence is restricted to entities with the resources and tech- Greco-Roman poems and histories. Freeman (2004) traces
nical skills necessary to perform large-scale analysis of open the development of social network analysis from early socio-
source data. This research sets forth and demonstrates a set metric studies at Sing Sing prison (Moreno 1932) and the
of analyses that are readily usable by smaller entities without Hudson School for Girls (Moreno 1933) through its formal
a large capital investment. establishment as a rigorous discipline in the 20th century. In
This research makes four contributions to the larger dis- particular, four features characterize modern social network
cipline of social network analysis. analysis: structural intuition based on ties linking social
actors, grounding in systemic empirical data, employment
1. We propose and demonstrate a framework to analyze of graphic imagery, and quantification via rigorous math-
Twitter data consisting of the following four steps: ematical modeling. (Freeman 2004)
Scott and Carrington (2011) defines social network analy-
• Topic discovery
sis as the specific logic behind the relationships that people
• Construction of the multilayer network
choose to form and maintain, resulting in a social configura-
• Identification of influential users and topics
tion that can be represented graphically. This SNA approach
• Detection of communities within the network
leverages connections between entities to construct a graphi-
2. We propose and demonstrate a multilayer network struc- cal representation of a network comprised of nodes and arcs,
ture consisting of a user layer and a topics layer. This wherein the arcs convey the relative strength of connections
structure leverages relationships between and among the (Legradi 2009), as determined by mathematical analysis.
layers to provide meaningful insight into the network Within this context, Allard (1990) outlines two main goals
and its potential influence. As far as we can determine, of SNA: understanding the factors that affect relationships
the proposed structure is unique to this research. and their correlations, and ascertaining the effects of these
3. We propose and demonstrate two new SNA arc weight- relationships, including the possible identification of an
ing techniques: one between users based on the inverse informal leader.
of the long-term proportion of interactions and the other An important aspect of social media culture research
between topics based on cosine similarity. related to this work is the phenomenon of influencers. Zhang
4. We evaluate four alternative methods to identify influ- and Vos (2015) examine social media culture in depth and
ential users and two alternative algorithms to discover conclude that the most effective way to spread a message
communities. on social media is through highly influential users known
as influencers. Influencers have acquired the reputation of
The remainder of the manuscript is organized as follows. being compelling and reliable sources of information and are
Section 2 reviews the technical literature related to topic connected to large numbers of users who follow, comment
modeling and social network analysis (SNA), including on, and share their messages.
methods previously used in these fields that can answer Given a social network representation of people and
key research questions. Section 3 explains the methodol- their interactions, it is therefore important to identify these
ogy to conduct the social network analysis, to include crea- disproportionately influential individuals. Several studies
tion of the multilayer network and the metrics to evaluate (e.g., Bakshy et al. (2011); Erlandsson et al. (2016); Dewi
it. Section 4 presents the results of applying the proposed et al. (2017); Bhavnani et al. (2021)) research methods to
SNA process to a population of Tweets and discusses both characterize the influence of nodes within networks. Such
the algorithmic results and instance-specific implications. methods range in approach from direct observations such as
Finally, Sect. 5 concludes the work and recommends how counting the average number of interactions by others for a
one may best apply the process and leverage the study’s user’s Tweets (Erlandsson et al. 2016), to indirect inferences
insights. such as representing physical interactions among people as
a network and modeling how quickly a virus would spread
from an individual (Doerr et al. 2013). Of note, whereas
2 Literature review many of the most influential users in a social network have
a large number of followers, follower count alone is not a
This study leverages existing research related to social net- strong enough metric for quantifying influence (Erlandsson
work analysis, influencer identification, sentiment analysis, et al. 2016). Additionally, Pudjajana et al. (2018) identifies
community detection, and both directional relationship and several centrality metrics useful to compare the influence of
multilayer modeling between entities in SNA. nodes, which Jin (2020) well demonstrated. Also related to

13
Social Network Analysis and Mining (2023) 13:65 Page 3 of 18 65

influential node identification, Sheth et al. (2022) and Ven- directed social networks better represent these interactions.
katesan and Prabhavathy (2019) study methods to discover However, directed social networks are not a modeling pana-
anomalous users within social networks. cea, and both Malliaros and Vazirgiannis (2013) and Tsopze
However, it is not sufficient simply to identify influenc- and Domgue (2021) describe the challenges associated with
ers; one must also characterize the messages that they share their modeling and analysis.
with followers. Sentiment analysis is a technique to label a For SNA, a multilayer network can better represent the
message as either positive or negative, and it can effectively complexities of interactions better than a single-layer net-
monitor users’ emotions towards a topic over time. Tsugawa work model. Multilayer networks can include entity-specific
and Ohsaki (2015) and Salehi et al. (2018) outline many layers and include edges or directed edges (i.e., arcs) to rep-
of the methods to perform sentiment analysis. Featherstone resent different interactions and the relative strength thereof.
and Barnett (2020) employ self-reported attitude scores to Figure 1 depicts an example of a multi-layer network with a
validate sentiment scores obtained from a comprehensive user layer, and topic layer, and inter-nodal arcs both within
study on public opinion towards genome editing. Results are and between layers.
promising, although the strength of the relationship between Some SNA techniques are unique to multilayer networks,
attitude score and sentiment did vary between the subgroups directed networks, or the combination of both. Kolda et al.
sampled (Featherstone and Barnett 2020). (2005) define methods to quantify node influence in mul-
Moreover, sentiments can affect consumer behavior. As tilayer networks. Tang and Liu (2011) propose methods to
examples, both Gazdaggyori (2021) and Hamraoui and Bou- detect communities in directed networks. This research lev-
baker (2022) apply sentiment analysis to Tweets related to erages each of these contributions.
financial stock performance. Although the growth of the
studied stocks was inconsistent and relatively short-lived,
both studies demonstrate that the sentiment of social media 3 Methodology
users reflects changes in consumer attitude that influence
investor behaviors. The same phenomenon is observed in Section 3.1 describes the analyzed datasets. Sections 3.2
the pro-vaccine and anti-vaccine communities; each com- and 3.3 contain the approaches for creating the user and
munity contains its stable of influencers whose messages and topic layers in the multilayer network. Finally, Sects. 3.4
sentiments produce the predictable effects on vaccination and 3.5 present the methods used to discover influential
coverage in children (Featherstone et al. 2020). users and communities.
Other aspects of SNA of interest to this research are topic
modeling and community discovery. Topic modeling deter- 3.1 Description of datasets
mines the most frequent topics of discussion in a collec-
tion of Tweets. The two primary topic modeling methods This research employs four traditional datasets studied
are Latent Dirichlet Allocation (LDA) and Latent Semantic within the SNA literature, listed in Table 1, along with
Analysis (LSA). Kalepalli et al. (2020) directly compare customized datasets created by sampling Tweets from each
LDA and LSA, and both Rahmadan et al. (2020) and Yang of the four named sets. The Tweets are almost exclusively
et al. (2021) demonstrate the efficacy of LDA for topic mod-
eling using Twitter Data. In a related work, Jiwanggi and
Adriani (2016) detail methods for extracting a summary of
topics from a collection of Tweets. Topic modeling is fre-
quently useful to model the evolution of public discourse
on topics of interest such as vaccination (Featherstone et al.
2020) or gene editing (Ji et al. 2022). Community discov-
ery methods seek to identify the closely connected groups
of users in a network, whether by individual communica-
tions or communications related to a common topic such as
vaccine hesitancy (Ruiz et al. 2021). Various user features
can help detect communities of connected users, as Pacheco
et al. (2021) recently studies.
Of interest to this research is the modeling of social
networks with directional representation of interactions.
Aiello et al. (2010) characterizes representation decisions
for constructing social networks and the links between users. Fig. 1  Multilayer Social Network Representation of Entity-specific
Relationships and communication are often asymmetric, and Layers and Directed-arc Representation of Interactions

13
65 Page 4 of 18 Social Network Analysis and Mining (2023) 13:65

Table 1  Summary of datasets and key features network can distinguish a celebrity who does not Tweet fre-
Dataset name # Tweets Retweets Replies
quently but has many followers from a bot or spammer that
Tweets frequently about other users.
2018 World Cup 530,000 Yes No Edges or arcs within a social network have associated
Game of Thrones 760,660 No Yes weights to convey the strength of the connection implied by
2016 US Election 42,013 Yes Yes interactions between users. All of the datasets within Table 1
COVID-19 179,108 No Yes include three types of user interactions: a user mentioning
another user, a user Retweeting another user’s Tweet, and a
user replying to another user’s Tweet. Although no formal
written in the English language, and the information associ- direct measure of user relationships exists, these interac-
ated with each Tweet consists of username, number of fol- tions can inform a proxy metric that represents the relative,
lowers, date of account creation, and verification status. The implied strength of relationships via arc-specific weights.
topics of conversation are sports, entertainment, politics, and Some interactions imply a closer relationship between
health, respectively. users. This is evident by observing the ratio of likes to
The first step in this analysis is to demonstrate the analyti- Retweets for nearly every Tweet in the Twittersphere. Tweets
cal techniques on the well understood datasets in Table 1. consistently have far more likes than Retweets (e.g., Per-
Such analysis can reveal differences in results pertaining dana and Pinandito (2018)), implying that a Retweet conveys
both to the identification of influential users and to general a stronger engagement with a Tweet than a like. Moreo-
network metrics. It can also provide insight into Tweet query ver, Tweets typically have fewer mentions in new Tweets
practices and their effect on the resulting social network than Retweets, and fewer replies than mentions, indicating
representation. increasing degrees of engagement.
Next, we apply the techniques to customized datasets cre- For this reason, the inverse of the frequency with which
ated by sampling a uniform number of Tweets from each of replies, mentions, and Retweets occur can provide a suitable
the named datasets. These customized datasets include more proxy for the strength of connection implied by an interac-
diverse topics and conversations, and they better represent tion. For example, if the distribution of replies, mentions,
the topical diversity of the Twittersphere. Moreover, such and Retweets was uniform, then any action would contribute
datasets help validate community identification and topic the same weight (i.e., 1∕0.33̄ = 3) to an arc from one user
modeling methods because it is reasonable to expect topics to the author of the original Tweet. If the distribution were
and communities identified in the composite dataset to map 14.3%, 28.6%, and 57.2%, a reply would contribute twice
to those in the four original datasets. as much weight to the arc (i.e., 7) as a mention (i.e., 3.5)
As mentioned in Sect. 1, the intended beneficiaries of this and four times as much as a Retweet (i.e., 1.75). If a user
framework are entities without the budget required for more interacts with another user several times within a dataset,
expensive APIs that collect large amounts of data about indi- the net contributions of the interactions to the arc weight are
vidual Tweets. As such, analytic techniques herein consider additive. Among the responses to Tweets in Table 1 datasets,
only the Tweet data available from a low-cost accessible the distribution of interactions consisted of 6.97% replies,
API: Tweet text, username, and verification status. As is 39.49% mentions, and 53.54% Retweets. Thus, each reply,
typical with most analyses, some minor data cleaning was mention, and Retweet contributes 14.35, 2.53, and 1.87 to
necessary to ensure that subsequent analysis only considered an arc weight, respectively, for the generation of the user
Tweets with an associated username. networks or user network layers.

3.2 User network (layer) creation 3.3 Topic modeling and integration as a network


layer
Preliminary to the four-step analytic process, it is first nec-
essary to generate a user network – or network layer, in As Sect. 1 outlines, the first two of the four steps in process
the case of a multilayer network – for a dataset of Tweets. are the discovery of topics and the construction of a topic-
Within such a layer, nodes represent users and edges repre- focused, directed multilayer network. The goal of imple-
sent relationships between users. As Sect. 2 discusses, this menting topic modeling alongside social network analysis
research adopts a directed network with arcs to model Twit- is two-fold. First, topic modeling provides a general over-
ter relationships because it better represents the directional view of the discussions contained within a dataset. Second,
relationships between users. The directed network models topic modeling can be used in conjunction with the existing
actions taken by users to show their relationships with others network to connect users through the conversations which
as an outbound arc and Tweets about a user or in response to they are having, a feature that cannot be extracted directly
a user’s Tweet using an inbound arc. In doing so, the directed from Twitter data. Thus, the inclusion of a topical layer with

13
Social Network Analysis and Mining (2023) 13:65 Page 5 of 18 65

the user network layer, combined with arcs indicating user useful context for analysis. The new topic layer consists of a
participation in the topics, induces a multilayer network to node for each topic, and the LDA results (i.e., 𝛽 ) inform arc
more accurately represent both the direct and indirect con- creation between the user layer and the topic layer.
nections among users participating in the discourse on the The two aspects of arc creation are which arcs to gen-
social media platform. erate and how to weight those arcs. This research gener-
For the first step, this research conducts topic modeling ates user-to-topic arcs for only the strongest Tweet-to-topic
using Latent Dirichlet Allocation (LDA). Pritchard et al. relationship for each of the user’s Tweets, as measured by
(2000) set forth LDA as an unsupervised clustering method a similarity index of words within a Tweet to each of the k
to assign individual creatures (e.g., birds, people) to popula- topics. For a given Tweet and a vector s of length V, wherein
tions (e.g., species, tribes) based on genotype similarities. sv is the number of times token v appears within the Tweet,
Blei et al. (2003) first applied LDA to topic modeling for the Tweet-to-topic similarity index for each topic i equals
text-based documents. The authors describe LDA as “a gen- 𝛽i ⋅ s. As an aside, although it is possible for a single Tweet
erative probabilistic model of a corpus”; it synthesizes a to relate to multiple topics rather than only its most relevant
user-defined number of topics and populates them with the topic, such an alternative is perhaps a compelling sequel
words that have the highest probabilities of belonging to to this work, albeit a more computationally burdensome
them. The three key elements of an LDA model are the top- endeavor.
ics, documents, and corpus. Within this research, topics are Additionally, arc generation only creates Tweet-to-topic
the clusters to which statistical analysis will assign words arcs if the similarity index was in the top 25% of all such
from the Tweets. Documents are the individual Tweets maximal indices for the corpus of Tweets. Doing so avoids
appearing in the data, and the corpus is the complete col- establishing weak connections between users and topics.
lection of documents. Given k topics and V unique words in Although no formal research exists to determine such a
the corpus, LDA creates a k × V probability matrix 𝛽 , where threshold, future work could utilize labeled training data
𝛽ij represents the probability that topic i includes word j. with a machine learning approach to explore better deci-
Preprocessing of Tweets removes stop words, tokenizes sions in this space.
the remaining text, and lemmatizes the individual words For this directed multilayer network, this research creates
(i.e., tokens) to ensure only relevant text remains for topic a pair of user-to-topic and topic-to-user arcs as determined
discovery via LDA. Stop word removal deletes common by the similarity index. Inducing the opposite-direction
words that provide no contextual meaning, such as articles, topic-to-user arc represents scenarios wherein users scroll
conjunctions, prepositions, and pronouns. Tokenization through a topic of conversation on Twitter and find another
partitions the remaining text into words. Lemmatization user via their topic-specific Tweets. The weight for each of
replaces different forms of a word (e.g., runner, running, the arcs in the generated pair is equal to the value of the
runs) with a common root word (e.g., run). similarity index.
Only a user-defined number of topics k is necessary to After establishing connections between the user layer and
apply LDA to preprocessed Twitter data. Although the opti- the topic layer of the multilayer network, the final step to
mal number of topics depends on the data, a coherence met- complete the multilayer network model is to represent the
ric can assess the effectiveness of a k-topic LDA model by connections between the topics in the topic layer. For a given
assigning a score to the set of highest probability words in topic, the word probability vector is a vector of length V that
each topic based on their similarity and interpretability by a indicates in each entry the likelihood that a word is in that
human. Thus, a line search on k can identify the best number topic. From this definition, the cosine similarity between two
of topics to maximize coherence. Although several research- topics is the angle between their word probability vectors, as
ers (e.g., see Bouma (2009); Newman et al. (2010); Mimno Equation (1) calculates for two vectors A and B.
et al. (2011)) have developed alternative coherence metrics, � �
Röder et al. (2015) conducted an extensive, comparative AT B
cos−1 (1)
study of such metrics and introduced two new metrics, iden- ‖A‖‖B‖
tifying a superlative coherence metric the authors denoted as
Although the theoretical domain of Equation (1) is [−1, 1],
CV . This research uses their recommended metric.
only a range of [0, 1] is feasible for cosine similarities
Once LDA is complete, the second step generates the
between topics; each vector is in the non-negative orthant
directed multilayer network representation. An analyst
because every element is a non-negative probability.
names the topics after inspecting the highest probability
The interpretation of these values is as follows. A similar-
words associated with each topic. Such a task is not ardu-
ity of 0 means the vectors are orthogonal, so no inter-topic
ous, given familiarity with the language-of-origin for the
relationship exists; a value close to 1 results from nearly
Tweet. Although this manual naming process is not strictly
parallel vectors, indicating similar relative distributions of
necessary because LDA identifies the topics, it does provide

13
65 Page 6 of 18 Social Network Analysis and Mining (2023) 13:65

word-to-topic probabilities for the two topics. The thresh- for each topic identifiable via 𝛽 , this extractive summari-
old to generate a pair of directed arcs between two topics zation of the most relevant topic-specific Tweets provides
is if their cosine similarity exceeds 0.5. Differing from the additional context regarding topics and reduces the creative,
user-to-topic connections, arcs may connect a single topic cognitive labor required to analyze Twitter data scrapes.
to multiple other topics.
The resulting directed multilayer network includes
3.4 Influential user identification
respective user and topic layers; weighted, directed arcs
connecting users based on replies, mentions, and Retweets;
For the third step in the process, it is worth noting that there
pairs of weighted, directed arcs connecting users to topics
are myriad methods to identify influential nodes within a
based on the vocabulary of a user’s Tweets, with at most
network. Although many such methods produce reasonable
one user-to-topic connection formed by a single Tweet; and
results for smaller, highly-connected networks, the relative
pairs of weighted, directed arcs connecting topics based on
performance of the methods depends notably on the network.
the cosine similarities of their respective word probability
Given that this research examines larger networks expected
distributions.
to be relatively disconnected, it is relevant to evaluate alter-
Complementing the first two topic-focused steps of the
native methods to identify influential nodes. Testing within
process is the summarization of Tweets linked to each topic.
Sect. 4 compares rankings via PageRank algorithm, Hyper-
This research creates extractive summaries rather than
link-Induced Topic Search (HITS) algorithm, betweenness
abstractive summaries, favoring the former for its simplicity.
centrality, and eigenvector centrality. For each of these tech-
Moreover, the ability of an abstractive summary to gener-
niques, a higher computed value indicates greater influence.
ate unique thoughts is mitigated by the methods by which
As described in Sect. 3.3, the PageRank algorithm uses
Twitter creates trending topics; they often present the most
long-term node visit probabilities for a random walk to rank
relevant Tweets within a conversation, an outcome similar to
order the users and infer a relative degree of influence.
an extractive summary (Rudrapal et al. 2018). This research
Kleinberg (1999) modifies the PageRank algorithm to
uses the TextRank algorithm to create extractive summaries.
create the HITS algorithm to identify influential nodes. The
Created by Mihalcea and Tarau (2004), it applies a graph-
author conjectures a conceptual shortcoming of the PageR-
based ranking technique that induces a graph wherein nodes
ank algorithm for directed network analysis; whereas PageR-
represent sentences (i.e., Tweets) and edges are weighted by
ank readily identifies authority nodes having many inbound
a user-defined sentence similarity metric. This work utilizes
arcs, it can underestimate the influence of hub nodes having
the better performing metric (i.e., BM25) proposed by Bar-
many outbound arcs. It is arguably influential to direct con-
rios et al. (2015) in lieu of the alternatives originally set
nections, not just to be the directed target of connections.
forth by Mihalcea and Tarau (2004). In comparison, BM25
Accounting for both authority and hub behaviors of nodes,
considers the inverse frequency of words within a document
the HITS algorithm identifies a root set of nodes via a tar-
to increase the relative similarity metric for documents con-
geted search query and augments it with all nodes adjacent
taining words that are rare in the topic-specific corpus.
via outgoing arcs from the root set. For this larger subgraph,
For the TextRank generated graph, PageRank subse-
the algorithm iteratively updates each node’s authority and
quently identifies the most important sentences for inclu-
hub scores to be equal to the sum of the hub and author-
sion in the extractive summary of each topic. Page et al.
ity scores of nodes respectively connected to or from the
(1999) proposes the PageRank algorithm to calculate the
node, until convergence. The HITS algorithm yields two
most important sentences for extractive summaries. The
metrics, one each for authority and hub rankings. In prac-
algorithm repeatedly applies an extended random walk on a
tice, researchers often average these scores to enable a direct
graph to determine the long-term probabilities of residing at
comparison with other influential node identification meth-
each node. In a random walk, a simulated entity sequentially
ods, and this research does likewise.
travels from one node to an adjacent node with a probabil-
As a third method to identify influential nodes, between-
ity equal to the arc weight, relative to the total weights of
ness centrality (Freeman 1977) computes for a given node
arcs emanating from the current node. The authors modify
v the frequency with which it is on one of the shortest paths
the adjacent step probabilities of the random walk to create
between a pair of nodes (s, t), considered over all node pairs
small, nonzero probabilities of traversing from a given to any
s, t ∈ V , as per Equation (2).
other (i.e., non-adjacent) node in the network to mitigate the
effect of disconnected network components on long-term ∑ 𝜎(s, t ∣ v)
probability calculations. The application of multiple ran-
cB (v) =
𝜎(s, t) (2)
s,t∈V
dom walks from initial entity locations determined via a uni-
form distribution over the nodes more notably mitigates that For social network analysis, betweenness centrality com-
effect. Augmenting the list of the highest probability words putations use the inverse of edge weights as edge distances

13
Social Network Analysis and Mining (2023) 13:65 Page 7 of 18 65

because larger weights indicate strong connections that The first agglomerative technique this research uses is
would conceptually correspond to a shorter distance (i.e., the Greedy Modularity Algorithm (GMA). The GMA is a
an edge more likely to be traversed). A notable downside to modification of the Clauset–Newman–Moore (CNM) algo-
this method is that it requires calculating the shortest paths rithm set forth by Clauset et al. (2004). Like CNM, GMA is
between all pairs of nodes. Although either a repeated appli- a heuristic approach to maximize a modularity metric that
cation of Dijkstra’s( Algorithm
) or the Floyd Warshall algo- measures the strength of community classification. Whereas
rithm can run in O n3 time (Ahuja et al. 1993), such effort Clauset et al. (2004) designed the CNM algorithm for undi-
remains computationally expensive for larger networks, and rected networks, the GMA seeks to maximize the modularity
both algorithms require modification to identify alternative metric in Equation (3), adapted for directed networks when
optima for shortest (s, t)-paths. implemented via the NetworkX library (Hagberg et al. 2008)
Finally, eigenvector centrality (Landau 1895) provides for the Python programming language.
another alternative to identify influential nodes. This met- � �2
n ⎛ ⎞
ric leverages the idea that nodes of high influence are adja- � kcin kcout
Q= ⎜ Lc − 𝛾 ⎟ (3)
cently connected to other nodes of high importance. Given ⎜m 2m ⎟
c=1 ⎝ ⎠
an N × N node adjacency matrix A, wherein Aij is equal to
the weight of the connection between nodes i and j, solve the
Therein, Lc is the number of arcs within community c; m
eigenvector equation Ax = 𝜆x . Designating 𝜆 as the largest
is the total number of edges in the graph; kcin and kcout are
eigenvalue, the corresponding vector x indicates the respec-
the sums of the respective in-degree and out-degree weights
tive influence scores for each of the nodes. This metric is
in community c; and 𝛾 is a positive, user-defined resolu-
conceptually simple and easy to calculate.
tion parameter to balance the importance of edges within a
community and edges connecting communities. Smaller 𝛾
-values yield fewer, larger communities, and larger 𝛾 -values
3.5 Community detection
yield more, smaller communities (Newman 2016). At ini-
tialization, there are n = N communities, and Q ≤ 0 because
For the fourth step in the proposed SNA process, multiple
Lc = 0 for c = 1, ..., n . Within an iteration, the CNM algo-
methods for community detection exist in the literature.
rithm calculates the net change to network modularity that
Among them, this research tests and compares the Greedy
would result from conjoining pairs of communities via an
Modularity Algorithm (GMA) and the Leiden algorithm
edge connecting them. If the maximal such change to modu-
(Traag et al. 2019).
larity is positive, the edge is added and the algorithm pro-
Before discussing these methods, it is important to note
ceeds to the next iteration; otherwise, GMA terminates with
the characteristics of data that inform such choices. In this
identified community structures.
research, communities are identified using only the infor-
The Leiden algorithm is an agglomerative community
mation contained within the directed multilayer network’s
detection algorithm for undirected, fully connected net-
nodes and arcs. Given the nature of social media data, espe-
works. Fortunately, Malliaros and Vazirgiannis (2013) dis-
cially data gathered via broad queries of Tweets among
cussed transformations one can apply to a directed network
many unique users, the resulting social network structures
to enable the application of the Leiden algorithm. First,
tend to be fragmented. Even when querying data by common
edges replace pairs of equal-weight, opposite direction arcs
keywords, the likelihood of capturing a back-and-forth con-
between nodes. Second, edges replace singular arcs between
versation via Tweets between two or more users is exceed-
nodes, with the same total edge weight. This transforma-
ingly small, considering the millions of Tweets posted daily.
tion implies two-way connections that do not exist in the
Accordingly, the idealized version of a social network com-
data, but it allows for the exploration of a larger number of
munity as clique subgraph having k nodes and k(k − 1)∕2
community detection methods for which the results can be
edges (or k(k − 1) directed arcs) is elusive. Rather, commu-
validated in comparison to the original social network struc-
nity detection methods must consider the implicit networks
ture. Third, a minimal number of low-weight edges augment
of users who are not in direct conversation with each other,
the social network to ensure all nodes are fully connected.
but who share the same topics of conversation or common
These artificial connections minimally modify the network
connections with other users. This low level of direct con-
representation in a manner that should be negligible but for
nectivity motivates the use of agglomerative community
which the results of any community discovery algorithm
detection methods, wherein every node begins as a sole
should be validated against the original network.
member of its own community, and an algorithm iteratively
For an undirected multilayer representation of the
conjoins smaller communities to improve the collective
directed multilayer network via the aforementioned
strength of the respective communities, as measured via a
steps, the Leiden algorithm (Traag et al. 2019) can detect
customized metric.

13
65 Page 8 of 18 Social Network Analysis and Mining (2023) 13:65

communities via an agglomerative, modularity-focused 4.1 User network layer creation


approach. It also begins by assigning each node to its own
community. Each iteration consists of three steps: mov- A sampling of 15,000 Tweets from each of the named data-
ing nodes locally, refining a partition of the network, and sets in Table 1 yields the information necessary to create
aggregating nodes within the network. The first step reas- user network layers. Table 2 presents the summary statis-
signs individual nodes to the community that yields the tics for the user network layers, wherein arcs correspond to
largest increase in network modularity, partitioning the replies, mentions, and Retweets.
network into larger, potential communities. The second As a first observation, it is possible to have more nodes
step refines each of the partitions by re-agglomerating its and arcs than sampled Tweets if Tweets convey more than
nodes via stochastic, metric-improving assignments. The one relationship between users (e.g., if a reply to one user’s
third step aggregates nodes within each component of the Tweet mentions another user). Such is the case with the 2018
refined partition. The iteration terminates by assigning World Cup dataset and nearly the case with the 2016 US
the aggregate nodes to their aligned component in the Election dataset. By contrast, both the Game of Thrones
unrefined partition from the first step. and COVID-19 datasets yield fewer users and arcs for the
Modularity is not the only metric-of-interest to assess same number of Tweets, implying a difference in the nature
community detection algorithms. Other useful metrics of communications. Visual depictions of the user network
include partition coverage and partition performance. layer can help garner insight in this regard.
Partition coverage is the ratio of the number of intra-com- While network visualizations can be misleading since
munity edges to the number of edges in the graph. The node placement is stochastic and semi-arbitrary, they can
partition coverage metric favors community partitions help identify users with many strong relationships with oth-
with few edges connecting communities. The partition ers. Figure 2 depicts the user network layer for a 5,000 Tweet
performance metric is the ratio of the combined number sample from the 2016 US Election dataset. The graphical
of intra-community edges and possible inter-community depiction results from the Fruchterman-Reingold force-
non-edges to the total possible edges in the graph. For the directed algorithm, which represents nodes connected by
networks in this research, both partition coverage and par- an arc closer together, reducing arc crossover. Red nodes
tition performance scores should be high because social depict the five most influential users as per the PageRank
networks representing Twitter data tend to be highly dis- metric, yellow nodes represent the five users with the highest
connected; most graph partitions would detect communi- number of incoming arcs (i.e., authority nodes), green nodes
ties which isolate many of the fragmented components. are the users with the highest number of outgoing arcs (i.e.,
Moreover, such networks are not often dense, correspond- hub nodes), and blue nodes represent all other users.
ing to a much larger number of potential edges than actual Although Sects. 4.3 and 4.4 will formally identify influ-
edges. ential users and detect user communities, the user network
layer depiction does provide preliminary insights. Large
clusters of users in these graphs reliably imply influence and
community membership. Visible within Fig. 2, clusters of
4 Testing, results, and analysis users surround the influential nodes, indicating the strength
of their connections and implying a community their com-
In presenting and discussing the results of applying the munications can affect rapidly. Many nodes also surround
SNA process Sect. 3.2 proposes, Sect. 4.1 initially presents the authority nodes, although the implied community is less
visualizations for selected named datasets from Table 1 to dense. In contrast, the hub nodes are all in the relative center
derive high-level insights regarding the user layers. Sec- of the graphical depiction; this graphical depiction under-
tion 4.2 details the results of applying LDA with respect states the potential influence of hub nodes that the HITS
to the number of topics and the corresponding LDA coher- algorithm seeks to characterize.
ence, and it subsequently illustrates a topic layer for an
aggregated dataset. Section 4.3 compares the four meth-
ods for identifying influential users (i.e., PageRank, HITS,
betweenness centrality, eigenvector centrality). Section 4.4 Table 2  User network layer characteristics for 15,000 sampled tweets
compares GMA and Leiden for detecting communities, Sampled dataset # Nodes # Arcs
both for a single-layer user network and the multi-layer
2018 World Cup 15,073 16,522
network this research proposes for SNA. Section 4.5 con-
Game of Thrones 6,416 4,432
cludes with an examination of query size on the identifica-
2016 US Election 12,565 13,603
tion of influential users via this process, highlighting the
COVID-19 8,457 6,565
practical implications thereof.

13
Social Network Analysis and Mining (2023) 13:65 Page 9 of 18 65

Fig. 2  User Network Layer for


5,000 Tweets Sampled from
the 2016 US Election Dataset,
Depicted via the Force-Directed
Algorithm

Table 3  User network layer statistics


Sampled dataset Network density Average degree Average
weighted
degree

2018 World Cup 7.36 × 10−5 2.190 0.295


Game of Thrones 1.08 × 10−4 1.363 0.628
2016 US Election 8.63 × 10−5 2.169 0.375
COVID-19 8.97 × 10−5 1.533 0.338

surrounding them. In contrast to Fig. 2, Fig. 3 contains a


densely packed outer shell of nodes, which suggests a far
more fractured network of individual users connecting with
a small number of other users. Figure 2, however, reflects
a network by which a relatively small number of influen-
tial nodes reaches a large number of users. This is visu-
ally evident by the node clusters surrounding the influential
Fig. 3  User Network Layer for 5,000 Tweets Sampled from the users and authority nodes. Of note, the top five hub nodes
COVID-19 Network, Depicted via the Force-Directed Algorithm in Figs. 2 and 3 are near the center of the graph with fewer
adjacent nodes, seemingly undervaluing their potential
For comparison, Fig. 3 depicts the user network layer influence.
for the 15,000 Tweet sample from the COVID-19 Network To illustrate the need for both visualizations and quanti-
dataset. Similar to Fig. 2, influential users and authority tative analysis using established metrics, consider the sum-
nodes are on the outside of the graph, with many nodes mary of network statistics in Table 3. Despite having similar

13
65 Page 10 of 18 Social Network Analysis and Mining (2023) 13:65

network densities, the user network layers for the 2016 US


Election and COVID-19 datasets exhibit very different user
behaviors in Figs. 2 and 3. The average degree metric better
conveys what the user network layer visualizations depict.
The higher degree in the 2016 US Election network shows
that, on average, each user interacts with many other users,
relative to the COVID-19 network. Comparing the average
weighted degrees for those two networks, the relatively close
values indicate that, although COVID-19 network users
interact with fewer other users, their connections with them
are stronger, on average, than the 2016 US Election users.
Even absent a visualization of the Game of Thrones user
network layer, the statistics in Table 3 convey that users have
very strong connections with very few other users, relative
Fig. 5  LDA Coherence versus number of topics for 15,000 tweets
to the other datasets. sampled from the 2018 World Cup Dataset

4.2 Topic modeling and integration as a network


layer discernible trend, indicating a more exhaustive line search
on k such as simultaneous search is appropriate when tun-
Although Latent Dirichlet allocation (LDA) discovers k top- ing LDA performance. Parsimony suggests k = 10 topics as
ics of discussions among a corpus, an analyst must iden- reasonable for the 2018 World Cup dataset.
tify the optimal number of topics to extract. This number With LDA as an unsupervised machine learning tech-
depends on the data; a collection of Tweets gathered via a nique, one can at best conjecture about the reason for the dif-
very focused, topic-related query manifests fewer topics. As ference in its performance in Figs. 4 and 5. For example, the
Sect. 3.3 discussed, topic coherence quantifies the perfor- Game of Thrones dataset tends to manifest opinionated reac-
mance of LDA. A model exhibiting larger coherence should tions to specific episodes of the show, resulting in relatively
yield the higher interpretability upon inspection. To illus- differentiated, topic-focused language across the Tweets. In
trate this effect, Figs. 4 and 5 present the LDA coherence contrast, the 2018 World Cup dataset contains Tweets about
for k = 2, ..., 25 for the respective samples from the Game a sequence of games, but user descriptions of the actions and
of Thrones and 2018 World Cup datasets. players from game-to-game will be less variable. That is, the
Within Fig. 4, the coherence scores for the Game of plot varies among television show episodes more than foot-
Thrones data range from 0.14 to 0.30, and they generally ball matches. To visually depict this difference, Fig. 6 shows
increase over the full range of k explored. In contrast, the the intertopic distances of LDA models for both datasets
coherence scores for the 2018 World Cup exhibit higher when k = 10 (Sievert and Shirley 2014). The lack of clear
average values within range of 0.25 to 0.33, but the effect of topic separation present in the 2018 World Cup Dataset can
the number of topics is more nuanced. There is not a readily be seen through the cluster of overlapping topics, especially
when contrasted with greater relative distance between top-
ics present in the Game of Thrones Dataset.
To further convey this effect, Table 4 presents the top 10
highest topic-specific probability words for five of the top-
ics modeled by LDA when k = 10 . Whereas words within
each topic exhibit some intuitive relationship, the reuse of
some words in several of the topics suggests that many of the
Tweets, regardless of their underlying message, use the same
verbiage. As a result, the LDA model struggles to clearly
discern distinct topics of discussion. Since the best coher-
ence scores for both datasets is approximately 0.3, a low
performance for LDA models, topics will be more cryptic
and difficult for an analyst to manually label.
Additionally, if coherence is low for the LDA model,
Tweets are less likely to exhibit a strong connection with a
Fig. 4  LDA Coherence versus number of topics for 15,000 tweets topic because the words with the highest probability of topic
sampled from the Game of Thrones Dataset membership will have less semantic connection with each

13
Social Network Analysis and Mining (2023) 13:65 Page 11 of 18 65

Fig. 6  Intertopic distance maps of LDA models

Table 4  Selection of topics Topic no Top 10 highest probability words


from LDA model of 2018 World
Cup Data Topic 0 World, cup, good, sorry, Russia, champion, fifa, congratulation, fra, win
Topic 1 Eng, ronaldo, messi, world, final, England, complete, premierleague, cup, paulpogba
Topic 3 Penalty, frabel, win, team, save, proud, France, eng, time, threelion
Topic 4 Mbappe, fra, player, Kylian, young, award, golden, fifa, score, ball
Topic 9 Team, France, congratulation, win, African, dear, Khaledbeydoun, cut, racism, xenophobi

other. As a direct result, Tweets may be categorized into


topics for which the fit is not ideal. Revisiting Table 4, the
Tweet “Kylian Mbappé will donate everything he earns play-
ing for France at the World Cup to charity” was connected
with Topic 0. Intuitively, this Tweet seems better suited for
membership in Topic 4 because it refers to a specific player
and his country. However, the tokens ‘world’ and ‘cup’
exhibited stronger connection to Topic 0. Such counter-intu-
itive topic modeling results can affect both influential user
identification and community detection, and it motivates
the use of broad queries to facilitate higher topic coherence
scores for LDA, i.e., more discernible topic modeling.
To determine the effectiveness of LDA on a dataset more
representative of a generic query of Tweets, samples were
Fig. 7  LDA coherence versus number of topics for the joint dataset
taken from each dataset and conjoined into an aggregate col-
lection of Tweets, hereafter denoted as the Joint dataset. The
higher diversity of word usage and topic discussion in the below 0.50. The increase in coherence past 50 topics indi-
Joint dataset enabled LDA models to attain higher coherence cates that, as conjectured, a larger amount of topic separation
values, as exhibited in Fig. 7. is possible with a more diverse dataset of Tweets. This find-
The range of coherence scores in Fig. 7 generally ing is important when creating a directed multilayer network
increases with k, manifesting higher coherence scores just that includes a topic layer to help identify influencers and

13
65 Page 12 of 18 Social Network Analysis and Mining (2023) 13:65

communities; larger coherence scores better justify connec-


tions from users to topics via their Tweets.
Table 5 presents the top 10 highest topic-specific prob-
ability words for five of the topics modeled by LDA when
k = 10 . The improved topic separation is apparent when
inspecting word membership in topics. In this LDA model,
words strongly associated with the different topics appear
to come from each of the different datasets, and the greater
topic separation is apparent.
High LDA coherence scores and the well separated nature
of the identified topics allows for meaningful extractive topic
summarization. This activity reduces the work required of
an analyst to infer meaning for a topic. For example, within
Table 5, both Topics 3 and 8 appear to discuss the forecast
of the 2016 US Election, but differentiation of the topics is
elusive using only the highest probability words. Extrac-
tive topic summaries characterized Topic 3 as “Now, 95%
for Trump: Live Presidential Forecast – Election Results
2016 – The New York Times.”, whereas it characterized
Topic 8 as “RT @DrewLinzer: My final 2016 presidential
election forecast: Clinton 323 - Trump 215.” These summa-
ries provide added context to convey that Topic 3 is mainly
concerned with the conversation surrounding a forecast Fig. 8  Intertopic distance map for the joint dataset
projecting Trump to win, whereas Topic 8 is discussing a
different poll projecting Clinton as the winner. That insight
resulting from summaries obviates the need for an analyst direct, but are justified in the model to represent the weaker,
to conduct a manual inspection of Tweets. In addition, the more distant relationships.
intertopic distance map in Fig. 8 reveals that these two topics A visualization of the topic layer for the LDA model ref-
do occupy distinct spaces despite initially appearing similar. erenced in Table 5 can be seen in Fig. 9. Of note, Topics 2
Summaries also provide context when the word membership and 4 are a part of the topic layer, but they are not depicted
in a topic makes the topic difficult to identify, in general. in Fig. 5 because they are not connected to any other topics,
For example, the summary of Topic 5 is, “80% of your team as per the methodology set forth in Sect. 3.3. The weights on
is African, cut out the racism and xenophobia. Africa did the edges 9 represent the cosine similarities between topics
not win the #Worldcup France did. Africa did not even win and are color mapped to show the relative strength of topi-
it for France”, revealing both the controversy aligned with cal similarities. Interestingly, Topics 6 and 8 are not directly
Topic 5 and the countering stances of the users engaged in connected, but they are connected through other topics with
the discourse. which they are similar. This illustration demonstrates the
By augmenting the user layer with the topic layer in a ability of the topical layer to model relational intricacies in
multilayer network, analysis may discover connections the conversations.
between users through the topic layer in the absence of direct
connection between them. This modeling characteristic more 4.3 Influential user identification results
accurately depicts the dynamics of a social network because
users engaged in similar conversations have a higher likeli- Preliminary analysis applied PageRank and eigenvector cen-
hood of seeing each other’s Tweets; such connections are not trality to identify influential users, both with and without

Table 5  Selection of topics Topic No Top 10 Highest Probability Words


from LDA model of joint data
Topic 1 Like, watch, season, come, episode, time, atch, fifaworldcup, people, go
Topic 3 New, election, forecast, york, presidential, result, times, final, live, fra
Topic 5 France, win, team, election, fra, fifaworldcup, congratulation, forecast, African, trump
Topic 8 clinton, forecast, chance, poll, election, gt, good, trump, fivethirtyeight, vote
Topic 10 Forecast, case, update, election, late, poll, death, new, today, coronavirus

13
Social Network Analysis and Mining (2023) 13:65 Page 13 of 18 65

influential user characteristics can assist in determining


their relative performance. For example, both PageRank and
eigenvector centrality identify several highly influential bots
in the multilayer network. This result is counter-intuitive
because these bots either scrape data or share news articles
but do not actively engage in discussion or offer views to
stimulate conversation from other users. Moreover, many
of these bots reply with information (e.g., a requested sta-
tistic) to users who mentioned them. This dynamic induces
an artificially high number of connections with other users.
While this information can be useful to an analyst, bots such
as these often cannot hold opinions and are therefore of less
interest to this research.
Applying PageRank or eigenvector centrality to only the
user network layer identified a number of high-profile poli-
ticians and government organizations as being influential.
This outcome is logical, given the nature of data concerning
COVID-19. Although these results are similar, eigenvector
centrality identifies as its most influential user an unveri-
fied user with fewer than 1000 followers. Such a conclusion
seems conceptually unlikely, and the PageRank outcomes
do not comport with it.
Whereas topic inclusion exhibited a negative impact on
influential user identification when the LDA model was
poor, results are more promising for the Joint dataset, which
Fig. 9  Topic layer connections from Joint data show affiliations of has better coherence and topic separation. For the user layer
topics and the strength of their connections only and the multilayer network, respectively, Tables 7 and
8 present the top ten identified influential users for the Joint
dataset, as determined by PageRank, HITS, betweenness
the topic layer, to assess its impact. Noting the quality of centrality, and eigenvector centrality.
topic identification via LDA can affect the identification The effect of both the topic layer and the method of
of influential users, Table 6 presents the top ten identified influence ranking is evident. Within Table 7, identifying
influential users for the COVID-19 dataset, for which the influential users via only the user network layer yields no
topic coherence scores were low and the topic separation users common to every ranking, two users (i.e., “Five Thirty
was relatively weak. Eight” and “GMA”) common to three rankings, and four
Findings within Table 6 vary by method and network users (i.e., “538Politics”, “Ginger_Zee”, and “Author”) com-
type. Although there is not a single ‘correct’ answer to mon to two rankings. Different node properties influence the
assess the quality of methods, some observations regarding various ranking methods, and all but the HITS algorithm

Table 6  Top Influential Users User network layer Multilayer network


for COVID-19 Dataset via
the User Network Layer and Rank PageRank Eigenvector PageRank Eigenvector
the Multilayer Network, using
Selected Techniques 1 Donald Trump Unverified user Donald Trump News Blog
2 YouTube Donald Trump ANI ANI
3 Boris Johnson Kamala Harris Bot Bot
4 WHO Joe Biden News Blog Bot
5 Change Tamara McCleary Global Pandemic.NET Journalist
6 thehill Business Writer Journalist Data Bot
7 CDCgov Journalist Medical Journal Data Bot
8 GOP CPHO Canada Data Bot Medical Journal
9 Narendra Modi Boris Johnson Journalist Global Pandemic.NET
10 Joe Biden DrRP Nishank Data Bot Journalist

13
65 Page 14 of 18 Social Network Analysis and Mining (2023) 13:65

Table 7  Top Influential Users Rank PageRank HITS Betweenness Eigenvector


for the Joint Dataset via the
User Network Layer 1 Nate Silver 538 Donald Trump GMA 538politics
2 538Politics Unverified user Five Thirty Eight matthewjdowd
3 Five Thirty Eight GOP Ginger_Zee RobMarciano
4 FIFA World Cup Unverified user ringer rickklein
5 Nate_Cohn Unverified user imarleneking Five Thirty Eight
6 270toWin Unverified user SkyNews Ginger_Zee
7 Khaled Beydoun Unverified user Professor Peggynoonannyc
8 BBC MOTD Unverified user Lawrence GMA
9 GMA Unverified user Author Author
10 YouTube Unverified user ABC San Diego rachel_handler

Table 8  Top Influential Users Rank PageRank HITS Betweenness Eigenvector


for the Joint Dataset via the
Multilayer Network 1 FIFA World Cup Unverified user Five Thirty Eight Unverified user
2 Nate Silver 538 Unverified user Nate Silver538 Unverified user
3 Five Thirty Eight Unverified user Unverified user Five Thirty Eight
4 Khaled Beydoun Unverified user Unverified user Unverified user
5 BBC MOTD Five Thirty Eight Unverified user Unverified user
6 ManUtd Unverified user katz Unverified user
7 Donald Trump Unverified user Unverified user Unverified user
8 YouTube Unverified user Unverified user Unverified user
9 brfootball Unverified user Unverified user Unverified user
10 HNS_CFF Unverified user Unverified user Unverified user

identify top influential users who have verified Twitter opportunities for marketing sports brands and merchandise
accounts. Section further examines this phenomenon. that a company might otherwise overlook.
Within Table 8, the rankings determined via the multi-
layer network are notably different. Pagerank identifies six 4.4 Community detection results
of the same top ten influential users that it found with only
the user network layer. However, the remaining three meth- Whereas topic modeling can help find users having specific,
ods generally identify low profile, unverified users as being topical interests, community detection finds the groups of
highly influential. users having more generally related interests. In doing so,
When identifying influential users via either the user net- one may design branding or product marketing material for
work layer only or the multilayer network, PageRank out- a broader community rather than a topical interest group,
performs the other methods based on three factors. First, it thereby engaging with a larger set of potential customers.
exhibits relative consistency in identifying some influencers. Of interest is the merit of the directed multilayer network
Second, many of the users PageRank identifies have verified model for detecting communities of users.
Twitter accounts. Third, many of the same users have hun- As discussed in Sect. 3.5, this research applies both
dreds of thousands if not millions of followers. Thus, these the Greedy Modularity Algorithm (GMA) and the Leiden
influential users have a high in-degree because other users algorithm to detect communities of users, both for the user
frequently mention them or Retweet their Tweets. network layer only and for the multilayer network. That dis-
Another characteristic difference between the two sets of cussion noted the potential disadvantages of applying the
rankings is that rankings leveraging only the user network Leiden algorithm to the directed multilayer network: the
layer tend to identify influential users related to politics, algorithm applies to undirected networks, so selected trans-
news, or entertainment, whereas the rankings from the mul- formations are necessary that may reduce model efficacy.
tilayer network identify influential users related to politics For the 2016 US Election dataset, Table 9 reports the
and sports. This outcome implies that users Tweeting about number of communities, modularity, partition coverage,
sports are more likely to be connected via their topics of and partition performance for the aforementioned combina-
conversation than via direct conversations, and it reveals tions of network models and community detection methods.

13
Social Network Analysis and Mining (2023) 13:65 Page 15 of 18 65

Table 9  Community Detection Results for the 2016 US Election yielded higher modularity scores than GMA. The signifi-
Dataset and Alternative Network Models & Detection Algorithms cance of this improvement as it relates to the required graph
User network layer Multilayer network transformations would require further research to ascertain,
and we propose that exploration as a sequel to this research.
Measure GMA Leiden GMA Leiden
The Leiden algorithm also yielded slightly lower partition
Communities 1125 822 1059 172 coverage and marginally higher partition performance for
Modularity 0.881 0.880 0.589 0.619 both network models. Compared to their performance on the
Coverage 0.969 0.921 0.861 0.853 2016 US Election dataset, both GMA and Leiden performed
Performance 0.901 0.901 0.922 0.924 better on most metrics, with notably higher modularity for
this dataset having high topic coherence. This result rein-
forces the merit of the multilayer network for modeling and
Table 10  Community Detection Results for the Joint Dataset and analyzing user interactions attained via broad search queries.
Alternative Network Models & Detection Algorithms
User network layer Multilayer network 4.5 Impact of dataset size on multilayer network
approach by query type
Measure GMA Leiden GMA Leiden

Communities 3192 1990 2608 585 Common to results in Sects. 4.2 and 4.4, the efficacy of
Modularity 0.969 0.973 0.799 0.824 methods vary by the type of query. LDA topic separation
Coverage 0.983 0.870 0.945 0.920 was better for the Joint dataset, yielding a coherence of 0.5.
Performance 0.978 0.983 0.959 0.961 In turn, these results allowed the directed multilayer net-
work approach to identify influential users via PageRank
and detect communities using either GMA or the Leiden
Recall that this dataset has a relatively low coherence for algorithm. By comparison, the directed multilayer network
topic identification. approach was not well suited to analyze datasets attained via
As reported in Table 9, the Leiden algorithm identified topic-specific queries. LDA encountered challenges attempt-
fewer communities than GMA for each type of network, and ing to differentiate topics within the 2016 US Election data-
notably less for the directed multilayer network; the undi- set because, e.g., Tweets from different political parties will
rected network representation to enable the Leiden algorithm use much of the same language. PageRank and other meth-
artificially connected more components. Otherwise, the ods can identify influential users for datasets culled using
GMA and Leiden results for other metrics were comparable. topic-specific queries, but the performance is better when
Both algorithms identified fewer communities when applied to a single, user-layer network.
applied to the multilayer network. The topic layer helped Redundancy is an aspect of Twitter data that compels an
identify connections between nodes that would otherwise examination of the appropriate query size for data queries.
not be detected. In the user network layer alone, there are For example, despite the 2016 US Election dataset contain-
1045 (disconnected) components, whereas the multilayer ing over 42,000 Tweets, it has only 15,000 unique Tweets.
network has only 991. Thus, the connectivity between users The majority of its communications are Retweets. Four of
modeled via the topic layer helps identify larger commu- its Tweets and the ensuing Retweets and replies account for
nities. The modularity and partition coverage metrics are over 1,000 of the dataset’s instances. Although one would
worse for both GMA and Leiden when applied to the 2016 expect some data redundancy in Twitter, its existence is
US Election dataset, a result consistent with degraded influ- potentially beneficial. Smaller sized datasets may suffice
ential user identification via the multilayer network. Only the for SNA.
partition performance is elevated for the multilayer network, To examine the potential reduction in dataset size, testing
by about 2.5%. examines the process through the third step: the identifica-
For the Joint dataset, Table 10 reports the number of com- tion of influential users. As a benchmark for expectations,
munities, modularity, partition coverage, and partition per- analysis identified the top ten influential users for the entire
formance for both the user network layer and the multilayer 2016 US Election dataset and for 50,000 observations sam-
network, when applying the GMA and Leiden algorithms. pled from the Joint dataset using only the user network layer.
Relative to the 2016 US Election dataset, the Joint dataset For various sample sizes, 50 trials of bootstrap sampling
has a higher coherence for topic identification. (with replacement) and analysis of data from each dataset
Within Table 10, the Leiden algorithm again identified identified the top ten influential users. Table 11 reports for
fewer communities than GMA for both types of network each of the sample sizes the average percentage of top influ-
models. Despite the addition of low weight edges to con- ential users from the 50,000 observation samples found by
nect the components of the network, the Leiden algorithm the smaller samples.

13
65 Page 16 of 18 Social Network Analysis and Mining (2023) 13:65

Table 11  Average (%) of Top Influential Users from a 50,000 Tweet sound and easy to implement. PageRank is the superlative
dataset found by 50 Samples Each of Smaller Datasets technique among those tested to identify influential users,
Sample Size 2016 US election data (%) Joint data (%) regardless of the dataset or modeling approach. Twit-
ter verification status strongly relates to influential user
250 61.6 38.2
identification. For broad-query datasets analyzed via the
500 65.0 40.6
directed multilayer network approach, larger samples than
1000 65.6 44.2
would be required by a topic-specific dataset are necessary
2500 67.6 46.0
for procedural accuracy. Finally, both GMA and the Lei-
5000 71.0 46.8
den algorithm are useful for community detection, regard-
less of dataset query type or network modeling approach.
An interested analyst or company can readily replicate
Observable in Table 11, smaller datasets produce similar and automate the proposed four-step process to gather
results for specific queries, but more general queries that information for marketing via social media. In doing so, it
collect data from different conversations require more data is important to use broad search queries and gather large
to accurately identify influential users. datasets of Tweets. As a check on expected outcomes,
analysis should proceed with the proposed directed mul-
tilayer network approach if the LDA topic coherence is
5 Conclusions and recommendations approaching 0.5, at least.
The impact of this research would benefit from the fol-
This research proposed a four-step process for analyz- lowing extensions. First, additional study should examine
ing social networks to identify and target individuals and the thresholds for including user-to-topic and inter-topic
communities with brand and product-specific marketing. relationships as arcs in the directed multilayer network.
For such marketing, it is intuitively helpful to understand Second, it is relevant to examine more broad-query data-
a target audience’s interests, i.e., their topics of discussion. sets to verify or refine the proposed threshold for LDA
Within this context, this study set forth and tested a four- coherence. Third, the effect of required network trans-
step process that leveraged a directed multilayer network formations on the efficacy of the Leiden algorithm mer-
approach for analysis. Augmenting traditional user network its study, arguably using datasets with known commu-
(layer) construction, the proposed process leverages Latent nity membership. Finally, the impact of Twitter’s recent
Dirichlet analysis (LDA) with extractive summarization to changes in user verification should be studied to determine
identify topics; constructs a directed multilayer network with if status has an effect on social influence.
a user layer, topic layer, and appropriate arcs to represent As a caveat to the recommendations, it is important to
connections; identifies influential users (e.g., via PageRank); note that relationships are not static, nor is user discourse.
and detects the related communities of interest (e.g., via a Although testing demonstrated the potential benefit of the
Greedy Modularity Algorithm). proposed, four-step process for analyzing large, broad-
Testing these techniques for named datasets attained via query datasets, analysis supporting marketing must be an
specific queries and a more generally focused dataset sam- iterative process. Only by analyzing a market repeatedly,
pled from the named datasets revealed several important over time may one be aware not only of user interests but
findings. First, LDA better identified topics via the directed evolving user interests that allow a company to exercise
multilayer network approach when analyzing datasets marketing initiatives.
attained via a broad query, enabling higher coherence scores
and better topic separation. The proposed directed multilayer
network approach was effective for identifying influential
users and communities for such datasets. In contrast, the 6 Disclaimer
proposed, four-step process was not effective for datasets
attained via specific queries. LDA had difficulty identifying The views expressed in this article are those of the authors
distinct topics of conversation, so the inclusion of a topic and do not reflect the official policy or position of the
layer in the network degraded the processes of influential United States Air Force, United States Army, United States
user identification and community detection. Department of Defense, or United States Government.
Testing also revealed several procedural insights. The
Acknowledgements The authors thank the editor and an anonymous
proposed weighting schemes to quantify directed user-
reviewers for their detailed and constructive comments that improved
to-user, directed user-to-topic, and undirected inter-topic both the content and presentation of this paper.
relationships in the multilayer network are conceptually

13
Social Network Analysis and Mining (2023) 13:65 Page 17 of 18 65

Declarations Hamraoui I, Boubaker A (2022) Impact of Twitter sentiment on stock


price returns. Soc Netw Anal Min 12(1):1–15
Conflict of interest The authors received no funds, grants, or other Ji J, Robbins M, Featherstone JD et al (2022) Comparison of public
support for this research. discussions of gene editing on social media between the united
states and china. Plos one 17(5):e0267406
Jin X (2020) Exploring crisis communication and information dis-
semination on social media: social network analysis of Hurri-
cane Irma tweets. J Int Crisis Risk Commun Res 3(2):179–210
References Jiwanggi MA, Adriani M (2016) Topic summarization of microblog
document in Bahasa Indonesia using the phrase reinforcement
Ahuja RK, Magnanti TL, Orlin JB (1993) Network flows. Prentice-Hall algorithm. Proc Comput Sci 81:229–236
Inc., Upper Saddle River, NJ Kalepalli Y, Tasneem S, Teja PDP, et al (2020) Effective comparison
Aiello LM, Barrat A, Cattuto C, et al (2010) Link creation and profile of LDA with LSA for topic modelling. In: 2020 4th Interna-
alignment in the aNobii social network. In: 2010 IEEE Second tional Conference on Intelligent Computing and Control Sys-
International Conference on Social Computing, IEEE, pp 249–256 tems (ICICCS), IEEE, pp 1245–1250
Allard K (1990) Command, control, and the common defense. Yale Kleinberg JM (1999) Authoritative sources in a hyperlinked environ-
University Press, New Haven, CT ment. J ACM 46(5):604–632
Bakshy E, Hofman JM, Mason WA, et al (2011) Everyone’s an influ- Kolda TG, Bader BW, Kenny JP (2005) Higher-order web link analy-
encer: quantifying influence on Twitter. In: Proceedings of the sis using multilinear algebra. In: Fifth IEEE International Con-
Fourth ACM International Conference on Web Search and Data ference on Data Mining (ICDM’05), IEEE, pp 8–pp
Mining. ACM, pp 65–74 Landau E (1895) Zur relativen wertbemessung der turnierresultate.
Barrios F, López F, Argerich L, et al (2015) Variations of the similar- Deutsches Wochenschach 11:366–369
ity function of TextRank for automated summarization. In: 2015 Legradi J (2009) An exploratory social network analysis of military
Argentine Symposium on Artificial Intelligence, Sociedad Argen- and civilian emergency operation centers focusing on organiza-
tina de Informática e Investigación Operativa (SADIO), pp 65–72 tion structure. Master’s thesis, Air Force Institute of Technol-
Bhavnani V, Galphat Y, Bhawsinghka G, et al (2021) A survey on ogy, Wright Patterson AFB, OH
detecting influential user in social networking. In: 2021 4th Bien- Malliaros FD, Vazirgiannis M (2013) Clustering and commu-
nial International Conference on Nascent Technologies in Engi- nity detection in directed networks: A survey. Phys Rep
neering (ICNTE), IEEE, pp 1–7 533(4):95–142
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Mihalcea R, Tarau P (2004) Textrank: Bringing order into text. In:
Learn Res 3(Jan):993–1022 Proceedings of the 2004 Conference on Empirical Methods in
Bouma G (2009) Normalized (pointwise) mutual information in col- Natural Language Processing. ACL, pp 404–411
location extraction. In: 2009 Proceedings of the Biennial German Mimno D, Wallach H, Talley E, et al (2011) Optimizing semantic
Society for Computational Linguistics & Language Technology, coherence in topic models. In: Proceedings of the 2011 Confer-
vol 30. GSCL, pp 31–40 ence on Empirical Methods in Natural Language Processing.
Clauset A, Newman ME, Moore C (2004) Finding community structure ACL, pp 262–272
in very large networks. Phys Rev E 70(6):066–111 Moreno J (1933) Psychological and social organization of groups in the
Dewi FK, Yudhoatmojo SB, Budi I (2017) Identification of opin- community. In: Proceedings & Addresses. American Association
ion leader on rumor spreading in online social network twitter on Mental Deficiency
using edge weighting and centrality measure weighting. In: 2017 Moreno JL (1932) Application of the group method to classification.
Twelfth International Conference on Digital Information Manage- National committee on prisons and prison labor
ment (ICDIM), IEEE, pp 313–318 Newman D, Lau JH, Grieser K, et al (2010) Automatic evaluation of
Doerr C, Blenn N, Van Mieghem P (2013) Lognormal infection times topic coherence. In: Human Language Technologies: The 2010
of online information spread. PloS ONE 8(5):e64-349 Annual Conference of the North American Chapter of the Asso-
Erlandsson F, Bródka P, Borg A et al (2016) Finding influential users ciation for Computational Linguistics. ACL, pp 100–108
in social media using association rule learning. Entropy 18(5):164 Newman ME (2016) Equivalence between modularity optimization
Featherstone JD, Barnett GA (2020) Validating sentiment analysis and maximum likelihood methods for community detection. Phys
on opinion mining using self-reported attitude scores. In: 2020 Rev E 94(5):052–315
Seventh International Conference on Social Networks Analysis. Pacheco D, Hui PM, Torres-Lugo C, et al (2021) Uncovering coordi-
Management and Security (SNAMS), IEEE, pp 1–4 nated networks on social media: Methods and case studies. In:
Featherstone JD, Barnett GA, Ruiz JB et al (2020) Exploring childhood 2021 Proceedings of the AAAI International Conference on Web
anti-vaccine and pro-vaccine communities on twitter-a perspec- and Social Media (ICWSM). AAAI, pp 455–466
tive from influential users. Online Soc Netw Media 20(100):105 Page L, Brin S, Motwani R, et al (1999) The PageRank citation rank-
Featherstone JD, Ruiz JB, Barnett GA et al (2020) Exploring childhood ing: Bringing order to the web. Tech. Rep. SIDL-WP-1999-0120,
vaccination themes and public opinions on twitter: A semantic Stanford University InfoLab, Stanford, CA
network analysis. Telemat Inf 54(101):474 Perdana RS, Pinandito A (2018) Combining likes-retweet analysis and
Freeman L (2004) The development of social network analysis. Stud naive Bayes classifier within Twitter for sentiment analysis. Jour-
Soc Sci 1(687):159–167 nal of Telecommunication, Electronic and Computer Engineering
Freeman LC (1977) A set of measures of centrality based on between- (JTEC) 10(1-8):41–46
ness. Sociometry 40:35–41 Pritchard JK, Stephens M, Donnelly P (2000) Inference of popu-
Gazdaggyori Z (2021) A case study of Gamestop. Bachelor’s thesis, lation structure using multilocus genotype data. Genetics
Aarhus University 155(2):945–959
Hagberg A, Swart P, S Chult D (2008) Exploring network structure, Pudjajana AM, Manongga D, Iriani A, et al (2018) Identification of
dynamics, and function using NetworkX. In: Proceedings of the influencers in social media using social network analysis (SNA).
7th Python in Science Conference (SciPy2008). SciPy, Pasadena, In: 2018 International Seminar on Research of Information Tech-
CA, pp 11–15 nology and Intelligent Systems (ISRITI), IEEE, pp 400–404

13
65 Page 18 of 18 Social Network Analysis and Mining (2023) 13:65

Rahmadan MC, Hidayanto AN, Ekasari DS et al (2020) Sentiment Tang L, Liu H (2011) Leveraging social media networks for classifica-
analysis and topic modelling using the LDA method related to tion. Data Min Knowl Discov 23(3):447–478
the flood disaster in Jakarta on Twitter. In :2020 International Traag VA, Waltman L, Van Eck NJ (2019) From Louvain to Leiden:
Conference on Informatics. Multimedia, Cyber and Information guaranteeing well-connected communities. Sci Rep 9(1):1–12
System (ICIMCIS), IEEE, pp 126–130 Tsopze N, Domgue FG (2021) Boolean factor based community extrac-
Röder M, Both A, Hinneburg A (2015) Exploring the space of topic tion from directed networks with the non reciprocal link relation-
coherence measures. In: Proceedings of the Eighth ACM Inter- ship. Inf Sci 569:544–556
national Conference on Web Search and Data Mining. ACM, pp Tsugawa S, Ohsaki H (2015) Negative messages spread rapidly and
399–408 widely on social media. In: Proceedings of the 2015 ACM on
Rudrapal D, Das A, Bhattacharya B (2018) A survey on automatic twit- Conference on Online Social Networks. ACM, pp 151–160
ter event summarization. J Inf Process Syst 14(1):79–100 Venkatesan M, Prabhavathy P (2019) Graph based unsupervised
Ruiz J, Featherstone JD, Barnett GA (2021) Identifying vaccine hesi- learning methods for edge and node anomaly detection in social
tant communities on twitter and their geolocations: a network network. In: 2019 IEEE 1st International Conference on Energy.
approach Systems and Information Processing (ICESIP), IEEE, pp 1–5
Salehi A, Ozer M, Davulcu H (2018) Sentiment-driven community Yang Y, Hsu JH, Löfgren K et al (2021) Cross-platform comparison of
profiling and detection on social media. In: Proceedings of the framed topics in Twitter and Weibo: machine learning approaches
29th ACM Conference on Hypertext and Social Media. ACM, to social media text mining. Soc Netw Anal Min 11(1):1–18
pp 229–237 Zhang B, Vos M (2015) How and why some issues spread fast in social
Scott J, Carrington PJ (2011) The SAGE Handbook of Social Network media. Online J Commun Media Technol 5(1):90–113
Analysis. SAGE Publications, London, UK
Sheth A, Shalin VL, Kursuncu U (2022) Defining and detecting toxic- Publisher's Note Springer Nature remains neutral with regard to
ity on social media: context and knowledge are key. Neurocom- jurisdictional claims in published maps and institutional affiliations.
puting 490:312–318
Sievert C, Shirley K (2014) Ldavis: A method for visualizing and
interpreting topics. In: Proceedings of Workshop on Interactive
Language Learning, Visualization, and Interfaces, Association for
Computational Linguistics, pp 63–70

13

You might also like