Structural - Social Networks: Twitter Analysis
Structural - Social Networks: Twitter Analysis
Structural - Social Networks: Twitter Analysis
https://doi.org/10.1007/s13278-023-01063-2
ORIGINAL ARTICLE
Received: 12 January 2023 / Revised: 4 March 2023 / Accepted: 4 March 2023 / Published online: 7 April 2023
This is a U.S. Government work and not under copyright protection in the US; foreign copyright protection may apply 2023
Abstract
Effective employment of social media for any social influence outcome requires a detailed understanding of the target audi-
ence. Social media provides a rich repository of self-reported information that provides insight regarding the sentiments and
implied priorities of an online population. Using Social Network Analysis, this research models user interactions on Twit-
ter as a weighted, directed network. Topic modeling through Latent Dirichlet Allocation identifies the topics of discussion
in Tweets, which this study uses to induce a directed multilayer network wherein users (in one layer) are connected to the
conversations and topics (in a second layer) in which they have participated, with inter-layer connections representing user
participation in conversations. Analysis of the resulting network identifies both influential users and highly connected groups
of individuals, informing an understanding of group dynamics and individual connectivity. The results demonstrate that the
generation of a topically-focused social network to represent conversations yields more robust findings regarding influential
users, particularly when analysts collect Tweets from a variety of discussions through more general search queries. Within
the analysis, PageRank performed best among four measures used to rank individual influence within this problem context.
In contrast, the results of applying both the Greedy Modular Algorithm and the Leiden Algorithm to identify communi-
ties were mixed; each method yielded valuable insights, but neither technique was uniformly superior. The demonstrated
four-step process is readily replicable, and an interested user can automate the process with relatively low effort or expense.
Keywords Social network analysis · Networks · Multilayer networks · Natural language processing
Mathematics Subject Classification 62H30 · 68T50 · 90B10 · 90B90 · 91B08 · 91C20 · 91D30
1 Introduction
Austin P. Logan, Phillip M. LaCasse and Brian J. Lunday have Social media provides an open environment for individual
contributed equally to this work. users to send and receive information within the greater
online community, simultaneously influencing and being
* Phillip M. LaCasse influenced by it. This influence may take the form of mass
phillip.lacasse@afit.edu
content dissemination to a wide audience. It could also con-
Austin P. Logan sist of targeted, tailored advertising to individuals based on
austin.logan.2@us.af.mil
their online behaviors. A less benign example is the use of
Brian J. Lunday propaganda or misinformation to shape public opinion or
brian.lunday@afit.edu
sow discord.
1
Directorate of Plans, Programs, and Requirements, Twitter is a compelling source for social network analy-
Air Combat Command, 129 Andrews Street, sis because its content is heavily text-based compared other
Langley Air Force Base, VA 23665, USA common, image-based platforms, such as Instagram, Snap-
2
Department of Operational Sciences, Air Force chat, or Tik Tok. Moreover, the unique features of Tweet
Institute of Technology, 2950 Hobson Way,
Wright‑Patterson Air Force Base, OH 45433, USA
13
Vol.:(0123456789)
65 Page 2 of 18 Social Network Analysis and Mining (2023) 13:65
sharing and hashtag labeling distinguish Twitter from sites On some level, social ties and connections have been wor-
such as Facebook where post sharing is less emphasized. thy of study dating back to antiquity, as evidenced by the
For the most part, the capability for widespread social presence of genealogies in ancient texts such as the Bible or
influence is restricted to entities with the resources and tech- Greco-Roman poems and histories. Freeman (2004) traces
nical skills necessary to perform large-scale analysis of open the development of social network analysis from early socio-
source data. This research sets forth and demonstrates a set metric studies at Sing Sing prison (Moreno 1932) and the
of analyses that are readily usable by smaller entities without Hudson School for Girls (Moreno 1933) through its formal
a large capital investment. establishment as a rigorous discipline in the 20th century. In
This research makes four contributions to the larger dis- particular, four features characterize modern social network
cipline of social network analysis. analysis: structural intuition based on ties linking social
actors, grounding in systemic empirical data, employment
1. We propose and demonstrate a framework to analyze of graphic imagery, and quantification via rigorous math-
Twitter data consisting of the following four steps: ematical modeling. (Freeman 2004)
Scott and Carrington (2011) defines social network analy-
• Topic discovery
sis as the specific logic behind the relationships that people
• Construction of the multilayer network
choose to form and maintain, resulting in a social configura-
• Identification of influential users and topics
tion that can be represented graphically. This SNA approach
• Detection of communities within the network
leverages connections between entities to construct a graphi-
2. We propose and demonstrate a multilayer network struc- cal representation of a network comprised of nodes and arcs,
ture consisting of a user layer and a topics layer. This wherein the arcs convey the relative strength of connections
structure leverages relationships between and among the (Legradi 2009), as determined by mathematical analysis.
layers to provide meaningful insight into the network Within this context, Allard (1990) outlines two main goals
and its potential influence. As far as we can determine, of SNA: understanding the factors that affect relationships
the proposed structure is unique to this research. and their correlations, and ascertaining the effects of these
3. We propose and demonstrate two new SNA arc weight- relationships, including the possible identification of an
ing techniques: one between users based on the inverse informal leader.
of the long-term proportion of interactions and the other An important aspect of social media culture research
between topics based on cosine similarity. related to this work is the phenomenon of influencers. Zhang
4. We evaluate four alternative methods to identify influ- and Vos (2015) examine social media culture in depth and
ential users and two alternative algorithms to discover conclude that the most effective way to spread a message
communities. on social media is through highly influential users known
as influencers. Influencers have acquired the reputation of
The remainder of the manuscript is organized as follows. being compelling and reliable sources of information and are
Section 2 reviews the technical literature related to topic connected to large numbers of users who follow, comment
modeling and social network analysis (SNA), including on, and share their messages.
methods previously used in these fields that can answer Given a social network representation of people and
key research questions. Section 3 explains the methodol- their interactions, it is therefore important to identify these
ogy to conduct the social network analysis, to include crea- disproportionately influential individuals. Several studies
tion of the multilayer network and the metrics to evaluate (e.g., Bakshy et al. (2011); Erlandsson et al. (2016); Dewi
it. Section 4 presents the results of applying the proposed et al. (2017); Bhavnani et al. (2021)) research methods to
SNA process to a population of Tweets and discusses both characterize the influence of nodes within networks. Such
the algorithmic results and instance-specific implications. methods range in approach from direct observations such as
Finally, Sect. 5 concludes the work and recommends how counting the average number of interactions by others for a
one may best apply the process and leverage the study’s user’s Tweets (Erlandsson et al. 2016), to indirect inferences
insights. such as representing physical interactions among people as
a network and modeling how quickly a virus would spread
from an individual (Doerr et al. 2013). Of note, whereas
2 Literature review many of the most influential users in a social network have
a large number of followers, follower count alone is not a
This study leverages existing research related to social net- strong enough metric for quantifying influence (Erlandsson
work analysis, influencer identification, sentiment analysis, et al. 2016). Additionally, Pudjajana et al. (2018) identifies
community detection, and both directional relationship and several centrality metrics useful to compare the influence of
multilayer modeling between entities in SNA. nodes, which Jin (2020) well demonstrated. Also related to
13
Social Network Analysis and Mining (2023) 13:65 Page 3 of 18 65
influential node identification, Sheth et al. (2022) and Ven- directed social networks better represent these interactions.
katesan and Prabhavathy (2019) study methods to discover However, directed social networks are not a modeling pana-
anomalous users within social networks. cea, and both Malliaros and Vazirgiannis (2013) and Tsopze
However, it is not sufficient simply to identify influenc- and Domgue (2021) describe the challenges associated with
ers; one must also characterize the messages that they share their modeling and analysis.
with followers. Sentiment analysis is a technique to label a For SNA, a multilayer network can better represent the
message as either positive or negative, and it can effectively complexities of interactions better than a single-layer net-
monitor users’ emotions towards a topic over time. Tsugawa work model. Multilayer networks can include entity-specific
and Ohsaki (2015) and Salehi et al. (2018) outline many layers and include edges or directed edges (i.e., arcs) to rep-
of the methods to perform sentiment analysis. Featherstone resent different interactions and the relative strength thereof.
and Barnett (2020) employ self-reported attitude scores to Figure 1 depicts an example of a multi-layer network with a
validate sentiment scores obtained from a comprehensive user layer, and topic layer, and inter-nodal arcs both within
study on public opinion towards genome editing. Results are and between layers.
promising, although the strength of the relationship between Some SNA techniques are unique to multilayer networks,
attitude score and sentiment did vary between the subgroups directed networks, or the combination of both. Kolda et al.
sampled (Featherstone and Barnett 2020). (2005) define methods to quantify node influence in mul-
Moreover, sentiments can affect consumer behavior. As tilayer networks. Tang and Liu (2011) propose methods to
examples, both Gazdaggyori (2021) and Hamraoui and Bou- detect communities in directed networks. This research lev-
baker (2022) apply sentiment analysis to Tweets related to erages each of these contributions.
financial stock performance. Although the growth of the
studied stocks was inconsistent and relatively short-lived,
both studies demonstrate that the sentiment of social media 3 Methodology
users reflects changes in consumer attitude that influence
investor behaviors. The same phenomenon is observed in Section 3.1 describes the analyzed datasets. Sections 3.2
the pro-vaccine and anti-vaccine communities; each com- and 3.3 contain the approaches for creating the user and
munity contains its stable of influencers whose messages and topic layers in the multilayer network. Finally, Sects. 3.4
sentiments produce the predictable effects on vaccination and 3.5 present the methods used to discover influential
coverage in children (Featherstone et al. 2020). users and communities.
Other aspects of SNA of interest to this research are topic
modeling and community discovery. Topic modeling deter- 3.1 Description of datasets
mines the most frequent topics of discussion in a collec-
tion of Tweets. The two primary topic modeling methods This research employs four traditional datasets studied
are Latent Dirichlet Allocation (LDA) and Latent Semantic within the SNA literature, listed in Table 1, along with
Analysis (LSA). Kalepalli et al. (2020) directly compare customized datasets created by sampling Tweets from each
LDA and LSA, and both Rahmadan et al. (2020) and Yang of the four named sets. The Tweets are almost exclusively
et al. (2021) demonstrate the efficacy of LDA for topic mod-
eling using Twitter Data. In a related work, Jiwanggi and
Adriani (2016) detail methods for extracting a summary of
topics from a collection of Tweets. Topic modeling is fre-
quently useful to model the evolution of public discourse
on topics of interest such as vaccination (Featherstone et al.
2020) or gene editing (Ji et al. 2022). Community discov-
ery methods seek to identify the closely connected groups
of users in a network, whether by individual communica-
tions or communications related to a common topic such as
vaccine hesitancy (Ruiz et al. 2021). Various user features
can help detect communities of connected users, as Pacheco
et al. (2021) recently studies.
Of interest to this research is the modeling of social
networks with directional representation of interactions.
Aiello et al. (2010) characterizes representation decisions
for constructing social networks and the links between users. Fig. 1 Multilayer Social Network Representation of Entity-specific
Relationships and communication are often asymmetric, and Layers and Directed-arc Representation of Interactions
13
65 Page 4 of 18 Social Network Analysis and Mining (2023) 13:65
Table 1 Summary of datasets and key features network can distinguish a celebrity who does not Tweet fre-
Dataset name # Tweets Retweets Replies
quently but has many followers from a bot or spammer that
Tweets frequently about other users.
2018 World Cup 530,000 Yes No Edges or arcs within a social network have associated
Game of Thrones 760,660 No Yes weights to convey the strength of the connection implied by
2016 US Election 42,013 Yes Yes interactions between users. All of the datasets within Table 1
COVID-19 179,108 No Yes include three types of user interactions: a user mentioning
another user, a user Retweeting another user’s Tweet, and a
user replying to another user’s Tweet. Although no formal
written in the English language, and the information associ- direct measure of user relationships exists, these interac-
ated with each Tweet consists of username, number of fol- tions can inform a proxy metric that represents the relative,
lowers, date of account creation, and verification status. The implied strength of relationships via arc-specific weights.
topics of conversation are sports, entertainment, politics, and Some interactions imply a closer relationship between
health, respectively. users. This is evident by observing the ratio of likes to
The first step in this analysis is to demonstrate the analyti- Retweets for nearly every Tweet in the Twittersphere. Tweets
cal techniques on the well understood datasets in Table 1. consistently have far more likes than Retweets (e.g., Per-
Such analysis can reveal differences in results pertaining dana and Pinandito (2018)), implying that a Retweet conveys
both to the identification of influential users and to general a stronger engagement with a Tweet than a like. Moreo-
network metrics. It can also provide insight into Tweet query ver, Tweets typically have fewer mentions in new Tweets
practices and their effect on the resulting social network than Retweets, and fewer replies than mentions, indicating
representation. increasing degrees of engagement.
Next, we apply the techniques to customized datasets cre- For this reason, the inverse of the frequency with which
ated by sampling a uniform number of Tweets from each of replies, mentions, and Retweets occur can provide a suitable
the named datasets. These customized datasets include more proxy for the strength of connection implied by an interac-
diverse topics and conversations, and they better represent tion. For example, if the distribution of replies, mentions,
the topical diversity of the Twittersphere. Moreover, such and Retweets was uniform, then any action would contribute
datasets help validate community identification and topic the same weight (i.e., 1∕0.33̄ = 3) to an arc from one user
modeling methods because it is reasonable to expect topics to the author of the original Tweet. If the distribution were
and communities identified in the composite dataset to map 14.3%, 28.6%, and 57.2%, a reply would contribute twice
to those in the four original datasets. as much weight to the arc (i.e., 7) as a mention (i.e., 3.5)
As mentioned in Sect. 1, the intended beneficiaries of this and four times as much as a Retweet (i.e., 1.75). If a user
framework are entities without the budget required for more interacts with another user several times within a dataset,
expensive APIs that collect large amounts of data about indi- the net contributions of the interactions to the arc weight are
vidual Tweets. As such, analytic techniques herein consider additive. Among the responses to Tweets in Table 1 datasets,
only the Tweet data available from a low-cost accessible the distribution of interactions consisted of 6.97% replies,
API: Tweet text, username, and verification status. As is 39.49% mentions, and 53.54% Retweets. Thus, each reply,
typical with most analyses, some minor data cleaning was mention, and Retweet contributes 14.35, 2.53, and 1.87 to
necessary to ensure that subsequent analysis only considered an arc weight, respectively, for the generation of the user
Tweets with an associated username. networks or user network layers.
13
Social Network Analysis and Mining (2023) 13:65 Page 5 of 18 65
the user network layer, combined with arcs indicating user useful context for analysis. The new topic layer consists of a
participation in the topics, induces a multilayer network to node for each topic, and the LDA results (i.e., 𝛽 ) inform arc
more accurately represent both the direct and indirect con- creation between the user layer and the topic layer.
nections among users participating in the discourse on the The two aspects of arc creation are which arcs to gen-
social media platform. erate and how to weight those arcs. This research gener-
For the first step, this research conducts topic modeling ates user-to-topic arcs for only the strongest Tweet-to-topic
using Latent Dirichlet Allocation (LDA). Pritchard et al. relationship for each of the user’s Tweets, as measured by
(2000) set forth LDA as an unsupervised clustering method a similarity index of words within a Tweet to each of the k
to assign individual creatures (e.g., birds, people) to popula- topics. For a given Tweet and a vector s of length V, wherein
tions (e.g., species, tribes) based on genotype similarities. sv is the number of times token v appears within the Tweet,
Blei et al. (2003) first applied LDA to topic modeling for the Tweet-to-topic similarity index for each topic i equals
text-based documents. The authors describe LDA as “a gen- 𝛽i ⋅ s. As an aside, although it is possible for a single Tweet
erative probabilistic model of a corpus”; it synthesizes a to relate to multiple topics rather than only its most relevant
user-defined number of topics and populates them with the topic, such an alternative is perhaps a compelling sequel
words that have the highest probabilities of belonging to to this work, albeit a more computationally burdensome
them. The three key elements of an LDA model are the top- endeavor.
ics, documents, and corpus. Within this research, topics are Additionally, arc generation only creates Tweet-to-topic
the clusters to which statistical analysis will assign words arcs if the similarity index was in the top 25% of all such
from the Tweets. Documents are the individual Tweets maximal indices for the corpus of Tweets. Doing so avoids
appearing in the data, and the corpus is the complete col- establishing weak connections between users and topics.
lection of documents. Given k topics and V unique words in Although no formal research exists to determine such a
the corpus, LDA creates a k × V probability matrix 𝛽 , where threshold, future work could utilize labeled training data
𝛽ij represents the probability that topic i includes word j. with a machine learning approach to explore better deci-
Preprocessing of Tweets removes stop words, tokenizes sions in this space.
the remaining text, and lemmatizes the individual words For this directed multilayer network, this research creates
(i.e., tokens) to ensure only relevant text remains for topic a pair of user-to-topic and topic-to-user arcs as determined
discovery via LDA. Stop word removal deletes common by the similarity index. Inducing the opposite-direction
words that provide no contextual meaning, such as articles, topic-to-user arc represents scenarios wherein users scroll
conjunctions, prepositions, and pronouns. Tokenization through a topic of conversation on Twitter and find another
partitions the remaining text into words. Lemmatization user via their topic-specific Tweets. The weight for each of
replaces different forms of a word (e.g., runner, running, the arcs in the generated pair is equal to the value of the
runs) with a common root word (e.g., run). similarity index.
Only a user-defined number of topics k is necessary to After establishing connections between the user layer and
apply LDA to preprocessed Twitter data. Although the opti- the topic layer of the multilayer network, the final step to
mal number of topics depends on the data, a coherence met- complete the multilayer network model is to represent the
ric can assess the effectiveness of a k-topic LDA model by connections between the topics in the topic layer. For a given
assigning a score to the set of highest probability words in topic, the word probability vector is a vector of length V that
each topic based on their similarity and interpretability by a indicates in each entry the likelihood that a word is in that
human. Thus, a line search on k can identify the best number topic. From this definition, the cosine similarity between two
of topics to maximize coherence. Although several research- topics is the angle between their word probability vectors, as
ers (e.g., see Bouma (2009); Newman et al. (2010); Mimno Equation (1) calculates for two vectors A and B.
et al. (2011)) have developed alternative coherence metrics, � �
Röder et al. (2015) conducted an extensive, comparative AT B
cos−1 (1)
study of such metrics and introduced two new metrics, iden- ‖A‖‖B‖
tifying a superlative coherence metric the authors denoted as
Although the theoretical domain of Equation (1) is [−1, 1],
CV . This research uses their recommended metric.
only a range of [0, 1] is feasible for cosine similarities
Once LDA is complete, the second step generates the
between topics; each vector is in the non-negative orthant
directed multilayer network representation. An analyst
because every element is a non-negative probability.
names the topics after inspecting the highest probability
The interpretation of these values is as follows. A similar-
words associated with each topic. Such a task is not ardu-
ity of 0 means the vectors are orthogonal, so no inter-topic
ous, given familiarity with the language-of-origin for the
relationship exists; a value close to 1 results from nearly
Tweet. Although this manual naming process is not strictly
parallel vectors, indicating similar relative distributions of
necessary because LDA identifies the topics, it does provide
13
65 Page 6 of 18 Social Network Analysis and Mining (2023) 13:65
word-to-topic probabilities for the two topics. The thresh- for each topic identifiable via 𝛽 , this extractive summari-
old to generate a pair of directed arcs between two topics zation of the most relevant topic-specific Tweets provides
is if their cosine similarity exceeds 0.5. Differing from the additional context regarding topics and reduces the creative,
user-to-topic connections, arcs may connect a single topic cognitive labor required to analyze Twitter data scrapes.
to multiple other topics.
The resulting directed multilayer network includes
3.4 Influential user identification
respective user and topic layers; weighted, directed arcs
connecting users based on replies, mentions, and Retweets;
For the third step in the process, it is worth noting that there
pairs of weighted, directed arcs connecting users to topics
are myriad methods to identify influential nodes within a
based on the vocabulary of a user’s Tweets, with at most
network. Although many such methods produce reasonable
one user-to-topic connection formed by a single Tweet; and
results for smaller, highly-connected networks, the relative
pairs of weighted, directed arcs connecting topics based on
performance of the methods depends notably on the network.
the cosine similarities of their respective word probability
Given that this research examines larger networks expected
distributions.
to be relatively disconnected, it is relevant to evaluate alter-
Complementing the first two topic-focused steps of the
native methods to identify influential nodes. Testing within
process is the summarization of Tweets linked to each topic.
Sect. 4 compares rankings via PageRank algorithm, Hyper-
This research creates extractive summaries rather than
link-Induced Topic Search (HITS) algorithm, betweenness
abstractive summaries, favoring the former for its simplicity.
centrality, and eigenvector centrality. For each of these tech-
Moreover, the ability of an abstractive summary to gener-
niques, a higher computed value indicates greater influence.
ate unique thoughts is mitigated by the methods by which
As described in Sect. 3.3, the PageRank algorithm uses
Twitter creates trending topics; they often present the most
long-term node visit probabilities for a random walk to rank
relevant Tweets within a conversation, an outcome similar to
order the users and infer a relative degree of influence.
an extractive summary (Rudrapal et al. 2018). This research
Kleinberg (1999) modifies the PageRank algorithm to
uses the TextRank algorithm to create extractive summaries.
create the HITS algorithm to identify influential nodes. The
Created by Mihalcea and Tarau (2004), it applies a graph-
author conjectures a conceptual shortcoming of the PageR-
based ranking technique that induces a graph wherein nodes
ank algorithm for directed network analysis; whereas PageR-
represent sentences (i.e., Tweets) and edges are weighted by
ank readily identifies authority nodes having many inbound
a user-defined sentence similarity metric. This work utilizes
arcs, it can underestimate the influence of hub nodes having
the better performing metric (i.e., BM25) proposed by Bar-
many outbound arcs. It is arguably influential to direct con-
rios et al. (2015) in lieu of the alternatives originally set
nections, not just to be the directed target of connections.
forth by Mihalcea and Tarau (2004). In comparison, BM25
Accounting for both authority and hub behaviors of nodes,
considers the inverse frequency of words within a document
the HITS algorithm identifies a root set of nodes via a tar-
to increase the relative similarity metric for documents con-
geted search query and augments it with all nodes adjacent
taining words that are rare in the topic-specific corpus.
via outgoing arcs from the root set. For this larger subgraph,
For the TextRank generated graph, PageRank subse-
the algorithm iteratively updates each node’s authority and
quently identifies the most important sentences for inclu-
hub scores to be equal to the sum of the hub and author-
sion in the extractive summary of each topic. Page et al.
ity scores of nodes respectively connected to or from the
(1999) proposes the PageRank algorithm to calculate the
node, until convergence. The HITS algorithm yields two
most important sentences for extractive summaries. The
metrics, one each for authority and hub rankings. In prac-
algorithm repeatedly applies an extended random walk on a
tice, researchers often average these scores to enable a direct
graph to determine the long-term probabilities of residing at
comparison with other influential node identification meth-
each node. In a random walk, a simulated entity sequentially
ods, and this research does likewise.
travels from one node to an adjacent node with a probabil-
As a third method to identify influential nodes, between-
ity equal to the arc weight, relative to the total weights of
ness centrality (Freeman 1977) computes for a given node
arcs emanating from the current node. The authors modify
v the frequency with which it is on one of the shortest paths
the adjacent step probabilities of the random walk to create
between a pair of nodes (s, t), considered over all node pairs
small, nonzero probabilities of traversing from a given to any
s, t ∈ V , as per Equation (2).
other (i.e., non-adjacent) node in the network to mitigate the
effect of disconnected network components on long-term ∑ 𝜎(s, t ∣ v)
probability calculations. The application of multiple ran-
cB (v) =
𝜎(s, t) (2)
s,t∈V
dom walks from initial entity locations determined via a uni-
form distribution over the nodes more notably mitigates that For social network analysis, betweenness centrality com-
effect. Augmenting the list of the highest probability words putations use the inverse of edge weights as edge distances
13
Social Network Analysis and Mining (2023) 13:65 Page 7 of 18 65
because larger weights indicate strong connections that The first agglomerative technique this research uses is
would conceptually correspond to a shorter distance (i.e., the Greedy Modularity Algorithm (GMA). The GMA is a
an edge more likely to be traversed). A notable downside to modification of the Clauset–Newman–Moore (CNM) algo-
this method is that it requires calculating the shortest paths rithm set forth by Clauset et al. (2004). Like CNM, GMA is
between all pairs of nodes. Although either a repeated appli- a heuristic approach to maximize a modularity metric that
cation of Dijkstra’s( Algorithm
) or the Floyd Warshall algo- measures the strength of community classification. Whereas
rithm can run in O n3 time (Ahuja et al. 1993), such effort Clauset et al. (2004) designed the CNM algorithm for undi-
remains computationally expensive for larger networks, and rected networks, the GMA seeks to maximize the modularity
both algorithms require modification to identify alternative metric in Equation (3), adapted for directed networks when
optima for shortest (s, t)-paths. implemented via the NetworkX library (Hagberg et al. 2008)
Finally, eigenvector centrality (Landau 1895) provides for the Python programming language.
another alternative to identify influential nodes. This met- � �2
n ⎛ ⎞
ric leverages the idea that nodes of high influence are adja- � kcin kcout
Q= ⎜ Lc − 𝛾 ⎟ (3)
cently connected to other nodes of high importance. Given ⎜m 2m ⎟
c=1 ⎝ ⎠
an N × N node adjacency matrix A, wherein Aij is equal to
the weight of the connection between nodes i and j, solve the
Therein, Lc is the number of arcs within community c; m
eigenvector equation Ax = 𝜆x . Designating 𝜆 as the largest
is the total number of edges in the graph; kcin and kcout are
eigenvalue, the corresponding vector x indicates the respec-
the sums of the respective in-degree and out-degree weights
tive influence scores for each of the nodes. This metric is
in community c; and 𝛾 is a positive, user-defined resolu-
conceptually simple and easy to calculate.
tion parameter to balance the importance of edges within a
community and edges connecting communities. Smaller 𝛾
-values yield fewer, larger communities, and larger 𝛾 -values
3.5 Community detection
yield more, smaller communities (Newman 2016). At ini-
tialization, there are n = N communities, and Q ≤ 0 because
For the fourth step in the proposed SNA process, multiple
Lc = 0 for c = 1, ..., n . Within an iteration, the CNM algo-
methods for community detection exist in the literature.
rithm calculates the net change to network modularity that
Among them, this research tests and compares the Greedy
would result from conjoining pairs of communities via an
Modularity Algorithm (GMA) and the Leiden algorithm
edge connecting them. If the maximal such change to modu-
(Traag et al. 2019).
larity is positive, the edge is added and the algorithm pro-
Before discussing these methods, it is important to note
ceeds to the next iteration; otherwise, GMA terminates with
the characteristics of data that inform such choices. In this
identified community structures.
research, communities are identified using only the infor-
The Leiden algorithm is an agglomerative community
mation contained within the directed multilayer network’s
detection algorithm for undirected, fully connected net-
nodes and arcs. Given the nature of social media data, espe-
works. Fortunately, Malliaros and Vazirgiannis (2013) dis-
cially data gathered via broad queries of Tweets among
cussed transformations one can apply to a directed network
many unique users, the resulting social network structures
to enable the application of the Leiden algorithm. First,
tend to be fragmented. Even when querying data by common
edges replace pairs of equal-weight, opposite direction arcs
keywords, the likelihood of capturing a back-and-forth con-
between nodes. Second, edges replace singular arcs between
versation via Tweets between two or more users is exceed-
nodes, with the same total edge weight. This transforma-
ingly small, considering the millions of Tweets posted daily.
tion implies two-way connections that do not exist in the
Accordingly, the idealized version of a social network com-
data, but it allows for the exploration of a larger number of
munity as clique subgraph having k nodes and k(k − 1)∕2
community detection methods for which the results can be
edges (or k(k − 1) directed arcs) is elusive. Rather, commu-
validated in comparison to the original social network struc-
nity detection methods must consider the implicit networks
ture. Third, a minimal number of low-weight edges augment
of users who are not in direct conversation with each other,
the social network to ensure all nodes are fully connected.
but who share the same topics of conversation or common
These artificial connections minimally modify the network
connections with other users. This low level of direct con-
representation in a manner that should be negligible but for
nectivity motivates the use of agglomerative community
which the results of any community discovery algorithm
detection methods, wherein every node begins as a sole
should be validated against the original network.
member of its own community, and an algorithm iteratively
For an undirected multilayer representation of the
conjoins smaller communities to improve the collective
directed multilayer network via the aforementioned
strength of the respective communities, as measured via a
steps, the Leiden algorithm (Traag et al. 2019) can detect
customized metric.
13
65 Page 8 of 18 Social Network Analysis and Mining (2023) 13:65
13
Social Network Analysis and Mining (2023) 13:65 Page 9 of 18 65
13
65 Page 10 of 18 Social Network Analysis and Mining (2023) 13:65
13
Social Network Analysis and Mining (2023) 13:65 Page 11 of 18 65
13
65 Page 12 of 18 Social Network Analysis and Mining (2023) 13:65
13
Social Network Analysis and Mining (2023) 13:65 Page 13 of 18 65
13
65 Page 14 of 18 Social Network Analysis and Mining (2023) 13:65
identify top influential users who have verified Twitter opportunities for marketing sports brands and merchandise
accounts. Section further examines this phenomenon. that a company might otherwise overlook.
Within Table 8, the rankings determined via the multi-
layer network are notably different. Pagerank identifies six 4.4 Community detection results
of the same top ten influential users that it found with only
the user network layer. However, the remaining three meth- Whereas topic modeling can help find users having specific,
ods generally identify low profile, unverified users as being topical interests, community detection finds the groups of
highly influential. users having more generally related interests. In doing so,
When identifying influential users via either the user net- one may design branding or product marketing material for
work layer only or the multilayer network, PageRank out- a broader community rather than a topical interest group,
performs the other methods based on three factors. First, it thereby engaging with a larger set of potential customers.
exhibits relative consistency in identifying some influencers. Of interest is the merit of the directed multilayer network
Second, many of the users PageRank identifies have verified model for detecting communities of users.
Twitter accounts. Third, many of the same users have hun- As discussed in Sect. 3.5, this research applies both
dreds of thousands if not millions of followers. Thus, these the Greedy Modularity Algorithm (GMA) and the Leiden
influential users have a high in-degree because other users algorithm to detect communities of users, both for the user
frequently mention them or Retweet their Tweets. network layer only and for the multilayer network. That dis-
Another characteristic difference between the two sets of cussion noted the potential disadvantages of applying the
rankings is that rankings leveraging only the user network Leiden algorithm to the directed multilayer network: the
layer tend to identify influential users related to politics, algorithm applies to undirected networks, so selected trans-
news, or entertainment, whereas the rankings from the mul- formations are necessary that may reduce model efficacy.
tilayer network identify influential users related to politics For the 2016 US Election dataset, Table 9 reports the
and sports. This outcome implies that users Tweeting about number of communities, modularity, partition coverage,
sports are more likely to be connected via their topics of and partition performance for the aforementioned combina-
conversation than via direct conversations, and it reveals tions of network models and community detection methods.
13
Social Network Analysis and Mining (2023) 13:65 Page 15 of 18 65
Table 9 Community Detection Results for the 2016 US Election yielded higher modularity scores than GMA. The signifi-
Dataset and Alternative Network Models & Detection Algorithms cance of this improvement as it relates to the required graph
User network layer Multilayer network transformations would require further research to ascertain,
and we propose that exploration as a sequel to this research.
Measure GMA Leiden GMA Leiden
The Leiden algorithm also yielded slightly lower partition
Communities 1125 822 1059 172 coverage and marginally higher partition performance for
Modularity 0.881 0.880 0.589 0.619 both network models. Compared to their performance on the
Coverage 0.969 0.921 0.861 0.853 2016 US Election dataset, both GMA and Leiden performed
Performance 0.901 0.901 0.922 0.924 better on most metrics, with notably higher modularity for
this dataset having high topic coherence. This result rein-
forces the merit of the multilayer network for modeling and
Table 10 Community Detection Results for the Joint Dataset and analyzing user interactions attained via broad search queries.
Alternative Network Models & Detection Algorithms
User network layer Multilayer network 4.5 Impact of dataset size on multilayer network
approach by query type
Measure GMA Leiden GMA Leiden
Communities 3192 1990 2608 585 Common to results in Sects. 4.2 and 4.4, the efficacy of
Modularity 0.969 0.973 0.799 0.824 methods vary by the type of query. LDA topic separation
Coverage 0.983 0.870 0.945 0.920 was better for the Joint dataset, yielding a coherence of 0.5.
Performance 0.978 0.983 0.959 0.961 In turn, these results allowed the directed multilayer net-
work approach to identify influential users via PageRank
and detect communities using either GMA or the Leiden
Recall that this dataset has a relatively low coherence for algorithm. By comparison, the directed multilayer network
topic identification. approach was not well suited to analyze datasets attained via
As reported in Table 9, the Leiden algorithm identified topic-specific queries. LDA encountered challenges attempt-
fewer communities than GMA for each type of network, and ing to differentiate topics within the 2016 US Election data-
notably less for the directed multilayer network; the undi- set because, e.g., Tweets from different political parties will
rected network representation to enable the Leiden algorithm use much of the same language. PageRank and other meth-
artificially connected more components. Otherwise, the ods can identify influential users for datasets culled using
GMA and Leiden results for other metrics were comparable. topic-specific queries, but the performance is better when
Both algorithms identified fewer communities when applied to a single, user-layer network.
applied to the multilayer network. The topic layer helped Redundancy is an aspect of Twitter data that compels an
identify connections between nodes that would otherwise examination of the appropriate query size for data queries.
not be detected. In the user network layer alone, there are For example, despite the 2016 US Election dataset contain-
1045 (disconnected) components, whereas the multilayer ing over 42,000 Tweets, it has only 15,000 unique Tweets.
network has only 991. Thus, the connectivity between users The majority of its communications are Retweets. Four of
modeled via the topic layer helps identify larger commu- its Tweets and the ensuing Retweets and replies account for
nities. The modularity and partition coverage metrics are over 1,000 of the dataset’s instances. Although one would
worse for both GMA and Leiden when applied to the 2016 expect some data redundancy in Twitter, its existence is
US Election dataset, a result consistent with degraded influ- potentially beneficial. Smaller sized datasets may suffice
ential user identification via the multilayer network. Only the for SNA.
partition performance is elevated for the multilayer network, To examine the potential reduction in dataset size, testing
by about 2.5%. examines the process through the third step: the identifica-
For the Joint dataset, Table 10 reports the number of com- tion of influential users. As a benchmark for expectations,
munities, modularity, partition coverage, and partition per- analysis identified the top ten influential users for the entire
formance for both the user network layer and the multilayer 2016 US Election dataset and for 50,000 observations sam-
network, when applying the GMA and Leiden algorithms. pled from the Joint dataset using only the user network layer.
Relative to the 2016 US Election dataset, the Joint dataset For various sample sizes, 50 trials of bootstrap sampling
has a higher coherence for topic identification. (with replacement) and analysis of data from each dataset
Within Table 10, the Leiden algorithm again identified identified the top ten influential users. Table 11 reports for
fewer communities than GMA for both types of network each of the sample sizes the average percentage of top influ-
models. Despite the addition of low weight edges to con- ential users from the 50,000 observation samples found by
nect the components of the network, the Leiden algorithm the smaller samples.
13
65 Page 16 of 18 Social Network Analysis and Mining (2023) 13:65
Table 11 Average (%) of Top Influential Users from a 50,000 Tweet sound and easy to implement. PageRank is the superlative
dataset found by 50 Samples Each of Smaller Datasets technique among those tested to identify influential users,
Sample Size 2016 US election data (%) Joint data (%) regardless of the dataset or modeling approach. Twit-
ter verification status strongly relates to influential user
250 61.6 38.2
identification. For broad-query datasets analyzed via the
500 65.0 40.6
directed multilayer network approach, larger samples than
1000 65.6 44.2
would be required by a topic-specific dataset are necessary
2500 67.6 46.0
for procedural accuracy. Finally, both GMA and the Lei-
5000 71.0 46.8
den algorithm are useful for community detection, regard-
less of dataset query type or network modeling approach.
An interested analyst or company can readily replicate
Observable in Table 11, smaller datasets produce similar and automate the proposed four-step process to gather
results for specific queries, but more general queries that information for marketing via social media. In doing so, it
collect data from different conversations require more data is important to use broad search queries and gather large
to accurately identify influential users. datasets of Tweets. As a check on expected outcomes,
analysis should proceed with the proposed directed mul-
tilayer network approach if the LDA topic coherence is
5 Conclusions and recommendations approaching 0.5, at least.
The impact of this research would benefit from the fol-
This research proposed a four-step process for analyz- lowing extensions. First, additional study should examine
ing social networks to identify and target individuals and the thresholds for including user-to-topic and inter-topic
communities with brand and product-specific marketing. relationships as arcs in the directed multilayer network.
For such marketing, it is intuitively helpful to understand Second, it is relevant to examine more broad-query data-
a target audience’s interests, i.e., their topics of discussion. sets to verify or refine the proposed threshold for LDA
Within this context, this study set forth and tested a four- coherence. Third, the effect of required network trans-
step process that leveraged a directed multilayer network formations on the efficacy of the Leiden algorithm mer-
approach for analysis. Augmenting traditional user network its study, arguably using datasets with known commu-
(layer) construction, the proposed process leverages Latent nity membership. Finally, the impact of Twitter’s recent
Dirichlet analysis (LDA) with extractive summarization to changes in user verification should be studied to determine
identify topics; constructs a directed multilayer network with if status has an effect on social influence.
a user layer, topic layer, and appropriate arcs to represent As a caveat to the recommendations, it is important to
connections; identifies influential users (e.g., via PageRank); note that relationships are not static, nor is user discourse.
and detects the related communities of interest (e.g., via a Although testing demonstrated the potential benefit of the
Greedy Modularity Algorithm). proposed, four-step process for analyzing large, broad-
Testing these techniques for named datasets attained via query datasets, analysis supporting marketing must be an
specific queries and a more generally focused dataset sam- iterative process. Only by analyzing a market repeatedly,
pled from the named datasets revealed several important over time may one be aware not only of user interests but
findings. First, LDA better identified topics via the directed evolving user interests that allow a company to exercise
multilayer network approach when analyzing datasets marketing initiatives.
attained via a broad query, enabling higher coherence scores
and better topic separation. The proposed directed multilayer
network approach was effective for identifying influential
users and communities for such datasets. In contrast, the 6 Disclaimer
proposed, four-step process was not effective for datasets
attained via specific queries. LDA had difficulty identifying The views expressed in this article are those of the authors
distinct topics of conversation, so the inclusion of a topic and do not reflect the official policy or position of the
layer in the network degraded the processes of influential United States Air Force, United States Army, United States
user identification and community detection. Department of Defense, or United States Government.
Testing also revealed several procedural insights. The
Acknowledgements The authors thank the editor and an anonymous
proposed weighting schemes to quantify directed user-
reviewers for their detailed and constructive comments that improved
to-user, directed user-to-topic, and undirected inter-topic both the content and presentation of this paper.
relationships in the multilayer network are conceptually
13
Social Network Analysis and Mining (2023) 13:65 Page 17 of 18 65
13
65 Page 18 of 18 Social Network Analysis and Mining (2023) 13:65
Rahmadan MC, Hidayanto AN, Ekasari DS et al (2020) Sentiment Tang L, Liu H (2011) Leveraging social media networks for classifica-
analysis and topic modelling using the LDA method related to tion. Data Min Knowl Discov 23(3):447–478
the flood disaster in Jakarta on Twitter. In :2020 International Traag VA, Waltman L, Van Eck NJ (2019) From Louvain to Leiden:
Conference on Informatics. Multimedia, Cyber and Information guaranteeing well-connected communities. Sci Rep 9(1):1–12
System (ICIMCIS), IEEE, pp 126–130 Tsopze N, Domgue FG (2021) Boolean factor based community extrac-
Röder M, Both A, Hinneburg A (2015) Exploring the space of topic tion from directed networks with the non reciprocal link relation-
coherence measures. In: Proceedings of the Eighth ACM Inter- ship. Inf Sci 569:544–556
national Conference on Web Search and Data Mining. ACM, pp Tsugawa S, Ohsaki H (2015) Negative messages spread rapidly and
399–408 widely on social media. In: Proceedings of the 2015 ACM on
Rudrapal D, Das A, Bhattacharya B (2018) A survey on automatic twit- Conference on Online Social Networks. ACM, pp 151–160
ter event summarization. J Inf Process Syst 14(1):79–100 Venkatesan M, Prabhavathy P (2019) Graph based unsupervised
Ruiz J, Featherstone JD, Barnett GA (2021) Identifying vaccine hesi- learning methods for edge and node anomaly detection in social
tant communities on twitter and their geolocations: a network network. In: 2019 IEEE 1st International Conference on Energy.
approach Systems and Information Processing (ICESIP), IEEE, pp 1–5
Salehi A, Ozer M, Davulcu H (2018) Sentiment-driven community Yang Y, Hsu JH, Löfgren K et al (2021) Cross-platform comparison of
profiling and detection on social media. In: Proceedings of the framed topics in Twitter and Weibo: machine learning approaches
29th ACM Conference on Hypertext and Social Media. ACM, to social media text mining. Soc Netw Anal Min 11(1):1–18
pp 229–237 Zhang B, Vos M (2015) How and why some issues spread fast in social
Scott J, Carrington PJ (2011) The SAGE Handbook of Social Network media. Online J Commun Media Technol 5(1):90–113
Analysis. SAGE Publications, London, UK
Sheth A, Shalin VL, Kursuncu U (2022) Defining and detecting toxic- Publisher's Note Springer Nature remains neutral with regard to
ity on social media: context and knowledge are key. Neurocom- jurisdictional claims in published maps and institutional affiliations.
puting 490:312–318
Sievert C, Shirley K (2014) Ldavis: A method for visualizing and
interpreting topics. In: Proceedings of Workshop on Interactive
Language Learning, Visualization, and Interfaces, Association for
Computational Linguistics, pp 63–70
13