Social Network Analysis - AA - Article - Session 12
Social Network Analysis - AA - Article - Session 12
Social Network Analysis - AA - Article - Session 12
Preface
speed on terminology and concepts to create and analyze their first network graph.
This is a long article. Feel free to bookmark this article and come back to it from time
to time as you are first learning social network analysis. An outline of this article is
Part 1: Background
2. Edge Direction
3. Edge Weight
4. Centrality Measures
5. Network-Level Measures
6. Path-Level Measures
Part 3: Application
1. Implementation
3. Dataset
12. Recap
Social network analysis (SNA), also known as network science, is a field of data
analytics that uses networks and graph theory to understand social structures. SNA
In order to build SNA graphs, two key components are required: actors and
pages on the internet often link to other webpages — either on their own website or
another website. These links can be considered relationships between actors (web
Networks are all around us — such as road networks, internet networks, and online
social networks like Facebook. While this article focuses on social network analysis
(keyword: social), learning these techniques will give you valuable tools in your
A social network graph contains both points and lines connecting those dots — similar
to a connect-the-dot puzzle. The points represent the actors and the lines represent
the relationships. An example of a social network graph can be seen below (taken
Like many things in data science, there a variety of tools you can use to conduct SNA.
This guide focuses on a specific set of tools in order to get you started making network
graphs and conducting analysis on them. In no way are these the only or best tools
available.
Gephi
This guide will use Gephi, a free software for Mac, PC, and Linux, in order to build
network graphs and run some analytics on them. Gephi provides a GUI interface
(seen below) and will not require any coding to use. Gephi can be downloaded here.
Python/Excel
In order to build network graphs in Gephi, a specific data format must be used. In
order to fit our data into the correct format a tool must be used to create CSV files.
With simple data, Excel should suffice. However, when using large amounts of data or
data that must have its relationships extracted it is recommended to use Python.
Don’t fret if you do not have any Python skills — you should still be able to build some
basic networks.
Data Source
You will also need a data source for your network. Network data have two
requirements: actors and relationships. Some data will require these relationships to
be extracted, and others it will be more explicit in the dataset. I recommend using
datasets from Kaggle to get started. Some recommendations are listed below:
Up until now, I have referred to both actors and relationships. In network science,
actors are referred to as nodes (the dots on the graph) and relationships as edges (the
lines on the graph). You will see me use this terminology throughout the rest of this
article.
Examples for nodes and relations
Nodes can represent a variety of ‘actors’. In internet networks, nodes can represent
web pages. In social networks, nodes can represent people. In supply chain networks,
nodes can represent organizations. In foreign relations networks, nodes can represent
countries. While nodes can represent a variety of things, they are all the thing that has
things.
Nodes and edges are a key concept in networks, so make sure you have a good
Edge Direction
There are two types of edges: directed and undirected. It will be necessary to decipher
what type of edge your data contains when building a network graph.
Directed edges are applied from one node to another with a starting node and an
ending node. For example, when a twitter user tags another twitter user in a tweet,
that relationship is directed. The user who wrote the tweet (starting node) applied
that relationship to the user who they tagged (ending node). The tagged user has not
payments. If a customer (starting node) pays a coffee shop (ending node) for a coffee,
that relationship is not necessarily reciprocated because the coffee shop has not also
reciprocated by both parties without a clear starting node and ending node. For
example, if two people are friends on Facebook, that relationship is undirected. This is
because it can be said that Person A is friends with Person B, but it can also be said
Meetup groups. This is because it can be said that Person A is in a group with Person
An Edge’s weight is the number of times that edge appears between two specific
nodes. For example, if Person A buys a coffee from a coffee shop 3 times, the edge
connecting Person A and the coffee shop will have a weight of 3. However, if Person B
only buys coffee from the coffee shop once, the edge connecting Person B and the
Centrality Measures
measures are used on specific nodes within the network, and do not provide
information on a network level. There are several centrality measures, but this guide
Degree
network(see edge direction section), there is only one measure for degree. For
example, if node A has edges connecting it to Node B and Node D, then node A’s
degree is 2.
However, in a directed network, there are actually three different degree measures.
Because these edges have a starting and end node, the in-degree (number of edges the
node is an end node of), out-degree(number of edges a node is a starting node of), and
degree (number of edges a node is either a starting node or end node of) can be
calculated.
Closeness
Closeness measures how well connected a node is to every other node in the network.
A node’s closeness is the average number of hops required to reach every other node
in the network. A hop is the path of an edge from one node to another. For example,
connected to Node C. For Node A to reach Node C it would take two hops.
Betweenness
the percentage of shortest paths in the network that the node is in.
Network-Level Measures
Metrics can also be calculated on the network level, evaluating the entire network
instead of a single node in it. Like centrality measures, there are a variety of network-
Network Size
Network size is the number of nodes in the network. The size of a network does not
take into consideration the number of edges. For example, a network with nodes A, B,
Network density is the number of edges divided by the total possible edges. For
example, a network with Node A connected to Node B, and Node B connected to Node
C, the network density is 2/3 because there are two edges out of a possible 3.
Path-Level Measures
Path-level measures provide information for a path between one node and another
node. Paths follow edges between nodes, known as hops. There are also many
different path-level measures, but this article will cover length and distance.
Length
Length is the number of edges between the starting and ending nodes, known as hops.
In order to calculate the length between two nodes, a path must be predetermined.
Distance
Distance is the number of edges or hops between the starting and ending nodes
following the shortest path. Unlike length, the distance between two nodes uses only
the shortest path — the path that requires the least hops.
Not all nodes in a network will necessarily be connected to each other. A connected
component is a group of nodes that are connected to each other, but not connected to
another group of nodes. Another way of thinking of this is a group of connected nodes
that have no path to a node from another group. Depending on the network, there can
be many connected components, or even only one. The below diagram shows a
of thinking about it is that a bridge is a node that is the sole connection of a group of
Hubs and Authorities are node classifications used in directed networks. A hub is a
node that has many edges pointing out of it. You can also think of a hub as a node that
is the starting node of many edges. An authority, on the other hand, is a node that has
many edges pointing to it. You can also think of authority as a node that is the ending
node of many edges. There is not a pre-defined number of edges that makes a node a
hub or an authority and will depend on the network. In addition, remember that not
all nodes in a directed network will be a hub or an authority.
Dyads and Cliques
Dyads and Cliques are pairings of nodes connected by edges. A dyad is a pairing of
two nodes, while a clique is a pairing of three or more nodes. While a dyad or clique
component.
Implementation
Now that you have an understanding of social network analysis terms and concepts,
this guide will walk you through applying these techniques to a dataset using the
Gephi software.
First, download and install the Gephi software for the operating system your machine
is running. Gephi is available for Mac, PC, and Linux and can be downloaded here.
Dataset
For this guide, we will be using the Marvel Universe Social Network dataset from
Kaggle. While this dataset is already laid out with a node and edge list, when working
with datasets not structured as a network this will require some data transformation
After downloading the dataset, there will be three csv files: nodes, edges, and network.
The nodes file contains a list of all the nodes in the network. This file has two
columns: node and type. This network contains two different types of nodes that
represent different actor types. If you are familiar with object-oriented programming,
you can think of the node type as a class and nodes as objects. The two types of nodes
There is no data preparation needed to import this node list into Gephi, so we will
The edges file also contains two columns: hero and comic. Each row in this table
represents a single edge. The hero node and comic node are the two nodes connected
by the edge.
In Gephi, an edges table requires the column headers of ‘source’ and ‘target’. In an
undirected network it does not matter which node is in which column. However, in a
directed network the source column contains the starting node and the target column
contains the ending node. Rename column A to ‘source’ and column B to ‘target’.
Now that the node and edge lists are properly formatted for Gephi, it is time to load
the data!
Open the Gephi software. You should see the below screen.
Click on ‘New Project’. If you do not see the welcome screen, go to file>new project.
Then navigate to the folder containing the datasets and open the nodes file.
An import wizard will then step you through correctly importing the node list. Set
Separator to Comma, Import as to Nodes table, and Charset as UTF-8. Then click
next.
After clicking next, the wizard will provide additional setting configurations. Set Time
representation to Intervals. For Imported columns, check the node and type boxes
There is one more step in importing the nodes list. Set Graph Type to Undirected and
Edges merge strategy to Sum. Ensure that it is set up to append to the existing
the import wizard set Separator to Comma, Import as to Edges table, and Charset to
Then, set Graph Type to Undirected and Edges merge strategy to Sum. Choose
switch your view between these two lists by clicking on Nodes or Edges in the top left-
hand corner.
Now that the data has been imported it is time to view the graph. Click on the
overview tab.
In order to make the graph more readable, we will need to use a layout function to
using the ForceAtlas 2 function. Select this function and then click run. You will see
the nodes move in real-time, and you can stop the function when you like the position
of the nodes.
After running the layout function your graph should look something like the one
shown below. You can continue to play with other layout functions if you wish to get a
better node position. This is by no means the best node layout for this graph. In
addition, you can change the parameters of layout functions. While this guide uses the
stock ForceAtlas 2 parameters, changing them can give you better control over the
node positions.
This guide previously covered the network-level measures of Size and Density. Let's
calculate what the network size and density of this Marvel network are.
The network size is easy to find. In the upper right-hand corner is a pane called
Context. This window provides the number of nodes and edges in the graph. Because
a network’s size is the number of nodes in it, the network size of our Marvel network
is 19,090.
To find the network density, we will take our first dive into the statistics window. Click
density of 0.001
You can save this report by clicking the save button in the bottom left-hand corner, or
Recall that centrality measures are on a node-level, and not a network-level. However,
centrality measures can also be averaged to get a network-level metric. In Gephi, you
calculate centrality measures as a network-level average, which then also inputs the
Node Degree
To calculate node degree, click run on the average degree algorithm in the statistics
window.
The report will provide you with the average degree for the network, as well as a
distribution graph. While these can be useful in some applications, we are more
and click on the node table. You will see a new column in the data titled degree.
Node Closeness and Betweenness
degree. In the statistics window, click run on the network diameter algorithm.
Select undirected and click OK. Depending on the specs of your machine this may
Like with the node measure, Gephi will provide a network-level report. Click close on
Edge weights are auto-calculated in Gephi and can be found in the edge list in the
data laboratory.
Using Color in Network Graphs
Currently, our graph nodes and edges are black, providing no additional information.
Both nodes and edges can be color-coded in Gephi to provide additional information.
To color-code the nodes of the graph based on the node degree, click on the nodes
Partition, and Ranking. If you want to change the color of all nodes of the graph to the
same color, use the Unique window. Partition will break the nodes into color-coded
Lets color the nodes by their degree. To do this, click on the ranking section and select
degree.
A color scale will be used to color the nodes. To select a new scale, click on the color
change the color of edges to a specific color using the Unique color tab for edges, or
You may also notice that the majority of the graph is colored red. This is because most
nodes in the graph have a low degree. Zooming in will show that some nodes are
yellow or blue.
To make these nodes easier to see in the graph let’s scale the size of the nodes to the
node degree as well. To do this click on the nodes and size buttons in the appearance
window.
Then, click on ranking and select degree. Change the minimum size to 1 and the
maximum size to 100.
Then click apply.
As can be seen in the above graph, we can now better see what nodes have a high
degree.
Let's also change the background from white to black. Depending on the colors used
in a graph, either color may look better and is often up to personal preference. To
That is the end of this guide. While this should you get started making your first
network graph using the Marvel dataset, I encourage you to continue playing around
with this graph in Gephi. There are many more measures you can calculate and other
Your next step should be to take another dataset and try to reproduce these steps on
that data. Finally, you can try to collect your own data and transform it into network
data.
I hope this guide was useful to you! Feel free to reach out if you have any questions!