Social Network Analysis - AA - Article - Session 12

Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

How To Get Started with Social Network Analysis

A Complete Beginner’s Guide to Getting Up and Running Making Beautiful Network


Graphs

Preface

This guide is intended to get complete beginners in social network analysis up to

speed on terminology and concepts to create and analyze their first network graph.

This is a long article. Feel free to bookmark this article and come back to it from time

to time as you are first learning social network analysis. An outline of this article is

also provided below:

Part 1: Background

1. Why Should I Care About Social Network Analysis?

2. What Does a Social Network Graph Look Like?

3. What Tools Do I Need To Get Started?


Part 2: Terms and Concepts

1. Nodes and Edges

2. Edge Direction

3. Edge Weight

4. Centrality Measures

5. Network-Level Measures

6. Path-Level Measures

7. Connected Components and Bridges

8. Hubs and Authorities

9. Dyads and Cliques

Part 3: Application

1. Implementation

2. Download and Install Gephi

3. Dataset

4. Loading Network Data into Gephi

5. Using Layout Functions

6. Calculating Network-Level Measures

7. Calculating Centrality Measures

8. Calculating Edge Weights

9. Using Color in Network Graphs


10. Using Size in Network Graphs

11. Changing Background Color

12. Recap

Why Should I Care About Social Network Analysis?

Social network analysis (SNA), also known as network science, is a field of data

analytics that uses networks and graph theory to understand social structures. SNA

techniques can also be applied to networks outside of the societal realm.

In order to build SNA graphs, two key components are required: actors and

relationships. A common application of SNA techniques is with the internet. Web

pages on the internet often link to other webpages — either on their own website or

another website. These links can be considered relationships between actors (web

pages). This is actually a key component of search engine architecture.

Networks are all around us — such as road networks, internet networks, and online
social networks like Facebook. While this article focuses on social network analysis
(keyword: social), learning these techniques will give you valuable tools in your

toolbelt to provide insight on a variety of data sources.

What Does a Social Network Graph Look Like?

A social network graph contains both points and lines connecting those dots — similar

to a connect-the-dot puzzle. The points represent the actors and the lines represent

the relationships. An example of a social network graph can be seen below (taken

from my article “Community Detection of ISIS Twitter Accounts”).

Example Social Network Graph

What Tools Do I Need To Get Started?

Like many things in data science, there a variety of tools you can use to conduct SNA.
This guide focuses on a specific set of tools in order to get you started making network
graphs and conducting analysis on them. In no way are these the only or best tools

available.

Gephi

This guide will use Gephi, a free software for Mac, PC, and Linux, in order to build

network graphs and run some analytics on them. Gephi provides a GUI interface

(seen below) and will not require any coding to use. Gephi can be downloaded here.

Gephi Overview Page

Python/Excel

In order to build network graphs in Gephi, a specific data format must be used. In

order to fit our data into the correct format a tool must be used to create CSV files.

With simple data, Excel should suffice. However, when using large amounts of data or

data that must have its relationships extracted it is recommended to use Python.

Don’t fret if you do not have any Python skills — you should still be able to build some

basic networks.
Data Source

You will also need a data source for your network. Network data have two

requirements: actors and relationships. Some data will require these relationships to

be extracted, and others it will be more explicit in the dataset. I recommend using

datasets from Kaggle to get started. Some recommendations are listed below:

1. Marvel Universe Social Network

2. Wikipedia Article Network

3. Deezer Social Network

Nodes and Edges

Up until now, I have referred to both actors and relationships. In network science,

actors are referred to as nodes (the dots on the graph) and relationships as edges (the

lines on the graph). You will see me use this terminology throughout the rest of this

article.
Examples for nodes and relations

Nodes can represent a variety of ‘actors’. In internet networks, nodes can represent

web pages. In social networks, nodes can represent people. In supply chain networks,

nodes can represent organizations. In foreign relations networks, nodes can represent

countries. While nodes can represent a variety of things, they are all the thing that has

a relationship with another thing.

Edges can represent a variety of ‘relationships’. In internet networks, edges can

represent hyperlinks. In social networks, edges can represent connections. In supply


chain networks, edges can represent the transfer of goods. In foreign relations
networks, edges can represent policies. Like nodes, edges can represent a variety of

things.

Nodes and edges are a key concept in networks, so make sure you have a good

understanding of them before tackling the other concepts.

Edge Direction

There are two types of edges: directed and undirected. It will be necessary to decipher

what type of edge your data contains when building a network graph.

Directed edges are applied from one node to another with a starting node and an

ending node. For example, when a twitter user tags another twitter user in a tweet,

that relationship is directed. The user who wrote the tweet (starting node) applied

that relationship to the user who they tagged (ending node). The tagged user has not

necessarily reciprocated that relationship. Another example of a directed edge are

payments. If a customer (starting node) pays a coffee shop (ending node) for a coffee,

that relationship is not necessarily reciprocated because the coffee shop has not also

paid the customer.


Undirected edges are the opposite of directed edges. These relationships are

reciprocated by both parties without a clear starting node and ending node. For

example, if two people are friends on Facebook, that relationship is undirected. This is

because it can be said that Person A is friends with Person B, but it can also be said

that Person B is friends with Person A. Another example of an undirected edge is

Meetup groups. This is because it can be said that Person A is in a group with Person

B, but it can also be said that Person B is in a group with Person A.


Edge Weight

An Edge’s weight is the number of times that edge appears between two specific

nodes. For example, if Person A buys a coffee from a coffee shop 3 times, the edge

connecting Person A and the coffee shop will have a weight of 3. However, if Person B

only buys coffee from the coffee shop once, the edge connecting Person B and the

coffee shop will have a weight of 1.

Centrality Measures

Centrality is a collection of metrics used to quantify how important and influential a

specific node is to the network as a whole. It is important to remember that centrality

measures are used on specific nodes within the network, and do not provide

information on a network level. There are several centrality measures, but this guide

will cover degree, closeness, and betweenness.

Degree

A node’s degree is the number of edges the node has. In an undirected

network(see edge direction section), there is only one measure for degree. For

example, if node A has edges connecting it to Node B and Node D, then node A’s

degree is 2.
However, in a directed network, there are actually three different degree measures.

Because these edges have a starting and end node, the in-degree (number of edges the

node is an end node of), out-degree(number of edges a node is a starting node of), and

degree (number of edges a node is either a starting node or end node of) can be

calculated.

Closeness
Closeness measures how well connected a node is to every other node in the network.

A node’s closeness is the average number of hops required to reach every other node

in the network. A hop is the path of an edge from one node to another. For example,

as seen in the diagram below, Node A is connected to Node B, and Node B is

connected to Node C. For Node A to reach Node C it would take two hops.

Betweenness

Betweenness measures the importance of a node’s connections in allowing nodes to


reach other nodes (in a hop). A node’s betweenness is the number of shortest paths
the node is included in divided by the total number of shortest paths. This will provide

the percentage of shortest paths in the network that the node is in.

Network-Level Measures

Metrics can also be calculated on the network level, evaluating the entire network

instead of a single node in it. Like centrality measures, there are a variety of network-

level measures. This guide will cover size and density.

Network Size

Network size is the number of nodes in the network. The size of a network does not

take into consideration the number of edges. For example, a network with nodes A, B,

and C has a size of 3.


Network Density

Network density is the number of edges divided by the total possible edges. For

example, a network with Node A connected to Node B, and Node B connected to Node

C, the network density is 2/3 because there are two edges out of a possible 3.
Path-Level Measures

Path-level measures provide information for a path between one node and another

node. Paths follow edges between nodes, known as hops. There are also many

different path-level measures, but this article will cover length and distance.

Length

Length is the number of edges between the starting and ending nodes, known as hops.

In order to calculate the length between two nodes, a path must be predetermined.
Distance

Distance is the number of edges or hops between the starting and ending nodes

following the shortest path. Unlike length, the distance between two nodes uses only

the shortest path — the path that requires the least hops.

Connected Components and Bridges

Not all nodes in a network will necessarily be connected to each other. A connected

component is a group of nodes that are connected to each other, but not connected to

another group of nodes. Another way of thinking of this is a group of connected nodes

that have no path to a node from another group. Depending on the network, there can

be many connected components, or even only one. The below diagram shows a

network with two connected components.


A bridge is a node that when removed, creates a connected component. Another way

of thinking about it is that a bridge is a node that is the sole connection of a group of

connected nodes to another group of connected nodes.


Hubs and Authorities

Hubs and Authorities are node classifications used in directed networks. A hub is a

node that has many edges pointing out of it. You can also think of a hub as a node that

is the starting node of many edges. An authority, on the other hand, is a node that has

many edges pointing to it. You can also think of authority as a node that is the ending

node of many edges. There is not a pre-defined number of edges that makes a node a

hub or an authority and will depend on the network. In addition, remember that not
all nodes in a directed network will be a hub or an authority.
Dyads and Cliques

Dyads and Cliques are pairings of nodes connected by edges. A dyad is a pairing of

two nodes, while a clique is a pairing of three or more nodes. While a dyad or clique

may be a connected component, they can also be part of a larger connected

component.

Implementation
Now that you have an understanding of social network analysis terms and concepts,

this guide will walk you through applying these techniques to a dataset using the

Gephi software.

Download and Install Gephi

First, download and install the Gephi software for the operating system your machine

is running. Gephi is available for Mac, PC, and Linux and can be downloaded here.

Dataset

For this guide, we will be using the Marvel Universe Social Network dataset from

Kaggle. While this dataset is already laid out with a node and edge list, when working

with datasets not structured as a network this will require some data transformation

skills. I recommend using Python and Pandas in these situations.

The dataset can be downloaded here.

After downloading the dataset, there will be three csv files: nodes, edges, and network.

Open the file nodes.csv in Excel.


nodes.csv in Excel

The nodes file contains a list of all the nodes in the network. This file has two

columns: node and type. This network contains two different types of nodes that

represent different actor types. If you are familiar with object-oriented programming,

you can think of the node type as a class and nodes as objects. The two types of nodes

in this network are heroes and comics.

There is no data preparation needed to import this node list into Gephi, so we will

close the file.

Open the file edges.csv in Excel.


edges.csv in Excel

The edges file also contains two columns: hero and comic. Each row in this table

represents a single edge. The hero node and comic node are the two nodes connected

by the edge.

In Gephi, an edges table requires the column headers of ‘source’ and ‘target’. In an

undirected network it does not matter which node is in which column. However, in a

directed network the source column contains the starting node and the target column

contains the ending node. Rename column A to ‘source’ and column B to ‘target’.

Then save the file.


edges.csv with renamed column headers

Loading Network Data into Gephi

Now that the node and edge lists are properly formatted for Gephi, it is time to load

the data!

Open the Gephi software. You should see the below screen.
Click on ‘New Project’. If you do not see the welcome screen, go to file>new project.

Then, click the Data Laboratory tab.


The data laboratory tab is where we will load in our edge and node list files. To import

a list click the import spreadsheet button.

Then navigate to the folder containing the datasets and open the nodes file.
An import wizard will then step you through correctly importing the node list. Set

Separator to Comma, Import as to Nodes table, and Charset as UTF-8. Then click

next.
After clicking next, the wizard will provide additional setting configurations. Set Time

representation to Intervals. For Imported columns, check the node and type boxes

and set their data types to string. Then, click finish.

There is one more step in importing the nodes list. Set Graph Type to Undirected and

Edges merge strategy to Sum. Ensure that it is set up to append to the existing

workspace. Then, click OK.


You should now see some data in the data laboratory window! Next we need to import
the edges list.
To import the edges list, click on Import Spreadsheet and open the edges.csv file. In

the import wizard set Separator to Comma, Import as to Edges table, and Charset to

UTF-8. Then click next.


Set Time representation to Intervals then click Finish.

Then, set Graph Type to Undirected and Edges merge strategy to Sum. Choose

append to existing workspace. Click OK.


Congrats! You have just imported the node and edge lists! In the data library, you can

switch your view between these two lists by clicking on Nodes or Edges in the top left-

hand corner.
Now that the data has been imported it is time to view the graph. Click on the

overview tab.

Using Layout Functions


You might be disappointed in the graph that was visualized. It will likely look like the

black mess below.

In order to make the graph more readable, we will need to use a layout function to

change the position of nodes in the graph.


There are a variety of layout functions in Gephi, however, in this guide, we will be

using the ForceAtlas 2 function. Select this function and then click run. You will see

the nodes move in real-time, and you can stop the function when you like the position

of the nodes.
After running the layout function your graph should look something like the one

shown below. You can continue to play with other layout functions if you wish to get a

better node position. This is by no means the best node layout for this graph. In

addition, you can change the parameters of layout functions. While this guide uses the

stock ForceAtlas 2 parameters, changing them can give you better control over the

node positions.

Calculating Network-Level Measures

This guide previously covered the network-level measures of Size and Density. Let's

calculate what the network size and density of this Marvel network are.

The network size is easy to find. In the upper right-hand corner is a pane called

Context. This window provides the number of nodes and edges in the graph. Because

a network’s size is the number of nodes in it, the network size of our Marvel network

is 19,090.
To find the network density, we will take our first dive into the statistics window. Click

on the statistics tab shown below.

You should then see the below window.


The statistics window contains many measures that can be calculated on the network.

To find the network density, click run for Graph Density.

Select undirected, and then click OK.


A new window will then pop up showing the results. This Marvel network was a

density of 0.001
You can save this report by clicking the save button in the bottom left-hand corner, or

close it by clicking the close button in the bottom right-hand corner.

Calculating Centrality Measures

Recall that centrality measures are on a node-level, and not a network-level. However,

centrality measures can also be averaged to get a network-level metric. In Gephi, you

calculate centrality measures as a network-level average, which then also inputs the

centrality measure on a node-level into the data laboratory tab.

Node Degree
To calculate node degree, click run on the average degree algorithm in the statistics

window.

The report will provide you with the average degree for the network, as well as a

distribution graph. While these can be useful in some applications, we are more

interested in the degree on a node-level. Click close on the report.


To see the degree for each node in the network, go back to the data laboratory window

and click on the node table. You will see a new column in the data titled degree.
Node Closeness and Betweenness

Calculating node closeness and betweenness is a similar process as calculating node

degree. In the statistics window, click run on the network diameter algorithm.
Select undirected and click OK. Depending on the specs of your machine this may

take a little bit to calculate.

Like with the node measure, Gephi will provide a network-level report. Click close on

this report and go to the data laboratory.


In the data laboratory, you will find additional columns in the node table including

the node betweenness and closeness.


Calculating Edge Weights

Edge weights are auto-calculated in Gephi and can be found in the edge list in the

data laboratory.
Using Color in Network Graphs

Currently, our graph nodes and edges are black, providing no additional information.

Both nodes and edges can be color-coded in Gephi to provide additional information.

Coloring options can be found in the appearance window.

To color-code the nodes of the graph based on the node degree, click on the nodes

button and the color palette button in the appearance window.


There are three options to encode information in the color of nodes: Unique,

Partition, and Ranking. If you want to change the color of all nodes of the graph to the

same color, use the Unique window. Partition will break the nodes into color-coded

groups. Ranking will color-code the nodes on a scale.

Lets color the nodes by their degree. To do this, click on the ranking section and select

degree.
A color scale will be used to color the nodes. To select a new scale, click on the color

selector button to the right of the color scale.


You can select any color scale to use. Then click apply.
As you can see in the above image, coloring our nodes also colored our edges. You can

change the color of edges to a specific color using the Unique color tab for edges, or

apply a ranking or partitioning color scale to them.

Using Size in Network Graphs

You may also notice that the majority of the graph is colored red. This is because most

nodes in the graph have a low degree. Zooming in will show that some nodes are

yellow or blue.
To make these nodes easier to see in the graph let’s scale the size of the nodes to the

node degree as well. To do this click on the nodes and size buttons in the appearance

window.

Then, click on ranking and select degree. Change the minimum size to 1 and the
maximum size to 100.
Then click apply.
As can be seen in the above graph, we can now better see what nodes have a high

degree.

Changing Background Color

Let's also change the background from white to black. Depending on the colors used

in a graph, either color may look better and is often up to personal preference. To

change the color to black, press the lightbulb button.


Recap

That is the end of this guide. While this should you get started making your first

network graph using the Marvel dataset, I encourage you to continue playing around

with this graph in Gephi. There are many more measures you can calculate and other

appearances you could use.

Your next step should be to take another dataset and try to reproduce these steps on

that data. Finally, you can try to collect your own data and transform it into network
data.

I hope this guide was useful to you! Feel free to reach out if you have any questions!

You might also like