Evaluation and Development of Data Mining Tools
for Online Social Networks
Dhiraj Murthy, Alexander Gross, Alex Takata, and Stephanie Bond
Social Network Innovation Lab, Bowdoin College, Brunswick, Maine
{dmurthy,agross,atakata,sbond2}@bowdoin.edu
Abstract. This chapter reviews existing data mining tools for scraping data from heterogenous online social networks. It introduces not only the complexities of scraping
data from these sources (which include diverse data forms), but also presents currently available tools including their strengths and weaknesses. The chapter introduces
our solution to effectively mining online social networks through the development of
VoyeurServer, a tool we designed which builds upon the open-source Web-Harvest
framework. We have shared details of how VoyeurServer was developed and how it
works so that data mining developers can develop their own customized data mining
solutions built upon the Web-Harvest framework. We conclude the chapter with future
directions of our data mining project so that developers can incorporate relevant features into their data mining applications.
Keywords. Data Mining, Online Social Networks, Web Content Extraction, Web-Harvest 2.0
1
Introduction
The practice of data mining and web-content extraction is an important
and growing field. Many disciplines are looking at ‘big data’ and ways
to mine and analyze this data as the key to solving everything from
technical problems to better understanding social interactions. For example, large sets of tweets mined from Twitter have been analyzed to
detect natural disasters [1, 2], predict the stock market [3], and track
the time of our daily rituals [4]. As our use of blogs, social networks,
and social media continues to increase, so does our creation of more
web-based hyperlinked data. The successful extraction of this webbased data is of considerable research and commercial value.
Data mining often goes beyond information retrieval, towards a
meta-discovery of structures and entities hidden in seas of data. As
our social interactions become increasingly mediated by Internetbased technologies, the potential to use web-based data for understanding social structures and interactions will continue to increase.
Online social networks are defined as ‘web-based services that allow individuals to (1) construct a public or semi-public profile within a
bounded system, (2) articulate a list of other users with whom they
share a connection, and (3) view and traverse their list of connections
and those made by others within the system’ [5]. Individuals interact
within online social networks through portals such as Facebook which
create social experiences for the user by creating a personalized environment and interaction space by combining knowledge of one users’
online activity and relationships with information about other networked
individuals. It is through data mining algorithms that Twitter, for example, determines recommendations for users to follow or topics which
may be of potential interest. One way to study social networks is by
examining relationships between users and the attributes of these relationships. However, data on a blog, Facebook, or Twitter is not inherently translatable into network-based data. This is where data mining
becomes useful. Social networks typically only provide individual portal
access to one’s egocentric network. Put in the language of social network analysis (SNA), the visible network is constructed in relation to
ego (the individual being studied) and relations of ego, known as ‘alters’, are seen (e.g. Facebook friends). However, in a restricted profile
environment, the alters’ relationships are not revealed. In order to understand network structure (which is key to a systems perspective), the
researcher must use methods like data mining in order to gather information about all users and interactions by iterating over the data. A
variety of different types of tools have been developed to collect this
type of information using the text based framework of the web. These
tools were created for a wide array of purposes. The majority of these
tools have been commercially released. Some of these tools can be
used to construct profiles of individuals based on data from multiple
sources. Given issues of privacy, ethical uses of these tools should be
strictly employed.
Despite the existence of a variety of tools, the simplicity and robustness of them varies widely. There are many types of networks and
online communities that could qualify as a subject of network-based
research. Many of these virtual organizations and networks often share
key elements and structures that are common across social network
technologies. These could include users, groups, communications, and
relationship networks between these entities. Also unlike the simple
data that is subject of most data mining projects, network analysis is
not merely focused on generating lists of entities and information. Social networks are more organic in their growth and place emphasis on
relational attributes. SNA seeks to understand how individuals and
groups within networks (termed ‘cliques’) are connected together.
The Social Network Innovation Lab (SNIL) is an interdisciplinary
research lab dedicated to understanding online social networks, social
media, and cyberinfrastructure for virtual organizations. Research at
the SNIL often involves the need for tools that are able to extract social
network-based data for analysis from varied online social communities.
The SNIL currently has projects which require data mining of popular
microblogging services, shared interest forums and traditional social
networks. As part of our ongoing research, we have begun to investigate and develop our own custom data mining tools. As part of this
project, we have researched existing tools, developed a conceptual
framework for general data mining of online social networks save
document, and built and tested prototype implementations of the toolkit
while acquiring data for use in current ongoing projects.
In this chapter, we will consider a variety of common methodologies and technologies for generic data mining and web content extraction. We will highlight a number of features and functionalities we see
as key to effective data mining for social network analysis. We will then
review several current data mining software tools and their goodness
of fit for data mining online social networks. The remainder of the
chapter discusses our development of a data-mining framework for
online social networks. Specifically, we introduce our initial development work in extending the Web-Harvest 2.0 framework to incorporate
some of the important identified needs of data mining for online social
networks. This is followed by a case study of some of our initial results
and discoveries in the use of our pilot technology to acquire data from
an actual online virtual community which is organized around social
network technologies. The remaining sections summarize what we
have learned through this process, and lays out a course for future
development.
2
Web-Content Extraction Technologies
The nature of online social networking sites is such that the information
and data that constitutes the network and its entities are by necessity
distributed over a vast array of unique and dynamically generated
page instances. Even considering only a basic set of common SNS
features (user profiles, friend lists, discussion boards), it is easy to see
how social network-mediated data exponentially grows. In order to
study virtual communities as social networks, the researcher needs to
transform this sea of distributed data into data formatted for network
analysis software (most commonly UCINET for smaller networks and
pajek for larger networks). In the absence of direct access to the database systems that drive social networks or a site-provided API, one
must utilize other means to capture SNS data for research.
The field of online data extraction or web scraping has existed in one
form or another since the advent of web based information. The majority of information on the Internet is circulated in the form of HTML content, which wraps data in a nested set of tags that specify how data
needs to be visually rendered in the browser. This is suitable for making data easily read and understood from the screen or through printing, but not so useful when clean organized machine-readable datasets are desired. Most online data extraction tools take advantage of
the fact that the HTML is itself a structured data interchange format,
albeit one focused on the display of information. These tools leverage
the HTML format to create language parsers, which can extract the
simple content of the page in an organized way while discarding the
irrelevant material. Generally, most online data extraction technologies
can be classified into several categories with a few hybrids.
2.1
Formats, Conventions, Utilities, and Languages
Technologies in this class are low-level constructs that often derive
from some sort of published standard grammar. This grammar may
then be implemented in whole or in part by other higher-level technologies. They often simply define a way in which data can be ordered,
searched, manipulated, or transformed. For instance, the XPath standard defines a format for finding and isolating pieces of information
from a structured XML document. Similarly, regular expressions are a
format for performing advanced searches and manipulation on unstructured strings of characters. XSLT is a language defined to assist
in the transformation of one type of structured XML document into another (e.g. transforming an HTML document into a simpler RSS feed or
vice-versa). Without well-defined standards for interacting with various
types of data, extraction would be much more difficult, yet because of
the low level nature of these structures is would be hard to use them in
isolation to perform any kind of advanced extraction project without
constructing a broader framework for their application to a set of data.
2.2
Libraries
Data extraction libraries often perform the job of wrapping one or more
lower level data manipulation/extraction constructs into an organized
framework within the context of a specific programming language.
These libraries then manage the implementation of a given construct
within a framework useful for further development within a given programming language. Development libraries leave the end goals completely open to the developer. Depending on the time, investment, and
goals of the developer, development libraries can be used to create
anything from simple one-off scripts to creating higher level applications with many advanced features.
2.3
Web-based APIs and CLUI
Web-based API and command line user interfaces often provide a kind
of standardized abstraction layer to certain sets of web content. These
typically wrap development libraries with their exact nature dependent
on the hosting server and application. Furthermore, they will generally
apply and be structured around the content available for one data
source (e.g. a particular website, web-enabled technology, or application). Examples of such tools include Google’s OpenSocial API framework which is actually just an open standard for a set of API features
that any developer could implement for their social networking site.
Other examples include APIs provided by most large popular social
networking sites like Twitter and Facebook.
2.4
Applications
The vast majority of data extraction solutions take the form of applications. Applications make use of a large set of extraction technologies
and development libraries and wraps then in a interface designed
around a set of desired functionality. Depending on that set of functionality and the level of expertise expected of the user by the developer, there can be a wide range of different types of data extraction
applications, each potentially best suited toward certain sets of tasks
and users. This spectrum extends the range from self-adapting, learning, fully GUI based extractors for non-technical users to applications
for advance data extraction that may require some knowledge of programming or data extraction utilities. Many applications fall into this
category. Some of the most common are Helium Scraper, Djuggler,
Newprosoft, Deixto, and Web Harvest.
2.5
Enterprise Suites
This class of data extraction solutions is characterized by providing
very high level, multi-featured, and advanced software solutions, often
delivered in a suite of highly specialized applications. The implementations of these software packages are often fully private (as the code is
often developed from the ground up based off the companies’ own
proprietary development libraries). Like many enterprise solutions
these software products are often so powerful, advanced, and featurerich that special training, and or ongoing technical support from the
company itself are required to use these tools effectively. This support
and training often come at significant additional cost beyond the original software license. Though potentially expensive, this support may
allow the client to obtain custom solutions to their specific needs which
would be developed for them by the company’s developers in response to the client’s specific needs. Pentaho and QL2 are two examples of enterprise solutions.
3
Considerations for Data Mining of Online Social Networks
3.1
Data extraction specification language.
One of the most important features for any professional level datamining application for research is the implementation of a robust and
dynamic data query specification language. This language should include the ability to define functions, execute loop, and conditional
branching. Many basic data mining application just use a GUI to allow
one to specify the desired extractions but this will always have it’s limitations. Some tools used a command based query language like SQL
to scrape data, but a better alternative is if the specification language
is robust enough to emulate a programming language ipso facto. Conditional branching, loops, and functions, as well as the ability the define
and access global and local variables are all needed qualities for successful data mining of complex structures like social networks.
3.2
Flexible I/O.
At its simplest level, data extraction centers around taking one kind of
information as input and translating, manipulating or filtering it into another format more appropriate for ones’ research objectives. In order to
allow for the most possible types of automated data extraction and
manipulations, data mining tools should be able to both read in and
output to a large number of potential data formats. Ideally, the tool
should be able to take input from and make output to all of its supported formats. Common formats that will provide the most amount of
use within a data mining tools kit are various kinds of structured text
files like HTML/XML, delimited text, or JSON. Additional important features include being able to read and write to different types of databases, APIs, or even the ability execute local system commands. The
power and usability of the tools will increase the more ways it is able to
take input and give output.
3.3
API Interfacing.
There are many different types of social networks from small sharedinterest communities to large global networks. The administrators of
these networks often recognize the importance for allowing different
ways for people to access their data and they often provide third parties to develop applications which further develop or enhance participating in the network. These Application Programming Interfaces
(APIs) often provide alternate methods for requesting information from
a site. As opposed to simply requesting a page and extracting data
from it, APIs allow developers to make a special kind of request to the
API and return just the raw data one is looking for. Any serious data
mining tool for the analysis of social networks should be created in
such a way as to allow content extraction from both traditional web
pages as well as APIs.
3.4
Job Scheduling.
There are several types of job scheduling features that would be most
useful in data mining of online social networks.
Now. This option would immediately execute a job. This is the most
basic type of scheduling operation.
Later. This option allows a user to schedule a job for a specific time.
This could be useful to extract data from a site during low traffic hours,
or for a situation where it is known that new information will be posted
or made available at a specified time.
Chain. The chain option would allow one job to be scheduled to start
once another had completed. This is very useful for when one data
extraction task is dependent upon the completion of one or more other
tasks. With this option the whole flow could be specified in advance
and sent to the server application as a single project.
Recurring. Recurring jobs are quite valuable in data mining of online
social networks. The vast majority of social network data presents difficulties in terms of mining due to the fact that social networks can be in
continuous flux. For research purposes, it becomes necessary to capture a snapshot of data as it exists at a specified point in time. A fea-
ture to consider would be to include data extraction tasks that update
the data at regular intervals.
3.5
Concurrency.
Most web-content extraction tools acquire data by essentially creating
a virtual agent to make automated requests from a web host. This is
the same way a web browser works. The web site is requested by the
browser and the host send a file containing information (i.e. in HTML)
needed to display the web page on the requester’s browser. In data
mining this data is simply grabbed and parsed in a variety of methods
to obtain data. Most tools just use one agent to go to each page in sequence. By creating multiple page request agents, a data mining tool
could make multiple concurrent requests to the same web host (for
different information). This allows the user to take advantage of the
scalability of the hosting server. For large jobs this feature would play a
key role in speeding up the acquisition of data, and should be a key
component of any data mining tools for online social networks.
3.6
Progress Management.
More often than not, the analysis of social networks requires large
amounts of data. This is because these networks are most often useragent based and each user will generate some amount of activity.
Networks can be analyzed effectively by collecting the activity of individual users and their connections (egocentric networks). Though, it is
often useful to collect data on all users (or at least from large subsets).
This enables comparative analysis and discovery of connections between subnetworks (e.g. users who act as bridges from one group to
another). In data mining tasks, data extraction is often limited by the
speed that the hosting server allows clients to access data. Aside from
concurrency, there is often little that can be done if the social network
you are mining is slow or very large. If the network is both, it could take
hours or perhaps days to complete a data extraction tasks. This is why
we identify progress management as an important feature for the data
mining of online social networks. Wherever possible, ideal tools should
attempt to keep track of the progress of data extraction tasks as well
as expected time to completion. This feature will be of great value to
those who are charged with managing one or more data extraction
projects by giving them the information they need in order to be prepared for when data will be ready for post-processing.
3.7
Playing nice.
When setting up large data extraction tasks, the operator of the software tool might be tempted to create large number of page request
agents which generates a large amount of traffic on the hosting site.
This not only is considered bad ‘netiquette’ [6], but has ethical and legal considerations. If one’s data mining project is part of academic research, the relevant Institutional Review Board (IRB) should be consulted as well to confirm ethical compliance with human subjects. A
large volume of page requests with a web host could degrade the quality of the experience of other users of the site. Also it could result in the
web host banning all requests from your IP address if the host believes
your requests are malicious. Furthermore, many SNS have specific
policies or terms of service in place that would dictate how much data
can be requested per agent. It is in the best interest of the data extractor to ‘play nice’ and follow and conventions whether explicit, or implicit
about not trying to request too much data from a host. If in doubt, contact the host whose data you intend to mine. Any data mining tool kit
for online social networks should implement some kind of standard
limiting, but also provide the ability to create custom guidelines depending on the known Terms of Service (TOS) of a web host or for
when the user knows it is acceptable to request large volumes of data.
The idea is to be able to set the tools to get data as fast as possible,
but not “too fast” for the given host.
3.8
Client-Server Paradigm.
Extracting data from the web can require significant processing power
as well a bandwidth. Many types of data extraction projects may be
ongoing and most users would not want their computer constantly running potentially resource expensive scripts. This is why for data mining, the ideal solutions for tools is to use a client server paradigm
(where each user simply submits their jobs to a server for handling).
That way, the designated server can handle all the heavy processing
and high data load while the clients’ machine remains free for use. The
server application just needs to notify the client when the data is final
collected data is available. The client side application gives a lot of
flexibility to the user requesting certain extraction jobs. They can use
the client to log onto the main server and manage all there running
jobs, no matter where they are physically located. The server should
provide the client with options such as checking job progress, creating
new jobs, aborting running jobs, changing scheduling, changing the
extraction specification. Also, this provides the ability for multiple users
with different data extraction needs to utilize one centralized server. If
the server was designed to be appropriately powerful and scalable,
then a powerful open research service could be provided.
4
Review of Existing Data-Mining Tools to Mine Online Social
Networks
After the evaluation of several commonly available tools and technologies for online data extraction, we determined that Web-Harvest 2.0
was an the best fit for the needs of our project (which included mining
data from two online life science communities of practice which used
social networking technologies).
4.1
Common Data-Mining Tools
Among the tools considered were Helium Scraper, Newprosoft, Happy
Harvester, Djuggler, Rapid Miner, Deixto, and Web-Harvest. Based on
our evaluation it was determined that Helium Scraper, Newprosoft,
Happy Harvester, and Djuggler, were all excellent GUI based scraping
applications. However, these tools also shared the same limitations. All
four tools were single operating system applications that only allow
scraper configurations to be defined within the context of the application. They also have no ability to be controlled or configured from the
command-line. Their source code is not open source and script could
not be written against their various executables. When taken in consideration with our project goals (which required modifications for large
scrapes to be conducted with minimal impact on the host), it became
clear that these tools could not be leveraged to achieve our desired
functionality. Rapid Miner is one of the leading open-source applications for data mining and analytics with solid data extraction capabilities. Rapid Miner was evaluated as a potential fit for our project’s
needs. It is open-source, cross-platform, uses XML-based configuration files which can be developed through the interface or written directly, and the code base can be scripted against both in application
wrapper interfaces as well as from the command line. The issue is that
Rapid Miner is such a powerful tool that it has a very steep learning
curve and includes features which would not be needed for our project
needs. Using Rapid Miner would prevent us from being able to develop
a fast lightweight utility in a reasonable amount of time. DEiXTo is another web extraction technology that was potentially able to meet most
of our project’s needs. DEiXTo is a single platform GUI-based web
extraction application built on top of an open source Perl-based
scraper utility. DEiXTo also uses a XML based configuration language
that potentially allows configurations to be defined outside of the GUI.
The Perl module that forms the backbone of DEiXTo’s extraction technology could also be scripted against on any operating system or code
framework which supports the Perl scripting language. Though the
DEiXTo file format (.wpf) is unduly complicated and not well documented. This means that most .wpf files must be developed within the
GUI application, which is single platform and closed source. Also
DEiXTo is limited in the way output can be written only to specific file
formats and in specific ways. While the features and options available
in DEiXTo would allow us to accomplish our project goals, it was determined that another tool, Web-Harvest 2.0, also had all the capabilities we needed for our project, but was easier to work with, more configurable, and more open in terms of input and output capabilities.
4.2
Web-Harvest
Web-Harvest 2.0 is a hybrid technology that consists of a GUI based
application wrapped around a Java open-source development library.
This library, in turn, implements several of the most common and powerful extraction utility formats such as XPath and regular expressions.
The Web-Harvest 2.0 platform also defines syntax for defining custom
data extraction workflows. This was ideal for several reasons. First, the
graphical user interface allowed for quick start-up of development using the features of the GUI to easily debug, learn, and understand how
to develop complex workflows in the Web-Harvest scraper configuration format. Defining workflows via this format is, in many ways, better
than coding library solutions that require workflows to be defined in the
context of that code base. This is because the configuration syntax is
just a simple standard which can be written with any text editor. This
frees the developer from the additional nuances of any specific highlevel programming language. Furthermore once tested these workflows could be easily shared with others and passed to the development package, which could execute the scraper configurations through
code. The fact that the core of Web-Harvest is an open-source data
extraction engine allowed for our project to wrap this engine in our own
lightweight code. We discarded the overhead of a GUI in favor of a
lightweight command-line interface implemented with a client-server
pattern. We were also freed from the limitations placed on the extraction engine by the GUI, by taking advantage of the Java programming
language to develop our own features not present in the GUI or the
engine itself. This included multiple simultaneous extractions as well
as timed and repeated extractions. Furthermore all the configuration
files we had previously developed could be simply passed to this engine and executed in the same manner as through the GUI. WebHarvest 2.0 was a good fit because of its hybrid nature. Most pure application based scrapers are not extendable, and few define a configuration format. This causes the developer to have to work within the
confines of what the application allows. Pure development packagebased based extraction tools can have a steep learning curve and can
be difficult to debug. Relying purely on data extraction utilities and
standards like XPath, and regular expressions requires that an entire
framework be built around them in order to execute complex dynamic
extraction workflows. This can be time-consuming and resource intensive. Given the remit of our project, Web-Harvest served as an ideal
solution in terms of features and the ability to separate between code
and UI (which allowed us to quickly develop our own tools using the
power of the Web-Harvest engine). Few if any other tools would have
efficiently allowed us to work in this way.
5
Extension of Web-Harvest for Data Mining of Online Social
Networks
After a review of existing data mining tools and a consideration of the
desired features of data mining for online social networks, it was determined the best course of action was to develop extensions and a
application wrapper for the open-source Web-Harvest 2.0 data extraction engine. Web-Harvest 2.0 already incorporates many of the basic
functionalities identified as important in mining online social networks.
Because the code is open source, we saw Web-Harvest as a good
place to begin testing and developing a truly social network-centric
data mining tool. Our plan centered on taking the extraction features
that existed and wrap the code base within a multi-threaded clientserver model. Once the base extraction modules were wrapped in this
way, we could focus on added additional management feature to the
wrapper like scheduling, process, and progress management.
5.1
Related Work
The data mining literature regarding either Web-Harvest or extensions
to Web-Harvest is minimal. Web-Harvest has been used successfully
as a basic scraper based in the literature. One such example was in a
study by Nagel and Duval [7]. They used Web-Harvest in their study in
order to collect large amounts of information from a website. For their
study, they only needed a simple web scraper, and used Web-Harvest
in its original form to mine publication data from Springer, an academic
publisher. They used the software to collect data including titles, authors, affiliations, and postal addresses.
Katzdobler and Filho use Web-Harvest extensively [8]. They combined Web-Harvest with JENA a tool used to build semantic web applications as well as an ontology which described what type of information they wanted to extract. The ontology is then accessed by the jena
api and Web-Harvest extracts the information from the site. However,
manual creation of the configuration file and manual startup of WebHarvest is needed.
TagCrawler is a program written using Web-Harvest and is is one
of the few cases of Web-Harvest being directly extended [9]. The
creators of TagCrawler desired a web crawling tool which would be
able to retrieve information from tagging communities. TagCrawler is a
web crawler that focuses on “retrieving data from tagging communities.” While the end goal of the project was not related to our project,
their use of Web-Harvest as a base and building off of it has shown
that this was a successfully deployed method.
5.2
Voyeur Server Project
This program is designed as an extension to the existing Web-Harvest
2.0 framework. It uses Web-Harvest’s existing functionality in terms of
scraping and use of configuration files, but adds on several layers of
additional features. In the development of this extension, we tried to
take into consideration all the features we identified in Section 3 as
being key considerations for the data mining of online social networks
and attempted to push and adapt the Web Harvest 2.0 engine to better
fit this model.
Project Overview. Web-Harvest is an excellent basic web scraper.
The Web-Harvest framework has been able to satisfy many of our
scraping needs. These include a robust query specification language
with capabilities to import and export data to a number of important
formats including MySQL database integration. Our most important
needs include the ability of the scrape tool to be able to repeat upon
finishing a scrape and to run on a specific date. Additionally, it was
important to us that the tool be able to run concurrent scrapes. This is
not only more efficient, but allows users to collect sets of scraped data
for analysis rather than a single variable/page. We needed to be able
to limit a scrape, check the status of all running scrapes, stop a running scrape, and update a scrape. Our tool adds all of these features
by building on top of the Web-Harvest source code.
Project Structure and Development. Our program is organized
around a client-server model with the server able to create any number
of threads. This program is designed to run though the terminal with
one window for the client and one window for the server. The server
will first connect to a port on a server and once it has, the client can
then connect and begin communicating. The server and client are able
to communicate by using various readers and writers. The server uses
a buffered reader to receive information from the client and an objectoutputstream to send information to the client. The client can receive
information throug an object input stream and it send information with
a printwriter.
When both the client and server are running, the user can enter input into the client. The client then sends the input to the server, which
determines what the user has inputted. If the call is for a new scrape,
the server will take the inputted information and create a new thread,
which calls the web-harvest scrape code. This thread is then stored in
an ArrayList in the server class. The thread also receives information
regarding any date limit or repetitions and then performs all the necessary procedures. When the user inputs a command that will affect any
existing scrape thread, the server will retrieve that specific thread from
the ArrayList and pass in a command to the Thread. In general, the
VoyeurClient receives inputs from the user and relays them to the VoyeurServer. The VoyeurServer then relays information and commands
to the VSServerThread. If the command sent to the VSServerThread
returns information, the Server receives the information and sends it
back to the Client which prints it.
Fig. 1. Scematic of data flow in Voyeur Server
One of the first problems we attempted to tackle was enabling the
program to run concurrent scrapes. We decided to base the program
in the client server model, so we were able to use server threads.
Each time we wanted to create a new scrape, the server would call a
new thread and store the created thread in an ArrayList that enabled it
to be accessed later. Each thread runs independently of all other
threads. Since all of the actual scraping logic occurs in the thread, this
enables the user to run multiple concurrent threads.
When beginning a new scrape of a file, the user has 3 pieces of information they can enter. First, they must enter the file extension of
the scrape file they wish to use. Second, they can enter a date, which
will be the start date of the scraper. Third, if they want the scrape to
repeat, they can enter information regarding how frequently it would
repeat. When this information is inputted, the server parses through
the information provided and determines what scrape options have
been inputted. With this information the server creates a new thread
object and sends in the users choices regarding the start date and the
repetition. When created, the thread stores the date and repeat frequency if present. Additionally, there are booleans for both the start
date option and the repeat option. These are set according to what
information is inputted. Once the thread has been created, the run()
method of the thread begins. First, the important Web-Harvest classes
are created.
Web-Harvest is organized such that all of the scraping functionality
can be accessed through two of its classes. First, an instance of the
ScraperConfiguration class can be created with the file extension as
the parameter. Second, you can create an instance of the Scraper
class, which takes an instance of the ScraperConfiguration as a parameter. When a scrape is to be called you call the execute function of
the Scraper Class.
Within the run() method, there is a loop which contains all the
code for calling scrapes. Which type of scrape is called depends entirely on which booleans are set by the constructor. This setup easily
enabled the implementation of a start date limitation as well as the option to repeat. Within the infinite loop, there are a series of if else
statements that check for the status of the date and repeating Boolean
variables. The date is checked through use of the Java Calendar class.
When the date Boolean is true the thread enters a loop which checks if
the current time, (checked by a Calendar time feature), is equal to the
time passed in by the constructor, when it is the scraper execute
method is called and the program exits the loop.
VoyeurServer Features. As mentioned previously, the key parts of
VoyeurServer included a command line user interface, job scheduling,
process management, progress management, and database access.
Repeating Scrapes. There is a Boolean that designates whether or not
the program should repeat. This section of the code uses the Calen-
dar class to check whether or not the current time is equal to the time
at which the scrape should repeat. So first the program creates an
instance of the Calendar class that is altered by the time period the
user wishes to repeat over. Next, it checks whether the current time is
equal to the time at which it should repeat. Finally, when that time is
reached the scraper executes. It will continue to repeat because it will
continue to infinitely loop within the run() method.
Individual Scrapes. The final scrape option is a single stand-alone
scrape. This option also has a Boolean associated with it that when
true will cause a single stand-alone scrape. If while passing through
the infinite loop the date and repeat variables are false and the single
scrape variable is true, the program will run one single scrape.
Updating scrapes. The next problem we approached was how to organize the program such that the user would be able to update the
initial parameters of a scrape after it has been created. When creating
the scrape, it was important that the thread object be stored and be
able to be accessed later. Once that was in place, the user could then
call methods of the thread that would change the parameters of the
scrape. If the user wanted to change either the date or repeat status
of the scrape a method would be called in the thread that would switch
the Booleans associated with the change. Since the thread is set up to
always be looping, a change in the Booleans is all that was necessary
to update the parameters of the scrape.
Status Check. The final change we made is the thread status. When
called, the status returns the filename, the elapsed time of the scrape,
the scrape finish time, the time of the next scrape, the current status,
and the number of variables scraped.
Of these, the filename, scrape finish, and next scrape are simply
stored variables. If a scrape is running, elapsed time is calculated by
finding the difference is between the current time and the time the
scrape began. If the scrape has finished, the difference between the
finish time and the start time. The total elapsed time of the scrape is
converted from milliseconds to days, hours, minutes, and seconds by
taking a series of divisions and mods of the difference. The state of the
scrape is determined by the three Boolean variables date, repeat, and
single.
The most difficult task was regarding the number of variables
scraped. This is more or less about the progress of the scraper. Each
time the scraper scrapes a piece of information from a web site, there
is a section of code in the scraper file that increments a variable in a
database. When the Server calls the status method, the thread accesses that database and pulls the number. All of these variables are
then concatenated into a string and returned to the Server. The server
sends this string to the client which parses through this string and pulls
out all the variables and prints them in an organized manner.
6
Experimental Results: A Case Study
We used this tool to gather information from an online life science
community of practice. This virtual community consisted mainly of users and their communication across a wide array of social networkmediated interactions, including profiles, blogs, and forums. This section details how we were able to use our tool to acquire the information
we needed for our research.
Our needs included capturing information on the users within this
community and their communications with one another. Our eventual
goal was to use this data to study patterns of trust choose one development of scientific collaboration online. The community we studied
did not provide any API. Therefore, we had to rely on traditional web
content extraction methodologies. We were able to use the WebHarvest specification language to develop separate jobs to collect user
data including profile information as well as collect posts. This is a
fairly basic task. We wanted to limit the bandwidth consumption of our
requests to not affect service of the site. Therefore, we limited concurrency. However, we were able to successfully collect user and post
information simultaneously. At first we were collecting data into text
files to be reviewed, evaluated, and codded manually. As our research
developed, we were able to further update and modify our data collection job to collect information directly to a database. We had previously
developed an application to assist in the coding and classification of
the community's data based on this database. Being able to execute
web content extraction to interface directly with some of our downstream research applications represents an extremely powerful and
desirable workflow for network analysis. This is an area which we envision our future work to follow.
Although this research is preliminary and its remit has not been to
test all the features we have identified as being important to data mining online social networks, our experience in developing the VoyeurServer tools has been positive and represents what we believe to
be an important step towards the further development of this and/or
other data mining tools specifically for online social networks. It is important to begin developing these domain specific solutions so that
good open source options are available to researchers. In its absence
the market will likely be left to be served primary by the existing companies whose tools focus much more on the domains of marketing and
business knowledge. These types of solutions will never be ideal for
pure research and could lead to a period where it becomes difficult for
researchers to obtain this kind of information.
7
Future Work
Despite initial success in using Voyeur Server for mining data for network analysis, there are still many potential capabilities of Voyeur
Server that have yet to fully tested. In Section 3 we outlined key features for data mining of online social networks. Currently, the Voyeur
Server extension of Web-Harvest only implements these features in a
basic way. Further testing and development would determine whether
this extension has a future as a general research tool or whether it
suggests that extending Web-Harvest 2.0 is perhaps less preferable
than starting from scratch to develop a data mining toolkit for online
social networks. If one is considering developing custom wrappers for
Web-Harvest, we suggest considering these as possibilities of extended functionality:
• Coded modules for common APIs to ease the use of the extraction
specification language for API related tasks.
• Coded modules for specific database tasks. VoyeurServer currently
relies on raw SQL statements.
• Develop specification files for projects as opposed to per file
scrapes. Incorporate timing and progress monitoring options.
• Smart or automated concurrency as opposed to having to design
your individual jobs or project for concurrency.
• Develop ratelimiting features including self awareness of requests
per second and bandwidth of incoming data. Ability also then to impose limits on itself to guarantee it does not exceed some bandwidth
limits. It would also be useful to be able to set these values within a
preference file or as a module. Limits for various sites could be defined and reusable.
Our continued work seeks to develop VoyeurServer in the following
ways:
• High levels of concurrency.
• Investigate the feasibility of a broad-based public research server
providing network-structured extraction as a service.
• Investigate high per-thread resources. Experience suggests that
VoyeurServer is memory intensive. Our solution will need to make
large numbers of jobs for various clients more efficient.
• Improved Interface for Web-Harvest backend (incorporating client
server features)
We have shared these improvements so that data mining developers
can be aware of issues we currently face and some possible solutions.
This will enable designers and developers to learn from our development challenges.
8
Conclusion
This chapter has reviewed various data mining tools for scraping data
from online social networks. It has highlighted not only the complexities
of scraping data from these sources (which include diverse data
forms), but also introduces currently available tools and the ways in
which we have sought to overcome these limitations through extensions to existing software. After reviewing data scraping tools currently
on the market, we developed a tool of our own, VoyeurServer, which
builds upon the Web-Harvest framework. In this chapter, we outlined
the challenges we faced and our solutions. We also included future
directions of our data mining project. Concrete methods for developers
to develop data mining solutions of online social networks using the
Web-Harvest framework are provided.
9
References
[1] Doan, S., Vo, B.-K., and Collier, N., "An Analysis of Twitter
Messages in the 2011 Tohoku Earthquake", 2011,
[2] Hughes, A.L., Palen, L., Sutton, J., Liu, S.B., and Vieweg, S., "“SiteSeeing” in Disaster: An Examination of on-Line Social Convergence",
5th International ISCRAM Conference, 2008
[3] Bollen, J., "Twitter Mood as a Stock Market Predictor", computer,
44(10), 2011, pp. 91 - 94.
[4] Golder, S.A., and Macy, M.W., "Diurnal and Seasonal Mood Vary
with Work, Sleep, and Daylength across Diverse Cultures", Science,
333(6051), 2011, pp. 1878-1881.
[5] Boyd, D.M., and Ellison, N.B., "Social Network Sites: Definition,
History, and Scholarship", Journal of Computer-Mediated
Communication, 13(1), 2008, pp. 210-230.
[6] Morzy, M., "Internet Forums: What Knowledge Can Be Mined from
Online Discussions": Knowledge Discovery Practices and Emerging
Applications of Data Mining: Trends and New Domains, IGI Global, 2011. ,
pp. 315-336.
[7] Nagel, T., and Duval, E., "Muse : Visualizing the Origins and
Connections of Institutions Based on Co-Authorship of Publications", 2nd
International Workshop on Research 20 At the5th European Conference on
Technology Enhanced Learning Sustaining TEL, 2010, pp. 48-52.
[8] Katzdobler, F.-J., and Filho, H.P.B., "Knowledge Extraction from
Web", in (Editor, 'ed.'^'eds.'): Book Knowledge Extraction from Web,
Retrieved from:
http://subversion.assembla.com/svn/iskm/FinalDocumentation/FinalRe
port.pdf, 2009
[9] Yin, R.M., Tagcrawler : A Web Crawler Focused on Data Extraction
from Collaborative Tagging Communities, University of British
Columbia, 2007.