Author's Accepted Manuscript: Neurocomputing
Author's Accepted Manuscript: Neurocomputing
Author's Accepted Manuscript: Neurocomputing
www.elsevier.com/locate/neucom
PII: S0925-2312(16)30683-X
DOI: http://dx.doi.org/10.1016/j.neucom.2016.06.045
Reference: NEUCOM17295
To appear in: Neurocomputing
Received date: 25 September 2015
Revised date: 3 June 2016
Accepted date: 14 June 2016
Cite this article as: MohammadNoor Injadat, Fadi Salo and Ali Bou Nassif, Data
Mining Techniques in Social Media: A Survey, Neurocomputing,
http://dx.doi.org/10.1016/j.neucom.2016.06.045
This is a PDF file of an unedited manuscript that has been accepted for
publication. As a service to our customers we are providing this early version of
the manuscript. The manuscript will undergo copyediting, typesetting, and
review of the resulting galley proof before it is published in its final citable form.
Please note that during the production process errors may be discovered which
could affect the content, and all legal disclaimers that apply to the journal pertain.
1
1
Department of Electrical and Computer Engineering, University of Western Ontario, 1151 Richmond St, London, Ontario N6A 3K7 Canada
2
Department of Electrical and Computer Engineering, University of Sharjah, Sharjah, United Arab Emirates
ABSTRACT
Today, the use of social networks is growing ceaselessly and rapidly. More alarming is the fact that these networks have become a
substantial pool for unstructured data that belong to a host of domains, including business, governments and health. The
increasing reliance on social networks calls for data mining techniques that is likely to facilitate reforming the unstructured data
and place them within a systematic pattern. The goal of the present survey is to analyze the data mining techniques that were
utilized by social media networks between 2003 and 2015. Espousing criterion-based research strategies, 66 articles were
identified to constitute the source of the present paper. After a careful review of these articles, we found that 19 data mining
techniques have been used with social media data to address 9 different research objectives in 6 different industrial and services
domains. However, the data mining applications in the social media are still raw and require more effort by academia and
industry to adequately perform the job. We suggest that more research be conducted by both the academia and the industry since
the studies done so far are not sufficiently exhaustive of data mining techniques.
Keywords: Data Mining, Social Media, Social Media Networks Analysis, Survey
1. INTRODUCTION
Undoubtedly, the world is shrinking into a small village owing to the tangible influence of social media. It connects people
from different parts of the world, ages, and nationalities and allows them to share their opinions, experiences, feelings, hobbies,
pictures, and videos. This has opened the door for public and private organizations from all domains to promote, benefit, analyze,
learn, and improve their organizations based on the data provided in social media. Thus, the significance of social media for
academia and industry is quite conspicuous in the amount of research done by these two sectors, seeking answers to pivotal
questions.
The structure of the social media data is unorganized and is displayed in different forms such as: text, voice, images, and
videos [1]. Moreover, the social media provides an enormous amount of continuous real time data that makes traditional
statistical methods unsuitable to analyze this massive data [2]. Therefore, the data mining techniques can play an important role
in overcoming this problem.
In spite of the large number of empirical research about data mining techniques and social media, a scant number of studies
compare data mining techniques in terms of accuracy, performance, and suitability. For instance, it was observed that the
accuracy of certain machine learning techniques is calculated in various methods which makes it difficult to find answers to the
suitability of the data mining techniques.
Many researchers have selected their data mining techniques based solely on expert judgment (A31, A56). Few surveys have
been conducted in this area without giving full justification for using data mining techniques in social media [3,4]. However,
some studies discussed certain areas in the used data mining techniques in social media. In [5], Vilma Vuori, et al., discussed the
information gathering and knowledge and information sharing through social media for companies. In [6], Rafeeque P C, et al.,
the work and challenges related to short text analysis have been reviewed. Akin to this study, [7], Mikalai Tsytsarau, et al.,
reviewed the opinion mining and sentiment analysis development, providing a summary about the proposed methods of
contradiction analysis. In, [8], Sheela Gole, et al., discussed mining big data in social media and its challenges as a result of big
data features such as: Volume, Velocity, Variety, Veracity and Value.
To the best of our knowledge, there is no previous study that systematically concentrates on the implemented data mining
techniques in social media research, which has triggered the idea of the present survey. The review presented in this paper
discusses the published research in the period from January 1, 2007 to January 7, 2015. The goal of this study is to probe the
available articles with regards to: (I) the data mining techniques used to extract social media data, (II) the research area that
requires mining data from social media, (III) a comparison between machine learning and non-machine learning data mining
techniques, (IV) a comparison between different data mining techniques, and (V) the strength and weakness of the recommended
data mining techniques in social media.
This manuscript is divided into five sections. Section 2 explains the implemented methodology. Section 3 describes our
findings. Section 4 discusses the limitation of this review. Finally, Section 5 presents our findings, recommendations, and future
work.
2. METHODOLOGY
In this review, we conducted a survey based on the Systematic Literature Review (SLR) proposed by Kitchenham and
Charters [9] methodology which consists of: planning, conducting, and reporting phases where each phase consists of several
stages. At the planning phase we created a review protocol which consists of six stages: specifying research questions, designing
the search strategy, identifying the study selection procedures, specifying the quality assessment rules, detailing the data
extraction strategy, and synthesizing the extracted data. Fig. 1 shows the review protocol stages.
The research questions have been specified based on the objectives of this review. At the next stage, we designed the search
strategy referring to the first stage to retrieve the required and related articles. We also identified the search terms and article
selection process, which is required for an accurate search. Stage three covered the selection criteria which specify the inclusion
and exclusion rules; we also included more related articles from the references in the articles we used to enrich our literature
resources related to the research questions. Stage four included the quality questions to filter the related articles. In stage five, we
described the extraction strategy used to obtain the required data which could answer the research questions. Finally, in the last
stage, we identified the methodologies used to synthesize the extracted data.
As indicated by Kitchenham and Charters [9], the review protocol is considered to be a critical element of any SLR.
Therefore, to avoid researcher bias and to ensure the quality of the review protocol, regular meetings have continued between the
authors.
The following subsections: 2.1 – 2.6 will illustrate in detail the review protocol followed in this review.
Science Direct
ACM Digital Library
Computing Research Repository
Web of Science
Spie
The first search process included journals, and Tier I social network related conferences, such as International Conference on
Advances in Social Networks Analysis and Mining (ASONAM), ACM Conference on Online Social Networks (COSN),
International World Wide Web Conference (WWW), and International Conference on Data Engineering (ICDE), from the above
mentioned digital libraries. The search terms considered cover any part of the articles (metadata) and were restricted to articles
published between January, 2003 and the January, 2015, because the most popular social networks (Facebook, Twitter, LinkedIn,
and MySpace) began after 2002 [14].
The inclusion and exclusion criteria applied in this survey are defined below:
Inclusion criteria:
Use data mining techniques in social media.
Use machine learning and non-machine learning data mining techniques
in social media.
Comparative studies that compare among data mining techniques.
Comparative studies that compare between data mining and non-data
mining techniques.
Consider the latest edition of the article of the same research (if different
versions are available).
Consider only articles published between January 2003 and January 2015.
Exclusion criteria:
Exclude articles that include data mining that is not related to social
media.
Exclude articles that do not include data mining but are related to social
media.
Exclude non-journal and non-conferences articles.
Finally, after applying all filtration steps, 66 articles were considered as the
resources for this review. The selected articles are listed in Appendix (A), Table
A1.
The scores that resulted from applying the QARs on the selected articles are shown in Appendix (A), Table A2.
Not all selected articles answered all the five RQs. Appendix (A), Table A3 illustrate the RQs that were answered by each
selected study.
Fig. 3 shows that SVM, BN, and DT are the most applied techniques in the area of social media with a percentage of 51% of
the selected articles. Novel techniques with the percentage of 9% were not considered as the one of the highest; because each
article has its dedicated novel technique. Table 4, includes detailed information about the frequencies of data mining techniques
used by the selected articles in this review.
6
Quality Improvement
Risk Management
Semantic Analysis
Sentiment Analysis
Fig. 7 illustrates the distribution of these research areas. The sentiment analysis and quality improvement were the most
active areas among articles with a frequencies of 21 and 14 respectively.
MACHINE LEARNING VERSUS NON-MACHINE LEARNING METHODS IN MINING SOCIAL MEDIA DATA
(RQ3)
Data mining techniques are the process of extracting hidden knowledge from the data [16]. This can be done in many ways
such as KNN, K-means, and SVM as machine learning methods. Also the statistical methods in some cases are considered as
non-machine learning methods which used to discover patterns. As Berson, et al. mentioned [11], “statistical techniques are
driven by the data and are used to discover patterns and build predictive models”.
Out of the 66 papers identified, only three papers contain either experimental or theoretical knowledge about non-machine
learning methods. Two of these paper (A11, A19) integrated non-machine learning methods with machine learning methods to
improve the result of their proposed solution. The third paper (A53) mentioned that text mining techniques that depend on
machine learning methods are different than non-machine learning methods because of: (i) in traditional quantitative analysis
methods, conclusions are derived from the population sample, whereas machine learning methods allow the researcher to derive
conclusions from the entire population, (ii) traditional quantitative methods require the researcher to analyze the data using a
Fig. 6 Domains Distribution per Year
8
theoretical platform, while machine learning methods give the researcher the ability to extract the actual meaning of the
mined data contained in natural language text. (iii) machine learning methods investigate the textual data without human
interaction, whereas traditional quantitative methods need the researcher to interpret the data before analyzing.
However, we disagree with the authors of paper A53 because the definition of data mining consists of three concepts [17]:
Statistics, Data (Big or Small), and Machine Learning and Lifting. Thus, data mining includes all statistics (Descriptive and non-
inferential parts of the classical statistics) and Exploratory Data Analysis (EDA) for the data using the power of computers for the
purpose of lifting and learning the patterns of the data [17].
Consequently, machine learning data mining techniques and non-machine learning data mining techniques such as traditional
quantitative methods in statistics are complementary to each other in data mining
.
3.3. DATA MINING TECHNIQUES VERSUS OTHER DATA MINING TECHNIQUES (RQ4)
This RQ compares different data mining techniques that have been used in the selected articles. Since most of the articles
based their findings on either weak statistical analysis or without using any statistics, we built our comparison based on their
judgments, which relied on the experiment they made or by referring to their article references. For instance, papers (A31, A53)
indicate that the SVM technique is one of the best categorization and feature selection techniques available relying on references
published in 1998 and 2003; however, the paper was published in 2013. Further details are provided in Section 5.
After reviewing the papers selected, we found that many papers have common findings on the same data mining techniques.
For instance, papers (A31, A45, A53, A59) found that SVM outperforms other techniques such as Naïve Bayes. In contrast,
papers (A41, A51) claimed that Naïve Bayes and MLP are performed better than SVM. Some other papers (A3, A20, A35)
claimed that K-Means performed better than other techniques such as C4.5. Finally, (A42, A60) found that the DBA technique
outperforms other techniques in terms of working with noisy data.
12 11
NO. OF ARTICLES 9
10 7 7
8 5 5
6 3 3 3 3
4 1 1 2 1 1 1 1 1 2 1
2
0
Analysis
Semantic
Semantic
Cyber Crime
Management
Awareness
Cyber Crime
Geolocating
Improvement
Sentiment
Sentiment
Improvement
Sentiment
Sentiment
Improvement
Sentiment
Improvement
Biometric
Sentiment
Analysis
Analysis
Content
Finance
Analysis
Analysis
Analysis
Analysis
Analysis
Analysis
Disease
Quality
Quality
Quality
Quality
Risk
BM EDU GP MH SN
This part of the review represents a good source of information where the best practices of the primary data mining
techniques could be implemented. Table 5 summarizes the data mining techniques that could be implemented in the social media
area. In addition to the traditional data mining techniques, Appendix (A), Table A7, summarizes the description and the main
features of the novel techniques proposed by the researchers.
An immediate recommendation is that the area of social media still calls for more profound research that takes into account
accurate implementation of data mining techniques in the academic and industrial sectors. A thorough investigation of the
literature written in this area reveals that a significant number of the studies have not applied any statistical tests.
Quite understandably, research in the social media domain should house a twin-focus method which incorporates accurate
result recording of experiments and appropriate statistical analysis.
The systematic literature review conducted in this study reveals that quite a few articles applied statistical tests, such as
ANOVA, MANOVA, and t-test; these parametric statistical tests require normally distributed data [11]. Apparently, the majority
of the studies reviewed failed to meet this condition and, therefore, the data provided can hardly be held reliable.
Our study also found that very few surveys and case studies have shed light on data mining techniques in social media from
the software engineering perspective. By way of illustration, most of the published papers in the health domain were conducted
by health researchers, who barely provide any information about the method utilized in their papers.
In addition to the method-related gap, another one still holds as far as other domains are concerned. The domains of
Education, Customer Relationship Management (CRM), and Human Resource Management (HRM), among others, have not yet
been explored by software engineers. This is a gap that we recommend future research could bridge by investigating CRM and
HRM using data mining techniques. Such studies are anticipated to yield a more generic view and understanding of data mining
techniques.
10
TABLE 5
STRENGTHS AND WEAKNESS
APPENDIX (A)
SVM 1
LDA 1
2015
k-NN 1
Bayesian Networks 1
ANN 2
Wrapper 1
SVM 7
Novel 4
Logistic Regression 2
Linear-Regression 1
6
2014
LDA
k-NN 2
K-Means 2
Decision Trees 3
Bayesian Networks 5
ANN 1
AdaBoost 1
SVM 7
Novel 1
Maximum Entropy 1
Logistic Regression 1
2013
K-Means 1
Hierarchical Clustering 1
Decision Trees 3
Bayesian Networks 8
ANN 1
SVM 9
Novel 6
Maximum Entropy 1
Markov 1
LDA 1
k-NN 5
2012
K-Means 2
Hierarchical Clustering 1
GA 1
Density Based Algorithm 2
Decision Trees 2
Bayesian Networks 7
Apriori 1
SVM 2
Novel 1
K-Means 1
2011
Fuzzy 1
Decision Trees 1
Bayesian Networks 2
ANN 2
AdaBoost 1
20
2008 09
SVM 1
SVM 1
ANN 1
0 1 2 3 4 5 6 7 8 9 10
Iteration of Data Mining Techniques
TABLE A1
SELECTED ARTICLES
ID Title Year Ref
A1 #tag: Meme or Event? 2014 [19]
A2 @Phillies tweeting from philly? Predicting twitter user locations with spatial word usage 2012 [20]
A framework for building web mining applications in the world of blogs: A case study in product sentiment
A3 2012 [21]
analysis
A4 A Novel Data-Mining Approach Leveraging Social Media to Monitor Consumer Opinion of Sitagliptin 2015 [22]
A5 A probabilistic generative model for mining cybercriminal networks from online social media 2014 [23]
A6 A semantic triplet based story classifier 2012 [24]
A7 An algorithm for local geoparsing of microtext 2013 [25]
A8 An interests discovery approach in social networks based on semantically enriched graphs 2012 [26]
A9 An Unsupervised Feature Selection Framework for Social Media Data 2014 [27]
A10 Analyzing and visualizing web opinion development and social interactions with density-based clustering 2011 [28]
A11 Analyzing the political landscape of 2012 Korean presidential election in twitter 2014 [29]
A12 Ano´nimos: An LP-Based Approach for Anonymizing Weighted Social Network Graphs 2012 [30]
A13 Ant colony based approach to predict stock market movement from mood collected on Twitter 2013 [31]
[32]
A14 Batch kernel SOM and related Laplacian methods for social network analysis 2008
TABLE A2
QARS MARKS FOR THE SELECTED ARTICLES
ID QAR1 QAR2 QAR3 QAR4 QAR5 QAR6 QAR7 QAR8 QAR9 QAR10 Total
A1 0.75 0.25 0.25 0.75 0.75 1 1 0.75 0.75 0.75 7
A2 0.75 0 0.25 0.5 0.75 0.75 0 0.75 0.5 0.75 5
A3 1 0.75 0.75 0.75 0.75 0.5 0.25 0.75 0.25 0.5 6.25
A4 1 0.75 1 0.75 1 0.75 1 0.5 0.25 0.75 7.75
A5 1 1 0.5 1 1 1 1 1 0.75 0.75 9
A6 0.75 0.25 1 0.5 0.75 0.25 0 0.5 0.25 0.75 5
A7 1 0.5 0.75 0.75 0.5 0.75 0.5 0.75 0.25 0.5 6.25
A8 0.75 0 0.25 0.5 0.75 0.5 0.5 0.75 0.25 0.75 5
A9 1 0.75 1 0.75 1 1 1 1 0.5 0.75 8.75
A10 1 1 1 0.75 0.75 1 1 0.75 0.25 0.5 8
A11 1 1 1 0.75 0.75 0.75 0.75 0.75 0.25 0.5 7.5
A12 1 0.75 1 0.75 0.75 0.5 1 0.75 0.25 0.5 7.25
A13 0.75 0.5 0.5 0.5 0.5 0.25 0.25 0.75 0.25 0.75 5
14
ID QAR1 QAR2 QAR3 QAR4 QAR5 QAR6 QAR7 QAR8 QAR9 QAR10 Total
A14 0.75 0.25 1 0.75 0.5 0.75 0.25 0.75 0.75 0.75 6.5
A15 0.75 0.5 0.75 0.5 0.75 0.25 0.25 0.5 0.25 0.75 5.25
A16 0.75 0.75 0.5 0.75 0.5 0.75 0 0.75 0.5 0.5 5.75
A17 1 0.75 0.75 0.75 1 1 0.5 0.5 0.25 0.75 7.25
A18 1 0.75 1 0.75 1 0.5 0.5 0.5 0.25 0.5 6.75
A19 1 0.75 0.75 1 0.75 1 1 1 0.75 0.75 8.75
A20 0.75 0.75 0.5 1 1 1 0 1 1 1 8
A21 0.5 0.5 0.5 0.75 0.75 1 1 0.75 0.5 0.75 7
A22 1 1 0.75 0.75 0.5 1 0 0.75 0.5 0.75 7
A23 0.75 0 0.25 0.5 0.75 0.75 0.75 0.75 0.5 0.75 5.75
A24 1 0.25 0.25 1 0.25 0.75 1 0.75 0.75 1 7
A25 0.75 0.25 0.25 1 0.25 0.75 1 0.75 1 0.75 6.75
A26 0.75 1 0.25 0.75 0.75 1 1 0.75 0.75 0.75 7.75
A27 0.75 0.25 0.25 0.5 0.75 0.75 0.5 0.5 0.5 0.75 5.5
A28 0.75 0 0.25 0.5 0.75 0.75 1 0.75 0.5 0.75 6
A29 1 0.75 0.75 1 1 0.75 0.75 0.75 0.25 0.75 7.75
A30 0.75 0.5 0.5 0.75 0.75 0 0 0.75 0.25 0.75 5
A31 1 1 1 0.75 1 1 1 1 1 1 9.75
A32 0.75 0.75 0.75 1 1 1 0.5 0.75 0.5 0.5 7.5
A33 1 0.5 0.5 1 0.75 0.75 0.75 1 0.5 1 7.75
A34 1 0.75 0.75 1 0.75 1 1 1 0.75 0.75 8.75
A35 1 1 1 1 0.75 1 0.75 0.75 0.25 1 8.5
A36 1 0.25 0.25 0.75 0.75 0.75 0 1 0.5 0.5 5.75
A37 1 1 0.75 0.75 0.75 0.75 1 0.75 0.5 0.5 7.75
A38 0.75 0.75 0.25 0.75 0.5 0.75 0.5 0.5 0.5 0.5 5.75
A39 0.75 0.75 0.75 0.75 0.5 0.25 0.25 0.75 0.25 0.75 5.75
A40 1 1 1 1 0.75 1 0 0.75 1 0.5 8
A41 1 1 1 1 0.75 1 0.75 1 0.5 0.75 8.75
A42 1 1 1 1 0.75 0.75 0 0.75 0.5 0.75 7.5
A43 0.75 0.75 0.75 0.5 0.75 0.75 0.5 0.5 0.5 0.5 6.25
A44 1 0.75 0.75 0.75 0.5 0.75 0.75 0.75 0.5 0.75 7.25
A45 1 1 1 1 0.75 1 1 1 0.75 0.75 9.25
A46 0.75 0 0.75 0.75 0.75 0.75 0.5 0.75 0.5 0.75 6.25
A47 0.75 0.75 0.5 0.5 0.75 1 0.5 0.75 0.5 0.75 6.75
A48 1 0.75 0.75 0.5 0.75 0.75 0 0.75 0.75 0.5 6.5
A49 0.75 0.5 0.75 0.75 0.5 0.5 0 0.5 0.25 0.75 5.25
A50 0.75 0 0.25 0.75 0.5 1 0.75 0.75 0.75 0.75 6.25
A51 1 0.5 0.5 0.5 0.75 0.5 0.75 0.5 0.25 0.5 5.75
A52 1 0.5 0.75 0.5 0.75 0.75 0 0.5 0.25 0.5 5.5
A53 1 1 0.75 0.75 1 1 1 1 1 0.75 9.25
A54 0.5 0.75 0.25 0.5 0.5 0.75 0 0.75 0.5 0.75 5.25
A55 1 1 0.75 1 0.75 0.75 0 0.75 0.5 0.75 7.25
A56 1 1 0.75 0.75 0.75 0.75 0 0.75 0.25 0.5 6.5
15
ID QAR1 QAR2 QAR3 QAR4 QAR5 QAR6 QAR7 QAR8 QAR9 QAR10 Total
A57 0.5 0.5 0.25 0.75 0.75 0.5 0.25 0.5 0.5 0.5 5
A58 1 0.75 0.75 0.75 0.75 1 0.75 1 0.75 0.5 8
A59 1 1 1 1 1 1 1 0.75 0.75 0.5 9
A60 1 1 0.75 1 0.75 1 1 0.5 0.25 0.75 8
A61 0.75 0.5 0.25 0.5 0.75 0.75 0 0.75 0.75 0.5 5.5
A62 0.75 0.75 0.5 0.75 0.5 0.75 0.75 0.75 0.75 0.5 6.75
A63 0.75 0.5 0.25 0.75 0.75 0.25 0.5 0.25 0.25 0.75 5
A64 1 1 0.75 0.5 0.75 0.5 0.5 0.5 0.25 0.5 6.25
A65 1 1 1 0.75 1 0.75 0 0.5 0.25 0.5 6.75
A66 0.75 0.75 0.5 0.75 0.5 0.5 0.75 0.5 0.5 0.75 6.25
TABLE A3
RQS ANSWERED BY ARTICLES
TABLE A4
ARTICLES PERCENTAGE PER JOURNAL
Publication Venue Type Freq. % Publication Venue Type Freq. %
ACM TRANSACTIONS ON INTELLIGENT Jour. 2 3 IEEE TRANSACTIONS ON LEARNING Jour. 1 2
SYSTEMS AND TECHNOLOGY TECHNOLOGIES
AI MAGAZINE Jour. 1 2 IEEE TRANSACTIONS ON MULTIMEDIA Jour. 2 3
TABLE A5
A1, A2, A5, A7, A9, A11, A13, A15, A17, A21, A28, A30,
A32, A35, A39, A41, A42, A45, A47, A49, A50, A51, A54,
Microblogging 31 A57, A59, A60, A61, A62, A63, A64, A66
Product Reviews 1 A33
A8, A10, A12, A14, A15, A18, A26, A32, A46, A53, A54,
Social Networks 12 A59
Video and Photo sharing 11 A9, A12, A15, A23, A29, A34, A36, A38, A40, A43, A58
TABLE A7
NOVEL TECHNIQUES FEATURES
A17 Biterm Topic Model (BTM) Capture the topics within short texts by explicitly modeling word co- Outperforms the online LDA
occurrence patterns in the whole corpus. in terms of effectiveness
Discover more prominent and coherent topics than the state-of-the art
competitors.
A37 Interest-based Factor Graph Proposed to take both network topology and node features into
Model (I-FGM) consideration.
Makes the most of the strong inference abilities of the probability model
and the graph model.
A58 Topic-Sensitive Aims to find the influential nodes in the networks. Outperforms LDA in terms of
Influencer Mining (TSIM) Improves the performance significantly in the applications of friends’ friends’ suggestion and photo
suggestion and photo recommendation. recommendation.
A34 Latent Space Method Discovers the latent semantic space from both context and content links Extends the traditional LSI
in multimedia information networks. algorithm by low-rank
Solve the problem with sparse context. approximation.
The learned latent semantic space can be applied for many applications,
such as multimedia annotation and retrieval.
A32 Novel The proposed framework performs language knowledge integration and
feature reduction simultaneously.
Improves the short texts clustering performance.
Scales linearly with the number of short texts and the number of
integrated languages.
A64 Online Incremental Provide useful situation awareness information through a set of tightly Resolves the weaknesses in
Clustering Algorithm integrated components. K-Means and EM.
Enhance timely situation awareness across a range of crisis types.
19
A29 Decision Fusion for Multimodal Reduces the false acceptance rate for both single biometric traits and
Biometrics multimodal biometrics when the social network analysis is employed.
Independently classify an actor from the relationship among actors.
A9 Unsupervised Feature Selection Exploit link information effectively in comparison with the state-of-the-
Framework (LUFS) art unsupervised feature selection methods.
A43 Neighborhood Similarity Encodes both the local density information and semantic information. Outperforms the k-NN
Measure Enhances the scalability to conduct approximated nearest neighbor methods using the labeled
search. data only.
Enhance the robustness on diversified genres of images.
A8 Semantic Social Graph (SSG) Discovers the implicit semantic relations between entities in text Significantly outperforms
messages. Naive Bayes classifier in
Enriches graph representation of entities contained in text messages accuracy and reliability
generated by a user.
ACKNOWLEDGEMENT:
MohammadNoor Injadat and Fadi Salo would like to thank the University of Western Ontario for
supporting this research.
Dr. Ali Bou Nassif would like to thank the University of Sharjah for supporting this research.
REFERENCES
[1] A.L. Kavanaugh, E. a. Fox, S.D. Sheetz, S. Yang, L.T. Li, D.J. Shoemaker, et al., Social media use by government: From the routine to
the critical, Gov. Inf. Q. 29 (2012) 480–491. doi:10.1016/j.giq.2012.06.002.
[2] H. Chen, R.H.L. Chiang, V.C. Storey, Business Intelligence and Analytics: From Big Data To Big Impact, Mis Q. 36 (2012) 1165–
1188.
[3] M. Zuber, A Survey of Data Mining Techniques for Social Network Analysis, Int. J. Res. Comput. Eng. Electron. 3 (2014) 1–8.
[4] S. Yu, S. Kak, A survey of prediction using social media, arXiv Prepr. arXiv1203.1647. (2012) 1–20. http://arxiv.org/abs/1203.1647.
[5] V. Vuori, J. Väisänen, The use of social media in gathering and sharing competitive intelligence, ICEB 2009 Proc. (2009) 1–8.
[6] P.C. Rafeeque, S. Sendhilkumar, A survey on Short text analysis in Web, 2011 Third Int. Conf. Adv. Comput. (2011) 365–371.
doi:10.1109/ICoAC.2011.6165203.
[7] M. Tsytsarau, T. Palpanas, Survey on mining subjective data on the web, Data Min. Knowl. Discov. 24 (2012) 478–514.
doi:10.1007/s10618-011-0238-6.
[8] S. Gole, B. Tidke, A survey of Big Data in social media using data mining techniques, in: 2015 Int. Conf. Adv. Comput. Commun.
Syst. (ICACCS -2015), 2015: pp. 1–5. doi:10.1109/ICACCS.2015.7324059.
[9] B. Kitchenham, S. Charters, Guidelines for performing Systematic Literature Reviews in Software Engineering, Tech. Rep. EBSE-
2007-01, Keele Univ. Univ. Durham. (2007). doi:10.1145/1134285.1134500.
[10] D. Hand, Statistics and data mining: intersecting disciplines, ACM SIGKDD Explor. Newsl. 1 (1999) 16–19.
doi:10.1145/846170.846171.
[11] A. Berson, S.J. Smith, Building Data Mining Applications for CRM, McGraw-Hill, Inc., New York, NY, USA, 2002.
[12] X. Wu, V. Kumar, The top ten algorithms in data mining, CRC Press, 2009.
[13] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, et al., Top 10 algorithms in data mining, Knowl. Inf. Syst. 14
(2008) 1–37. doi:10.1007/s10115-007-0114-2.
[14] D.M. Boyd, N.B. Ellison, Social network sites: Definition, history, and scholarship, J. Comput. Commun. 13 (2007) 210–230.
doi:10.1111/j.1083-6101.2007.00393.x.
[15] M.G. Smith, L. Bull, Feature construction and selection using genetic programming and a genetic algorithm, in: Genet. Program.,
Springer, 2003: pp. 229–237.
[16] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, others, Knowledge Discovery and Data Mining: Towards a Unifying Framework., in:
KDD, 1996: pp. 82–88.
[17] B. Ratner, Statistical and machine-learning data mining: Techniques for better predictive modeling and analysis of big data, CRC Press,
2011.
20
[18] D. Pohl, A. Bouchachia, H. Hellwagner, Social media for crisis management: clustering approaches for sub-event detection, Multimed.
Tools Appl. (2013) 1–32. doi:10.1007/s11042-013-1804-2.
[19] D. Kotsakos, P. Sakkos, I. Katakis, D. Gunopulos, #tag: Meme or Event?, in: 2014 IEEE/ACM Int. Conf. Adv. Soc. Networks Anal.
Min., 2014: pp. 391–394. doi:10.1109/ASONAM.2014.6921615.
[20] H.W. Chang, D. Lee, M. Eltaher, J. Lee, Phillies tweeting from philly? Predicting twitter user locations with spatial word usage, in:
Proc. 2012 IEEE/ACM Int. Conf. Adv. Soc. Networks Anal. Mining, ASONAM 2012, 2012: pp. 111–118.
doi:10.1109/ASONAM.2012.29.
[21] E. Costa, R. Ferreira, P. Brito, I.I. Bittencourt, O. Holanda, A. MacHado, et al., A framework for building web mining applications in
the world of blogs: A case study in product sentiment analysis, Expert Syst. Appl. 39 (2012) 4813–4834.
doi:10.1016/j.eswa.2011.09.135.
[22] A. Akay, A. Dragomir, A Novel Data-Mining Approach Leveraging Social Media to Monitor Consumer Opinion of Sitagliptin, IEEE J.
Biomed. Heal. INFORMATICSournal Biomed. Heal. Informatics. 19 (2015) 389–396. doi:10.1109/JBHI.2013.2295834.
[23] R.Y.K. Lau, Y. Xia, Y. Ye, A probabilistic generative model for mining cybercriminal networks from online social media, IEEE
Comput. Intell. Mag. 9 (2014) 31–43. doi:10.1109/MCI.2013.2291689.
[24] B. Ceran, R. Karad, A. Mandvekar, S.R. Corman, H. Davulcu, A semantic triplet based story classifier, in: Proc. 2012 IEEE/ACM Int.
Conf. Adv. Soc. Networks Anal. Mining, ASONAM 2012, 2012: pp. 573–580. doi:10.1109/ASONAM.2012.97.
[25] J. Gelernter, S. Balaji, An algorithm for local geoparsing of microtext, Geoinformatica. 17 (2013) 635–667. doi:10.1007/s10707-012-
0173-8.
[26] A. Al-Kouz, S. Albayrak, An interests discovery approach in social networks based on semantically enriched graphs, in: Proc. 2012
IEEE/ACM Int. Conf. Adv. Soc. Networks Anal. Mining, ASONAM 2012, 2012: pp. 1272–1277. doi:10.1109/ASONAM.2012.219.
[27] J. Tang, H. Liu, An Unsupervised Feature Selection Framework for Social Media Data, IEEE Trans. Knowl. Data Eng. 4347 (2014)
2914–2927. doi:10.1109/TKDE.2014.2320728.
[28] C.C. Yang, T.D. Ng, Analyzing and visualizing web opinion development and social interactions with density-based clustering, IEEE
Trans. Syst. Man, Cybern. Part ASystems Humans. 41 (2011) 1144–1155. doi:10.1109/TSMCA.2011.2113334.
[29] M. Song, M.C. Kim, Y.K. Jeong, Analyzing the political landscape of 2012 korean presidential election in twitter, IEEE Intell. Syst. 29
(2014) 18–26. doi:10.1109/MIS.2014.20.
[30] S. Das, Ö. Eǧ ecioǧ lu, A. El Abbadi, Anónimos: An LP-based approach for anonymizing weighted social network graphs, IEEE Trans.
Knowl. Data Eng. 24 (2012) 590–604. doi:10.1109/TKDE.2010.267.
[31] S. Bouktif, M.A. Awad, Ant colony based approach to predict stock market movement from mood collected on Twitter, in: 2013
IEEE/ACM Int. Conf. Adv. Soc. Networks Anal. Min. Ant, 2013: pp. 837–845. doi:10.1145/2492517.2500282.
[32] R. Boulet, B. Jouve, F. Rossi, N. Villa, Batch kernel SOM and related Laplacian methods for social network analysis, Neurocomputing.
71 (2008) 1257–1273. doi:10.1016/j.neucom.2007.12.026.
[33] M. Saravanan, S. Buveneswari, S. Divya, V. Ramya, Bayesian filters for mobile recommender systems, in: Proc. - 2011 Int. Conf. Adv.
Soc. Networks Anal. Mining, ASONAM 2011, 2011: pp. 715–721. doi:10.1109/ASONAM.2011.51.
[34] P.M. Hartmann, M. Zaki, N. Feldmann, A. Neely, Big Data for Big Business? A Taxonomy of Data-driven Business Models used by
Start-up Firms, Cambridge Serv. Alliance Blog. (2014) 1–29. http://cambridgeservicealliance.blogspot.co.uk/2014/04/big-data-for-big-
business_3.html.
[35] X. Cheng, X. Yan, Y. Lan, J. Guo, BTM: Topic Modeling over Short Texts, IEEE Trans. Knowl. Data Eng. 26 (2014) 2928–2941.
doi:10.1109/TKDE.2014.2313872.
[36] M.A. Rahman, A. El Saddik, W. Gueaieb, Building dynamic social network from sensory data feed, IEEE Trans. Instrum. Meas. 59
(2010) 1327–1341. doi:10.1109/TIM.2009.2038307.
[37] B.I. Analytics, Business Intelligence from Social Media A Study from the VAST Box Office Challenge, IEEE Comput. Graph. Appl.
34 (2014) 58–69. doi:10.1109/MCG.2014.61.
[38] B.J. Jansen, K. Sobel, G. Cook, Classifying ecommerce information sharing behaviour by youths on social networking sites, J. Inf. Sci.
37 (2011) 120–136. doi:10.1177/0165551510396975.
[39] E. Ferrara, M. JafariAsbagh, O. Varol, V. Qazvinian, F. Menczer, A. Flammini, Clustering memes in social media, in: Proc. 2013
IEEE/ACM Int. Conf. Adv. Soc. Networks Anal. Min. - ASONAM ’13, 2013: pp. 548–555. doi:10.1145/2492517.2492530.
[40] H.-N. Kim, A.-T. Ji, I. Ha, G.-S. Jo, Collaborative filtering based on collaborative tagging for enhancing the quality of
recommendation, Electron. Commer. Res. Appl. 9 (2010) 73–83. doi:10.1016/j.elerap.2009.08.004.
[41] M. Wang, F. Li, M. Wang, Collaborative visual modeling for automatic image annotation via sparse model coding, Neurocomputing.
95 (2012) 22–28. doi:10.1016/j.neucom.2011.04.049.
[42] X. Si, E.Y. Chang, Z. Gyöngyi, M. Sun, Confucius and its intelligent disciples: integrating social with search, Proc. VLDB Endow. 3
(2010) 1505–1516. doi:10.1145/1645953.1645955.
[43] J. Piorkowski, L. Zhou, Content Feature Enrichment for Analyzing Trust Relationships in Web Forums, in: 2013 IEEE/ACM Int. Conf.
Adv. Soc. Networks Anal. Min. Content, 2013: pp. 1486–1487.
[44] I. Ting, S. Wang, Content Matters : A study of hate groups detection based on social networks analysis and web mining, in: 2013
IEEE/ACM Int. Conf. Adv. Soc. Networks Anal. Min., 2013: pp. 1196–1201. doi:10.1145/2492517.2500254.
[45] P. Biyani, C. Caragea, P. Mitra, C. Zhou, J. Yen, G.E. Greer, et al., Co-training over Domain-independent and Domain-dependent
features for sentiment analysis of an online cancer support community, in: 2013 IEEE/ACM Int. Conf. Adv. Soc. Networks Anal.
Mining, ASONAM 2013, August 25, 2013 - August 28, 2013, 2013: pp. 413–417. doi:10.1145/2492517.2492606.
[46] A. Beykikhoshk, T. Caelli, Data-Mining Twitter and the Autism Spectrum Disorder : A Pilot Study, in: 2014 IEEE/ACM Int. Conf.
Adv. Soc. Networks Anal. Min., 2014: pp. 349–356.
[47] P.P. Paul, M.L. Gavrilova, R. Alhajj, Decision Fusion for Multimodal Biometrics Using Social Network Analysis, Ieee Trans. Syst.
Man, Cybern. Syst. 44 (2014) 1522–1533.
[48] J.S. Alowibdi, U.A. Buy, P.S. Yu, L. Stenneth, Detecting Deception in Online Social Networks, in: 2014 IEEE/ACM Int. Conf. Adv.
Soc. Networks Anal. Min., 2014: pp. 383–390.
[49] D. Schniederjans, E.S. Cao, M. Schniederjans, Enhancing financial performance with social media: An impression management
perspective, Decis. Support Syst. 55 (2013) 911–918. doi:10.1016/j.dss.2012.12.027.
[50] J. Tang, X. Wang, H. Gao, X. Hu, H. Liu, Enriching short text representation in microblog for clustering, Front. Comput. Sci. China. 6
(2012) 88–101. doi:10.1007/s11704-011-1167-7.
21
[51] A. Ghose, P.G. Ipeirotis, Estimating the helpfulness and economic impact of product reviews: Mining text and reviewer characteristics,
IEEE Trans. Knowl. Data Eng. 23 (2011) 1498–1512. doi:10.1109/TKDE.2010.188.
[52] G. Qi, C. Aggarwal, Q. Tian, S. Member, Exploring Context and Content Links in Social Media: A Latent Space Method, IEEE Trans.
Pattern Anal. Mach. Intell. 34 (2012) 850–862.
[53] B. Yee Liau, P. Pei Tan, Gaining customer knowledge in low cost airlines through text mining, Ind. Manag. Data Syst. 114 (2014)
1344–1359. doi:10.1108/IMDS-07-2014-0225.
[54] C.H.C. Leung, A.W.S. Chan, A. Milani, J. Liu, Y. Li, Intelligent Social Media Indexing and Sharing Using an Adaptive Indexing
Search Engine, ACM Trans. Intell. Syst. Technol. 3 (2012) 1–27. doi:10.1145/2168752.2168761.
[55] F. Tan, L. Li, Z. Zhang, Y. Guo, Latent Co-interests ’ Relationship Prediction, Tsinghua Sci. Technol. 18 (2013) 379–386.
[56] S.Y. Wang, W.S. Liao, L.C. Hsieh, Y.Y. Chen, W.H. Hsu, Learning by expansion: Exploiting social media for image classification
with few training examples, Neurocomputing. 95 (2012) 117–125. doi:10.1016/j.neucom.2011.05.043.
[57] L. Dickens, I. Molloy, J. Lobo, Learning Stochastic Models of Information Flow, in: 2012 IEEE 28th Int. Conf. Data Eng., 2012: pp.
570–581.
[58] J. Biel, D. Gatica-perez, Mining Crowdsourced First Impressions in Online Social Video, IEEE Trans. ONMULTIMEDIA. 16 (2014)
2062–2074.
[59] X. Chen, M. Vorvoreanu, K. Madhavan, Mining Social Media Data for Understanding Students’ Learning Experiences, IEEE Trans.
Learn. Technol. 7 (2014) 246–259. http://web.ics.purdue.edu/~chen654/pub/XinChen_etal_IEEETrans_tlt-
cs_Mining_Twitter.pdf\nhttp://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6697807.
[60] C.H. Lee, Mining spatio-temporal information on microblogging streams using a density-based online clustering method, Expert Syst.
Appl. 39 (2012) 9623–9641. doi:10.1016/j.eswa.2012.02.136.
[61] S. Wang, Q. Huang, S. Jiang, Q. Tian, L. Qin, Nearest-neighbor method using multiple neighborhood similarities for social media data
mining, Neurocomputing. 95 (2012) 105–116. doi:10.1016/j.neucom.2011.06.039.
[62] A. Akay, A. Dragomir, B.-E. Erlandsson, Network-Based Modeling and Intelligent Data Mining of Social Media for Improving Care,
IEEE J. Biomed. Heal. INFORMATICS. 19 (2015) 210–218.
[63] N. Collier, N.T. Son, N.M. Nguyen, OMG U got flu? Analysis of shared health messages for bio-surveillance, J. Biomed. Semantics. 2
(2011) 1–10. doi:10.1186/2041-1480-2-S5-S9.
[64] F. Rossi, N. Villa-Vialaneix, Optimizing an organized modularity measure for topographic graph clustering: A deterministic annealing
approach, Neurocomputing. 73 (2010) 1142–1163. doi:10.1016/j.neucom.2009.11.023.
[65] A. Jaiswal, W. Peng, T. Sun, Predicting Time-sensitive User Locations from Social Media, in: 2013 IEEE/ ACM Int. Conf. Adv. Soc.
Networks Anal. Min., 2013: pp. 870–877. doi:10.1145/2492517.2500229.
[66] D.H.-L. Goh, A. Chua, C.S. Lee, K. Razikin, Resource discovery through social tagging: a classification and content analytic approach,
Online Inf. Rev. 33 (2009) 568–583. doi:10.1108/14684520910969961.
[67] G. Cai, H. Wu, R. Lv, Rumors Detection in Chinese via Crowd Responses, in: 2014 IEEE/ACM Int. Conf. Adv. Soc. Networks Anal.
Min., 2014: pp. 912–917.
[68] Y. Chen, X. Zhang, Z. Li, J.-P. Ng, Search engine reinforced semi-supervised classification and graph-based summarization of
microblogs, Neurocomputing. 152 (2015) 274–286. doi:10.1016/j.neucom.2014.10.068.
[69] R. Dehkharghani, H. Mercan, A. Javeed, Y. Saygin, Sentimental causal rule discovery from Twitter, Expert Syst. Appl. 41 (2014)
4950–5958. doi:10.1016/j.eswa.2014.02.024.
[70] C.-Y. Lin, L. Wu, Z. Wen, H. Tong, V. Griffiths-Fisher, L. Shi, et al., Social Network Analysis in Enterprise, Proc. IEEE. 100 (2012)
2759–2776. doi:10.1109/JPROC.2012.2203090.
[71] L. Kwok, B. Yu, Spreading Social Media Messages on Facebook: An Analysis of Restaurant Business-to-Consumer Communications,
Cornell Hosp. Q. 54 (2013) 84–94. doi:10.1177/1938965512458360.
[72] A. Malhotra, L. Totti, W. Meira, P. Kumaraguru, V. Almeida, Studying user footprints in different online social networks, Proc. 2012
IEEE/ACM Int. Conf. Adv. Soc. Networks Anal. Mining, ASONAM 2012. (2012) 1065–1070. doi:10.1109/ASONAM.2012.184.
[73] T. Finin, A. Joshi, P. Kolari, A. Java, A. Kale, A. Karandikar, The Information Ecology of Social Media and Online Communities, AI
Mag. 29 (2008) 77–92. doi:10.1609/aimag.v29i3.2158.
[74] A. Gal-Tzur, S.M. Grant-Muller, T. Kuflik, E. Minkov, S. Nocera, I. Shoor, The potential of social media in delivering transport policy
goals, Transp. Policy. 32 (2014) 115–123. doi:10.1016/j.tranpol.2014.01.007.
[75] P. Bogdanov, M. Busch, J. Moehlis, A.K. Singh, B.K. Szymanski, The social media genome: modeling individual topic-specific
behavior in social media, in: Proc. 2013 IEEE/ACM Int. Conf. Adv. Soc. Networks Anal. Min., 2013: pp. 236–242.
doi:10.1145/2492517.2492621.
[76] Q. Fang, J. Sang, C. Xu, Y. Rui, Topic-sensitive influencer mining in interest-based social media networks via hypergraph learning,
IEEE Trans. Multimed. 16 (2014) 796–812. doi:10.1109/TMM.2014.2298216.
[77] G. Paltoglou, M. Thelwall, Twitter, MySpace, Digg: Unsupervised Sentiment Analysis in Social Media, ACM Trans. Intell. Syst.
Technol. 3 (2012) 1–19. doi:10.1145/2337542.2337551.
[78] C.H. Lee, Unsupervised and supervised learning to evaluate event relatedness based on content mining from social-media streams,
Expert Syst. Appl. 39 (2012) 13338–13356. doi:10.1016/j.eswa.2012.05.068.
[79] S. O’Banion, L. Birnbaum, Using explicit linguistic expressions of preference in social media to predict voting behavior, in: 2013
IEEE/ACM Int. Conf. Adv. Soc. Networks Anal. Min., 2013: pp. 207–214. doi:10.1145/2492517.2492538.
[80] J.H. Wang, M.S. Lin, Using inter-comment similarity for comment spam detection in Chinese blogs, in: Proc. - 2011 Int. Conf. Adv.
Soc. Networks Anal. Mining, ASONAM 2011, 2011: pp. 189–194. doi:10.1109/ASONAM.2011.49.
[81] J. Dickerson, V. Kagan, V. Subrahmanian, Using Sentiment to Detect Bots on Twitter: Are Humans more Opinionated than Bots?, in:
2014 IEEE/ACM Int. Conf. Adv. Soc. Networks Anal. Min., 2014: pp. 620–627. http://jpdickerson.com/pubs/dickerson14using.pdf.
[82] J. Yin, A. Lampert, M. Cameron, B. Robinson, R. Power, Using social media to enhance emergency situation awareness, IEEE Intell.
Syst. 27 (2012) 52–59. doi:10.1109/MIS.2012.6.
[83] E. Ferrara, P. De Meo, G. Fiumara, R. Baumgartner, Web data extraction, applications and techniques: A survey, Knowledge-Based
Syst. 70 (2014) 301–323. doi:10.1016/j.knosys.2014.07.007.
[84] A. Boutet, H. Kim, E. Yoneki, What’s in twitter: I know what parties are popular and who you are supporting now!, in: Proc. 2012
IEEE/ACM Int. Conf. Adv. Soc. Networks Anal. Mining, ASONAM 2012, 2012: pp. 132–139. doi:10.1109/ASONAM.2012.32.
22
MohammadNoor Injadat received the BSc and MSc degrees in computer science from Al al-Bayt University and
University Putra Malaysia in Jordan and Malaysia in 2000 and 2002, respectively. He obtained a Master of Engineering
in Electrical and Computer Engineering from University of Western Ontario in 2015. He is currently working toward
his PhD degree in Software Engineering at the Department of Electrical and Computer Engineering, University of
Western Ontario in Canada. His research interests include data mining, machine learning, social network analysis, data
analytics, and cloud computing. MohhammadNoor is a member of IEEE Computer Society.
Fadi Salo received the BSc and MSc degrees in computer science from Al-Ahliyya Amman University and University
Putra Malaysia in Jordan and Malaysia in 1999 and 2005, respectively. He obtained a Master of Engineering in
Electrical and Computer Engineering from University of Western Ontario in 2015. He is currently working toward his
PhD degree in Software Engineering at the Department of Electrical and Computer Engineering, University of Western
Ontario in Canada. His research interests include data mining, text mining, machine learning, social network analysis,
data analytics, and intrusion detection systems. Fadi is a member of IEEE Computer Society.
Ali Bou Nassif is currently an Assistant Professor at University of Sharjah, UAE. He obtained a Master’s degree in
Computer Science and a Ph.D. degree in Electrical and Computer Engineering from Western University in 2009 and
2012, respectively. Ali’s research interests include the applications of statistical and artificial intelligence models in
different areas such as software engineering, electrical engineering, e-learning and social media, as well as cloud
computing and mobile computing. Ali is a registered professional engineer in Ontario, as well as a member of IEEE
Computer Society and ACM Association for Computing Machinery.