Doan Uccs 0892D 10279
Doan Uccs 0892D 10279
Doan Uccs 0892D 10279
by
TRI SI DOAN
B.S., Computer Science, HoChiMinh city, Technical University, Viet Nam 1990
Doctor of Philosophy
2017
This dissertation for the Doctor of Philosophy degree by
TRI SI DOAN
by
Terrance Boult
Rory Lewis
Jonathan Ventura
Abhijit Bendale
ii
Doan, Tri Si (Ph.D., Engineering)
ABSTRACT
Data mining practitioners often face problems of the unavailability of all training data at the
same time and the inability to process a large amount of data due to constraints such as lack
of adequate system memory. Building a data mining system with whatever data available
at a certain time is data is a practical solution. Our hypothesis is that a learning model
arises when new classes are introduced into a trained system during testing because the
learned model does not have an ability to handle unknown classes. Traditional mining
models fail to detect and learn new classes. While current solutions have been well studied
in computer vision, the challenge of how computer systems deal with unknown classes has
received less attention particularly in text classification tasks in Natural Language Processing
(NLP). It is in this realm that this dissertation will focus its resources while overcoming
the aforementioned challenges. In this thesis, we extend the ensemble learning approach
to overcome the large scale data challenge. First, we introduce our solution to select an
algorithm for each partitioned region using meta-learning approaches. Next, we propose a
solution to aggregate the final prediction. The problem with the majority vote methodology
is that the majority outvotes the minority of trained classifier results, often rendering an
iii
ACKNOWLEDGMENTS
I would like to express the deepest appreciation of my advisor Professor Jugal Kalita for
his guidance and support during my study at UCCS. Without his guidance and persistent
help, this dissertation would not have been possible. I am also grateful to the members of
my committee for their patience and support in overcoming numerous obstacles I have been
faced through my research: Especially, Prof. Boult with several discussions on the open
set problem and help with a poster conference; Prof Lewis for his generous support and his
inspiration me to follow the area of data mining, and Prof Ventura for his discussion; and
Dr. Abhijit Bendale, a scientist at Samsung Research America for providing his dissertation
I would like to thank my fellow students at the Linc Lab for their feedback; particularly
Michael Bankston, and Connor Clark for their willingness to help. My special thanks go to
Thomas Conley who used his spare time to read my dissertation and give help with Latex. In
addition, I would like to express my gratitude to the Writing Center staff, Heather Gutekunst,
Makaela Worden, and Malcome Campbell, for help with my dissertation writing.
I am also grateful to Micheal Bihn for checking typos. I would like to thank Ethan Rudd
for his kind help and troubleshooting skills, Chad Mello for his charming support. Last but
not the least, I would like to thank Ali Langfels for administrative support.
iv
I would like to express my special thanks to my parents and my wife who have sacrificed
tremendously and missed me for more than seven years. Finally, I dedicate this Ph.D.
dissertation to my two lovely daughters, Vy and Anh who have been with me during my
study.
v
TABLE OF CONTENTS
CHAPTER
I INTRODUCTION 1
1 Purpose of study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Scope of study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4 Ensemble modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
vi
5.2 Ensemble learning with different base classifiers . . . . . . . . . . . 20
6 Summary of work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.2 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 Proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
vii
3.3 Proposed algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5 Proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Bibliography 112
Appendices 126
viii
2 IMDB movie review dataset . . . . . . . . . . . . . . . . . . . . . . . . . 130
ix
LIST OF FIGURES
2.1 New split outside current boundary. Adapt (Lakshminarayanan et al., 2014) 36
3.6 Example of initial values for a new ball of the same class in NCC . . . . . . 71
3.7 Second case of initial new ball of the same class in NCC . . . . . . . . . . 71
x
3.8 Expansion of boundary after update . . . . . . . . . . . . . . . . . . . . . 74
1.2 Star vs useful votes between elite and non-elite users . . . . . . . . . . . . 127
xi
LIST OF TABLES
3.1 F-score results on Amazon product review dataset with 10, 20 domains . . . 76
3.2 F-score results on Amazon product review dataset with 30 and 40 domains . 76
xii
4.4 F-scores in Experiment 1a on IMDB movie review . . . . . . . . . . . . . 104
xiii
CHAPTER I
INTRODUCTION
John Heywood
angles or perspectives. The main reason is to guarantee that the final conclusion has sufficient
support. It is commonly referred to as wisdom of the crowd because reasoning shared among
many is usually considered better than that of an individual. Trial by jury is one example
that shows how the aggregated answer among a group can be obtained. In machine learning,
ensemble learning is similar in concept (Rokach, 2010) where multiple classifiers or experts
solve and combine the results. Each model plays the role of an expert in the crowd.
In several data mining fields, increasing volumes of data are a big challenge because
computational powers do not suffice. With existing computing systems, ensemble learning
can be an alternative feasible solution. In fact, ensemble learning has been extensively
studied in data mining literature (Cesa-Bianchi and Lugosi, 2006), (Chernov et al., 2010)
and has excelled in machine learning competition (Vovk and Zhdanov, 2009).
1
2
The popular uses of the ensemble approach consist of a long list of domain applications
such as computer vision (Bosch et al., 2007), (Samat et al., 2014), (Du et al., 2015), object
tracking (Kalal et al., 2012),(Zhang et al., 2014), bioinformatics and biomedicine (Yang
et al., 2010), (Chen and Ishwaran, 2012), spam detection (Saeedian and Beigy, 2012),
(Neumayer, 2006), Natural Language Processing (NLP) (Xu and Jelinek, 2004), (Whitehead
and Yaeger, 2010), (Luca, 2011) and network intrusion detection (Tavallaee et al., 2009),
1 Purpose of study
performance in many classification problems. Ensemble models often place high in data
mining competitions and thus demonstrate the advantage of ensemble learning over any
single data mining algorithm. Our goal is to develop a data mining model that scales well to
data. To achieve this goal, we investigate a feasible solution using ensemble learning that
• the new combined approach is able to balance between a minority and majority of
classifiers.
3
explore its use to address the problems posed by large datasets that can be only processed
fractionally due to system memory constraints. In the second challenge, we extend our
study on ensemble learning to deal with text classification under open set settings to address
limitations of current models. Finally, we investigate enhancing the ability of the ensemble
Most ensemble models work well under an implicit assumption of having a single set of
data points. However, a divide-and-conquer approach can generate several datasets which
may have different distribution such that ensemble members may work well only in particular
data regions. A prediction generated by aggregating predictions from all ensemble members
We overcome this challenge with a new method for combining the predictions of different
classifiers. This solution provides crucial support for the scalable model proposed in this
thesis. We summarize the goals of our approach as the following: dynamic, being able to
handle real world assumptions and scalability, making it more amenable to dealing with
• ‘Real world assumption’ addresses the problems that arise when training data is
available in incremental order.
2 Scope of study
Our objective is to propose a scalable algorithm with ensemble learning. Our study focuses
on the classification problem using ensemble learning even though similar solutions can
be achieved for the regression task also. Particularly, we follow an incremental approach
where training data are learned in chunks. We use incremental learning to handle large
scale data by building a model with available data and allowing the model to update itself
efficiently. Figure 1.2 illustrates two ensemble models. The left figure describes a situation
where ensemble members are created from different subsets of data with sampling. The
right figure describes an ensemble model where different ensemble members are created
State-of-the-art ensemble models such as Random Forests and eXtreme Gradient Boost-
ing are developed with the left figure approach. Two types of ensemble classifiers built
from the same type of base classifiers are tree-based and Nearest centroid-based classifiers.
5
An ensemble of tree-based models is known for its high performances whereas one built
from Nearest Centroid classifiers is a natural choice to deal with the open set problem. We
investigate solutions with both approaches. We further study the second approach (refer to
the left Figure 1.2) where ensemble members are different types of classifiers.
3 Methodology
Ensemble learning refers to a learning technique that achieves high predictive power by
aggregating or combining the predictions of a set of different models (supports the concept of
“wisdom of the crowd”). In general, an ensemble model consists of several base algorithms
• Algorithm selection for a given data space: Searching a classifier for a dataset is often
difficult since a poor choice may affect the final prediction of the ensemble model.
An automatic advice system may provide a better way to deal with an unfamiliar data
domain, high consulting costs of experts, or time-consuming experiments.
(Džeroski and Ženko, 2004). In order to achieve improved performance, two main compo-
nents define an ensemble learning system: 1) the diversity in classification algorithms, and
2) combining strategy (Dietterich, 2002). Diversity aims at learning from different errors by
a variety of classifiers (Brown et al., 2005a). Without diversity, there is no effective learning
gain because the same errors are always treated the same way by the system (Brown et al.,
2005a), (Peterson and Martinez, 2005). It is noted that diversity requires that the classifiers
be independent of each other. (Chandra and Yao, 2006) further consider the interactions
cost function of each ensemble member. As a result, each classifier minimizes its mean
square error (MSE) together with the remaining ensemble members. However, independence
among classifiers does not guarantee the best outcome in the combined classifier (Kuncheva
et al., 2000). Combination of many independent and poor performing classifiers may not
improve the ensemble model’s performance. Diversity can be achieved in several ways:
• Data diversifies generate multiple datasets from the original data source. Different
training results are produced in different models.
• The architecture is the structure of an ensemble model that may distinguish one
ensemble from another.
Diversified data can be obtained by Bagging (Breiman, 1996), Boosting (Freund et al.,
1996), or the use of Random subspaces (Ho, 1998) while hyper-parameter searches can
7
produce a large number of predictors from a base classifier. Ensemble learning can use more
than one layer where the outcome of one classifier is an input to another classifier.
The method of combining classifiers determines how a final decision is made from
the ensemble classifier. Common combining methods include majority voting such as
Unweighted or Weighted Majority voting (Kim et al., 2011), and borda count which takes
the vote for each class into account. Voting methods such as Unweighted/Weighted majority
voting implement a winner-take-all strategy in which the vote is assigned only to the class
chosen by the classifier but not to other classes. Borda count considers the class with the
most support as well as the class with the second most support, etc. The votes are summed
and the class with the most votes is represented as an ensemble decision.
Currently, ensemble learning is assessed using two approaches, pairwise and non-pairwise
measures. In the pairwise approach, the diversity of an ensemble learner with n classifiers
is defined as an average of all n(n-1)/2 pair values. To compute a single value, all pairs
are be averaged. Several metrics include the disagreement measurement to determine the
probability that two classifiers disagree on their results (Ho, 1998), and the double-fault
measurement to determine the probability of both classifiers being incorrect (Ruta and
Gabrys, 2001).
On the other hand, the non-pairwise approach takes all classifiers into account at once.
Included in this category are entropy measurement (Cunningham and Carney, 2000), Kohavi-
Wolpert variance (KW variance) (Hansen and Salamon, 1990), Interrater agreement (Fleiss
8
et al., 2013), and measurement of difficulty (Hansen and Salamon, 1990). With the entropy
measures the variance of predicted labels over different training sets by classifiers, and
the number of times the individual learners agree on correct labels while measurement of
difficulty estimates the shared difficulty among the classifiers that data points present in
classification (e.g., data points that are often given incorrect labels) (Giacinto and Roli,
2001). Each metric is required to have a minimum threshold to measure the diversity and
order to solve it efficiently. In the same vein, ensemble learning breaks down the data space
into several sub-regions and performs the learning task in each sub-region. Known as the
wisdom of the crowd approach, diversity of experts is vital to produce the best decision. This
concept that inspired ensemble learning assumes that a diversity among ensemble classifiers
(known as a base classifier) is important for the accuracy of the model. Practically, the
its best classifier member (Džeroski and Ženko, 2004). Understanding diversity is vital to
justify the choice of algorithms used in ensemble learning. Considering an ensemble model
with majority vote scheme. Let A, B and C denote 3 binary classifiers. Suppose we have a
test set of 10 examples and the ground truth is all “1”. We can see that an ensemble model
9
may sometimes actually boost the final accuracy performance using an error correcting
method (case 2) while there is no gain in the case 1. Taking a closer look, all three classifiers
A, B and C have agreed on most of the testing examples. That is when A classifies 1 so does
B, and even C except in the second testing example. In case 2, both A and B achieve 80%
accuracy but they disagree on the third and ninth examples. C misclassifies the second, ninth
and tenth examples resulting in 70% accuracy. But a Majority vote works well to correct
members are independent or uncorrelated. To see this, there is another example with even
lower performing individuals as seen in the following. Given 10 test examples with ground
truth all “1”, Figure 1.3 illustrates how the ensemble can produce correct prediction from
We now observe that a more diverse set of algorithms is better suited for an ensemble
model. In other words, an ensemble model is only efficient when the ensemble members
10
are less correlated. Importantly, these uncorrelated ensemble members may include low
performing classifiers, but not classifiers that do not under-perform. In fact, an ensemble of
under-performing (e.g., random guessers) will not improve the final performance.
It is sufficient to say that when all training examples are correctly labeled by the same
number of base classifiers, ensemble margins (referred to also as the uniformity condition)
have been achieved with diversity. However, (Tang et al., 2006) indicate that maximizing
diversity does not come with the desired improvement of accuracy unlike maximizing
leads to a weak relationship between diversity and performance. To complicate the issue,
high diversity may lead to high performance of ensemble learning in one study but poor
maximizing diversity, the results may be unpredictable, posing a question of how diversity
4 Ensemble modeling
In an earlier work, (Dietterich, 2000) indicated the fundamental reasons for the use of
ensemble learning: computational, statistical and representational benefits that accrue. This
work supports the bias-variance decomposition finding in (Kohavi et al., 1996) and later well-
task where we have a set of training examples of the form (x,y), a classification model will
predict which class a particular example belongs to. Let a general classification function be
defined as
11
predicted class and Z denotes a set of nominal values. In the same way, a regression task
where R denotes a set of real values. In both cases, θ denotes parameter values.
It is known that traditional classification and regression models require all training
loading into the system memory at the beginning of a specific learning stage. After a training
process is finished, any new training example can be learned by starting a new training
process again. Instead of resetting a system, incremental learning is designed to learn a new
concept on the fly. This characteristic gives it an ability to learn with incomplete training
into subproblems,and solutions are aggregated into a final solution. Predicted outcomes
from all ensemble members are aggregated to a second layer of classifiers where another
a diversity among ensemble members is very important. With a diversity, Windeatt et al.
12
(Windeatt, 2005) demonstrate that the ensemble model can achieve smaller square error than
the average squared error of the based predictors for a given proper ensemble predictor in a
regression problem. For multiple datasets, Domingos et al. (Domingos, 2000) formulated
where E(.) denotes Expected error, S denotes size of the ensemble,fi denotes outcome of ith
ensemble member, and t denotes target. Bias are the simplifying assumptions made by a
model to make the target function easier to learn. Variance (refers to as var) indicates the
Equation I.3 shows influences on the ensemble errors by mean bias (between prediction
and ground truth), mean variance (among ensemble members) and mean covariance (average
pairwise difference among ensemble members). Firstly, when the number of ensemble
learning increases, partial variance in this loss function decreases. Secondly, a negative
covariance may reduce the expected loss of the ensemble model. It turns out that low
For a classification task, the same goal can have been proved by using bias and variance
Figure 1.4: Adapt: Understanding the Bias-Variance Tradeoff from Scott Fortmann-Roe.
Given a particular dataset, the results of classifiers can vary often from one classifier to
the others. This expected error due to model mismatch is defined as bias, whereas the
variation due to the train sample and randomization is referred to as variance. Determining
a good trade-off (James et al., 2013) between bias and variance is important to achieve a
is defined as
E[(g(x∗ ) − y ∗ )2 ]
= E[g(x∗ )2 − 2g(x∗ )y ∗ + y ∗2 ]
= E[g(x∗ )2 ] − 2E[g(x∗ )y ∗ ] + E[y ∗2 ] since E[g(x∗ )y ∗ ] = g(x∗ )y ∗
= E[(g(x∗ ) − E[g(x∗ )])2 ] + E[g(x∗ )]2 − 2E[g(x∗ )]f (x∗ ) + E[(y ∗ − f (x∗ ))2 ] + f (x∗ )2
since E[X 2 ] = E[(X − E[X])2 ] + E[X]2 by I.5
(I.4)
14
Proof:
E[(g(x∗ ) − E[g(x∗ )])2 ] + E[g(x∗ )]2 − 2E[g(x∗ )]f (x∗ ) + f (x∗ )2 + E[(y ∗ − f (x∗ ))2 ] =
Note that the third term does not depend on the estimator g(x) so it often refers to as
irreducible error.
There are two common ways to reduce this bias-variance problems: Ensemble method
• Regularization uses L1 norm or L2 norm to deal with bias and variance problem.
Ensemble learning averages all classifiers’ outputs to reduce variance (refer Figure
1.5). Random Forests (an ensemble of Decision trees) takes this into account in the design
15
(Breiman, 2001) by first building a set of Decision trees. The difference among trees is ob-
tained by sampling examples and sampling features using Bagging (Bootstrap Aggregation).
The optimal split decision at an intermediate node relies on a random selection or a linear
(OOB), which are used to test each binary tree. The final prediction is obtained with a
combiner.
While sampling examples allows for faster training time due to a small subset of
examples, randomized feature selection reduces a possible correlation among features that
can improve the performance. These features give Random Forests a predictive power that is
Random Forests posses several properties such as a natural design that allows them to work
Random Forests have become an attractive method for a variety of applications in computer
vision such as image classification, pedestrian detection (Marin et al., 2013), and facial
key point tracking (Lepetit and Fua, 2013). Many variants of Random Forests have been
successfully tailored to specific data mining needs. For example, Hough Forests, a variant
of Random Forests, implement Hough transformation during the process of building trees
to identify features of a particular shape within an image for object detection (Gall and
Lempitsky, 2013). Nearest Class Mean Forests (Ristin et al., 2014) use nearest class mean
(mean descriptor for each class) at decision nodes to identify objects. Given a large dataset,
Forests models have been designed and implemented for batch learning (refers to as off-line)
where all training data is available at the training time. The general issue of these off-line
16
techniques is that all training data and grown trees remain in the main memory.
Ensemble models with Bagging, Boosting and Stacking, are considered as conventional
ensemble models that are well-studied in machine learning literature. Bagging is a Bootstrap
Aggregating example with a replacement which allows for creating different classifiers. A
special ensemble model, for e.g., a Random Subspace applies Bagging to feature space
Blending (Jahrer et al., 2010) is a generalized method to Bagging. It uses multi-layers where
results from the current layer aggregates to the next layers. The classifier in the second layer
estimates weights for the outputs from based learners (Muja and Lowe, 2009).
semble models. The ensemble models can perform independently before aggregating their
predictions to the next layers. On the contrary, negative correlation ensemble (Liu and Yao,
1999) trains its neural network members simultaneously using a correlation penalty term
in its loss function. In both regression and classification tasks, this model uses the simple
averaging method. Negative correlation model with regression (Brown et al., 2005b) is
designed with two neural networks, Multi-Layer Perceptron (MLP) and radial basis function
neural network. A Negative Correlation model for classification proposed by (Dam et al.,
The emergence of neural networks in data mining is fueled by their high performance in
face recognition tasks. For example, the Convolutional Neural Network (CNN) architecture
(LeCun et al., 1998) is a feedforward neural network that has multiple layers including
convolution, activation and pooling layers, followed by a fully connected and a final output
In a deep structure, one CNN can stack on top of another CNN to form multiple layers
of non-linear operation. An ensemble of deep SVMs (Abdullah et al., 2009) uses one SVM
in the first hidden layer for feature extraction. This layer is used as an input for the next
layer to approximate target function. An ensemble of deep learning for regression (Qiu et al.,
2014) uses Support Vector Regression (SVR) as the aggregate algorithm. Neural Networks
(NN), particularly Deep learning (including CNN) have demonstrated their predictive power
in the computer vision domain. We believe that it is feasible to apply Deep Learning in the
field of Natural Language Processing (NLP). However, words and sentences in NLP will
replace the image pixels of computer vision. Word, for example Word2vec (Le and Mikolov,
2014) or GloVe (Pennington et al., 2014) use vectors that index words in one-hot encodings.
Such word vectors can be put together in matrix form for computation.
18
Word vectors in natural language processing are sometimes referred to as word embedding,
representation. Character-embedding using CNN has also been used for part-of-speech
tagging (dos Santos and Zadrozny, 2014). A Deep Learning architecture with 9 layers has
been used by (dos Santos and Zadrozny, 2014) for sentiment analysis and (Zhang et al.,
2015b) for text categorization tasks. (Kim et al., 2015) use the output of the character level
as the input to an LSTM (Sak et al., 2014) and apply to different languages.
denote a set of classes. There are two common outcomes of a classifier: a class label, and a
probability of belonging to one class (or probability of membership) for a classification task,
(Kuncheva, 2004) proposes a probabilistic framework for combining label outputs. Let
P(ωk is the true class |s) or for short P(ωk |s) be the probability of a predicted class for class
P (ωk ) Y
P (ωk |s) = × P (si |ωk ). (I.6)
P (s)
For an ensemble of classifiers, there are two ways to generate the final decision: 1)
combining and 2) generating. A combining method commonly uses Majority Vote (hard
vote), (Randomized) Weighted Majority vote (soft vote), and a Naive Bayes combiner, while
a generating method often uses average and product. Since classifiers may generate different
Decision trees are well-known for high accuracy in the field of data mining (Quinlan, 2014).
Decision trees are also sensitive and unstable to training data. This means that a different
set of data or additional data during training will result in different trees. The process of
classification with Decision trees is as follows. In Figure 1.7, an incoming data example is
classified by following down from the top of the tree (root) to the leaf. At each intermediate
node, a splitting decision is made to determine the left or the right direction this example
should follow. To measure the possible classification error, we define a loss function as
y=y1m , .., ynm denotes label at the current node m. The reduction in error by splitting at node
m nmr
m is determined by δ(y m ) = L(y m ) − [ nnml L(y ml ) + nm
L(y mr )] with nml and nmr being
the number of examples that go to the left or the right of the split, and L(.) being the loss
20
function.
1 X
max P(w, b, ξ) = w2 + CLgini (y (m) ) = pm (d)[1 − p(m) (d) (I.7)
w,b,ξ 2 (m) d∈D
Two loss functions are often presented in classification problems including Gini loss
this problem, stopping criteria are often introduced. These include the depth of the tree
reaching a certain threshold, the number of data points or the homogeneity of distribution in
the terminal nodes going below certain parameterized threshold values. The main objective
of using the stopping criterion is to prevent a Decision tree from growing too complex so
that errors in testing cases do not deteriorate compared to errors in training. Pruning the
tree is another method to deal with overfitting when a tree has grown to its fullest. Such
approaches may lead to different prediction results because the trees become different.
This approach has emerged as a popular practice with the diversity of the ensemble members
coming from different types of classifiers instead of relying on different subsets of examples
is distinguished from each other on predictive performance and usually the final final
performance is better.
voting. The simplest form is a democratic system where consent of more than half of the
committee members determines a final decision. For an ensemble of classifiers where each
classifier is selected as a different expert, different weights may need to used. (Littlestone
and Warmuth, 1989) in their study, discover the benefit of a Weighted Majority of votes and
propose a new process to address the drawback that rises due to high bound for weighted
majority of votes. (Zamani et al., 2014) introduce weighted majority of votes with cascade.
However, these current solutions only work with the closed set assumption where each
The cold start problem arises when a new member has a relevant knowledge of a new
class w.r.t the remaining members. A majority of votes from the remaining members can
outvote the knowledgeable voter and result in incorrect predictions until classifiers with
relevant prior knowledge of the new class become sufficient. We investigate more on the
combination approach in Chapter IV while tackling a large scale problem. Particularly, our
goal is to propose a new weighted majority vote, able to work on open set scenarios. To
achieve this goal, a proposed combination method should be able to deal with the cold start
majority.
6 Summary of work
We investigate issues that arise when data are not available at once or arrive continuously.
This often results in large data sets. We explore the incremental learning approach and
22
homogeneous models. Inspired by the recent introduction of the Mondrian process (Roy
and Teh, 2009) and Mondrian Forests (Lakshminarayanan et al., 2014), we explore the
2014), we observe that individual sampling of features degrades the performance, particularly
when the selection of informative features is not guaranteed. Mondrian Forests also suffer
from a critical drawback as extensive memory is required because it maintains all training
samples in memory to compute a split in each node. This is an acute problem particularly
when it reaches about 500,000 samples. Our goal is building a new model that applies to
sentiment analysis on review datasets, which evolve over time. This work has not been
attempted before despite the fact that the reviews are often generated in a time series. Using
a stationary approach for analysis of streaming data will lose a lot of information, resulting
time line can make or break the business. One challenge of our study is that the current
version of Mondrian Forests is unable to work with sparsity or high dimensionality. These
redesign, although we keep the main idea of a sampling approach using the Mondrian
process.
ing works well with prior knowledge of all classes at the training time. In the real world,
examples of unknown classes are commonly seen during testing. This scenario likely occurs
in continuous data due to the unavailability of training data at once. Much of the significant
23
work on open set problems has been done by a group of researchers at the VAST Lab,
UCCS (Scheirer et al., 2013a), (Jain et al., 2014b), (Bendale and Boult, 2015a). Their work
has inspired our curiosity in this new idea in data mining. In our research, we found that
the concept of an open set of classes had been used in earlier literature, such as (Li et al.,
2006) in stylometry, (Sommer and Paxson, 2010) in intruder network detection, and (Koppel
et al., 2011) in authorship attribution. For example, (Stolerman et al., 2011) demonstrate
their solution to tackle the problem of determining an unknown author. The fundamental
problem in forensic stylometry relies on linguistic analysis to determine the genuine author
of an unknown text. (Sommer and Paxson, 2010) and (Stolerman et al., 2011) depend on
an approximate solution that computes probability of new classes to solve this challenge.
Open set recognition and open world problems (Scheirer et al., 2013a), (Jain et al., 2014b),
(Bendale and Boult, 2015a) have gained new attention among data mining practitioners
after these papers’ publication. In this work, we improve the online version which has been
introduced from (Bendale and Boult, 2015a) and successfully apply this to text classification
problems.
Problem 3: Ensemble learning for a large scale problem. Large scale problems are
the most challenging for data mining tasks, especially with recent advanced technologies
to collect data in many fields such as communication, health care, or banking. A lot of
attention among data mining communities focus on a distributed approach such as (Lin and
et al., 2014). While a top-down approach in problem solving is very popular, ensemble
learning can use this approach to tackle the large scale problem in a two-step processes.
First, the decomposition of a dataset into a subset of data can be performed with random
sampling of features and/or random sampling of examples. Then a solution for each subset
24
of data with the best selected algorithm (an expert for the specific subset) can be obtained.
Second, solutions for a set of sub-problems can be aggregated and integrated into the final
solution. Ensemble learning uses this approach (Ho, 1998) is limited to a single base
classifier (tree-based classifier), rather than different base classifier as recent top performers
We tackle the large dataset problem in two related sub-problems: finding a ranked list
of algorithms for a given dataset (a subset of data space) and finding the best combination
and weight assignment for each algorithm of an ensemble member. Published papers
from winning data mining competitions focus on the model architecture rather than the
logic behind each ensemble model. Trial and error is a main practice of many data mining
practitioners. We tackle this challenge in two studies: algorithm selection, and by developing
A common challenge of ensemble learning is to choose its members. Choosing the base
how algorithms can be chosen for ensemble learning based on a meta-learning framework.
Given an unknown dataset, our approach generates a ranked list of algorithms in which the
top k (k is a parameter indicating the number of algorithms in the final ensemble model)
selection of best performers (ensemble member) may not guarantee the best prediction of an
ensemble model. A naive method is to use a brute-force approach to search through an entire
combination to obtain is a majority of the votes. It was actually used in early development
of an ensemble model. When classifiers are distinct, each classifier may provide different
predicted results for a given set of data examples. The Weighted Majority vote that assigns
different weights for ensemble members is a better method. In our work, we suggest a new
The era of big data has emerged in many fields with unprecedented growth of data at
a magnitude of million or even more records per day. Big data are testing the limitations
of data mining algorithms as well as computer resources. When amount of big data are
generated as a result of streaming, new patterns may evolve as time elapses. The hidden
patterns may represent a new concept, referred to as a concept drift by (Widmer and Kubat,
1996), requiring the system to adapt without interfering with the working model.
The approaches in data mining can be classified into two types: static and dynamic. A
static method obtains a solution with a single data mining algorithm, commonly known as
perform parallel processing of data mining tasks, e.g., a Hadoop distributed system. Recently,
the Spark distributed system hasa been used effectively to implement parallel functions
solution for continuous data but its approach works for non-streaming data too. This method
addresses the limitation of computer resources, especially with memory. Commonly, data
mining algorithms are required to load and store an entire training dataset, which may not
A solution is breaking data into chunks and reading them in a sequence. Streaming
data also shares a similar characteristic in that data examples come over time and need
26
to be processed in sequence. To deal with this problem, incremental learning starts at the
time of incoming data, and applies its adjustment for the next prediction task. As a result,
incremental learning can be studied as a means to deal with the large data problem where
Ensemble learning, on the other hand, integrates the main concept of incremental learning
in its own design. Incremental learning can be implemented with a single classifier of an
ensemble model (e.g., incremental Decision tree), an ensemble model with a single type of
base classifiers (e.g., incremental Random Forests) or ensemble model with different base
classifiers. For the sake of simplicity, we use the term incremental ensemble learning to refer
to the ensemble model using an incremental approach while the term ensemble learning is
1. Incremental ensemble model: In Chapter II, we address the challenges that tradi-
tional machine learning algorithms face, including unavailability of entire large dataset
at one time due to high cost of collection, nature of data (data examples introduced
in incremental order), and the problem of loading the entire dataset for training. A
feasible solution is building a model on the currently available data and updating with
new training data examples as they become available. While Random Forests are is
considered a best feasible solution for incremental learning, its drawback is that it
is unable to reintroduce the split at current node making it hard to obtain efficient
performance. Our proposed model tackles this problem and outperforms other tree
27
2. Tackle unknown classes : In Chapter III, we show that streaming data can present an
open set problem by their basic nature. We review several previous works in computer
solutions and their concerns. We present a new solution to the open set problem, along
with the challenge of sparse data. We indicate the problem of using traditional tree
based ensembles that fail to extend to deal with the open set problem. We propose our
3. Large scale data with ensemble learning: In Chapter IV we address the other
good choice for streaming data. Instead, we use the divide and conquer approach by
decomposing data into sub-datasets, and building models by training on each region,
and aggregating all predictions into a final result. This is a challenge because there
is no guaranteed that a testing example may belong to the same region of training.
We demonstrate that ensemble learning has the ability to correct this error. The
contribution is that we can use this outcome in the previous study and perform the
In the following chapters, the above mentioned problems and proposed solutions are
discussed in detail1 .
1
Each chapter mentioned in this thesis is presented as a “stand alone” techniques. The symbols and
notations are specific to each chapter, unless specified otherwise.
28
6.2 Publications
tional Conference Machine Learning and Application (ICMLA) Dec 2016, Anaheim,
CA (Chapter II)
2. T. Doan, J. Kalita “Breaking a challenge for text classification in open world recogni-
tion” IEEE Computing and Communication Workshop and Conference (CCWC), Jan
IEEE Internal Conference on Data Mining (ICDM) Nov 2015, Atlantic city, NJ
4. T. Doan, J. Kalita “Algorithm Selection using performance and run time behavior” In-
Data generated at high speed in real time from network traffic monitors, log records from
click streams, and news feeds result in an increase in demand for scalable algorithms that can
handle a massive number of data points. Despite the fact that some current state-of-the-art
powerful computational systems can meet this high demand, they are not practical posed by
large amounts of data for most in data mining communities. This thesis focuses on providing
a suitable tool to discover in-depth knowledge for application domains under the constraint
where the main difference is the ability to phase out old knowledge to reserve space for
the new. While both approaches give a system the ability to update for incoming training
data, online learning requires extra resources to deal with dynamic change in contents
known as concept drift. One state-of-the-art method that is the first candidate in this thesis
is incremental learning using the Support Vector Machines (SVM). The goal is to adapt
learning for new concepts without sacrificing the performance, because retraining the
29
30
gating the space for learning new concepts in an additive manner. In ensemble learning,
each batch of data (referred to as a sample of examples) can be used to construct a new
model member or classifier. Ensemble models train several members where each member is
constructed with a subset of data space. Since each classifier in ensemble learning is con-
structed using only data examples from an individual batch, the strategy for sampling data is
model also supports traditional data mining algorithms by loading the entire training data.
This feature makes an incremental ensemble model advantageous in real world applications.
In practice, finite memory resources hinder the use of many data mining algorithms due
to the fact that many of them require that all the training data is residing in memory and this
is not conducive to incremental computation. There are two types of incremental approaches:
batch and instance. The batch incremental ensemble learning (or simply ensemble learning)
requires specification of a batch size, determined based on the total available memory in a
system. In an instant incremental ensemble learning, each example arrives and is processed
With incremental learning, the new example may cause the underlying distribution of
the data to change. As a result, the prediction may suddenly become less accurate. Typically,
two approaches are used to improve the accuracy of incremental learning, 1) sampling data
such that we guarantee that the distribution of data is preserved, and 2) using a sliding
window method to deal with concept drift. Inspired by a new method of sampling data
(Roy and Teh, 2009), this dissertation investigates a novel design of incremental batch
31
learning based on KD structures. Herein, the number of ensemble members are established
a-priori and training data arrives sequentially, requiring adjusting of ensemble members. To
deal with larger datasets, we also propose a strategy that maintains and updates the set of
ensemble classifier members so that we can update the old model by dropping an ensemble
Limitations of current ensemble learning with Random Forests include the need for
a large number of Decision trees to achieve good accuracy (Denil et al., 2013), and that
retraining is slow with batch learning. Finding an efficient way to deal with these issues is
an interesting challenge.
1 Related work
Incremental learning is inspired by the need to use data mining algorithms on stream data,
where training data instances come along a timeline. An early model using incremental
feasible split after each incoming data instance arrives. The downside of this approach is that
it is possible to produce an unstable tree in some rare cases when the splitting feature may
be changed repeatedly as a result of incoming data. Furthermore, a single Decision tree has
been known to be outperformed by a forest of Decision trees (an ensemble model) that uses
2001) where a random selection of features is used for splits. In the opposite direction,
Decision Forests proposed by (Criminisi et al., 2012) take an entire set of features as well
as combinations into consideration for each split. With training data arriving on a timeline,
32
incoming data cannot be used to correct a previous split when a decision at a particular node
had already been made. To overcome this problem, (Domingos and Hulten, 2000) keep
several splitting candidates in every leaf and propose a method, known as Hoeffding bound
to estimate the probability of a good split. The use of this split method produces Hoeffding
Trees, which are reported to have better performance (Bifet et al., 2009).
Because incremental learning focuses on continuous data (or data stream), it has to
be able to deal with the problem of concept drift. For example, a proposed incremental
ensemble learning by (Günter and Bunke, 2002) constructs one classifier at a time from
distinct feature subsets or overlapped features between these subsets. New classifier is able
to learn a new concept from most recent data. Incremental learning addresses a principal
weakness of many traditional machine learning algorithms that are unable to learn from an
incoming data example without restarting the entire learning process. One way is to focus
on the ability of learning additional knowledge in an incremental manner rather than dealing
with limited memory resources. For example, the proposed method in (Yang et al., 2009)
maintains old data so that they are accessible while providing an ability to learn on the fly.
An efficient way in incremental learning tries to save space for some new incoming data
Decision trees were the first types of algorithms used to adapt to incremental approaches.
One example of a successful model is Very Fast Decision Tree (VFDT) (Domingos and
Hulten, 2000) which implements Hoeffding inequality bound to ensure sufficient data in
splitting nodes. Additionally, Incremental Bayesian models support dynamic updates for the
A non-tree single learner such as Online Naive Bayes classifier proposed by (John and
33
Langley, 1995) uses updated counters to incrementally compute the probability of the class
the new example belongs to. A recently proposed incremental Naive Bayes (in short iNB)
algorithm (Ren et al., 2014) computes the posterior probability of 1) test examples in each
class and 2) class conditional probability after an incoming data example appears on the
bus. Next, it uses this data to adjust the degree of error between classification prediction and
observation.
updates the model’s weight one sample at a time by optimizing an objective function. Under
a batch incremental setting, SGD has advantages of low computation expense and high
performance.
SVMs have been also used for incremental learning. SVM classifiers rely on two
risk. By maximizing margins for a separating hyperplane, SVMs use support vectors for
classification. SVMs need to look at the whole training set to determine these support
vectors. As a result, an incremental SVM (Laskov et al., 2006) preserves the current support
vectors and adds them to the training set in the next step. However, these support vectors vary
depending on the selected kernel function and they become less helpful when a concept drift
occurs. Incremental SVMs also have difficulty discarding the old concepts since support
In a different direction of study, (Ristin et al., 2014) use a Nearest Class Mean (NCM)
classifier to construct their proposed incremental model, named Nearest Class Mean Forest,
based on a Random Forest prototype. Their approach is to minimize the cost of updating
34
summaries at each leaf node. Here the tree is allowed to incrementally grow using a splitting
Online learning, which shares some characteristics with incremental learning, has also
evolved in a similar fashion. Under the framework for online learning called M.O.A.
(Massive Online Analysis), online Random Forests proposed by (Bifet et al., 2011) sample
the incoming data model based on the Poisson distribution. Here, the growing tree and
threshold for propagating samples in each direction of splits are random. The statistics of
splits for all random tests are maintained at a node and propagated down to children nodes
similarly to the Hoeffding Trees approach. Similarly, a model proposed by (Denil et al.,
2013) grows a tree breadth first and keeps all potential children. In this thesis, our goal
is to design a new incremental learning algorithm that can address the problem of high
computational cost of SVM, but retain the comparable high performance. We focus on a
novel sampling technique that identifies the weak performers in ensemble members. We
Incremental learning supports two approaches: an instance approach and a batch ap-
proach. The batch approach requires a full batch of examples to train so that it learns from
the most recent data. Our work focuses on ensemble learning where the ability to integrate a
new model or to eliminate an old model is an advantage of the design. Our approach shares
the common approach with the Random Forests model, a popular ensemble method that
Given a Decision tree, the ability to induct a new data instance into it relies on how a
new split can be processed. Maintaining all relevant examples at each node is the simplest
solution, but costly in terms of memory. Without keeping this information, there is no
35
guaranteed way to make a new split optimal in response to an incoming data instance due to
the recursive design. With an instance incremental approach, different strategies of splitting
Many existing variant models of Random Forests distinguish themselves by the way
the trees are grown. For example, Extremely Randomized Trees (Geurts et al., 2006) use
entire data examples and decision boundaries are picked at random instead a best feature in
Random Forests. Rotation Forests (Rodriguez et al., 2006) split randomly subsets of features
and applied Principal Components Analysis (PCA). In another approach, Rotation Forests
are built as an ensemble of these new features. Conversely, Gradient-Boosted Trees (GBTs)
(Friedman, 2002) train one tree at a time, where each new tree helps correct errors made by
previously trained trees. With each tree added, the model becomes more expressive.
including Dato (formerly known as GraphLab) (Low et al., 2014), and Apache Spark
(Zaharia et al., 2012). These platforms integrate the advantages of a distributed system that
reduces the number of high input/output operations (Dean and Ghemawat, 2008).
In our work, we are attracted to the concept of Mondrian process based on which, a
classification algorithm called Online Mondrian forest (Lakshminarayanan et al., 2014) has
been developed. Mondrian forest uses a new way for sampling data that guarantees invariant
distribution. Those with further interest are referred to (Roy and Teh, 2009). Inspired by
this work, this thesis proposes a novel algorithm that overcomes the limitations of online
Mondrian forest in more efficient ways. Our proposed work differs from Mondrian forest by
time and can be used in batch. We use labels to guide the split of a node whereas Mondrian
36
forest picks feature to split a node independently with data’ labels. We implement stratified
sampling of features to include informative features in each sub region instead of random
sampling. Our hypothesis is that this will reduce the risk of constructing a weak model that
affects the performance. Our proposed algorithm has two main contributions as follows.
• We propose an incremental learning approach that trains with higher accuracy than
other state-of-the-art incremental methods.
• We demonstrate that our solution generates comparable results with offline models at
each incremental size.
Figure 2.1: New split outside current boundary. Adapt (Lakshminarayanan et al., 2014)
Mondrian process (Roy and Teh, 2009) was introduced as a new class of distributions
that supports random partitions in a data space that are not limited to regular grids. Mondrian
process uses recursive axis-aligned cuts to partition data space (see Figure 2.3) hierarchically
37
akin to a Decision tree. This feature gives Mondrian processes the ability to guarantee an
In Figure 2.1, a Mondrian process starts when a new data point is introduced. Instead
of introducing a new cutting plane over the data space, the Mondrian process provides the
cut on a region determined by existing boundaries. As the new data point enters, a new
boundary is redrawn (see Figure 2.2), a similar process is repeated. Noticeably, a cut point
is randomly selected uniformly for all intervals in the recursion where the split occurs.
Figure 2.2: New split inside boundary Adapt: (Lakshminarayanan et al., 2014)
Figure 2.3: Random Forests vs Mondrian Forests. Adapt from (Lakshminarayanan et al.,
2014)
The main difference between a traditional Decision tree and a Mondrian tree is centered
around the splitting method, centered around the refer to Figure 2.3. Looking at Figure 2.3,
we see that a cut in the Decision tree is illustrated by the line extending through the data
space, while the cut in the Mondrian tree is computed from the range of training data at the
time of split. In other words, the cut is restricted to the known region defined by the current
data points so long as the new split is consistent with the current tree.
38
Similar to Random Forests, Mondrian Forests are able to deal with overfitting, even
process (Roy and Teh, 2009), a distribution is chosen by re-sampling and it guarantees a
similar distribution as before. Here, we let MT(λ, D) denote the distribution of a Mondrian
′
process. Additionally, we can choose a probability distribution M T including the new
example which will result in the new distribution being be equal to the distribution MT
before including the new example. Originally, λ denotes a budget (Roy and Teh, 2009)
using the aforementioned notations so that we can compare our potential contributions.
Further details on the splitting mechanism with a Mondrian process are discussed in the
(Roy and Teh, 2009), because it guarantees similar distributions in incremental learning.
Accordingly, this enables a Mondrian tree to achieve results similar to the trees created using
batch learning.
3 Proposed approach
the work of (Lakshminarayanan et al., 2014). Each Decision tree is independently built
from randomized sampling of examples similar to (Breiman, 2001), and updated or used
for prediction. The final prediction for an unseen example is generated by averaging the
combination of predictions from all trees. Intuitively, each tree is represented by a KD-tree
structure (refer to Figure 2.4) where a node’s ranges are bounded by an axis aligned box.
Similar to prepruning of random forests (Breiman, 2001), we use two methods to control
39
the split: either i) a predefined minimum number of data points or, ii) a required minimum
To force phasing out of an old concept, we select a candidate tree with low prediction
performance over a time period (equivalent to a half of the predefined number of trees).
We also introduce a weight factor for each data point to determine which particular data
point will be potentially chosen for removal. Each data point is represented by a tuple
<index, count >where an index denotes an instance’s order of arrival and a count denotes
the number of data examples arriving after it. Initially, we set the count to zero as the first
data point arrives. Herein, it increases at each incoming instance or decreases after removing
an instance. For the ith data instance, its weight is defined as Wi = index/(1 + count).
To incorporate a new arriving example d, we obtain its count and update the summary
at each node (refer to line 2 in Algorithm 1) when d traverses to the node of the tree. Two
conditions required for the split are the sufficient data points by Hoeffding bound and the
depth threshold. In this case (line 6), this tree simply grows (line 7). Otherwise, the process
obtained using samples from all current data instances (line 6). Building entire a tree is a
40
result of low performance during testing. If there is no testing, the oldest data instances
are the candidates for removal. A new tree is built with half the examples from recently
seen examples and the remaining data points are obtained by sampling the entire set of data
points.
In the newly designed incremental learning algorithm, we apply the Mondrian process
for sampling data. The idea is illustrated as the following. Utilizing the statistical summaries
of internal nodes, we suggest a simpler scoring system. In that, a new data point goes
down the tree until it can fall into the boundary of the specific label with its computed class
probabilities. When this region has no label, the probability of the nearest ancestor can be
used. In fact, we also have the option to use the depth tree’s threshold to control how deep
the tree can grow by limiting the total number of splits it is allowed to make.
The conditioning test at line 5 checks the minimum number of required data points
compared to the approximate number of data points determined with a Hoeffding approach
for confidence of splitting (De Rosa, 2016). This approximate strategy has worked with
et al., 2014) gives the Mondrian process (Roy and Teh, 2009) an advantage in processing
new data examples. This unique feature of the Mondrian process distinguishes it from other
occurrence for each example, we have the ability to select a potential candidate for removal
by its lowest weight to save space for incoming data. Furthermore, we adapt a variant of
the sliding window approach ADWIN (ADaptive WINdowing) proposed by (Bifet et al.,
2009) to detect changes in error performance of each Decision tree. As shown in Table 2.1,
this method works more effectively with our design than the estimated average error of all
In addition, if the current boundary at a splitting node does not include the new data
instance, it leads to extending the boundary to cover the arriving data instance or replacing
it with a new parent and including a sibling to keep any data it does not cover. For detail of
sampling with Mondrian process, refer to the paper (Lakshminarayanan et al., 2014).
• First, we do not keep any training data at each node, instead we maintain class
frequencies and other summaries.
• Second, when a new data instance traverses down the tree, it initializes the update
process for the count (the larger the value the older the concept) of each instance (data
42
point). The count is used to compute each instance’s weight factor, which is used to
select the most likely candidate for dropping data instances. The other dropping case
involves low performance of a particular Decision tree,
• Third, we also implement Hoeffding bound to determine when there are sufficient data
example for splitting a node during increment learning and to use a sliding window to
monitor each tree’s performance.
We evaluate our proposed method to perform sentiment analysis with three datasets:
restaurant reviews from the Yelp reviews dataset, product reviews from Amazon dataset, and
movie reviews from IMDB reviews dataset. For these text review datasets, we first explore
the representation of word feature. These pre-experiments use four models: the Online
Random Forests (ORF), our proposed model (iRF), Hoeffding Trees, and the incremental
Naive Bayes (iNB). Each tree based model has 100 trees. These algorithms are selected
because they are incremental learners, which usually are not good at text mining problem.
For a review dataset, we generate features using statistical summaries and a bag of words
method. A bag of words approach treats each word as a feature. However, many words are
rarely used compared to others which results in many zeros in training data. This problem
43
words may not capture the correct meaning of the text. To address the shortcomings of a
simple bag of words, we first explore a phrase-based sentiment analysis approach (Socher
et al., 2013). However, our experimental results using this method where we represent each
sentence in the review with an average word vector and train with Random Forests, are not
Ass a result, we explore two advanced techniques, i.e., Word2Vec (Mikolov et al., 2013)
and FOFE (Fixed-size Ordinally Forgetting Encoding) (Zhang et al., 2015a) for representing
44
sentences. These two techniques allow us to transform a review text into a uniform format
upon which hybrid features are built. These new features are constructed with the following
combinations of settings: i) Bag of Words on FOFE (simply as FOFE), and ii) Bag of Words
with Word2Vec. We compare the accuracy in order to select the method to generate our
features beside statistical summary features. We refer to the two settings as FOFE (instead
Reviews are split into sentences to generate vectors. The FOFE method is applied
without using pretrained feature vectors. We do not use the Google pretrained set for a fair
comparison. Table 2.1, Table 2.2 and Table 2.3 show the accuracies of selected algorithms
We observe that the Word2vec method provides consistent results, consisting of more
than half of the best results. Word2Vec produces better results than FOFE does with the
corresponding algorithms. Based on these results, we prefer to use the Word2vec method
for the representation of text datasets on our next evaluation task. Figure 2.5 illustrates the
• Popularity: A review may have a number of up votes (useful votes) and a number
of down votes (funny votes). Popularity is defined as the sum of up votes and down
votes.
• Confidence-adjusted time: A ratio of the number of reviews posted over time elapsed
since the reviewer registered (by month unit).
4 Experimental setup
rithms. These include Online Random Forests (ORF-Safari) (Saffari et al., 2009), Extremely
Randomized Trees (ERT-1, ERT-k) (Geurts et al., 2006), and Random Forests (Breiman
RF*) (Breiman, 2001). Our evaluation includes other incremental algorithms, in particular
incremental Naive Bayes (Ren et al., 2014), Stochastic Gradient Descent (Bottou, 2010).
Stochastic Gradient Descent (SGD) is selected because it has worked well for a sparse data
We report the result of the Mondrian Online Random Forest model only on numeric
datasets because it is unable to work on sparse data which occurs when many words rarely
occur in a majority of reviews. Online Mondrian Random Forest also have difficulty with
numeric data (e.g., Mnist and Poker datasets). Non-incremental learners are retrained for
each incremental size, introducing bias comparison with incremental learners because they
allows retraining on entire current availble from scratch. To set up a fair comparison, we
do not remove the current training example while allowing incoming data to be read in
sequence by incremental learning methods. Training sizes are from 10% to 90% with 10%
incremental step.
46
We use three review datasets Yelp, IMDB and Amazon review datasets, as well as three
benchmark datasets Mnist, Poker and cover type datasets. In addition, our final feature set
for review datasets includes statistical features and Word2vec based features. The features
We conduct experiments that focus on ensembles of tree-based models (refer to Table 2.4)
but also include experiments with other incremental learning models, such as incremental
In the first set of experiments in this thesis, we compare our proposed model with other
ensembles of tree-based models such as Hoeffding Trees (HT), Extra Trees (ERT-k,ERT-1)
and Online Random Forests (ORF). Table 2.4 indicates that our proposed model outperforms
significantly all tree-based models as well as incremental Naive Bayes (iNB). To verify this
claim, we next compute Anova one way test at 95% confident level.
The F = 3.91732939 and p-value = 0.01327529 indicate that we can reject the null
hypothesis. We further report the result of the paired t-test in Table 2.5 to conclude that
our model is better than tree-based incremental learning models. Figure 2.6 illustrates the
incremental learning models used in the experiments. Cover Type dataset seems to be the
most challenging data when half of the algorithms produce very low performance at initial
10% training size while others generate around 50% to 60%, similar to random guess at
the same proportion of training size. The Online Random Forest model yields strong the
48
For non-tree base classifiers, incremental Naive Bayes illustrate just better than average
performances in both numeric and text data. The significant improvement is obtained by
Naive Bayes with Yelp dataset. Stochastic Gradient Descent (SGD) outperforms our model
with Amazon and Yelp reviews. However, our model shows consistent performance in the
Hoeffding Trees and Online Random Forests are comparable, but increasing the number of
trees in Online Random Forest improves significantly its accuracy. Next, we present our test
We use two-way ANOVA multiple repeated test to validate our comparison results. Two way
ANOVA is used for the following two factors: algorithms and different training sizes. Algo-
rithms includes Hoeffding Trees, Extra Trees (ERT-k with k=100), Extra Trees with single
tree member, Online Random Forests, incremental Naive Bayes, Online Mondrian Forest
and our model (incremental Random Forests) to test of our hypothesis. Our hyphotheses are
as:
• H01 : All classifiers are the same, i.e., all classifiers perform similarly.
49
The result of ANOVA test is F-statistic= 0.00362303 for the first hypothesis H01 and
1e-4 for the second hypothesis H02 with 95% confidence level. As a result, we reject the null
2.5. From these results, we conclude that our proposed model is significantly better that the
incremental settings in the context of numeric data. We conclude that increasing the number
of trees can indeed improve the performance of the model. We observe that a low number
of the trees (e.g., n=50) can severely affect performance of our algorithm. To improve the
performance in text data, hashing methods such as murmurHash (Couceiro et al., 2009) is a
When data consist of many low informative features, our model’s performance deterio-
rates requiring feature selection and feature generation. However, the concepts of strong
informative and low informative features might need to be investigated more because the
dropping which has beenis successfully applied in neutral networks (Hinton et al., 2012),
allowing a reduction in the training time. We believe that when the number of features is
2. Our approach allows optimal tree learning on the fly due to the ability of the Mondrian
process.
3. Our approach produces nearly identical accuracy with trees built by conventional
batch learners given enough examples.
In conclusion, this model introduced in thesis chapter can handle a sparse datasets, often
A problem that deals with a lot of data may encounter an unseen class issue when training
examples for some classes were not available are not available at the training time. Including
of unknown objects during training hampers classification with the traditional closed set
assumption. Unknown classes can be in minority, but may be majority as well. As a result,
unbalanced classes in the open set cannot be handled with common techniques such as
stratified sampling.
The open set problem was first introduced in computational linguistics for authorship
identification (Diederich et al., 2003), which is often also called to authorship attribution. In
this problem, a list of authors and their sample styles, which are presented in form of text,
articles are known. An unknown text is needed to determine an actual author. It may be the
case that the real author is not among them. Schaalje et al. (Schaalje et al., 2011), (Schaalje
and Fields, 2012) use the term open set to introduce the concept of openness. A statistical
analysis of literary style is the main technique known as stylometry, to determine who is the
51
52
Researchers in computer vision domain (Scheirer et al., 2013a), (Bendale and Boult,
2015b) presented their intensively mathematical work that lay out the breakthrough frame-
work to deal with the open set problem in the field of computer vision. By exposing the
weaknesses of current data mining techniques which may result in an unreliable solution
in a real world application, they establish a new framework for data mining algorithm
development. (Scheirer et al., 2013a) quantify the openness of the data space as follows:
s
2 × trainingclasses
openness = 1 − . (III.1)
testingclasses + targetclasses
where traininglclasses denotes the number of classes in training, testingclasses denote the
number of classes in testing, and targetclasses denotes the number of target classes.
In open set recognition, the risk of the unknown can be estimated by an empirical risk
function on the training data without sufficient knowledge. Its objective is to minimize what
53
is defined as the total recognition error (Scheirer et al., 2013a). Studies on open set are
found beyond computer vision (for object detection) in computational linguistics (authorship
attribution) and in biometric verification (Rattani et al., 2015b). Initial work on the open set
problem has focused on- an off-line setting where data is considered statically. Our work
on the open set problem has been inspired by the published research in an excellent paper
(Bendale and Boult, 2015b). These authors give a new life to the use of data mining on
evolving data where the closed set assumption has less hold. Like anomaly detection, open
world recognition can detect new objects or unknown classes or simply rare classes, e.g., in
fraud detection (He et al., 2008). The difference is that open world recognition allows it to
update the classifier with these new classes in its multi-class classification system.
Incremental or online learning poses challenges because of finite system resources and
unavailability of training data in one place. In such a case, incremental ensemble learning
is a natural choice due to its ability to process training data in chunks, and the ability to
update the learned models. The logic behind incremental ensemble learning for the open set
problem is based on the wisdom of the crowd, which uses the combination of knowledge
from many experts to help identify the common uncertainty among them.
Classification is often referred to as the task of discriminating one class from others in a
given set of classes. Traditionally, classifiers work well assuming that a prior knowledge
of all classes is given. This scenario is defined as a closed set in the context of classifying
objects from prior knowledge about them. Unfortunately, a presenting of an example from
an unknown class during testing can lead to poor performance of even state-of-the-art
classifiers due to observed classes being incorrectly identified as one of the known classes.
It is normal to see unknown objects during testing in traditional batch learning, in which
all classes are presented at training time. It is because we often have insufficient knowledge
of the evolving real world. Incremental learning, particularly online learning, complicates
this problem when continuous data, known as streaming data may introduce unknown
objects at any time. In fact, a large dataset is more likely to include unknown classes because
class under 1-vs-rest approach at training time (Wilber et al., 2013). Without a label even as
simple as unknown class, a classifier suffers uncertain decisions and poor performance (see
Figure 3.1). This problem is ill-defined with unsupervised and semi-supervised learning
for two reasons. Unsupervised learning does not take labels into account and often works
well to identify the existing groups based on similarity. This is known as clustering. Semi-
supervised learning considers both labeled and unlabeled examples and propagates the
1 Problem definition
The generation and collection of a large volume of data has become the norm in domains
such as computer vision and computational linguistics. A model is built from a set of
training data and is evaluated with a test set, with or without a validating set. What if some
classes are not observed in training but are presented at testing? It may be a result of bad
samples, under-sampling or non-available samples at the time of sampling data for training,
learn to identify a class by discriminating it from other classes. With an unobserved class at
training, the performance of such systems is in question because they will classify unknown
samples can be dealt with advanced sampling techniques. However, the best sampling
techniques cannot tackle the problem of the unknown class, which is present only at testing
time due to the dynamic nature of data. This problem is a common weakness of prior
knowledge assumption in many data mining algorithms. Intuitively, an unknown class can
In many knowledge domains, the acquired knowledge of the data is often far from
complete, given the dynamic real world. Implicitly, a multi-class classification model is only
able to distinguish one class from the rest, assuming a full knowledge of classes a priori,
referred to as a closed set of classes. This assumption may not be guaranteed and cannot be
assumed in many real world scenarios, as an unseen class may only be available at the time of
prediction and may be impossible to learn using traditional techniques. Some examples are:
a) a self-driving car cannot learn all classes of objects, a priori; b) an authorship attribution
system may not have any knowledge of most authors while investigating true authorship of
an anonymous text; and c) a biological system may lack knowledge of rare existing species
when investigating an underwater living organism. Human experts may have to thoroughly
check all known species before making any conclusion about the outlier.
more subjective due to the openness of the surrounding real world. As a result, a recognition
56
system deployed in the real world, given the static world assumption, produces unreliable
results, known as risk. This motivates a new direction of research seen in recent studies
set conception. Studies on open set recognition such as forensic authorship attribution
(Schaalje and Fields, 2012), image processing (Costa et al., 2012), fake fingerprint detection
(Rattani et al., 2015a), object detection, (De Rosa et al., 2016), authorship attribution in
social media (Rocha et al., 2017) and an open world recognition such as (Bendale and Boult,
2015b), (De Rosa et al., 2016) illustrate a growing interest over a short time.
The remainder of this chapter is organized as follows: Section III.2 discusses related
work. Section III.3 illustrates our proposed model. Section III.4 describes the set up of
our experiment to compare our algorithm with other state-of-the-art solutions. We present
results and discussion in Section III.5. Finally, Section III.6 summarizes up the this chapter
2 Related work
A recognition system often deals with an unknown scenario, referred to as openness, when
identifying objects of interest. Object detection is a typical scenario where an open set
recognition is naturally described. The object of interest, e.g. a human face may be among
others, e.g. cars, trees, electric poles where the recognition of one class, referred to as the
positive class, is discriminated from others using specific features. This system supports a
rejection mechanism. This type of mechanism is widely used for face detection (Jin et al.,
2004), document classification (Manevitz and Yousef, 2001), and speech processing (Shen
57
Intuitively, the 1-vs-1 classifier (Rocha and Goldenstein, 2009) has a goal similar to
open set recognition to distinguish one class from others by training in one class as positive,
against all other classes as negative. In the same vein, the 1-vs-all classifier treats examples
of non-positive classes as negative. Adapting the idea from the work of (Schölkopf et al.,
2001), (Scheirer et al., 2013a) proposed the 1-vs-Set machine and further presented what is
A data mining application, e.g., forensic authorship attribution (Stolerman et al., 2014),
may want a system with an ability to discriminate unknown authors from known authors, or
classes when an unknown author referred to as a class, may not be in a set of known authors,
referred to as known classes. The openness of this system is due to an incomplete knowledge
of the real world. In fact, earlier studies of open set recognition have been conducted on
authorship attribution by (Schaalje et al., 2011), (Schaalje and Fields, 2012), a branch of the
natural language processing for determining the authorship of unknown documents, referred
(Fei and Liu, 2016) in their work on text classification address the open set problem
boundary. They name their approach as center-based similarity space learning, or CBS
learning. The idea of their approach is to transform the original data space of documents
into a space representation of centroid classes. This transformation generates a new feature
space, one for unigram and one for bi-gram, for each text document. A document, in close
proximity to one class, is assigned to this class, and a new center is computed. A drawback
58
of this approach is that the center of each class may or may not be an actual data point that
represents a document. As a result, a close distance from a document of interest to the center
of a class to which i should actually belong may be longer than distance to the class it may
be assigned to.
these studies. The open set concept presented in (Scheirer et al., 2013b) is the first one
that theoretically addresses the closed set assumption in an object recognition system. At
first, a single class model (e.g., one class SVM) emerges as a feasible solution for single
class detection (implicitly rejecting others). However, one class SVM relies on a closed set
assumption because it does not take into account the presence of a negative class.
When an unknown class is introduced, a model should be able to detect it and reject it. This
ability allows the system to avoid misclassifying the unknown to a known class. A common
way is to measure a distance from an unknown object to a known object. However, a dataset
may have features which use different measurement units. For example, consider a subset
of IMDB movie dataset with a gross revenue feature (unit in hundred million dollars) and a
number of review features as a non-negative integer. The gross feature measures the revenue
before tax of a particular movie while the number of reviews indicates the count of reviews
for this movie. Here are our rules to detect an outlier by thresholding:
• the gross revenue is greater 187 or less than 150 (unit hundred million dollars), and
In Figure 3.2, the left plot illustrates how outliers can be obtained by using thresholds.
However, there are three more outliers left (bold red circles). The reason these red circles are
considered outlier is that they are not part of the remaining data points. We see that a movie
with gross revenue of $176 hundred million with only 42 reviews and even another with
a gross revenue of $185 hundred millions with only 45 reviews are examples of abnormal
data examples. Ignoring these outliers will affect the accuracy in both classification and
regression tasks. However, thresholds with different units may have difficulty detecting
these outliers. It is a downside of using simple thresholds which do not taking into account
the fact that the variance in each direction is different. This problem poses a big challenge
when we measure distance from a data example to the center of a class to determine whether
or not this data point belongs to a class using a distance based approach.
points, considering a distribution of data in a manner that can deal with the covariance
among the variables. The Mahalanobis distance between point x and point y is defined as
follows:
p
d(x, y) = (x − y)′ C −1 (x − y) (III.2)
60
clidean distance in the transformed space by rescaling the axes into unit variance. Ma-
halanobis distance is unitless and scale invariant. We use Mahalanobis distance for our
distanced base method in ensemble learning with unknown classes as well as in combining
classifiers.
In their first work on open set recognition, (Scheirer et al., 2013b) introduce their 1-vs-Set
model to counter the closed set assumption. The main idea is that there is a risk in class
assignment in open space. This is known as open space risk. This risk can be formulated in
the Compact Abating Probability Model (CAP) (Scheirer et al., 2013b) for a class assignment
which describes an inverse relationship between the probability and the distance of data
move toward an open space away from known data. A CAP model can be illustrated by the
In the CAP model, any point beyond a given threshold distance τ from the closest training
data has zero probability. However, computing the confidence level of class membership
To deal with the problem of open space risk, many researchers currently favor distance
based approaches where each class is represented by a closed sphere with its center being
computed as the class mean. Figure 3.3 illustrates a data space with two classes where each
class may be represented by more than a ball-shaped class boundary. This closed boundary
model (referred to as Nearest Class Mean) addresses the problem of open space risk, is
implemented as NCM forest by (Ristin et al., 2014). (Bendale and Boult, 2015a) adapt this
method and propose their NNO model, which has been further modified by (De Rosa et al.,
Intuitively, Bayes theorem (refer to Equation B.1) is a common way to compute the
posterior probability of an unknown class given that other classes are taken into account.
Prior probability of event describes the probability of the event before any collection of new
data. When we known more about a hypothesis H, the adjusted probability is the posterior
probability. Since the evidence does not depend on H, Bayes’ theorem can be written as
posterior probability. One way is to estimate the approximate lower bound for an unknown
classes is proposed by (Schaalje and Fields, 2012). This solution relies on experimental
thresholds and may be domain specific in practice. If a model generates only scores for
class assignment (e.g., SVM), mapping raw decision scores to probability scores is a way to
determine the level of uncertainty of an existing unknown class w.r.t. open space risk. These
calibrated scores can be estimated by several methods, e.g., (Platt et al., 1999), (Huang et al.,
2006), (Bravo et al., 2008), (McCann and Lowe, 2012). Despite the fact that the proposed
estimating method in Platt’s work (Platt et al., 1999) can be used to calibrate scores for
general non-linear models in addition to SVM, the gap between the threshold selection
heuristic and the Gaussian distribution’s theoretical assumption for the support of unknown
(Scheirer et al., 2014a) argue that EVT (Extreme Value Theory) (Lindgren et al., 1987)
has the theoretical underpinning to address the occurrence of rare events, i.e., the appearance
of unknown classes. Because the appearance of unknown classes is rare, they recommend
using the Weibull distribution, which is independent from the model regardless of the overall
distribution. In fact, sampling the extrema (as the outlier for the unknown class) in the right
tail is always guaranteed by the Weibull distribution (Scheirer et al., 2011). These authors
introduce the scoring function approach in their Weibull calibrated SVM or W-SVM (a
variant of binary SVM) for a linear boundary where the optimization problem of the SVM
transforms into an optimization of the risk within a space between two parallel hyperplanes.
For a multi-class problem, (Jain et al., 2014a) proposed PI -SVM (Probability of Induction
where C denotes known classes, y denotes ground truth, yi being one of the known classes of
C, and ŷ denotes predicted label w.r.t x as a data point. Threshold has been used as a rejection
option in several studies (Platt et al., 1999) (Bartlett and Wegkamp, 2008). Furthermore, the
heuristic of selecting a decision threshold is often natural given prior knowledge of unseen
Another related work is the Center Based Similarity (CBS) approach proposed by (Fei
and Liu, 2016). The data space in the original problem is transformed into a representation
of class means, called a similarity space. Their model, named cbsSVM, is applied to this
new space. By limiting the boundary to a ball, they avoid the problem of open space risk in
a traditional SVM model, but still support the detection of unknown classes. Their SVM
uses the Platt calibrated score method assuming that the Gaussian distribution for feature
normalization and center-based positive classes hold true. However, this assumption may not
always hold the case in practice (Scheirer et al., 2011). One common weakness of models in
(Scheirer et al., 2013b),(Jain et al., 2014a) and (Fei and Liu, 2016) is a static setting design
the data is continuous, there is no control over the appearance of unknown classes over
64
time. Incremental learning does not require the system to retrain from the beginning,
which is important from a practical point of view. While the open set problem has not
(Scheirer et al., 2013b), (Scheirer et al., 2014a),(Jain et al., 2014a) in the field of computer
vision have been laying out solid foundations to address the limitation of the closed set
assumption.
Even though an open world concept was introduced by (Stolerman et al., 2013) in
stylometric authorship attribution, (Bendale and Boult, 2015b) are the first contributors in
incremental learning in his redefined open world recognition approach to the best of our
knowledge. Their NNO (Nearest Neighbor Outlier) model extends the NCM model (Nearest
Class Mean) (Ristin et al., 2014) to illustrate open world recognition. The main idea is
follows:
1
Sy (x, τ ) = Zτ (1 − dW (x, µy )) (III.6)
τ
where τ denotes a threshold to determine a ball for each class mean, Zτ denotes a normal-
m m
ization factor using gamma distribution, computed as Zτ = (Γ( m + 1))/(π 2 τ m ) .
The NCM model and its variants have limitations similar to those of the class mean
approach which is known for linear classifiers. The NNO model also needs to obtain
an offline a metric W during training on an initial set of known classes. To tackle this
problem, (De Rosa et al., 2016) use the Hoeffding bound method to incrementally estimate
a confidence threshold for an unknown class. Hoeffding bound (Hoeffding, 1963) is used in
incremental learning (Bifet et al., 2009) to estimate good attributes for streaming data when
65
As we apply this incremental updated threshold in (De Rosa et al., 2016), we briefly
review the formulation of Hoeffding inequality. Let τ denote threshold to assign an instance
1
given that a, b ∈ [0,1] and δ denotes the desired confidence level s.t. δ = t∗C
where C is
the number of current classes and time t. The more training data we have, the closer this
threshold becomes to τ t . The estimation of confidence prediction follows in (De Rosa et al.,
2016).
However, their proposed online approach (refer to lines 11 and 12 of the algorithm in
(De Rosa et al., 2016)) also suffers from a drawback when an updated class ball is created
for a class to cover incoming data item overlaps the class ball of a different existing class. In
this case, a new class ball may have many incorrect data points because it contains a majority
of data points of other classes. Our proposed distance based model uses multiple centroids
and local learning similar to online NNO (De Rosa et al., 2016), but differs with two main
contributions. First, we address the initial issue of incrementally adding new classes in our
solution. Second, we optimize the nearest neighbor search for determining the nearest local
balls.
66
We are interested in developing an incremental learning approach, which is more suitable for
open set recognition especially using the ensemble approach, to facilitate classification when
the nature of data collection is continuous. Incremental learning can work well in off-line
learning and online learning when new classes may be introduced during a trained model’s
use. Our first choice is not an incremental SVM model due to the high computational cost
for the update and the performance depending on the number of support vectors retained
splitting data space with axis-align boundaries is highly incorrect predictions when a tree
built has a small number of splits, and the overfit problem that prevents the generalization of
a solution. A new class can exist at an arbitrary location in the data space. This introduces
a big challenge for an ensemble of tree-based models due to the high cost of re-doing any
splitting at parent nodes. While Mondrian process supports this ability, it comes at a cost
as the data sampling method does not take labels into account. As a result, extending our
previous work on incremental learning model is not a feasible solution. Using rotation
method to change decision boundaries has a limitation since high computational costs result
Using an ensemble of tree-based models, such as Random Forests has been criticized due
to its averaging prediction strategy. One feasible solution is to use the closed shape boundary
method where each class is determined by a closed boundary. The Rocchio algorithm in
Information Retrieval uses this concept, with a class being determined by a round shaped
67
boundary. The Rocchio algorithm has its own problem since this vector space method may
fail to classify classes if two data points belong to two different classes but refer to the same
object.
The Nearest Class Mean model addresses this problem by changing its mean to reflect
the new examples. It uses a threshold to detect unknown classes. A class may change its
boundary by shifting the class mean of each ball to its new class mean to counter each
mistaken classification. This strategy has been used efficiently in (Schaalje et al., 2011) for
open set authorship attribution. However, we do not use this technique due to the reasons
Remember that the goal in open set recognition for the general classification problem is
to counterbalance an open space risk with an empirical risk. Our model design uses a set of
closest neighbors represented in terms of centroid class. Here,a centroid or a core point r of
a ball-shape boundary presents an actual data point instead of a mean of data points which
may not represent real data point in ball-shaped boundary. A class is a set of ball clusters
where each ball has a minimum number of member points. Unlike class mean, a ball center
is a data point instead of a class mean which may not represent an actual data point. We
allow reduction of a ball’s radius to take into account errors in label assignment for each
ball. Since only the nearest ball’s radius is adjusted, we refer to our model as the Nearest
This design minimizes open space risk since it guarantees CAP property III.3. Let dx
denote the distance from x to its nearest centroid neighbor. For ∀dx > τ , the probability px
τ −d
τ
x
dx ≤ τ
px = (III.8)
0 dx >τ
where Br (xi ) denotes a closed ball with a radius r around arbitrary known positive training
point xi ∈ K where K denote a number of balls representing for a class, i= 1..N. Open space
exp(− 12 dW (x, µy )
p(y|x) = P 1 (III.11)
y ′ ∈Y (− 2 dW (x, µy ′ ))
p
using a low rank Mahalanobis distance dW (x, µ) = (x − µ)T W T W (x − µ) where x and
As it is a case of a cold start problem, thresholds for unknown class detection can be
established by a pre-training stage (Bendale and Boult, 2015b). A way to update the
threshold for unknown class detection is suggested by (De Rosa et al., 2016). We use
their proposed update method for better performance. In fact, our model is on par with the
extension of NNO by (De Rosa et al., 2016) in which we use a set of centroids instead of
69
a class mean for representation, but differ on how each centroid is selected. The posterior
exp(− 12 dW (x, µy ))
p(y|x) = P 1 (III.12)
y ′ ∈Y (− 2 exp(dW (x, µy ′ ))
such that the estimation of prediction confidence for nearest ball b∗ is defined by (De Rosa
et al., 2016) as
1 1
p(mcj|x ) = exp(− dW (x.mcj )) (III.13)
Z 2
P P
where p(mcj |x) denotes the posterior probability of a centroid mcj and Z = c j exp(− 12 dW (x, mcj ))
denotes a normalizer.
1
Cy (xt , b∗ ) = p(y|x) × exp(− dW (xt , cb∗ )) (III.14)
2rb∗
Whenever a new incoming data point arrives, its distance to the center of the nearest class is
measured. Computing the distances from all current data points to the data point of interest
70
Figure 3.5: Example of initial values for a new ball in Online NNO
is not practical. In fact, we only need to determine the nearest classes to this data point.
Our approach is to use only special data points that are centers of each class. This number
is much smaller than the total number of current data points. Then, we can measure the
distance to the centroid of the nearest class from the data example to predict a new class
As we do not know how many balls are needed for one particular class, we adapt the
technique, known as DBSCAN, which was introduced in (Ester et al., 1996). Our proposed
method is described in Algorithm 6 using an actual data point at the center of a class ball.
It is the main distinct feature in our designed algorithm compared to NCM Forest (Ristin
et al., 2014) and its variant version NNO (Bendale and Boult, 2015b), both of which rely on
K-Means clustering. Their approach uses K-means assuming a fixed number of centroids per
class to determine the centroid of each class. Our approach is more flexible as the number
of centroids may increase. This approach is also a distinct feature of our method comparing
to NCM and its variants. DBSCAN (Density-Based Spatial Clustering of Applications with
Noise) technique answer the query whether or not a data point belongs to a cluster of points
that shares similar features. This technique does not requires the number of clusters a-prori
like the K-Means method, and as a result it can perform well for many real world datasets.
71
• A core point is a point that has at least a minimum number of points inside a circle of
radius r centered at it.
• Any point not reachable is considered outlier or a feasible candidate for a new class.
outlier. Our main idea is that a data point is rejected by a class if it is an outlier with respect
to the class. When this data point is rejected by all known classes, it may be a candidate for
Figure 3.6: Example of initial values for a new ball of the same class in NCC
Figure 3.7: Second case of initial new ball of the same class in NCC
We further investigate how an overlapping region can affect the accuracy. In fact, another
difference of our NCC model from the online NNO (De Rosa et al., 2016) model is how a
72
new ball is initialized as illustrated in Figure 3.6 and Figure 3.7. For a new ball boundary,
Online NNO assigns the nearest ball’s radius (referred as to Figure 3.5) to the new ball
which may overlap with the existing boundary of another class. This can lead to incorrect
label assignment in predicting. To avoid this, we make the new ball’s radius as equal to the
radius of the current nearest ball if its distance to a new ball is greater than its radius (see
Figure 3.7). When a data point is close the boundary of another existing class, the radius of
a new ball is changed to radius r’= distance(nearest center of ball, data point), the radius of
this nearest ball (see Figure 3.6). We refer to this as conservative initialization (referred to
Algorithm 4), instead of spanning the full space in (De Rosa et al., 2016).
In Figure 3.8, we show the case of updating a ball by expanding when the nearest ball
gives a correct prediction such that the ball center or a core point remains in its location.
We adjust the ball’s radius and update list of core points during training. Figure 3.8 shows
another case of creating a new ball that is far away from the nearest ball. Here we initialize
the new ball radius to its nearest ball radius instead of distance between the two as proposed
Our main algorithm works as follows. Each data point is tested with the nearest class
ball. A new ball is created if the distance between the training data and the center the nearest
ball of the same class is greater than the distance between the training data and the center
the nearest ball of different class; otherwise we expand the class ball’s boundary. The new
boundary may or may not be updated. For testing, the basic idea is that if a data point is not
covered by its nearest class boundary, it is treated as an outlier or noise. When none of the
4 Experiment setup
To validate the proposed model, we experiment with three datasets. The first two are the
The 20 newsgroups (Asuncion and Newman, 2007), and Amazon reviews (Jindal and Liu,
2008) datasets which have been used in experiments in (Fei and Liu, 2016) for document
classification based on topics using a simple bag of words with TF-IDF computation. The
last dataset is the aubset of IMDB movie reviews dataset, which consists of 18 genres of
movies including action, animation, comedy, documentary, family, musical, romance, war,
75
adventure, biography, crime, drama, fantasy, music, mystery, sci-fi, thriller, western.
• Amazon reviews in (Fei and Liu, 2016) contains 1000 reviews with a total of 50 types
of products or domains.
• The subset of IMDB movie reviews is grouped into 3 genre domains which are
relevant contents .
To validate classifier models in an open world setting, we use the evaluation protocol
proposed in (Bendale and Boult, 2015b). In the experiment, classes (referred to as categories)
are injected to the system in incremental steps while unknown test classes are evaluated.
Since the results in (Fei and Liu, 2016) are used for comparison, we do not include run time
Two domains of the 20 newsgroups dataset consist of 10 and 20 topics whereas the
Amazon product reviews are split into 5 domains. Five domains of the Amazon dataset
consist of 10, 20, 30, 40 and 50 classes. For IMDB movies, 3 genre domains consist of 6,
12 and 18 classes. We select a set of movies from year 2010 to year 2016, with 5 reviews for
each with a total of 1,977,035 reviews in the three genre domains. The grouping is based on
correlated topics, which allows us to maintain a fair balance among three groups.
The correct class assignment is referred to as TP (True Positive: correct class assignment
for a known class), or the TN (True Negative: correct asignment for unknown class).
Incorrect class assignment is either denoted as False Positive (FP: incorrect assignment of
Table 3.1: F-score results on Amazon product review dataset with 10, 20 domains
Amazon data 0.25 0.5 0.75 1 0.25 0.5 0.75 1
1-vs-Set* 0.59 0.69 0.7 0.69 0.51 0.56 0.59 0.62
W-svm*(linear) 0.60 0.69 0.69 0.70 0.55 0.62 0.63 0.64
W-svm*(rbf) 0.25 0.59 0.70 0.79 0.39 0.50 0.57 0.70
PI-svm*(linear) 0.6 0.7 0.70 0.71 0.55 0.62 0.63 0.64
PI-svm*(rbf) 0.25 0.59 0.72 0.77 0.4 0.55 0.68 0.71
cbsSVM 0.45 0.72 0.78 0.87 0.57 0.69 0.70 0.76
onlineNNO 0.61 0.71 0.77 0.83 0.61 0.66 0.70 076
our model 0.61 0.72 0.78 0.85 0.61 0.66 0.71 0.78
Experiment on 10 domains 20 domains
Table 3.2: F-score results on Amazon product review dataset with 30 and 40 domains
Amazon data 0.25 0.5 0.75 1 0.25 0.5 0.75 1
1-vs-Set* 0.46 0.51 0.54 0.59 0.43 0.49 0.53 0.56
W-svm*(linear) 0.52 0.57 0.58 0.60 0.50 0.55 0.56 0.57
W-svm*(rbf) 0.37 0.44 0.50 0.65 0.35 0.40 0.46 0.61
PI-svm*(linear) 0.52 0.58 0.58 0.60 0.50 0.55 0.56 0.57
PI-svm*(rbf) 0.38 0.52 0.63 0.68 0.37 0.5 0.6 0.63
cbsSVM 0.56 0.65 0.63 0.69 0.54 0.63 0.62 0.65
onlineNNO 0.59 0.64 0.63 0.69 0.55 0.60 0.63 0.66
our model 0.59 0.62 0.64 0.71 0.55 0.60 0.64 0.66
Experiment on 30 domains 40 domains
report the F-score, suggested by (Scheirer et al., 2013b),(De Rosa et al., 2016), which is
P recision × Recall
F − score = 2 × (III.15)
P recision + Recall
TP
where Precision = T P +F P
and Recall= T PT+F
P
N
.
We include 1-vs-Set (Scheirer et al., 2013b), PI -SVM (Jain et al., 2014a) and online
NNO (De Rosa et al., 2016) for comparison. In fact, the online NNO model, a modified
NNO, can generate the same performance as NNO but does not need a training phase to
77
Table 3.3: F-score results on Amazon product review on dataset with 50 domains
Experiment on Amazon 0.25 0.5 0.75 1
1-vs-Set* 0.42 0.48 0.51 0.55
W-svm*(linear) 0.49 0.54 0.55 0.56
W-svm*(rbf) 0.32 0.37 0.44 0.58
PI-svm*(linear) 0.49 0.55 0.55 0.56
PI-svm*(rbf) 0.36 0.51 0.63 0.63
cbsSVM 0.56 0.62 0.59 0.63
onlineNNO 0.55 0.62 0.65 0.72
our model 0.55 0.62 0.68 0.75
50 domains
establish the initial threshold. We were not able to obtain the implementation of cbsSVM in
(Fei and Liu, 2016) to which our proposed solution’s results are compared. We report the
results of online NNO and NCC (our model) and compare the results reported by (Fei and
Liu, 2016). We extend our experiment with the IMDB dataset 3.4 for which we reimplement
cbsSVM (similar to Rochio algorithm). Table 3.1 and Table 3.2 illustrate the experimental
results for the Amazon dataset, and compare the performance of our model to online NNO
(De Rosa et al., 2016). Table 3.4 reports the result for the 20 newsgroup dataset.
• For rbf (kernel) SVM, (C=5,γ=0.2) for Amazon, (C=5,γ=0.5) for IMDB, and (C=10,γ=0.5)
for 20 newsgroup.
By withholding a test class during training, we evaluate the performance of algorithms with
unknown classes in different scenarios. For example, when we say training is on 25% (or
79
.25) it corresponds the case of an experiment where we train on examples from 8 classes
and test on examples from 10 class domains in case of Amazon reviews dataset.
The F-score performances of 1-vs-Set, W-SVM, PI -SVM and cbsSVM in Table 3.1 ,
Table 3.2, Table 3.3 and Table 3.4 are from (Fei and Liu, 2016) (marked as *) followed by
Table 3.1, Table 3.2, and Table 3.3 report the F-score performance with subsets of
data domains in 10, 20, 30, 40 and 50 domains for the Amazon review dataset. Similarly,
Table 3.4 reports the F-score performance for subsets of 10 and 20 of newsgroups dataset
while Table 3.5 and Table 3.6 report F-score performance for the IMDB movie genre
dataset. In general, F-score performance (in short, performance) reduces when the number
of unseen classes in training increases. Overall, our proposed model performs well in all
three experiments, especially with the IMDB dataset with 0.5, 0.75 and 1 settings. We
obtain similar performance with the 20 newgroups datasets and the Amazon dataset. The
0.25 setting is the most challenging since it represents using only 25% of the testing classes
during training. Both our model and online NNO place in top 80% of the time. Our model
80
shows more improvement when the number of the testing classes increases. Our NCC model,
and online NNO (as well as NNO) are still better than random guess, albeit the number
we observe that linear models such as 1-vs-Set and W-SVM (linear) are very close to our
performance, even outperforming in the case of the 20 newsgroups dataset. This has been
demonstrated with the closed world assumption, where linear classifiers often perform well
The cbsSVM model makes the most gain when only half the test classes are presented
during training with the Amazon dataset, especially with 50 domains. However, we believe
this is an unusual result since the performance of an algorithm is often low when the number
of categories (domains) is high, given the high number of unobserved classes in this case, 38
out of 50 domains or .25 observed test classes. Our model performs well across the entire
set of experiments, 70% or 14 out of 20 with the Amazon dataset and 87.5% or 7 out of 8
To experiment with theIMDB genre dataset, we simulate the cbSVM model using the
pseudocode in (Fei and Liu, 2016). First, we process the IMDB dataset with uni-gram
and bi-gram representation. The Rocchio algorithm is used to query the document to the
center of each class. Finally, the SVM model is used as the final classifier with Platt ’s
calibration option. Table 3.5 and Table 3.6 reveal the result of F-measure performance in
this experiment. 1-vs-Set shows poor performance for text dataset in general. We observe
that W-svm does better than cbsSVM model. Our model demonstrates good performance
Both distance-based approaches, online NNO and our model NCC, perform better than
81
others when all classes are presented in the training stage, referred to as closed set assumption.
The accuracy of all algorithms, computed based on formula [13] in (Scheirer et al., 2013b),
decreases as the number of unknown classes is introduced into testing increases (refer to
Figures 3.10 and 3.12) with some exceptions for the 20 newsgroup dataset (see Figure 3.11).
However, accuracy losses vary in two distinct ways. Our model and online NNO gradually
lose accuracy as more unknown classes appear in testing. 1-vs-Set, W-SVM and PI -SVM
perform well in our experiment with the IMDB dataset but not as well in (Fei and Liu, 2016).
Furthermore, we address the issues of initializing values when a new ball is added, and
we perform a nearest neighbor search as described in the DBSCAN algorithm (Ester et al.,
1996). The results demonstrate that our proposed model does not suffer from the issue of
poor radius approximation r0 in (Fei and Liu, 2016) when positive and negative training
Both Figure 3.10 and Figure 3.12 show similar trends in accuracy loss as the number of
82
Figure 3.12: Accuracy on unknown test cases on the IMDB domain dataset
unseen classes or domains increases. It shows how the classification problem becomes a
hard problem when unknown classes exist only in testing. While the performance of the
linear models reduces gradually, the RBF kernel suffers significant losses. Our model shows
We do not experiment with the Exploratory EM model as presented in (Fei and Liu,
2016) because it is not suitable in the context of open set recognition. Last but not least, the
cbsSVM model is not an incremental learning model, indicating that this model is similar to
1-vs-Set, W-SVM and PI -SVM. They all retrain whenever unknown classes are introduced
into the system. The reason for unexpected performance of cbsSVM (actually a simulation
of csbSVM) in the IMDB dataset may be due to two reasons: a single class representation
and features used. We observe that the additional of tri-grams may further improve the
result.
Our proposed model seems to work better than online NNO. We believe there are
two main reasons for the success of our proposed algorithms: the use of a cluster based
83
approach, as DBSCAN performs better when we have knowledge of the number of ball-
shaped components for each class, and the new way of initializing values in our proposed
model is more effective than the way it is performed in the Online NNO model.
In order to validate the results of our experiment, we compute the two-way ANOVA
multiple repeated test. There are two main factors inte ANOVA test in our study: algorithms
and varying training sizes. The algorithms include 1-vs-set, W-svm-linear, W-svm-rbf, PI-
svm-linear, PI-svm-rbf, cbsSVM, online NNO and our proposed model NCC. The training
sizes are 25%, 50% and 75% and 1. ANOVA also allows testing whether or not there are
interactions between the first factor and the second factor. Interaction testing can indicate
the independence of each other, for example, the effect of training size on the performance
• H01 : All the classifiers are the same, i.e, they produce the same results.
• H02 : Different testing class sizes during training do not impact on the performance,
and
The result of the two way ANOVA test is F-statistic= 1.32e-06 for the first hypothesis H01 ,
and 2e-16 for the second hypothesis H02 , and 0.052 for the third hypothesis, all with 95%
confidence levels. As a result we reject the null hypothesis. We conclude that there are
differences among these classifiers and there is an impact on existing testing class during
training on the performances. However, we cannot reject the first hypothesis that indicates
Further, we compute pairwise t-tests for the top pairwise classifiers. We compute these
84
tests with the 25% open set setting which is considered the most challenging task because
Table 3.7 indicates that our proposed model can be considered better than W-SVM rbf,
PI-SVM rbf and csbSVM. We fail to reject the null hypotheses that our proposed model is
better than 1-vs-set, W-SVM linear, PI-SVM linear and Online NNO.
In this chapter, we have presented our proposed solution that attemps to overcome the
deficiencies of online NNO in initializing and updating information for a class ball. While
our proposed model works well in the text domain, we see room for the improvement since
we do not focus on replacing old concepts with the new one in this design to make the
create a new model as new data examples arrive arrives and use a modifies the combination
2. incremental learning design has advantages than 1-vs-set, W-SVM and PI-SVM
3. NCC adapts better in different data domains due to its use clustering technique.
CHAPTER IV
An individual often has only partial knowledge required to solve a particular problem. A
committee is a common way to address this issue because committee members can pool
their expertise in order to make a better decision. Similarly, an ensemble learning can
combine models trained in different sub-regions of a data space to provide scalability to data
mining algorithms. We hypothesize that an ensemble model can achieve scalability if the
combining method also improves the predicted results of its classifier members operating on
the different regions of original data space. For example, if there is no single classifier that
excels in the entire data space, a solution is to split data into smaller regions and to generate
the final prediction. However, if the majority of classifiers are not trained in the same region
where testing occurs, the majority voting method will fail to generate proper results. Herein,
we proposes a novel method to overcome the challenge of efficiently assigning and adjusting
(Kuncheva, 2002) illustrate that an ensemble model may perform better than any base
classifier as long as all the classifiers are complimentary. In other words, the errors made by
86
87
the ensemble classifier members must be independent (Zupanski and Zupanski, 2006). For
of the ensemble. Let K denote the number of sub-regions, refer to as regions of competence,
into which the original feature space is decomposed. Let R1 , · · · , RK denote the K sub-
regions. Let D∗ ∈ D denote the ensemble member with the highest average accuracy over
the entire feature space. Let P(Di |Ri ) denote the probability of correct classification by Di
computed as
k
X
P (correct) = P (Rj )P (Dij |Rj ) (IV.1)
j=1
where P(Rj ) denotes the probability that an input is drawn from region Rj . To maximize
P(correct), we assign Dij such that P (Dij |Rj ) ≥ P (Di |Rj ) where i ∈ [1, · · · , L]. As a
result,
K
X
P(correct) ≥ P (D∗ |Rj ) = P (D∗ ). (IV.2)
j=1
Assuming that Dij is the best classifier of the ensemble model D in region Rj , it simply
indicates that the ensemble model is as good as the best classifier D∗ , regardless of the
Bagging and Random Subspace are two techniques often used in an ensemble model to
• Random Subspace (Ho, 1998) splits a feature space into smaller regions that may
overlap. This is an efficient way to reduce a high dimensionality domain, particularly
when the number of features is larger than the number of examples.
Another variation of Bagging is the Pasting method (Breiman, 1999) where subsets are
randomly generated from a dataset. However, the ensemble model is reported to have higher
accuracy (Louppe and Geurts, 2012) using both methods. In other words, to aggregate the
• The boosting method focuses on fixing situations where the previous classifiers makes
wrong prediction.
• The combining method, commonly used with bagging, aggregates the outcomes from
ensemble members.
A boosting model such as AdaBoost (Freund et al., 1996) can be built on top any
classifier by assigning a weight to each training example. At each round, the weight to the
misclassified examples is increased to correct the problem. However, Boosting gives more
weight to misclassified instances for the next iteration based on previous predictions of a
Gradient Boosting.
One variant of boosting is stacking, which is defined as a two layer model. Stacking
can tackle both variance and bias problems by using another classifier, also known as the
combiner to reduce generation error. The Stacking method or the Stacked generalization
• The first classifier is trained on a random subset of the available training data, similar
to bagging.
89
• The second classifier is trained on a dataset where half the examples from different
subsets than the of origin subset and half the examples come from misclassified
instances.
• The third classifier is trained where with instances where the first and the second
disagree.
data for training instead of out-of-fold prediction whereas the combining method, which can
be applied in Bagging and Boosting approaches, focuses on how to fix incorrect predictions
Given different predicted outcomes of ensemble members, an ensemble model can use a com-
bining method to aggregate a final prediction. A simple solution is the majority vote method
which considers all classifiers as being equal. Majority vote was first introduced in Bagging
by (Breiman, 1996), Extra Trees (Geurts et al., 2006) and in Random Forests (Breiman,
2001) where an ensemble model was initially comprised of homogeneous classifiers. For
an ensemble of L classifiers, Majority vote (Battiti and Colla, 1994) will generate a correct
prediction if at least [L/2] + 1 classifiers obtain the correct answers. Let us consider an
90
ensemble of three classifiers C1 , C2 , and C3 with accuracy 0.6, 0.6 and 0.8, respectively. An
ensemble model with majority vote, corresponding to two classifiers that classify correctly,
generates the final accuracy: Pmajority = 0.62 ∗ 0.2 + 2 ∗ 0.4 ∗ 0.6 ∗ 0.8 + 0.62 ∗ 0.8 = 0.774
<0.8: the best accuracy of a single classifier from this model. If we drop the two poor
performers C1 and C2 , our ensemble model achieves an accuracy of 0.8 as a single best
classifier. This pitfall of Majority vote can be rectified by adding the coefficients or weights
w1 , w2 and w3 such that w1 = w2 = 0 and w3 =1. We see that Weighted Majority vote can
The Majority vote is effective so long as all the committee members have identical
predictive power. However, many studies have criticized this assumption (Ruta and Gabrys,
2001) and have pointed out the limitation of the Majority vote (Kuncheva et al., 2003). In
the introduction to this thesis, we illustrate that the final prediction can be improved if and
only if a diversity of classifiers is guaranteed (Kuncheva and Whitaker, 2003). The presence
best classifier member needs to weigh more than a weak classifier member.
As sampling techniques are commonly used to create the different samples of data, each
91
subset of data may be generated from a different portion of the data space. Many researchers
question the use of invariant weights (Kim et al., 2011) in ensemble models because there is
ensemble model may have different expertise on a given data domain. As a result, we will
need to adjust for members who give incorrect advice. Suppose we construct an ensemble
model for a binary problem where each member’s (expert) decision is either correct (1) or
i ∈ [1, n].
models such as Adaboost, eXtreme Gradient Boosting (XGB), and Random Forests. While
Adaboost assigns fixed weights to each ensemble member, XGB allows updating weights
of members in each training step to address incorrect predictions. The Dynamic Weight
method allows adjustment of weights for classifiers that generate correct prediction. In
particular, the Dynamic Weight method often uses an extra classifier, for example, a single
layer of neural network to learn the weights. This combining method has demonstrated
92
high accuracy in many reports (Jin et al., 2003),(Karmaker et al., 2007), (Valdovinos and
Sanchez, 2009). .
3 Related work
The Weighted Majority vote algorithm was first studied by (Littlestone and Warmuth, 1989)
where it was shown that it can generalize by adding a term quota where a quota is a minimum
number of votes needed to achieve a majority. In the Weighted Majority vote, we reduce the
current weight in half when the ensemble member or expert makes an incorrect prediction.
Let n denote the number of ensemble members (referred to as experts or members) and W
of mistakes made by the best members and made by Weighted Majority vote algorithm,
respectively. If at least half of the total weights of ensemble members produce incorrect
prediction results, it results in an incorrect prediction by the Weighted Majority vote al-
gorithm. Because the penalty reduce the weight by half for each corresponding weight
for incorrect prediction, the total weight W will be reduced by at least 1/4. Repeating this
predicting process n times, the final weight eventually becomes W ≤ n(3/4)M . On the
other hand, the best ensemble members makes m mistakes and so its weight is (1/2)m . The
the number of mistakes made by the best expert. Since this algorithm guarantees an upper
bound of 2.41, it may not always be better than random guess. To illustrate this, consider an
93
ensemble of 10 members where the best member makes 20% mistakes. The upper bound
is 2.41(20 + log2 10) ≈ 56% mistakes, which is far from a good upper bound because the
An improved version, Random Weighted Majority (RWM) vote, normalizes each weight
of member by the inverse of the weight total. The RWM method, proposed by (Littlestone
and Warmuth, 1989) views the weights as probabilities. The predictions are generated
proportionally to the weights such that the probability of the ith ensemble member becomes
wi
P
W
where W = i wi . Such that, we replace the weight of the ith classifier (as expert) with
P
the ratio wi /W where W = i wi . In other words we choose an expert i with probability
wi /W and predict what this expert says. Here ǫ is a fraction of weight reduction. where ǫ is
(Littlestone and Warmuth, 1989) show that the expected number of mistakes (M) satisfies
mln(1/β) + ln(n)
M≤ (IV.3)
1−β
(Polikar et al., 2001) adapted the Weighted Majority vote, which was introduced in
ensemble members with incorrect predictions. This situation arises when the testing data
come from different regions of training data. The idea is to update weights of each classifier
member and retrain the model. Another modified version of the Weighted Majority vote
and another by (Muhlbaier et al., 2009), known as Dynamically Weighted Consult and Vote.
These combining methods use error rates in predicting as indicators to adjust weights and
Generally, combining methods can obtain the final predictions by focusing on the best
to aggregate the results. The simplest way to determine the best classifier is to ignore
other ensemble members with low performance. On the other hand, (Zamani et al., 2014)
recommend that the number of best classifiers be equal to the number of classes plus one to
improve the prediction. In their ensemble model, classifiers at the output form a cascade
where output of one is the input to other classifiers. The correct label of an example is
an input for the classifier that produces the output label. This reduces the weights for the
ensemble model can be lowered below the mistakes generated by Randomized Weighted
Majority vote when the training size is increased. However, their proposed ensemble model
requires high running time and produces poor performance on unbalanced classes.
novel combining method for an ensemble model, where each classifier may be trained in a
sequence, producing aspects of the classification knowledge for the individual classifiers.
According to our knowledge, our proposed method is the first investigation into developing
95
Randomized Weighted Majority vote may not generalize well in dealing with classifiers in
members. Even though this approach is common in enhancing expertise by diversity, the
problem is that when a majority of ensemble members are not trained in the same region
of the testing data,a combining methods may often incorrectly reject the correct predicted
outcomes. Figure 4.2 illustrates another example of two groups of classifiers. Here,
members of group 1 have been trained on classes 1 and 2 while all members of group 2 have
been trained on classes 1, 2 and 3. Note that at this point, the ensemble model includes two
groups of members which have acquired different aspects of the classification knowledge
necessary. Now, consider a testing stage where an object from class 3, is introduced as
shown in Figure 4.3. If a simple majority vote is taken, we see that three members without
96
knowledge of class 3 outvote a minority group of two and continue doing so until the
minority group gains sufficient votes to outvote the remaining members, which may only be
after be several misclassifications. Later while a modified weight for the minority may be a
promised direction to explore, it is clear in this example that a simple Weighted Majority
vote does not provide a mechanism to solve this problem. Herein, we demonstrate precisely
Figure 4.3: An ensemble member not trained on a certain class, during testing
It is important to reiterate that simply coming up with a good voting process does not
always guarantee the correct solution unless the knowledge of training is taken into account.
Since the training process can be used to obtain the confidence levels of votes to be used in
aggregating results, we can use this confidence level as the weight of a classifier’s decision.
Decision making in ensemble learning represents a core weakness that needs to be resolved.
The dilemmas between exploration and exploitation arises when an ensemble model is
forced to make decisions under uncertainty. We seek to obtain more information to ensure a
higher level ofconfidence in decision making. A popular example is a multi armed bandit, in
which when we pull the lever of a slot machine, we may get lucky and get a reward. Some
slot machines reward us more than others, but it is a secret that someone may not let others
win. Each machine has an unknown probability of generating a prize which corresponds
to the fact that each ensemble classifier has a different weight of its confidence level in
97
predicting.
(Kuleshov and Precup, 2014). Multi-armed bandit methodologies often talk about a one-
arm bandit, a term for slot machine. For ensemble learning, a multi-arm bandit illustrates
formulate a solution for adjusting weights of an ensemble model. Initially, the independent
which is independent of all others. This thesis allows the overlap of regions when sampling
the data. In our proposed solution, we update weights for the ensemble classifiers based on
5 Proposed method
Intuitively, we want to have more confidence on a classifier which is trained on the region
from which a tested example comes. A simple solution is to determine the best classifier
and give it a larger weight and give other classifiers proportional weights. These weights
previous step. The difficulty of this approach occurs when two classifiers generate different
prediction but have proximity for training regions. To counter this issue, we exploit the
training history of each classifier. In this thesis, we use Bayesian inference to formulate
the posterior belief and compute the weight of each classifier. We denote the information
known up to time t such that a classifier C maps data D to distributions over actions a. The
• Initially, our belief for each classifier is drawn from a uniform distribution,
• We compute a new belief (referred to as weight) factor ri based on our prior belief
factor,
note class labels, and P (ck |s), k=1..n denote the probability that class k is the true label given
class label s. We assume that all classifiers are independent such that P (s1 , s2 , · · · , sL |ck ) =
Let I+k denotes the set of indices of classifiers vote for class ck and I−k denote the set of
indices of classifiers vote for other class label. The probability can be written as
P (ck ) Y Y
P (ck |s) = P (si |ck ) × P (si |ck ) (IV.5)
P (s) K K
i∈I+ i∈I−
This formula describes the optimality condition for several combination rules (Kuncheva
and Rodrı́guez, 2014). Our goal is to determine the probability of correct prediction of each
for the combination method. Keeping track of the history which examples worked well,
discovered for which classifier during training may be helpful, but our research showed
that this method suffers from memory loss as (Muhlbaier et al., 2009). We apply Bayesian
inference to compute the posterior probability, given prior knowledge in adapting the multi-
99
arm bandit algorithm to improve the performance of the Randomized Weighted Majority
vote solution.
During training, we have a set of action a and a context x where a reward r is observed.
The set of historical observation is computed using a likelihood function P(r|a, x, θ) pro-
viding on some parameters θ. For some prior distribution P(θ) on these parameters, the
Q
posterior distribution is given by Bayes rule P(θ|D) P (r|a, x, θ)P (θ).
In our proposed solution, we use the Mahalanobis metric to measure the distance from
testing examples to the center of distribution of data points in each region of the original
data set. If testing data points are in the same region of the data space where the classifier
is trained, we found that this classifier had more confidence in predicting. The classifier
assigned to the region with the smallest Mahalanobis distance to the testing data point will
have a larger weight. Our method which stores the mean and the covariance matrices of
the training set, is more efficient than a classifier that stores the entire history. In addition,
an upper bound in the bandit problem using Bayesian inference has been obtained for the
where ǫ is a fraction for weight reduction. Since our proposed solution is based on
arms) are independent. Fortunately, a diversity of classifiers in the ensemble model can be
achieved by independently selecting the classifiers. It in turn guarantees that the samples
can be drawn from each classifier in an independently and identically distributed manner.
6 Experimental setup
We used the IMDB movie reviews and Amazon product reviews datasets to evaluate our
proposed combining method in two experiments. First we test under the open set setting
to create an ensemble of different classifiers. For the Amazon dataset, we use 50 domains
and we used 3 group domains in the IMDB movie reviews dataset. The testing classes are
introduced into training by 0.25 (25%), 0.5 (50%), 0.75 (75%) and 1 (100% of the testing
In our prior work, the reported in Chapter III, the proposed NCC model revealed a
limitation in difficult test cases when there was were fewer than 25% testing classes existing
during training. In other words, when the number of untrained classes is high, NCC does
not perform as well as other open set solutions. We hypothesize that a combining method
can improve this weakness if there exists strong classifier members in the ensemble model.
In the experimental setting, our choice of one strong classifier may be picked up from one of
101
the proposed models for open set classification. This intentionally designed ensemble model
allows traditional classifiers as ensemble members to work well under the open set scenario.
While the NCC classifier has been designed for working with open set, other classifiers
can be selected from a pool of classifiers. Alternative, the selection of remaining classifiers
can be formulated on the algorithm selection problem. In a the second experiment, the
IMDB movie review and Amazon product review datasets are used to predict a negative or
explore what users think about the services they have paid for (Pang et al., 2002), (Agarwal
et al., 2011). We choose a word embedding method Word2Vec (Mikolov et al., 2013), and
use statistical features to combine classifiers. We apply 65%,20%,15% splits for training,
Other classifiers include Naive Bayes (NB), eXtreme Gradient Boosting (XGB), Logistic
Regression (LR), Stochastic Gradient Descent (SGD), and Linear SVM. In addition, we
use neural network models implemented in the Keras library, a Convolutional Neural
Network (CNN) (LeCun et al., 1995) and a recurrent neural network (RNN)(Sak et al.,
2014). The RNN model is known as a Long Short Term Memory (LSTM) network using
102
Figure 4.4 illustrates the architecture of our approach. In the processing step, we extract
data, remove corrupt words and numbers. The next step is to generate feature sets for review
datasets.
Table 4.1: F-scores in Experiment 1a on Amazon product review dataset
Experiment on Amazon 0.25 0.5 0.75 1 0.25 0.5 0.75 1
1-vs-set 0.59 0.68 0.74 0.78 0.56 0.62 0.68 0.72
W-SVM linear 0.61 0.66 0.76 0.79 0.58 0.64 0.70 0.75
PI-SVM linear 0.60 0.69 0.78 0.84 0.57 0.66 0.69 0.73
Online NNO 0.60 0.71 0.77 0.83 0.58 0.66 0.70 0.76
Our ensemble model 0.61 0.71 0.79 0.84 0.59 0.67 0.72 0.78
10 domains 20 domains
• 5 layers
• two dropout layers between embedding and LSTM, LSTM and dense layer
To obtain the best model for each algorithm, we use hyperopt which is a distributed
hyper-parameter search using random search option. The Random search can run fast and
its results are often comparable with the results from grid search. Noticeably, we limit our
algorithms to 1-vs-set, W-SVM linear and PI-SVM because the experiments are highly time
consuming. The F-scores of algorithms in Experiment 1 are shown in Table 4.1, Table 4.2
and in Table 4.2. We observe that there are significant improvements of our model as well as
1-vs-set, W-SVM, PI -SVM which rely on parameter tuning. An ensemble model with our
We observe a similar trend with the IMDB movie review dataset. We also notice that
the homogeneous ensemble models make large gains when hyper-parameter search is uaws.
104
Our model also obtains a higher performance when half of the testing classes are presented
at training.
Table 4.4: F-scores in Experiment 1a on IMDB movie review
Experiment on IMDB 0.25 0.5 0.75 1 0.25 0.5 0.75 1
1-vs-set 0.50 0.57 0.68 0.78 0.49 0.55 0.63 0.73
W-SVM linear 0.52 0.63 0.69 0.81 0.51 0.64 0.72 0.79
PI-SVM linear 0.53 0.61 0.71 0.81 0.5 0.62 0.64 0.80
Online NNO 0.52 0.62 0.75 0.81 0.5 0.6 0.63 0.78
Our ensemble model 0.53 0.64 0.76 0.82 0.52 0.64 0.74 0.80
6 classes 12 classes
We perform Experiment 2 assuming that we have prior knowledge of all classes. Table
4.6 illustrates the accuracy of ensemble models with different combining methods compared
to a single algorithm solution. In Table 4.7, we show that our proposed method provides
comparable results with common combining methods. Table 4.6 illustrates the performance
of our ensemble model and several state-of-the-art models including Logistic Rergession,
LinearSVMm with a closed set assumption. Our ensemble model outperforms all single
classifiers in both tasks. We also report the accuracy of the ensemble model with different
Validating the comparison results in the study We use the two way ANOVA test with a
95% confidence level. In the open set experiment, our hypotheses are
The F-statistic from ANOVA test is 2.69E-05 for the first hypothesis H01 and 2.64E-08
for the second hypothesis H02 . As a result, we reject the null hypothesis that says that all
classifiers are different and unknown classes impact the performance in the first experiment.
We also compute pairwise t-test for our proposed ensemble model and 1-vs-set classifier to
evaluate whether or not our ensemble model with the proposed combining method make
improvement.
Table 4.8 indicates that our proposed method is significantly better than Random
Weighted Majority vote and Majority vote but not Dynamic Weighted Majority vote. How-
106
ever,we think that we have demonstrated a solution to combine the outcome of an ensembles
model to improve the performance of our algorithm in the open-set environment using our
proposed method. Two questions we want to explore further: Does the performance improve
because we include open set assumption in the ensemble model? How does the performance
Our proposed combining method illustrates a promising solution for both closed-set and open
set scenarios, although our proposed method is not significantly different in performance
compare to Dynamic Weighted Majority vote. However, it is not a fair comparison because
of two reasons: i) Dynamic Weighted Majority vote requires keeping historical training data
and 2) the experiment was designed to work with current combining methods. Our method
needs less memory and is able to work in the open-set scenario. The contributions of this
history.
in order to aggregate the predicted result in an ensemble model this work has proposed a
novel for future research. Experimental results show that this proposed combining method is
significant. One reason is that our ensemble model does not need expense time consuming
107
hyper-parameter search in both mixed open set and traditional situation. We believe there is
1 Summary
It has long been held that data mining algorithms have been successfully implemented into
variety of domain applications with success. The growing volume of data poses different
challenges that have not been addressed by traditional mining algorithms. In this thesis we
studied several challenges that are not well studied or documented in Natural Language
Processing.
This thesis focuses on ensemble learning methodologies which has recently gained
learning models often utilize a single classifier to model members generated by different
data sets and different parameter subsets. Accordingly these model members contain a
common denominator, in that the same data space of the original data set is shared between
both members.
In this thesis we focus on an approach where by the data space of each classifier is
108
109
distinct. We created this space by ensuring that the data would be available in an incremental
means, which means that the model needs to be trained in an incremental fashion as well.
As mentioned the drawback of current online NNOs are overcome with our proposed model.
In essence overcoming this challenge has proven to solve the problem of text classification
We overcome the challenge of dealing with unknown classes during testing that exists
when data arrives in different time lines by invoking the idea of online Nearest Non Outlier
models using open set problems, that we adapt into clustering methodologies (DBSCAN).
We apply our model to solve the problem of text classification and compare with other open
set solution.
needs to overcome various smaller sub challenges as will be addressed in this thesis. one of
these is that the likely different data spaces during training and testing of each classifier in a
domain makes the combination of an ensemble model complex. this is evident when one
views state of the art ensemble models such as random forest, extreme gradient boosting
which have unique data spaces for both training and testing. However this in not the case for
open set models such as NNO and online NNOs that consists of the base classifiers .
The current combination method for the ensemble model, such as Majority vote,
Weighted Majority vote suffer low accuracy due to a majority of untrained classifiers.
Using historic training may be a solution but it wastes memory. Conversely the logic of the
solution presented in this thesis does not waste as many resources and uses the Bayesian
2 Future work
While our work tackles several issues in data mining with an ensemble learning, it is
not complete. We show that incremental learning plays an important role in real world
data mining problems. However, the number of incremental learning classifiers is relative
small number compared to current pool of data mining classifiers even with state-of-the-art
classifiers. Our belief is that ensemble learning is the best candidate for the task. Our future
work is to look at another ensemble of tree base model in two direction: 1) using ensemble
of tree base classifier, especially eXtreme Gradient Boosting (XGB) to adapt the need of
with XGB to give it an ability of dealing with open set problem. 2) studying the strategy of
To tackle a large scale data problem, we will need to split data space into several
subspaces. First, how many subspace is considered optimal that is an open question. Can a
number of subspaces fixed or varied? Our intuition is that the number of subsets may be
equal to the number of target classes. For example, if our task is to classifier a data with
two classes, it may be reason to split into 2. However, there is no guarantee to support this
intuition as our study of scalablility ensemble. We think that each data domain may have
On the other hands, current search of algorithm for each sub region has not been
discussed in this thesis even though we provide our published paper related to this problem
in publication section. Our current solution has a limitation such that it only if the original
dataset has at least 4 features. We want a more general solution for algorithm selection. By
111
While most of our work experiments on text documents (NLP domain), we want to
study the performance of our proposed in other fields such as computer vision, security
, commercial which shows high interest in increment learning approach as well as open
set solution. On the other hand, we also see that a feasible work for ensemble learning in
distributed environment which is not our focus in this thesis. We hope that your future work
will look at distributed system which is in favor to dealing with big data.
Bibliography
Azizi Abdullah, Remco C Veltkamp, and Marco A Wiering. An ensemble of deep support
vector machines for image categorization. In Soft Computing and Pattern Recognition,
2009. SOCPAR’09. International Conference of, pages 301–306. IEEE, 2009. 17
Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen Rambow, and Rebecca Passonneau. Senti-
ment analysis of twitter data. In Proceedings of the workshop on Languages in Social
media, pages 30–38. Association for Computational Linguistics, 2011. 101
Arthur Asuncion and David Newman. Uci machine learning repository, 2007. 74
David Barber. Bayesian reasoning and machine learning. Cambridge University Press,
2012. 133
Peter L Bartlett and Marten H Wegkamp. Classification with a reject option using a hinge
loss. Journal of Machine Learning Research, 9(Aug):1823–1840, 2008. 63
Roberto Battiti and Anna Maria Colla. Democracy in neural nets: Voting schemes for
classification. Neural Networks, 7(4):691–707, 1994. 89
Eric Bauer and Ron Kohavi. An empirical comparison of voting classification algorithms:
Bagging, boosting, and variants. Machine Learning, 36(1-2):105–139, 1999. 14
Abhijit Bendale and Terrance Boult. Towards open world recognition. IEEE Conference on
Computer Vision and Pattern Recognition, 2015a. 23, 61
Abhijit Bendale and Terrance Boult. Towards open world recognition. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pages 1893–1902,
2015b. 52, 53, 56, 64, 68, 70, 75
James Bergstra, Daniel Yamins, and David Cox. Making a science of model search:
Hyperparameter optimization in hundreds of dimensions for vision architectures. In
International Conference on Machine Learning, pages 115–123, 2013. 103
Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Richard Kirkby, and Ricard Gavaldà. New
ensemble methods for evolving data streams. In Proceedings of the 15th ACM SIGKDD
112
113
Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Jesse Read, Philipp Kranen, Hardy
Kremer, Timm Jansen, and Thomas Seidl. Moa: a real-time analytics open source
framework. In Machine Learning and Knowledge Discovery in Databases, pages 617–
620. Springer, 2011. 34
Jock A Blackard and Denis J Dean. Comparative accuracies of artificial neural networks
and discriminant analysis in predicting forest cover types from cartographic variables.
Computers and Electronics in Agriculture, 24(3):131–151, 1999. 131
Anna Bosch, Andrew Zisserman, and Xavier Munoz. Image classification using random
forests and ferns. In Computer Vision, IEEE 11th International Conference on, pages 1–8,
2007. 2
Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings
of COMPSTAT, pages 177–186. Springer, 2010. 33, 45
Cristian Bravo, Jose Luis Lobato, Richard Weber, and Gaston L’Huillier. A hybrid system
for probability estimation in multiclass problems combining svms and neural networks.
In Hybrid Intelligent Systems, 2008. HIS’08. Eighth International Conference on, pages
649–654. IEEE, 2008. 62
Leo Breiman. Pasting small votes for classification in large databases and on-line. Machine
Learning, 36(1-2):85–103, 1999. 5, 88
Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001. 10, 15, 31, 38, 45, 89
Gavin Brown and Ludmila I Kuncheva. good and bad diversity in majority vote ensembles.
In Multiple classifier systems, pages 124–133. Springer, 2010. 10
Gavin Brown, Jeremy Wyatt, Rachel Harris, and Xin Yao. Diversity creation methods: a
survey and categorisation. Information Fusion, 6(1):5–20, 2005a. 6
Gavin Brown, Jeremy L Wyatt, and Peter Tiňo. Managing diversity in regression ensembles.
Journal of Machine Learning Research, 6(Sep):1621–1650, 2005b. 16
John W Byers, Michael Mitzenmacher, and Georgios Zervas. The Groupon effect on yelp
ratings: a root cause analysis. In Proceedings of the 13th ACM Conference on Electronic
Commerce, pages 248–265. ACM, 2012. 128
Robert Cattral, Franz Oppacher, and Dwight Deugo. Evolutionary data mining with auto-
114
Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge
university press, 2006. 1
Arjun Chandra and Xin Yao. Evolving hybrid ensembles of learning machines for better
generalisation. Neurocomputing, 69(7):686–700, 2006. 6
Xi Chen and Hemant Ishwaran. Random forests for genomic data analysis. Genomics, 99
(6):323–329, 2012. 2
Alexey Chernov, Yuri Kalnishkan, Fedor Zhdanov, and Vladimir Vovk. Supermartingales in
prediction with expert advice. Theoretical Computer Science, 411(29-30):2647–2669,
2010. 1
Filipe Costa, Michael Eckmann, Walter J. Scheirer, and Anderson Rocha. Open set source
camera attribution. In XXV SIBGRAPI - Conference on Graphics, Patterns and Images,
August 2012. 56
Maria Couceiro, Paolo Romano, Nuno Carvalho, and Luı́s Rodrigues. D2stm: Dependable
distributed software transactional memory. In Dependable Computing, 2009. PRDC’09.
15th IEEE Pacific Rim International Symposium on, pages 307–313. IEEE, 2009. 49
Antonio Criminisi, Jamie Shotton, and Ender Konukoglu. Decision forests: A unified
framework for classification, regression, density estimation, manifold learning and semi-
supervised learning. Now, 2012. 31
Padraig Cunningham and John Carney. Diversity versus quality in classification ensembles
based on feature selection. In Machine Learning: European Conference on Machine
Learning, pages 109–116. Springer, 2000. 7
Hai H Dam, Hussein A Abbass, Chris Lokan, and Xin Yao. Neural-based learning classifier
systems. IEEE Transactions on Knowledge and Data Engineering, 20(1):26–39, 2008.
16
Rocco De Rosa. Confidence decision trees via online and active learning for streaming (big)
data. arXiv preprint arXiv:1604.03278, 2016. 41
Rocco De Rosa, Thomas Mensink, and Barbara Caputo. Online open world recognition.
arXiv preprint arXiv:1604.02275, 2016. 56, 61, 64, 65, 68, 69, 71, 72, 76, 77
Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters.
Communications of the ACM, 51(1):107–113, 2008. 35
115
Janez Demšar. Statistical comparisons of classifiers over multiple data sets. Journal of
Machine Learning Research, 7(Jan):1–30, 2006. 134
Misha Denil, David Matheson, and Nando de Freitas. Consistency of online random forests.
arXiv preprint arXiv:1302.4853, 2013. 31, 34
Joachim Diederich, Jörg Kindermann, Edda Leopold, and Gerhard Paass. Authorship
attribution with support vector machines. Applied Intelligence, 19(1-2):109–123, 2003.
51
Thomas G Dietterich. Ensemble learning. The handbook of Brain Theory and Neural
Networks, 2:110–125, 2002. 6
Pedro Domingos and Geoff Hulten. Mining high-speed data streams. In Proceedings of the
sixth ACM SIGKDD international conference on Knowledge Discovery and Data Mining,
pages 71–80. ACM, 2000. 32, 41
Cı́cero Nogueira dos Santos and Bianca Zadrozny. Learning character-level representations
for part-of-speech tagging. In ICML, pages 1818–1826, 2014. 18
Peijun Du, Alim Samat, Björn Waske, Sicong Liu, and Zhenhong Li. Random forest and
rotation forest for fully polarized sar image classification using polarimetric and spatial
features. ISPRS Journal of Photogrammetry and Remote Sensing, 105:38–53, 2015. 2
Saso Džeroski and Bernard Ženko. Is combining classifiers with stacking better than
selecting the best one? Machine Learning, 54(3):255–273, 2004. 6, 8
Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. A density-based algorithm
for discovering clusters in large spatial databases with noise. In KDD, pages 226–231,
1996. 70, 71, 81
Geli Fei and Bing Liu. Breaking the closed world assumption in text classification. In
Proceedings of NAACL-HLT, pages 506–514, 2016. 57, 63, 74, 75, 76, 77, 79, 80, 81, 82
Joseph L Fleiss, Bruce Levin, and Myunghee Cho Paik. Statistical methods for rates and
proportions. John Wiley & Sons, 2013. 7
Yoav Freund, Robert E Schapire, et al. Experiments with a new boosting algorithm. In
116
IEEE, 1996. 6, 88
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning,
volume 1. Springer series in statistics Springer, Berlin, 2001. 14
Jerome H Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis,
38(4):367–378, 2002. 35
Juergen Gall and Victor Lempitsky. Class-specific hough forests for object detection. In
Decision Forests for Computer Vision and Medical Image Analysis, pages 143–157.
Springer, 2013. 15
Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machine
Learning, 63(1):3–42, 2006. 35, 45, 89
Giorgio Giacinto and Fabio Roli. Design of effective neural network ensembles for image
classification purposes. Image and Vision Computing, 19(9):699–707, 2001. 8
Prasanta Gogoi, Monowar H Bhuyan, DK Bhattacharyya, and Jugal K Kalita. Packet and
flow based network intrusion dataset. In Contemporary Computing, pages 322–334.
Springer, 2012. 2
Simon Günter and Horst Bunke. Creation of classifier ensembles for handwritten word
recognition using feature selection algorithms. In Frontiers in Handwriting Recognition,
2002. Proceedings. Eighth International Workshop on, pages 183–188. IEEE, 2002. 32
Lars Kai Hansen and Peter Salamon. Neural network ensembles. IEEE Transactions on
Pattern Analysis & Machine Intelligence, 4(10):993–1001, 1990. 7, 8
Jingrui He, Yan Liu, and Richard Lawrence. Graph-based rare category detection. In Data
Mining. Eighth IEEE International Conference on, pages 833–838. IEEE, 2008. 53
Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R
Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detec-
tors. arXiv preprint arXiv:1207.0580, 2012. 49
Tin Kam Ho. The random subspace method for constructing decision forests. Pattern
Analysis and Machine Intelligence, IEEE Transactions on, 20(8):832–844, 1998. 6, 7, 24,
88
Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal
of the American Statistical Association, 58(301):13–30, 1963. 64
Tzu-Kuo Huang, Ruby C Weng, and Chih-Jen Lin. Generalized bradley-terry models and
multi-class probability estimates. Journal of Machine Learning Research, 7(Jan):85–115,
117
2006. 62
Michael Jahrer, Andreas Töscher, and Robert Legenstein. Combining predictions for
accurate recommender systems. In Proceedings of the 16th ACM SIGKDD international
conference on Knowledge Discovery and Data Mining, pages 693–702. ACM, 2010. 16
Lalit P. Jain, Walter J. Scheirer, and Terrance E. Boult. Multi-class open set recognition
using probability of inclusion. In The European Conference on Computer Vision (ECCV),
September 2014a. 62, 63, 64, 76
Lalit P Jain, Walter J Scheirer, and Terrance E Boult. Multi-class open set recognition using
probability of inclusion. In Computer Vision–European Conference on Computer Vision,
pages 393–409. Springer, 2014b. 23, 53
Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to
statistical learning, volume 6. Springer, 2013. 13
Gareth M James. Variance and bias for general loss functions. Machine Learning, 51(2):
115–135, 2003. 12
Hongliang Jin, Qingshan Liu, and Hanqing Lu. Face detection using one-class-based support
vectors. In Automatic Face and Gesture Recognition, 2004. Proceedings. Sixth IEEE
International Conference on, pages 457–462. IEEE, 2004. 56
Rong Jin, Yan Liu, Luo Si, Jaime G Carbonell, and Alexander Hauptmann. A new boosting
algorithm using input-dependent regularizer. Pattern Analysis & Applications, 2003. 92
Jindal. Finding local experts from Yelp dataset. Master’s thesis, University of Illinois at
Urbana-Champaign, Illinois, 2015. 128
Nitin Jindal and Bing Liu. Opinion spam and analysis. In Proceedings of the 2008
International Conference on Web Search and Data Mining, pages 219–230. ACM, 2008.
74
George H John and Pat Langley. Estimating continuous distributions in bayesian classifiers.
In 11 Conference on Uncertainty in AI, pages 338–345. Morgan Kaufmann Publishers
Inc., 1995. 32
Amitava Karmaker, Kihoon Yoon, Chau Nguyen, and Stephen Kwek. iboost: Boosting
using an instance-based exponential weighting scheme. International Journal of Hybrid
Intelligent Systems, 4(4):243–254, 2007. 92
118
Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On bayesian upper confidence
bounds for bandit problems. In AISTATS, pages 592–600, 2012. 99
Hyunjoong Kim, Hyeuk Kim, Hojin Moon, and Hongshik Ahn. A weight-adjusted voting
algorithm for ensembles of classifiers. Journal of the Korean Statistical Society, 40(4):
437–449, 2011. 7, 91
Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. Character-aware neural
language models. arXiv preprint arXiv:1508.06615, 2015. 18
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980, 2014. 103
Ron Kohavi, David H Wolpert, et al. Bias plus variance decomposition for zero-one loss
functions. In ICML, volume 96, pages 275–83, 1996. 10
Eun Bae Kong and Thomas G Dietterich. Error-correcting output coding corrects bias and
variance. In ICML, pages 313–321, 1995. 12
Moshe Koppel, Jonathan Schler, and Shlomo Argamon. Authorship attribution in the wild.
Language Resources and Evaluation, 45(1):83–94, 2011. 23
Volodymyr Kuleshov and Doina Precup. Algorithms for multi-armed bandit problems.
arXiv preprint arXiv:1402.6028, 2014. 97
Ludmila I Kuncheva. A theoretical study on six classifier fusion strategies. IEEE Transac-
tions on Pattern Analysis & Machine Intelligence, 2(2):281–286, 2002. 86
Ludmila I Kuncheva. Combining pattern classifiers: methods and algorithms. John Wiley
& Sons, 2004. 18
Ludmila I Kuncheva and Juan J Rodrı́guez. A weighted voting framework for classifiers
ensembles. Knowledge and Information Systems, 38(2):259–275, 2014. 98
Balaji Lakshminarayanan, Daniel M Roy, and Yee Whye Teh. Mondrian forests: Efficient
online random forests. In Advances in Neural Information Processing Systems, pages
3140–3148, 2014. x, 22, 35, 36, 37, 38, 41
Pavel Laskov, Christian Gehl, Stefan Krüger, and Klaus-Robert Müller. Incremental support
vector learning: Analysis, implementation and applications. Journal of Machine Learning
Research, 7(Sep):1909–1936, 2006. 33, 66
Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time
series. The handbook of brain theory and neural networks, 3361(10):1995, 1995. 101
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. 17,
131
Vincent Lepetit and Pascal Fua. Keypoint recognition using random forests and random
ferns. In Decision Forests for Computer Vision and Medical Image Analysis, pages
111–124. Springer, 2013. 15
Jiexun Li, Rong Zheng, and Hsinchun Chen. From fingerprint to writeprint. Communications
of the ACM, 49(4):76–82, 2006. 23
Chieh-Yen Lin and et al. Large-scale logistic regression and linear support vector machines
using spark. In Big Data, 2014 IEEE, pages 519–528. IEEE, 2014. 23
Georg Lindgren, Holger Rootzén, Dag Tjøstheim, Richard A Davis, L Nilsson, S Uvell,
Anders Milhøj, and Jacques de Maré. Extreme values: Theory and technical applications
[with discussion and reply]. Scandinavian journal of statistics, pages 241–279, 1987. 62
Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm. In Foundations
of Computer Science, 1989., 30th Annual Symposium on, pages 256–261. IEEE, 1989.
21, 92, 93
Yong Liu and Xin Yao. Ensemble learning via negative correlation. Neural Networks, 12
(10):1399–1404, 1999. 16
Gilles Louppe and Pierre Geurts. Ensembles on random patches. In Joint European
Conference on Machine Learning and Knowledge Discovery in Databases, pages 346–
361. Springer, 2012. 88
Yucheng Low, Joseph E Gonzalez, Aapo Kyrola, Danny Bickson, Carlos E Guestrin, and
Joseph Hellerstein. Graphlab: A new framework for parallel machine learning. arXiv
120
Michael Luca. Reviews, reputation, and revenue: The case of yelp. com. Com (September
16, 2011). Harvard Business School NOM Unit Working Paper, 2(12-016), 2011. 2
Leandros A Maglaras, Jianmin Jiang, and Tiago J Cruz. Combining ensemble methods and
social network metrics for improving accuracy of ocsvm on intrusion detection in scada
systems. Journal of Information Security and Applications, 30:15–26, 2016. 2
Larry M Manevitz and Malik Yousef. One-class svms for document classification. Journal
of Machine Learning Research, 2(Dec):139–154, 2001. 56
Javier Marin, David Vázquez, Antonio M López, Jaume Amores, and Bastian Leibe. Ran-
dom forests of local experts for pedestrian detection. In Computer Vision (ICCV), 2013
IEEE International Conference on, pages 2592–2599. IEEE, 2013. 15
Sancho McCann and David G Lowe. Local naive bayes nearest neighbor for image classifi-
cation. In CVPR, 2012 IEEE Conference on, pages 3650–3656. IEEE, 2012. 62
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed
representations of words and phrases and their compositionality. In Advances in Neural
Information Processing Systems, pages 3111–3119, 2013. 43, 101
Michael Muhlbaier, Apostolos Topalis, and Robi Polikar. Learn++. mt: A new approach to
incremental learning. In International Workshop on Multiple Classifier Systems, pages
52–61. Springer, 2004. 94
Michael D Muhlbaier, Apostolos Topalis, and Robi Polikar. Learn nc: Combining ensemble
of classifiers with dynamically weighted consult-and-vote for efficient incremental learn-
ing of new classes. IEEE Transactions on Neural Networks, 20(1):152–168, 2009. 94,
98
Marius Muja and David G Lowe. Fast approximate nearest neighbors with automatic
algorithm configuration. VISAPP (1), 2, 2009. 16
Robert Neumayer. Clustering based ensemble classification for spam filtering. Citeseer,
2006. 2
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up?: sentiment classification
using machine learning techniques. In Proceedings of the ACL-02 conference on Empirical
methods in Natural Language processing-Volume 10, pages 79–86. Association for
Computational Linguistics, 2002. 101
121
Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for
word representation. In EMNLP, volume 14, pages 1532–1543, 2014. 17
Adam H Peterson and Tony R Martinez. Estimating the potential for combining learning
models. In Proceedings of the ICML workshop on Meta-Learning, pages 68–75, 2005. 6
John Platt et al. Probabilistic outputs for support vector machines and comparisons to
regularized likelihood methods. Advances in Large Margin Classifiers, 10(3):61–74,
1999. 62, 63
Robi Polikar, Lalita Upda, Satish S Upda, and Vasant Honavar. Learn++: An incremental
learning algorithm for supervised neural networks. IEEE Transactions on Systems, Man,
and Cybernetics, part C (applications and reviews), 31(4):497–508, 2001. 93
Ajita Rattani, Walter J. Scheirer, and Arun Ross. Open set fingerprint spoof detection across
novel fabrication materials. IEEE Transactions on Information Forensics and Security
(T-IFS), 10, November 2015a. 56
Ajita Rattani, Walter J Scheirer, and Arun Ross. Open set fingerprint spoof detection across
novel fabrication materials. Information Forensics and Security, IEEE Transactions on,
10(11):2447–2460, 2015b. 53
Shuxia Ren, Yangyang Lian, and Xiaojian Zou. Incremental naive Bayesian learning
algorithm based on classification contribution degree. Journal of Computers, 9(8):1967–
1974, 2014. 33, 45
Marko Ristin, Matthieu Guillaumin, Juergen Gall, and Luc Van Gool. Incremental learning
of ncm forests for large-scale image classification. In Computer Vision and Pattern
Recognition (CVPR), pages 3654–3661. IEEE, 2014. 15, 33, 61, 64, 70
attribution for social media forensics. IEEE Transactions on Information Forensics and
Security, 2017. 56
Juan José Rodriguez, Ludmila I Kuncheva, and Carlos J Alonso. Rotation forest: A
new classifier ensemble method. Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 28(10):1619–1630, 2006. 35
Daniel M Roy and Yee W Teh. The mondrian process. In Advances in Neural Information
Processing Systems, pages 1377–1384, 2009. 22, 30, 35, 36, 38, 39, 41
Dymitr Ruta and Bogdan Gabrys. Analysis of the correlation between majority voting error
and the diversity measures in multiple classifier systems. Internal Report, 2001. 7, 90
Mehrnoush Famil Saeedian and Hamid Beigy. Learning to filter spam emails: An ensemble
learning approach. International Journal of Hybrid Intelligent Systems, 9(1):27–43, 2012.
2
Amir Saffari, Christian Leistner, Jakob Santner, Martin Godec, and Horst Bischof. On-line
random forests. In Computer Vision Workshops IEEE 12th International Conference on,
pages 1393–1400. IEEE, 2009. 41, 45
Haşim Sak, Andrew Senior, and Françoise Beaufays. Long short-term memory recurrent
neural network architectures for large scale acoustic modeling. In Fifteenth Annual
Conference of the International Speech Communication Association, 2014. 18, 101
Alim Samat, Peijun Du, Sicong Liu, Jun Li, and Liang Cheng. Ensemble extreme learn-
ing machines for hyperspectral image classification. Selected Topics in Applied Earth
Observations and Remote Sensing, IEEE Journal of, 7(4):1060–1069, 2014. 2
G Bruce Schaalje and Paul J Fields. Open-set nearest shrunken centroid classification.
Communications in Statistics-Theory and Methods, 41(4):638–652, 2012. 51, 52, 56, 57,
62
G Bruce Schaalje, Paul J Fields, Matthew Roper, and Gregory L Snow. Extended nearest
shrunken centroid classification: a new method for open-set authorship attribution of texts
of varying sizes. Literary and Linguistic Computing, 26(1):71–88, 2011. 51, 57, 67
Walter J Scheirer, Anderson Rocha, Ross J Micheals, and Terrance E Boult. Meta-
recognition: The theory and practice of recognition score analysis. Pattern Analysis
and Machine Intelligence, IEEE Transactions on, 33(8):1689–1695, 2011. 62, 63
Walter J Scheirer, Anderson de Rezende Rocha, Archana Sapkota, and Terrance E Boult. To-
123
ward open set recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions
on, 35(7):1757–1772, 2013a. 23, 52, 53, 57, 68
Walter J. Scheirer, Anderson Rocha, Archana Sapkota, and Terrance E. Boult. Towards
open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence
(T-PAMI), 36, July 2013b. 58, 60, 63, 64, 76, 81
Walter J. Scheirer, Lalit P. Jain, and Terrance E. Boult. Probability models for open set
recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI),
36, November 2014a. 62, 64
Walter J Scheirer, Lalit P Jain, and Terrance E Boult. Probability models for open set
recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(11):
2317–2324, 2014b. 57
Bernhard Schölkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C
Williamson. Estimating the support of a high-dimensional distribution. Neural computa-
tion, 13(7):1443–1471, 2001. 57
Yufeng Shen and Yingchun Yang. A novel data description kernel based on one-class svm
for speaker verification. In 2007 IEEE International Conference on Acoustics, Speech
and Signal Processing-ICASSP’07, volume 2, pages II–489. IEEE, 2007. 56
Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, An-
drew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality
over a sentiment treebank. In Proceedings of the conference on Empirical Methods in
Natural Language Processing (EMNLP), volume 1631, page 1642. Citeseer, 2013. 43
Robin Sommer and Vern Paxson. Outside the closed world: On using machine learning for
network intrusion detection. In Security and Privacy (SP), 2010 IEEE Symposium on,
pages 305–316. IEEE, 2010. 23
Ariel Stolerman, Rebekah Overdorf, Sadia Afroz, and Rachel Greenstadt. Classify, but
verify: Breaking the closed-world assumption in stylometric authorship attribution. In
IFIP Working Group, volume 11, 2011. 23
Ariel Stolerman, Rebekah Overdorf, Sadia Afroz, and Rachel Greenstadt. Classify, but
verify: Breaking the closed-world assumption in stylometric authorship attribution. In
IFIP Working Group, volume 11, 2013. 64
Ariel Stolerman, Rebekah Overdorf, Sadia Afroz, and Rachel Greenstadt. Breaking the
closed-world assumption in stylometric authorship attribution. In IFIP International
Conference on Digital Forensics, pages 185–205. Springer, 2014. 57
Mahbod Tavallaee, Ebrahim Bagheri, Wei Lu, and Ali-A Ghorbani. A detailed analysis of
the kdd cup 99 data set. In Proceedings of the Second IEEE Symposium on Computational
Intelligence for Security and Defence Applications 2009, 2009. 2
Rosa Maria Valdovinos and Jose Salvador Sanchez. Combining multiple classifiers with
dynamic weighted voting. In International Conference on Hybrid Artificial Intelligence
Systems, pages 510–516. Springer, 2009. 92
Vladimir Vovk and Fedor Zhdanov. Prediction with expert advice for the brier game. Journal
of Machine Learning Research, 10(Nov):2445–2471, 2009. 1
Matthew Whitehead and Larry Yaeger. Sentiment mining using ensemble classification
models. In Innovations and Advances in Computer Sciences and Engineering, pages
509–514. Springer, 2010. 2
Gerhard Widmer and Miroslav Kubat. Learning in the presence of concept drift and hidden
contexts. Machine Learning, 23(1):69–101, 1996. 25
Michael J Wilber, Walter J Scheirer, Philipp Leitner, Brian Heflin, James Zott, Daniel
Reinke, David K Delaney, and Terrance E Boult. Animal recognition in the mojave desert:
Vision tools for field biologists. In Applications of Computer Vision (WACV), 2013 IEEE
Workshop on, pages 206–213. IEEE, 2013. 54
Terry Windeatt. Diversity measures for multiple classifier system analysis and design.
Information Fusion, 6(1):21–36, 2005. 12
Pengyi Yang, Yee Hwa Yang, Bing B Zhou, and Albert Y Zomaya. A review of ensemble
methods in bioinformatics. Current Bioinformatics, 5(4):296–308, 2010. 2
Xinzhu Yang, Bo Yuan, and Wenhuang Liu. Dynamic weighting ensembles for incremental
learning. In Pattern Recognition, 2009. CCPR 2009. Chinese Conference on, pages 1–5.
IEEE, 2009. 32
Xiao Yu, Xiang Ren, Yizhou Sun, Bradley Sturt, Urvashi Khandelwal, Quanquan Gu, Bran-
don Norick, and Jiawei Han. Recommendation in heterogeneous information networks
125
with implicit user feedback. In Proceedings of the 7th ACM Conference on Recommender
Systems, pages 347–350. ACM, 2013. 128
Guo-Xun Yuan and et al. Recent advances of large-scale linear classification. Proceedings
of the IEEE, 100(9):2584–2603, 2012. 80
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy
McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient distributed
datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of
the 9th USENIX conference on Networked Systems Design and Implementation, pages
2–2. USENIX Association, 2012. 35
Jianming Zhang, Shugao Ma, and Stan Sclaroff. Meem: robust tracking via multiple experts
using entropy minimization. In European Conference on Computer Vision, pages 188–203.
Springer, 2014. 2
Shiliang Zhang, Hui Jiang, Mingbin Xu, Junfeng Hou, and Lirong Dai. The fixed-size
ordinally-forgetting encoding method for neural network language models. In Proceedings
of ACL, 2015a. 43
Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text
classification. In Advances in Neural Information Processing Systems, pages 649–657,
2015b. 18
Dusanka Zupanski and Milija Zupanski. Model error estimation employing an ensemble
data assimilation approach. Monthly Weather Review, 134(5):1337–1354, 2006. 87
Appendix A
DATASETS OVERVIEW
Businesses often compete with each other during their operation. Businesses need customers
and they want customers to remember them as their first choice. Advertising in media is
often considered a high expense where a limited budget of medium and small businesses are
strained. Having their own website helps businesses reach out to their customers and receive
the constructive feedback. Businesses have their own setbacks as the information can spread
to large communities. They cannot be used as a start up when they have not actually had
any customers.
Businesses often want to know how customers think about the quality of their services in
order to improve and make more profits. Businesses often spend a large amount of money in
advertisements to promote their services and build their image. In fact, successful businesses
can be determined by their profit and their customers’ satisfaction. Many social networks
(e.g., Yelp) support a platform for customer feedback to express feelings and thoughts to
provide guidelines for other customers to decide on a business.
126
127
Figure 1.2: Star vs useful votes between elite and non-elite users
Why Yelp? Yelp promotes interaction and the sharing of information among communities
where one end is the user and other is the business. Yelp provides a public forum for its
registered members to share their thoughts or feeling about the service they pay for (see
Figure 1.2). This online forum has also allowed businesses to get customers’ feedback.
The number of reviews has dramatically increased from 2006. By getting a rating (more
stars), businesses can have a channel for promoting their business to potential customers
or can improve their service to keep their customers and achieve their goals. Other social
networking is also available in this study, such as Amazon reviews, where the number of
review texts are over a million.
In Yelp datasets, an elite feature defines a confident level of reviewers. Elite reviewers
usually receive higher “useful” votes on their reviews based on the rating other Yelp users
give them. Elite users often post at least 20 reviews and they give 3-4 stars for the service
presenting in review. Non-elite users have few reviews and tend to spread their votes in the
rating scale (see Figure 1.2). Word usage is distributed around common 3 and 4 stars in
the reviews (see Figure 1.3). For example, restaurant goers may want to learn from others’
experiences using a variety of criteria such as food quality, service, ambiance, discounts
and worthiness. Food and service play the most important role in customer satisfaction.
Ambiance reflects the look and feel of the place. While discounts indicate the promotion
strategy to attract customers, worthiness demonstrates the level of satisfaction on the services
the customers pay for, as well as the cost for these services.
Yelp users may post their reviews and ratings on businesses and services or simply may
express their thoughts on other reviews. Negative reviews from one’s perspective may have
an effect on potential customers in making decisions, e.g., a potential customer may cancel
a service and persuade others do the same. A rating often presents an overall experience for
a specific time of used service, but it may change overtime. In addition, star rating does not
explain why customers rate a specific business as a thumbs up or a thumbs down. Analyzing
the customers’ comments allows for a better understanding of how a number of individual
rating stars can make or break a business.
There is no lack of studies on the datasets from different viewpoints that unveil valuable
128
information on a wide range of topics such as the effect of promotional strategies (Byers
et al., 2012), the benefit of retrieving knowledge from implicit user feedback (Yu et al.,
2013), or the important role of local reviewers (Jindal, 2015).
Surprisingly, most published studies for sentiment analysis on these datasets are using
the traditional data mining methods where time is not a main factor. Figure 1.5 illustrates
that consumers posted more negative reviews in the earlier time of the Yelp operation (year
2006 -2007) but give more positive feedback from year 2010. Ignoring the time-line may
affect the reliability of the results since it considers outdated information as important as the
current information. These proposed methods lack the ability to capture rapid changes in
knowledge so that the findings are valid only under a static view of business activities.
In the real world, an evolving data source, such as Yelp, may be better modeled with a
dynamic approach to accurately reflect continuous changes in business activities. Incremen-
tal learning, an example of such a dynamic approach, has the ability to learn a new concept
without retraining on the entire dataset. The question is to quantify how customers and
businesses are influenced and how business ratings change in response to recent feedback
with an incremental learning approach.
We obtain over 2.2 million examples of reviews from the Yelp Data challenge. In the first
study, we focus on a subset of businesses in the restaurant category (25071 records contain
the ‘restaurant’ word), while in the last study, we consider all businesses with all information.
As a result, our first dataset for sentiment reviews includes both chain restaurants (such
as McDonald’s, Wendy’s, Steak n Shake, etc.) and independent restaurants. The chain
restaurants often share similar ambiance, menu and service but each of its restaurants still
use the review system for feedback to attract more customers. In fact, the quality of public
reviews have been used as a main metric to measure the actual quality of the independent
restaurants.
Yelp reviews contain information in raw text format, indicating a posted date and a
number of rating stars. Particularly, Yelp allows the user to vote on each review to indicate
how they think of a particular review fit in three categories (useful, cool and funny). The
129
most number of votes implies a useful rating , while the least number of votes imply a funny
rating (see Figure 1.6). Useful votes express how people think about the review on the
business. They agree with what a reviewer says. A funny vote makes a review less negative
instead of saying “it sucks”. These votes allow consumers to tell other consumers which
review they like and why.
We found that the reviews posted by elite users with high useful votes often have 3.5 to
4 average stars and 2.5 to 5 stars for low useful votes (refer to Figure 1.2). For non-elite
users, the same phenomenon is observed for high useful values while less than 1000 useful
votes are seen as inconsistent. A low number of useful votes may either indicate a positive
review (more stars) or a negative review (a few stars), while a high number of useful votes is
considered an unbiased review without explicit user status. Funny votes, in fact, are more
ambiguous and have a tendency of being an indicator for negative sentiment.
Ratings ranging from 1 (worst) to 5 (best) stars can be given by a reviewer while an
average rating star in the same range is generated by Yelp. While 3 stars can be considered
neutral, we observe that implicit negative sentiment has become more common over the
years so that rating star (Bstar) by Yelp and its own reviewers (UStar) have been moving
in the opposite direction. Rating stars by elite users are often close to Yelp ratings while
non-elite users often have a different rating with Yelp’s ratings.
When there is a large difference between two ratings, the low number of votes for a
corresponding review may represent a vote from an unfair competitor or a disgruntled
employee. For example, a reviewer gives 2 stars (refer to 9th row in Table ??) while the
business for this review received 4.5 stars from Yelp. To avoid this problem, we exclude
130
any review from a non-elite user who currently has fewer useful votes than the number of
reviews. In addition, these non-elite users should post a rating to more than 1 restaurant.
Finally, we retain a total of 113,197 users with 424,521 reviews for 12437 restaurants for
this study. We assign a label of positive sentiment for 4 and 5 stars and negative sentiment
for 1 and 2 stars for each review. Generating features is a challenge in incremental learning
because we do not know all values in advance. In this study, we assume that we have the
entire corpus for the purpose of extracting features.
• MNIST: This dataset (LeCun et al., 1998) is a subset of handwritten digits available
from NIST. It includes 60,000 training and 10,000 test images for 10 classes. Each
image is represented by a 784 dimensional feature vector.
• Forest Cover Type: This dataset (Blackard and Dean, 1999) includes 581,012 exam-
ples where each is represented by 54 features as belonging to one of 7 classes. Each
example gives the forest cover type for 30 x 30 meter cells on the ground collected by
US Geological Survey and US Service.
• Poker: This dataset (Cattral et al., 2002) includes 1,025,010 examples where each
example is represented by 10 features and belong to one of 10 classes. Each card is
described by two attributes: suit and rank. The class denotes a value of a poker hand.
After removing duplicates, there are 1,000,000 instances.
Appendix B
BAYESIAN INFERENCE
p(D|H)p(H)
p(H|D) = (B.1)
p(D)
where
• p(H): Prior probability
132
133
account. Prior probability of an event describes the probability of the event before a new
collection of new data. When we know about H, the adjusted probability is the posterior
probability. Since the evidence does not depend on H, Bayes’ theorem can be written as
p(H|D) ∝ p(D|H)p(H).
Several data mining algorithms are developed in the field of statistics, particularly regres-
sion. These probabilistic models are also known as Bayesian models (Barber, 2012) because
it illustrates the relationship between variables and data in the presence of uncertainty in a
real world using probabilities. Without Bayesian approach, any generating data process is
formulated as a stochastic model with noise. Our objective is to maximize the probability of
observing data. Using a Bayesian approach, a data mining model (classifier) can be defined
as a hypothesis given input data from some distributions (refer Figure 0.1).
A Bayesian inference framework studies the distribution of model parameters and finds
parameters that maximize the probability of observing a given data. We can select the best
model from the family of models for a given dataset. It illustrates a learn to learn concept
that we present in our first topic. We address the challenge of unknown objects by taking into
account of the uncertainty of a real world problem with the help of Bayesian theorem. We
further demonstrate Bayesian inference to optimize the predicting method of an ensemble
model.
Bayesian inference framework provides its support to an ensemble learning approach
with a mixture model. A real world data are more complicated for describing by a single
distribution. Bayesian framework uses a mixture model when data is described by a mixture
of distributions as the way we use ensemble learning to tackle a large data problem. For
example, we can have a student dataset at university A that includes gender, and age. We
may have two distributions of male and female students or traditional and non-traditional
students. While these groups may be represented by Gaussian distribution, the population
of exchange students or non-credit enrolled students does not follow. Since the model,
represented in one such group, has its own parameters, a collection of different models may
provide the best predictive power rather than a single model.
cov(X, Y )
ρx,y = corr(X, Y ) = (B.2)
σX σY
2 Statistical tests
In an comparative experiment, we are going to make a conclusion that algorithm A is
better than algorithm B. The conclusion should be supported from a statistical test, such as
Wilcoxom signed test, Anova test, etc. Several statistical tools are used to verify our claims
in order to validate the performances. Throughout this study, we compare the performances
of several algorithms on multiple datasets. To avoid any conclusion on the comparative
results by chance, we follow the recommendation by Demsar et al. (Demšar, 2006) in
their study (Demšar, 2006). We use a non-parametric Friedman test to evaluate the results.
One advantage of the Friedman test is that there are no required assumptions about the
distribution of the data. The classifiers are ranked on N datasets where a first rank indicates
the best model. It tests the hypotheses for the comparison as follows:
H0 : All classifiers have similar performance vs Ha : There is a significant difference in
the performance of the classifiers.
Each algorithm is ranked for each dataset separately based on the performancePmeasure.
We compute the mean rank of a particular classifier Rj on all datasets as R̄j = n1 ni=1 Rij .
Let Stotal and Serror be the sum of the squared totals (the variations in the ranks) and the
sum of the squared errors, respectively. The Friedman statistic test is defined as χ2 = SSerror
total
.
The test used is a Chi-square with 1 degree of freedom. If the p-value results from this test
is small (<0.05), it is sufficient to reject the null hypothesis.
If H0 is rejected. The post-hoc pairwise Nemenyi test is recommended (Demšar, 2006)
to find where the differences locate. If the average ranks Ri and Rj of two classifiers differ
than a threshold. The statistical test is computed as
Ri − Rj
z=q (B.3)
M (M +1)
6N