Using Weighted Nearest Neighbor To Benef PDF
Using Weighted Nearest Neighbor To Benef PDF
1 Introduction
The combination of supervised and unsupervised learning [1] is a growing sub-
field of Machine Learning. Applications such as text- or image-mining and molec-
ular profiling have revealed application areas that yield very little (and often
expensive) labeled data but often plenty of unlabeled data. As traditional ma-
chine learning algorithms are not able to use and benefit from the information
available in the unlabeled data, custom built algorithms should be able outper-
form them. Current research in semi-supervised learning using algorithms such
as Co-Training [2] or more recent approaches based on graph representations [3]
confirms that this is indeed possible.
Most of the semi-supervised learning approaches use the labeled and unla-
beled data simultaneously or at least in close collaboration. Roughly speaking,
the unlabeled data provides information about the structure of the domain, i.e.
it helps to capture the underlying distribution of the data. The challenge for the
algorithms can be viewed as realizing a kind of trade-off between robustness and
information gain [1]. To make use of unlabeled data, one must make assump-
tions, either implicitly or explicitly. As reported in [3], the key to semi-supervised
learning is the prior assumption of consistency, that allows for exploiting the ge-
ometric structure of the data distribution. Close data points should belong to
the same class and decision boundaries should lie in regions of low data density;
this is also called the “cluster assumption”.
In this paper, we introduce a very simple two-stage approach that uses the
available unlabeled data to improve on the predictions made when learning only
from the labeled examples. In a first stage, it uses an of-the-shelf classifier to
build a model based on the small amount of available training data, and in the
second stage it uses that model to transform the available unlabeled data into a
weighted “pre-labeled” data-set that together with the original data is used in a
nearest neighbor classifier. We will show that the proposed algorithm improves
on the classifier built in stage 1, especially in cases where much more unlabeled
data is available compared to the labeled data.
The rest of the paper is structured as follows: in section 2 we describe a few
related semi-supervised learning techniques. Section 3 introduces the proposed
algorithm in detail. In section 4 we show experimental results using an array of
different classifiers used in the first stage. Section 5 concludes and presents some
directions for future work.
Early methods in semi-supervised learning were using mixture models (in which
each mixture component represents exactly one class) and extensions of the EM
algorithm [4]. More recent approaches belong to one of the following categories:
self-training, co-training, transductive SVMs, split learning, and graph-based
methods. In the self-training approach, a classifier is trained on the labeled data
and then used to classify the unlabeled ones. The most confident (now labeled)
unlabeled points are added to the training set, together with their predictive
labels, and the process is repeated until convergence [5]. Approaches based on
co-training [2] assume that the features describing the objects can be divided
in two subsets such that each of them is sufficient to train a good classifier, and
that the two sets are conditionally independent given the class attribute. Two
classifiers are iteratively trained, each on one set, and they teach each other with
a respective subset of unlabeled data and their highest confidence predictions.
The transductive SVMs [6] are a ”natural” extension of SVMs to the semi-
supervised learning scheme. They aim at finding a labeling of the unlabeled data
so that the decision boundary has a maximum margin on the original labeled
data and on the (newly labeled) unlabeled data.
Graph-based methods attempt to capture the underlying structure of the
data with a graph whose vertices are the available data (both labeled and un-
labeled) and whose (possibly weighted) edges encode the pairwise relationships
among this data. Examples of recent work in that direction include Markov ran-
dom walks [7], cluster kernels [8], and regularization on graphs [3]. The learning
problem on graphs can generally be viewed as estimation problem of a classify-
ing function f which should be close to a given function y on the labeled data
and smooth on the whole graph. Different graph-based methods mainly vary by
their choice of the loss function and the regularizer [9]. For example, the work on
graph cuts [10] minimizes the cost of a cut in the graph for a two-class problem,
while [11] minimizes the normalized cut cost and [12, 3] minimize a quadratic
cost. As noticed in [9], these differences are not actually crucial. What is far
more important is the construction and the quality of the graph, which should
reflect domain knowledge through the similarity function which is used to assign
edges and their weights.
Collective classification [13] is an ILP approach that uses the relational struc-
ture of the combined labeled and unlabeled data-set to enhance classification ac-
curacy. With relational approaches, the predicted label of an example will often
be influenced by the labels of related examples. The idea behind collective clas-
sification is that the predicted labels of a test-example should also be influenced
by the predictions made for related test-examples. The algorithm presented in
this paper is closely related to this, but works on non-relational data by using a
distance and the nearest neighbor relation that results from it.
3 Yatsi
The Yatsi algorithm4 that we present in this paper will incorporate ideas from
different algorithms that were discussed in the previous section. Since we really
like the idea of giving the user the option to chose from a number of machine
learning algorithm (like it is possible in co-training), we will develop a technique
that builds on top of any standard machine learning algorithm. To incorporate
the general idea behind collective classification, we use a nearest neighbor ap-
proach and the distance between as a way of relating them to each other.
Algorithm 1 High level pseudo code for the two-stage Yatsi algorithm.
Input: a set of labeled data Dl and a set of unlabeled data Du , an of-the-shelf
classifier C and a nearest neighbor number K; let N = |Dl | and M = |Du |
Step 1:
Train the classifier C using Dl to produce the model Ml
Use the model Ml to “pre-label” all the examples from Du
Assign weights of 1.0 to every example in Dl
N
and of F × M to all the examples in Du
Merge the two sets Dl and Du into D
Step 2:
For every example that needs a prediction:
Find the K-nearest neighbors to the example from D to produce set N N
For each class:
Sum the weights of the examples from N N that belong to that class
Predict the class with the largest sum of weights.
4
Yatsi was developed during a time when we were experimenting with a number of
multi-stage classifiers that dealt with labeled and unlabeled data. At the time, we
referred to the presented algorithm as: “Yet Another Two-Stage Idea”, hence the
name Yatsi
The Yatsi classifier (See Algorithm 1 for high-level pseudo-code) uses both
labeled and unlabeled data in a two-stage set-up. In the first stage a standard,
of-the-self, classifier (or regression-algorithm) is trained on the available training
data. Since this kind of data is limited in the specific application areas we are
looking at, it is best to choose an algorithm that can learn a model well using
only a small amount of learning data.
In the second stage, the model generated from the learning data is used
to “pre-label” all the examples in the test set. These pre-labeled examples are
then used together with the original training data in a weighted nearest neighbor
algorithm. The weights used by the nearest neighbor classifier are meant to limit
the amount of trust the algorithm puts in the labels generated by the model from
the first step. As a default value, we set the weights of the training data to 1.0
and the weights of the pre-labeled test-data to N/M with N the number of
training examples and M the number of test-examples. Conceptually, this gives
equal weights to the whole train- and the whole test-set. By adding a parameter
F to the algorithm that will cause the weight of the test-examples to be set to
F ∗ (N/M ), it becomes possible to vary the influence one wants to give to the
unlabeled data and the classifier built in step 1. Values of F between 0.0 and
1.0 will lower the influence on the test-data and the learned model from the
first step, values larger than 1.0 will increase their influence. In the experiments,
we will test values ranging from 0.01 to 10. An F -value of 10.0 will adjust the
weights of the individual examples such as to give the total test-set 10 times the
weight of the total training set.
over all examples in the with dataset tj being the target value of example j and
distij being the distance between examples i and j, both ways of incorporating
the weights of examples are equivalent. As such, although we have not yet im-
plemented this and do not have any experimental results, Yatsi can be used for
predicting continuous target values as well without major changes.
For our experiments, we fixed the number of nearest neighbor to 10. However,
this is not a requirement for the Yatsi algorithm. Cross-validation on the labeled
training examples would be a way adapting the number of nearest neighbors.
However, the resulting values of k obtained in this fashion might be misleading
because of the large amount of extra examples that will be available in the second
step of the Yatsi algorithm.
Since the algorithm is designed to work in applications where the amount
of labeled training data is limited, one can in most cases get away with less
efficient algorithms in the first step. As we expect the amount of test data to
greatly exceed that of the training data, most of the computational complexity
will lie in the search for nearest neighbors, as this search spans the combined
sets of training and test examples.
Yatsi will therefore greatly benefit from using efficient nearest neighbor
search algorithms. Currently, we use KD-trees [15] to speed up the nearest
neighbor search. However, recently a lot of research effort has gone into the
development of more efficient search strategies for nearest neighbors, which can
be directly applied to the Yatsi algorithm. Examples of such search strategies
are cover trees [16] and ball trees [17].
4 Experimental Results
Table 1. Number of statistically significant wins, draws and losses (in that order:
W/D/L) in predictive accuracy of Yatsi vs. the classifier trained in stage 1, for different
values of the weighting parameter. (Tested with a paired t-test, confidence level 0.05,
two tailed)
Table 2. Predictive accuracies of J48 and Yatsi using J48 as the stage 1 classifier
averaged over 20 runs of the experiments. The data-sets were split into training- and
test-set with a 5%-95% ratio. Significant improvements or degradations were tested
with a two-tailed 5% confidence interval.
For additional insight into the results we have selected a few typical datasets
and plot in Tables 3 and 4 error rates for J48 and Yatsi(J48) using different
values for the weighting parameter F. The percentage of labels available for
training varies between 1% and 20%. General trends are very obvious in these
plots, like the fact that more labels usually lead to globally better results, or that
with a very small number of labels J48 usually performs worse than Yatsi , but
that J48 can outperform Yatsi when given more labeled data. With regard to
the weighting parameter F we see that values of 0.1 and 1.0 consistently perform
better than a value of 10 or the no-weight option, which clearly indicates the
advantage of taking a cautious approach which puts more trust into the originally
supplied labels over the labels generated by the first stage classifier.
As already stated, all previous experiments were run with the number of near-
est neighbor for the second stage fixed to 10. Because of the use of weights and
the large difference in weights between training and test examples, we thought
it might make sense to use a larger number of nearest neighbors, so we also per-
formed experiments with 20 and 50 nearest neighbors in the 1% labeled training
data case. Overall, these experiments showed very little difference with the 10
nearest neighbor ones. When there was a difference, there was a little improve-
ment for low values of F (0.1 or 1.0) and a small loss for the cases where a high
weight was given to the test-examples (F = 10.0 or no weights used at all).
We have presented a simple two-stage idea that benefits from the availability of
unlabeled data to improve on predictive accuracies of standard classifiers. Yatsi
uses an of-the-shelf classification or regression algorithm in a first step and uses
weighted nearest neighbor on the combined set of training data and “pre-labeled”
test data for actual predictions. Experimental results obtained from both a large
array of different classifiers used in the first step, different amounts of available
unlabeled data and a relatively large selection of data-sets show that Yatsi will
usually improve on or match the predictive performance of the base classifier
used generated in the first stage. These improvements are largest in cases where
there is a lot more unlabeled data available than there is labeled data.
The Yatsi algorithm in its current form is quite simple and therefore a
number of further improvements are possible. Some ideas have already been
presented in section 3 such as the inclusion of a more efficient nearest neighbor
search algorithm or the use of cross validation to determine the best number
of nearest neighbors to use. More elaborate extensions could include some sort
of EM-algorithm that tries to match the “pre-labels” of test-examples with the
eventually predicted values. Distance functions different to simple Euclidean
distance could encode specialized domain knowledge and thus help improving
classification performance. These directions would relate Yatsi more closely to
both graph-based and kernel-based methods of semi-supervised learning.
References
1. Seeger, M.: Learning with labeled and unlabeled data. Technical report, Edinburgh
University (2001)
2. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In:
COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan
Kaufmann (1998) 92–100
3. Zhou, D., Bousquet, O., Lal, T., Weston, J., Schölkopf, B.: Learning with local
and global consistency. In: Proceedings of the Annual Conf. on Neural Information
Processing Systems, NIPS. (2004)
4. Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text classification from labeled
and unlabeled documents using em. Machine Learning 39 (2000) 103–134
5. Rosenberg, C., Hebert, M., Schneiderman, H.: Semi-supervised self-training of
object detection models. In: 7th IEEE Workshop on Applications of Computer
Vision / IEEE Workshop on Motion and Video Computing (WACV/MOTION
2005), 5-7 January 2005, Breckenridge, CO, USA, IEEE Computer Society (2005)
29–36
6. Joachims, T.: Transductive inference for text classification using support vector
machines. In Bratko, I., Džeroski, S., eds.: Proceedings of ICML99, 16th Interna-
tional Conference on Machine Learning, Morgan Kaufmann (1999) 200–209
7. Szummer, M., Jaakkola, T.: Partially labeled classification with markov random
walks. In Dietterich, T., Becker, S., Ghahramani, Z., eds.: Advances in Neural
Information Processing Systems 14 [Neural Information Processing Systems, NIPS
2001, December 3-8, 2001, Vancouver and Whistler, British Columbia, Canada],
Cambridge, MA, MIT Press (2001) 945–952
8. Chapelle, O., Weston, J., Schölkopf, B.: Cluster kernels for semi-supervised learn-
ing. In Becker, S., Thrun, S., Obermayer, K., eds.: Advances in Neural Information
Processing Systems 15 [Neural Information Processing Systems, NIPS 2002, De-
cember 9-14, 2002, Vancouver, British Columbia, Canada], Cambridge, MA, MIT
Press (2002) 585–592
9. Zhu, X.: Semi-supervised learning with graphs. PhD thesis, Carnegie Mellon Uni-
versity, School of Computer Science, Pittsburgh, Pennsylvania (PA), USA (2005)
10. Blum, A., Chawla, S.: Learning from labeled and unlabeled data using graph min-
cuts. In Brodley, C., Pohoreckyj Danyluk, A., eds.: Proceedings of the Eighteenth
International Conference on Machine Learning (ICML 2001), Williams College,
Williamstown, MA, USA, June 28 - July 1, 2001, Morgan Kaufmann (2001) 19–26
11. Joachims, T.: Transductive learning via spectral graph partitioning. In Fawcett,
T., Mishra, N., eds.: Machine Learning, Proceedings of the Twentieth International
Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, AAAI Press
(2003) 290–297
12. Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-supervised searning using gaussian
fields and harmonic functions. In Fawcett, T., Mishra, N., eds.: Machine Learning,
Proceedings of the Twentieth International Conference (ICML 2003), August 21-
24, 2003, Washington, DC, USA, AAAI Press (2003) 912–919
13. Neville, J., Jensen, D.: Collective classification with relational dependency net-
works. In: Proceedings of the Second International Workshop on Multi-Relational
Data-Mining. (2003)
14. Blum, A., Chawla, S.: Learning from labeled and unlabeled data using graph
mincuts. In: Proceedings of the Eighteenth International Conference on Machine
Learning, Morgan Kaufmann (2001)
15. Friedman, J., Bentley, J., Finkel, R.: An algorithm for finding best matches in
logarithmic expected time. ACM Transactions on Mathematical Software 3 (1977)
209–226
16. Beygelzimer, A., Kakade, S., Langford, J.: Cover trees for nearest neighbor. In
pre-print, available from www.cs.rochester.edu/u/beygel/publications.html (2005)
17. Omohundro, S.: Efficient algorithms with nearal network behavior. Journal of
Complex Systems 1 (1987) 273–347
Table 3. Exemplary plots of error rates for J48 and Yatsi(J4) using different values
for the weighting parameter F for some selected datasets with between 1% and 20%
known labels (I).
anneal.ORIG anneal
26 30
J48 J48
25 YATSI -no-weights YATSI -no-weights
24 YATSI -F 0.1 25 YATSI -F 0.1
YATSI -F 1.0 YATSI -F 1.0
23 YATSI -F 10.0 YATSI -F 10.0
% incorrect
% incorrect
20
22
21 15
20
10
19
18
5
17
16 0
0 5 10 15 20 0 5 10 15 20
Size of training file in % Size of training file in %
balance-scale breast-cancer
40 32.5
J48 J48
YATSI -no-weights 32 YATSI -no-weights
35 YATSI -F 0.1 31.5 YATSI -F 0.1
YATSI -F 1.0 YATSI -F 1.0
YATSI -F 10.0 31 YATSI -F 10.0
% incorrect
% incorrect
30 30.5
30
25 29.5
29
20 28.5
28
15 27.5
0 5 10 15 20 0 5 10 15 20
Size of training file in % Size of training file in %
cleveland-14-heart-disease ecoli
50 55
J48 J48
YATSI -no-weights 50 YATSI -no-weights
45 YATSI -F 0.1 YATSI -F 0.1
YATSI -F 1.0 45 YATSI -F 1.0
YATSI -F 10.0 YATSI -F 10.0
% incorrect
% incorrect
40
40
35 35
30
30
25
25
20
20 15
0 5 10 15 20 0 5 10 15 20
Size of training file in % Size of training file in %
Table 4. Exemplary plots of error rates for J48 and Yatsi(J4) using different values
for the weighting parameter F for some selected datasets with between 1% and 20%
known labels (II).
horse-colic hungarian-14-heart-disease
34 38
J48 J48
32 YATSI -no-weights 36 YATSI -no-weights
YATSI -F 0.1 YATSI -F 0.1
30 YATSI -F 1.0 34 YATSI -F 1.0
YATSI -F 10.0 YATSI -F 10.0
% incorrect
% incorrect
28 32
26 30
24 28
22 26
20 24
18 22
16 20
0 5 10 15 20 0 5 10 15 20
Size of training file in % Size of training file in %
ionosphere sonar
40 55
J48 J48
YATSI -no-weights YATSI -no-weights
35 YATSI -F 0.1 50 YATSI -F 0.1
YATSI -F 1.0 YATSI -F 1.0
YATSI -F 10.0 YATSI -F 10.0
% incorrect
% incorrect
30 45
25 40
20 35
15 30
10 25
0 5 10 15 20 0 5 10 15 20
Size of training file in % Size of training file in %
soybean vote
90 26
J48 J48
YATSI -no-weights 24 YATSI -no-weights
80
YATSI -F 0.1 22 YATSI -F 0.1
70 YATSI -F 1.0 20 YATSI -F 1.0
YATSI -F 10.0 YATSI -F 10.0
% incorrect
% incorrect
60 18
16
50
14
40 12
30 10
8
20
6
10 4
0 5 10 15 20 0 5 10 15 20
Size of training file in % Size of training file in %