K Nearest Neighbours Based On Mutual Inf
K Nearest Neighbours Based On Mutual Inf
K Nearest Neighbours Based On Mutual Inf
1 Introduction
Missing values in data sets may have different origins such as death of patients,
equipment malfunctions, refusal of respondents to answer certain questions, and
so on [1]. The correct treatment of missing values is an important step for solv-
ing pattern recognition tasks. This paper is focused on classification problems
given an input data set with missing values. We present a variant of the K-
nearest neighbours (KNN) algorithm based on Mutual Information (MI). This
approach finds the nearest neighbours using a distance measure that considers
the relevance of each input feature with the classification task. The aim of this
paper is to analyze the performance of the KNN algorithm on incomplete data
classification tasks, and to improve its effectiveness using the MI.
The remaining of this work is organized as follows. Section 2 shows the
notation used in this paper. Missing data imputation with the standard KNN
approach and the proposed KNN algorithm based on MI concept are both de-
scribed in Section 3. Section 4 proposes a KNN procedure using MI for perform-
ing decision tasks. Next, in Section 5, experimental results on both real and
artificial data sets are shown. Finally, Section 6 presents the main conclusions
and future related works.
∗ This work is partially supported by Ministerio de Educación y Ciencia under grant
37
ESANN'2008 proceedings, European Symposium on
Artificial Neural Networks - Advances in Computational Intelligence and Learning
Bruges (Belgium), 23-25 April 2008, d-side publi., ISBN 2-930307-08-0.
1 In the following, the terms pattern, case, input vector and instance are used as synonyms.
38
ESANN'2008 proceedings, European Symposium on
Artificial Neural Networks - Advances in Computational Intelligence and Learning
Bruges (Belgium), 23-25 April 2008, d-side publi., ISBN 2-930307-08-0.
where dj (xaj , xbj ) is the distance between xa and xb on its j-th attribute:
xaj − xbj
dN (xaj , xbj ) = , (3)
max(xj ) − min(xj )
where max(xj ) and min(xj ) are respectively the maximum and minimum values
observed in the training set for the numerical attribute xj .
39
ESANN'2008 proceedings, European Symposium on
Artificial Neural Networks - Advances in Computational Intelligence and Learning
Bruges (Belgium), 23-25 April 2008, d-side publi., ISBN 2-930307-08-0.
Pattern x is assigned to the class for which the weights of the representatives
among the KC nearest neighbours sum to the largest value. Doing this, βk is
computed considering how relevant are the input features for classification.
5 Experimental Results
In order to compare the proposed method with the standard KNN approach, we
have chosen a complete problem2 , Telugu, and an incomplete data set3 , Hepatitis.
The main reason for using a complete problem is to measure the influence of the
percentage of artificially inserted missing data on the classification accuracy.
The tested methods are both trained using the same training, validation, and
test sets obtained by the stratified 10-fold cross validation method. For each fold
in the Telugu set, a percentage of missing data (5%, 10%, 20%, 30% and 40%)
are inserted into the selected attributes in a completely at random manner [1].
In all tested problems, firstly, the test set is classified by KNNclassify and MI-
KNNclassify without estimating the missing values. Next, we check whether
the classification performance can be improved imputing the missing values by
KNNimpute and MI-KNNimpute. The optimal values of K to impute (KI ) and
to classify (KC ) are selected by cross-validation.
40
ESANN'2008 proceedings, European Symposium on
Artificial Neural Networks - Advances in Computational Intelligence and Learning
Bruges (Belgium), 23-25 April 2008, d-side publi., ISBN 2-930307-08-0.
features equally contribute to solve the problem. Considering this fact, miss-
ing values have been randomly introduced in the attributes x1 and x2 . Table 1
shows the classification results obtained without imputation. We test the follow-
ing values for KI and KC : 1, 5, 10, 15, 20, 30, 40 and 50. For each percentage
of missing data, KC has been selected by cross-validation. We can see how MI-
KNNclassify clearly outperforms the standard KNN method in all simulations.
After that, we test wheter a missing data imputation provides better classifi-
cation results. The different values of KI and KC have also been selected by
cross-validation (using the classification accuracy in the validation set). As we
can observe in Table 2, a missing data estimation helps to improve the classifier
accuracy in all simulations. With low percentage of missing data (from 5% to
30%), the combined approach MI-KNNclassify and MI-KNNimpute outperforms
the other tested procedures. The advantage of using MI-KNNimpute is not so
clear for a higher percentage of missing data, as we can see with 40%. In this
case, KNNimpute is a better approach for imputation. It is due to the fact that
the estimated MI is not so accurate (we have less information to compute the
values of MI, i.e., αj ).
Table 1: Test classification accuracy (%) for Telugu data using KNNclassify
and MI-KNNclassify (without imputation), for different percentages of missing
values in the attributes x1 and x2 . The best results are shown in bold face.
Table 2: Test classification accuracy (%) for Telugu data using KNNclassify and
MI-KNNclassify after imputation in the attributes x1 and x2 is done.
41
ESANN'2008 proceedings, European Symposium on
Artificial Neural Networks - Advances in Computational Intelligence and Learning
Bruges (Belgium), 23-25 April 2008, d-side publi., ISBN 2-930307-08-0.
Table 3: Test classification accuracy (%) for Hepatitis data without imputation
vs. Number of nearest neighbours (KC ) in KNNclassify and MI-KNNclassify.
Selected values by cross-validation are shown in bold face.
6 Conclusions
In this work, we have established a new KNN method to classify and impute
incomplete data based on the MI concept. The proposed procedure selects the K
nearest cases considering the input attribute relevance to the target class. It is
done using a MI-weighted distance metric. During the missing data estimation
stage, this method provides an imputation oriented to solve the classification
task. Moreover, an effective KNN classifier based on the MI-weighted distance
has also been proposed. Experimental results on both artificial and real incom-
plete databases show the usefulness of the proposed approach.
Some future works are to implement a more robust MI estimator for missing
values, and to do an extensive comparison with other machine learning methods.
References
[1] R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. John Wiley &
Sons, New Jersey, USA, 2nd edition, 2002.
[2] O. Troyanskaya, M. Cantor, O. Alter, G. Sherlock, P. Brown, D. Botstein, R. Tibshi-
rani, T. Hastie, and R. Altman. Missing value estimation methods for DNA microarrays.
Bioinformatics, 17(6):520–525, 2001.
[3] G. E. Batista and M. C. Monard. An analysis of four missing data treatment methods for
supervised learning. Appl. Artif. Intell., 17(5-6):519–533, 2003.
[4] N. Kwak and C.-H. Choi. Input feature selection by mutual information based on parzen
window. IEEE Trans. Pattern. Anal. Mach. Intell., 24(12):1667–1671, 2002.
[5] F. Rossi, A. Lendasse, D. Francois, V. Wertz, and M. Verleysen. Mutual information for
the selection of relevant variables in spectrometric nonlinear modelling. Chemometr. Intell.
Lab. Syst., 80:215–226, 2006.
[6] T. Denoux. A k-nearest neighbor classification rule based on dempster-shafer theory. IEEE
Trans. Syst. Man. Cybern. C Appl Rev, 25(5):804–813, 1995.
42