An Empirical Study of Distance Metrics For K-Nearest Neighbor Algorithm
An Empirical Study of Distance Metrics For K-Nearest Neighbor Algorithm
An Empirical Study of Distance Metrics For K-Nearest Neighbor Algorithm
Abstract
1. Introduction
Data mining is the extraction of knowledge hidden in
the data. Data mining is often done with the large datasets.
The knowledge from data mining has been used in various
fields, such as prediction over future situation, assisting in
medical diagnosis, forecasting relation of chronology.
Current data mining methodology has been classified
into several tasks, such as classification, clustering, and
association mining. Data mining each for task will have a
different purpose. Classification task will be trying to
classify data with high accuracy for classifying future
example, such as trying to distinguish between patients with
heart disease and those who are healthy. Clustering task
will try to categorize groups of data such that data in the
same group look similar, whereas they are dissimilar to
others in different groups. Association mining task will try
DOI: 10.12792/iciae2015.051
280
2. Background
2.1
k-Nearest Neighbor
= ( )( )
(1)
= ( ) 1 ( )
(2)
Distance Metrics
3. Mahalanobis Distance
The Mahalanobis distance is a measure between a
point and a distribution of data, defined by Eq. (3)
2
= ( ) 1 ( )
(3)
= | |
(4)
=1
281
5. Minkowski Distance
The Minkowski distance is a method to find
distance based on Euclidean space, defined by Eq. (5)
9. Hamming Distance
Hamming distance, which is the percentage of
coordinates that differ, can be defined by Eq. (9)
= (
= | |
(5)
=1
= (
# [( ) (( 0) ( 0))]
(6)
(7)
( )( )
( )( ) ( )( )
(10)
(11)
(+1)
(+1)
8. Correlation Distance
Distance based on correlation is a measure of
statistical dependence between two vectors, defined by
Eq. (8)
= (1
( )( ) ( )( )
= =
( )( )
)
( )( )
= =
Where
rsj is the rank of xsj taken over x1j, x2j, ...xmx,j.
rtj is the rank of ytj taken over y1j, y2j, ...ymy,j.
rs and rt are the coordinate-wise rank vectors
of xs and yt,
i.e., rs = (rs1, rs2, ... rsn) and rt = (rt1, rt2, ... rtn).
7. Cosine Distance
The Cosine distance is computed from one minus
the cosine of the included angle between points,
defined by Eq. (7)
#[( 0) ( 0)]
= 1
(9)
#( )
)
(8)
where
1
282
Generated Data
4. Experimental Results
4.1
Training set (52%)
Datasets
k-Nearest Neighbor
Distance metrics
Euclidean
Standardized
Euclidean
Mahalanobis
City block
Minkowski
Chebychev
Cosine
Correlation
Hamming
Jaccard
Spearman
Mean
SD
Class 1
Class 2
Total
[0 0 0;
[1 0 0;
0022
0022
0222
3 0 0]
0 1 0;
0502
002
0222
0022
0022
0222
0502
002
0222
0022
0022
0222
0502
002
0222
0022
0022
0222
0502
002
0222
0 0 1]
0
[0 0 0;
[1 0 0;
3 0 0]
0 1 0;
0 0 1]
[0 0 0;
[0.2 0 0;
0 0 3]
0 0.2 0;
0 0 1]
[0 0 0;
[0.2 0 0;
0 0 3]
0 0.2 0;
0 0 1]
Evaluate Classification
Performance
[0 0 0;
3 0 0]
[1 0 0;
0 0.2 0;
0 0 0.2]
[0 0 0;
3 0 0]
[1 0 0;
0 0.2 0;
0 0 0.2]
[0 0 0;
[1 0.9 0;
3 3 0]
0.9 1 0;
0 0 1]
[0 0 0;
[1 0.9 0;
3 3 0]
0.9 1 0;
0 0 1]
283
(a) Dataset 1
(b) Dataset 2
(c) Dataset 3
(d) Dataset 4
(e) Dataset 5
(f) Dataset 6
(g) Dataset 7
(h) Dataset 8
Fig. 3. Distribution of the eight synthetic datasets, each one has four kinds of distribution.
4.2
Experimental Results
Data 2
Data 3
Data 4
cityblock
chebychev
correlation
cosine
euclidean
jaccard
mahalanobis
minkowski
seuclidean
spearman
hamming
284
Data 6
Data 7
Data 8
cityblock
chebychev
correlation
cosine
euclidean
jaccard
mahalanobis
minkowski
seuclidean
spearman
hamming
5. Conclusions
References
(1) Beyer
Kevin,
Jonathan
Goldstein,
Raghu
Ramakrishnan, and Uri Shaft, When is nearest
neighbor meaningful?, in Database TheoryICDT99.
1999, Springer. p. 217-235.aaaa
(2) Cover Thomas and Peter Hart: "Nearest neighbor
pattern classification", Information Theory, IEEE
Transactions on, Vol. 13, No. 1, pp. 21-27, 1967
(3) Dudani Sahibsingh A: "The distance-weighted
k-nearest-neighbor rule", Systems, Man and
Cybernetics, IEEE Transactions on, Vol. No. 4, pp.
325-327, 1976
(4) Fukunaga Keinosuke and Patrenahalli M Narendra: "A
branch and bound algorithm for computing k-nearest
neighbors", Computers, IEEE Transactions on, Vol.
100, No. 7, pp. 750-753, 1975
(5) Keller James M, Michael R Gray, and James A Givens:
"A fuzzy k-nearest neighbor algorithm", Systems, Man
and Cybernetics, IEEE Transactions on, Vol. No. 4, pp.
580-585, 1985
(6) Khn Hans-Friedrich: "Combinatorial individual
differences scaling within the city-block metric",
Computational Statistics & Data Analysis, Vol. 51, No.
2, pp. 931-946, 2006
285