K-Nearest Neighbor Classification for Glass
Identification Problem
Mashael S. Aldayel
Department of Information Technology
King Saud University
Saud Arabia, Riyadh
maldayel@ksu.edu.sa
Abstract— The discovery of knowledge form criminal
evidence databases is important in order to make effective
criminological investigation. The aim of data mining is to extract
knowledge from database and produce unambiguous and
reasonable patterns. K-Nearest Neighbor (KNN) is one of the
most successful data mining methods used in classification
problems. Many researchers show that combining different
classifiers through voting resulted in better performance than
using single classifiers. This paper applies KNN to help
criminological investigators in identifying the glass type. It also
checks if integrating KNN with another classifier using voting
can enhance its accuracy in indentifying the glass type. The
results show that applying voting can enhance the KNN accuracy
in the glass identification problem.
dataset used in this paper is “Glass Identification Dataset”
which has been taken from the UCI Machine Learning
Repository [3]. WEKA open source software was used
throughout this study as a tool for data mining analysis. It was
chosen in term of computational view, larger range of
algorithms and better data preparation tool [5].
Keywords— Data Mining; glass classification; K-Nearest
Neighbor; voting
Relatively, not much research has been conducted in the
field of glass identification problems. Researchers have been
applying many data mining techniques including fuzzy
clustering [6] and many variants of KNN technique such
feature weighting [4], AdaBoost [5], locally adaptive KNN [7],
bagging, kernel density, and support vector machine over same
glass identification dataset. Thus the results can be compared
with this paper based on the classification accuracy as shown in
table I. They are all lower than the results achieved in this
paper.
I.
INTRODUCTION
The study of glass classification problem was motivated by
criminological investigation. At the scene of the crime, the
glass left can be used as evidence if it is correctly identified. A
frequent casework requirement is the comparison of glass from
a crime scene with glass particles found to be associated with a
suspect. Such glass particles are often exceedingly small. It is
important to identify and compare these small glass fragments
that may be significant in a forensic context [1][2].
The quantitative analysis of glass [1] gives the oxide
concentration for up to 15 elements. In most instances,
however, only the oxides of Sodium, Magnesium,
Aluminimum, Silicon, Potassium and Calcium occur at levels
high enough for measurement.
The main goal in this paper is to classify correctly a single
fragment of glass based upon the main component measured.
The proposed data mining algorithm for glass problem is knearest neighbor classification method which is one of the most
commonly used data mining techniques in pattern recognition
and classification problems. Recently, many researches
compare between single classifiers and multiple classifiers
using voting and prove that voting gives better result than other
single classifiers [11]. This paper investigates applying KNN in
the identification of glass to allow comparisons with other data
mining techniques used on the same dataset. It also investigates
if integrating KNN with another classifier using voting can
enhance its classification accuracy in the glass dataset. The
978-1-4673-5157-7/13/$31.00 ©2013 IEEE
The reminder of the paper organized as follows. Section II
deals with related works. Section III describes glass
identification dataset. Section IV shows the data analysis
methodology. The data analysis results and conclusion are
shown consequently in section V and VI.
II.
TABLE I.
RELATED WORK
CLASSIFICATION ACCURACY IN GLSS DATASET IN RELATED
WORKS
Method
Boosting NN [5]
Naïve KNN [5]
Adaptive metric NN [7]
Descriminant Adaptive NN [7]
C4.5 Deciction Tree[7]
Wilson Editing [8]
Multi-edit [8]
Citation Editing [8]
Supervised Clustering [8]
Accuracy Rate
75.6%
73.2%
75.2%
72.9%
68.2%
67.4%
60.1%
70.0%
71.5%
Vivencio et al. [4] propose a feature weighting nearest
neighbor method based on chi-square statistical test, to be used
in conjunction with a KNN classifier. The distance metric used
in this work is chi-squared statistical score which can be used
in ranking features. Forty four out of forty five experiments
prefer the feature weighted approach.
Athitsos and Sclaroff [5] introduce an algorithm
that uses boosting (AdaBoost) to learn a distance
measure for KNN classification. However, experiments
with eight UCI datasets, including glass identification
dataset, yield no clear winner among those methods;
boosting using output codes, and KNN classification
using an un-optimized distance measure.
Ruying and Rongcang [6] propose a fuzzy clustering
algorithm to get the minimum reduction of decision
table. They use attribute reduction algorithm based on
fuzzy rough set, structure new decision table, and then
eliminate redundant values from the new decision table.
Finally, they extract valuable rules according to the
resulted minimum reduction. However, they did not put
clear information about the accuracy of their glass
identification experiment but they mention that 31
useful decision rules are extracted from 214 rules.
Dorneniconi et al. [7] propose a locally adaptive
nearest neighbor to minimize bias and smooth
classification process. They use chi-squared distance to
compute a flexible metric between neighbors. They
compare their method with other classification
techniques including KNN, discriminate adaptive NN
and C4.5 decision tree. Their experiments use nine
datasets, including glass identification dataset. Their
results show that the classification accuracy rate is
improved to be 75.2%.
Zeidat et al. [8] compare several popular editing
techniques, which are Wilson, multi-edit, and citation,
according to classification accuracy and training set
compression rate. Their experiments include 11 UCI
datasets as well as a set of two dimensional synthetic
datasets. They used Manhattan distance to calculate the
distance and 5- fold cross-validation. Their experiments
demonstrate the affect of editing techniques but did not
show much enhancement on the nearest neighbor
classifier.
III.
TABLE II.
ATTRIBUTES DESCRIPTION IN GLASS IDENTIFICATION DATASET
Attribute
Name
Attribute
Type
Refractive
Index
Numeric
Sodium
Magnesium
Aluminum
Silicon
Potassium
Calcium
Numeric
Numeric
Numeric
Numeric
Numeric
Numeric
DATASET DESCRIPTION
The UCI Machine Learning Repository provides
many datasets. One of them is glass identification
dataset [3] which determines the type of glass based on
its components. The number of Instances in this dataset
is 214. This dataset has 10 attributes as shown in table
II. Attributes from 2 to 9 are measured using weight
percent in corresponding oxide. The glass has seven
types, which define different usages, such as windows
float processed, windows non-float processed, vehicle
windows float processed, vehicle windows non-float
processed, containers, tableware and headlamps.
Consequently, each type is represented by a small
number from 1 to 7.
Note that the refractive index attribute is a
determination of each glass sample which was used
together with the elemental analyses to determine
discrimination of pairs of glasses or to help in
classifying a glass according to its usage [4].
Barium
Iron
Glass type
Numeric
Numeric
Nominal
(categorical)
Attribute Values
Min
1.511
Max
1.534
Mean
1.518
Std. Dev.
0.003
Min
10.73
Max
17.38
Mean
13.408
Std. Dev.
0.817
Min
0
Max
4.49
Mean
2.685
Std. Dev.
1.442
Min
1.511
Max
1.534
Mean
1.518
Std. Dev.
0.003
Min
69.81
Max
75.41
Mean
72.651
Std. Dev.
0.775
Min
0
Max
6.21
Mean
0.497
Std. Dev.
0.652
Min
5.43
Max
16.19
Mean
8.957
Std. Dev.
1.423
Min
0
Max
3.15
Mean
0.175
Std. Dev.
0.497
Min
0
Max
0.51
Mean
0.057
Std. Dev.
0.097
Label
1
2
3
4
5
6
7
Count
70
76
17
0
13
9
29
Attribute Relationship with the class
label
IV.
METHODOLOGY
The proposed data mining algorithms that handle glass
identification problem are k-nearest neighbor and voting
method. This research tries to investigate if integrating KNN
with another classifier using voting can enhance its accuracy.
WEKA is used as a tool for data mining analysis. It was chosen
in term of computational view, larger range of algorithms and
better data preparation tool.
This study goes through two phases; data pre-processing
and applying classifiers. The data pre-processing phase intend
to prepare the dataset for the second phase. The second phase
includes using single classifiers and multiple classifiers (voting
with KNN and HNB) to construct a high accurate prediction
model for the glass identification problem.
A. Data Preprocessing
Even if a dataset is delivered in the right format, it may still
need preprocessing in order to be able to apply a data mining
algorithm and increase the quality of the analysis result. There
are many data preprocessing techniques. In this study, the
techniques used are data cleaning, dimension reduction,
standardization, transformation, and discretization.
The glass data is relatively cleaned since there are no
missing values. Hence, I just remove outliers from the dataset.
Dimension reduction is done by selecting attributes based on
information gain ranking method. This method evaluates the
worth of an attribute by measuring the information gain with
respect to the class. So, such method can rank and only
attributes that have significant contribution to the glass
identification will be selected. The result of this analysis
indicates that Iron and Silicon attributes have the least rank
than others. So, it is better to eliminate them while the
remaining attributes have strong correlation with glass type as
shown in table III.
TABLE III.
EVALUATE THE WORTH OF AN ATTRIBUTE BY MEASURING
THE INFORMATION GAIN WITH RESPECT TO THE CLASS IN WEKA
Attribute
Aluminum
Magnesium
Potassium
Calcium
Barium
Sodium
Refractive Index
Iron
Silicon
Rank
0.566
0.563
0.543
0.472
0.412
0.335
0.332
0.099
0
B. Voting
Multiple classifier voting technique is used to combine
decisions of multiple classifiers. It involves dividing the
training data into smaller equal subsets of dataset and building
a classifier for each subset. The simplest form of voting is
based on majority voting where each individual classifier
contributes a single vote. The final decision is based on the
majority of the votes that is the class with the most votes.
Applying voting to classification algorithms is showing
successful improvement in the accuracy of these classifiers
[11]. Based on my experiments I found that, HNB (Hidden
Naive Bayes) and KNN performed well on most of glass
classes. As a result, a combination of the best classifiers on
glass dataset, which are HNB and KNN, are used to form a
voting technique.
C. K- Nearest Neighbor Classifier
K-nearest neighbor classifier is a popular method in a wide
range in classification problems due to its simplicity and
relatively high convergence speed. KNN considers the k
nearest instances {il, i2,.., ik} from an instance (x) and decides
upon the most frequent class in the set {c1, c2, ... ck}. The most
frequent class is assumed to be the class of that instance (x). In
order to determine the nearest instance, KNN technique adopts
a distance metric that measures the proximity of instance x to k
of stored instances. Various distance metrics can be used,
including the Euclidean which is used in this paper because it
performs well when the continuous attributes are normalized so
that they have the same influence on the distance measure
between instances. Furthermore, the data dimensionality have
been reduced to prevent or reduce its affecting on the
performance of the Euclidean distance [4][5][10][11].
However, there are many disadvantages of KNN classifiers.
The main one is the large memory requirement needed to store
the whole training set. If the training set is large, response time
will be also large which resulted in poor run-time performance.
Despite the memory requirement, KNN in general has a good
performance in classification problems. Moreover, KNN is
very sensitive to irrelevant or redundant attributes and thus
affect on the classification accuracy. Hence, the selected
dataset should be prepossessed with careful attribute selection
technique. Another disadvantage of KNN is the selection of k.
If k is too small, then the result can be sensitive to noise. If k is
too big, then the result can be incorrect where neighbors
include too many points from other classes [7][8][10].
V.
DATA ANALYSIS RESULTS
Many studies show that the combination of multiple
classifiers leads to a significant accuracy improvement in a
certain dataset. As a result, identifying the glass type using
combining classifiers is better than a single classifier because
the classification decision depends on collective outputs of
numerous models.
I firstly pre-process the glass dataset. Then I examine single
classifiers accuracy on the dataset and choose the best of them
to produce voting method. I found that, HNB and KNN
performed well on most of glass classes as seen in table IV.
Hence, I use a combination of the best classifiers on the glass
dataset, which are HNB and KNN, to form a voting technique.
The results show that combining different classifiers
outperform other single classifiers for identifying glass type.
The best single classifiers, used in this paper, are HNB and
KNN classifier using Euclidean distance where k=1, 3, 5, 7 and
9. I found that the best KKN results when K=1. So, K=1 is
used in the voting method to classify the glass dataset. The
chosen test mode is 10 fold in cross-validation which shows the
best experiments results based on the accuracy of the
classification.
The voting result shows that 172 of 214 instances are
classified correctly. Table IV shows some common accuracy
details of data mining methods. It also shows that applying
voting can enhance the KNN accuracy in the glass
identification problem.
TABLE IV.
Recall
0.801
0.796
0.789
0.768
0.756
0.707
0.806
0.799
0.79
0.78
0.748
0.738
0.701
0.804
F-Measure
0.797
0.789
0.778
0.745
0.733
0.685
0.802
Accuracy
Rate
79.9065%
78.972%
78.0374%
74.7664%
73.8318 %
70.0935%
80.3738%
1
2
3
4
5
6
7
61
6
3
0
0
0
0
1
10
60
3
0
3
0
0
2
8
2
7
0
0
0
0
3
0
0
0
0
0
0
0
4
0
0
0
0
11
0
2
5
0
2
0
0
0
7
0
6
1
0
1
0
1
0
26
7
Actual Class
Precision
CONFUSION MATRIX
Perdicted Class
CLASSIFICATION ACCURACY DETAILS IN GLASS DATASET
Method
HNB
KNN ( K=1)
KNN ( K=3)
KNN ( K=5)
KNN ( K=7)
KNN ( K=9)
Proposed
Voting, K =1
(KNN+HNB)
TABLE V.
In table V, I notice the large confusion rates between
classes (1) and (2) which is caused by the actual classification
in the glass dataset where most classes are (1) and (2). The 214
glass instances are actually classified as 70 of (1), 76 of (2), 17
of (3), 0 of (4), 13 of (5), 9 of (6) and 29 of (7). In another
words, the categories which occur are 70 of window float
processed, 76 of windows non-float processed, 17 of vehicle
window float processed, 13 of containers, 9 of tableware and
29 of headlamps.
0.82
0.80
0.78
0.76
0.74
0.72
0.70
VI.
0.68
0.66
0.64
0.62
Voting
Precision
HNB
1NN
Recall
3NN
5NN
F-Measure
7NN
9NN
Accuracy Rate
Figure 1. Classification Accuracy Details in Data Mining Methods
Another accuracy measurement is the confusion matrix in
table V. Each element in the confusion matrix is a count of
instances. Rows in the matrix represent the actual class of the
instances, and columns represent the predicted class. The
matrix shows the following:
CONCLUSION
The discovery of knowledge form criminal evidence
databases is important in order to make effective
criminological investigation. The aim of data mining is to
extract knowledge from database and produce unambiguous
and reasonable patterns. K-nearest neighbor is one of the most
successful data mining methods used in classification
problems. Many researchers show that combining different
classifiers through voting resulted in better performance than
using single classifiers. This paper applies KNN to help
criminological investigators in identifying the glass class where
K = 1, 3, 5, 7 and 9. A combination of the best classifiers on
glass dataset, which are HNB and KNN (K=1), are used to
form a voting technique. The results show that integrating
KNN with HNB using voting can enhance classification
accuracy in indentifying the glass types. 172 of 214 instances
are classified correctly. The accuracy measurements of
applying voting and KNN are explained in detail. Furthermore,
this paper has raised some interesting possibilities for further
research with voting using different multiple classifiers and
different dataset.
•
70 instances of class (1) classified as follows: 61 in
class (1), 6 in class (2) and 3 in class (3)
•
76 instances of class (2) classified as follows: 10 in
class (1), 60 in class (2), 3 in class (3) and 3 in class (5)
•
17 instances of class (3) classified as follows: 8 in
class (1), 2 in class (2) , and 7 in class (3)
•
No instance classified in class (4) because the dataset
actually does not has.
The author would like to thank the partial support funded
by the Research Center of the College of computer and
Information Sciences at King Saud University.
•
13 instances of class (5) classified as follows: 11 in
class (5) and 2 in class (7)
REFERENCES
•
9 instances of class (6) classified as follows: 2 in class
(2) and 7 in class (6)
•
29 instances of class (7) classified as follows: 1 in class
(1), 1 in class (3), 1 in class (5) and 26 in class (7)
ACKNOLEDGEMENT
[1]
[2]
K. W. Terry, A. Van Riessen, and B. F. Lynch, Identification of Small
Glass Fragments for Forensic Purposes. Government Chemical
Laboratories, Criminology Research Council (Australia) and Western
Australian Institute of Technology, 1983.
WEKA, "Waikato Environment for Knowledge Analysis", Version
3.6.3,
New
Zealand,
1999-2010,
online
accessed:
[http://www.cs.waikato.ac.nz/~ml/WEKA]
[3]
[4]
[5]
[6]
German, B., (2012). UCI Machine Learning Repository "Glass
Identification Dataset", September, 1987, online accessed:
[http://archive.ics.uci.edu/ml/datasets/Glass+Identification]
D. P. Vivencio, E. Hruschka, M. Nicoletti, E. dos Santos, and S. Galvao,
“Feature-weighted k-Nearest Neighbor Classifier,” Foundations of
Computational Intelligence, 2007. FOCI 2007. IEEE Symposium, pp.
481–486, 2007.
V. Athitsos and S. Sclaroff, “Boosting Nearest Neighbor Classifiers for
Multiclass Recognition,” Computer Vision and Pattern RecognitionWorkshops, 2005. CVPR Workshops. IEEE Computer Society
Conference, pp. 45–45, 2005.
S. Ruying and H. Rongcang, “Data Mining Based on Fuzzy Rough Set
Theory and its Application in the Glass Identification,” Information and
Automation, 2009. ICIA’09. International Conference on, pp. 154–157,
2009.
[7]
C. Dorneniconi, J. Peng, and D. Gunopulos, “An Adaptive Metric
Machine for Pattern Classification,” Advances in Neural Information
Processing Systems 13, vol. 13, p. 458, 2001.
[8] N. Zeidat, S. Wang, and C. F. Eick, “Dataset editing techniques: a
comparative study,” 2005.
[9] P. Cunningham and S. J. Delany, “k-Nearest neighbour classifiers,”
Multiple Classifier Systems, pp. 1–17, 2007.
[10] X. Wu et al., “Top 10 algorithms in data mining,” Knowledge and
Information Systems, vol. 14, no. 1, pp. 1–37, 2008.
[11] M. Shouman, T. Turner, and R. Stocker, “Applying k-Nearest
Neighbour in Diagnosing Heart Disease Patients,” International Journal
of Information and Education Technology, vol. 2, 2012.