Academia.eduAcademia.edu

K-Nearest Neighbor Classification for Glass Identification Problem

The discovery of knowledge form criminal evidence databases is important in order to make effective criminological investigation. The aim of data mining is to extract knowledge from database and produce unambiguous and reasonable patterns. K-Nearest Neighbor (KNN) is one of the most successful data mining methods used in classification problems. Many researchers show that combining different classifiers through voting resulted in better performance than using single classifiers. This paper applies KNN to help criminological investigators in identifying the glass type. It also checks if integrating KNN with another classifier using voting can enhance its accuracy in indentifying the glass type. The results show that applying voting can enhance the KNN accuracy in the glass identification problem.

K-Nearest Neighbor Classification for Glass Identification Problem Mashael S. Aldayel Department of Information Technology King Saud University Saud Arabia, Riyadh maldayel@ksu.edu.sa Abstract— The discovery of knowledge form criminal evidence databases is important in order to make effective criminological investigation. The aim of data mining is to extract knowledge from database and produce unambiguous and reasonable patterns. K-Nearest Neighbor (KNN) is one of the most successful data mining methods used in classification problems. Many researchers show that combining different classifiers through voting resulted in better performance than using single classifiers. This paper applies KNN to help criminological investigators in identifying the glass type. It also checks if integrating KNN with another classifier using voting can enhance its accuracy in indentifying the glass type. The results show that applying voting can enhance the KNN accuracy in the glass identification problem. dataset used in this paper is “Glass Identification Dataset” which has been taken from the UCI Machine Learning Repository [3]. WEKA open source software was used throughout this study as a tool for data mining analysis. It was chosen in term of computational view, larger range of algorithms and better data preparation tool [5]. Keywords— Data Mining; glass classification; K-Nearest Neighbor; voting Relatively, not much research has been conducted in the field of glass identification problems. Researchers have been applying many data mining techniques including fuzzy clustering [6] and many variants of KNN technique such feature weighting [4], AdaBoost [5], locally adaptive KNN [7], bagging, kernel density, and support vector machine over same glass identification dataset. Thus the results can be compared with this paper based on the classification accuracy as shown in table I. They are all lower than the results achieved in this paper. I. INTRODUCTION The study of glass classification problem was motivated by criminological investigation. At the scene of the crime, the glass left can be used as evidence if it is correctly identified. A frequent casework requirement is the comparison of glass from a crime scene with glass particles found to be associated with a suspect. Such glass particles are often exceedingly small. It is important to identify and compare these small glass fragments that may be significant in a forensic context [1][2]. The quantitative analysis of glass [1] gives the oxide concentration for up to 15 elements. In most instances, however, only the oxides of Sodium, Magnesium, Aluminimum, Silicon, Potassium and Calcium occur at levels high enough for measurement. The main goal in this paper is to classify correctly a single fragment of glass based upon the main component measured. The proposed data mining algorithm for glass problem is knearest neighbor classification method which is one of the most commonly used data mining techniques in pattern recognition and classification problems. Recently, many researches compare between single classifiers and multiple classifiers using voting and prove that voting gives better result than other single classifiers [11]. This paper investigates applying KNN in the identification of glass to allow comparisons with other data mining techniques used on the same dataset. It also investigates if integrating KNN with another classifier using voting can enhance its classification accuracy in the glass dataset. The 978-1-4673-5157-7/13/$31.00 ©2013 IEEE The reminder of the paper organized as follows. Section II deals with related works. Section III describes glass identification dataset. Section IV shows the data analysis methodology. The data analysis results and conclusion are shown consequently in section V and VI. II. TABLE I. RELATED WORK CLASSIFICATION ACCURACY IN GLSS DATASET IN RELATED WORKS Method Boosting NN [5] Naïve KNN [5] Adaptive metric NN [7] Descriminant Adaptive NN [7] C4.5 Deciction Tree[7] Wilson Editing [8] Multi-edit [8] Citation Editing [8] Supervised Clustering [8] Accuracy Rate 75.6% 73.2% 75.2% 72.9% 68.2% 67.4% 60.1% 70.0% 71.5% Vivencio et al. [4] propose a feature weighting nearest neighbor method based on chi-square statistical test, to be used in conjunction with a KNN classifier. The distance metric used in this work is chi-squared statistical score which can be used in ranking features. Forty four out of forty five experiments prefer the feature weighted approach. Athitsos and Sclaroff [5] introduce an algorithm that uses boosting (AdaBoost) to learn a distance measure for KNN classification. However, experiments with eight UCI datasets, including glass identification dataset, yield no clear winner among those methods; boosting using output codes, and KNN classification using an un-optimized distance measure. Ruying and Rongcang [6] propose a fuzzy clustering algorithm to get the minimum reduction of decision table. They use attribute reduction algorithm based on fuzzy rough set, structure new decision table, and then eliminate redundant values from the new decision table. Finally, they extract valuable rules according to the resulted minimum reduction. However, they did not put clear information about the accuracy of their glass identification experiment but they mention that 31 useful decision rules are extracted from 214 rules. Dorneniconi et al. [7] propose a locally adaptive nearest neighbor to minimize bias and smooth classification process. They use chi-squared distance to compute a flexible metric between neighbors. They compare their method with other classification techniques including KNN, discriminate adaptive NN and C4.5 decision tree. Their experiments use nine datasets, including glass identification dataset. Their results show that the classification accuracy rate is improved to be 75.2%. Zeidat et al. [8] compare several popular editing techniques, which are Wilson, multi-edit, and citation, according to classification accuracy and training set compression rate. Their experiments include 11 UCI datasets as well as a set of two dimensional synthetic datasets. They used Manhattan distance to calculate the distance and 5- fold cross-validation. Their experiments demonstrate the affect of editing techniques but did not show much enhancement on the nearest neighbor classifier. III. TABLE II. ATTRIBUTES DESCRIPTION IN GLASS IDENTIFICATION DATASET Attribute Name Attribute Type Refractive Index Numeric Sodium Magnesium Aluminum Silicon Potassium Calcium Numeric Numeric Numeric Numeric Numeric Numeric DATASET DESCRIPTION The UCI Machine Learning Repository provides many datasets. One of them is glass identification dataset [3] which determines the type of glass based on its components. The number of Instances in this dataset is 214. This dataset has 10 attributes as shown in table II. Attributes from 2 to 9 are measured using weight percent in corresponding oxide. The glass has seven types, which define different usages, such as windows float processed, windows non-float processed, vehicle windows float processed, vehicle windows non-float processed, containers, tableware and headlamps. Consequently, each type is represented by a small number from 1 to 7. Note that the refractive index attribute is a determination of each glass sample which was used together with the elemental analyses to determine discrimination of pairs of glasses or to help in classifying a glass according to its usage [4]. Barium Iron Glass type Numeric Numeric Nominal (categorical) Attribute Values Min 1.511 Max 1.534 Mean 1.518 Std. Dev. 0.003 Min 10.73 Max 17.38 Mean 13.408 Std. Dev. 0.817 Min 0 Max 4.49 Mean 2.685 Std. Dev. 1.442 Min 1.511 Max 1.534 Mean 1.518 Std. Dev. 0.003 Min 69.81 Max 75.41 Mean 72.651 Std. Dev. 0.775 Min 0 Max 6.21 Mean 0.497 Std. Dev. 0.652 Min 5.43 Max 16.19 Mean 8.957 Std. Dev. 1.423 Min 0 Max 3.15 Mean 0.175 Std. Dev. 0.497 Min 0 Max 0.51 Mean 0.057 Std. Dev. 0.097 Label 1 2 3 4 5 6 7 Count 70 76 17 0 13 9 29 Attribute Relationship with the class label IV. METHODOLOGY The proposed data mining algorithms that handle glass identification problem are k-nearest neighbor and voting method. This research tries to investigate if integrating KNN with another classifier using voting can enhance its accuracy. WEKA is used as a tool for data mining analysis. It was chosen in term of computational view, larger range of algorithms and better data preparation tool. This study goes through two phases; data pre-processing and applying classifiers. The data pre-processing phase intend to prepare the dataset for the second phase. The second phase includes using single classifiers and multiple classifiers (voting with KNN and HNB) to construct a high accurate prediction model for the glass identification problem. A. Data Preprocessing Even if a dataset is delivered in the right format, it may still need preprocessing in order to be able to apply a data mining algorithm and increase the quality of the analysis result. There are many data preprocessing techniques. In this study, the techniques used are data cleaning, dimension reduction, standardization, transformation, and discretization. The glass data is relatively cleaned since there are no missing values. Hence, I just remove outliers from the dataset. Dimension reduction is done by selecting attributes based on information gain ranking method. This method evaluates the worth of an attribute by measuring the information gain with respect to the class. So, such method can rank and only attributes that have significant contribution to the glass identification will be selected. The result of this analysis indicates that Iron and Silicon attributes have the least rank than others. So, it is better to eliminate them while the remaining attributes have strong correlation with glass type as shown in table III. TABLE III. EVALUATE THE WORTH OF AN ATTRIBUTE BY MEASURING THE INFORMATION GAIN WITH RESPECT TO THE CLASS IN WEKA Attribute Aluminum Magnesium Potassium Calcium Barium Sodium Refractive Index Iron Silicon Rank 0.566 0.563 0.543 0.472 0.412 0.335 0.332 0.099 0 B. Voting Multiple classifier voting technique is used to combine decisions of multiple classifiers. It involves dividing the training data into smaller equal subsets of dataset and building a classifier for each subset. The simplest form of voting is based on majority voting where each individual classifier contributes a single vote. The final decision is based on the majority of the votes that is the class with the most votes. Applying voting to classification algorithms is showing successful improvement in the accuracy of these classifiers [11]. Based on my experiments I found that, HNB (Hidden Naive Bayes) and KNN performed well on most of glass classes. As a result, a combination of the best classifiers on glass dataset, which are HNB and KNN, are used to form a voting technique. C. K- Nearest Neighbor Classifier K-nearest neighbor classifier is a popular method in a wide range in classification problems due to its simplicity and relatively high convergence speed. KNN considers the k nearest instances {il, i2,.., ik} from an instance (x) and decides upon the most frequent class in the set {c1, c2, ... ck}. The most frequent class is assumed to be the class of that instance (x). In order to determine the nearest instance, KNN technique adopts a distance metric that measures the proximity of instance x to k of stored instances. Various distance metrics can be used, including the Euclidean which is used in this paper because it performs well when the continuous attributes are normalized so that they have the same influence on the distance measure between instances. Furthermore, the data dimensionality have been reduced to prevent or reduce its affecting on the performance of the Euclidean distance [4][5][10][11]. However, there are many disadvantages of KNN classifiers. The main one is the large memory requirement needed to store the whole training set. If the training set is large, response time will be also large which resulted in poor run-time performance. Despite the memory requirement, KNN in general has a good performance in classification problems. Moreover, KNN is very sensitive to irrelevant or redundant attributes and thus affect on the classification accuracy. Hence, the selected dataset should be prepossessed with careful attribute selection technique. Another disadvantage of KNN is the selection of k. If k is too small, then the result can be sensitive to noise. If k is too big, then the result can be incorrect where neighbors include too many points from other classes [7][8][10]. V. DATA ANALYSIS RESULTS Many studies show that the combination of multiple classifiers leads to a significant accuracy improvement in a certain dataset. As a result, identifying the glass type using combining classifiers is better than a single classifier because the classification decision depends on collective outputs of numerous models. I firstly pre-process the glass dataset. Then I examine single classifiers accuracy on the dataset and choose the best of them to produce voting method. I found that, HNB and KNN performed well on most of glass classes as seen in table IV. Hence, I use a combination of the best classifiers on the glass dataset, which are HNB and KNN, to form a voting technique. The results show that combining different classifiers outperform other single classifiers for identifying glass type. The best single classifiers, used in this paper, are HNB and KNN classifier using Euclidean distance where k=1, 3, 5, 7 and 9. I found that the best KKN results when K=1. So, K=1 is used in the voting method to classify the glass dataset. The chosen test mode is 10 fold in cross-validation which shows the best experiments results based on the accuracy of the classification. The voting result shows that 172 of 214 instances are classified correctly. Table IV shows some common accuracy details of data mining methods. It also shows that applying voting can enhance the KNN accuracy in the glass identification problem. TABLE IV. Recall 0.801 0.796 0.789 0.768 0.756 0.707 0.806 0.799 0.79 0.78 0.748 0.738 0.701 0.804 F-Measure 0.797 0.789 0.778 0.745 0.733 0.685 0.802 Accuracy Rate 79.9065% 78.972% 78.0374% 74.7664% 73.8318 % 70.0935% 80.3738% 1 2 3 4 5 6 7 61 6 3 0 0 0 0 1 10 60 3 0 3 0 0 2 8 2 7 0 0 0 0 3 0 0 0 0 0 0 0 4 0 0 0 0 11 0 2 5 0 2 0 0 0 7 0 6 1 0 1 0 1 0 26 7 Actual Class Precision CONFUSION MATRIX Perdicted Class CLASSIFICATION ACCURACY DETAILS IN GLASS DATASET Method HNB KNN ( K=1) KNN ( K=3) KNN ( K=5) KNN ( K=7) KNN ( K=9) Proposed Voting, K =1 (KNN+HNB) TABLE V. In table V, I notice the large confusion rates between classes (1) and (2) which is caused by the actual classification in the glass dataset where most classes are (1) and (2). The 214 glass instances are actually classified as 70 of (1), 76 of (2), 17 of (3), 0 of (4), 13 of (5), 9 of (6) and 29 of (7). In another words, the categories which occur are 70 of window float processed, 76 of windows non-float processed, 17 of vehicle window float processed, 13 of containers, 9 of tableware and 29 of headlamps. 0.82 0.80 0.78 0.76 0.74 0.72 0.70 VI. 0.68 0.66 0.64 0.62 Voting Precision HNB 1NN Recall 3NN 5NN F-Measure 7NN 9NN Accuracy Rate Figure 1. Classification Accuracy Details in Data Mining Methods Another accuracy measurement is the confusion matrix in table V. Each element in the confusion matrix is a count of instances. Rows in the matrix represent the actual class of the instances, and columns represent the predicted class. The matrix shows the following: CONCLUSION The discovery of knowledge form criminal evidence databases is important in order to make effective criminological investigation. The aim of data mining is to extract knowledge from database and produce unambiguous and reasonable patterns. K-nearest neighbor is one of the most successful data mining methods used in classification problems. Many researchers show that combining different classifiers through voting resulted in better performance than using single classifiers. This paper applies KNN to help criminological investigators in identifying the glass class where K = 1, 3, 5, 7 and 9. A combination of the best classifiers on glass dataset, which are HNB and KNN (K=1), are used to form a voting technique. The results show that integrating KNN with HNB using voting can enhance classification accuracy in indentifying the glass types. 172 of 214 instances are classified correctly. The accuracy measurements of applying voting and KNN are explained in detail. Furthermore, this paper has raised some interesting possibilities for further research with voting using different multiple classifiers and different dataset. • 70 instances of class (1) classified as follows: 61 in class (1), 6 in class (2) and 3 in class (3) • 76 instances of class (2) classified as follows: 10 in class (1), 60 in class (2), 3 in class (3) and 3 in class (5) • 17 instances of class (3) classified as follows: 8 in class (1), 2 in class (2) , and 7 in class (3) • No instance classified in class (4) because the dataset actually does not has. The author would like to thank the partial support funded by the Research Center of the College of computer and Information Sciences at King Saud University. • 13 instances of class (5) classified as follows: 11 in class (5) and 2 in class (7) REFERENCES • 9 instances of class (6) classified as follows: 2 in class (2) and 7 in class (6) • 29 instances of class (7) classified as follows: 1 in class (1), 1 in class (3), 1 in class (5) and 26 in class (7) ACKNOLEDGEMENT [1] [2] K. W. Terry, A. Van Riessen, and B. F. Lynch, Identification of Small Glass Fragments for Forensic Purposes. Government Chemical Laboratories, Criminology Research Council (Australia) and Western Australian Institute of Technology, 1983. WEKA, "Waikato Environment for Knowledge Analysis", Version 3.6.3, New Zealand, 1999-2010, online accessed: [http://www.cs.waikato.ac.nz/~ml/WEKA] [3] [4] [5] [6] German, B., (2012). UCI Machine Learning Repository "Glass Identification Dataset", September, 1987, online accessed: [http://archive.ics.uci.edu/ml/datasets/Glass+Identification] D. P. Vivencio, E. Hruschka, M. Nicoletti, E. dos Santos, and S. Galvao, “Feature-weighted k-Nearest Neighbor Classifier,” Foundations of Computational Intelligence, 2007. FOCI 2007. IEEE Symposium, pp. 481–486, 2007. V. Athitsos and S. Sclaroff, “Boosting Nearest Neighbor Classifiers for Multiclass Recognition,” Computer Vision and Pattern RecognitionWorkshops, 2005. CVPR Workshops. IEEE Computer Society Conference, pp. 45–45, 2005. S. Ruying and H. Rongcang, “Data Mining Based on Fuzzy Rough Set Theory and its Application in the Glass Identification,” Information and Automation, 2009. ICIA’09. International Conference on, pp. 154–157, 2009. [7] C. Dorneniconi, J. Peng, and D. Gunopulos, “An Adaptive Metric Machine for Pattern Classification,” Advances in Neural Information Processing Systems 13, vol. 13, p. 458, 2001. [8] N. Zeidat, S. Wang, and C. F. Eick, “Dataset editing techniques: a comparative study,” 2005. [9] P. Cunningham and S. J. Delany, “k-Nearest neighbour classifiers,” Multiple Classifier Systems, pp. 1–17, 2007. [10] X. Wu et al., “Top 10 algorithms in data mining,” Knowledge and Information Systems, vol. 14, no. 1, pp. 1–37, 2008. [11] M. Shouman, T. Turner, and R. Stocker, “Applying k-Nearest Neighbour in Diagnosing Heart Disease Patients,” International Journal of Information and Education Technology, vol. 2, 2012.