Data Mining Discretization Methods and Performances
Data Mining Discretization Methods and Performances
Data Mining Discretization Methods and Performances
B-28
Discretization process is known to be one of the most important data preprocessing tasks in data mining. Presently, many discretization methods are available. These include Boolean Reasoning, Equal Frequency Binning, Entropy, and others. Each method is developed for specific problems or domain area. In consequent, the usage of such methods in other areas might not be appropriate. In appropriately used of a technique will cause serious problem to happen in which important data is lost. This will cause inaccuracy of results and unreliable models be produced. This study attempts to evaluate the performances of various discretization methods on several domain areas. The experiments have been validated using 10fold cross validation method. The ranking of the performances of the methods have been discovered from the experiments. The results suggest that different discretization methods perform better in one or more domain areas.
1. Introduction
In data mining, discretization process is known to be one of the most important data preprocessing tasks. Most of the existing machine learning algorithms are capable of extracting knowledge from databases that store discrete attributes. If the attribute are continuous, the algorithms can be integrated with a discretization algorithms which transform them into discrete feature. Discretization methods are used to reduce the number of values for a given continuous attributes by dividing the range of the attribute into intervals (5)(2). Discretization makes learning more accurate and faster. The results (decision tree, induction rules) of the process are usually more compact, shorter and more accurate when discrete features are used compared to continuous features (7)(8). Furthermore, according to (3), the most important performance criterion of the discretization method is the accuracy rate. This paper begins with a brief description about the importance of discretization method in data mining. In Section 2., the discretization processs steps are listed and the chosen methods are described. Subsequent section explains about the experimental evaluation conducted in this study, and the results obtained are also presented. Discussion on the findings and concluding remarks follow in Section 4.
In order to carry out the process, discretization method has to be applied. The following subsections describe about different discretization methods used in this study, namely the Boolean Reasoning, Equal Frequency Binning, Entropy Minimum Description Length (Entropy/MDL), Nave and Semi Nave.
2. Discretization Process
A normal discretization process specifically consists of four steps (i) sort all the continuous values of the feature to be discretized (ii) choose a cut point to split the continuous values into intervals. (iii) split or merge the intervals of continuous values (iv) choose the stopping criteria of the discretization process (8).
2.3 Entropy/MDL
According to (7), Entropy/MDL is an algorithm that recursively partitioned the value set of each attribute so that the local measure of entropy is optimized. In this algorithm, the minimum description length principle defines a stopping criterion for the partitioning process. sorts the condition attributes first, then considers a cut point between two values of each attribute.
2.4 Nave
Nave algorithm takes both condition attributes and decision attributes into consideration (1). The algorithm
ISBN 978-979-16338-0-2
535
Proceedings of the International Conference on Electrical Engineering and Informatics Institut Teknologi Bandung, Indonesia June 17-19, 2007
B-28
iteration. The overall results of the experiment are presented in the next section.
100 80 60
40 20 0 2 3 4 Nave SemiNave
Figure 1: Classification accuracy obtained for various numbers of classes of medical dataset
ISBN 978-979-16338-0-2
536
Proceedings of the International Conference on Electrical Engineering and Informatics Institut Teknologi Bandung, Indonesia June 17-19, 2007
B-28
100 80
% accuracy
60 40 20 0 2 7 8 No. of class
reveals that all discretization methods give higher accuracy for small class size for engineering dataset. Thus, the bigger the class size, the lesser the accuracy of classification. In general, this experiment has shown that for engineering dataset the size of class does affect the classification accuracy for all the discretization methods experimented. Further study that looks into the suitable class size for engineering dataset should be conducted. In addition, a study on the effect of different type of datasets should be carried out.
References
Ohrn. A.:Discernibility and Rough Sets in Medicine: Tools and Applications. PhD Thesis, Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway, NTNU report 1999:133,IDI report 1999:14, ISBN 82-7984-014-1, 239 pages. (Pub.1999) (2) Kurgan. L and Cios, K.J.: Discretization Algorithm that Uses ClassAttribute Interdependence Maximization, Proceedings of the 2001 International Conference on Artificial Intelligence (IC-AI 2001), pp.980-987, Las Vegas, Nevada. (Pub. 2001.) (3) Chmielewski. M.R and Grzymala-Busse. J.W: Global discretization of Continuous Attributes as Preprocessing for Medichine Learning, International of Approximate Reasoning, Vol. 15, pp. 319-331 (Pub. 1996.) (4) The University of California,,UCI Machine Learning Repository, http://www.ics.uci.edu/~mlearn/MLRepository.html (5) Ratanamahatana, C. A.: CloNI: Clustering of N-Interval discretization, Proceedings of the 4th International Conference on Data Mining Including Building Application for CRM & Competitive Intelligence, Rio de Janeiro, Brazil, (Pub. 2003). (6) Ohrn A.: ROSETTA Technical Reference Manual. Department ofComputer and Information Science, Norwegian University of Science and Technology,Trondheim, Norway. (Pub 2001). (7) Dougherty, J.,Kohavi, R., and Sahami, M.: Supervised and unsupervised discretization of continuous features.In Proc. Twelfth International Conference on Machine Learning. Los Altos, CA: Morgan Kaufmann, pp. 194202 (Pub 1995). (8) Liu, H. et. al:Discretization :An Enabling Technique.Data Mining and Knowledge Discovery, 6,393-423.(Pub 2002) (9) Karthigasoo. S, Cheah Y.N. and Manickam.S: Improving the Accuracy of Medical Decision Support via a Knowledge Discovery Pipeline using Ensemble Techniques. Journal of Advancing Information and Management Studies, 2(1), (Pub 2005). (10) Nguyen H. S. and Skowron A.:Quantization of real-valued attributes. In Proc. Second International Joint Conference on Information Sciences, pp. 3437 (Pub. 1995). (1)
Figure2: Classification accuracy obtained for various numbers of classes of engineering dataset
ISBN 978-979-16338-0-2
537