Rough Set Approach For Clusteri
Rough Set Approach For Clusteri
Rough Set Approach For Clusteri
Information Systems
journal homepage: www.elsevier.com/locate/infosys
a r t i c l e i n f o abstract
Article history: A variety of clustering algorithms exists to group objects having similar characteristics. But
Received 8 January 2014 the implementations of many of those algorithms are challenging in the process of dealing
Accepted 3 June 2014 with categorical data. While some of the algorithms cannot handle categorical data, others
are unable to handle uncertainty within categorical data in nature. This is prerequisite
Keywords: for clustering categorical data which also deal with uncertainty. An algorithm, termed
Clustering minimum-minimum roughness (MMR) was proposed, which uses the rough set theory
Categorical data in order to deal with the above problems in clustering categorical data. Later many
Rough set theory algorithms has developed to improve the handling of hybrid data. This research proposes
Information system
information-theoretic dependency roughness (ITDR), another technique for categorical
Attribute dependency
data clustering taking into account information-theoretic attributes dependencies degree
of categorical-valued information systems. In addition, it is second to none of all its
predecessors; MMR, MMeR, SDR and standard-deviation of standard-deviation roughness
(SSDR). Experimental results on two benchmark UCI datasets show that ITDR technique is
better with the baseline categorical data clustering technique with respect to computa-
tional complexity and the purity of clusters.
& 2014 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.is.2014.06.008
0306-4379/& 2014 Elsevier Ltd. All rights reserved.
Please cite this article as: I.-K. Park, G.-S. Choi, Rough set approach for clustering categorical data using information-
theoretic dependency measure, Information Systems (2014), http://dx.doi.org/10.1016/j.is.2014.06.008i
2 I.-K. Park, G.-S. Choi / Information Systems ] (]]]]) ]]]–]]]
quality of approximations of a set [6]. However, since assigns different probabilities to each class or category,
application of rough set theory in categorical data cluster- for each cluster. These probabilities are then successively
ing is relatively new, the focus of MMR is still on the adjusted to maximize the likelihood of the data given the
evaluation of its performance. To this, this computational specified number of clusters. Since the EM algorithm
complexity and clusters purity are still outstanding issues computes the classification probabilities, each observation
since all attributes are considered for selection and objects to a cluster is determined based on the largest classifica-
in different class appear in a cluster respectively. tion probability. After a large number of iterations, EM
The problem with all the above mentioned algorithms terminates at a locally optimal solution. Han et al. [18]
is that they mostly deal with numerical data sets that are propose a clustering algorithm to cluster related items in a
those databases having attributes with numeric domains. market database based on an association rule hyper-graph.
The basic reason for dealing with numerical attributes is A hyper-graph is used as a model for relatedness. The
that these are very easy to handle and also it is easy approach targets binary transactional data. It assumes item
to define similarity on them. But categorical data have sets that define clusters are disjoint and there is no overlap
multi-valued attributes. This, similarity can be defined as amongst them. However, this assumption may not hold in
common objects, common values for the attributes and practice as transactions in different clusters may have a
the association between two. In such cases horizontal few common items. K-modes [1] extend K-means and
co-occurrences (common value for the objects) as well as introduce a new dissimilarity measure for categorical
the vertical co-occurrences (common value for the attri- data. The dissimilarity measure between two objects is
butes) can be examined [7]. Other algorithms, those can calculated as the number of attributes whose values
be handle categorical data have proposed including work do not match. The K-modes algorithm then replaces the
by Huang [1], Gibbson et al. [10], Guha et al. [12] and means of clusters with modes, using a frequency based
Dempster et al. [9]. While these algorithms or methods are method to update the modes in the clustering process to
very helpful to form the clusters form categorical data they minimize the clustering cost function. One advantage of
have the disadvantage that they cannot deal with uncer- K-modes is it is useful in interpreting the results. However,
tainty. However, in real world applications it has been K-modes generate local optimal solutions based on the
found that there is often no sharp boundary between initial modes and the order of objects in the data set.
clusters. Some work has been done by Huang [1] and K-modes must be run multiple times with different start-
Kim et al. [14] where they have developed some clustering ing values of modes to test the stability of the clustering
algorithms using fuzzy sets, which can handle categorical solution. Ralambondrainy proposes a method to convert
data. But, these algorithms suffer from the stability pro- multiple categories attributes into binary attributes as
blem as they do not provide satisfactory values due to the numeric in the K-means algorithm [15]. Hung also pro-
multiple runs of the algorithms. poses the K-prototypes algorithm, which allows clustering
In this paper, we propose ITDR, an alternative techni- of objects described by a combination of numeric and
que for categorical data clustering. The technique differs categorical data. CACTUS (clustering categorical data using
on the baseline method, where the rough attributes summaries) [10] is a summarization based algorithm. In
dependencies in categorical-valued information systems CACTUS, the authors cluster for categorical data by gen-
is used to select clustering attribute based on rough eralizing the definition of a cluster for numerical attri-
entropy measure [19,22,24]. Further, we use a divide and butes. Summary information constructed from the data
conquer method to group the objects. We have succeed in set is assumed to be sufficient for discovering well-defined
showing that the proposed technique is able to achieve clusters. CACTUS finds clusters in subsets of all attributes
lower computational complexity with higher purity as and thus performs a subspace clustering of the data.
compared to MMR without the lower and upper approx- Guha et al. [12] propose a hierarchical clustering method
imation process. The rest of this paper is organized as termed ROCK (rock clustering using links), which can
follows. Section 2 describes an overview of standard measure the similarity or proximity between a pair of
clustering methods existing in the literature. Section 3 objects. Using ROCK, the number of “links” is computed as
discusses the implementation of the algorithm and the the number of common neighbors between two objects.
results from the application of the algorithm on Animal Agglomerative hierarchical clustering algorithm is then
world dataset [8]. Section 4 describes the Information- applied: first, the algorithm assigns each objects to a
Theoretic Dependency Roughness (ITDR) technique. Com- separate cluster, clusters are then merged repeatedly
parison tests of ITDR with MMR techniques based on Zoo according to the closeness between clusters, where the
dataset. Finally, the conclusion of this work is described in closeness is defined as the sum of the number of “links”
Section 5. between all pairs of objects. Gibson et al. [10] propose an
algorithm called STIRR (sieving through iterated relational
2. Literature view reinforcement), a generalized spectral graph partitioning
method for categorical data. STIRR is an iterative approach
In this section we first present the literature review which maps categorical data to non-linear dynamic sys-
as the basis of the proposed work, the definitions of tems. If the dynamic system converges, the categorical
concepts to be used in the work and also present the data can be clustered. Clustering naturally lends itself to
notations to be used. Dempster et al. [9] presents a combinatorial formulation. However, STIRR requires a
partition-based clustering method, called the expecta- non-trivial post-processing step to identify sets of closely
tion-maximization(EM) algorithm. EM first randomly related attributes values. Additionally, certain classes of
Please cite this article as: I.-K. Park, G.-S. Choi, Rough set approach for clustering categorical data using information-
theoretic dependency measure, Information Systems (2014), http://dx.doi.org/10.1016/j.is.2014.06.008i
I.-K. Park, G.-S. Choi / Information Systems ] (]]]]) ]]]–]]] 3
1st possible 2st possible 3st possible possible to fully exploit the power of fuzzy sets in
9 objects clusters clusters clusters representing the uncertainty in the classification of cate-
gorical data. However, fuzzy K-modes and fuzzy centroid
Tiger, Viper, Tiger,
algorithms suffer from the same problem as K-modes that
Tiger,
OBJECTS is they require multiple runs with different starting values
Cheetah, Cheetah Cheetah
Tiger, of modes to test the stability of the membership fuzziness
Giraffe,
Cheetah, to obtain better solution. This necessitates the effort for
Zebra, Viper Giraffe, Zebra Viper
Giraffe,
multiple runs of these algorithms to determine an accep-
Zebra,
table value of this parameter. Therefore, there is a need for
Ostrich,
Ostrich, Ostrich, Penguin a categorical data clustering method, having the stability
Penguin,
Penguin, to handling uncertainty in the clustering process while
Albatross,
Albatross,
Eagle, Viper Albatross, Eagle providing stable results. One methodology with potential
Eagle
for handling uncertainty is rough set theory which has
Fig. 1. The objects splitting.
received considerable attention in the computational
intelligence literature since its development by Pawlak in
the 1980s. Unlike fuzzy set based approaches, rough sets
clusters are not discovered by STIRR. Moreover, Zhang have no requirement on domain expertise to assign
et al. [17] argue that STIRR cannot guarantee convergence the fuzzy membership. Still, it may provide satisfactory
and therefore propose a revised dynamic system a one- results for rough clustering. The objective of this proposed
pass algorithm. Squeezer puts the first-tuple in a cluster algorithm is to develop a rough set based approach for
and then the subsequent-tuples are either put into an categorical data clustering. The approach, termed standard
existing cluster or rejected to form a new cluster based on deviation of standard deviation roughness (SSDR) [21], is
a given similarity function. He et al. [13] explore catego- presented and its performance is evaluated on large scale
rical data clustering (CDC) and link clustering (LC) pro- data sets.
blems and propose a LCBCDC (link clustering based
categorical data mining), and compare the results with 3. Information-theoretic dependency measure
squeezer and K-mode. In reviewing these algorithms,
some of the methods such as STIRR and EM algorithms 3.1. Rough set
cannot guarantee the convergence while others have
scalability issues. In addition, all of the algorithms have In 1982, Pawlak introduced the theory of rough sets. This
one common assumption: each object can be classified theory was initially developed for a finite universe of
into only one cluster and all objects have the same degree discourse in which the knowledge base is a partition, which
of confidence when group into a cluster [11]. However, in is obtained by any equivalence relation defined on the
real world applications, it is difficult to draw clear bound- universe of discourse. In rough sets theory, the data is
aries between the clusters. Therefore, the uncertainty of organized in a table called decision table. Rows of the
the objects belonging to the cluster needs to be considered decision table correspond to objects, and columns corre-
(Fig. 1). spond to attributes. In the data set, a class label to indicate
One of the first attempts to handle uncertainty is fuzzy the class to which each row belongs. The class label is called
K-means. In this algorithm, each pattern or objects is as decision attribute, the rest of the attributes are the condi-
allowed to have membership functions to all clusters tion attributes, D for decision attributes, where C\ D¼Ø,
rather than having a distinct membership to exactly one and t j tuple of the data table. Rough sets theory defines three
cluster. Krishnapuram and Keller [16] propose a probabil- regions based on the equivalent classes induced by the
istic approach to clustering in which the membership of a attribute values: lower approximation, upper approximation,
feature vector in a class has nothing to do membership and boundary. Lower contains all the objects which are
distributions. Krishnapuram et al. present several fuzzy classified surely based on the data collected, and upper
and probabilistic algorithms to detect linear and quadratic approximation contains all the objects, which can be classi-
shell clusters. Note the initial work in handling uncertainty fied probably, while the boundary is the lower approxima-
was based on numerical data. Huang propose a fuzzy tion. Hu et al. [8] presented the formal definitions of rough
K-modes algorithm with a new procedure to generate set theory.
the fuzzy partition matrix from categorical data within
the framework of the fuzzy K-means algorithm [14]. The Definition 1. Ind(B) is a relation on U. Given two objects,
method finds fuzzy cluster modes when a simple matching xi ; xj A U, they are indiscernible by the set of attributes B in
dissimilarity measure is used for categorical objects. By A, if and only if a(xi )¼a(xj ) for every a A B. That is,
assigning confidence to objects in different clusters, the xi ; xj A IndðBÞ if and only if 8 a A B where B DA, a(xi ) ¼a(xj ).
core and boundary objects of the clusters can be decided. Definition 2. Given Ind(B), the set of objects xi having the
This helps in providing more useful information for deal- same values for the set of attributes in B consists of an
ing with boundary objects. More recently, Kim et al. [2] equivalence classes, ½xi IndðBÞ . It is also known as elementary
have extended the fuzzy K-modes algorithm by using set with respect to B.
fuzzy centroid to represent the clusters of categorical
data instead of the hard-type centroid used in the fuzzy Definition 3. Let S¼ (U, A, V, f) be an approximation space
K-modes algorithm. The use of fuzzy centroid makes it and let Y and X be any subsets of A. Attribute Y is called
Please cite this article as: I.-K. Park, G.-S. Choi, Rough set approach for clustering categorical data using information-
theoretic dependency measure, Information Systems (2014), http://dx.doi.org/10.1016/j.is.2014.06.008i
4 I.-K. Park, G.-S. Choi / Information Systems ] (]]]]) ]]]–]]]
Definition 4. Let S¼(U, A, V, f) be an approximation space, MMRH ðY i jX j Þ ¼ MRH ðY i½α jX j Þ þ … þMRH ðY i½ajVðaiÞj jX j Þ=jVðai Þ
and let X and Y be any subsets of A and X, YaØ. ITDR of ð3Þ
attribute Y on attributes X, denoted X)H Y, is defined by whereVðai Þis the set of values of attribute ai A A.
the following equation:
( Definition 7. Given n attributes, min-mean-min-rough-
∑nj¼ 1 Q j log2 jX j \ Y i j=jX j j; jX j \ Y i j4 0
HðY i jX j Þ ¼ ð1Þ ness of attribute ai A Y with respect to aj A X, where ia j,
1:0 ; jX j \ Y i j ¼ 0
refers to the min of MMRH ðai jaj Þ, denoted MMMRH ðY i jX j Þ,
is obtained by the following formula:
Where HðY i jX j Þ is a function from A, jX j j=jUj is the probability
of equivalence classes on U. Obviously 0rhr1, where h MMMRH ðY i jX j Þ ¼ minðMMRH ðY 1 jX 1 Þ; ::; MMRH ðY m jX n ÞÞ ð4Þ
depicts the value of HðY i jX j Þ. Attribute Y is said to be (totally The ITDR technique for selecting partition attribute is based
dependent) depends totally (in a degree of h) on the attribute on the mean degree of rough entropy. The justification that
X if h¼1. Otherwise, Y is depends partially in X. Thus, attri- the higher of the degree of rough entropy implies the more
bute Y depends totally (partially) on attribute X, if all (some) accuracy for selecting partition attribute is stated. The lower
element of the universe U can be uniquely classified to the mean roughness is, the higher the crispness of clustering.
equivalence classes of the partition U/Y, employing X. Min-roughness determines the best crispness each attribute
can achieve. ITDR determines the best split on the attri-
butes. The ITDR algorithm iteratively divides the group of
3.2. Information-theoretic dependency roughness
objects with the goal of achieving better clustering crispness.
The algorithm takes the number of clusters, k, as one input
Mazlak et al. [4] attempt to use RST for choosing partition-
and will terminate when this pre-defined number k, is
ing attributes for clustering. They use a measure called total
reached. Next, we present an illustrative example of the ITDR
roughness to determine the crispness of the partition. How-
algorithm.
ever, for partitioning, the method starts with binary valued
attributes and uses the total roughness criterion only for Algorithm. ITDR
multi-valued attributes. This creates from a handicap due to Input: Dataset without clustering attributes
the fact that the partitioning is done on a binary attribute Output: Clustering attribute
even though the total roughness for a multi-valued attribute is 1 Compute the equivalence classes using the indiscernibility
relation on each attribute.
lower. MMR overcomes this drawback by clustering the
2 Determine the entropy roughness of attribute ai with respect to
objects on all attributes. In addition, MMR proposes a new all aj , where ia j.
way to measure data similarities based on the roughness 3 Select information-theoretic dependency degree of each
concept. MMR utilizes a measure termed mean roughness attribute within attributes.
comparable to that proposed by Mazlak et al. [4] based on 4 Select a clustering attribute based on the maximum (lowest
entropy value) degree of dependency of attribute.
RST. This is reproduced below:
Table 1
Animal world dataset.
Row Hair Teeth Eye Feather Feet Eat Milk Fly Swim
Please cite this article as: I.-K. Park, G.-S. Choi, Rough set approach for clustering categorical data using information-
theoretic dependency measure, Information Systems (2014), http://dx.doi.org/10.1016/j.is.2014.06.008i
I.-K. Park, G.-S. Choi / Information Systems ] (]]]]) ]]]–]]] 5
Please cite this article as: I.-K. Park, G.-S. Choi, Rough set approach for clustering categorical data using information-
theoretic dependency measure, Information Systems (2014), http://dx.doi.org/10.1016/j.is.2014.06.008i
6 I.-K. Park, G.-S. Choi / Information Systems ] (]]]]) ]]]–]]]
Table 3
ITDR calculation
Table 4
more objects is selected for further splitting. The algorithm In the above C depicts both the ith cluster and its
terminates when it reaches a pre-defined number of corresponding class. Therefore the overall purity is defined
clusters. This is subjective and is pre-decided based either as follows:
on user or domain knowledge.
# of clusters
Purity ðiÞ ¼ ∑ Purity ðiÞ=# of clusters ð6Þ
4. Experimental analysis i¼1
In order to test ITDR, a prototype implementation According to this measure, a higher value of overall purity
system is developed using MATLAB and tested on several indicates a better clustering result, with perfect clustering
data sets obtained from the UCI Machine Learning Repo- yielding a value of 1. Note that similar measures have been
sitory. Validating clustering results is a non-trivial task. used in Kim et al. [2] and Guha et al. [12].
The purity of clusters was used as a measure to test The Zoo dataset is comprised of 101 objects, where
the quality of the clusters [5]. The purity of a cluster is each data point represents information of an animal in
defined as terms of 18 categorical attributes [23]. Each animal
data point is classified into seven classes. Therefore, for
Purity ðiÞ ¼ # of data of in C=# of data in X ð5Þ ITDR, the splitting data is set at seven clusters. Table 5
Please cite this article as: I.-K. Park, G.-S. Choi, Rough set approach for clustering categorical data using information-
theoretic dependency measure, Information Systems (2014), http://dx.doi.org/10.1016/j.is.2014.06.008i
I.-K. Park, G.-S. Choi / Information Systems ] (]]]]) ]]]–]]] 7
summarizes the results of running the ITDR algorithm on Appendix A. Supporting information
the Zoo dataset. All of 101 objects belong to the majority
class label of the cluster in which they are classified. Thus Supplementary data associated with this article can be
the overall purity of the clusters is 85.42% via Table 6. found in the online version at http://dx.doi.org/10.1016/j.is.
2014.06.008.
5. Conclusion References
Most algorithms designed to cluster categorical data [1] Z. Huang, Extensions to the K-means algorithm for clustering large
sets are not designed to handle uncertainty in the data set. data sets with categorical values, Data Min. Knowl. Discov. 2 (3)
(1998) 283–304.
The majority of the clustering techniques implicitly [2] D. Kim, K. Lee, D. Lee, Fuzzy clustering of categorical data using fuzzy
assume that an object can be classified into, at most, one centroids, Pattern Recogn. Lett. (PAMI) 25 (11) (2004) 1263–1271.
cluster and that all objects classified into a cluster belong [3] Z. Pawlak, Rough sets, Int. J. Comput. Inf. Sci. 11 (1982) 341–356.
[4] L.J. Mazlak, A. He, Y. Zhu, S. Coppock, A Rough Set Approach in
to it with the same degree of confidence. However, in Choosing Partitioning Attributes, in: Proceedings of the ISCA 13th
many applications, there can be objects that might have International Conference, CAINE-2000, 2000, pp. 1–6.
the potential of being classified into more than one [5] D. Parmar, T. Wu, J. Blackhurst, MMR: an algorithm for clustering
categorical data using rough set theory, Data Knowl. Eng. 63 (2007)
cluster. Thus there is a need for a clustering algorithm 879–893.
that can handle this uncertainty in the clustering process. [6] Z. Pawlak, A. Skowron, Rudiments of rough sets, Inf. Sci. 177 (1)
The ITDR algorithm proposed in this paper has potential (2007) 3–27.
[7] S. Wu, A. Liew, H. Yan, M. Yang, Cluster analysis of gene expression
for clustering categorical attributes in data mining. It is
data based on self-splitting and merging competitive learning, IEEE
capable of handling the uncertainty in the clustering Trans. Inf. Technol. Biomed. 8 (1) (2004) 5–15.
process. Unlike other algorithms, ITDR requires only one [8] X. Hu, Knowledge Discovery in Databases: An Attribute Oriented
Rough Set Approach (Ph.D. dissertation), University of Regina, 1995.
input, the number of clusters, and has been tested to be
[9] A. Dempster, N. Laird, D. Rubin, Maximum likelihood from incom-
stable. plete data via the EM algorithm, J. R. Stat. Soc. 39 (1) (1977) 1–38.
Categorical data clustering technique has emerged as a [10] D. Gibson, J. Gehrke, R. Ramakrishnan, CACTUS-clustering catego-
new trend in technique of handling uncertainty in the rical data using summaries, in: Proceedings of the Fifth ACM SIGKDD
International Conference in Knowledge Discovery and Data Mining,
clustering process. In this paper we proposed a new 1999, pp. 73–83.
algorithm called SSDR, which is more efficient than most [11] M. Halkidi, Y. Be.atistakis, M. Vazirgiannis, On clustering validation
of the earlier algorithms including MMR, MMeR and SDR, techniques, J. Intell. Inf. Syst. 17 (2–3) (2001) 107–145.
[12] S. Guha, R. Rastogi, K. Shim, ROCK: a robust clustering algorithm for
which are recent algorithms developed in the direction. It categorical attributes, Inf. Syst. 25 (5) (2000) 345–366.
handles uncertain data using rough set theory. First, we [13] Z. He, X. Xu, S. Deng, A link clustering based on association for
proposed a method where both numerical and categorical clustering categorical data, in: Proceedings of the WAIM Conference,
2004.
data can be handled and secondly, by providing the [14] E. Ruspini, A New Approach to Clustering, Inf. Control 15 (1) (1969)
distance of relevance we are getting much better results 22–32.
than MMR where they are choosing the table to be [15] H. Ralambondrainy, A conceptual version of the K-means algorithm,
Pattern Recogn. Lett. 16 (11) (1995) 1147–1157.
clustered, according to the number of objects. The compu-
[16] R. Krishnapuram, J. Keller, A possibilistic approach to clustering, IEEE
tation of purity ratio shows its superiority over MMeR Trans. Fuzzy Syst. 1 (2) (1993) 98–110.
future enhancements of this algorithm may be possible by [17] Y. Zhang, A. Fu, C. Cai, P. Heng, Clustering categorical data, in:
Proceedings of the 16th International Conference on Data Engineer-
considering hybrid techniques like rough-fuzzy clustering
ing, 2000, pp. 305–324.
or fuzzy-rough clustering. [18] E. Han, G. Karypis, V. Kumar, B. Mobasher, Clustering based on
association rule hyper-graphs, in:Workshop on Research Issues on
Data Mining and Knowledge Discovery, 1997, pp. 9–13.
[19] T. Beaubouef, F.E. Petry, G. Arora, Information-theoretic measures of
uncertainty for rough sets and rough relational databases, J. Inf. Sci.
109 (1998) 185–195.
Acknowledgment [20] T. Herawan, R. Ghazali, I.T.R. Yanto, M.M. Deris, Rough set approach
for categorical data clustering, Int. J. Database Theory Appl. 3 (1)
(2010) 33–52.
'This article is a revised and expanded version of a
[21] B.K. Tripathy, A. Ghosh, SSDR: an algorithm for clustering categorical
paper entitled [A Categorical Data Clustering Based on data using rough set theory, Adv. Appl. Sci. Res. 2 (3) (2011) 314–326.
Information-Theoretic Dependency Roughness] presented [22] 〈http://xxx.sf.nchc.org.tw/ftp/cs/papers/0412/0412019.pdf〉.
at International Symposium on Advanced and Applied [23] 〈http://archive.ics.uci.edu/ml/datasets/Zoo〉.
[24] I.K. Park, G.S. Choi, J.J. Kang, Categorical Data Clustering Based
Convergence held on November 14–16, 2013 at Seoul, on Information-Theoretic Dependency, ISAAC Conference, 2013,
Korea. pp. 316–318.
Please cite this article as: I.-K. Park, G.-S. Choi, Rough set approach for clustering categorical data using information-
theoretic dependency measure, Information Systems (2014), http://dx.doi.org/10.1016/j.is.2014.06.008i