1999 - An Analysis of Recent Work On Clustering Algorithms
1999 - An Analysis of Recent Work On Clustering Algorithms
net/publication/2279437
CITATIONS READS
206 1,353
1 author:
Daniel Fasulo
Pattern Genomics
37 PUBLICATIONS 23,153 CITATIONS
SEE PROFILE
All content following this page was uploaded by Daniel Fasulo on 24 March 2013.
Abstract
This paper describes four recent papers on clustering, each of which approaches the
clustering problem from a di erent perspective and with di erent goals. It analyzes the
strengths and weaknesses of each approach and describes how a user could could decide
which algorithm to use for a given clustering application. Finally, it concludes with
ideas that could make the selection and use of clustering algorithms for data analysis
less dicult.
1 Introduction
1.1 Informal Problem De nition
Clustering can be loosely de ned as the process of organizing objects into groups whose
members are similar in some way. There are two major styles of clustering: partitioning
(often called k-clustering ), in which every object is assigned to exactly one group, and
hierarchical clustering, in which each group of size greater than one is in turn composed of
smaller groups.
Both hierarchical clustering and k-clustering had been studied extensively by the mid-
1970's, and comparatively little clustering research was carried out in the 1980's. In recent
years, however, the advent of World Wide Web search engines (and, speci cally, the problem
of organizing the large amount of data they produce) and the concept of \data mining"
massive databases has lead to a renewal of interest in clustering algorithms.
The goal of this paper is to examine recent work in clustering. In particular, the focus is
on clustering algorithms whose goal is to identify groups of related items in an input
set. This is in contrast to certain \clustering-like" problems such as graph partitioning [33]
or segmentation problems [23], which organize objects into groups but often have other ob-
jectives. As much as possible, this paper also emphasizes the fundamental ideas behind the
construction of clusters and attempts to avoid the details of speci c clustering applications,
such as determining good measures for text document similarity.
1
The remainder of this section summarizes prior work in clustering and certain related
problems, and brie y describes the issues involved in representing the input to clustering
algorithms. Section 2 focuses on summarizing and independently critiquing a set of four
recent papers on clustering. Finally, Section 3 compares the strengths and weaknesses of
the various algorithms and suggests ways in which the positive aspects of each could be
synthesized in future work.
For a more detailed introduction to clustering, see the works by Everitt [12], Ras-
mussen [29], Kaufman and Rousseeuw [21], Jain and Dubes [20], and Gordon [18]. In par-
ticular, the works by Everitt and Jain and Dubes provide clear and comprehensive coverage
of the \classical" approaches to clustering.
1.2 k-clustering
In general, k-clustering algorithms take as input a set S of objects and an integer k, and
output a partition of S into subsets S1 , S2 , : : : , Sk . By far the most common type of
k-clustering algorithm is the optimization algorithm. Optimization algorithms typically
assume that the elements of S are drawn from a d-dimensional metric space, usually Rd ,
and de ne a cost function c : fX : X S g ! R+ P which associates a cost with each
cluster. The goal of the algorithm is then to minimize ki=1 c(Si ), the sum of the costs of
the clusters.
The most well-known optimization criterion is the sum-of-squares criterion. Let xir be
the rth element of Si , jSi j be the number of elements in Si , and d(xir ; xis ) be the distance
between xir and xis . The sum-of-squares criterion is then de ned with the following cost
function:
jSi j X
X jSi j , ,
d xir ; xis 2 :
c(Si ) = (1.1)
r=1 s=1
This de nition of the clustering problem is known to be NP-hard [15]. Nevertheless, the
k-means algorithm of MacQueen [24] is a popular clustering algorithm which uses the sum-
of-squares criterion. The algorithm relies on the ability to calculate the centroid of each
cluster Si , denoted x i . Technically the algorithm optimizes using the cost function
jSi j
X ,
c(Si ) = d x i ; xir ; (1.2)
r=1
which can be shown to produce the same results as (1.1) above. Thus the k-means algo-
rithm produces the requisite set of k clusters, along with the centroid (often termed the
representative element ) for each.
Since many types of data do not belong to spaces in which the mean is de ned, Kaufman
and Rousseeuw [21] have developed a similar algorithm for what they term the \k-medioids"
2
problem. Their algorithm, called PAM (for \Partitioning Around Medioids"), relies on the
ability to Pnd the, median
of each Si , denoted x^ i . Note that x^ i 2 Si , and x^ i is chosen to
jS j
minimize r=1i d x^ i ; xir . This leads to the optimization criterion
jSi j
X ,
c(Si ) = d x^ i ; xir : (1.3)
r=1
Kaufman and Rousseeuw's work has spawned a recent series of papers in the data mining
community [28, 11, 10, 39, 1].
An alternate
P
optimization criterion has been proposed by Gonzalez [17]. Instead of
minimizing ki=1 c(Si ), Gonzalez minimizes max1ik c(Si ), where c(Si ) is given by
,
c(Si ) = imax d xir ; xis : (1.4)
xr ;xs 2Si
i
in Section 2.3. Hartuv et. al. [19] also use the similarity graph notion. They repeatedly
apply a minimum cut algorithm to form clusters. Additional discussion of the relationship
between certain graph problems and clustering can be found in a paper by Matula [25] and
in Jain and Dubes' book.
1.3 Hierarchical Clustering
Hierarchical clustering is an appealing approach to problems in which the input set S does
not have a single \obvious" partition into well-separated clusters. The goal of a hierarchical
clustering algorithm is to produce a tree T (S ) in which the nodes represent subsets of S .
In particular, S itself is at the root of the tree, the leaves comprise the individual elements
of S , and internal nodes are de ned as the union of their children. A path down a well-
constructed tree should then visit sets of increasingly tightly-related elements, and any set
of nodes such that every path from a leaf to the root hits exactly one element in the set can
be considered a partition of S into clusters. Hence, given the tree, the user can conveniently
trade o between the number of clusters and the compactness of each cluster.
There are two major types of hierarchical clustering algorithms. Divisive algorithms
work by recursively partitioning S until singleton sets are achieved; agglomerative algo-
rithms work by beginning with singleton sets and merging them until S is achieved. This
section will focus on agglomerative methods, which are far more common.
Most agglomerative methods follow the same conceptual framework. First, the elements
are placed into a list of singleton sets S1 , S2 , : : : , Sn . Then a cost function is used to nd
the pair of sets fSi ; Sj g from the list which is \cheapest" to merge. Finally, Si and Sj are
removed from the list of sets and replaced with Si [ Sj . This process is repeated until there
is only one set remaining. Hence, the major di erence between agglomerative clustering
algorithms is the de nition of the cost of merging two sets Si and Sj , which will be denoted
c(Si ; Sj ). Figure 1 summarizes the most well-known cost functions and lists examples of
4
algorithms that use each.
In addition to the methods in Figure 1, Ward's method [36] is also frequently used.
Ward's method can be seen as the hierarchical clustering equivalent to the sum-of-squares
criterion for k-clustering. As in the above methods, at each step of the algorithm there is a
list of clusters S1 , : : : , Sn and the goal is to choose a pair to merge. As in the sum-of-squares
method, the cost of each Si is de ned as
jSi j X
X jSi j , ,
d xir ; xis 2 :
c(Si ) = (1.5)
r=1 s=1
P
The cost of the current solution is then de ned as ni=1 c(Si ), and the pair of sets to merge
is chosen to minimize the increase in the cost of the total solution. Also as in the sum-of-
squares method, algorithms implementing Ward's method typically rely on the ability to
calculate the centroid of each cluster.
Hierarchical methods su er similar criticism to statistical k-clustering methods. The
average-link, total-link, and Ward's methods tend to favor spherical clusters, while single-
link clustering is more analogous to density based methods and can produce undesirably
\elongated" clusters. Nevertheless, hierarchical methods are widely used, particularly in
the document clustering community [5, 37, 35].
1.4 Input to Clustering Algorithms
Fundamentally, every clustering algorithm operates on a set of input vectors x1 , x2 , : : : ,
xn in a d-dimensional space. It is important to note that the input spaces from di erent
clustering problems may have di erent mathematical properties, and that these properties
can in uence which clustering algorithms may be applied to the data. For example, methods
which calculate centroids cannot be used with spaces whose dimensions which are not
numerical, and the method of Gonzalez in Section 1.2 must be used with vectors from a
metric space.
In practice, many clustering algorithms do not directly examine or manipulate the in-
put vectors. For example, the algorithms of Kaufman and Rousseeuw all take as input a
similarity matrix M in which entry Mij represents the degree of similarity between xi and
xj . This matrix is assumed to be pre-calculated in an application-speci c step. Their book
goes into great detail on how to calculate similarity coecients for di erent types of data,
including non-numerical data.
A closely related style of input is the distance matrix D, in which Dij represents the
distance between element i and element j . This distance relation must typically follow the
de nition for a distance metric on the input space. A major advantage of this method is its
simplicity; the Euclidean or Manhattan distance metrics can be used on any problem with
numerical vectors.
5
Lastly, some clustering methods such as single-link hierarchical clustering and reciprocal
nearest neighbor clustering [9, 27] require as input only a list of the input vectors and a
function which returns the nearest neighbor of any input. This method has the advantage
that the input is not of size n2 , which may be prohibitive for some applications.
6
matrix of each distribution. Let i and i denote the mean vector and covariance matrix
respectively for Ei . This framework is quite general; for example, observe that if i = 2 I
for all i then maximizing (2.2) is equivalent to optimizing with the sum-of-squares criterion.
Given this parameterization, a standard iterative expectation-maximization (EM) process
can be used to nd the values of i , i , and .
The major problem with this model is that if the parameters i and i are allowed to
vary freely for each distribution then nding a global optimum for (2.2) can be too time and
space intensive for large problems. On the other hand, previously-suggested constraints,
such as i = 2 I , are too limiting. A major contribution of Ban eld and Raftery's work is to
reparameterize the distributions in a way which allows more exibility in the characteristics
of each distribution while still being solvable for larger problems than the unconstrained
model. Speci cally, they assume each distribution is multivariate normal as before, but
decompose i via singular value decomposition as follows:
i = Di iDiT ; (2.3)
where Di is the matrix of normalized eigenvectors and i is the diagonal matrix of eigen-
values of i . Since the covariance matrix is positive semi-de nite, this decomposition is
always possible. Under this parameterization, the orientation of Ei is determined by Di ,
and its shape and density contours are determined by i . The matrix i can then be further
parameterized as i Ai , where i is the principal eigenvalue of i . Hence i determines the
\volume" of each cluster and Ai determines the shape.
Ban eld and Raftery use this parameterization to derive several constrained clustering
models. They propose one model in which Ai = I for all Ai ; under this assumption,
the distributions are all spherical but have di erent sizes. Another proposed model is to
constrain all the Ai 's to be equal, but allow i and Di to vary for each distribution. This
results in clusters which have the same shape but di erent orientations and sizes. Finally,
Ban eld and Raftery demonstrate the derivation of a model which is not purely Gaussian,
but instead has one designated dimension in which the points vary uniformly in a cluster-
determined interval. While these constrained models do not apply to all sets of input, when
used appropriately they can signi cantly speed up the clustering process by reducing the
number of model parameters.
Ban eld and Raftery also derive a statistic called the approximate weight of evidence
(AWE) which estimates the Bayesian posterior probability of a clustering solution produced
via the above process. This statistic can be used to compare the results obtained with
di erent mixture models and/or values of k. In subsequent work by Raftery and Fraley [13],
the AWE statistic has been replaced by an alternative approximation called the Bayesian
information criterion (BIC) derived by Schwarz [30].
Finally, a common criticism of mixture models (and related methods such as k-means)
is that they do not address the problem of \noise", de ned loosely as input elements which
7
do not belong to any cluster. Ban eld and Raftery address this shortcoming by formally
de ning a notion of noise as a Poisson process with intensity which distributes points
throughout the space containing all of the input points. Suppose that r = 0 if xr is
considered noise, and that E0 is the set of all noise points. Then (2.2) can be generalized
to
jE0 j ,V Yn
L(; ; ) = (V )jE je! f r (xr j ); (2.4)
0 r=1
where V denotes the hypervolume of the space containing the input points.
Despite Ban eld and Raftery's attempts to make clustering via mixture models more
practical, the technique is still relatively unknown compared to older approaches such as
k-means and single link clustering. There are several probable causes.
First, the technique does not focus on eciency. No attempt is made to analyze the
time or space required for the algorithm as a function of input size, and the examples upon
which the algorithm is run are in general quite small. Hence, clustering by mixture models
may not be suitable for the currently trendy applications of clustering in computer science,
since these often impose severe time and space limitations. At a minimum, further study
should be conducted to determine the practical limits of the algorithm as a function of input
size and dimensionality.
Second, a large degree of manual intervention is required. As currently implemented,
the user is required to specify to the algorithm a model and a number of clusters along
with the data set. The user can then vary the model and number of clusters and use the
AWE/BIC statistic to compare the results. The user can also alter the results by selecting
which attributes of the data to use as input, as well as how those attributes are represented.
It would be interesting to see an attempt to make the computer optimize over all of these
parameters simultaneously instead of splitting the task between the user and the computer.
This idea will be discussed further in Section 3.2.
Finally, mixture models rely on the assumption that the data ts a Gaussian distribution.
This may not be true in many cases; but even worse, the data may not be numerical, making
this approach totally inapplicable. In particular, most databases contain a large amount
of categorical data, meaning that each item has attributes which are drawn from a small
set of possibilities over which there is no inherent notion of distance. Examples would be
the color of a person's hair or the manufacturer of a car. The next section describes an
approach which specializes in this type of data.
2.2 Gibson, Kleinberg, and Raghavan: Dynamical Systems
This section describes the paper \Clustering Categorical Data: An Approach Based on
Dynamical Systems" by Gibson, Kleinberg, and Raghavan [16]. Their approach relies on a
simple process which iteratively computes weights on the vertices of a graph until a xed
8
1. For each node v
update wv as follows:
For each tuple = fv; u1 ; : : : ; ud,1 g
containing do v
x P (wu1 ; : : : ; wud,1 )
wv x
2. Normalize the configuration such that the sum of the
squares of the weights in each field is 1.
Figure 2: The con guration update procedure of Gibson et. al.. The com-
bining operator can take on a number of di erent de nitions.
point is reached; in this respect it is similar to the work of Kleinberg on nding authoritative
web sources [22] and to the work of Teng et. al. on spectral graph partitioning [33, 26].
In this approach, the input data is a set T = f1 ; 2 ; : : : ; n g of tuples with d elds,
each of which can take one of a small set of values. A key assumption of this approach
is that this set is not metric, so one can determine if two tuples have the same value for
a particular eld, but no additional information such as degree of similarity is associated
with non-matching values.
A convenient way to view the input is as a graph in which the vertices are the set of all
values appearing in the input tuples. In this view of the data, the values are referred to as
nodes and the set of nodes is denoted V = fv1 ; v2 ; : : : ; vm g. Each tuple can be considered
a path through the graph, and the edge set of the graph is simply the collection of edges
in the paths. Associated with each node v in the graph is a weight wv , and the vector w
containing the weights of all of the nodes is referred to as a con guration of the graph.
At the core of this approach is a function f which maps the current con guration to
the next con guration. The idea is to repeatedly apply f until a xed point (referred to by
the authors as a basin ) is reached; i.e., f (w) w. The function f is de ned procedurally
in Figure 2. In order to fully specify f , the combining operator must be de ned. The
operators considered in the paper include the following:
The product operator Q: (w1 ; : : : ; wd ) = w1 w2 wd.
The Sp operator: (w1; : : : ; wd ) = ((w1 )p + (w2 )p + + (wd )p)1=p .
The S1 operator: (w1 ; : : : ; wd ) = max fjw1j; : : : ; jwd jg.
In addition to choosing the combining operator, the user must also choose an initial
con guration. The authors suggest two methods. In order to explore the data without
9
introducing any bias, one can use random or uniform node weights for the initial con gu-
ration. On the other hand, to bias the results towards a grouping based on a certain set of
values, one can give those values high initial weight.
Gibson et. al. claim that relatively little has been formally proved about the behavior
of dynamical systems such as the one outlined above. However, they do present a few
formal results in their paper. To avoid unnecessary repetition of the paper these results are
summarized in a non-rigorous fashion below; see the paper for a more technical presentation.
Result 2.2.1 For every set T of tuples and every initial con guration w, f i(w) converges
to a xed point as i ! 1 when S1 is used as the combining rule.
Result 2.2.2 If Ta and Tb are two sets of tuples with no eld values in common and the
input set is Ta [ Tb , the chance that the iterative process will converge to weights which
\distinguish" the values in Ta from those in Tb grows quickly as the sets get more unequal
in size.
Result 2.2.3 If the dynamical system has multiple basins, they can all be discovered by
starting iteration with a \relatively small" set of random initial con gurations.
These results hint at a methodology for using this type of dynamical system for data
exploration. One can apply the iterative process to a set of tuples until it converges (which is
proven to happen in some cases by Result 2.2.1 above, and happens anecdotally for a wider
range of combining rules according to the authors). At that point, some nodes will have high
weights and some will have lower weights. Result 2.2.2 indicates that by separating tuples
containing high weight nodes from those containing low weight nodes, one can often nd
a partition in the input set that has some semantic meaning. Again, the authors provide
additional anecdotal evidence to support this conclusion. Finally, Result 2.2.3 shows that
runs from di erent initial con gurations can often nd di erent meaningful groupings within
the data. Since the theorems in the paper are not sucient to fully support the behavioral
claims, the authors present a large number of additional empirical results in support of this
methodology.
In addition to this methodology, the authors suggest another approach inspired by
spectral graph partitioning. Instead of operating on one con guration at a time, the user
can maintain several con gurations wh1i ; wh2i ; : : : ; whmi simultaneously. The user can then
perform the following two steps until wh1i achieves a xed point:
,
1. Update whii f whii for i = 1; 2; : : : ; m.
2. Update the set of vectors wh1i ; wh2i ; : : : ; whmi to be orthonormal.
The second step in which the vectors are made orthonormal can introduce negative weights
into the con gurations; the authors claim that separating the positive weight nodes from the
10
negative weight nodes in the various con gurations can result in an informative partition of
the data. They term con guration wh1i after iteration the \principal basin", and the others
\non-principal basins."
This approach shows several promising features. It converges quickly; generally less
than 20 iterations are required in most of the examples in the paper. The authors also
demonstrate that it can identify \clusters" even in the presence of irrelevant elds and
large numbers of randomly generated \noise" inputs. Also, the non-principal basins can
apparently be used to separate multiple clusters within the same input set.
Nevertheless, this approach has several weaknesses. The most basic is the current lack of
both theoretical and practical understanding of the behavior of the algorithm. Theoretically,
convergence is not guaranteed with most combining operators, and there are few results
characterizing exactly what types of clusters may be identi ed by this method. Partially
because of the lack of theoretical understanding, there is no clearly de ned methodology
to use this approach on practical clustering algorithms. The authors allude to \masking"
and \augmenting" certain terms in the combining operator, but do not provide insight into
when or how to perform these modi cations. They also provide little in the way of insight
on how to choose the right combining operator or initial state for a speci c task. Until
these issues are addressed, this approach is likely to be little more than a curiosity item.
Additionally, the fact that the approach deals only with categorical data can be a draw-
back. Categorical data is more general than numerical data, because even numerical vectors
can be treated as categorical data. However, doing so discards useful similarity/distance
information, and should probably be avoided if possible. The next section describes a graph-
based approach, which unlike the papers considered so far, can deal with any type of data
for which there is a well-de ned notion of similarity.
2.3 Ben-Dor and Yakhini: Clique Graphs
Clustering has recently become popular in computational biology as a technique for an-
alyzing DNA microarray data [8, 3, 19]. A pair of recent papers, one by Ben-Dor and
Yakhini [3] the other by Hartuv et. al. [19], stand out in this literature because they de-
scribe novel clustering algorithms that could potentially be useful for many applications.
The papers contain many similar ideas, but this section will focus exclusively on the work
by Ben-Dor and Yahkini.
Underlying Ben-Dor and Yahkini's approach is a model of input called the corrupted
clique model. The basic idea of this model is that ideally, all elements within each cluster
should be similar to one another, and not similar to any elements of other clusters. Hence,
the similarity graph for the data should appear as a set of vertex-disjoint cliques; this type
of graph is called a clique graph. In practice, however, the similarity relation is usually
approximate, so the actual similarity graph derived from the data has some extra and some
missing edges in it when compared to the ideal similarity graph. In the corrupted clique
11
model, one assumes that the input is a clique graph that has been corrupted by adding
each non-existing edge with probability and removing each existing edge with probability
. The goal of the clustering algorithm is thus to nd the original clique graph given the
corrupted version.
Ben-Dor and Yakhini present two algorithms to accomplish this task. The rst is a
theoretical algorithm, about which they prove several performance results. The second is a
practical heuristic implementation based on the ideas of the theoretical algorithm.
Several de nitions are required to describe the theoretical algorithm.1 Suppose that
V denotes the set of vertices in the input graph, and V is a union of disjoint sets so
V = E1 [ E2 [ [ Ek where each Ei represents a cluster. Suppose that for v 2 V , C (v) = i
if and only if v 2 Ei , so C (v) is the label of the cluster which contains v.
A core V 0 of V is de ned as V 0 = E10 [ E20 [ [ Ek0 , where each Ei0 is a non-empty
subset of Ei . The core V 0 classi es an input v 2 V , V 0 with the following function:
CV 0 (v) = 1
max
ik
deg(v; Ei0 )=jEi0 j: (2.5)
Intuitively, CV 0 (v) represents the cluster which appears to contain v based only on the
similarity of v to elements in the core. Hence, if CV 0 (v) = C (v), then V 0 correctly classi es
v. The essence of the clustering problem in this approach is to identify a small core which
correctly classi es all of the input elements.
The theoretical algorithm relies upon the following result, stated non-technically here
for brevity.
Result 2.3.1 Let m denote the size of the smallest cluster in the input. If is not too
large and m is not too small, then with high probability a random small subset of V contains
a core which correctly classi es all inputs.
Based on this result, the theoretical algorithm can be summarized as follows:
1. Consider a small subset V 0 of V .
2. Consider all ways of partitioning V 0 into k non-empty clusters. Each such partition
is called a core candidate.
3. With each core candidate, classify all of the remaining points. Keep the classi cation
which results in the clique graph most similar to the input graph.
The practical algorithm, which the authors call Cast, works slightly di erently. It
requires no prior knowledge of the number of clusters, and avoids brute-force enumeration
1
In order to be more consistent with the other sections, the notation in this section has been changed
signi cantly from that in the original paper.
12
of core candidates. It takes as input a similarity function s : V V ! [0; 1] and a coecient
t and de nes the anity a(v; E ) of an element v to a cluster E as
X
a(v; E ) = s(u; v): (2.6)
u2E
An element v is said to have high anity for E if a(v; E ) tjE j; otherwise it has low
anity for E . The algorithm operates by constructing one cluster at a time. In general,
it alternates between adding high anity elements to a cluster and removing low anity
elements. When this process stabilizes, the cluster is considered nished, and a new cluster
is started. The process stops when all elements have been assigned to a cluster.
The authors present results of the practical algorithm on both simulated and real bio-
logical data; the results on simulated data are impressive in terms of accuracy, but do not
report the running time or storage requirements of the algorithm. Hence, the scalability
of the method is dicult to determine. However, the concept of identifying a high-quality
\core" with a small number of elements which can be used to classify the remaining points
hints at a useful way of scaling the algorithm: use a relatively small set of elements to
generate the core, then use a simpler strategy to classify the remaining inputs and adjust
the clusters as needed. A similar strategy has been successfully applied in the data mining
community in systems such as Clarans [28] and Birch [39].
Otherwise, the algorithm seems very appealing for data whose similarity patterns t
the corrupted clique model. However, it is important to note that by assuming a similarity
matrix as input, the authors have avoided addressing issues such as how to select which
attributes to use for clustering and how to represent them to the algorithm. The next section
describes another general clustering algorithm that addresses these and other pragmatic
concerns.
2.4 Agrawal, Gehkre, et. al.: Subspace Clustering
This section describes the paper \Automatic Subspace Clustering of High Dimensional
Data for Data Mining Applications" by Agrawal, Gehrke, Gunopulos, and Raghavan [1].
The paper describes the Clique clustering algorithm, which is targeted speci cally at data
mining large databases. In particular, the authors identify three goals for practical data
mining systems and argue that Clique satis es all of them. These goals are summarized
below:
E ective Treatment of High Dimensionality: Each item in a database may have a
large number of attributes, many of which may be irrelevant or misleading with respect
to the formation of clusters.
Interpretability of Results: Many applications require that the algorithm produce a
simple \summary" of each cluster for the user.
13
Scalability and Usability: The clustering algorithm must be fast and easy to use even
on large databases, and be insensitive to noise and to the order in which the data is
read.
The Clique system takes a three step approach to clustering. First, a set of \subspaces"
is chosen in which to cluster the data. Then, clustering is performed independently in
each subspace. Finally, a compact summary of each cluster is generated in the form of a
disjunctive normal form (DNF) expression.
Suppose the input is a set of n d-dimensional numerical2 vectors from the space A = A1
A2 Ad , where each Ai is an interval. Suppose furthermore that each Ai is partitioned
into subintervals, where is a user-supplied parameter. Applying such partitions in the
original d-space divides the space into a set of d-dimensional axis-aligned rectangles which
are termed regions by the authors. A region is termed dense if the proportion of input
vectors contained within the region exceeds , another user-supplied parameter.
Now consider a graph in which the vertices are dense regions, and there is an edge
between vertices if the corresponding regions in space share a face. A cluster can then be
succinctly de ned as a connected component in this graph. Also, observe that a region can
be described simply as a conjunction of range checks in each of the dimensions; hence, a
cluster can be described as a disjunction of the conjunctions describing the regions which
comprise it.
These de nitions are sucient to describe how the authors perform clustering in the
full-dimensional input space. First, they identify the dense regions, then use the above
graph representation to nd the clusters. Second, a DNF formula is generated for each
cluster by taking a disjunction of the descriptions of each region in the cluster and then
heuristically simplifying the resulting expression.
However, the authors are not content to perform clustering only on the full-dimensional
data; they also want to consider what they call the subspaces of the original data. Specif-
ically, the input space A has 2d subspaces which can be formed by selecting all possible
subsets of the original dimensions to represent the input data. During a preprocessing step,
the authors use an algorithm to identify every subspace which contains at least one dense
region. The clustering stages de ned above are then run separately on each such subspace.
Because it may have to consider every possible subspace of the original dimensions, this pre-
processing step takes exponential time in the number of dimensions. However, the authors
argue that using various tricks and heuristics one can keep the actual time \reasonable" on
real data.
Instead of relying on subspaces, traditional approaches to dimensional reduction have
relied upon singular value decomposition of the original data [7, 14]. The idea is that the
data can be transformed to the coordinate system de ned by its eigenvectors, and then
2
The authors show that the method can be modi ed to handle categorical data, but presenting it in the
context of numerical data is more intuitive.
14
those axes corresponding to small eigenvalues can be discarded. The authors argue that
this approach complicates the analysis of the nal clusters, since they are formed in terms
of linear combinations of the original input values. They also demonstrate several situations
in which \obvious" clusters are obscured by the SVD approach.
The main contribution of this paper seems to lie in its concrete de nitions of the desirable
properties of approaches to data mining, and the unique concept of searching for clusters
in the subspaces of the input set. However, the speci c approach taken by Clique is easily
criticized. It makes use of an exponential time algorithm to analyze the subspace structure.
In addition, the parameters and seem obscure, dicult to choose, and potentially
limiting. For example di erent clusters may have di erent \densities", making a single
value of for the whole data set impractical, and using the wrong value can obviously be
disastrous. Similarly, di erences in the distributions in each dimension may call for di erent
optimal values of . On the other hand, adding even more parameters such as a separate
value of for each dimension simply adds to the diculty of tuning the algorithm for a
data set.
15
the clusters are assumed to be spherical, and in the case of single-link hierarchical
clustering the clusters are assumed to be \well separated" so that incorrect links are
not made.
Analysis based on these ideas can be applied to the algorithms in Section 2. If the
data is numerical and the clusters are believed to be spherical or ellipsoidal, then
approaches based on mixture models may be appropriate. On the other hand, if
the data is non-numerical and/or little is known a priori about the geometry of the
clusters, one of the graph-based approaches may be more appropriate.
In general, categorical data is best handled by the approaches of Ben-Dor and Gibson.
The former is probably the best possibility when a well-de ned notion of similarity is
available for the data in question, while the latter is a better heuristic approach for
\exploring" data and nding clusters which arise from the co-occurrence of certain
categorical values.
Mixed data is handled best by the approaches of Ben-Dor and Agrawal. Ben-Dor's
more rigorous approach once again has the advantage in the presence of a known
similarity measure for the data, whereas Agrawal's approach is best used when it is
unclear which attributes of the data should be examined while clustering.
The most general case, in which little is known about the data or the clusters, is
not handled well by any of the algorithms since they all require that the user have
sucient knowledge to choose appropriate input parameters. This unfortunate case
is discussed more fully in Section 3.2 below.
Scalability: Despite the ongoing exponential increases in the power of computers, scal-
ability remains still a major issue in many clustering applications. In commercial
data mining applications, the quantity of the data to be clustered can far exceed the
main memory capacity of the computer, making both time and space eciency crit-
ical; this issue is addressed by clustering systems in the database community such
as Birch [39]. In other cases such as clustering algorithms imbedded within World
Wide Web search engines, time eciency is imperative because search engine users
want results very quickly.
Scalability is not directly addressed in Ban eld or Ben-Dor's papers, while the other
two papers present empirical results showing that their algorithms can handle large
input sets, although the time required by Agrawal's approach grows quickly with the
number of dimensions of the input. Statistical mixture models often require quadratic
space and the EM algorithm converges relatively slowly, making scalability an issue.
Ben-Dor's theoretical algorithm makes use of sampling to reduce the amount of work
needed to form the \cores" of each cluster, but the practical algorithm does not use of
this methodology. Of course, sampling can be used to scale any clustering algorithm;
this idea will also be mentioned again in Section 3.2.
16
Noise: While traditional methods such as those described in Section 1 tend to ignore
the problem of noise, all of the algorithms presented in Section 2 deal with noise in
some way or another. In the mixture model approach, noise is modeled as a Poisson
process and the clustering procedure explicitly labels some inputs as noise. In the
dynamical systems and subspace approaches, noise points are simply left out of the
clusters. Speci cally, in the former approach the authors claim that these elements
do not achieve high enough weights to be placed in clusters, while in the latter noise
elements tend to fall outside \dense regions" and are thus ignored. Finally, Ben-Dor's
system does not speci cally address the issue of noise, but under the clique graph
model, noise points will tend to fall into extremely small clusters (often singletons)
which can subsequently be ignored.
Result Presentation: In many practical clustering applications, it is useful if a succinct
\summary" of each cluster can be given to the user. All of the systems within this
paper except Ben-Dor's have this ability. Ban eld's method generates a centroid and
covariance matrix for each cluster, Gibson's dynamical system identi es a set of high-
weight values around which each cluster is formed, and Agrawal's approach produces
a DNF expression summarizing each cluster. Perhaps a post-processing phase could
be added to Ben-Dor's algorithm to choose a representative \median element" for
each cluster.
18
hypothesis requires modi cation. It is at this type of clustering that the methods in this
paper excel.
Finally, there is classi cation to explore a data set, which hereafter will be referred
to as data mining. In this case the user has no a priori assumptions about the data, but
wants to know if it falls into \meaningful groups" (a term for which the user may not
even have a speci c de nition). Data mining is currently a trendy application of clustering
technology, but very little work has been done to gracefully eliminate xed input parameters
to clustering algorithms. Thus, most data mining approaches implicitly require the user to
alternate between running the clustering algorithm, modifying the parameters, and choosing
the results which seem \best."
This leads to the following set of proposals for continuing research in clustering, and in
particular data mining:
Gracefully eliminate the need for a priori assumptions about the data. An obvious
candidate for elimination is xed numerical inputs (such as in Agrawal's subspace
clustering approach). Obviously, one can de ne a notion of output quality, as Ban eld
and Raftery have done with their AWE statistic, and then iterate over many possible
parameter values searching for a maximum. This trivially can eliminate the need for
xed inputs, at the cost of a potentially large increase in running time. More intelligent
strategies than brute-force sampling to choose parameters would be helpful.
Additionally, current clustering algorithms make implicit assumptions about the input
data. Common examples of such assumptions are that the units in each dimension are
scaled appropriately, that the prede ned notion of distance/similarity is ideal, and/or
that the set of data attributes used for clustering are correct (although the approaches
by Gibson and Agrawal are notable for attempting to avoid this assumption). An
interesting proposal would be to explore useful ways to explicitly parameterize these
assumptions and allow the computer to search for good speci c choices. For example,
the computer could be allowed to select which dimensions of the data to use in the
similarity calculation, or to choose whether or not to normalize the data in some
dimension.
Use sampling to improve eciency. Theoretically-founded clustering algorithms are
often dismissed as too inecient to be used in practice. This problem will be exacer-
bated if the previous proposal is followed and the computer is forced to search over
an even broader array of choices in the pursuit of \good clustering." The theoretical
results of Ben-Dor show that with a good choice of a similarity measure, forming a set
of \cluster cores" and then classifying the remaining inputs is a very e ective strategy
as long as the sample size is big enough to get representatives from every important
cluster. This strategy could be pursued to help \scale up" computationally intensive
clustering methods.
19
3.3 Conclusion
This paper has described four recently-developed approaches to clustering. Each has both
positive and negative aspects, and each is suitable for di erent types of data and di erent
assumptions about the cluster structure of the input.
However, all of the algorithms still rely to some extent on rigid assumptions about the
data which must be provided by the user. These assumptions appear both explicitly as
xed numerical parameters and implicitly in the way which the user represents the input
to the algorithm. While prior knowledge of the application can help the user choose appro-
priate assumptions, oftentimes the user must still alternate between running the clustering
algorithm and updating the assumptions based on the result.
The paper concludes by suggesting a new general strategy for clustering as a method of
data mining. In this strategy, the assumptions required by the various clustering algorithms
can be explicitly parameterized, allowing the computer more freedom to search for the best
way to cluster the data. To compensate for the broadening of search space of possible
clusterings, it is recommended that an implementation of this strategy use sampling to
reduce the number of inputs if necessary. It is hoped that future work will lead to a speci c
clustering algorithm based on these ideas that can then be validated on actual data.
References
[1] Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan.
Automatic subspace clustering of high dimensional data for data mining applications.
In Proceedings of the ACM SIGMOD Conference on Management of Data (SIGMOD
98), pages 94{104, 1998.
[2] Je rey D. Ban eld and Adrian E. Raftery. Model-based gaussian and non-gaussian
clustering. Biometrics, 49:803{821, September 1993.
[3] Amir Ben-Dor and Zohar Yakhini. Clustering gene expression patterns. In Proceedings
of the Third Annual International Conference on Computational Molecular Biology
(RECOMB 99), 1999.
[4] Gilles Celeux and Gerard Govaert. Gaussian parsimonious clustering models. Pattern
Recognition, 28(5):781{793, 1995.
[5] D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/gather: a
cluster-based approach to browsing large document collections. In 15th International
ACM SIGIR Conference on Research and Development in Information Retrieval, pages
318{329, 1992.
20
[6] D. Defays. An ecient algorithm for a complete link method. The Computer Journal,
20:364{366, 1977.
[7] R. O. Duda and P. E. Hart. Pattern Classi cation and Scene Analysis. John Wiley
and Sons, 1973.
[8] Michael B. Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein. Cluster
analysis and display of genome-wide expression patterns. Proceedings of the National
Academy of Sciences, 95:14863{14868, December 1998.
[9] A. El-Hamdouchi and Peter Willet. Hierarchic document clustering using ward's
method. In Proceedings of the Ninth International Conference on Research and Devel-
opment in Information Retrieval, pages 149{156, 1986.
[10] Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu. A density-based
algorithm for discovering clusters in large spatial databases with noise. In Proceedings of
the Second International Conference on Knowledge Discovery and Data Mining (KDD
96), 1996.
[11] Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu. Knowledge discovery in large
spatial databases: Focusing techniques for ecient class identi cation. In Proceedings
of the Fourth International Symposium on Large Spatial Databases (SDD 95), 1995.
[12] Brian S. Everitt. Cluster Analysis. Halsted Press, third edition, 1993.
[13] C. Fraley and A. E. Raftery. How many clusters? which cluster method? answers
via model-based cluster analysis. Technical Report 329, Department of Statistics,
University of Washington, February 1998.
[14] K. Fukanaga. Introduction to Statistical Pattern Recognition. Academic Press, 1990.
[15] Michael R. Garey and David S. Johnson. Computers and Intractability: A Guide to
the Theory of NP-Completeness. W. H. Freeman and Company, 1979.
[16] D. Gibson, J. Kleinberg, and P. Raghavan. Two algorithms for nearest-neighbor search
in high dimensions. In Proceedings of the 29th ACM Symposium on Theory of Com-
puting (STOC 97), 1997.
[17] Teo lo F. Gonzalez. Clustering to minimize the maximum intercluster distance. The-
oretical Computer Science, 38:293{306, 1985.
[18] A. D. Gordon. Classi cation: Methods for the Exploratory Analysis of Multivariate
Data. Chapman and Hall, 1981.
21
[19] Erez Hartuv, Armin Schmitt, Jorg Lang, et al. An algorithm for clustering cdnas for
gene expression analysis. In Proceedings of the Third Annual International Conference
on Computational Molecular Biology (RECOMB 99), 1999.
[20] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
[21] Leonard Kaufman and Peter J. Rousseeuw. Finding Groups in Data: An Introduction
to Cluster Analysis. John Wiley & Sons, Inc., 1990.
[22] J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proceedings of
the ACM-SIAM Symposium on Discrete Algorithms, 1998.
[23] Jon Kleinberg, Christos Papadimitriou, and Prabhakar Raghavan. Segmentation prob-
lems. In Proceedings of the 30th ACM Symposium on Theory of Computing (STOC
98), pages 473{481, 1998.
[24] J. MacQueen. Some methods for classsi cation and analysis of multivariate observa-
tions. Proc. 5th Berkeley Symposium, 1:281{297, 1967.
[25] David W. Matula. Classi cation and Clustering, chapter Graph Theoretic Techniques
for Cluster Analysis Algorithms. Academic Press, Inc., 1977.
[26] Gary L. Miller, Shang-Hua Teng, William Thurston, and Stephen A. Vavasis. Separa-
tors for sphere-packings and nearest neighbor graphs. Journal of the ACM, 44(1):1{29,
January 1997.
[27] F. Murtagh. A survey of recent advances in hierarchical clustering algorithms. The
Computer Journal, 26:354{359, 1983.
[28] Raymond T. Ng and Jiawei Han. Ecient and e ective clustering methods for spa-
tial data mining. In Proceedings of the 20th International Conference on Very Large
Databases (VLDB 94), 1994.
[29] Edie Rasmussen. Information Retrieval, chapter Clustering Algorithms, pages 419{442.
Prentice Hall, 1992.
[30] G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6:461{464,
1978.
[31] A. J. Scott and M. J. Symons. Clustering methods based on likelihood ratio criteria.
Biometrics, 27:387{397, 1971.
[32] R. Sibson. Slink: an optimally ecient algorithm for a complete link method. The
Computer Journal, 16:30{34, 1973.
22
[33] Daniel A. Spielman and Shang-Hua Teng. Spectral partitioning works: Planar graphs
and nite element meshes. In Proceedings of the IEEE Symposium on Foundations of
Computer Science, 1996.
[34] M. J. Symons. Clustering criteria and multivariate normal mixtures. Biometrics,
37:35{43, 1981.
[35] Ellen M. Voorhees. Implementing agglomerative hierarchic clustering algorithms for use
in document retrieval. Information Processing & Management, 22(6):465{476, 1986.
[36] J. H. Ward Jr. Hierarchical grouping to optimize an objective function. Journal of the
Americal Statistical Assocation, 58(301):235{244, 1963.
[37] Peter Willet. Recent trends in hierarchical document clustering: A critical review.
Information Processing & Management, 24(5):577{597, 1988.
[38] Zhigang Xiang. Color image quantization by minimizing the maximum intercluster
distance. ACM Transactions on Graphics, 16(3):260{276, July 1997.
[39] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: An ecient data cluster-
ing method for very large databases. In Proceedings of the ACM SIGMOD Conference
on Management of Data (SIGMOD 96), pages 103{114, 1996.
23