Cluster Analysis and Mathematical Programming
Cluster Analysis and Mathematical Programming
net/publication/220589774
CITATIONS READS
657 4,457
2 authors:
All content following this page was uploaded by Brigitte Jaumard on 05 June 2014.
Pierre Hansen
Brigitte Jaumard
G–97–10
March 1997
Les textes publiés dans la série des rapports de recherche HEC n’engagent que la responsabilité de leurs
auteurs. La publication de ces rapports de recherche bénéficie d’une subvention du Fonds F.C.A.R.
Cluster Analysis and Mathematical Programming
Pierre Hansen
GERAD and École des Hautes Études Commerciales
Montréal, Canada
Brigitte Jaumard
GERAD and École Polytechnique de Montréal
Canada
February, 1997
Résumé
1
analysis are reviewed in the next section: steps of a cluster analysis study, types of
clusterings and criteria. Section 3 is devoted to hierarchical clustering. Agglomera-
tive and divisive algorithms are reviewed. Section 4 addresses partitioning problems,
and is organized by solution technique. Six of them are considered: dynamic pro-
gramming, graph theoretical algorithms, branch-and-bound, cutting planes, column
generation and heuristics. Other less frequently used clustering paradigms are ex-
amined in Section 5: sequential clustering, additive clustering and representation of
dissimilarities by trees. Brief conclusions are drawn in Section 6.
Most cluster analysis methods rely upon dissimilarities (or similarities, or proximi-
ties) between entities, i.e., numerical values either directly observed or, more often,
computed from the data before clustering. A general scheme for dissimilarity-based
clustering is the following:
(a) Sample. Select a sample O = {O1 , O2 , . . . , ON } of N entities among which
clusters are to be found.
(b) Data. Observe or measure p characteristics of the entities of O. This yields
a N × p data matrix X.
(c) Dissimilarities. Compute from the matrix X a N × N matrix D = (dk` ) of
dissimilarities between entities. Such dissimilarities (usually) satisfy the properties
dk` ≥ 0, dkk = 0, dk` = d`k for k, ` = 1, 2, . . . , N . They need not satisfy the triangle
inequality, i.e., be distances.
(d) Constraints. Choose the type of clustering desired (hierarchy of partitions,
partition, . . . ). Specify also further constraints on the clusters, if any (maximum
weight or cardinality, connectedness, . . . ).
(e) Criterion. Choose a criterion (or possibly two criteria) to express homogene-
ity and/or separation of the clusters in the clustering to be found.
(f ) Algorithm. Choose or design an algorithm for the problem defined in (d),
(e). Obtain or write the corresponding software.
2
(g) Computation. Apply the chosen algorithm to matrix D = (dk` ) thus obtain-
ing clusters, and clusterings of the chosen type.
(h) Interpretation. Apply formal or informal tests to select the best cluster-
ing(s) among those obtained in (g). Describe clusters by their lists of entities and
descriptive statistics. Proceed to a substantive interpretation of the results.
Steps (d) and (e) define a clustering problem as a mathematical program. Steps (a)
to (c) and (h) correspond to a statistical viewpoint on clustering. They are in many
ways delicate and discussed at length in the literature [111, 73, 48, 83]. We focus here
on steps (d) to (g) which correspond to a mathematical programming viewpoint.
Several remarks are in order. First, dissimilarities may be computed from other
sources than a matrix of measurements X, for instance when comparing biological
sequences or partitions. Second, for some methods only the order of the dissimilarities
matters. This information can be obtained by questions such as “ are these two
entities more similar than these two other ones”. Third, cluster analysis is not the
only way to study dissimilarities or distances between entities in the field of data
analysis. Another much used technique is principal component analysis (e.g. [99]).
Fourth, few assumptions are made on the clusters in the above scheme and they are
usually in set-theoretic terms. In some circumstances, more knowledge is available.
For instances, the set of entities may be associated with a mixture of distributions, the
number and parameters of which are to be found (e.g. [96] chap. 3). Or yet clusters
may correspond to given objects such as characters, to be recognized. This last case
pertains to pattern recognition, a field close to but different from cluster analysis.
Fifth, instead of computing dissimilarities, direct clustering may be performed on the
matrix X. An early example is maximization of the bond-energy or sum for all cells
of products of their values with the values of adjacent cells [92]. Heuristics are based
on permuting rows and columns, and an exact solution is obtained by solving two
associated traveling salesman problems [89]. Clusters found by direct clustering may
be interpreted in conceptual terms. Recently, conceptual clustering has become a
very active field of research (e.g. [34, 106]).
Cluster analysis algorithms are designed to find various types of clusterings, e.g.,
(i) Subset C of O;
3
(ii) Partition PM = {C1 , C2 , . . . , CM } of O into M clusters;
(ii a) Cj 6= ∅ j = 1, 2, . . . , M ;
(ii b) Ci ∩ Cj = ∅ i, j = 1, 2, . . . , M and i 6= j;
M
(ii c) ∪ Cj = O;
i=1
By far the most used types of clustering are the partition and the complete hierar-
chy of partitions, i.e., that one containing N partitions. This last hierarchy can also
be defined as a set of 2N − 1 clusters which are pairwise disjoint or included one into
the other. Recently, weakenings of hierarchies are also increasingly studied. They
include hierarchies of packings [91], weak hierarchies [2] and pyramids [35]. Work has
also been done on fuzzy clustering, in which entities have a degree of membership in
one or several clusters [10].
In constrained clustering, additional requirements are imposed on the clusters.
The most frequent are bounds on their cardinality, bounds on their weight, assuming
entities to be weighted, or connectedness, assuming an adjacency matrix between
entities is given.
2.3 Criteria
4
(ii) the cut c(Cj ) of Cj , or sum of dissimilarities between entities of Cj and entities
outside Cj : X X
c(Cj ) = dk`
k:Ok ∈Cj `:O` 6∈Cj
and one might also consider a normalized cut, which corrects the previous³ measure
´
to eliminate the effect of the cluster’s size by dividing c(Cj ) by |Cj | N − |Cj | .
(ii) the radius r(Cj ) of Cj or minimum for all entities Ok of Cj of the maximum
dissimilarity between Ok and another entity of Cj :
(iii) the star st(Cj ) of Cj or minimum for all entities Ok of Cj of the sum of dissimi-
larities between Ok and the other entities of Cj :
X
st(Cj ) = Min dk` ;
k:Ok ∈Cj
`:O` ∈Cj
and one might also consider a normalized star and ³a normalized ´ clique defined as
st(Cj ) divided by |Cj | − 1 and c`(Cj ) divided by |Cj | |Cj | − 1 respectively.
5
where || · ||2 denotes Euclidean distance and
1 X
x= xk ;
|Cj | k:Ok ∈Cj
6
partition to the next are subject to constraints. Sixth, asymmetric dissimilarities may
be reduced to symmetric dissimilarities, e.g. by taking minimum or maximum val-
ues associated to opposite directions for each pair of entities. Alternately, definitions
given above may be adapted [76].
Criteria used in additive clustering differ form those described here, and will be
examined in Section 5.
3 Hierarchical Clustering
Agglomerative hierarchical clustering algorithms are among the oldest and still most
used methods of cluster analysis [23, 49]. They proceed from an initial partition in
N single-entity clusters by successive mergings of clusters until all entities belong to
the same cluster. Thus, they fit into the following scheme:
Initialization
PN = {C1 , C2 , . . . , CN };
Cj = {Oj } j = 1, 2, . . . , N ;
k = 1;
Current step:
While N − k > 1 do
select Ci , Cj ∈ PN −k+1 following a local criterion;
CN +k = ³
Ci ∪ Cj ; ´
PN −k = PN −k+1 ∪ {CN +k } \ {Ci , Cj };
k =k+1
EndWhile
By a local criterion, we mean a criterion which uses only the information given
in D and the current partition. Thus the algorithm uses no memory about how this
partition was reached or look-ahead feature about other partitions than the next one.
Many local criteria have been considered. They correspond to criteria for the
partitions obtained, sometimes defined in an implicit way. This is the case for the
single-linkage algorithm, which merges at each step the two clusters for which the
smallest inter-cluster dissimilarity is minimum. Indeed, a well-known graph theoretic
7
result of [105] can be reformulated as follows. Let G = (V, E) denote a complete
graph, with vertices vk associated with entities Ok , for k = 1, 2, . . . , N and edges
{vk , v` } weighted by the dissimilarities dk` . Let M ST denote a minimum spanning
tree of G.
Proposition 1 [105] The values of the split for all subsets of entities of O, and
hence for all partitions of O, belong to the set of dissimilarity values associated with
the edges of M ST .
For other criteria, the partitions obtained after several steps of an agglomerative
algorithm are not necessary optimal. For instance, the complete-linkage algorithm
merges at each step the two clusters for which the resulting cluster, as well as the
resulting partition, has smallest diameter. After two steps or more this partition may
not have minimum diameter. An algorithm to find minimum diameter partitions is
discussed in the next section.
An interesting updating scheme for dissimilarities in agglomerative hierarchical
clustering has been proposed in [87] and extended in [79, 80]. A parametric formula
gives new dissimilarity values between cluster Ck and Ci , Cj when these last two are
merged:
dk,i∪j = αi dik + αj djk + βdij + δ|dik − djk |.
Values of the parameters, a few examples of which are given in Table 1, correspond
to single-linkage, complete-linkage and other methods. Clusters to be merged at each
iteration are those corresponding to the smallest updated dissimilarity. Using heaps,
an O(N 2 log N ) uniform implementation of agglomerative hierarchical clustering is
obtained [26].
Better results can be derived in a few cases: finding the M ST of G, ranking its
edges by non-decreasing values and merging entities at endpoints of successive edges
yields a θ(N 2 ) implementation of the single-linkage algorithm [50]. At each iteration,
clusters correspond to connected components of a graph with the same vertex set
as G and as edges those of M ST considered. A θ(N 2 ) algorithm based on similar
principles has also been obtained [112] for clustering with asymmetric dissimilarities
and strongly connected components as clusters.
8
Table 1: Coefficients in updating formula for agglomerative hierarchical clustering
Method αi αj β δ
Single linkage 1/2 1/2 0 −1/2
Complete linkage 1/2 1/2 0 1/2
|Ci | |Cj |
Average linkage |Ci |+|Cj | |Ci |+|Cj |
0 0
|Ci | |Cj | −|Ci ||Cj |
Centroid |Ci |+|Cj | |Ci |+|Cj | (|Ci |+|Cj |)2
0
|Ci |+|Ck | |Cj |+|Ck | −|Ck |
Ward’s method |Ci |+|Cj |+|Ck | |Ci |+|Cj |+|Ck | |Ci |+|Cj |+|Ck |
0
implies n o
min d(Ci , Ck ), d(Cj , Ck ) ≤ d(Ci ∪ Cj ), Ck ) ∀ i, j, k;
in words, merging two clusters Ci and Cj less dissimilar between themselves than
with another cluster Ck cannot make the resulting dissimilarity with Ck smaller than
the smallest initial one. Dissimilarities D = (dk` ) induce a nearest neighbor relation,
with one or more pairs of reciprocal near neighbors. When the reducibility prop-
erty holds, each pair of reciprocal near neighbors will be merged before merging with
other clusters. Updating chains of nearest neighbors yields a θ(N 2 ) agglomerative
hierarchical clustering algorithm for the (average) variance criterion [7]. This result
extends to the single-linkage, complete-linkage and average-linkage algorithms [98].
When entities of O belong to a low-dimensional Euclidean space and dissimilarities
are equal to distances between them, techniques from computational geometry can be
invoked, to get even faster algorithms. Extensions of agglomerative hierarchical clus-
tering algorithms to weak hierarchies or pyramids have been much studied recently,
e.g., in [2, 9].
Divisive hierarchical clustering algorithms are less frequently used than agglomerative
ones. They proceed from an initial cluster containing all entities by successive bipar-
titions of one cluster at a time until all entities belong to different clusters. Thus,
they fit into the following scheme:
9
Initialization n o
P1 = {C1 } = {O1 , O2 , . . . , ON } ;
k = 1;
Current step:
While k < N do
select Cj ∈ Pk following a first local criterion;
partition
³ Cj into C2k and C2k+1 ´ following a second local criterion;
Pk+1 = Pk ∪ {C2k } ∪ {C2k+1 } \ {Cj };
k =k+1
EndWhile
The role of the first local criterion is not crucial, as it only determines the order
in which clusters will be bipartitioned. The real difficulty lies in bipartitioning the
chosen cluster according to the second criterion, a problem which requires a specific
algorithm for each case, and which may be NP-hard. Only a few divisive clustering
algorithms have, as yet, been proposed.
For the minimum diameter criterion one exploits a property of any maximum
spanning tree M ST 0 of the graph G defined above:
Note that the diameter of this bipartition is equal to the largest dissimilarity of
an edge outside M ST 0 closing an odd cycle with the other edges in M ST 0 .
Using Proposition 2 at all levels yields an O(N 3 ) divisive hierarchical algorithm [102,
75]. A more careful implementation, building simultaneously maximum spanning
trees at all levels, takes O(N 2 log N ) time [55].
It follows from Proposition 2 and the remark following it that there are at most
O(N ) candidate values for the diameter of a bipartition. This property can be used
in a divisive algorithm for hierarchical clustering with the average diameter criterion.
Candidate values for the largest diameter are considered in sequence and minimum
values for the smallest diameter sought for by dichotomous search. Existence of
a bipartition with given diameters is tested by solving a quadratic boolean equa-
tion [59] or by a specialized labelling algorithm [97, 46]. The resulting algorithm
takes O(N 3 log N ) time. It is more difficult to build an algorithm for average linkage
10
divisive hierarchical clustering: bipartitioning O to maximize the average between
clusters dissimilarity is strongly NP-hard [60]. However, moderate size problems
(N ≤ 40) can be tackled, using hyperbolic and quadratic 0–1 programming. For
several criteria, when entities are points in R2 , there are hyperplanes separating the
clusters. This property is exploited in an algorithm for hierarchical divisive minimum
sum-of-squares clustering in low-dimensional spaces [72] which solves instances with
N ≤ 20000 in R2 , N ≤ 500 in R3 and N ≤ 150 in R4 .
Diameter
12
11 12
C 15
10 11
C 15
9 10
9
8
C 13 C 14
7 8
7 C 13 C 14
6
6
5
5
4
4
3 c c12
C 11 11 C 12 3
2 C 11 C 12
c10 2
1
C9 C 10 1
0 Entities C9 C 10 Entities /
0 Splits
O1 O2 O3 O4 O5 O6 O7 O8 O1 O2 O3 O4 O5 O6 O7 O8
to entities or clusters and horizontal lines joining endpoints of vertical lines to merg-
ings of clusters. The height of the horizontal lines corresponds to the value of the
updated dissimilarity between the clusters merged. This is a measure of separation
or homogeneity of the clusters obtained. In espaliers the length of the horizontal lines
is used to represent a second measure of homogeneity or separation of the clusters. If
the reducibility condition holds the updated dissimilarities d0k` satisfy the ultrametric
inequality [88]:
11
d0k` ≤ max(d0kj , d0j` ) ∀ j, k, `.
or X
|dk` − d0k` |.
k,`
In the former case, which is NP-hard [86], a combination of the average linkage
algorithm with branch-and-bound solves small instances (N ≤ 20) [17]; in the lat-
ter case a branch-and-bound method solves slightly larger instances. Heuristics use
penalty methods [30], in which violations of the ultrametric inequality are penalized,
or iterative projection strategies [77]. They can be extended to the case where some
data are missing [31] and to more general problems discussed in Section 5.
4 Partitioning
where
Cm = {Ok , Ok+1 , . . . , Om }.
12
Using updating to compute the f (Cj ) for all potential clusters yields O(N 2 ) algo-
rithms for various criteria [102, 109]. Note that the string property does not always
hold. Optimal clusters for one-dimensional clique partitioning do not necessarily sat-
isfy it [11]. However, they enjoy a weaker nestedness property: let [Cj ] denote the
range of the entities Ok , . . . , O` of Cj , i.e., [xk , x` ]. Then for any two clusters Ci and
Cj in the set of optimal partitions
So, ranges of any two clusters are either disjoint or included one into the other.
Exploiting this property leads to a polynomial algorithm for one-dimensional clique
partitioning, also based on dynamic programming [70]. A detailed discussion of nest-
edness and related properties is given in [78].
When clustering entities in higher-dimensional spaces, there does not seem to be an
equivalent of the string property. In a few particular cases, the recurrence equation
can be extended [81, 38]. Several authors, e.g. [110], have proposed to impose an
order on the entities, for instance the order of points on a Peano curve or the order
of traversal in a traveling salesman tour, and then to apply dynamic programming
to the resulting one-dimensional problem. Such a procedure quickly gives an optimal
solution to an approximation of the given problem. Its proximity to the optimal
solution of the problem itself depends on the first step, which is somewhat arbitrary.
To obtain an optimal solution in the general case, nonserial dynamic program-
ming [8] must be used. Let FS` denote the optimal value of a clustering of the entities
of subset S into ` clusters. The recurrence relation then becomes
n o
`−1
FS` = Min FS\C m
+ f (Cm ) .
C` CS
|C` |≤|S|−`+1
Applying this equation takes time exponential in N , so only small sets of entities
(N ≤ 20) may be considered. Sometimes, constraints accelerate the computations,
e.g., if all clusters must be small.
13
of maximizing the average split, or the sum-of-splits, of such a partition is related
but different. Its solution relies on the following result:
Proposition 3 [61] Let C = {C1 , C2 , . . . , C2N −1 } denote the set of clusters obtained
when applying the single-linkage algorithm to O. Then for all M there exists a parti-
∗
tion PM which maximizes the average split and consists solely of clusters of C.
Consider then the dual graph of the single-linkage dendrogram, as defined in [61].
It is easy to show that any partition of O into M clusters of C corresponds to a
source-sink path with M arcs in that graph. Then, weight the arcs of the dual graph
by the splits of the clusters associated with the edges of the dendrogram they cross.
Using dynamic programming to find a cardinality constrained longest path yields a
∗
partition PM with maximum average split in θ(N 2 ) time.
The relationship between graph coloring and finding a bipartition with minimum
diameter was also mentioned in the previous section. In fact, this relationship extends
to the general case.
This relationship can be exploited in the reverse direction to show minimum di-
ameter partitioning is NP-hard for M ≥ 3 [13, 58], and adapted to prove further
NP-hardness results [115]. Updating may be used to exploit Proposition 4 efficiently.
Consider graph Gt to which is added an edge. If the vertices of this edge do not have
the same color, or if local recoloring (e.g., by bichromatic interchange) gives a coloring
with no more colors than previously one can proceed to the next graph. When there
is some structure in the set O under study, it will be reflected in the graphs Gt , which
are easier to color than random ones, and instances with N ≤ 600 could indeed be
solved.
Minimum diameter partitions are not unique. Enumerating them is discussed
in [54]. Alternately, one can adapt the coloring algorithm to find a partition mini-
mizing the second largest cluster diameter, subject to the first being minimum, then
the third largest and so on [27].
14
Partitions obtained with the single-linkage algorithm may suffer from the chain-
ing effect: dissimilar entities at the ends of a long chain of pairwise similar entities
are assigned to the same cluster. Partitions obtained by the coloring algorithm for
minimum diameter may suffer from the dissection effect [23]: similar entities may
be assigned to different clusters. To avoid both effects one may seek compromise
solutions, i.e., efficient partitions for the split and diameter criteria. The resulting
bicriterion cluster analysis algorithm [28] is based on Propositions 1 and 4. To impose
a minimum value on the split it suffices to merge the vertices of G at endpoints of
successive edges of M ST . Then the resulting reduced graph GR of G can be colored
as described above. Splits and diameters of the efficient partitions may be represented
graphically on a diameter-split map. It can be used to evaluate whether the set O
possesses some structure or not and which partitions appear to be the most natural
ones. A single efficient partition for a value of M is a good indication.
Some clustering algorithms apply to graphs, which may be viewed as partial graphs
Gt as defined above, for a given t. Clusters may then be defined as maximal com-
ponents with minimum degree at least δ [91]; a O(N + |E|) algorithm provides a
hierarchy of packings corresponding to successive values of δ. When clustering points
in R2 , geometric properties may be exploited to obtain low-order polynomial algo-
rithms. For instance, minimum average diameter bipartitioning in the plane can be
done in O(n log2 n/ log log n) time [74] and minimizing any monotone function of
the diameters of an M cluster partition can be performed in O(n5M ) time [16].
4.3 Branch-and-bound
Branch-and-bound algorithms have been applied, with some success, to several par-
titioning problems of cluster analysis. Their efficiency depends on sharpness of the
bounds used, availability of a good heuristic solution and efficient branching, i.e.,
rules which improve bounds for all subproblems obtained in a fairly balanced way.
An algorithm for minimum sum-of-squares partitioning [85, 36] exploits bounds
based on assignments of entities to clusters already made, and additivity of bounds for
separate subsets of entities. It solves problems with N ≤ 120 and a few well-separated
clusters of points of R2 , but its performance deteriorates in higher dimensional spaces.
Another algorithm [84], for minimum sum-of-cliques partitioning, uses bounds based
on ranking dissimilarities, which are not very sharp. Problems with N ≤ 50, M ≤ 5
can be solved.
15
Better results are obtained when bounds result form solution of a mathemati-
cal program. For minimum sum-of-stars partitioning (the M -median problem) the
well-known DUALOC algorithm [42] combined with Lagrangian relaxation of the car-
dinality constraint [57] is very efficient. Problems with N ≤ 900 are solved exactly
and the dimension of the space considered does not appear to be an obstacle.
A variant of the minimum sum-of-cliques partitioning problem arises when one
seeks a consensus partition, i.e., one which is at minimum total distance of a given
set of partitions [104], distance between two partitions being measured by the number
of pairs of entities in the same cluster in one partition and in different clusters in the
other. Dissimilarities may then be positive or negative and the number of clusters is
not fixed a priori. This problem can be expressed as follows [90]:
N
X −1 N
X
Minimize dk` yk`
k=1 `=k+1
subject to:
yk` + y`q − ykq ≤ 1 k = 1, 2, . . . , N − 2
−yk` + y`q + ykq ≤ 1 ` = k + 1, k + 2, . . . , N − 1
yk` − y`q + ykq ≤ 1 q = ` + 1, ` + 2, . . . , N
and
yk` ∈ {0, 1} k = 1, 2, . . . , N − 1, ` = k + 1, k + 2, . . . , N.
where yk` = 1 if Ok and O` belong to the same cluster and yk` = 0 otherwise. Problems
with N ≤ 72 could be solved [90] by applying the revised simplex method to the dual
of the continuous relaxation of the above formulation. No duality gap was observed
(nor a branching rule specified for the case where there would be one). A direct
branch-and-bound approach is proposed in [39]. A first bound equal to the sum of
negative dissimilarities is improved upon be using logical relations between the yk`
variables (or, in other words, exploiting consequences of the triangle inequalities).
For instance is variable yk` is equal to 1 then for all indices q either both yk` and y`q
are equal to 1 or both are equal to 0 in any feasible solution. If these variables are
free, the bound may be increased by
n o
min max {dk` , 0} + max {d`q , 0}, max {−dkq , 0} + max {−d`q 0} .
Many further consequences are taken into account and the resulting bounds are quite
sharp. Instances with N ≤ 158 could be solved, more quickly than with a cutting-
16
plane approach, but less quickly than with a combination of heuristic, cutting planes
and branch-and-bound (see next subsection).
Until recently, few papers of cluster analysis advocated the cutting-plane approach.
The minimum sum-of-cliques partitioning problem has attracted the most attention.
Therefore, the convex hull H of integer solutions to the problem defined in the pre-
vious section is studied.
where y(U : V ) denotes the sum of the variables corresponding to pairs of entities one
in U and the other in V , y(U ) = y(U : U ) and y(V ) = y(V : V ), is valid and a facet
if and only if |U | 6= |V |.
Several further families of facets are given. These results are used in a cutting plane
algorithm [51] to solve instances with N ≤ 158. It appears that the triangle inequal-
ities suffice in almost all cases. Facets of the polytope obtained when a cardinality
constraint is added have also been studied [19].
Recently, cutting planes have been combined with heuristics, relocalization of the
best known solution at the origin (which eases the separation problem) and branch-
and-bound. [101]. Minimum sum-of-cliques problems of the literature with N ≤ 158
are solved very quickly.
Cutting-planes were also used in [82] to solve, in moderate time, the auxiliary
problem in a column generation approach (see next subsection) to a constrained
minimum-sum-of-cuts partitioning problem (called min-cut clustering).
The cutting plane approach does not seem to be easy to adapt to clustering prob-
lems with objectives which are not sums of dissimilarities or to problems in which the
17
number of clusters is fixed. Further work on cutting-planes for clustering or related
problem is [19, 20, 43].
and where ajt is equal to 1 if entity Oj belongs to cluster Ct and 0 otherwise. Despite
its enormous size this formulation turns out to be one of the most useful. In order
to solve this problem one needs (i) to solve efficiently its continuous relaxation and
(ii) to proceed efficiently to a branch-and-bound phase in case the solution of the
relaxation is not in integers. We discuss these two aspects in turn.
The standard way to solve linear programs with an exponential number of columns
is to use column generation [47, 21]. In this extension of the revised simplex method,
the entering column is obtained by solving an auxiliary problem, where the unknowns
are the coefficients aj of the column:
N
X
Min f (Cj ) − aj uj − uN +1
j=1
subject to:
aj ∈ {0, 1} j = 1, 2, . . . , N
18
where (u1 , . . . , uN , uN +1 ) are the dual variables at the current iteration. Difficulty
varies depending on the form of f (Cj ) as a function of the aj . For minimum sum-
of-stars clustering (or the M -median problem), the first clustering problem solved
by column generation [44], solving the auxiliary problem is straightforward: for each
potential cluster center k in turn set aj = 1 if dkj < uj and aj = 0 otherwise. If
P
j/aj =1 (dkj − uj ) − uN +1 < 0 the column so defined is a candidate to enter the basis.
For the capacitated version of this problem the auxiliary problem reduces to
a knapsack problem. For the sum of cliques problem the subproblem reduces to
quadratic 0–1 programming:
N
X −1 N
X N
X
Min djk aj ak − aj uj − uN +1
j=1 k=j+1 j=1
in 0–1 variables aj [93, 82, 68]. For the minimum sum-of-squares problem, it reduces
to a hyperbolic 0–1 program, in view of Huyghens’ theorem, which states that the
sum of squared distances to the centroid is equal to the sum of squared distances
between entities divided by the cardinality of the cluster:
NP
−1 P
N
d2jk aj ak N
j=1 k=j+1 X
Min − aj uj − uN +1
P
N
aj j=r
j=1
in 0–1 variables. An iterative solution scheme [37] reduces this problem to a sequence
of quadratic programs in 0–1 variables. These last problems, as well as other quadratic
0–1 programs discussed above, can be solved by an algebraı̈c (or variable elimination)
method [24], linearisation [113], cutting planes [3] or branch-and-bound [63], possibly
exploiting the persistency properties of roof duality theory [56]. Combining column
generation with an interior point method [41] allows solution of minimum sum-of-
squares partitioning problem with N ≤ 150.
Once the entering column is found the algorithm proceeds to a simplex iteration
as in the revised simplex method. However, convergence may be slow, particularly if
there are few clusters in the partition and hence massive degeneracy of the optimal
solution. In fact, even when the optimal solution is found many more iterations may
be needed to prove its optimality. Columns in the primal correspond to cutting planes
in the dual; a good approximation of the dual polytope around the optimal value for
the dual is needed, but little information is available about this optimum. A recent
19
bundle method in the L1 -norm [40] stabilizes the algorithm while remaining within
the column generation framework. It gives good results for continuous sum-of-stars
clustering in the plane (the multisource Weber problem), instances with N = 1060,
M ≤ 50 being solved[62].
Once the linear relaxation of the master problem is solved, one must check for inte-
grality of the solution. For some problems, as minimum sum-of-cliques clustering, it
seems to be fairly often the case. Otherwise, branch-and-bound is needed. Extension
of standard dual and primal procedures of mixed-integer programming to column
generation [71, 66] is only efficient when there are few integers variables. Setting one
fractional variable yt at 1 modifies substantially the problem as all constraints corre-
sponding to elements of Ct are satisfied; but setting yt at 0 only excludes one column
among an enormous number. So other branching rules are needed, and have indeed
been found. A first proposal [100] was made in 1983 for capacitated sum-of-stars
partitioning (or the capacitated M -median problem with single-supply constraints):
branching is done by assigning an entity to a center, which implies this center is
selected in some cluster of the partition, or forbidding it to belong to a cluster with
that center. Another fairly close branching rule, first proposed [107] for the parti-
tioning problem (but not for column generation) is to specify that two entities Oj
and Ok must belong to the same cluster or not. So branching is done in the auxiliary
problem by adding the constraints aj = ak in one branch, and aj + ak ≤ 1 in the
other. Columns not satisfying these constraints are removed. This rule appears to
be more efficient than the previous one [67] and variants of it have been applied with
success in several papers on scheduling problems, e.g., [33]. Nevertheless, some recent
column generation methods for clustering,e.g., [93, 82] still stopped after solution of
the master’s problem relaxation or used some heuristic from that point. In a recent
survey [4], the name “branch-and-price” has been proposed for combination of column
generation and branch-and-bound.
4.6 Heuristics
For many criteria, exact solution of large clustering problems is out of reach. So there
is room for heuristics. Moreover, finding a good initial solution may be important in
column generation (if it is well exploited, i.e., if columns close to those of this solution
are used to complete the basis; otherwise beginning with the heuristic solution may
slow down the solution process).
20
Traditional heuristics use exchange of entities between clusters or redefinition of
clusters from their centroids. The HMEANS algorithm, e.g. [109], for minimum-sum-
of-squares partitioning draws an initial partition at random, then proceeds to best
exchanges of entities from one cluster to another until a local minimum is attained.
The KMEANS algorithm for the same problem, also draws an initial partition at
random then computes the cluster centroı̈ds, assigns entities each to the closest of
them and iterates until a local minimum is attained. Both procedures can be repeated
a given number of times. They give good results when there are few clusters but
deteriorate when there are many. Experiments show that the best clustering found
with KMEANS may be more than 50% worse then the best known one.
Much better results have been obtained with metaheuristics, i.e., simulated an-
nealing, Tabu search, genetic search, etc [103]. The recent Variable Neighborhood
Search [72] proceeds by local search to a local minimum, then explores increasingly
distant neighborhoods of that partition by drawing a perturbation at random and
doing again a local search. It moves to a new partition and iterates if and only if
a better one than the incumbent is found. Experiments show this procedure is very
efficient for approximate solutions of large clustering problems.
Most clustering algorithms give results regardless of whether the given set of entities
possesses some structure or not. Moreover, all entities must usually be assigned to
some cluster. This disregards the possibility of noise, i.e., entities (possibly all of
them) which can only be classified arbitrarily. It may therefore be preferable to
consider packing problems instead of partitioning problems. Moreover, one may wish
to study clusters one at a time, beginning by the most obvious one, removing its
entities and iterating. The so-defined sequential clustering [72] is close to methods of
image processing:
Current step
Find clusters Ck ⊂ O with |Ck | = k = 1, 2, . . . , |O| which optimize a criterion;
Evaluate the best value k ∗ of k and the significance of cluster Ck∗ . If it is significant
(different from noise) set O = O \ {Ck∗ } and iterate; otherwise stop.
21
Thus, at each step, a single-cluster parametric clustering problem is solved, and
followed by a test based on the distribution of values of the criterion. Some cases
are easy: finding a maximum split cluster can be done in θ(N 2 ) time in view of
Proposition 1, rediscovered in [18]. Finding a minimum radius cluster or a minimum
star cluster take O(N 2 log N ) time by ranking dissimilarities. Finding a minimum
diameter cluster is NP-hard, as well as finding a minimum clique cluster. The former
problem can be solved by reducing it to a sequence of maximum clique problems, and
the latter by expressing it as a quadratic knapsack problem. Other geometric criteria
are considered in [1] and [25].
In addition to finding clusters one may use them to explain dissimilarities (or simi-
larities) between pairs of entities, as proposed in additive clustering [108, 95]. Given
a matrix S = (sk` ) of similarities between pairs of entities of O one seeks M over-
lapping clusters C1 , C2 . . . , CM and corresponding weights λ1 , λ2 , . . . , λM to minimize
the sum-of-squares of errors:
N
X −1 N
X ³ X ´2
sk` − λj
k=1 `=k+1 j|Ok ,O` ∈Cj
In a variant of that model, one cluster contains all entities. Many heuristics have been
proposed for its solution, using various techniques of mathematical programming. If
one cluster is considered at a time, in a qualitative factor analysis technique [95], the
problem is easier and can be reduced to quadratic or hyperbolic 0–1 programming
with a cardinality constraint [64].
22
single entity is a constant. This property may be relaxed. Then the general problem
of representing dissimilarities by additive trees arises: the length corresponding to
dk` will be that of the path between the vertices vk and v` of the additive tree T . So
both the topology of T and the length of its edges must be determined. This topic is
studied in depth in [5].
In order to be representable by an additive tree, it is necessary and sufficient that
the dissimilarity D0 satisfy the four-point condition [14]:
6 Conclusions
Mathematical programming has been applied with success to cluster analysis in the
last 25 years. This has led to (i) define precisely many cluster analysis problems,
(ii) determine their computational complexity, (iii) clarify the objective underlying
known algorithms, and exhibit some important properties, e.g., for the split criterion,
(iv) obtain improved and sometimes best possible algorithms for known easy prob-
lems; (v) obtain polynomial and sometimes best possible algorithms for new problems,
e.g., average split partitioning; (vi) obtain non polynomial but useful algorithm for
NP-hard problems, e.g., clique partitioning and minimum sum-of-squares partition-
ing; (vii) devise useful heuristics, yielding near-optimal solutions for large instances.;
(viii) establish ties between cluster analysis and other subfields of mathematical pro-
gramming and computational geometry, where similar problems are studied.
23
While many results have been obtained, much remains to be done to completely
integrate cluster analysis within mathematical programming. Axiomatics are needed,
particularly for partitioning. New exact algorithms should be devised, mostly for
divisive hierarchical clustering, sequential clustering and additive clustering, where
few or none exist, but also for partitioning with little studied criteria. Heuristics for
large instances deserve further study. Empirical comparison of methods is also too
rare, with a few exceptions (e.g. [94]). Finally, gathering existing software, often hard
to access, and streamlining it into a package would be of help.
References
[1] A. Aggarwal, H. Imai, N. Katoh and S. Suri, Finding k Points with Minimum
Diameter and Related Problems, Journal of Algorithms 12 (1991) 38–56.
[2] H.J. Bandelt and A.W.M. Dress, Weak Hierarchies Associated with Similarity
Measures: an Additive Clustering Technique, Bulletin of Mathematical Biology
51 (1989) 133–166.
[4] C. Barnhart, E.L. Johnson, G.L. Nemhauser and M.W.P. Savelsbergh, Branch
and Price: Column Generation for Solving Huge Integer Programs, Computa-
tional Optimization Center COC-94-03, Georgia Institute of Technology, At-
lanta, 1994, (revised 1995).
[5] J.-P. Barthelemy and A. Guénoche, Les Arbres et les représentations des prox-
imités (Masson: Paris 1988) English translation: Trees and Proximity Relations
(Wiley: Chichester 1991).
24
[9] P. Bertrand, Structural Properties of Pyramidal Clustering. In: I. Cox, P.
Hansen and B. Julesz (eds.) Partitioning Data Sets (American Mathematical
Society: Providence 1995) 35–53.
[10] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms
(Plenum: New York 1981).
[11] E. Boros and P.L. Hammer, On Clustering Problems with Connected Optima
in Euclidean Spaces, Discrete Mathematics 75 (1989) 81–88.
[14] P.Buneman, The Recovery of Trees from Measures of Dissimilarity. In: F.R.
Hodson, D.G. Kendall and P. Tautu (eds.) Mathematics in Archeological and
Historical Sciences (Edinburgh University Press: Edinburgh 1971) 387–395.
[18] M.S. Chang, C.Y. Tang and R.C.T. Lee, A Unified Approach for Solving Bottle-
neck k-Bipartition Problems, Proceedings of the 19th Annual Computer Science
Conference (San Antonio, Texas, March 5–7, ACM, 1991) 39–47.
[19] S. Chopra and M.R. Rao, On the Multiway Cut Polyhedron, Networks 21 (1991)
51–89.
[20] S. Chopra and J.H. Owen, Extended Formulations for the A-Cut Problem,
Mathematical Programming 73 (1996) 17–30.
25
[21] V. Chvatal, Linear Programming (New-York: Freeman, 1983).
[24] Y. Crama, P. Hansen and B. Jaumard, The Basic algorithm for Pseudo-Boolean
Programming Revisited, Discrete Applied Mathematics 29 (1990) 171–185.
[25] A. Datta, H.-P. Lenhof, Ch. Schwarz and M. Smid, Static and Dynamic Al-
gorithms for k-point Clustering Problems, Journal of Algorithms 19 (1995)
474–503.
[26] W.H.E. Day and H. Edelsbrunner, Efficient Algorithms for Agglomerative Hi-
erarchical Clustering Methods, Journal of Classification 1 (1984) 7–24.
[29] G. De Soete, A Least Squares Algorithm for Fitting Additive Trees to Proximity
Data, Psychometrika 48 (1983) 621–626.
[34] E. Diday, From Data to Knowledge: Probabilistic Objects for a Symbolic Data
Analysis. In: I. Cox, P. Hansen and B. Julesz (eds): partitioning Data Sets.
(American Mathematical Society: Providence 1995) 35–53.
26
[35] E. Diday, Orders and Overlapping Clusters by Pyramids (Research Report, 730,
INRIA, France 1987).
[36] G. Diehr, Evaluation of a Branch and Bound Algorithm for Clustering, SIAM
Journal on Scientific and Statistical Computing 6 (1985) 268–284.
[39] U. Dorndorf and E. Pesch, Fast Clustering Algorithms, ORSA Journal on Com-
puting 6 (1994) 141–153.
[43] C.C Ferreira, A. Martin, C.C. De Souza, R. Weissmantel and L.A. Wolsey, For-
mulation and Valid Inequalities for the Node Capacitated Graph Partitioning
Problem, Mahtematical Programming 74 (1996) 247–266.
[44] R. Garfinkel, A.W. Neebe and M.R. Rao, An Algorithm for the M -median Plant
Location Problem, Transportation Science 8 (1974) 217–236.
27
[47] P.C. Gilmore and R.E. Gomory, A Linear Programming Approach to the Cut-
ting Stock Problem, Operations Research 9 (1961) 849–859.
[48] A.D. Gordon, Classification: Methods for the Exploratory Analysis of Multi-
variate Data (New York: Chapman and Hall, 1981).
[50] J.C. Gower and G.J.S. Ross, Minimum Spanning Trees and Single Linkage
Cluster Analysis, Applied Statistics 18 (1969) 54–64.
[53] A. Guénoche, Partitions with Minimum Diameter (paper presented at the In-
ternational Federation of Classification Societies Conference, Charlottesville,
USA, 1989).
28
[60] P. Hansen, B. Jaumard and E. da Silva, Average-Linkage Divisive Hierarchical
Clustering, Les Cahiers du GERAD, G–91–55 (1991). To appear in Journal of
Classification.
[64] P. Hansen, B. Jaumard and C. Meyer, Exact Sequential Algorithms for Additive
Clustering, Les Cahiers du GERAD, (1997) (forthcoming).
29
[71] P. Hansen, M. Minoux and M. Labbe, Extension de la programmation linéaire
généralisée au cas des programmes mixtes, Comptes Rendus de l’Académie des
Sciences, Paris, 305 (1987) 569–572.
[76] L.J. Hubert, Min and Max Hierarchical Clustering Using Asymmetric Similarity
Measures, Psychometrika 38 (1973) 63–72.
[77] L.J. Hubert and P. Arabie, Iterative Projection Strategies for the Least-Squares
Fitting of Tree Structure to Proximity data, British Journal of Mathematical
and Statistical Psychology 48 (1995) 281–317.
[78] F.K. Hwang, U.G. Rothblum and Y.-C. Yao, Localizing Combinatorial Proper-
ties of Partitions, AT&T Bell Labs Report, (1995).
[80] M. Jambu, Exploratory and Multivariate Data Analysis (Academic Press: New
York 1991).
[81] R.E. Jensen, A Dynamic Programming Algorithm for Cluster Analysis, Opera-
tions Research 17 (1969) 1034–1057.
[82] E.L. Johnson, A. Mehrotra and G.L. Nemhauser, Min-cut Clustering, Mathe-
matical Programming 62 (1993) 133–151.
[84] G. Klein and J.E. Aronson, Optimal Clustering: A Model and Method, Naval
Research Logistics 38 (1991) 447–461.
30
[85] W.L.G. Koontz, P.M. Narendra and K. Fukunaga, A Branch and Bound Clus-
tering Algorithm, IEEE Transactions on Computers C–24 (1975) 908–915.
[87] G.N. Lance and W.T. Williams, A General Theory of Classificatory Sorting
Strategies. 1. Hierarchical Systems, The Computer Journal 9 (1967) 373–380.
[89] J.K. Lenstra, Clustering a Data Array and the Traveling Salesman Problem,
Operations Research 22 (1974) 993–1009.
[91] D.W. Matula and L.L. Beck, Smallest-Last Ordering and Clustering and Graph-
Coloring Algorithms, Journal of the Association for Computing Machinery 30
(1983) 417–427.
[92] W.T. McCormick Jr, P.J. Schweitzer and T.W. White, Problem Decomposition
and Data Reorganization by a Clustering Technique, Operations Research 20
(1972) 993–1009.
[93] M. Minoux and E. Pinson, Lower Bounds to the Graph Partitioning Prob-
lem through Generalized Linear Programming and Network Flows, RAIRO–
Recherche Opérationnelle 21 (1987) 349–364.
[94] G.W. Milligan and M.C. Cooper, An Examination of Procedures for Determin-
ing the Number of Clusters in Data Set, Psychometrika 50 (1985) 159–179.
[95] B. Mirkin, Additive Clustering and Qualitative Factor Analysis Methods for
Similarity Matrices, Journal of Classification 4 (1987) 7–31, (Erratum 6, 271–
272).
31
[97] C. Monma and S. Suri, Partitioning Points and Graphs to Minimize the Max-
imum or the Sum of Diameters. In: Y. Alavi, G. Chartrand, O.R. Oellerman,
A.J. Schwenk, eds., Graph Theory, Combinatorics, and Applications, Proceed-
ings of the Sixth Quadrennial International Conference on the Theory and Ap-
plications of Graphs (New York: Wiley, 1991) 899–912.
[99] J. Ponthier, A.-B. Dufour and N. Normand, Le modèle Euclidien en analyse des
données (Ellipses: Paris 1990).
[100] A.W. Neebe and M.R. Rao, An Algorithm for the Fixed-Charge Assignment
of Users to Sources Problem, Journal of the Operational Research Society 34
(1983) 1107–1113.
[102] M.R. Rao, Cluster Analysis and Mathematical Programming, Journal of the
American Statistical Association 66 (1971) 622–626.
[103] C.R. Reeves, (ed.) Modern Heuristic Techniques for Combinatorial Problems
(Blackwell: London, 1993).
[105] P. Rosenstiehl, L’arbre minimum d’un graphe. In: P. Rosenstiehl (ed.): Théorie
des Graphes (Paris, Dunod, 1967) 357–368.
[106] A. Rusch and R. Wille, Knowledge Spaces and Formal Concept Analysis. In: H.
Boch and W. Polarek (eds) Data Analysis and Information Systems (Springer:
Berlin 1996) 427–436.
[107] D.M. Ryan and B.A. Foster, An Integer Programming Approach to Scheduling.
In: A. Wren (ed.), Computer Scheduling of Public Transport Urban Passenger
Vehicle and Crew Scheduling (North-Holland: Amsterdam 1981) 269–280.
32
View publication stats
[109] H. Späth, Cluster Analysis Algorithms for Data Reduction and Classification of
Objects (Ellis Horwood, Chichester, 1980).
[110] L.E. Stanfel, A Recursive Lagrangian Method for Clustering Problems, Euro-
pean Journal of Operational Research 27 (1986) 332–342.
[111] P.H.A. Sneath and R.R. Sokal, Numerical Taxonomy (Reeeman: San Francisco
1973).
[112] R.E. Tarjan, An Improved Algorithm for Hierarchical Clustering Using Strong
Components, Information Processing Letters 17 (1983) 37–41.
[114] H.D. Vinod, Integer Programming and the Theory of Grouping, Journal of the
American Statistical Association 64 (1969) 506–519.
33