IJPRAI-D-07-00167 Revision 1
International Journal of Pattern Recognition and Artificial Intelligence
c World Scientific Publishing Company
K-means Clustering for Problems with Periodic Attributes
M. Vejmelka∗
Institute of Computer Science, Academy of Sciences of the Czech Republic, Pod Vodarenskou
vezi 2, 182 07 Prague 8, Czech Republic, vejmelka@cs.cas.cz
P. Musilek
Department of Electrical and Computer Engineering, University of Alberta,
W2-030 ECERF, Edmonton, Alberta, T6G 2V4 Canada, musilek@ece.ualberta.ca
M. Paluš, E. Pelikán
Institute of Computer Science, Academy of Sciences of the Czech Republic, Pod Vodarenskou
vezi 2, 182 07 Prague 8, Czech Republic, mp/emil@cs.cas.cz
The K-means algorithm is very popular in the machine learning community due to its
inherent simplicity. However in its basic form it is not suitable for use in problems which
contain periodic attributes, such as oscillator phase, hour of day or directional heading.
A commonly used technique of trigonometrically encoding periodic input attributes to
artificially generate the required topology introduces a systematic error. In this paper, a
metric which induces a conceptually correct topology for periodic attributes is embedded
into the K-means algorithm. This requires solving a non-convex minimization problem
in the maximization step. Results of numerical experiments comparing the proposed
algorithm to K-means with trigonometric encoding on synthetically generated data are
reported. The advantage of using the proposed K-means algorithm is also shown on a
real example using gas load data to build simple predictive models.
Keywords: Clustering Algorithms, Similarity measures, K-means, Periodic Attributes
1. Introduction
Clustering is one of the most important problems of data analysis, concerned with
finding structures intrinsic to a set of unlabeled patterns. Typical goals of clustering
include finding prototypes for homogeneous groups of data, description of unknown
properties of data or detecting unusual data points. To aid the clustering process, it
is common to use a distance measure to assess the (reciprocal of) similarity between
patterns5 .
K-means7 is one of the best known clustering algorithms. Due to its simplicity
and ease of implementation, it has been used in numerous practical applications.
There have also been many works that extend the original form of the algorithm
∗ Department
of Cybernetics, Czech Technical University, Technicka 2, 166 27 Prague 6, Czech
Republic
1
IJPRAI-D-07-00167 Revision 1
2
M. Vejmelka, P. Musilek, M. Paluš, E. Pelikán
to improve its convergence2 , efficiency6 , cluster shape3 , etc. The distance measure
used for K-means clustering is usually the Euclidean distance or the more general
Minkowski distance5 , which provides a variety of metrics to suit many different
types of attributes. However, these conventional distance measures cannot be used
to effectively assess similarity of attributes that are periodic.
To deal with periodic attributes, trigonometric encoding11 can be used to preprocess the data before clustering. However, this ad hoc approach has several disadvantages. First, the dimensionality of the problem is increased by one for each periodic attribute treated this way. Second, the non-linearities introduced by trigonometric functions distort the feature space and create artificial structures not present
in the original data.
To avoid the above mentioned problems, we propose an alternative metric suitable for assessing distances of periodic attributes. This metric is first formally introduced and subsequently used in the development of a modified K-means algorithm.
It is then generalized so that it is applicable to problems involving both periodic and
non-periodic attributes. The main virtue of the suggested metric is its unbiasedness:
it does not deform the feature space and thus does not impose any structure on
the data analyzed. In addition, its use does not require the dimensionality of the
feature space to increase.
The new algorithm has been compared to both standard K-means and K-means
with trigonometric encoding of periodic attributes. The results of numerical simulations show the superiority of the proposed Hybrid K-means in problems with
mixed periodic and non-periodic attributes.
The remainder of this paper is organized as follows. Sec. 2 provides a brief
overview of partitioning clustering algorithms. The problems arising when clustering
patterns with periodic attributes are introduced in Sec. 3. Subsequently, the Kmeans algorithm for clustering patterns with periodic attributes is formalized in
Sec. 4. In Sec. 5, the new distance measure is combined with the Euclidean distance
and a K-means algorithm suitable for clustering patterns containing both periodic
and non-periodic attributes is formulated. Results of numerical simulations using
the proposed algorithm and its variants are presented in Section 6. Finally, Sec. 7
provides the major conclusions and points to potential directions of future work.
The Appendix contains the pseudocode of the algorithm introduced in the paper.
2. Clustering Algorithms
Clustering algorithms are broadly divided into hierarchical and partitioning1 . While
hierarchical techniques build clusters gradually (either by merging or splitting),
partitioning algorithms form clusters directly by dividing data into several subsets.
In partitioning algorithms, that are the subject of this work, certain heuristics
are used to iteratively optimize the location and size of the subsets. There are
different reallocation schemas facilitating this optimization process that can be
grouped into probabilistic clustering and clustering based on the use of an objective
IJPRAI-D-07-00167 Revision 1
K-means Clustering for Problems with Periodic Attributes
3
function. Typical instances of these approaches are mixture-density models and the
K-means clustering, respectively. These two algorithms are briefly reviewed in the
following text. In addition to the conventional view of K-means as an algorithm
based on objective-function optimization, it is also described in probabilistic terms.
This probabilistic description is later used to derive a new algorithm applicable to
periodic attributes.
2.1. Mixture-density Models
Probabilistic clustering models assume that the data to be clustered come from a
mixture model of several random populations. The clustering problem is to find the
distributions and prior probabilities of these populations.
Assume that the prior probability P (wk ) for cluster wk ,k = 1, . . . , K, and the
form of conditional probability density p(x|wk , θk ) are known, with the unknown
parameter vector θk . The mixture probability density for the whole data set is
p(x|θ) =
K
X
p(x|wk , θk )P (wk ),
(1)
k=1
PK
where θ = (θ1 , . . . , θK ), and k=1 P (wk ) = 1. After finding the parameter vector
θ, the posterior probability for assigning a pattern to a cluster can be calculated
using Bayes’ theorem12 .
The unknown parameters that maximize the probability of generating all given
observations can be found using maximum likelihood method, usually expressed in
the following logarithmic form
l(θ) =
N
X
i=1
log p(xi |θ).
(2)
As the maximum of the log-likelihood equation at ∂l(θ)/∂θk = 0 usually cannot
be found analytically, iterative methods are required to approximate the solutions.
The expectation-maximization (EM) algorithm is the most popular approach. It
m
m
augments the observable features xoi by vector xm
i = (xi1 , . . . , xiK ) whose comm
ponents are assumed missing and attain values xik ∈ {0, 1} depending on whether
xi = {xoi , xm
i } belongs to the cluster k (xik = 1) or not (xik = 0). The log-likelihood
of the complete data is
l(θ) =
K
N X
X
i=1 k=1
o
xm
ik log [P (wk )p(xi |θk )]
(3)
The standard EM algorithm generates a sequence of parameter estimates
{θ0 , θ1 , . . . , θ∗ }, where ∗ represents the reaching of convergence criterion by executing the following steps12
(1) Initialize θ0 and set t = 0;
IJPRAI-D-07-00167 Revision 1
4
M. Vejmelka, P. Musilek, M. Paluš, E. Pelikán
(2) E-step: determine the expectation of the complete data log-likelihood
Q(θ, θt ) = E[log p(xo , xm |θ)|xo , θt ];
(3) M-step: Select a new parameter estimate that maximizes the Q-function,
θt+1 = arg max Q(θ, θt );
θ
(4) Increment t and repeat steps E and M until convergence.
2.2. K-means Clustering
K-means is the best known clustering algorithm based on quantization error
minimization12 . The error can be described as the average distance between each
pattern and its closest prototype
N
E(w) =
1X
min(xi − wk )2 ,
2 i=1 k
(4)
where E is the quantization error, wk , k ∈ {1, . . . , K} is the prototype of cluster k,
and xi , i ∈ {1, . . . , N } is the i-th sample of the data set to be clustered.
The equation above can be reformulated by writing sw (i) for the subscript of
the prototype closest to the sample xi 2 . Then
N
E(w) =
1X
(xi − wsw (i) )2 ,
2 i=1
(5)
The quantization error can be minimized using gradient descent algorithm ∆w =
−ǫt (∂E(w)/∂w), to perform either batch or online updates of the cluster prototypes,
with learning rate ǫt . In this paper, the batch variant in the form
X ǫt (xi − wk ) if k = sw (i)
(6)
∆wk =
0
otherwise
i
is considered. This step is repeated until convergence which is guaranteed for suitable chosen learning rates ǫt .
2.3. Probabilistic Framework of K-means
The sum of squares appearing in (5) can be viewed as (a negative of) log-likelihood
for normally distributed mixture model that is widely used in statistics. Therefore K-means can be understood under a general probabilistic framework1 . Subsequently, K-means can be derived as an EM style algorithm. The following notation
has been adopted from2 .
The equation (3) can be modified for the K-means algorithm by assuming the
missing parameters to be the assignments of patterns xi to the prototypes wk .
Instead of considering the expected value over the distribution on these missing
IJPRAI-D-07-00167 Revision 1
K-means Clustering for Problems with Periodic Attributes
5
parameters, the values of parameters w′ that minimize the cost given their previous
values w are considered
Q(w′ , w) =
N
X
1
i=1
2
(xi − ws′ w (i) )2 .
(7)
In the case of K-means, the new set of prototypes w′ that minimizes Q(w′ , w)
can be found analytically by solving the equation ∂Q(w′ , w)/∂wk′ = 0
X
1
wk′ =
xi ,
(8)
Nk
i:k=sw (i)
where Nk is the number of samples assigned to prototype wk . This algorithm can
be reformulated to a form similar to the gradient descent algorithm (6)
X 1 (xi − wk ) if k = sw (i)
Nk
,
(9)
∆wk =
0
otherwise
i
where ∆wk = wk′ − wk and is thus equivalent to batch gradient descent2 , with the
learning rate dependent on cluster size, ǫt = 1/Nk .
3. Clustering with Periodic Attributes
In this section, periodic attributes are introduced and some examples of such attributes arising in practice are listed. A popular technique for dealing with periodic
attributes is presented and a systematic error arising from its use is shown.
3.1. Periodic Attributes
Intuitively a periodic attribute is an attribute the topology of which can be naturally
represented by a circle connecting the supremum and infimum of the set of values
together. The length of segments between points on the circle correspond to our
perception of distance. A fitting example is an attribute representing an hour (of
day) with values h0, 24). What is the distance between hour 23:00 (11 o’clock PM)
and 01:00 (1 o’clock AM) ? Events occurring at these times every day are commonly
understood to be 2 hours apart, not 22 hours apart as would be indicated by their
Euclidean distance.
Examples of periodic attributes conforming to the above considerations include
azimuth, cyclic time indices (minute of hour), oscillator phases or the hue of the
HSV color model (the saturation and value can be considered standard non-periodic
attributes). The part of statistics mainly concerned with periodic attributes and
spherical sample spaces is circular statistics. This subject has been treated at length
by Mardia and Jupp8 who suggest the cosine distance as a useful distance metric
dcos (θ, ξ) = 1 − cos(θ − ξ),
(10)
where θ, ξ are angles in radians. Any periodic attribute can be mapped onto h0, 2π)
and represented by an angle (a direction). To elucidate the relationship between a
IJPRAI-D-07-00167 Revision 1
6
M. Vejmelka, P. Musilek, M. Paluš, E. Pelikán
common technique employed in clustering periodic attributes, which shall be termed
K-means with trigonometric encoding and the cosine distance, some additional
quantities will be defined
PN
C̄ = N1 i=1 cos(θi ),
PN
(11)
S̄ = N1 i=1 sin(θi ),
where θi are the samples of the periodic attribute mapped onto h0, 2π). Then the
mean direction is defined as
S̄
.
(12)
θ̄ = tan−1
C̄
The mean direction is similar to the mean of a set of samples in R. It also has an
associated mean resultant length
p
R̄ = C̄ 2 + S̄ 2 ,
(13)
which characterizes the spread of points around the mean direction θ̄.
If the dispersion of a set of samples of a periodic attribute around a value α is
computed as8
N
1 X
(1 − cos(θi − α)),
D(α) =
N i=1
(14)
then the mean direction θ̄ is the minimizer of the dispersion D(α) for the set of
samples θi . The dispersion D(α) is in fact the sum of cosine distances from α to
each θi .
The K-means algorithm with trigonometric encoding works by first transforming the periodic attribute t ∈ h0, D) by the pair of functions
c = cos(2πt/D)
s = sin(2πt/D)
(15)
and applying the standard maximization step (8). Optionally, the values c, s can be
linearly mapped onto the range h0, D) to preserve the scale of the attribute. The
new prototype coordinate resulting from applying the maximization step (8) for
the c dimension is exactly C̄ and the new prototype coordinate for the s dimension
is S̄. The corresponding value in the original (unmapped) space is t̄ = θ̄D
2π — the
mean direction linearly scaled to h0, D). We have thus shown that given a set of
samples t1 , t2 , ..., tN , the K-means algorithm minimizes (14) which is in fact a sum
of cosine distances from the prototype to each of the samples ti mapped to h0, 2π).
3.2. Effects of Trigonometric Encoding
Problems involving periodic attributes are sometimes approached by mapping the
periodic attributes using trigonometric functions (15) onto a circle as shown above,
IJPRAI-D-07-00167 Revision 1
K-means Clustering for Problems with Periodic Attributes
7
for example in11,? . The dimensionality of the clustering problem is effectively increased by one for each periodic attribute that is mapped in this way. More importantly, the non-linearities of the trigonometric functions distort the spatial relationships between patterns. The problem manifests itself in both steps of the K-means
algorithm.
In the expectation step of K-means, distances are measured using the standard Euclidean metric, which measures distances along straight lines in the plane
spanned by the encoding variables c, s (15).
√ This means that the ratio of distances
between points a, c and a, b in Fig. 1 is 2, whereas a correct representation would
indicate the ratio of distances to be 2. This distortion is an additional effect stemming from the use of Euclidean distance to partition the samples for the maximization step. The deformation caused by measuring distances along straight lines
preserves the ordering of distances measured along the circle segments but not the
ratios of these distances thus distorting the value of the cost function.
Fig. 1. Deformation of distances caused by trigonometric encoding.
In the maximization step, each coordinate is considered separately when optimizing prototype locations. The K-means algorithm thus works with the projections
of the points on the unit circle into the horizontal and vertical axes shown in Fig. 1.
Here, another type of spatial deformation comes into play, where the projected
points equally spaced on the circle are not equally spaced on the axes. The result
of the maximization step has been described in Sec. 3.1 and corresponds to using
the cost function (14) based on the metric (10).
The resulting clusters are biased toward a structure artificially created by the
selective contraction of the original feature space resulting from the use of the
trigonometric encoding (15). This can be easily seen if there is no real structure in
the input data as depicted in Fig. 2(a), where three ‘clusters’ were generated from a
uniform distribution spanning the space h0, 10)2. The first attribute, depicted on the
vertical axis, is periodic with period D = 10 and the mapping (15) was applied to it
making the problem three-dimensional. The range of the transformed variables c, s
resulting from the application of the trigonometric encoding was linearly mapped
to h0, D) so that scaling of the attribute was preserved. The clusters computed by
IJPRAI-D-07-00167 Revision 1
8
M. Vejmelka, P. Musilek, M. Paluš, E. Pelikán
K-means with trigonometric encoding shown in Fig. 2(b) have a horizontal stripe
structure. Even though runs with different seeds may produce slightly different
configurations (e.g. the stripes are shifted), the bias in the results is clearly visible.
Indeed for uniformly distributed data, any K-means algorithm, including the one
proposed in this paper, will converge to some (meaningless) cluster configuration.
However, the use of trigonometric encoding causes a systematic error that is present
no matter what input is provided to the algorithm.
(a) Generated clusters.
(b) Clusters identified by K-means with trig.
encoding.
Fig. 2. The clusters identified by the K-means algorithm with trigonometric encoding display
a horizontal striped structure 2(b), the actual ‘clusters’ with all points drawn from a uniform
distribution on h0, 10)2 shown in 2(a).
In some cases, a smooth transformation of the feature space may be advantageous or may be required by the preprocessing method. In our case, however,
deformation occurs as a by-product of a preprocessing method intended to approximate the spatial relationships of periodic attributes.
4. K-means for Periodic Attributes
In this section, a new EM problem will be formulated incorporating the features of a
metric that does not distort the feature space. This metric has also been referenced
in Mardia and Jupp8
duni (θ, ξ) = π − |π − |θ − ξ||.
(16)
It can be perhaps more intuitively rewritten as
ρD (x, y) = min{|x − y|, D − |x − y|},
(17)
for D = 2π. The behavior of the metric is now clearer. If the term |x−y| is smaller (or
equal to) the term D − |x − y|, then the distance is exactly equal to the Euclidean
IJPRAI-D-07-00167 Revision 1
K-means Clustering for Problems with Periodic Attributes
9
distance. Otherwise the metric ‘wraps around’, which intuitively corresponds to
passing through point D ≡ 0 when tracing the shortest distance path on the circle
of values of the periodic attribute. The form (17) can be used with any period
D > 0. This situation is depicted in Fig. 3.
Fig. 3. Shortest paths between the points 0 < a < b < c < D. Between a, b the shortest path
does not pass through D ≡ 0. The shortest path between points a, c ‘wraps’ around the circle and
separates the two regions.
passes through the point D ≡ 0. The point a′ = a + D
2
In the following, the expectation and maximization steps will be modified to
work with this metric. The derivations will be performed for a single periodic attribute with period D. The variables w, x, y will be used to represent elements of
the set h0, D) of this periodic attribute.
4.1. Expectation Step
In the expectation step it is necessary to replace the Euclidean metric with the
metric (17). The expectation step thus becomes
sw (i) = arg min ρD (wk , xi ).
k
(18)
The modified expectation step ensures that each sample xi will be assigned to the
cluster prototype closest in the sense of the metric (17).
4.2. Maximization Step
The maximization step computes a new cluster prototype location that minimizes
the functional
N
Q(w, w′ ) =
1X
ρD (ws′ w (i) , xi )2 ,
2 i=1
(19)
where w represents the current cluster prototypes and w′ represents the new cluster
prototypes with respect to which the functional is minimized. This is not a simple
convex quadratic function and has in general multiple local minima. The above
IJPRAI-D-07-00167 Revision 1
10
M. Vejmelka, P. Musilek, M. Paluš, E. Pelikán
minimization problem is solved separately for each cluster as the positions of the
cluster centers are mutually independent. The set of patterns belonging to the
cluster k is
Xk = {xi |sw (i) = k},
(20)
where Nk = card Xk is the number of patterns assigned to the cluster k. An algorithm computing the global minimum of Q(w, w′ ) in O(Nk log Nk ) time is presented.
It is possible to optimize the algorithm further to run in O(Nk ) time by pre-sorting
the input data and using indirect indexing to access the sorted array. However, for
the sake of simplicity and clarity, only the O(Nk log Nk ) algorithm is shown.
For each pattern xi ∈ Xk , the interval h0, D) can be divided into two subintervals
such that if the new cluster prototype were located in one of them, the metric would
wrap around when computing the distance, and it would not wrap around if the
new prototype was in the other one. This is illustrated in Fig. 4 for one pattern
D
located in the interval h D
2 , D) and another pattern located in the interval h0, 2 ).
The point where the shortest distance path changes for the pattern in the upper
D
interval is xi − D
2 . The same happens for patterns located in the interval h0, 2 ),
D
but the switch point is at xi + 2 . A function can therefore be defined assigning a
switch point to each xi
xi + D/2 if xi < D/2
(21)
m(xi ) =
xi − D/2 if xi ≥ D/2.
The set Yk = {yi |yi = m(xi ), xi ∈ Xk } is a set of switch points which split the
interval h0, D) into at most Nk + 1 subintervals. If any two points coincide exactly,
then their switch points coincide as well. This means that the number of unique
subintervals is decreased by one. Starting at the point wk′ = 0, the metric wraps for
Fig. 4. Top: Figure shows wrapping behavior depending on new prototype position for a pattern
, D). Bottom: Figure shows wrapping behavior depending on new prototype position for a
in h D
2
pattern in h0, D
).
2
all xi ∈ Xk , xi >
D
2
and does not wrap in the interval h0, D
2 ). Moving the potential
IJPRAI-D-07-00167 Revision 1
K-means Clustering for Problems with Periodic Attributes
11
new prototype in the positive direction, Nk switch points are passed on the interval
h0, D). All switch points of the patterns in the interval h D
2 , D) are reached before
).
The
interval h0, D) can be
all switch points of the patterns in the interval h0, D
2
divided into subintervals inside which the potential new prototype can move and
the shortest distance path does not change discontinuously for any xi . The optimal
position in such a subinterval can be computed analytically
xi if ρD does not wrap
N
X
.
(22)
wk′ =
xi − D if ρD wraps and xi ≥ D
2
i=1
xi + D if ρD wraps and xi < D
2
It is possible that the new w′ might not be in the range h0, D) and will have to be
corrected by ±D. The optimal position with the smallest cost function (19) over
all the subintervals is then set as the new cluster prototype wk′ .
The optimal location and cost of the center if it lies in the first subinterval
can be computed in O(Nk ) time using (22). However this is only necessary in the
first subinterval. After the initial computation it possible to update the potential
optimal location and it’s associated cost in other subintervals in constant time O(1).
The total computational cost is thus dominated by the O(Nk log Nk ) cost of the
sorting procedure. The pseudocode for the algorithm is shown in the Appendix.
5. Hybrid K-means
A metric measuring distances in problems with mixed periodic and non-periodic
attributes is presented and a Hybrid K-means algorithm suitable for finding clusters
in such problems is described.
Using the distance metric (17) for periodic attributes and the Euclidean metric
for non-periodic attributes, it is possible to define a metric for problems in which
only some attributes are periodic. Let I ⊂ {1, 2, ...M } denote the set of indices
of periodic attributes, where M is the total number of attributes. The set D is
the set of periods of the periodic attributes D = {Dj |j ∈ I}. A metric representing
distances in this space can be written as
v
uM
uX ρD (xj , yj )2 if j ∈ I
j
.
(23)
ρI,D (x, y) = t
(xj − yj )2 otherwise
j=1
The function (23) is a metric as it satisfies all necessary conditions: it is nonnegative and symmetric with respect to its arguments and zero if and only if x =
y. The triangle inequality for the function (23) is satisfied for each coordinate
individually, therefore also for the square root of the sum of squares. This metric
is useful in practice when the clustering problem contains attributes which belong
to different domains altogether (hour of day, temperature, wind speed) and some
of the attributes are periodic. Introducing coefficients to weigh the contributions of
each attribute to the total distance preserves the metric properties of function (23)
so standard attribute scaling procedures can be applied. A modified version of the
IJPRAI-D-07-00167 Revision 1
12
M. Vejmelka, P. Musilek, M. Paluš, E. Pelikán
standard K-means algorithm can now be presented which is able to effectively find
clusters in problems where only some attributes are periodic.
In the expectation step of the Hybrid K-means, the assignments of data points
to centers are computed as
sw (i) = arg min ρI,D (xi , wk ).
(24)
k
The Hybrid K-means algorithm uses the original standard K-means maximization
step (8) for non-periodic attributes and the maximization step introduced in Sec. 4.2
for periodic attributes.
6. Experiments
In this section it is shown how the Hybrid K-means algorithm compares to Kmeans with trigonometric encoding on synthetic data sets containing both periodic
and non-periodic attributes and on real data describing the consumption of natural
gas as a function of time.
6.1. Case Study
The presented case study has two attributes so that the results can be visually
examined. In the following intentionally constructed example, three clusters were
generated from a normal distribution with σ = 1 and the respective prototypes
{(3, 9), (5, 3), (9, 6)} in the space R2 with one attribute periodic with D = 10,
shown on the horizontal axes in Figs. 5(a), 5(b), 5(c). The original clusters, each
containing 50 points are shown in Fig. 5(a).
(a) Original clusters
(b) Trigonometric encoding
(c) Hybrid K-means
Fig. 5. A case study of the problems caused by using trigonometric encoding with K-means.
To estimate the clusters using K-means with trigonometric encoding, (15) was
first applied to each periodic attribute and then the range h−1, 1i of both resulting variables was linearly mapped to h0, Di to preserve the scale of the attribute.
The result from the K-means algorithm with trigonometric encoding is shown in
IJPRAI-D-07-00167 Revision 1
K-means Clustering for Problems with Periodic Attributes
13
Fig. 5(b). It can be clearly seen that the points near the center of the image are
frequently misclassified. The result is reminiscent of the stripe structure seen in
Fig. 2(b). Repeated runs with the above parameters produce similar results.
The Hybrid K-means algorithm clustered the points as depicted in Fig. 5(c).
The clusters are quite clear in the original data and it is immediately seen that the
Hybrid K-means algorithm does not suffer the problems of the K-means algorithm
with trigonometric encoding.
6.2. Numerical studies
The proposed algorithm has been compared to K-means clustering with trigonometric encoding using synthetic data generated similarly to the case study above.
To asses the merit of using the Hybrid K-means, experiments have been performed
for all combinations of the following parameters: dimension M ∈ {2, 4, 6, 8}, number of clusters K ∈ {2, 3, ..., 6}, and standard deviation of patterns around cluster
prototypes σ ∈ {0.8, 1.2, ..., 2.4}. The standard deviations were chosen so that the
clusters did not overlap excessively as this only confounds the results because no
method can separate overlapping points. In each problem, the patterns were generated in the space h0, 10)M . Three scenarios were considered: in the first one, only
one attribute was periodic, in the second one half of the attributes were periodic
and in the third one, all of the attributes were periodic. The period of all periodic
attributes was set to D = 10.
For each combination of the above parameters, 100 different realizations of the
clusters were created. First, the required number of cluster prototypes were drawn
from a uniform distribution. If the cluster centers were too close, the a new set
of cluster centers was drawn. The minimum distance between any pair of cluster
centers was selected to be 3 to ensure that clusters with σ = 2.4 did not overlap
excessively. Each cluster was then populated with 500 points distributed normally
around the cluster prototype with the given standard deviation. The clustering
algorithms were initialized with the Forgy approach4 and run 5 times. For each
trial, the standard clustering criterion
N
J=
1X
ρI,D (xi , wsw (i) )2 ,
2 i=1
(25)
summing the squares of distances of all samples to their assigned prototypes was
computed and the trial with the smallest criterion was used. This is standard practice when applying K-means with random initialization. Performance of the algorithms has been evaluated by computing the number of erroneously assigned points
according to standard clustering method performance evaluation criteria.
The deformation caused by the trigonometric encoding severely affects the performance of the K-means algorithm in many configurations. The histograms clearly
show that for mixed attribute sets (periodic and non-periodic), the advantage of
Hybrid K-means over K-means with trigonometric encoding is much larger than
IJPRAI-D-07-00167 Revision 1
14
M. Vejmelka, P. Musilek, M. Paluš, E. Pelikán
when all the attributes are periodic. This stems from the fact that there are two
distinct deformation effects caused by the trigonometric encoding as described in
Sec. 3.2. If all of the attributes are periodic, then the deformation in the expectation step is the same for all attributes and its negative impact is much smaller.
The use of the cosine metric itself does not cause large performance degradation,
this is obvious by examining the range of differences of misclassification counts in
histogram 6(c). However when a mixed attribute set is provided as input, the distortion effect caused by the expectation step is also present as a side effect of using
trigonometric encoding and ‘fitting’ periodic attributes into the standard K-means
framework. Mixed attribute sets occur quite often and constitute a much more
important practical case.
Plots in Fig. 7 show the dependencies of misclassification rates on the number
of clusters and on their size for one periodic attribute and for all attributes periodic. The plots confirm that when all attributes are periodic, the difference in
performance is very small compared to the setup with mixed attributes.
Summarizing the performance tests, there are some configurations where the
algorithms perform comparably and many configurations where the proposed algorithm is significantly better than K-means with trigonometric encoding.
6.3. Application to Real Data
To demonstrate the utility of the proposed Hybrid K-means algorithm, it has been
applied to a real data set describing consumption of natural gas as a function of
calendar days. It contains 1153 data points collected over the period of about three
years between 2003 and 2005. The clustering algorithm has been used to separate
the data into several segments with the goal of building a simple regression model
for each segment. The distribution of data, shown in Fig. 8(a), lends itself to separation into four clusters corresponding to the heating season (upper data points
corresponding to high gas consumption at the beginning and end of a calendar
year), the non–heating season (lower data points corresponding to the low gas consumption during summer), and two transient periods (two slanted data segments,
one with negative and one with positive slope).
One of the attributes, the calendar day, is periodic and thus makes this data set
suitable for application of the Hybrid K-means algorithm. By visual inspection, it
is obvious that the data points at the beginning and at the end of the year should be
grouped together into a cluster corresponding to the heating season. The standard
K–means algorithm completely fails to achieve such grouping, as shown in Fig. 8(b).
Both remaining algorithms, K–means with trigonometric encoding and Hybrid Kmeans, group the beginning and the end of the heating season together. Each of the
two algorithms does this, however, with a different quality of separation between
the heating/non–heating season and the transient periods. In particular, the results
obtained by Hybrid K-means provide much better vertical separation of the clusters
(with respect to the values of gas consumption). The greater vertical overlap of the
IJPRAI-D-07-00167 Revision 1
K-means Clustering for Problems with Periodic Attributes
15
neighboring clusters obtained by K–means with trigonometric encoding is caused
by the deformation of the feature space described in Section 3.2.
To quantify the effect of different partitions obtained by application of the three
variants of K-means, a simple regression model has been built for each cluster to
predict the value of consumption from the value of day index. The model has the
following form
Consumption = m · Day + b.
(26)
Mean absolute percentage errors (MAPE) of predictions obtained by applying the
regression models on the data set used for clustering are summarized in Table 1. The
results clearly show the superiority of the proposed Hybrid K-means algorithm. The
regression models based on partition obtained by this algorithm reduce MAPE by
1.42% with respect to the standard K–means algorithm, and by 1.06% compared
to K–means with trigonometric encoding. The overall high values of MAPE are
caused by the simplicity of the prediction model, considering only day index as the
driver of gas consumption. More sophisticated forecasting systems include values
of temperature and past consumption to build a non–linear autoregressive model
and achieve MAPE around 3.6%10 .
Algorithm
MAPE
Standard
13.13%
Trigonometric
12.87%
Hybrid
11.81%
Table 1. Mean absolute percentage error (MAPE) obtained by applying the regression models to
clusters from Fig. 8.
7. Conclusions and Future Work
In this paper, we have examined the clustering problem with periodic attributes. It
has been shown that the standard K-means algorithm combined with trigonometric encoding suffers from some drawbacks. Trigonometric encoding distorts spatial
relationships in the pattern space, which affects both the expectation step and the
maximization step of K-means. The resulting partition is burdened with a systematic error caused by the fact that clusters tend to form in regions where trigonometric encoding contracts the feature space most. In addition, dimensionality of
the feature space is increased when using this approach.
To alleviate these problems, a solution of the minimization problem using an
alternate metric has been shown. This metric correctly represents distances for periodic attributes and its use does not entail increasing the dimensionality of the
feature space. Based on this algorithm, a novel expectation/maximization step for
the K-means algorithm has been derived. Subsequently, a modified K-means clustering algorithm, Hybrid K-means, for problems containing both periodic attributes
and non-periodic attributes has been developed.
IJPRAI-D-07-00167 Revision 1
16
M. Vejmelka, P. Musilek, M. Paluš, E. Pelikán
To explore the behavior of the Hybrid K-means algorithm on synthetic data,
a large number of numerical simulations has been performed. The simulations
show that the Hybrid K-means algorithm compares favorably with K-means with
trigonometric encoding. The statistics of the results have clearly shown that the
dominant cause of performance degradation is in the expectation step of K-means
with trigonometric encoding. As a consequence, the performance boost of the proposed algorithm over the K-means algorithm with trigonometric encoding is largest
in mixed attribute datasets with only a small number of periodic attributes. Such
datasets are most frequently encountered in practice as opposed to sets with only periodic attributes or many periodic attributes. As the number of periodic attributes
increases, the distortion of the feature space caused by the trigonometric mapping
itself has a smaller impact on the quality of the clustering.
The proposed algorithm has been compared to K-means with trigonometric
encoding and to the standard K-means algorithm on a gas load data set. To quantify
the quality of separation of the dataset into heating, non-heating and two transient
periods, simple regression models have been built in each time period identified by
each of the three variants of the K-means algorithm. The linear regression models
based on clusters generated by the Hybrid K-means algorithm show the smallest
mean absolute prediction error.
The algorithms described in this paper will be applied to several practical problems, such as time series prediction and processing of complex biological signals.
Acknowledgment
This work was supported in part by the 6RP EU project BRACCIA (Contract No 517133 NEST) and the Institutional Research Plan AV0Z10300504,
by the Grant Agency of the Academy of Science of the Czech Republic (grant
No. 1ET400300513), and by the Natural Sciences and Engineering Research Council (NSERC) of Canada.
Appendix A. Optimal Maximization Step Pseudocode
Input X = {x1 , x2 , ..., xNj }, D > 0
Output
wk
Init
Y = sort(X)
Z = {i | yi > D/2}, if Z = ∅ then m = Nk + 1 else m = min i
yi ∈Z
sum =
m−1
X
yi +
i=1
wk = sum/Nj
Nk
X
(yi − D)
i=m
IJPRAI-D-07-00167 Revision 1
K-means Clustering for Problems with Periodic Attributes
c=
m−1
X
i=1
∗
c =
(wk − yi )2 +
c, wk∗
Nk
X
17
(wk − yi + D)2
i=m
= wk
Loop I
for i=m : Nk ;the unwrapping phase
sum = sum + D, wk′ = sum/Nk
c = c + (wk′ − yi )2 − (wk − yi + D)2 + (Nk − 1)(wk′2 − w2 ) − 2(wk′ − wk )(s − yi )
wk = wk′
if c < c∗ then c∗ = c, wk∗ = wk
Loop II
for i=1 : m − 1 ;the wrapping phase
sum = sum + D, wk′ = sum/Nk
c = c + (wk′ − yi − D)2 − (wk − yi )2 + (Nk − 1)(wk′2 − w2 ) − 2(wk′ − wk )(s − yi − D)
wk = wk′
if c < c∗ then c∗ = c, wk∗ = wk
Termination
return wk∗
References
1. P. Berkhin, Survey of clustering data mining techniques, Tech. rep., Accrue Software,
San Jose, California (2002).
2. L. Bottou, Y. Bengio, Convergence properties of the K-means algorithms, in:
G. Tesauro, D. Touretzky, T. Leen (Eds.), Advances in Neural Information Processing
Systems, Vol. 7, The MIT Press, 1995, pp. 585–592.
3. M.-C. Su, C.-H. Chou, A modified version of the k-means algorithm with a distance
based on cluster symmetry, IEEE Trans. Pattern Analysis and Machine Intelligence
23 (6) (2001) 674–680.
4. E. Forgy, Cluster analysis of multivariate data: efficiency vs. interpretability of classifications, Biometrics 21, 1965, pp. 768–769.
5. A. Jain, M. Murty, P. Flynn, Data clustering: A review, ACM Computing Surveys
31 (3) (1999) 264–323.
6. T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, A. Wu, An efficient
k-means clustering algorithm: analysis and implementation, IEEE Trans. Pattern Analysis and Machine Intelligence 24 (7) (2002) 881–892.
7. J. B. MacQueen, Some methods for classification and analysis of multivariate obser-
IJPRAI-D-07-00167 Revision 1
18
M. Vejmelka, P. Musilek, M. Paluš, E. Pelikán
vations, in: Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and
Probability, Vol. 1, University of California Press, 1967, pp. 281–297.
8. K. Mardia, P. Jupp, Directional Statistics, Wiley, Chichester, 2000.
9. R. Mitchell, Forecasting electricity demand using clustering, Applied Informatics 378,
2003, pp. 378–134.
10. P. Musilek, E. Pelikan, T. Brabec, M. Simunek, Recurrent neural network based gating
for natural gas load prediction system, In: Proc. WCCI 2006, World Congress on
Computational Intelligence, Vancouver, BC, Canada, 2006, pp. 7127–7132.
11. N. H. Viet, J. Mandziuk, Neural and fuzzy neural networks in prediction of natural
gas consumption, Neural, Parallel & Scientific Computations 13, 2005, pp. 265–286.
12. R. Xu, D. Wunsch, Survey of clustering algorithms, IEEE Transactions on Neural
Networks 16 (3), 2005, pp. 645 – 678.
IJPRAI-D-07-00167 Revision 1
K-means Clustering for Problems with Periodic Attributes
(a) One periodic attribute.
19
(b) Half of attributes periodic.
(c) All attributes periodic.
Fig. 6. Histograms of differences of misclassification counts of the Hybrid K-means Eh and Kmeans with trigonometric encoding Etrig for different proportions of periodic attributes. Histograms collect misclassification differences over all the tested parameters.
IJPRAI-D-07-00167 Revision 1
20
M. Vejmelka, P. Musilek, M. Paluš, E. Pelikán
(a)
(b)
(c)
(d)
Fig. 7. Dependencies of the misclassification rate on the cluster standard deviation and on the
number of clusters for one periodic attribute 7(a),7(b) and for all attributes periodic 7(c),7(d).
IJPRAI-D-07-00167 Revision 1
K-means Clustering for Problems with Periodic Attributes
(a) Gas load data.
(b) Standard K-means.
(c) K-means with trig. encoding.
(d) Hybrid K-means.
21
Fig. 8. Gas load data (a) and clustering obtained using three variants of K–means algorithm (b–d)