K-MEANS CLUSTERING FOR PROBLEMS WITH PERIODIC ATTRIBUTES

M. VEJMELKA; P. MUSILEK; M. PALUŠ; E. PELIKÁN

K-MEANS CLUSTERING FOR PROBLEMS WITH PERIODIC ATTRIBUTES

Emil Pelikan

2009, International Journal of Pattern Recognition and Artificial Intelligence

visibility

…

description

21 pages

link

1 file

The K-means algorithm is very popular in the machine learning community due to its inherent simplicity. However in its basic form it is not suitable for use in problems which contain periodic attributes, such as oscillator phase, hour of day or directional heading. A commonly used technique of trigonometrically encoding periodic input attributes to artificially generate the required topology introduces a systematic error. In this paper, a metric which induces a conceptually correct topology for periodic attributes is embedded into the K-means algorithm. This requires solving a non-convex minimization problem in the maximization step. Results of numerical experiments comparing the proposed algorithm to K-means with trigonometric encoding on synthetically generated data are reported. The advantage of using the proposed K-means algorithm is also shown on a real example using gas load data to build simple predictive models.

IJPRAI-D-07-00167 Revision 1 International Journal of Pattern Recognition and Artificial Intelligence c World Scientific Publishing Company K-means Clustering for Problems with Periodic Attributes M. Vejmelka∗ Institute of Computer Science, Academy of Sciences of the Czech Republic, Pod Vodarenskou vezi 2, 182 07 Prague 8, Czech Republic, vejmelka@cs.cas.cz P. Musilek Department of Electrical and Computer Engineering, University of Alberta, W2-030 ECERF, Edmonton, Alberta, T6G 2V4 Canada, musilek@ece.ualberta.ca M. Paluš, E. Pelikán Institute of Computer Science, Academy of Sciences of the Czech Republic, Pod Vodarenskou vezi 2, 182 07 Prague 8, Czech Republic, mp/emil@cs.cas.cz The K-means algorithm is very popular in the machine learning community due to its inherent simplicity. However in its basic form it is not suitable for use in problems which contain periodic attributes, such as oscillator phase, hour of day or directional heading. A commonly used technique of trigonometrically encoding periodic input attributes to artificially generate the required topology introduces a systematic error. In this paper, a metric which induces a conceptually correct topology for periodic attributes is embedded into the K-means algorithm. This requires solving a non-convex minimization problem in the maximization step. Results of numerical experiments comparing the proposed algorithm to K-means with trigonometric encoding on synthetically generated data are reported. The advantage of using the proposed K-means algorithm is also shown on a real example using gas load data to build simple predictive models. Keywords: Clustering Algorithms, Similarity measures, K-means, Periodic Attributes 1. Introduction Clustering is one of the most important problems of data analysis, concerned with finding structures intrinsic to a set of unlabeled patterns. Typical goals of clustering include finding prototypes for homogeneous groups of data, description of unknown properties of data or detecting unusual data points. To aid the clustering process, it is common to use a distance measure to assess the (reciprocal of) similarity between patterns5 . K-means7 is one of the best known clustering algorithms. Due to its simplicity and ease of implementation, it has been used in numerous practical applications. There have also been many works that extend the original form of the algorithm ∗ Department of Cybernetics, Czech Technical University, Technicka 2, 166 27 Prague 6, Czech Republic 1 IJPRAI-D-07-00167 Revision 1 2 M. Vejmelka, P. Musilek, M. Paluš, E. Pelikán to improve its convergence2 , efficiency6 , cluster shape3 , etc. The distance measure used for K-means clustering is usually the Euclidean distance or the more general Minkowski distance5 , which provides a variety of metrics to suit many different types of attributes. However, these conventional distance measures cannot be used to effectively assess similarity of attributes that are periodic. To deal with periodic attributes, trigonometric encoding11 can be used to preprocess the data before clustering. However, this ad hoc approach has several disadvantages. First, the dimensionality of the problem is increased by one for each periodic attribute treated this way. Second, the non-linearities introduced by trigonometric functions distort the feature space and create artificial structures not present in the original data. To avoid the above mentioned problems, we propose an alternative metric suitable for assessing distances of periodic attributes. This metric is first formally introduced and subsequently used in the development of a modified K-means algorithm. It is then generalized so that it is applicable to problems involving both periodic and non-periodic attributes. The main virtue of the suggested metric is its unbiasedness: it does not deform the feature space and thus does not impose any structure on the data analyzed. In addition, its use does not require the dimensionality of the feature space to increase. The new algorithm has been compared to both standard K-means and K-means with trigonometric encoding of periodic attributes. The results of numerical simulations show the superiority of the proposed Hybrid K-means in problems with mixed periodic and non-periodic attributes. The remainder of this paper is organized as follows. Sec. 2 provides a brief overview of partitioning clustering algorithms. The problems arising when clustering patterns with periodic attributes are introduced in Sec. 3. Subsequently, the Kmeans algorithm for clustering patterns with periodic attributes is formalized in Sec. 4. In Sec. 5, the new distance measure is combined with the Euclidean distance and a K-means algorithm suitable for clustering patterns containing both periodic and non-periodic attributes is formulated. Results of numerical simulations using the proposed algorithm and its variants are presented in Section 6. Finally, Sec. 7 provides the major conclusions and points to potential directions of future work. The Appendix contains the pseudocode of the algorithm introduced in the paper. 2. Clustering Algorithms Clustering algorithms are broadly divided into hierarchical and partitioning1 . While hierarchical techniques build clusters gradually (either by merging or splitting), partitioning algorithms form clusters directly by dividing data into several subsets. In partitioning algorithms, that are the subject of this work, certain heuristics are used to iteratively optimize the location and size of the subsets. There are different reallocation schemas facilitating this optimization process that can be grouped into probabilistic clustering and clustering based on the use of an objective IJPRAI-D-07-00167 Revision 1 K-means Clustering for Problems with Periodic Attributes 3 function. Typical instances of these approaches are mixture-density models and the K-means clustering, respectively. These two algorithms are briefly reviewed in the following text. In addition to the conventional view of K-means as an algorithm based on objective-function optimization, it is also described in probabilistic terms. This probabilistic description is later used to derive a new algorithm applicable to periodic attributes. 2.1. Mixture-density Models Probabilistic clustering models assume that the data to be clustered come from a mixture model of several random populations. The clustering problem is to find the distributions and prior probabilities of these populations. Assume that the prior probability P (wk ) for cluster wk ,k = 1, . . . , K, and the form of conditional probability density p(x|wk , θk ) are known, with the unknown parameter vector θk . The mixture probability density for the whole data set is p(x|θ) = K X p(x|wk , θk )P (wk ), (1) k=1 PK where θ = (θ1 , . . . , θK ), and k=1 P (wk ) = 1. After finding the parameter vector θ, the posterior probability for assigning a pattern to a cluster can be calculated using Bayes’ theorem12 . The unknown parameters that maximize the probability of generating all given observations can be found using maximum likelihood method, usually expressed in the following logarithmic form l(θ) = N X i=1 log p(xi |θ). (2) As the maximum of the log-likelihood equation at ∂l(θ)/∂θk = 0 usually cannot be found analytically, iterative methods are required to approximate the solutions. The expectation-maximization (EM) algorithm is the most popular approach. It m m augments the observable features xoi by vector xm i = (xi1 , . . . , xiK ) whose comm ponents are assumed missing and attain values xik ∈ {0, 1} depending on whether xi = {xoi , xm i } belongs to the cluster k (xik = 1) or not (xik = 0). The log-likelihood of the complete data is l(θ) = K N X X i=1 k=1 o xm ik log [P (wk )p(xi |θk )] (3) The standard EM algorithm generates a sequence of parameter estimates {θ0 , θ1 , . . . , θ∗ }, where ∗ represents the reaching of convergence criterion by executing the following steps12 (1) Initialize θ0 and set t = 0; IJPRAI-D-07-00167 Revision 1 4 M. Vejmelka, P. Musilek, M. Paluš, E. Pelikán (2) E-step: determine the expectation of the complete data log-likelihood Q(θ, θt ) = E[log p(xo , xm |θ)|xo , θt ]; (3) M-step: Select a new parameter estimate that maximizes the Q-function, θt+1 = arg max Q(θ, θt ); θ (4) Increment t and repeat steps E and M until convergence. 2.2. K-means Clustering K-means is the best known clustering algorithm based on quantization error minimization12 . The error can be described as the average distance between each pattern and its closest prototype N E(w) = 1X min(xi − wk )2 , 2 i=1 k (4) where E is the quantization error, wk , k ∈ {1, . . . , K} is the prototype of cluster k, and xi , i ∈ {1, . . . , N } is the i-th sample of the data set to be clustered. The equation above can be reformulated by writing sw (i) for the subscript of the prototype closest to the sample xi 2 . Then N E(w) = 1X (xi − wsw (i) )2 , 2 i=1 (5) The quantization error can be minimized using gradient descent algorithm ∆w = −ǫt (∂E(w)/∂w), to perform either batch or online updates of the cluster prototypes, with learning rate ǫt . In this paper, the batch variant in the form X ǫt (xi − wk ) if k = sw (i) (6) ∆wk = 0 otherwise i is considered. This step is repeated until convergence which is guaranteed for suitable chosen learning rates ǫt . 2.3. Probabilistic Framework of K-means The sum of squares appearing in (5) can be viewed as (a negative of) log-likelihood for normally distributed mixture model that is widely used in statistics. Therefore K-means can be understood under a general probabilistic framework1 . Subsequently, K-means can be derived as an EM style algorithm. The following notation has been adopted from2 . The equation (3) can be modified for the K-means algorithm by assuming the missing parameters to be the assignments of patterns xi to the prototypes wk . Instead of considering the expected value over the distribution on these missing IJPRAI-D-07-00167 Revision 1 K-means Clustering for Problems with Periodic Attributes 5 parameters, the values of parameters w′ that minimize the cost given their previous values w are considered Q(w′ , w) = N X 1 i=1 2 (xi − ws′ w (i) )2 . (7) In the case of K-means, the new set of prototypes w′ that minimizes Q(w′ , w) can be found analytically by solving the equation ∂Q(w′ , w)/∂wk′ = 0 X 1 wk′ = xi , (8) Nk i:k=sw (i) where Nk is the number of samples assigned to prototype wk . This algorithm can be reformulated to a form similar to the gradient descent algorithm (6) X 1 (xi − wk ) if k = sw (i) Nk , (9) ∆wk = 0 otherwise i where ∆wk = wk′ − wk and is thus equivalent to batch gradient descent2 , with the learning rate dependent on cluster size, ǫt = 1/Nk . 3. Clustering with Periodic Attributes In this section, periodic attributes are introduced and some examples of such attributes arising in practice are listed. A popular technique for dealing with periodic attributes is presented and a systematic error arising from its use is shown. 3.1. Periodic Attributes Intuitively a periodic attribute is an attribute the topology of which can be naturally represented by a circle connecting the supremum and infimum of the set of values together. The length of segments between points on the circle correspond to our perception of distance. A fitting example is an attribute representing an hour (of day) with values h0, 24). What is the distance between hour 23:00 (11 o’clock PM) and 01:00 (1 o’clock AM) ? Events occurring at these times every day are commonly understood to be 2 hours apart, not 22 hours apart as would be indicated by their Euclidean distance. Examples of periodic attributes conforming to the above considerations include azimuth, cyclic time indices (minute of hour), oscillator phases or the hue of the HSV color model (the saturation and value can be considered standard non-periodic attributes). The part of statistics mainly concerned with periodic attributes and spherical sample spaces is circular statistics. This subject has been treated at length by Mardia and Jupp8 who suggest the cosine distance as a useful distance metric dcos (θ, ξ) = 1 − cos(θ − ξ), (10) where θ, ξ are angles in radians. Any periodic attribute can be mapped onto h0, 2π) and represented by an angle (a direction). To elucidate the relationship between a IJPRAI-D-07-00167 Revision 1 6 M. Vejmelka, P. Musilek, M. Paluš, E. Pelikán common technique employed in clustering periodic attributes, which shall be termed K-means with trigonometric encoding and the cosine distance, some additional quantities will be defined PN C̄ = N1 i=1 cos(θi ), PN (11) S̄ = N1 i=1 sin(θi ), where θi are the samples of the periodic attribute mapped onto h0, 2π). Then the mean direction is defined as S̄ . (12) θ̄ = tan−1 C̄ The mean direction is similar to the mean of a set of samples in R. It also has an associated mean resultant length p R̄ = C̄ 2 + S̄ 2 , (13) which characterizes the spread of points around the mean direction θ̄. If the dispersion of a set of samples of a periodic attribute around a value α is computed as8 N 1 X (1 − cos(θi − α)), D(α) = N i=1 (14) then the mean direction θ̄ is the minimizer of the dispersion D(α) for the set of samples θi . The dispersion D(α) is in fact the sum of cosine distances from α to each θi . The K-means algorithm with trigonometric encoding works by first transforming the periodic attribute t ∈ h0, D) by the pair of functions c = cos(2πt/D) s = sin(2πt/D) (15) and applying the standard maximization step (8). Optionally, the values c, s can be linearly mapped onto the range h0, D) to preserve the scale of the attribute. The new prototype coordinate resulting from applying the maximization step (8) for the c dimension is exactly C̄ and the new prototype coordinate for the s dimension is S̄. The corresponding value in the original (unmapped) space is t̄ = θ̄D 2π — the mean direction linearly scaled to h0, D). We have thus shown that given a set of samples t1 , t2 , ..., tN , the K-means algorithm minimizes (14) which is in fact a sum of cosine distances from the prototype to each of the samples ti mapped to h0, 2π). 3.2. Effects of Trigonometric Encoding Problems involving periodic attributes are sometimes approached by mapping the periodic attributes using trigonometric functions (15) onto a circle as shown above, IJPRAI-D-07-00167 Revision 1 K-means Clustering for Problems with Periodic Attributes 7 for example in11,? . The dimensionality of the clustering problem is effectively increased by one for each periodic attribute that is mapped in this way. More importantly, the non-linearities of the trigonometric functions distort the spatial relationships between patterns. The problem manifests itself in both steps of the K-means algorithm. In the expectation step of K-means, distances are measured using the standard Euclidean metric, which measures distances along straight lines in the plane spanned by the encoding variables c, s (15). √ This means that the ratio of distances between points a, c and a, b in Fig. 1 is 2, whereas a correct representation would indicate the ratio of distances to be 2. This distortion is an additional effect stemming from the use of Euclidean distance to partition the samples for the maximization step. The deformation caused by measuring distances along straight lines preserves the ordering of distances measured along the circle segments but not the ratios of these distances thus distorting the value of the cost function. Fig. 1. Deformation of distances caused by trigonometric encoding. In the maximization step, each coordinate is considered separately when optimizing prototype locations. The K-means algorithm thus works with the projections of the points on the unit circle into the horizontal and vertical axes shown in Fig. 1. Here, another type of spatial deformation comes into play, where the projected points equally spaced on the circle are not equally spaced on the axes. The result of the maximization step has been described in Sec. 3.1 and corresponds to using the cost function (14) based on the metric (10). The resulting clusters are biased toward a structure artificially created by the selective contraction of the original feature space resulting from the use of the trigonometric encoding (15). This can be easily seen if there is no real structure in the input data as depicted in Fig. 2(a), where three ‘clusters’ were generated from a uniform distribution spanning the space h0, 10)2. The first attribute, depicted on the vertical axis, is periodic with period D = 10 and the mapping (15) was applied to it making the problem three-dimensional. The range of the transformed variables c, s resulting from the application of the trigonometric encoding was linearly mapped to h0, D) so that scaling of the attribute was preserved. The clusters computed by IJPRAI-D-07-00167 Revision 1 8 M. Vejmelka, P. Musilek, M. Paluš, E. Pelikán K-means with trigonometric encoding shown in Fig. 2(b) have a horizontal stripe structure. Even though runs with different seeds may produce slightly different configurations (e.g. the stripes are shifted), the bias in the results is clearly visible. Indeed for uniformly distributed data, any K-means algorithm, including the one proposed in this paper, will converge to some (meaningless) cluster configuration. However, the use of trigonometric encoding causes a systematic error that is present no matter what input is provided to the algorithm. (a) Generated clusters. (b) Clusters identified by K-means with trig. encoding. Fig. 2. The clusters identified by the K-means algorithm with trigonometric encoding display a horizontal striped structure 2(b), the actual ‘clusters’ with all points drawn from a uniform distribution on h0, 10)2 shown in 2(a). In some cases, a smooth transformation of the feature space may be advantageous or may be required by the preprocessing method. In our case, however, deformation occurs as a by-product of a preprocessing method intended to approximate the spatial relationships of periodic attributes. 4. K-means for Periodic Attributes In this section, a new EM problem will be formulated incorporating the features of a metric that does not distort the feature space. This metric has also been referenced in Mardia and Jupp8 duni (θ, ξ) = π − |π − |θ − ξ||. (16) It can be perhaps more intuitively rewritten as ρD (x, y) = min{|x − y|, D − |x − y|}, (17) for D = 2π. The behavior of the metric is now clearer. If the term |x−y| is smaller (or equal to) the term D − |x − y|, then the distance is exactly equal to the Euclidean IJPRAI-D-07-00167 Revision 1 K-means Clustering for Problems with Periodic Attributes 9 distance. Otherwise the metric ‘wraps around’, which intuitively corresponds to passing through point D ≡ 0 when tracing the shortest distance path on the circle of values of the periodic attribute. The form (17) can be used with any period D > 0. This situation is depicted in Fig. 3. Fig. 3. Shortest paths between the points 0 < a < b < c < D. Between a, b the shortest path does not pass through D ≡ 0. The shortest path between points a, c ‘wraps’ around the circle and separates the two regions. passes through the point D ≡ 0. The point a′ = a + D 2 In the following, the expectation and maximization steps will be modified to work with this metric. The derivations will be performed for a single periodic attribute with period D. The variables w, x, y will be used to represent elements of the set h0, D) of this periodic attribute. 4.1. Expectation Step In the expectation step it is necessary to replace the Euclidean metric with the metric (17). The expectation step thus becomes sw (i) = arg min ρD (wk , xi ). k (18) The modified expectation step ensures that each sample xi will be assigned to the cluster prototype closest in the sense of the metric (17). 4.2. Maximization Step The maximization step computes a new cluster prototype location that minimizes the functional N Q(w, w′ ) = 1X ρD (ws′ w (i) , xi )2 , 2 i=1 (19) where w represents the current cluster prototypes and w′ represents the new cluster prototypes with respect to which the functional is minimized. This is not a simple convex quadratic function and has in general multiple local minima. The above IJPRAI-D-07-00167 Revision 1 10 M. Vejmelka, P. Musilek, M. Paluš, E. Pelikán minimization problem is solved separately for each cluster as the positions of the cluster centers are mutually independent. The set of patterns belonging to the cluster k is Xk = {xi |sw (i) = k}, (20) where Nk = card Xk is the number of patterns assigned to the cluster k. An algorithm computing the global minimum of Q(w, w′ ) in O(Nk log Nk ) time is presented. It is possible to optimize the algorithm further to run in O(Nk ) time by pre-sorting the input data and using indirect indexing to access the sorted array. However, for the sake of simplicity and clarity, only the O(Nk log Nk ) algorithm is shown. For each pattern xi ∈ Xk , the interval h0, D) can be divided into two subintervals such that if the new cluster prototype were located in one of them, the metric would wrap around when computing the distance, and it would not wrap around if the new prototype was in the other one. This is illustrated in Fig. 4 for one pattern D located in the interval h D 2 , D) and another pattern located in the interval h0, 2 ). The point where the shortest distance path changes for the pattern in the upper D interval is xi − D 2 . The same happens for patterns located in the interval h0, 2 ), D but the switch point is at xi + 2 . A function can therefore be defined assigning a switch point to each xi xi + D/2 if xi < D/2 (21) m(xi ) = xi − D/2 if xi ≥ D/2. The set Yk = {yi |yi = m(xi ), xi ∈ Xk } is a set of switch points which split the interval h0, D) into at most Nk + 1 subintervals. If any two points coincide exactly, then their switch points coincide as well. This means that the number of unique subintervals is decreased by one. Starting at the point wk′ = 0, the metric wraps for Fig. 4. Top: Figure shows wrapping behavior depending on new prototype position for a pattern , D). Bottom: Figure shows wrapping behavior depending on new prototype position for a in h D 2 pattern in h0, D ). 2 all xi ∈ Xk , xi > D 2 and does not wrap in the interval h0, D 2 ). Moving the potential IJPRAI-D-07-00167 Revision 1 K-means Clustering for Problems with Periodic Attributes 11 new prototype in the positive direction, Nk switch points are passed on the interval h0, D). All switch points of the patterns in the interval h D 2 , D) are reached before ). The interval h0, D) can be all switch points of the patterns in the interval h0, D 2 divided into subintervals inside which the potential new prototype can move and the shortest distance path does not change discontinuously for any xi . The optimal position in such a subinterval can be computed analytically  xi if ρD does not wrap N  X . (22) wk′ = xi − D if ρD wraps and xi ≥ D 2  i=1 xi + D if ρD wraps and xi < D 2 It is possible that the new w′ might not be in the range h0, D) and will have to be corrected by ±D. The optimal position with the smallest cost function (19) over all the subintervals is then set as the new cluster prototype wk′ . The optimal location and cost of the center if it lies in the first subinterval can be computed in O(Nk ) time using (22). However this is only necessary in the first subinterval. After the initial computation it possible to update the potential optimal location and it’s associated cost in other subintervals in constant time O(1). The total computational cost is thus dominated by the O(Nk log Nk ) cost of the sorting procedure. The pseudocode for the algorithm is shown in the Appendix. 5. Hybrid K-means A metric measuring distances in problems with mixed periodic and non-periodic attributes is presented and a Hybrid K-means algorithm suitable for finding clusters in such problems is described. Using the distance metric (17) for periodic attributes and the Euclidean metric for non-periodic attributes, it is possible to define a metric for problems in which only some attributes are periodic. Let I ⊂ {1, 2, ...M } denote the set of indices of periodic attributes, where M is the total number of attributes. The set D is the set of periods of the periodic attributes D = {Dj |j ∈ I}. A metric representing distances in this space can be written as v uM uX ρD (xj , yj )2 if j ∈ I j . (23) ρI,D (x, y) = t (xj − yj )2 otherwise j=1 The function (23) is a metric as it satisfies all necessary conditions: it is nonnegative and symmetric with respect to its arguments and zero if and only if x = y. The triangle inequality for the function (23) is satisfied for each coordinate individually, therefore also for the square root of the sum of squares. This metric is useful in practice when the clustering problem contains attributes which belong to different domains altogether (hour of day, temperature, wind speed) and some of the attributes are periodic. Introducing coefficients to weigh the contributions of each attribute to the total distance preserves the metric properties of function (23) so standard attribute scaling procedures can be applied. A modified version of the IJPRAI-D-07-00167 Revision 1 12 M. Vejmelka, P. Musilek, M. Paluš, E. Pelikán standard K-means algorithm can now be presented which is able to effectively find clusters in problems where only some attributes are periodic. In the expectation step of the Hybrid K-means, the assignments of data points to centers are computed as sw (i) = arg min ρI,D (xi , wk ). (24) k The Hybrid K-means algorithm uses the original standard K-means maximization step (8) for non-periodic attributes and the maximization step introduced in Sec. 4.2 for periodic attributes. 6. Experiments In this section it is shown how the Hybrid K-means algorithm compares to Kmeans with trigonometric encoding on synthetic data sets containing both periodic and non-periodic attributes and on real data describing the consumption of natural gas as a function of time. 6.1. Case Study The presented case study has two attributes so that the results can be visually examined. In the following intentionally constructed example, three clusters were generated from a normal distribution with σ = 1 and the respective prototypes {(3, 9), (5, 3), (9, 6)} in the space R2 with one attribute periodic with D = 10, shown on the horizontal axes in Figs. 5(a), 5(b), 5(c). The original clusters, each containing 50 points are shown in Fig. 5(a). (a) Original clusters (b) Trigonometric encoding (c) Hybrid K-means Fig. 5. A case study of the problems caused by using trigonometric encoding with K-means. To estimate the clusters using K-means with trigonometric encoding, (15) was first applied to each periodic attribute and then the range h−1, 1i of both resulting variables was linearly mapped to h0, Di to preserve the scale of the attribute. The result from the K-means algorithm with trigonometric encoding is shown in IJPRAI-D-07-00167 Revision 1 K-means Clustering for Problems with Periodic Attributes 13 Fig. 5(b). It can be clearly seen that the points near the center of the image are frequently misclassified. The result is reminiscent of the stripe structure seen in Fig. 2(b). Repeated runs with the above parameters produce similar results. The Hybrid K-means algorithm clustered the points as depicted in Fig. 5(c). The clusters are quite clear in the original data and it is immediately seen that the Hybrid K-means algorithm does not suffer the problems of the K-means algorithm with trigonometric encoding. 6.2. Numerical studies The proposed algorithm has been compared to K-means clustering with trigonometric encoding using synthetic data generated similarly to the case study above. To asses the merit of using the Hybrid K-means, experiments have been performed for all combinations of the following parameters: dimension M ∈ {2, 4, 6, 8}, number of clusters K ∈ {2, 3, ..., 6}, and standard deviation of patterns around cluster prototypes σ ∈ {0.8, 1.2, ..., 2.4}. The standard deviations were chosen so that the clusters did not overlap excessively as this only confounds the results because no method can separate overlapping points. In each problem, the patterns were generated in the space h0, 10)M . Three scenarios were considered: in the first one, only one attribute was periodic, in the second one half of the attributes were periodic and in the third one, all of the attributes were periodic. The period of all periodic attributes was set to D = 10. For each combination of the above parameters, 100 different realizations of the clusters were created. First, the required number of cluster prototypes were drawn from a uniform distribution. If the cluster centers were too close, the a new set of cluster centers was drawn. The minimum distance between any pair of cluster centers was selected to be 3 to ensure that clusters with σ = 2.4 did not overlap excessively. Each cluster was then populated with 500 points distributed normally around the cluster prototype with the given standard deviation. The clustering algorithms were initialized with the Forgy approach4 and run 5 times. For each trial, the standard clustering criterion N J= 1X ρI,D (xi , wsw (i) )2 , 2 i=1 (25) summing the squares of distances of all samples to their assigned prototypes was computed and the trial with the smallest criterion was used. This is standard practice when applying K-means with random initialization. Performance of the algorithms has been evaluated by computing the number of erroneously assigned points according to standard clustering method performance evaluation criteria. The deformation caused by the trigonometric encoding severely affects the performance of the K-means algorithm in many configurations. The histograms clearly show that for mixed attribute sets (periodic and non-periodic), the advantage of Hybrid K-means over K-means with trigonometric encoding is much larger than IJPRAI-D-07-00167 Revision 1 14 M. Vejmelka, P. Musilek, M. Paluš, E. Pelikán when all the attributes are periodic. This stems from the fact that there are two distinct deformation effects caused by the trigonometric encoding as described in Sec. 3.2. If all of the attributes are periodic, then the deformation in the expectation step is the same for all attributes and its negative impact is much smaller. The use of the cosine metric itself does not cause large performance degradation, this is obvious by examining the range of differences of misclassification counts in histogram 6(c). However when a mixed attribute set is provided as input, the distortion effect caused by the expectation step is also present as a side effect of using trigonometric encoding and ‘fitting’ periodic attributes into the standard K-means framework. Mixed attribute sets occur quite often and constitute a much more important practical case. Plots in Fig. 7 show the dependencies of misclassification rates on the number of clusters and on their size for one periodic attribute and for all attributes periodic. The plots confirm that when all attributes are periodic, the difference in performance is very small compared to the setup with mixed attributes. Summarizing the performance tests, there are some configurations where the algorithms perform comparably and many configurations where the proposed algorithm is significantly better than K-means with trigonometric encoding. 6.3. Application to Real Data To demonstrate the utility of the proposed Hybrid K-means algorithm, it has been applied to a real data set describing consumption of natural gas as a function of calendar days. It contains 1153 data points collected over the period of about three years between 2003 and 2005. The clustering algorithm has been used to separate the data into several segments with the goal of building a simple regression model for each segment. The distribution of data, shown in Fig. 8(a), lends itself to separation into four clusters corresponding to the heating season (upper data points corresponding to high gas consumption at the beginning and end of a calendar year), the non–heating season (lower data points corresponding to the low gas consumption during summer), and two transient periods (two slanted data segments, one with negative and one with positive slope). One of the attributes, the calendar day, is periodic and thus makes this data set suitable for application of the Hybrid K-means algorithm. By visual inspection, it is obvious that the data points at the beginning and at the end of the year should be grouped together into a cluster corresponding to the heating season. The standard K–means algorithm completely fails to achieve such grouping, as shown in Fig. 8(b). Both remaining algorithms, K–means with trigonometric encoding and Hybrid Kmeans, group the beginning and the end of the heating season together. Each of the two algorithms does this, however, with a different quality of separation between the heating/non–heating season and the transient periods. In particular, the results obtained by Hybrid K-means provide much better vertical separation of the clusters (with respect to the values of gas consumption). The greater vertical overlap of the IJPRAI-D-07-00167 Revision 1 K-means Clustering for Problems with Periodic Attributes 15 neighboring clusters obtained by K–means with trigonometric encoding is caused by the deformation of the feature space described in Section 3.2. To quantify the effect of different partitions obtained by application of the three variants of K-means, a simple regression model has been built for each cluster to predict the value of consumption from the value of day index. The model has the following form Consumption = m · Day + b. (26) Mean absolute percentage errors (MAPE) of predictions obtained by applying the regression models on the data set used for clustering are summarized in Table 1. The results clearly show the superiority of the proposed Hybrid K-means algorithm. The regression models based on partition obtained by this algorithm reduce MAPE by 1.42% with respect to the standard K–means algorithm, and by 1.06% compared to K–means with trigonometric encoding. The overall high values of MAPE are caused by the simplicity of the prediction model, considering only day index as the driver of gas consumption. More sophisticated forecasting systems include values of temperature and past consumption to build a non–linear autoregressive model and achieve MAPE around 3.6%10 . Algorithm MAPE Standard 13.13% Trigonometric 12.87% Hybrid 11.81% Table 1. Mean absolute percentage error (MAPE) obtained by applying the regression models to clusters from Fig. 8. 7. Conclusions and Future Work In this paper, we have examined the clustering problem with periodic attributes. It has been shown that the standard K-means algorithm combined with trigonometric encoding suffers from some drawbacks. Trigonometric encoding distorts spatial relationships in the pattern space, which affects both the expectation step and the maximization step of K-means. The resulting partition is burdened with a systematic error caused by the fact that clusters tend to form in regions where trigonometric encoding contracts the feature space most. In addition, dimensionality of the feature space is increased when using this approach. To alleviate these problems, a solution of the minimization problem using an alternate metric has been shown. This metric correctly represents distances for periodic attributes and its use does not entail increasing the dimensionality of the feature space. Based on this algorithm, a novel expectation/maximization step for the K-means algorithm has been derived. Subsequently, a modified K-means clustering algorithm, Hybrid K-means, for problems containing both periodic attributes and non-periodic attributes has been developed. IJPRAI-D-07-00167 Revision 1 16 M. Vejmelka, P. Musilek, M. Paluš, E. Pelikán To explore the behavior of the Hybrid K-means algorithm on synthetic data, a large number of numerical simulations has been performed. The simulations show that the Hybrid K-means algorithm compares favorably with K-means with trigonometric encoding. The statistics of the results have clearly shown that the dominant cause of performance degradation is in the expectation step of K-means with trigonometric encoding. As a consequence, the performance boost of the proposed algorithm over the K-means algorithm with trigonometric encoding is largest in mixed attribute datasets with only a small number of periodic attributes. Such datasets are most frequently encountered in practice as opposed to sets with only periodic attributes or many periodic attributes. As the number of periodic attributes increases, the distortion of the feature space caused by the trigonometric mapping itself has a smaller impact on the quality of the clustering. The proposed algorithm has been compared to K-means with trigonometric encoding and to the standard K-means algorithm on a gas load data set. To quantify the quality of separation of the dataset into heating, non-heating and two transient periods, simple regression models have been built in each time period identified by each of the three variants of the K-means algorithm. The linear regression models based on clusters generated by the Hybrid K-means algorithm show the smallest mean absolute prediction error. The algorithms described in this paper will be applied to several practical problems, such as time series prediction and processing of complex biological signals. Acknowledgment This work was supported in part by the 6RP EU project BRACCIA (Contract No 517133 NEST) and the Institutional Research Plan AV0Z10300504, by the Grant Agency of the Academy of Science of the Czech Republic (grant No. 1ET400300513), and by the Natural Sciences and Engineering Research Council (NSERC) of Canada. Appendix A. Optimal Maximization Step Pseudocode Input X = {x1 , x2 , ..., xNj }, D > 0 Output wk Init Y = sort(X) Z = {i | yi > D/2}, if Z = ∅ then m = Nk + 1 else m = min i yi ∈Z sum = m−1 X yi + i=1 wk = sum/Nj Nk X (yi − D) i=m IJPRAI-D-07-00167 Revision 1 K-means Clustering for Problems with Periodic Attributes c= m−1 X i=1 ∗ c = (wk − yi )2 + c, wk∗ Nk X 17 (wk − yi + D)2 i=m = wk Loop I for i=m : Nk ;the unwrapping phase sum = sum + D, wk′ = sum/Nk c = c + (wk′ − yi )2 − (wk − yi + D)2 + (Nk − 1)(wk′2 − w2 ) − 2(wk′ − wk )(s − yi ) wk = wk′ if c < c∗ then c∗ = c, wk∗ = wk Loop II for i=1 : m − 1 ;the wrapping phase sum = sum + D, wk′ = sum/Nk c = c + (wk′ − yi − D)2 − (wk − yi )2 + (Nk − 1)(wk′2 − w2 ) − 2(wk′ − wk )(s − yi − D) wk = wk′ if c < c∗ then c∗ = c, wk∗ = wk Termination return wk∗ References 1. P. Berkhin, Survey of clustering data mining techniques, Tech. rep., Accrue Software, San Jose, California (2002). 2. L. Bottou, Y. Bengio, Convergence properties of the K-means algorithms, in: G. Tesauro, D. Touretzky, T. Leen (Eds.), Advances in Neural Information Processing Systems, Vol. 7, The MIT Press, 1995, pp. 585–592. 3. M.-C. Su, C.-H. Chou, A modified version of the k-means algorithm with a distance based on cluster symmetry, IEEE Trans. Pattern Analysis and Machine Intelligence 23 (6) (2001) 674–680. 4. E. Forgy, Cluster analysis of multivariate data: efficiency vs. interpretability of classifications, Biometrics 21, 1965, pp. 768–769. 5. A. Jain, M. Murty, P. Flynn, Data clustering: A review, ACM Computing Surveys 31 (3) (1999) 264–323. 6. T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, A. Wu, An efficient k-means clustering algorithm: analysis and implementation, IEEE Trans. Pattern Analysis and Machine Intelligence 24 (7) (2002) 881–892. 7. J. B. MacQueen, Some methods for classification and analysis of multivariate obser- IJPRAI-D-07-00167 Revision 1 18 M. Vejmelka, P. Musilek, M. Paluš, E. Pelikán vations, in: Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, University of California Press, 1967, pp. 281–297. 8. K. Mardia, P. Jupp, Directional Statistics, Wiley, Chichester, 2000. 9. R. Mitchell, Forecasting electricity demand using clustering, Applied Informatics 378, 2003, pp. 378–134. 10. P. Musilek, E. Pelikan, T. Brabec, M. Simunek, Recurrent neural network based gating for natural gas load prediction system, In: Proc. WCCI 2006, World Congress on Computational Intelligence, Vancouver, BC, Canada, 2006, pp. 7127–7132. 11. N. H. Viet, J. Mandziuk, Neural and fuzzy neural networks in prediction of natural gas consumption, Neural, Parallel & Scientific Computations 13, 2005, pp. 265–286. 12. R. Xu, D. Wunsch, Survey of clustering algorithms, IEEE Transactions on Neural Networks 16 (3), 2005, pp. 645 – 678. IJPRAI-D-07-00167 Revision 1 K-means Clustering for Problems with Periodic Attributes (a) One periodic attribute. 19 (b) Half of attributes periodic. (c) All attributes periodic. Fig. 6. Histograms of differences of misclassification counts of the Hybrid K-means Eh and Kmeans with trigonometric encoding Etrig for different proportions of periodic attributes. Histograms collect misclassification differences over all the tested parameters. IJPRAI-D-07-00167 Revision 1 20 M. Vejmelka, P. Musilek, M. Paluš, E. Pelikán (a) (b) (c) (d) Fig. 7. Dependencies of the misclassification rate on the cluster standard deviation and on the number of clusters for one periodic attribute 7(a),7(b) and for all attributes periodic 7(c),7(d). IJPRAI-D-07-00167 Revision 1 K-means Clustering for Problems with Periodic Attributes (a) Gas load data. (b) Standard K-means. (c) K-means with trig. encoding. (d) Hybrid K-means. 21 Fig. 8. Gas load data (a) and clustering obtained using three variants of K–means algorithm (b–d)

Log In

K-MEANS CLUSTERING FOR PROBLEMS WITH PERIODIC ATTRIBUTES

Related papers

Related papers

Related topics