Academia.eduAcademia.edu

Periodicity Detection in Time Series Databases

2005, IEEE Transactions on Knowledge and Data Engineering

Periodicity mining is used for predicting trends in time series data. Discovering the rate at which the time series is periodic has always been an obstacle for fully automated periodicity mining. Existing periodicity mining algorithms assume that the periodicity rate (or simply the period) is user-specified. This assumption is a considerable limitation, especially in time series data where the period is not known a priori. In this paper, we address the problem of detecting the periodicity rate of a time series database. Two types of periodicities are defined, and a scalable, computationally efficient algorithm is proposed for each type. The algorithms perform in Oðn log nÞ time for a time series of length n. Moreover, the proposed algorithms are extended in order to discover the periodic patterns of unknown periods at the same time without affecting the time complexity. Experimental results show that the proposed algorithms are highly accurate with respect to the discovered periodicity rates and periodic patterns. Real-data experiments demonstrate the practicality of the discovered periodic patterns.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 7, JULY 2005 1 Periodicity Detection in Time Series Databases Mohamed G. Elfeky, Walid G. Aref, Senior Member, IEEE, and Ahmed K. Elmagarmid, Senior Member, IEEE Abstract—Periodicity mining is used for predicting trends in time series data. Discovering the rate at which the time series is periodic has always been an obstacle for fully automated periodicity mining. Existing periodicity mining algorithms assume that the periodicity rate (or simply the period) is user-specified. This assumption is a considerable limitation, especially in time series data where the period is not known a priori. In this paper, we address the problem of detecting the periodicity rate of a time series database. Two types of periodicities are defined, and a scalable, computationally efficient algorithm is proposed for each type. The algorithms perform in Oðn log nÞ time for a time series of length n. Moreover, the proposed algorithms are extended in order to discover the periodic patterns of unknown periods at the same time without affecting the time complexity. Experimental results show that the proposed algorithms are highly accurate with respect to the discovered periodicity rates and periodic patterns. Real-data experiments demonstrate the practicality of the discovered periodic patterns. Index Terms—Periodic patterns mining, temporal data mining, time series forecasting, time series analysis. æ 1 INTRODUCTION T IME series data captures the evolution of a data value over time. Life includes several examples of time series data. Examples are meteorological data containing several measurements, e.g., temperature and humidity, stock prices depicted in financial market, power consumption data reported in energy companies, and event logs monitored in computer networks. Periodicity mining is a tool that helps in predicting the behavior of time series data [23]. For example, periodicity mining allows an energy company to analyze power consumption patterns and predict periods of high and low usage so that proper planning may take place. Research in time series data mining has concentrated on discovering different types of patterns: sequential patterns [3], [21], [12], [6], temporal patterns [8], periodic association rules [20], partial periodic patterns [14], [13], [25], [4], and surprising patterns [17] to name a few. These periodicity mining techniques require the user to specify a period that determines the rate at which the time series is periodic. They assume that users either know the value of the period beforehand or are willing to try various period values until satisfactory periodic patterns emerge. Since the mining process must be executed repeatedly to obtain good results, this trial-and-error scheme is clearly not efficient. Even in the case of time series data with a priori known periods, there may be obscure periods and, consequently, interesting periodic patterns that will not be discovered. The solution to these problems is to devise techniques for discovering potential periods in time series data. Research in this . W.G. Aref and M.G. Elfeky are with the Department of Computer Sciences, Purdue University, 250 N. University St., West Lafayette, IN 47907-2066. E-mail: {aref, mgelfeky}@cs.purdue.edu. . A.K. Elmagarmid is with Hewlett-Packard Company, 10955 Tantau Ave., Bldg. 45, ms 4143, Cupertino, CA 95014. E-mail: ahmed_elmagarmid@hp.com. Manuscript received 3 May 2004; revised 18 Oct. 2004; accepted 4 Feb. 2005; published online 18 May 2005. For information on obtaining reprints of this article, please send e-mail to: tkde@computer.org, and reference IEEECS Log Number TKDE-0129-0504. 1041-4347/05/$20.00 ß 2005 IEEE direction has focused either on devising general techniques for discovering potential periods [15], [7] or on devising special techniques for specific periodicity mining problems [24], [19]. Both approaches require multiple phases over the time series in order to output the periodic patterns themselves. In this paper, we address the problem of discovering potential periods in time series databases, hereafter referred to as periodicity detection. We define two types of periodicities: segment periodicity and symbol periodicity. Whereas segment periodicity concerns the periodicity of the entire time series, symbol periodicity concerns the periodicities of the various symbols or values of the time series. For each periodicity type, a convolution-based algorithm is proposed and is analyzed, both theoretically and empirically. Furthermore, we extend the symbol periodicity detection algorithm so that it can discover the periodic patterns of unknown periods, termed obscure periodic patterns. Hence, we detect periodicity rates as well as frequent periodic patterns simultaneously. The rest of the paper is structured as follows: In Section 2, we outline the related work with the main emphasis to distinguish the main contributions of this paper. In Section 3, we introduce the notation used throughout the paper, and we formally define the periodicity detection problem as well as the notion of the segment and symbol periodicity types. Sections 4 and 5 describe the two proposed algorithms for periodicity detection in time series databases. Moreover, Section 4 describes the proposed algorithm for mining obscure periodic patterns. In Section 6, the performance of the proposed algorithms is studied, extensively validating the algorithms’ accuracy, examining their resilience to noise, and justifying their practicality. Finally, we summarize our findings in Section 7. 2 BACKGROUND Discovering the periodicity rate of time series data has drawn the attention of the data mining research community Published by the IEEE Computer Society 2 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, very recently. Indyk et al. [15] have addressed this problem under the name periodic trends and have developed an Oðn log2 nÞ time algorithm, where n is the length of the time series. Their notion of a periodic trend is the relaxed period of the entire time series, which is similar to our notion of segment periodicity (Section 3.3). However, our proposed algorithm for segment periodicity detection (Section 5) performs in Oðn log nÞ time. We conduct a thorough performance study to compare our proposed segment periodicity detection algorithm to the periodic trends algorithm of [15]. In addition to the saving in time performance, our proposed segment periodicity detection algorithm is more resilient to noise and produces more accurate periods. The proposed segment periodicity detection algorithm favors the shorter periods rather than the longer ones that are favored by the periodic trends algorithm of [15]. The shorter periods are more accurate than the longer ones since they are more informative. For example, if the daily power consumption of a specific customer has a weekly pattern, it is more informative to report a period, say, of length 7, than to report the periods 14, 21, or other multiples of 7. Specific to partial periodic patterns, Ma and Hellerstein [19] have developed a linear distance-based algorithm for discovering the potential periods regarding the symbols of the time series. In [24], a similar algorithm has been proposed with some pruning techniques. However, both algorithms miss some valid periods since they only consider the adjacent interarrivals. For example, consider a symbol that occurs in a time series in positions 0, 4, 5, 7, and 10. Since that symbol occurs in positions 0, 5, and 10, one of the underlying periods for that symbol should be 5. However, a distance-based algorithm only considers the adjacent interarrival times 4, 1, 2, and 3 as candidate periods, which clearly do not include the value 5. Should it be extended to include all possible interarrivals, the complexity of a distance-based algorithm [24], [19] would increase to Oðn2 Þ. Although Berberidis et al. [7] have proposed an algorithm that considers all possible potential periods, their algorithm is inefficient as it considers one symbol at a time. Moreover, the algorithms of [24], [19], [7] require additional phase over the time series in order to output the periodic patterns. Not only does our proposed symbol periodicity detection algorithm (Section 4) perform in Oðn log nÞ, it also discovers all possible potential periods as well as their corresponding periodic patterns simultaneously. Hence, we distinguish this paper by the following contributions: 1. 2. 3. We introduce a new notion of time series periodicity in terms of the symbols of the time series, which is termed symbol periodicity. We propose a convolution-based algorithm for symbol periodicity detection. Convolution allows the algorithm to consider all symbols and periods at the same time and to discover the potential periods and the obscure periodic patterns simultaneously. We propose a new algorithm for segment periodicity detection that outperforms the periodic trends algorithm of [15] with respect to time performance, resilience to noise, and accuracy of output periods. 3 VOL. 17, NO. 7, JULY 2005 PERIODICITY DETECTION PROBLEM 3.1 Notation Assume that a sequence of n time-stamped feature values is collected in a time series. For a given feature e, let ei be the value of the feature at time-stamp i. The time series of feature e is represented as T ¼ e0 ; e1 ; . . . ; en1 . For example, the feature in a time series for power consumption might be the hourly power consumption rate of a certain customer, while the feature in a time series for stock prices might be the final daily stock price of a specific company. If we discretize1 the time series feature values into nominal discrete levels2 and denote each level (e.g., high, medium, low, etc.) by a symbol (e.g., a, b, c, etc.), then the set of collected feature values can be denoted as  ¼ fa; b; c;   g. Hence, we can view T as a sequence of n symbols drawn from a finite alphabet . A time series may also be a sequence of n time-stamped events drawn from a finite set of nominal event types. An example is the event log in a computer network that monitors the various events that occur. Each event type can be denoted by a symbol (e.g., a, b, c, etc.) and, hence, we can use the same notation above. 3.2 Symbol Periodicity In a time series T , a symbol s is said to be periodic with a period p if s exists “almost” every p time-stamps. For example, in the time series T ¼ abcabbabcb, the symbol b is periodic with period 4 since b exists every four time-stamps (in positions 1, 5, and 9). Moreover, the symbol a is periodic with period 3 since a exists almost every three time-stamps (in positions 0, 3, and 6 but not 9). We define symbol periodicity as follows. Let p;l ðT Þ denote the projection of a time series T according to a period p starting from position l; that is, p;l ðT Þ ¼ el ; elþp ; elþ2p ; . . . ; elþðm1Þp ; where 0  l < p, m ¼ dðn  lÞ=pe, and n is the length of T . For example, if T ¼ abcabbabcb, then 4;1 ðT Þ ¼ bbb and 3;0 ðT Þ ¼ aaab. Intuitively, the ratio of the number of occurrences of a symbol s in a certain projection p;l ðT Þ to the length of this projection indicates how often this symbol occurs every p time-stamps. However, this ratio is not quite accurate since it captures all the occurrences, even the outliers. In the example above, the symbol b will be considered periodic with period 3 with a frequency of 1=4, which is not quite true. As another example, if, for a certain T , p;l ðT Þ ¼ abcbac, this means that the symbol changes every p time-stamp and so no symbol should be periodic with a period p. We remedy this problem by considering only the consecutive occurrences. A consecutive occurrence of a symbol s in a certain projection p;l ðT Þ indicates that the symbol s reappeared in T after p timestamps from the previous appearance, which means that p is a potential period for s. Let F 2 ðs; T Þ denote the number of 1. The problem of discretizing time series into a finite alphabet is orthogonal to our problem and is beyond the scope of this dissertation. See [10], [16] for an exhaustive overview of discretizing and segmentation techniques. 2. Nominal values are distinguished by name only and do not have an inherent order or a distance measure. ELFEKY ET AL.: PERIODICITY DETECTION IN TIME SERIES DATABASES times the symbol s occurs in two consecutive positions in the time series T . For example, if T ¼ abbaaabaa, then F 2 ða; T Þ ¼ 3 and F 2 ðb; T Þ ¼ 1. Definition 1. If a time series T of length n contains a symbol s F 2 ðs;p;l ðT ÞÞ  , where such that 9l; p, where 0  l < p, and dðnlÞ=pe1 0    1; then, s is said to be periodic in T with period p at position l with respect to periodicity threshold . For example, in the time series T ¼ abcabbabcb, F 2 ða;3;0 ðT ÞÞ d10=3e1 ¼ 2=3, thus, the symbol a is periodic with period 3 at position 0 with respect to a periodicity threshold   2=3. Similarly, the symbol b is periodic with period 3 at position 1 with respect to a periodicity threshold   1. 3 be divided into equal-length segments, each of length p, that are “almost” similar. For example, the time series T ¼ abcabcabc is clearly periodic with a period 3. Likewise, the time series T ¼ abcabdabc is periodic with a period 3 despite the fact that its second segment is not identical to the other segments. Since the symbols are considered nominal, i.e., no inherent order is assumed, we can simply use Hamming distance to measure the similarity between two segments: Hðu; vÞ ¼ m1 X j¼0 1 uj 6¼ vj 0 u j ¼ vj ; Sðu; vÞ ¼ 1  Hðu; vÞ=m; 3.2.1 Obscure Periodic Patterns The main advantage of Definition 1 is that, not only does it determine the candidate periodic symbols, but it also determines their corresponding periods and locates their corresponding positions. Thus, there are no presumptions of the period value and, so, obscure periodic patterns can be defined as follows: where u and v are two segments of equal length m; uj and vj are the symbols at position j of the two segments u and v, respectively; H is the Hamming distance; and S is the similarity measure. The similarity function is defined in such a way that the higher its value, the more similar the two segments are, and u ¼ v , Sðu; vÞ ¼ 1. Segment periodicity is therefore defined as follows. Definition 2. If a time series T of length n contains a symbol s that is periodic with period p at position l with respect to an arbitrary periodicity threshold, then a periodic single-symbol pattern of length p is formed by inserting the symbol s in position l and inserting the “don’t care” symbol  in all other positions. Definition 4. If a time series T of length n can be sliced into equal-length segments T0 ; T1 ; . . . ; Ti ; . . . ; TN , each of length p, where Ti ¼ eip ; . . . ; eipþp1 , N ¼ bn=pc  1, SðTi ; Tj Þ   8i; j ¼ 0; 1; . . . ; N, and 0    1, then T is said to be periodic with a period p with respect to periodicity threshold . The support of a periodic single-symbol pattern, formed F 2 ðs;p;l ðT ÞÞ . For according to Definition 2, is estimated by dðnlÞ=pe1 example, in the time series T ¼ abcabbabcb, the pattern a   is a periodic single-symbol pattern of length 3 with a support value of 2=3, and so is the single-symbol pattern b  with a support value of 1. However, we cannot deduce that the pattern ab  is also periodic since we cannot estimate its support.3 The only thing we know for sure is that its support value will not exceed 2=3. Definition 5. A period p is said to be perfect if SðTi ; Tj Þ ¼ 1 8i; j ¼ 0; 1; . . . ; N. Definition 3. In a time series T of length n, let Sp;l be the set of all the symbols that are periodic with period p at position l with respect to an arbitrary periodicity threshold. Let S p be the Cartesian product of all Sp;l in an ascending order of l; that is, S p ¼ ðSp;0 [ fgÞ  ðSp;1 [ fgÞ  . . .  ðSp;p1 [ fgÞ: Every ordered p-tuple ðs0 ; s1 ; . . . ; sp1 Þ that belongs to S p corresponds to a candidate periodic pattern of the form s0 s1 . . . sp1 , where si 2 Sp;i [ fg. For example, in the time series T ¼ abcabbabcb, we have S3;0 ¼ fag, S3;1 ¼ fbg, and S3;2 ¼ fg. Then, the candidate periodic patterns are a  , b  , and ab  , ignoring the “don’t care” pattern   . 3.3 Segment Periodicity Unlike symbol periodicity that focuses on the symbols (where different symbols may have different periods), segment periodicity focuses on the entire time series. A time series T is said to be periodic with a period p if it can 3. This is similar to the Apriori property of the association rules [2], that is, if A and B are two frequent itemsets, then AB is a candidate frequent itemset that may turn out to be infrequent. For example, the time series T ¼ abcabcabc has a perfect period 3 and the time series T ¼ abcabdabc is periodic with a period 3 with respect to any periodicity threshold   2=3. 4 SYMBOL PERIODICITY DETECTION Assume first that the period p is known for some symbols of a specific time series T . Then, the problem is reduced to detect those symbols that are periodic with period p. A way to solve this simpler problem is to shift the time series p positions, denoted as T ðpÞ , and compare this shifted version T ðpÞ to the original version T . For example, if T ¼ abcabbabcb, then shifting T three positions results in T ð3Þ ¼   abcabba. Comparing T to T ð3Þ results in four symbol matches. If the symbols are mapped in a particular way, we can deduce that those four matches are actually two for the symbol a both at position 0 and two for the symbol b both at position 1. Therefore, our proposed algorithm for symbol periodicity detection relies on two main ideas. The first is to obtain a mapping scheme for the symbols, which reveals, upon comparison, the symbols that match and their corresponding positions. Going back to the original problem where the period is unknown, the second idea is to use the concept of convolution in order to shift and compare the time series for all possible values of the period at the same time. The remaining part of this section describes those ideas in detail starting by defining the concept of convolution (Section 4.1) as it derives our mapping scheme (Section 4.2). 4 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 7, JULY 2005 4.1 Convolution A convolution [9] is defined as follows: Let X ¼ ½x0 ; x1 ; . . . ; xn1  and Y ¼ ½y0 ; y1 ; . . . ; yn1  be two finite length sequences of numbers,4 each of length n. The convolution of X and Y is defined as another finite length sequence X  Y of length n such that Fig. 1. A clarifying example for the mapping scheme. ðX  Y Þi ¼ i X xj yij j¼0 for i ¼ 0; 1; . . . ; n  1. Let X0 ¼ ½x00 ; x01 ; . . . ; x0n1  denote the reverse of the vector X, i.e., x0i ¼ xn1i . Taking the convolution of X 0 and Y and obtaining its reverse leads to the following: ðX 0  Y Þ0i ¼ ðX0  Y Þn1i ¼ n1i X x0j yn1ij j¼0 ¼ n1i X xn1j yn1ij ; j¼0 i.e., ðX0  Y Þ00 ¼ x0 y0 þ x1 y1 þ    þ xn1 yn1 ; ðX0  Y Þ01 ¼ x1 y0 þ x2 y1 þ    þ xn1 yn2 ; .. . ðX0  Y Þ0n1 ¼ xn1 y0 : In other words, the component of the resulting sequence at position i corresponds to shifting one of the input sequences i positions and comparing it to the other input sequence. Therefore, the symbol periodicity detection algorithm performs the following steps: Converts the time series T into two finite sequences of numbers ðT Þ and ðT Þ0 , where ðT Þ0 is the reverse of ðT Þ (based on the mapping scheme  described in Section 4.2); 2. Performs the convolution between the two seÞ; quences ðT Þ0  ðT  0 3. Reverses the output ðT Þ0  ðT Þ ; 4. Analyzes the component values of the resulting sequence to get the periodic symbols and their corresponding periods and positions (Section 4.2). It is well-known that convolution can be computed by the fast Fourier transform (FFT) [18] as follows:   X  Y ¼ FFT1 FFTðXÞ  FFTðY Þ : 1. This computation reduces the time complexity of the convolution to Oðn log nÞ. The brute force approach of shifting and comparing the time series for all possible values of the period has the time complexity Oðn2 Þ. Moreover, an external FFT algorithm [22] can be used for large sizes of databases mined while on disk. 4. The general definition of convolution does not assume equal-length sequences. We adapt the general definition to conform to our problem, in which convolutions only take place between equal-length sequences. 4.2 Mapping Scheme Let T ¼ e0 ; e1 ; . . . ; en1 be a time series of length n, where ei s are symbols from a finite alphabet  of size . Let  be a mapping for the symbols of T such that ðT Þ ¼ ðe0 Þ; ðe1 Þ; . . . ; ðen1 Þ. L e t CðT Þ ¼ ððT Þ0  ðT ÞÞ0 and ci ðT Þ be the ith component of CðT Þ. The challenge to our algorithm is to obtain a mapping  of the symbols, which satisfies two conditions: 1) When the symbols match, this should contribute a nonzero value in the product ðej Þ  ðeij Þ; otherwise, it should contribute 0 and 2) the value of each component of P CðT Þ and ci ðT Þ ¼ ij¼0 ðej Þ  ðeij Þ should identify the symbols that cause the occurrence of this value and their corresponding positions. We map the symbols to the binary representation of increasing powers of two [1]. For example, if the time series contains only the three symbols a, b, and c, then a possible mapping could be a : 001, b : 010, and c : 100, corresponding to power values of 0, 1, and 2, respectively. Hence, a time series of length n is converted to a binary vector of length n. For example, let T ¼ acccabb; then, T is converted to the binary vector T ¼ 001100100100001010010. Adopting regular convolution, defined previously, results in a sequence CðTÞ of length n. Considering only the n positions 0; ; 2; . . . ; ðn  1Þ, which are the exact start positions of the symbols, gives back the sequence CðT Þ. The latter derivation of CðT Þ can be written as CðT Þ ¼ ;0 ðCðTÞÞ using the projection notation defined in Section 3.2. The first condition is satisfied since the only way to obtain a value of 1 contributing to a component of CðT Þ is that this 1 comes from the same symbol. For example, for T ¼ acccabb, although c1 ðTÞ ¼ 1, this is not considered one of CðT Þ’s components. However, c3 ðTÞ ¼ 3 and so c1 ðT Þ ¼ 3, which corresponds to three matches when T is compared to T ð1Þ . Those matches are seen from the manual inspection of T to be two cs and one b. Nevertheless, it is not possible to determine those symbols only by examining the value of c1 ðT Þ, i.e., the second condition is not yet satisfied. Therefore, we modify the convolution definition to be ðX  Y Þi ¼ i X 2j xj yij : j¼0 The reason for adding the coefficient 2j is to get a different contribution for each match, rather than an unvarying contribution of 1. For example, when the new definition of convolution is used for the previous example, c1 ðT Þ ¼ 21 þ 211 þ 214 ¼ 18; 434. Fig. 1 illustrates this calculation. The powers of 2 for this value are 1, 11, and 14. Examining those powers modulo 3, which is the size of the alphabet in this particular example, results in ELFEKY ET AL.: PERIODICITY DETECTION IN TIME SERIES DATABASES 1, 2, and 2, respectively, which correspond to the symbols b, c, and c, respectively. Fig. 1 gives another example for c4 ðT Þ containing only one power of 2, which is 6, that corresponds to the symbol a since 6 mod 3 ¼ 0 and a was originally mapped to the binary representation of 20 . This means that comparing T to T ð4Þ results in only one match of the symbol a. Moreover, the power value of 6 reveals that the symbol a is at position 0 in T ð4Þ . Note that in the binary vector, the most significant bit is the leftmost one, whereas the most significant position of the time series T is the rightmost one. Therefore, not only can the power values reveal the number of matches of each symbol at each period, they also reveal their corresponding starting positions. This latter observation complies with the definition of symbol periodicity. Formally, let s0 ; s1 ; . . . ; s1 be the symbols of the alphabet of a time series T of length n. Assume that each symbol sk is mapped to the -bit binary representation of 2k to form T. The P sequence CðTÞ is computed such that ci ðTÞ ¼ ij¼0 2j ej  eij for i ¼ 0; 1; . . . ; n  1. Thus, CðT Þ ¼ ;0 ðCðTÞÞ. Assume that cp ðT Þ is a nonzero component of CðT Þ. Let Wp denote the set of powers of 2 contained in cp ðT Þ, i.e., Wp ¼ fwp;1 ; wp;2 ; . . .g; where cp ðT Þ ¼ P h 2wp;h , and let Wp;k ¼ fwp;h : wp;h 2 Wp ^ wp;h mod  ¼ kg: As shown in the previous example, the cardinality of each Wp;k represents the number of matches of the symbol sk when T is compared to T ðpÞ . Moreover, let Wp;k;l ¼ fwp;h : wp;h 2 Wp;k ^ ðn  p  1  bwp;h =cÞ mod p ¼ lg: Revisiting the definition of symbol periodicity, we observe that the cardinality of each Wp;k;l is equal to the desired value of F 2 ðsk ; p;l ðT ÞÞ. Working out the example of Section 4 where T ¼ abcabbabcb, n ¼ 10, and  ¼ 3, let s0 ; s1 ; s2 ¼ a; b; c, respectively. Then, for p ¼ 3, W3 ¼ f18; 16; 9; 7g, W3;0 ¼ f18; 9g, and W3;0;0 ¼ f18; 9g ) F 2 ða; 3;0 ðT ÞÞ ¼ 2; which conforms to the results obtained previously. As another example, if T ¼ cabccbacd, where n ¼ 9,  ¼ 4, and s0 ; s1 ; s2 ; s3 ¼ a; b; c; d, respectively, then, for p ¼ 4, W4 ¼ f18; 6g, W4;2 ¼ f18; 6g, W4;2;0 ¼ f18g ) F 2 ðc; 4;0 ðT ÞÞ ¼ 1; and W4;2;3 ¼ f6g ) F 2 ðc; 4;3 ðT ÞÞ ¼ 1, which are correct since 4;0 ðT Þ ¼ ccd and 4;3 ðT Þ ¼ cc. One final detail about our algorithm is the use of the values wp;h to estimate the support of the candidate periodic patterns formed according to Definition 3. Let sj0 sj1 . . . sjp1 be a candidate periodic pattern that is neither a singlesymbol pattern nor the “don’t care” pattern, i.e., at least two sji s are not . The set Wp;ji ;i contains the values responsible for the symbol sji 6¼ . Let W p be a subset of the Cartesian product of the sets Wp;ji ;i for all i where sji 6¼  such that all the values in an ordered pair should have the 5 np1bw =c p;h c. The support estimate of that same value of b p jW p j candidate periodic pattern is bn=pc . For example, if T ¼ abcabbabcb, W3;0;0 ¼ f18; 9g corresponds to the symbol a, and W3;1;1 ¼ f16; 7g corresponds to the symbol b, then for the candidate periodic pattern ab, W p ¼ fð18; 16Þ; ð9; 7Þg, and the support of this pattern is 2=3. Therefore, our algorithm scans the time series once to convert it into a binary vector according to the proposed mapping, performs the modified convolution on the binary vector, and analyzes the resulting values to determine the symbol periodicities and, consequently, the periodic singlesymbol patterns. Then, the set of candidate periodic patterns is formed and the support of each pattern is estimated. The complexity of our algorithm is the complexity of the convolution step that is performed over a binary vector of length n. Hence, the complexity is Oðn log nÞ when FFT is used to compute the convolution. Given that, in practice,  << n, the complexity of the symbol periodicity detection algorithm is, in fact, Oðn log nÞ. Note that adding the coefficient 2j to the convolution definition still preserves that   X  Y ¼ FFT1 FFTðXÞ  FFTðY Þ : The complete algorithm is sketched in Fig. 2. 5 SEGMENT PERIODICITY DETECTION The idea behind our new algorithm for segment periodicity detection follows the same idea of shifting and comparing the time series for all possible values. However, the interest is turned to the total number of matches rather than the specific symbols that match. For example, if T ¼ abcabdabc, then shifting T three positions results in T ð3Þ ¼   abcabd. Comparing T to T ð3Þ results in four matches out of six possible matches. We argue that such an occurrence of a large number of matches corresponds to a candidate period for the time series in hand. To ascertain this argument, assume that comparing T with T ðpÞ results in n  p matches for a certain index p and a time series T of length n. Then, ep ¼ e0 ; epþ1 ¼ e1 ; . . . ; en1 ¼ en1p and, also, e2p ¼ ep ; e2pþ1 ¼ epþ1 ; . . . , etc., which means that the segment of length p is periodic and p is a perfect period for T . If, for another index q, comparing T with T ðqÞ results in a number of matches slightly less than n  q due to a few mismatches, then q can be considered a candidate period for T . Similar to the proposed algorithm for symbol periodicity detection, our proposed algorithm for segment periodicity detection uses the concept of convolution in order to shift and compare the time series for all possible values of the period. However, segment periodicity detection needs a mapping scheme simpler than that of symbol periodicity detection since what matters is the total number of matches rather than the symbols that match. In other words, the mapping scheme  for segment periodicity detection should only satisfy the following condition: When the symbols match, this should contribute a nonzero value in the product ðej Þ  ðeij Þ; otherwise, it should contribute 0. 6 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 7, JULY 2005 Fig. 2. The symbol periodicity detection algorithm. We map the symbols to the th roots of unity5 [5], where  is the size of the symbols alphabet. For example, if the time series contains only three symbols, then they are mapped to !0 , !1 , !2 . Moreover, we modify the regular convolution definition to be ðX  Y Þi ¼ i X Similar to the symbol periodicity detection algorithm, the complexity of the segment periodicity detection algorithm is the complexity of the convolution step that is performed over a vector of length n, yet is repeated number of times ‘. Hence, the complexity is Oð‘n log nÞ when FFT is used to xj y1 ij : j¼0 compute the convolution. The maximum value for ‘ is !, which is the number of all possible mappings of the Without loss of generality, assume that a symbol sk is mapped to !k . Adopting the modified convolution, a symbol match (say sk ) contributes 1 to the product !k !k . A mismatch, i.e., sk 6¼ sl , contributes a perturbative term !kl 6¼ 1. Since the sum of the th roots of unity is equal to P i Z zero, i.e., 1 i¼0 ! ¼ 0, then Eð! Þ ¼ 0 when Z is a uniformly distributed random variable over f0; 1; . . . ;   1g. In other words, if the convolution computation is repeated for all possible mappings over f!0 ; !1 ; . . . ; !1 g, then the mean value at a mismatch position will be equal to 0 and the mean value at a match position will be equal to 1, which satisfies the aforementioned condition. Therefore, we can obtain the convolution of the time series CðT Þ as the mean of the inner P product ij¼0 ðej Þ1 ðeij Þ over all the possible mappings. Experiments show that accurate estimates can be achieved with few iterations rather than iterating over all possible mappings. This section contains the results of an extensive experimental study that examines various aspects of the proposed algorithms.6 The most important aspect is the accuracy with respect to the discovered periods, which is the subject of 5. The nth roots of unity are the n solutions to the equation xn ¼ 1. The kth root has the form !k ¼ e2ik=n . 6. The source code of the proposed algorithms is available at http:// www.cs.purdue.edu/~mgelfeky/Source. symbols. Clearly, this value is too large and, hence, is not negligible. However, in practice, few iterations are enough to get accurate estimates, i.e., ‘ << !. Therefore, the practical complexity of the segment periodicity detection algorithm is Oðn log nÞ. Note that the modified convolution definition used here still preserves that   X  Y ¼ FFT1 FFTðXÞ  FFTðY Þ : The complete algorithm is sketched in Fig. 3. 6 EXPERIMENTAL STUDY ELFEKY ET AL.: PERIODICITY DETECTION IN TIME SERIES DATABASES 7 Fig. 3. The segment periodicity detection algorithm. Section 6.1. As noisy data is inevitable, Section 6.2 scrutinizes the resilience of the proposed algorithms to various types of noise that can occur in time series data. The estimation accuracy of the segment periodicity detection algorithm is studied in Section 6.3. Then, the time performance of the proposed algorithms is studied in Section 6.4. The practicality and usefulness of the results are explored using real data experiments shown in Section 6.5. The periodic trends algorithm of [15] is compared with our proposed algorithms throughout the experiments. As discussed earlier in Section 2, the periodic trends algorithm of [15] is the fastest in the literature for detecting all valid candidate periods. In our experiments, we exploit synthetic data as well as real data. We generate controlled synthetic time series data by tuning some parameters, namely, data distribution, period, alphabet size, type, and amount of noise. Both uniform and normal data distributions are considered. Some types of noise include replacement, insertion, deletion, or any mixture of them. Inerrant data is generated by repeating a pattern, of length equal to the period, that is randomly generated from the specified data distribution. The pattern is repeated until it spans the specified time series length. Noise is introduced randomly and uniformly over the whole time series. Replacement noise is introduced by altering the symbol at a randomly selected position in the time series by another. Insertion or deletion noise is introduced by inserting a new symbol or deleting the current symbol at a randomly selected position in the time series. Two databases serve the purpose of real data experiments. The first one is a relatively small database that contains the daily power consumption rates of some customers over a period of one year. It is made available through the CIMEG7 project. The database size is approximately 5 Megabytes. The second database is a Wal-Mart database of 70 Gigabytes, which resides on an NCR 7. CIMEG: Consortium for the Intelligent Management of the Electric Power Grid, http://helios.ecn.purdue.edu/~cimeg. Teradata Server running the NCR Teradata Database System. It contains sanitized data of timed sales transactions for some Wal-Mart stores over a period of 15 months. The timed sales transactions data has a size of 130 Megabytes. In both databases, the numeric data values are discretized into five levels, i.e., the alphabet size equals to 5. The levels are very low, low, medium, high, and very high. For the power consumption data, discretizing is based on discussions with domain experts (very low corresponds to less than 6000 Watts/Day, and each level has a 2000 Watts range). For the timed sales transactions data, discretizing is based on manual inspection of the values (very low corresponds to zero transactions per hour, low corresponds to less than 200 transactions per hour, and each level has a 200 transactions range). 6.1 Accuracy Synthetic data, both inerrant and noisy, are used in this experiment in order to inspect the accuracy of the proposed algorithms. The accuracy measure that we use is the ability of the algorithms to detect the periodicities that are artificially embedded into the synthetic data. To accurately discover a period, it is not enough to discover it at any periodicity threshold value. In other words, the periods discovered with a high periodicity threshold value are better candidates than those discovered with a lower periodicity threshold value. Therefore, we define the confidence of a discovered period to be the minimum periodicity threshold value required to detect this period. The accuracy is measured by the average confidence of all the periods that were artificially embedded into the synthetic data. Figs. 4 and 5 give the results of this experiment. We use the symbols “U” and “N” to denote the uniform and the normal distributions, respectively, and the symbol “P ” to denote the period. Recall that inerrant synthetic data is generated in such a way that it is perfectly periodic, i.e., the periodicities embedded are: P ; 2P ; . . . . If the data is perfectly periodic, the confidence of all the periodicities should be 1. Time series lengths of 1M symbols are used 8 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 7, JULY 2005 Fig. 4. Accuracy of the symbol periodicity detection algorithm. (a) Inerrant data. (b) Noisy data. Fig. 5. Accuracy of the segment periodicity detection algorithm. (a) Inerrant data. (b) Noisy data. with alphabet size of 10. The values collected are averaged over 100 runs. Fig. 4a shows that the symbol periodicity detection algorithm is able to detect all the embedded periodicities in the inerrant time series data with the highest possible confidence. Fig. 5a shows a similar behavior only when the period divides the time series length. In general, Fig. 5 illustrates that the behavior of the segment periodicity detection algorithm depends on the period. When the period divides the time series length, the algorithm detects all the periods at the highest confidence. When the period does not divide the time series length, the algorithm favors the lower values. This behavior is due to the misalignment of the last portion of the time series that does not form a complete periodic segment. Fig. 4 shows an unbiased behavior of the symbol periodicity detection algorithm with respect to the period since what matters are the symbols rather than the whole segment. Both figures show an expected decrease in the confidence values due to the presence of noise. Fig. 6 gives the results of the same experiment for the periodic trends algorithm of [15]. However, in order to inspect the accuracy of that algorithm, we will briefly discuss its output. The algorithm computes an absolute value for each possible period and then it outputs the periods that correspond to the minimum absolute values as the best candidate periods. In other words, if the absolute values are sorted in ascending order, the corresponding periods will be ordered from the best to the worst candidate. Therefore, it is the rank of the period in this candidacy order that favors a period over another rather than its corresponding absolute value. Normalizing the ranks to be real-valued ranging from 0 to 1 is trivial. The normalized rank can be considered as the confidence value of each period (the best candidate period that has rank 1 will have normalized rank value of 1, and worse candidate periods that have lower ranks will have lower normalized rank values). This means that, if the data is perfectly periodic, the embedded periodicities should have the highest ranks and, so, confidence values close to 1. Fig. 6b shows a biased behavior of the periodic trends algorithm with respect to the period, as it favors the longer periods. However, we believe that the shorter periods are more accurate than the longer ones since they are more informative. For example, if the power consumption of a specific customer has a weekly pattern, it is more informative to report the period of 7 days than to report the periods of 14, 21, or other multiples of 7. ELFEKY ET AL.: PERIODICITY DETECTION IN TIME SERIES DATABASES 9 Fig. 6. Accuracy of the periodic trends algorithm. (a) Inerrant data. (b) Noisy data. Fig. 7. Resilience to noise of the symbol periodicity detection algorithm. (a) Uniform, Period = 25. (b) Normal, Period = 32. Fig. 8. Resilience to noise of the segment periodicity detection algorithm. (a) Uniform, Period = 32. (b) Normal, Period = 25. 6.2 Resilience to Noise As mentioned before, there are three types of noise: replacement, insertion, and deletion noise. This set of experiments studies the behavior of the periodicity detection algorithms toward these types of noise as well as different mixtures of them. Results are given in Figs. 7 and 8 in which we use the symbols “R,” “I,” and “D” to denote the three types of noise, respectively. Two or more types of noise can be mixed, e.g., “R I D” means that the noise ratio is distributed equally among replacement, insertion, and deletion, while “I D” means that the noise ratio is distributed equally among insertion and deletion only. Time series lengths of 1M symbols are used with an alphabet size of 10. The values collected are averaged over 100 runs. Since the behaviors were similar regardless of the period or the data distribution, an arbitrary combination of 10 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 7, JULY 2005 Fig. 9. Estimation accuracy of the segment periodicity detection algorithm. (a) Wal-Mart data. (b) Synthetic data. Fig. 10. Time behavior of the periodicity detection algorithms. (a) Wal-Mart data. (b) Synthetic data. a period and a data distribution is selected for each figure. The figures show an expected decrease in the confidence with the increase in noise. Both algorithms are well resilient to replacement noise. At 40 percent periodicity thresholds, both algorithms can tolerate 50 percent replacement noise in the data. When the other types of noise get involved separately or mixed with replacement noise, the algorithms perform poorly. However, segment periodicity detection can be considered roughly resilient to those other types since periodicity thresholds in the range 10 percent to 20 percent are not uncommon. 6.3 Estimation Accuracy of Segment Periodicity The next experiment studies the estimation accuracy of the proposed segment periodicity detection algorithm. We examine the accuracy by measuring the closeness of the convolution values estimated by the algorithm to the exact values. Fig. 9 gives the results for both real data (Wal-Mart timed sales transactions) and synthetic data (uniform and normal data). Fig. 9 shows that the estimation accuracy increases when more iterations are carried out. This is expected as the mean values converge to the exact values over all iterations. More importantly, Fig. 9 shows that an estimation accuracy of 90 percent is achieved after only two to four iterations, which supports our claim that few iterations are enough to obtain accurate estimates. 6.4 Time Performance To evaluate the time performance of the proposed periodicity detection algorithms, Fig. 10a exhibits the time behavior of both algorithms with respect to the time series length. Wal-Mart timed sales transactions data is used in different portion lengths of powers of 2 up to 128 Megabytes. As it has been shown in the previous experiment (Fig. 9), only three iterations of the segment periodicity detection algorithm are carried out in this experiment. Fig. 10a shows that the execution time is linearly proportional to the time series length. More importantly, the figure shows that segment periodicity detection takes less execution time than symbol periodicity detection. Fig. 10b supports this latest observation using synthetic data (1M symbols with alphabet size of 10) and increasing the number of iterations for the segment periodicity detection. Note that there are no iterations in the symbol periodicity detection algorithm. The straight line in the figure represents a fixed value for the execution time of the symbol periodicity detection algorithm. Fig. 10b shows that, up to 18 iterations, segment periodicity detection still takes less execution time than symbol periodicity detection. ELFEKY ET AL.: PERIODICITY DETECTION IN TIME SERIES DATABASES 11 TABLE 1 Segment Periodicity Detection Output TABLE 2 Symbol Periodicity Detection Output Furthermore, Fig. 10a shows that the proposed segment periodicity detection algorithm outperforms the periodic trends algorithm of [15] with respect to the execution time. This experimental result agrees with the theoretical results as the periodic trends algorithm [15] performs in Oðn log2 nÞ time whereas our proposed segment periodicity detection algorithm performs in Oðn log nÞ time. 6.5 Real Data Experiments Tables 1 and 2 display the output of both algorithms for the Wal-Mart and CIMEG data for different values of the periodicity thresholds. Clearly, the algorithms output fewer periods for higher periodicity threshold values and the periods detected with respect to a certain value of the Fig. 11. Plot of Wal-Mart time series. (a) Zoomed out. (b) Zoomed in. periodicity threshold are enclosed within those detected with respect to a lower value. To verify their accuracy, the algorithms should at least output the periods that are expected in the time series. From Fig. 11, Wal-Mart data has an expected period of 24 that corresponds to the daily pattern of number of transactions per hour. CIMEG data has an expected period of 7 that corresponds to the weekly pattern of power consumption rates per day. For the segment periodicity detection algorithm, Table 1 shows that, for Wal-Mart data, a period of 24 hours is detected when the periodicity threshold is 70 percent or less. In addition, the algorithm detects many more periods, some of which are quite interesting. A period of 168 hours 12 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 7, JULY 2005 TABLE 3 Symbol Periodicity Detection Output (Cont’d) (24  7) can be explained as the weekly pattern of the number of transactions per hour. This period value is not easily seen in Fig. 11, yet it is there (every seven cycles). A period of 3,961 hours shows a periodicity of exactly 5.5 months plus one hour, which can be explained as the daylight savings hour. One may argue against the clarity of this explanation, yet this proves that there may be obscure periods, unknown a priori, that the algorithm can detect. Similarly, for CIMEG data, the period of 7 days is detected when the periodicity threshold is 60 percent or less. Other clear periods are those that are multiples of 7. However, a period of 123 days is difficult to explain. A similar behavior is shown in Table 2 for the symbol periodicity detection algorithm with a higher number of output periods than that of segment periodicity detection. This is expected since symbol periodicity detects periods for individual symbols rather than for the entire time series. Further analysis of the output of the symbol periodicity detection algorithm is shown in Table 3. Exploring the period of 24 hours for Wal-Mart and that of 7 days for CIMEG data produces the results given in Table 3. Note that periodic single-symbol pattern is reported as a pair, consisting of a symbol and a starting position for a certain period. For example, (b,7) for WalMart data with respect to a periodicity threshold of 80 percent or less represents the periodic single-symbol pattern       b                . Knowing that the symbol b represents the low level for Wal-Mart data (less than 200 transactions per hour), this periodic pattern can be interpreted as follows: In 80 percent of the days, less than 200 transactions per hour occur in the seventh hour of the day (between 7:00 a.m. and 8:00 a.m.). As another example, ða; 3Þ for CIMEG data with respect to a periodicity threshold of 50 percent or less represents the periodic single-symbol pattern   a  . Knowing that the symbol a represents the very low level for CIMEG data (less than 6,000 Watts/Day), this periodic pattern can be interpreted as follows: In 50 percent of the weeks, less than 6,000 Watts/Day are consumed in the fourth day of the week. Finally, Table 4 gives the final output of periodic patterns of Wal-Mart data for the period of 24 hours for periodicity threshold of 35 percent. Each pattern can be interpreted in a similar way to the above. Experimenting the periodic trends algorithm of [15] with the real data gives a worse behavior than that of the proposed segment periodicity detection. The known a priori periods (24 for Wal-Mart data and 7 for CIMEG data) are discovered only after very low periodicity threshold values (< 30 percent) are considered. The reason is that the periodic trends algorithm favors the longer periods and, so, the shorter periods appear late in the candidate order. For example, for Wal-Mart data, the periodic trends algorithm detects that a period of 2,808 hours (a multiple of 24) is the best candidate period; and, for CIMEG data, it detects that a period of 3,708 days (not a multiple of 7) is the best candidate period. 7 CONCLUSIONS In this paper, we have defined two types of periodicities for time series databases. Whereas symbol periodicity addresses the periodicity of the symbols in the time series, segment periodicity addresses the periodicity of the entire time series regarding its segments. We have proposed a scalable, computationally efficient algorithm for detecting each type of periodicity in Oðn log nÞ time, for a time series of length n. An empirical study of the algorithms using realworld and synthetic data sets proves the practicality of the problem, validates the accuracy of the algorithms, and validates the usefulness of their outputs. Moreover, segment periodicity detection takes less execution time TABLE 4 Obscure Periodic Patterns for Wal-Mart Data ELFEKY ET AL.: PERIODICITY DETECTION IN TIME SERIES DATABASES whereas symbol periodicity detects more periods. We can conclude that in practice, segment periodicity detection could be applied first and, if the results are not sufficient, or not appealing, symbol periodicity detection can be applied afterwards. Finally, we have extended the proposed symbol periodicity detection algorithm to discover the obscure periodic patterns. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] K. Abrahamson, “Generalized String Matching,” SIAM J. Computing, vol. 16, no. 6, pp. 1039-1051, 1987. R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. 20th Int’l Conf. Very Large Data Bases, Sept. 1994. R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc. 11th Int’l Conf. Data Eng., Mar. 1995. W. Aref, M. Elfeky, and A. Elmagarmid, “Incremental, Online, and Merge Mining of Partial Periodic Patterns in Time-Series Databases,” IEEE Trans. Knowledge and Data Eng., vol. 16, no. 3, pp. 332-342, 2004. M. Atallah, F. Chyzak, and P. Dumas, “A Randomized Algorithm for Approximate String Matching,” Algorithmica, vol. 29, no. 3, pp. 468-486, 2001. J. Ayres, J. Gehrke, T. Yiu, and J. Flannick, “Sequential Pattern Mining Using A Bitmap Representation,” Proc. Eighth Int’l Conf. Knowledge Discovery and Data Mining, July 2002. C. Berberidis, W. Aref, M. Atallah, I. Vlahavas, and A. Elmagarmid, “Multiple and Partial Periodicity Mining in Time Series Databases,” Proc. 15th European Conf. Artificial Intelligence, July 2002. C. Bettini, X. Wang, S. Jajodia, and J. Lin, “Discovering Frequent Event Patterns with Multiple Granularities in Time Sequences,” IEEE Trans. Knowledge and Data Eng., vol. 10, no. 2, pp. 222-237, 1998. T. Cormen, C. Leiserson, and R. Rivest, Introduction to Algorithms. Cambridge, Mass.: The MIT Press, 1990. C. Daw, C. Finney, and E. Tracy, “A Review of Symbolic Analysis of Experimental Data,” Rev. Scientific Instruments, vol. 74, no. 2, pp. 915-930, 2003. M. Elfeky, W. Aref, and A. Elmagarmid, “Using Convolution to Mine Obscure Periodic Patterns in One Pass,” Proc. Ninth Int’l Conf. Extending Data Base Technology, Mar. 2004. M. Garofalakis, R. Rastogi, and K. Shim, “SPIRIT: Sequential Pattern Mining with Regular Expression Constraints,” Proc. 25th Int’l Conf. Very Large Data Bases, Sept. 1999. J. Han, G. Dong, and Y. Yin, “Efficient Mining of Partial Periodic Patterns in Time Series Databases,” Proc. 15th Int’l Conf. Data Eng., Mar. 1999. J. Han, W. Gong, and Y. Yin, “Mining Segment-Wise Periodic Patterns in Time Related Databases,” Proc. Fourth Int’l Conf. Knowledge Discovery and Data Mining, Aug. 1998. P. Indyk, N. Koudas, and S. Muthukrishnan, “Identifying Representative Trends in Massive Time Series Data Sets Using Sketches,” Proc. 26th Int’l Conf. Very Large Data Bases, Sept. 2000. E. Keogh, S. Chu, D. Hart, and M. Pazzani, “Segmenting Time Series: A Survey and Novel Approach,” Data Mining in Time Series Databases, M. Last, A. Kandel, and H. Bunke, eds., World Scientific Publishing, June 2004. E. Keogh, S. Lonardi, and B. Chiu, “Finding Surprising Patterns in a Time Series Database in Linear Time and Space,” Proc. Eighth Int’l Conf. Knowledge Discovery and Data Mining, July 2002. D. Knuth, The Art of Computer Programming, vol. 2, second ed., series in computer science and information processing, Reading, Mass.: Addison-Wesley, 1981. S. Ma and J. Hellerstein, “Mining Partially Periodic Event Patterns with Unknown Periods,” Proc. 17th Int’l Conf. Data Eng., Apr. 2001. B. Ozden, S. Ramaswamy, and A. Silberschatz, “Cyclic Association Rules,” Proc. 14th Int’l Conf. Data Eng., Feb. 1998. R. Srikant and R. Agrawal, “Mining Sequential Patterns: Generalizations and Performance Improvements,” Proc. Fifth Int’l Conf. Extending Data Base Technology, Mar. 1996. J. Vitter, “External Memory Algorithms and Data Structures: Dealing with Massive Data,” ACM Computing Surveys, vol. 33, no. 2, pp. 209-271, June 2001. 13 [23] A. Weigend and N. Gershenfeld, Time Series Prediction: Forecasting the Future and Understanding the Past. Reading, Mass.: AddisonWesley, Reading, 1994. [24] J. Yang, W. Wang, and P. Yu, “Mining Asynchronous Periodic Patterns in Time Series Data,” Proc. Sixth Int’l Conf. Knowledge Discovery and Data Mining, Aug. 2000. [25] J. Yang, W. Wang, and P. Yu, “InfoMiner +: Mining Partial Periodic Patterns with Gap Penalties,” Proc. Second Int’l Conf. Data Mining, Dec. 2002. Mohamed G. Elfeky received the BSc and MSc degrees in computer science from Alexandria University, Egypt, in 1996 and 1999, respectively. He received the PhD degree in computer science from Purdue University in 2005. His current research interests include data mining, data quality, and time series databases. Walid G. Aref is an associate professor of computer science at Purdue. His research interests are in developing database technologies for emerging applications, e.g., spatial, multimedia, genomics, and sensor-based databases. He is also interested in indexing, data mining, scalable media servers, and geographic information systems (GIS). Professor Aref’s research has been supported by the US National Science Foundation (NSF), the Purdue Research Foundation, CERIAS, Panasonic, and Microsoft Corp. In 2001, he received the CAREER Award from NSF, and in 2004, he was selected as a Purdue University Faculty Scholar. Professor Aref is a member of the ACM, a member of the IEEE Computer Society, and a senior member of the IEEE. Ahmed K. Elmagarmid received the BS degree in computer science from the University of Dayton and the MS and PhD degrees from The Ohio State University in 1977, 1981, and 1985, respectively. He received a Presidential Young Investigator award from the National Science Foundation and distinguished alumni awards from the Ohio State University and the University of Dayton in 1988, 1993, and 1995, respectively. Professor Elmagarmid is the editorin-chief of Distributed and Parallel Databases: An International Journal and of the book series on advances in database systems, and serves on the editorial boards of the IEEE Transactions on Knowledge and Data Engineering, Information Sciences, and the Journal of Communications Systems. He has served on the editorial boards of the IEEE Transactions on Computers and the IEEE Data Engineering Bulletin. He is on the steering committees for the IEEE International Conference on Data Engineering and the IEEE Symposium on Research Issues in Data Engineering, and has served on the organization committees of several international conferences. Professor Elmagarmid is the director of the Indiana Center for Database Systems (ICDS) and the newly formed Indiana Telemedicine Incubator. His research interests are in the areas of video databases, multidatabases, data quality, and their applications in telemedicine and digital government. He is the author of several books in databases and multimedia. He was chief scientist for Hewlett-Packard from 2001 to 2003 while on leave from Purdue. He has served widely as an industry consultant and/or adviser to Telcordia, Harris, IBM, MCC, UniSql, MDL, BNR, etc. He served as a faculty member at the Pennsylvania State University from 1985-1988 and has been with the Department of Computer Science at Purdue University since 1988. He is a senior member of the IEEE and a member of the IEEE Computer Society. . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.