Extracting Human-Readable Knowledge Rules in Complex Time-Evolving Environments
Extracting Human-Readable Knowledge Rules in Complex Time-Evolving Environments
Extracting Human-Readable Knowledge Rules in Complex Time-Evolving Environments
Pu Yang, and David L. Roberts Department of Computer Science, North Carolina State University, Raleigh, North Carolina, USA
Abstract A production rule system is a reasoning system that uses rules for knowledge representation. Manual rule acquisition requires a great amount of effort and time from humans. In this paper, we present a data-driven technique for autonomously extracting human-readable rules from complex, time-evolving environments that makes rule acquisition for production rule systems efcient. Complex, time-evolving environments are often highly dynamic and hard to predict. We represent these environments using sets of attributes, and transform those attributes to the frequency domain which enables analysis to extract important features. We extract human-readable knowledge rules from these features using rule-based classication techniques and translating the decision rules back to the time domain. We present an evaluation of our methodology on three environments: hurricane data, a real-time strategy game, and a currency exchange. Experiments show extracted rules are humanreadable and achieve good prediction accuracy. Keywords: Knowledge Acquisition, Production Rule, Humanreadable, Time Series Prediction
1. Introduction
A production rule system is one type of knowledge representation and reasoning system. It provides AI by a set of knowledge rules. These rules are a representation found useful in expert systems and automated planning. There are two ways to acquire the rules: manually and autonomously. Manual rule acquisition needs expertise and is tedious and inefcient, especially in a complex environment. This is called the knowledge acquisition bottleneck [1]. Directly extracting time domain rules can be problematic. Morchen [2] pointed out A big problem when mining in time series data is the high dimensionality. Because looking at each point in time in isolation ignores valuable temporal information, sliding windows are used. However, by enlarging the window to capture more interesting patterns very high dimensional vectors result, introducing the curse of dimensionality [3] which is a major challenge for all rule-based classiers. It seems feature selection can solve the problem. Unfortunately selecting features in the time domain introduces inaccuracies. Deng et al. [4] discussed how feature ltering and wrapper methods lead to low-quality feature subsets. Therefore, we avoid the aforementioned problem by selecting features in the frequency domain.
We present a method for autonomously extracting humanreadable knowledge rules from time-evolving environments that leverages techniques from signal processing and machine learning. To make the time series easier to learn from, we transform the attribute values from the time domain to the frequency domain using wavelet transformations. The time domain is the traditional method of analysis in the AI or ML eld for time series data. The time domain describes how a signal changes over time. On the other hand, the frequency domain describes how much of a signal lies within a given frequency range. The intuition behind transforming to the frequency domain is that transforming time series to the frequency domain keeps only the clean signals. Fugal [5] stated that the frequency transformation separates noise from the time series and keeps only the clean signals with more information. Our insight is that using the frequency domain representation of these informative features, we can build rule-based classiers to extract frequency domain knowledge rules more easily and accurately, and that those rules can be converted into easily human-interpretable knowledge rules. To make these rules easily interpretable, we convert the frequency domain knowledge rules back to time domain knowledge rules which have a natural interpretation in terms of the original attributes. The intuition behind translating rules back to the time domain is that frequency coefcients, which keep a signicant amount of the information in the original time series, can be translated back to the time domain in a manner that preserves the important information, but affords a more easily-understandable interpretation in terms of the original features.
To characterize the practicality and accuracy of our method, we tested it on three real-world time-series domains: the North Atlantic hurricane database; a corpus of action real-time strategy game replays; and nancial exchange prices. These data sets were used both to train our model and to characterize its accuracy. A standard ML-style evaluation of the results veried that the extracted human-readable rules contained useful knowledge. Accordingly, our contributions are: (1) leveraging frequency domain features to create human-readable knowledge rules by translating the rules back to the time domain; (2) producing knowledge rules automatically, thereby dramatically reducing the effort required to extract human-readable knowledge from data.
Environment
modeled
transformed
Fig. 1: Workow: The environment is modeled as attribute time series that are transformed to the frequency domain. Rule-based classiers that use features extracted from the frequency domain produce knowledge rules describing the important features of the time series. Finally, the knowledge rules are made human-readable by translating the output of the rule-based classier back to the time domain.
2. Related Work
Manual knowledge acquisition is the process of transferring knowledge from a domain expert to an agent by encoding the acquired expertise in the agents knowledge base. Sviokla et al. [6] and Turban et al. [7] indicated yearly productivity of a manual knowledge engineer is limited to hundreds of rules. To improve productivity, Wagner [8], [9] described a few end-user expert systems that do not rely on knowledge engineers, but (rightfully) acknowledged the difculties in maintaining these systems. Wagner [1] also pioneered methods for breaking the knowledge acquisition bottleneck through the application of the principles of Bazaar-style, open-source development where knowledge is acquired in an open, bottom-up manner with broad testing before nal adoption of the knowledge rules. Autonomous knowledge acquisition is when agents build a knowledge base by learning from data, generally with little or no human involvement. Autonomous techniques have focused on a number of areas including developments in how knowledge rules are represented [10]. Kim et al. [11] used codied expressions for knowledge acquisition. Other techniques have focused on knowledge maintenance systems like the Knowledge Base Management Systems (KBMS) which is a development environment for knowledge-based systems [12]. In KBMS the expert does not need a knowledge engineer to encode rules because the KBMS automates much of the process of application building. Finally, there are techniques in autonomous knowledge acquisition which are process focused. For example, Dhar [13] extracted certain types of patterns from data or information stored at data repositories. The extracted knowledge led to interesting distributions of outcomes which are not directly observable in the data. This type of work is most similar to our approach. Our algorithm autonomously identies knowledge rules which are easily interpretable by humans.
2) Samples of the time series are made uniform in length by either up- or down-sampling and the values are normalized. Using signal processing techniques, the values are converted from the time domain to the frequency domain, which reduces the time series into a smaller, more manageable set of frequency domain attributes that capture important trends in the data. 3) A rule-based classier is applied to the frequency domain attributes to output frequency domain rules. 4) The frequency domain rules are translated back into the time domainthe original variables and values which retain the information in the frequency domain rules, but afford easier interpretation by humans.
3. Methodology
Our basic process (shown in Figure 1) is: 1) The environment is represented using representative attributes, the values of which evolve over time. The values may, or may not, evolve at regular intervals.
values and interpolate or smooth the values in between. When up-sampling the time series, we interpolate additional values between the extremal values. When down-sampling the time series, we uniformly eliminate values between local extremal values to decrease its length to the average. Once the time series are of uniform length, we have to normalize their values to account for uncertainty. We normalize the values to be between 0 and 1 by the formula: n(x, S ) = x minS maxS minS (1) Fig. 2: A discrete wavelet transformation decomposition of a time series. The original time series is the top-most series. The low frequency part of the series is a6 (second-fromtop). The high frequency parts are d6 , d5 , d4 , d3 , d2 , and d1 (third-from-top to bottom respectively).
where x is the original value of time series S , maxS is the global maximal value of the time series, and minS is the global minimal value of the time series. n(x, S ) is then the normalized value of x. An example of a normalized time series is presented in Figure 2 in the top-most series.
a mother wavelet function. There are lots of mother wavelet functions. Among them, Haar wavelets are good for capturing big jumps, or trends [5]. For example, in Figure 2, the top-most time series represents an original time series. Starting from the original time series, in level 1, the DWT process proceeds by computing two sets of coefcients: approximation coefcients and detail coefcients. Approximation coefcients represent the low-frequency part of the time series, which captures the global trends. Detail coefcients represent the high-frequency part of the time series, which capture the local oscillations (noise). To further lter the noise, the DWT can be recursively applied to the low-frequency approximation. Each recursive application of the DWT eliminates half of the coefcients to create successively more coarse approximations. In Figure 2, from the bottom of the gure going up are the six successive high-frequency detail series, and second-from-the-top is the approximation series. As can be seen in the gure, this approximation captures the global trends of the time series.
the node in the tree. The higher the condence, the more accurate the rule is. The higher the support, the more general the rule is. The combination of these two parameters allows knowledge engineers to make a tradeoff between the predictive power of the extracted rules and the number of rules extracted. Too few rules and important knowledge may be omitted. Too many, and spurious relationships may be reported in the rules. Further, the complexity of the rules can be controlled by the DWT level parameter. The more applications of the DWT, the shallower the rules will be.
4. Experiments
We tested our approach on data from three real-world domains: a corpus of data describing hurricanes in the Atlantic Ocean, a corpus of replay log les from professional gamers playing an action real-time strategy game, and historical data about currency exchange rates. Moreover, we did a ML-style evaluation of the rules to validate that the conversion process that yields human-readable rules also yields rules that capture useful information. The validation of the extracted rules with subject matter experts is left for future work; however, the ML-style evaluation indicates the extracted knowledge has predictive power. Further, the reader should be able to determine whether or not the extracted rules are human-readable. For benchmarking, we also directly extracted rules from the time domain by using a sliding window to generate time series vectors, selecting relevant features and building a decision tree to create rules. We tried 3 different sliding windows (length 3, 4, and 5) and 5 different feature selection approaches (information gain, chi-square, hill climbing, correlation feature selection (CFS), and random forest with recursive feature elimination (RF-RFE)).
(2)
to be the time series value of attribute attri in segment E for observational instances in G= . Here, AV G(gj (attri)[E ]) is the average of original values of attribute attri in segment E of the original time series for observational instance gj in G= . We can compute v<> (attri)[E ] similarly. If v= (attri)[E ] < v<> (attri)[E ] then the corresponding time domain rule is v (attri)[E ] > v= (attri)[E ]. Otherwise, it will be v (attri)[E ] < v= (attri)[E ].
specicity is 0.89; the AUC is 0.87. The average length of rules is 2 conditions. With the same condence and support threshold, the best of the 15 benchmarks is the use of a sliding window of length 4 and random forest with recursive feature elimination, which has accuracy 0.74, sensitivity 0.77, specicity 0.78, and AUC 0.88. The average length of rules is 10 conditions. It should be easier for humans to understand rules with fewer conditions. Fig. 3: Map illustration of Rule2 and Rule4 in Table 1. The horizontal line is latitude, vertical is longitude.
To extract the rules, we set condence above 70% and support above 100. There were 1,176 instances in the dataset. The most signicant extracted rules are presented in Table 1. Interestingly, of the ve attributes (latitude, longitude, wind speed in knots, direction of movement in degrees, and speed), only latitude and longitude are represented in our knowledge rules as predictive of hurricanes hitting the east coast of the US or not. From the perspective of time the 2nd one-third and 3rd one-third of hurricane lifetime are critical to whether or not the hurricane makes landfall. Due to space constraints, we are unable to discuss all of rules in Table 1. The map illustrations of Rule2 and Rule4 are shown in Figure 3. In Rule2 and Rule4, latitude 26.5N is the critical value to distinguish landfall or not. When the longitude of a hurricane in the 2nd one-third of the hurricanes lifetime is to the west of 76.3W, the hurricane may curve north or south. When curving north above 26.5N during its 3rd one-third of the hurricanes lifetime, it is likely to hit the east coast of the US with probability 0.862. When curving to the south below 26.5N during the 3rd one-third of the hurricanes lifetime, it is not likely to make landfall with probability 0.944. We performed 10-fold cross-validation to validate our rules for this binary classication problem. If landfall is a true positive (TP). If not-landfall is a true negative (TN). The overall accuracy is 0.87; the sensitivity is 0.85; the
The average game length was 256 events. We up- or down-sampled the difference-team-capability time series to be 256 samples long using the procedure described above. We recursively applied the DWT six times to obtain four approximation coefcients representing 1st one-fourth, 2nd one-fourth, 3rd one-fourth, and 4th one-fourth of a games lifetime. The four stages represented approximately 12 minutes of gameplay each. Then we used 20 approximation coefcients (ve attribute time series represented as four approximation coefcients each) labeled with the winning team as input to the decision tree.
condence thresholds: 70%, 80%, and 90%. For each, we used 200 for the amount of support needed. At the 70% condence level we identied eight rules, at the 80% level seven rules, and ve rules at the 90% condence level. Due to space constraints, we choose the two rules with highest condence values. dattr[seg] means the difference of capability attr between Team1 and Team2 in seg one-fourth of the game lifetime. agi, dam, str means agility, damage, and strength. Rule1 (condence 96%): IF ddam[2nd] > 59.7 THEN Team2 Lose and Rule2 (condence 97%): IF ddam[2nd] > 59.7, dagi[3rd] < 18.1, and dstr[3rd] < 27.7 THEN Team2 Win. In Rule1, if the Team2 has a damage capability decit greater than 59.7 during the 2nd one-fourth of the game lifetime, then the Team2 will lose with 96% chance. However, if the Team2 can achieve agility advantage greater than 18.1 and strength advantage greater than 27.7 during the 3rd one-fourth of the game lifetime, then the Team2 can win with 97% chance according to Rule2. Comparing Rule1 and Rule2, we can conclude advantage of agility and strength in 3rd one-fourth of the game lifetime can compensate for disadvantage of damage in 2nd one-fourth of the game lifetime. Table 2 contains a high-level overview of the rules at each of the condence levels. The entries indicate the percentage of the rules each attribute or game stage appeared in. Of particular interest is the importance of the damage attribute and the 2nd one-fourth of the game lifetime. At all condence levels, every rule identied a test for the value of the damage attribute during the 2nd one-fourth of the game lifetime, indicating this value has the largest inuence on the outcome of the game. Also, of particular interest is the lack of rules pertaining to the gold attribute. This indicates that while gold is useful for increasing capabilities, it must be used wisely or it wont help a team win. Similarly, while all rules contained statements about the 2nd one-fourth of the game lifetime and the majority contained statements about the 1st one-fourth of the game lifetime, fewer contained statements about the 3rd one-fourth of the game lifetime and none contained anything during the 4th one-fourth of the game lifetime. This indicates that advantages are gained before the end of the game. Table 3: Evaluation metrics for game data. Classication accuracy (CA), Sensitivity (Sens), Specicity (Spec), and Area under the ROC curve (AUC).
Metrics CA Sens Spec AUC 70% 0.7860 0.8227 0.7485 0.7856 80% 0.8213 0.7981 0.8449 0.8303 90% 0.8817 0.8902 0.873 0.8853
team wins it is a true negative (TN). We used four metrics. Table 3 shows all values are above 0.74 for all condence thresholds and partitions. The average accuracy is 0.83. The average length of rules is 3 conditions. The best of the 15 benchmarks is the use of a sliding window of length 5 and correlation feature selection, which has accuracy 0.712. The average length of rules is 15 conditions.
We performed three-fold cross validation to validate the accuracy of our model. The results are presented in Table 3. Because Defense of the Ancients is an adversarial game, this is a binary classication problem: one team wins or loses. If the team wins it is a true positive (TP). If the other
than 0.0879, the average price of 24th to 30th of the month is not greater than 0.073, then the rst day price of the next month will be higher than the price of last day in the current month with condence 95%. Table 4: Accuracy summary of 10-fold cross-validation for each currency pairs. H-x is the Haar mother wavelet at the x DWT level. MAs (moving average series) include autoregressive integrated moving average, simple moving average, exponential moving average, weighted moving average, double-exponential moving average, and zero lag exponential moving average. The value in the MAs column is the highest prediction accuracy obtained using all of the MA techniques. Guess is random guess. BestBM is the best of the 15 benchmarks. The entries in bold face indicate the highest prediction accuracy.
USD to EUR CAD CNY GBP XAG XAU Average H-1 58% 58% 52% 53% 55% 66% 57% H-2 58% 57% 59% 53% 91% 49% 61% H-3 58% 57% 62% 53% 68% 49% 58% MAs 44% 47% 41% 49% 40% 44% 44% Guess 47% 45% 33% 34% 36% 40% 39% BestBM 50% 57% 53% 48% 58% 48% 52%
or ontology to lter high-quality rules. Second, to cooperate with domain experts to further investigate the semantics of the rules. Third, our method has three free parameters: DWT level, condence, and support threshold. In the future, we hope to develop a better understanding of how to set those free parameters. One approach is to use an optimization algorithm, such as a genetic algorithm [20] or randomized hill climbing [21], to determine the best values for these parameters. This way, we can ensure that there is a solid reasoning behind picking a specic threshold value or number of DWT coefcients.
References
[1] C. Wagner, Breaking the knowledge acquisition bottleneck through conversational knowledge management, Information Resources Management Journal (IRMJ), 2006. [2] F. Mrchen, Time series feature extraction for data mining using dwt and dft, 2003. [3] R. Bellman, Dynamic Programming. Dover Publications, Mar. 2003. [4] H. Deng and G. C. Runger, Feature selection via regularized trees. The 2012 International Joint Conference on Neural Networks (IJCNN), Brisbane, Australia, June 10-15, 2012, 2012, pp. 18. [5] D. L. Fugal, Conceptual Wavelets in Digital Signal Processing. Space and Signals Technical Publishing, 2009. [6] J. J. Sviokla, An Examination of the Impact of Expert Systems on the Firm: The Case of XCON, MIS Quarterly, vol. 14, no. 2, pp. pp. 127140, 1990. [7] E. Turban, J. E. Aronson, and T.-P. Liang, Decision support systems and intelligent systems. Prentice Hall, 2005. [8] C. Wagner, End-users as expert system developers, Journal of End User Computing, 2000. [9] Wagner, Knowledge management through end user developed expert systems: potential and limitations, Advanced topics in end user computing, Idea-Group Publishing, 2003. [10] S. Cauvin, Dynamic application of action plans in the Alexip knowledge-based system, Control Engineering Practice, vol. 4, no. 1, pp. 99104, 1996. [11] G. Kim, D. Nute, H. Rauscher, and D. L. Loftis, AppBuilder for DSSTools: an application development environment for developing decision support systems in Prolog, Computers and Electronics in Agriculture, vol. 27, pp. 107 125, 2000. [12] R. C. Hicks, Knowledge base management systems-tools for creating veried intelligent systems, Knowledge-Based Systems, vol. 16, no. 3, pp. 165171, 2003. [13] V. Dhar, Data mining in nance: using counterfactuals to generate knowledge from organizational information systems, Information Systems, vol. 23, no. 7, 1998. [14] Z. Wang, Fast algorithms for the discrete W transform and for the discrete Fourier transform, Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 32, no. 4, pp. 803816, 1984. [15] R. W. Ramirez, The FFT: Fundamentals and Concepts. Prentice Hall PTR, Sept. 1984. [16] W. W. Cohen, Fast Effective Rule Induction, Machine Learning: Proceedings of the Twelfth International Conference (ML95), 1995. [17] P. Clark and R. Boswell, Rule induction with CN2: Some recent improvements, Machine learningEWSL-91, 1991. [18] P. Domingos, The RISE system: Conquering without separating, Tools with Articial Intelligence, 1994. [19] J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, 1993. [20] M. Mitchell, An Introduction to Genetic Algorithms. Cambridge, MA, USA: MIT Press, 1998. [21] S. J. Russell and P. Norvig, Articial Intelligence: A Modern Approach, 2nd ed. Pearson Education, 2003.
Table 4 shows 10-fold cross-validation accuracy for each currency pair using the three sets of rules our method identied (using the three DWT levels). For the baseline comparison, we have included various MA techniques and a random guess. This is a ternary classication problem: increase, equal, and decrease. The average accuracy of our method over the six currency pairs and three DWT levels is 58.7%. The average accuracy of traditional methods (various MAs) over the six currency pairs is 40.1%. The average accuracy of random guess over the six currency pairs is 39%. The average accuracy of the best of the 15 benchmarks over the six currency pairs is 52%. The entries in the table that correspond to the highest prediction accuracy on each data set are indicated in bold face. In all cases, those entires were for one of our knowledge rule sets. Perhaps more importantly, even our worst-performing sets of knowledge rules, in each case, resulted in a classication accuracy higher than the best performing MA method.