Information Sciences: Pekka Kumpulainen, Kimmo Hätönen

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Information Sciences 178 (2008) 3840–3859

Contents lists available at ScienceDirect

Information Sciences
journal homepage: www.elsevier.com/locate/ins

Local anomaly detection for mobile network monitoring


Pekka Kumpulainen a,*, Kimmo Hätönen b,1
a
Tampere University of Technology, Automation Science and Engineering, Box 692, 33101 Tampere, Finland
b
Nokia Siemens Networks, Research, Technology and Platforms, Box 6, 02022 Nokia Siemens Networks, Finland

a r t i c l e i n f o a b s t r a c t

Article history: Huge amounts of operation data are constantly collected from various parts of communi-
Received 2 November 2007 cation networks. These data include measurements from the radio connections and system
Received in revised form 25 April 2008 logs from servers. System operators and developers need robust, easy to use decision sup-
Accepted 29 May 2008
port tools based on these data. One of their key applications is to detect anomalous phe-
nomena of the network. In this paper we present an anomaly detection method that
describes the normal states of the system with a self-organizing map (SOM) identified from
the data. Large deviation in the data samples from the SOM nodes is detected as anomalous
Keywords:
Local anomaly detection
behavior. Large deviation has traditionally been detected using global thresholds. If varia-
Outlier tion of the data occurs in separate parts of the data space, the global thresholds either fail
Mobile networks to reveal anomalies or reveal false anomalies. Instead of one global threshold, we can use
System log local thresholds, which depend on the local variation of the data. We also present a method
Self-organizing map to find an adaptive threshold using the distribution of the deviations. Our anomaly detec-
Adaptive thresholds tion method can be used both in exploration of history data or comparison of unforeseen
data against a data model derived from history data. It is applicable to wide range of pro-
cesses that produce multivariate data. In this paper we present examples of this method
applied to server log data and radio interface data from mobile networks.
Ó 2008 Elsevier Inc. All rights reserved.

1. Introduction

Detection of anomalies or outliers is an important task in data analysis. Locating rare or exceptional parts of the data can
reveal new valuable information from the system. As Kruskal wrote in 1960 [18]: An apparently wild (or otherwise anomalous)
observation is a signal that says: ‘‘Here is something from which we may learn a lesson, perhaps of a kind not anticipated before-
hand, and perhaps more important than the main object of the study”.
Mobile telecommunication networks are complex distributed systems. The amount of data traffic and number of services
is ever-growing. Network monitoring and management tools collect and use large amounts of data provided by network ele-
ments. This data includes, for example, performance measurements from the radio interface and log data from application
servers. A data set collected from an operational GSM network during one day consists of several gigabytes of data. It is
impossible for network operators to analyze all the data manually, especially in multivariate space, where it is impossible
to visualize all the dimensions of the data space simultaneously. Therefore, automated multivariate methods are needed
to analyze the data sets.
Fault and intrusion detection and network optimization are examples of continuously ongoing network monitoring tasks.
An operational GSM network can consist of thousands of base stations, hundreds of base station controllers and dozens of
management system servers. Monitoring tasks can be divided to several subtasks focusing on geographical part of the

* Corresponding author. Tel.: +358 40 8490930; fax: +358 3 31152171.


E-mail addresses: pekka.kumpulainen@tut.fi (P. Kumpulainen), kimmo.hatonen@nsn.com (K. Hätönen).
1
Senior specialist: security research.

0020-0255/$ - see front matter Ó 2008 Elsevier Inc. All rights reserved.
doi:10.1016/j.ins.2008.05.038
P. Kumpulainen, K. Hätönen / Information Sciences 178 (2008) 3840–3859 3841

network or a group of services. The task execution is based on the selected set of data, which can contain millions of events or
several hundreds of time series, each of which is representing a history of one performance indicator of a single network
element or process. To find malfunctions or sub-optimally behaving network elements or processes, operators use tools that
analyze these time series and typically search for data samples whose values exceed pre-defined thresholds or their combi-
nations [21]. There can be several hundreds of thresholds stored and maintained in analysis tools.
Universally applicable automated analysis tools are practically impossible to create due to intricacy of the systems and
diverse information requirements of the users. One suitable task for automatic data mining tools is to model the normal
or most common behavior of the system and to detect unusual situations. The data collected from various parts of mobile
network usually include samples from normal states as well as abnormal situations. With such data we have to assume that
the vast majority of the data is from normal functionality and the rare states present some sorts of anomalies or outliers.
Thus events outside the common model can be regarded as exceptional and thus providing new information about the
behavior of the system under study.
Anomalies or outliers in data are often signs of malfunction or otherwise undesired performance and, thus, they should be
detected as soon as possible. In multivariate data there is a vast variety of sources causing anomalous behavior. Therefore a
human expert is usually required to further analyze these situations. The purpose of the analysis tools is to filter out the
majority of the data and to present the user only a limited amount of data, containing the most useful new information
of the system.
A general definition for an outlier was given by Hawkins [9]: An outlier is an observation that deviates so much from other
observations as to arouse suspicion that it was generated by a different mechanism. This definition is very extensive but it gives
no guidelines how to determine whether an observation is an outlier or not.
Various statistical methods have been used in outlier detection [1] and for online monitoring purposes there are specific
tests in statistical process control (SPC) [8]. These are mostly univariate methods and rely on the knowledge of the under-
lying distribution. Multivariate SPC methods [7] also assume multinormal distributions. However, these methods are not
well applicable in telecommunication network management, since neither radio performance measurements, nor log activ-
ity counters usually follow any known distribution. Some traffic related features have heavy tail distributions and are closely
related to Ethernet traffic [38], which is self-similar by nature [22]. Poisson models, for example do not fit the network traffic
[28] and more complicated models are required, such as mixtures of exponentials [6]. The variables used in network man-
agement are aggregated from several counters. The original counters and the formulas to calculate the KPI (Key Performance
Indicator) are often company confidential. We encounter a variety of distributions both in server log activity and radio inter-
face performance measurements. Examples of distributions are depicted in Fig. 1. From left to right we have two histograms
presenting aggregated log activity variables and the last one is an example of a radio interface performance KPI. These exam-
ples include skewed heavy tailed and multimode distributions and the real data contains a variety of other types of distri-
butions. In practice it is impossible to use any single distribution model and a mixture of symmetric distributions like GMM
(Gaussian Mixture Model) do not fit these data very well. Therefore we prefer a method which does not need any assump-
tions about the underlying distribution.
ICA (Independent Component Analysis) has been applied for Multivariate SPC in order to perform better with non-nor-
mally distributed variables [15]. Knorr et al. present the notion of distance based outliers [16]. This requires no assumptions
of distribution, but is based on the distances between the data samples. Principal component classifiers have also been used
in anomaly detection [32].
All the methods mentioned above are global in a sense that they treat all the data set as one group. If the data are clus-
tered, they may fail. Definition of local outliers by Breunig et al. [3] takes the clustering structure of the data into account.

1000 250 12000

10000
800 200

8000
600 150
Count

6000
400 100
4000

200 50
2000

0 0 0
300 400 500 0 500 1000 0 50 100
Value Value Value

Fig. 1. Histograms of three types of variables. The ones on the left and in the middle are log activity counters and the one on the right is an example of a
performance KPI from a radio interface.
3842 P. Kumpulainen, K. Hätönen / Information Sciences 178 (2008) 3840–3859

Their method detects outliers in the local neighborhood. Recently an enhanced Support Vector Machine was proposed by
Shon and Moon [31] for network anomaly detection and using SVD and multiscale transforms was found effective for anom-
aly detection in self-similar network data [30].
Unsupervised anomaly detection methods have been widely used in network intrusion detection, see for example [40,23].
Also multivariate statistical methods have been used [39,14].
Höglund et al. [12] present an anomaly detection method based on quantization errors of self-organizing map (SOM) [17].
In this method SOM is used to approximate the distribution and the cluster structure of the data. As a result of the effective
SOM algorithm, it can be used also with large data sets. This method works well in most cases. However it has shortcomings
if the data are not homogenous with constant variance. The global threshold is likely to be too low in those parts of the data
space where the natural variation is large, thus producing false alarms. On the other hand, where the natural variance is low,
the global threshold tends to be too high, leaving locally anomalous samples undetected.
We have developed an improvement to that method. Motivation for this method comes from the ability of SOM to
approximate the data regardless of the distributions of the variables in use. In our case there is also an application point
of view: the SOM engine is readily implemented and available in the application and our improvement adds only minimal
requirements to the implementation. Our improvement emphasizes the local characteristics of the data. We cluster the SOM
nodes to groups and calculate thresholds for each group separately. This method has more sensitivity locally where the data
have less variance. At the same time the number of false positives can be reduced in those parts of the data space where
there is larger variance. In addition to detecting individual outlier samples, this method can reveal new states that are rel-
atively rare, but contain too many samples to be detected as outliers. Such cases can be as a result of new unseen behavior in
the process to learn about or caused by a sustained error and, thus require attention. We also introduce a method to adapt
the anomaly thresholds to the distribution of the variable used for anomaly detection.
In this paper we present the original algorithm and the improved one. The main features of the methods are illustrated
with a synthetic example. Finally, we apply the method to genuine data collected from mobile networks. In Section 3, we use
system log data. First in Section 3.1 we introduce a scaling method to use for the log data. In Section 3.2 history data from a
system log are analyzed to find outliers and groups of potentially anomalous samples in the data. In Section 3.3, the iden-
tified anomaly detection model is applied to a new log data to find out whether it is consistent with the previous behavior of
the system. The Section 3.4. The comparison is then repeated in Section 3.5 using another data set from system server logs
and the performance of the methods is discussed. In Section 4 the improved method is applied to radio interface perfor-
mance data and the adaptive thresholds for anomaly detection are discussed. In Section 5 we demonstrate the performance
of two more traditional methods: GMM (Gaussian Mixture Model) and clustering based anomaly detection method. Typical
properties of these methods are first discussed using the same synthetic example which is also used in Section 2 for dem-
onstrating the SOM based methods. The GMM and clustering methods are then applied to the network system log data
which is used in Sections 3.2,3.3 and 3.4. The results show that these methods tend to lack the sensitivity to detect the
new states in new data.

2. Method description

Our goal is to improve the anomaly detection method by Höglund et al. [12]. Our improved method detects local dis-
tance based outliers. In this section we first describe the original method and after that, we introduce the improved
version.

2.1. The original method

A SOM consists of nodes, sometimes called neurons. Each node contains a code vector, describing its position in the data
space and a position in the lower dimensional grid [17]. Most often a two-dimensional grid is used.
The original method is based on quantization errors of a SOM. Reference data are used to train a one-dimensional SOM,
which is more flexible than a two-dimensional one. A best matching unit (BMU) is searched for each data sample. The BMU is
a SOM node, which has minimum distance to the data sample. This distance is also called a quantization error. The quanti-
zation errors are then compared to a threshold that is a predefined quantile of all the quantization errors in training data.
Samples exceeding the threshold are labeled as anomalous to be further examined by human experts. The basic idea of
the algorithm is listed step by step.

1. Fit a SOM to a reference data set of n samples, drop out nodes with no hits.
2. Calculate distances D1 . . . Dn to BMUs (quantization errors) for the reference data. A predefined percentile is used as an
anomaly threshold.
3. For a sample to test, calculate distance Dn+1 from its BMU, considered as anomaly if it exceeds the threshold.

An example was generated to visualize the method. The left part of Fig. 2 illustrates a two-dimensional example case. The
blue dots present the data samples. The data set consists of three groups. Each group contains gaussian random samples with
similar variances in a group and the means close to each other. Group 1 has small variance, group 2 has medium variance and
P. Kumpulainen, K. Hätönen / Information Sciences 178 (2008) 3840–3859 3843

Global thresholds Local thresholds

Group 2 Group 2

Group 3 Group 3

Group 1 Group 1

Fig. 2. Scatter plot of the synthetic example data. Global anomaly thresholds are on the left and local thresholds on the right.

group 3 has largest variance. The small red1 circles, connected by lines, are the SOM nodes after training. The larger black cir-
cles present the anomaly threshold, which was selected to be 95th percentile of the quantization errors. Any data points outside
these circles are assumed to be anomalies, marked with pentagrams.
Fig. 2 demonstrates the ability of the SOM to approximate the data space regardless of the distribution or the clustering
structure. Especially the one-dimensional line, instead of the more often used two-dimensional grid, is very flexible and use-
ful for this purpose. The original method uses one global threshold for anomaly detection. That makes the anomaly threshold
circles equal in size around the data space. Therefore in this example, most of the detected anomalies are located in group 3,
which has the largest variance. Only one sample close to group 2 is considered as an anomaly.

2.2. The improved method

In order to further emphasize the local structure of the data, we introduce local thresholds. This takes into account also
the local variance of the data. The improved method for local anomaly detection was presented in [19]. In this paper we ex-
tend the method by introducing adaptive thresholds, which utilize the distribution of the quantization errors.
In the local anomaly detection method we form groups of SOM nodes and calculate the local anomaly threshold for each
group. Groups are created by clustering the SOM code vectors. The algorithm:

1. Train SOM and drop out nodes that have less than a specified number of hits.
2. Cluster the code vectors, these clusters will be called reference groups later on.
3. Set the anomaly threshold to predefined percentile of the distances from BMUs (quantization error) in each reference
group.
4. For a sample to test, calculate distance Dn+1 from its BMU, considered as an anomaly if it exceeds the local threshold.

This method gives higher thresholds in those parts of the space where the data samples are more deviated from BMUs,
and lower thresholds where the data samples are closer to the BMUs and have less variance.
When new data are available, the distances from BMU are calculated for each data sample. The distances are compared to
the threshold in the reference group the BMU belongs to. If the threshold is exceeded, the sample is labeled as an anomaly.
The results of this method using the previous example in 2-D are presented in the right part of Fig. 2. The anomaly threshold
is calculated separately for each group. The threshold is larger where the data have more variation and smaller in the parts of
the data space where there are more dense clusters.
Now the sizes of the threshold circles vary according to the variance of the data. There are fewer anomalies detected in
group 3 than with the original method. Thus the risk of false positive detections is reduced in those parts of the data space
where the natural variance of the data is higher. At the same time, there are more anomalies detected in groups 1 and 2. The
sensitivity of the method is locally adapted to the variation in the data.
The histograms of the quantization errors in different groups are depicted in Fig. 3. The 95% thresholds for anomaly detec-
tion in each group are marked with a solid vertical line. As can be seen, the threshold varies from 0.57 in group 1 to 1.33 in
group 3. The global threshold is 1.07 and it is marked on each histogram with a dotted line.
Using a 95% threshold will always assume 5% of the data in each reference groups as anomalous. With larger data sets this
is an unreasonably large number. A trivial solution is to reduce the percentage or limit the maximum number of anomalies.
These solutions do not utilize distribution of the quantization errors. Thus they can result in a situation where the threshold

1
For interpretation of color in figures, the reader is referred to the Web version of this article.
3844 P. Kumpulainen, K. Hätönen / Information Sciences 178 (2008) 3840–3859

Group 1, 47 samples, threshold 0.57


6
Global threshold
4 Local threshold

Group 2, 98 samples, threshold 0.81


15

10

Group 3, 108 samples, threshold 1.33


10

0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Fig. 3. Histograms of the quantization errors including the global and local thresholds.

will be in between samples that have almost the same quantization error. We introduce a simple adaptive threshold selec-
tion method which utilizes distribution of the quantization error.
To calculate the adaptive threshold, we first select the maximum allowed percentage of anomalies, typically 5%. The larg-
est 5% of the quantization errors QEi are sorted. Each quantization error QEi is associated with a difference to the next larger
error value,
DQEi ¼ QEiþ1  QEi :
Starting from the lowest errors, the first difference exceeding a predefined limit, dLIM, is searched. Thus we find the mini-
mum QEi so that DQEi > dLIM. The anomaly threshold is set between QEi and QEi+1. This method assures that the last non-
anomalous and the first anomalous samples will be at least dLIM far apart from each other. If none of the differences exceeds
the limit, then the maximum difference is selected. In that case the distance between anomalous and normal samples will be
the largest possible.
The method is equivalent to finding the first zero valued gap in the last 5% tail in a histogram using equal bin width of
dLIM. This is an intuitive way for a human to select a threshold from a histogram.
The limit dLIM is a parameter to choose. We have calculated the limit using the range of the quantization errors R = max
(QE)  min(QE) and the square root of the number of samples N. The square root of N has been traditionally used in histo-
gram based distribution estimates.
R
dLIM ¼ pffiffiffiffi ;
k N
where k is a constant to select.
Examples of the adaptive threshold are given in the last experiment using the radio interface data.

2.2.1. Selection of the parameters


This method requires some parameters to be set. First, the size of the number of nodes in the SOM has to be selected.
There are no universal guidelines to what is a good number for a specific purpose. According to our experience, when the
SOM is used as an intermediate tool in anomaly detection, the number of the nodes is not very crucial as long as the number
is big enough
pffiffiffiffi to approximate the data space. In the examples presented in this paper we have used N/5 for smaller data sets
and 3  N for the larger data set, in order to keep the number computationally reasonable. If the number of nodes is very
P. Kumpulainen, K. Hätönen / Information Sciences 178 (2008) 3840–3859 3845

large, we end up overfitting the data: the SOM will adapt too well to the reference data set and the generalization ability will
suffer. This is a common feature to all parameterized methods, including as well neural networks [4] as very basic statistical
models such as linear regression using polynomials of too high order [2]. Increasing the number of parameters in model will
yield better fit to the data at hand but the fit to any new data set will be worse. In all examples we have initialized the SOM
according to the first principal component of the data. This is the best linear approximation of the data and this will give
unambiguous results unlike random initialization.
The second parameter to choose is the minimum number of hits required for each node. The number of hits in a node
tells how many samples of the data are presented by the node. In the examples presented in this paper we have a required
minimum of 3 hits. This is strictly a practical selection, three can be thought as ‘‘many samples”, two is only a pair and
one is definitely not many. Discarding the nodes which have too few hits will effectively compensate for potential
overfitting.
The next selections required are the number of reference groups and the clustering algorithm to use. We have used hier-
archical clustering which gives unambiguous clustering for a given number of clusters. The number of clusters can be man-
ually selected and given to the clustering algorithm. The other option is to use some criteria to select the best number of
clusters, such as the Davies–Bouldin index [5] or the mean silhouette [29] for example. Even then in addition to the selection
of the criteria to use, the range of numbers to try, and minimum and maximum values have to be selected. We have used the
Davies–Bouldin index to select the best number between 3 and 10. In our method the main result of clustering is the division
of the data set to groups, each having their own local variance. According to our tests with the data sets presented in this
paper, using more than 10 clusters will result in several groups having similar local variances, thus giving no improvement
to the method.
Finally the anomaly threshold has to be selected. In this kind of application there is no way to tell the optimal threshold, it
has to be selected heuristically. In absence of predefined anomalies one has to assume a certain percentage of the history
data to be considered anomalous. For constant thresholds we have used a typical 95% threshold, assuming 5% of the refer-
ence data to be anomalous. The parameters for the adaptive thresholds are described in the previous section.

3. Experiments with system log data

We present experiments in which these methods have been applied to two data sets, which have been collected from
network management system servers from two small networks used for new mobile technology field tests. The
collected data contain information about the automatic system processes and manual operator activity. Thus, it can be
used for monitoring system behavior for maintenance purposes and for searching for security incidents in security
monitoring.
The data in set 1 contains about 1 million log entries per day. The data was extracted from a management application
server, which controls about a dozen base station controllers with a full range of network services. The log entries have been
classified by their source or originating application. These classes include, e.g., unix syslog, cron log, login logs, and several
management application logs. For our experiments we have computed time series by counting frequencies per time unit of
entries in selected classes. The data set 2 contains similar data, except the network from which it has been collected is
smaller.
In networks a major system malfunction causes a clear sign to system logs. Some processes can either begin to produce
huge amounts of log entries or they can stop logging. In both cases the resulting logging activity is anomalous when com-
pared to the normal logging behavior that changes in quite static cycles. Also in security monitoring all anomalies in system
component behavior or user activity or appearance of new software may be signs of an incident.
The methods have been applied to two types of analysis. First we have analyzed the reference data by training a SOM with
it and by clustering the SOM nodes. There have been seven groups of SOM nodes in both data sets. The groups represent
common system states that can be analyzed, described and named. The SOM nodes represent 95% of the data. The rare states
are presented as anomalies that differ from all the groups. Their number is quite low, which enables us to analyze them man-
ually. This kind of analysis can be used, for example, for periodical audits of the systems or on-demand analysis of the system
history, where we suspect that, for example, a security incident has taken place.
The other type of analysis that we have done, is the comparison of the test data set against the previously identified mod-
el. This has resulted in a set of anomalies that are analyzed further. This type of analysis can be used to do continuous mon-
itoring of a system. In analysis we have used variables that record logging activity level of system components and functions.
Seven variables were available in data set 1 and 8 variables in data set 2.
These examples simulate normal use cases in network operator’s system log analysis. The previously collected reference
data is used to calculate the scaling parameters, and to tune the anomaly detection model. The model is then used to exam-
ine the performance of the system and to detect potential problems later on.
The examples represent a real life situation also in a sense that there is no information available about whether a sample
is truly anomalous or not. Thus there are no ‘‘right results” and no measure of performance available. Therefore there is no
straightforward way to compare methods, but the validity of the results has to be left for the end user, the network man-
agement expert, to decide. The anomaly detection methods in these applications are merely aimed to find the most suspi-
cious parts of the data for the user for further study.
3846 P. Kumpulainen, K. Hätönen / Information Sciences 178 (2008) 3840–3859

3.1. Scaling

Scaling plays a very important role in anomaly detection. Scaling defines what kind of anomalies will be detected [10,11].
Proper scaling should make the variables equal in their importance within the problem in which they are used. Decisions
about the relative importance of the variables always require process knowledge and experience concerning the variables.
This makes it more difficult to implement applications and to adapt to new processes. However our experience shows that
utilizing expert knowledge will improve the performance of methods in practice.
All multivariate methods based on distances or variances are very sensitive to scaling. Variables with larger variance
dominate in the methods. It is common to scale all the variables to zero mean and unit variance. This is rarely the best pos-
sible choice. Variables with small variance are amplified and therefore could be overvalued in the analysis. Also the noise is
amplified. At the same time the variables with high variance are attenuated and possibly underrated in the analysis.
Log data consist of event counts. The significance of a difference in counts depends on the magnitude of a variable value. A
difference of 2 events, for example is more significant if there are 10 events altogether, than 2 events of total 1000 events.
The basic normalization is done by subtracting the mean and dividing by the standard deviation. For system log data we
use a robust logarithmic scaling that preserves the importance of the variables. Our scaling first takes a natural logarithm of
the variable plus one and then divides by a robust standard deviation.
lnðx þ 1Þ
xlog s ¼ ;
s
where s = std{ln(x + 1)jx > 0, x < q99} and q99 refers to 0.99 quantile of the variable x.
Adding one to the variable eliminates the need to separately handle the zeros in the data. Now zeros will remain zeros
instead of minus infinity. This also separates ones and zeros in the original scale, since values of one will be scaled to ln(2)
instead of ln(1), which equals zero. The standard deviation is calculated ignoring zeros and 1% from the upper tail.
Finally the mean is subtracted.
xscaled ¼ xlog s  meanfxlog s g:
An example of the difference between normalization and the logarithmic scaling applied to a variable that has peaks and
smaller scale variation is illustrated in Fig. 4. The normalized scaling leaves high peaks and attenuates the smaller scale var-
iation so that any changes within that are difficult to detect. The logarithmic scaling attenuates the peaks, which still remain
detectable, but leaves more variation to the smaller scale.
Scaling determines the relative importance of the variables in multivariate data. Any method utilizing similarity mea-
sures, such as correlation or Euclidean distances, are directly affected by scaling. Regularized multivariate regression meth-
ods that rely on the covariance structure, such as PLS (Partial Least Squares) and PCR (Principal Component Regression) or
more general CR (Continuum Regression) are more affected by scaling than the method selection [34]. Euclidean distance,
which is widely used in SOM and various clustering methods, is also dependent on the scaling of the variables. Scaling deter-
mines the importance of the distance in each direction in the multivariate space. The three scatter plots in Fig. 5 depict the
effect of scaling. We call the variables on the x and y axis variables x and y respectively. The variable x is the one presented
also in Fig. 4. Without any scaling, in the original data on the left, the distances between samples are mostly determined by y.

Normalized
25
20
15
10
5
0

Robust Logarithmic scaling

–2

Fig. 4. Example of scaling, normalized and robust logarithmic scaling.


P. Kumpulainen, K. Hätönen / Information Sciences 178 (2008) 3840–3859 3847

Original data Normalized Robust Logarithm


400 5 6

5
4
350
4
3
300 3

2 2
250 1
1
0
200
0
–1

150 –1 –2
0 500 1000 1500 –10 0 10 20 30 –2 0 2 4

Fig. 5. Scatter plots of two variables on three scales, original variables, normalized and robust logarithmic scaling. The variable on the x axis is the same one
as in the previous figure.

The x variable has much less variation except the three samples far out. The traditional normalization to zero mean and unit
variance is presented in the middle. This method is very much affected by any outliers in the data and possibly hides mean-
ingful variation as in this case. The direction of the x variable is dominated by the three outliers and the rest of the variation
in this direction is shrunk negligibly. Using robust mean and variance estimates would reduce the effect of possible outliers.
Still, if a variable has no significant variation within the data set at hand, normalization will amplify any noise present even
with robust estimates.
We think that the scaling should always be performed so that the importance of each variable is equal in the scaled data.
Therefore there is no universal method for scaling, but each process and data set requires an individual scaling method.
Sometimes each variable needs unique scaling [20]. The scatter plot on the right in Fig. 5 presents the robust logarithmic
scaling we used for the log activity data in this case. The small scale variation of the x variable is brought out but the three
samples that have higher values are still clearly separated as well as the group of high values in variable y.

3.2. Analysis of the reference data of system log data set 1

Reference data in set 1 contains hourly samples of seven variables. The measurement period available is 33 days, yielding
a total of 792 samples per variable.
First we trained a one-dimensional SOM using the SOM Toolbox for Matlab [33]. We used N/5 nodes, where N was the
number of samples in the reference data. Nodes that had less than three hits were dropped out. In the original method nodes
with no hits were dropped. The nodes were then clustered to form reference groups. We used hierarchical clustering with
ward linkage [13,37]. The number of clusters was determined by maximum Davies–Bouldin index [5]. We limited the
number of clusters between 3 and 10. The anomaly threshold for each group was determined as the 95th percentile of
the distances from the BMU in each reference group.
In this case we ended up with a SOM of 121 nodes and 7 reference groups. The histograms of the distances from BMUs are
presented in Fig. 6. The first histogram at the top presents the distances of the whole data set and the global threshold. The
following histograms are for each of the 7 reference groups.
In this case the reference groups 1, 2 and 3 are small, where the size of a reference group refers to the number of samples
in the reference data associated to the group. Small groups containing only a few samples present process states that do not
occur very often, yet too often to be detected as anomalies by most outlier detection methods. However this local anomaly
detection method gives an indication of anomalous behavior in the form of a small reference group. These groups can present
a rare, possibly acceptable or even desirable behavior in the process to learn about or they could be caused by a sustained
error that requires immediate attention. In either case they should be studied further by human experts.
The characteristics of the groups 1, 2 and 3 are studied further using the cluster centroids and the component planes of
the SOM. Cluster centroids, the means of the code vectors in each reference group, are presented in Fig. 7. Groups 1 and 2 are
clearly distinguished from the others. Variable 5 has very high values in both groups. The groups are separated from each
others by variable 2, which has high values in group 2 and variable 6, which has low values in group 2. In group 3, variable
3 has high values, which distinguishes it from all the other groups.
Component planes are a common way to visualize a two-dimensional SOM. A component plane presents the SOM grid for
each variable and the values of the variables in each node of the grid are color coded in the plane. In this case, when we have a
one-dimensional SOM, the component planes can be replaced by line plots as in Fig. 8. Each line plot presents the values of the
SOM code vectors corresponding to each variable. The x axis is the number of the SOM node along the one-dimensional ‘‘grid”.
3848 P. Kumpulainen, K. Hätönen / Information Sciences 178 (2008) 3840–3859

All the data, threshold 2.04


400
200
0
Group 1, 11 samples, threshold 2.85
2
1
0
Group 2, 13 samples, threshold 9.40
1
0.5
0
Group 3, 34 samples, threshold 2.35
4
2
0
Group 4, 108 samples, threshold 1.53
20
10
0
Group 5, 121 samples, threshold 2.36
20
10
0
Group 6, 357 samples, threshold 1.40
50

0
Group 7, 148 samples, threshold 2.08
20
10
0
0 1 2 3 4 5 6 7 8 9 10

Fig. 6. Histograms of the quantization errors in all the data on top and in each reference group below.

Clustering of the code vectors in a two-dimensional SOM is usually presented color coded or with markers on another
plane [36]. In this one-dimensional grid the clustering information can be integrated into the line plots. The clusters are sep-
arated by vertical lines. The numbers of the clusters from 1 to 7 are below variable 1 at the bottom of the figure.
Groups 1 and 2 are located at the end of the SOM line. They are clearly the only ones having high values in variable 5.
These states of the system are common enough not to be detected as anomalies, but still rare enough that they should be
examined further.
Fig. 8 also shows that groups 5 and 6 consist of two parts. The line of SOM nodes starts from group 6 and, after visiting
group 5 winds back to group 6 and then again returns to group 5. For a one-dimensional SOM grid this kind of warping is
more common than for a two-dimensional one, which has a more rigid topology.
The end users in real life applications do not want to see anything but the essential. A plain time series plot with the de-
tected outliers is often useful for the system expert. An example of a time series for a period of one day is given in Fig. 9. The
anomalies detected in the reference data are highlighted with a horizontal line. The most interesting anomalies can then be
further studied from the original logs. The reference groups, where each sample belongs to, are coded by marker type. This
section of the data contains samples from three groups: group 5, marked by a pentagram, group 7, marked with a square and
the last sample from group 3, marked with a diamond shape.
P. Kumpulainen, K. Hätönen / Information Sciences 178 (2008) 3840–3859 3849

12

10

6
Value

–2
1 2 3 4 5 6 7
Reference group

Fig. 7. Bar plot of reference group centroids.

3.3. Applying the model to new data, the test data on data set 1

In the data set 1 we have additional test data. It contains 9 days, 216 samples of hourly data, recorded right after the ref-
erence data. The test data were scaled the same way as the reference data, using the scale factors calculated from the ref-
erence data.
The BMU was searched for each sample of the test data. The BMU defined the reference group and the anomaly threshold
to be used for each sample.
The number of test data samples that are assigned to each reference group together with the number of samples detected
as anomalies are listed in Table 1. The majority of the data are assigned to group 6. There are also a lot of anomalies in that
group.
The error vectors from the anomalous samples in group 6 are illustrated in Fig. 10 as box plots [24]. Error vectors give the
distance between the data sample and its BMU code vector. The most obvious reason for anomalies is variable 4, which is far
on the negative side. This indicates that in all these samples this variable has much lower values than the corresponding code
vector in their BMU nodes.
The test data assigned to the group 6 and the code vectors in the same group are compared in Fig. 11. The code vectors
present the approximated normal behavior in the group. Values of variable 4 are well below the ones in code vectors. On
the other hand, those values of variable 4 which are close to 2 do exist in the reference data. Such states are assigned to
groups 5 and 7; it is the combination of the other variables that makes these samples to be assigned to group 6 in the test
data.
Variable 2 has also lower values in the test data anomalies than in the code vectors. This motivates inspection of the com-
bination of variables 2 and 4, which is illustrated in Fig. 12. The figure shows a part of the reference data and the test data of
variables 2 and 4. They seem to be able to explain the anomalous behavior in the group 6. Only the anomalies detected in
that group are presented in the figure by large dots over the line. All the data after the gray vertical line are test data.
The samples detected as anomalies in the end of the test data set are from normal operational state. Both variables are in
their normal range. But this kind of combination was not present in the reference data. Such groups of similar anomalies are
a sign of new unseen behavior and the model should be updated.

3.4. Comparison of the methods on system log data set 1

The models according to both the original global method and the new local one were created from data set 1. The refer-
ence data of 792 samples were used to identify the SOM and 95% quantile was used as the anomaly threshold for both
methods.
The identified models were applied to the test data set. The test set contains a period of 9 days, 216 samples, collected
right after the reference data period. The anomalies detected in the test data set are illustrated in Fig. 13 using variable 4
which was also presented in the previous example.
The global method detects 17 anomalies while the local one finds 45. A larger number of detected anomalies as such is not
generally desirable. However in this case the system has entered a new state, not present in the reference data as described
3850 P. Kumpulainen, K. Hätönen / Information Sciences 178 (2008) 3840–3859

Var 7

–2
10

Var 6

–5
15

Var 5

–5

Var 4

–2
6

Var 3

–2
10

Var 2

–5
1

Var 1

–2
6 5 6 5 4 7 3 1 2

Fig. 8. Code vectors of the SOM with reference groups numbered below.

in previous section. The local thresholds in the improved model allow this state to be detected earlier. The global threshold is
too high for this state to be detected to be something new, until at the very end where the variable 4 has very low values.

3.5. Comparison of the methods on system log data set 2

The second data set of system logs contains 8 variables recorded from a similar system as the first data set. The reference
data, which was received fist, contains 13 days of data, 312 samples. 205 samples were received later to be used as test data.
The models were identified using the same scaling and model parameters as in the previous case. In this case we ended up
with a SOM of 30 nodes and 7 reference groups in the local model.
P. Kumpulainen, K. Hätönen / Information Sciences 178 (2008) 3840–3859 3851

Var 7

–2
–5
10
Var 6

–5
–10
20
Var 5

–10
4

Var 4

–2
6

Var 3

–2

Var 2

–5
4

Var 1

–2

Fig. 9. Time series zoomed in for one day, reference groups of samples shown by marker type.

Table 1
Number of samples and detected anomalies in test data associated to reference groups

Reference group Number of samples in test data Anomalies in test data


1 1 0
2 0 0
3 9 0
4 36 5
5 3 0
6 129 35
7 38 5

The numeric results of the local method are presented in Table 2. The first column is the reference group number. The
second and third columns are the numbers of samples associated to each group from the reference and test data respectively.
The fourth column is the local anomaly detection threshold for each reference group. The fifth column presents the numbers
of anomalies detected from the test set by the local method.
The total number of anomalies detected by the local method is 29. The global method detects 77 anomalies. The time
series of one variable and the locations of the anomalies are presented in Fig. 14. The third plot at the bottom presents
3852 P. Kumpulainen, K. Hätönen / Information Sciences 178 (2008) 3840–3859

Var 7

Var 6

Var 5

Var 4

Var 3

Var 2

Var 1

–1.5 –1 –0.5 0 0.5 1

Fig. 10. Box plot of the quantization error of the anomalies in test data, which are assigned to reference group 6.

Test data anomalies in reference group 6

Var 7

Var 6

Var 5

Var 4

Var 3

Var 2

Var 1
–4 –3 –2 –1 0 1 2 3

SOM code vectors in reference group 6

Var 7

Var 6

Var 5

Var 4

Var 3

Var 2

Var 1
–4 –3 –2 –1 0 1 2 3

Fig. 11. Box plots of the anomalies in test data and SOM code vectors in reference group 6.

the reference groups associated with each sample in the test data. Most of the data are associated into reference groups 4 and
5. The local anomaly thresholds in these groups are 7.49 and 3.22 respectively, while the threshold in the global model is
2.36. In this case the global method detects more anomalies, most of which are within normal local variation according
to the local method.
When compared to the model, the test data will be analyzed either one sample at a time or with 24 samples if used on a
daily basis. In that case the global method gives a number of unnecessary detections. The local method gives fewer anomaly
detections since the samples are from states where the variation was larger also in the reference data.
P. Kumpulainen, K. Hätönen / Information Sciences 178 (2008) 3840–3859 3853

Variable 4
3

–1

–2

Variable 2

10
8
6
4
2
0
100 200 300 400 500 600 700
Sample index

Fig. 12. Time series plots of variables 4 and 2, reference and test data concatenated.

Global method, 17 anomalies

–1

Local method, 45 anomalies

–1

20 40 60 80 100 120 140 160 180 200


Sample index

Fig. 13. Variable 4 with the anomalies detected by local and global methods.

4. Adaptive thresholds with radio interface data

The third data set is a sample from network performance database of a commercial European GSM operator [20]. The
most important Key Performance Indicators (KPIs) were calculated from the network radio performance counters in the
database. The KPIs used here were Dropped Call Rate, Handover Success, Congestion, Radio Quality, Downlink Signal Strength
3854 P. Kumpulainen, K. Hätönen / Information Sciences 178 (2008) 3840–3859

Table 2
Number of samples in reference and test data associated to reference groups followed by the local threshold and the number of anomalies detected from the
test set in each group

Reference group Samples in reference data Samples in test data Local threshold Local anomalies in test data
1 84 11 0.45 11
2 11 0 3.93 0
3 24 0 1.89 0
4 17 57 7.49 5
5 40 127 3.22 13
6 16 10 14.18 0
7 120 0 1.60 0

Global method, 77 anomalies

3
Value

Local method, 29 anomalies

3
Value

Reference Group

6
Group ID

0 20 40 60 80 100 120 140 160 180 200


Sample index

Fig. 14. Time series example with anomalies and the corresponding reference group in test data set 2.

and Call Setup Success Rate. The performance data were collected from 2385 GSM radio cells. The measurement period is 6
weeks. The data is aggregated so that for each performance indicator there is one sample per day available.
The network contains various types of radio cells. Due to differences in their behavior it is reasonable to analyze each type
separately. We use the same division of the cells as Kylväjä et al. [20]. The cells are divided into three groups by their func-
tional type: indoor, micro and macro cells. The macro cells are further divided to four categories according to both the total
amount of traffic and the number of handovers (HO) within the cells. The mean value of traffic and HO in available history is
calculated for each cell. The 2/3 quantile of the distribution of mean values is used to threshold the cells to groups of ‘‘high
traffic”, ‘‘low traffic”, ‘‘high handovers” and ‘‘low handovers”.
Here we present results for only one category: macro cells with low traffic and low number of handovers. In this group we
have 1277 cells – about half of the data.
The first 5 weeks, 35 days were used as reference data to identify the anomaly detection model. Samples with missing
values were dropped off, which resulted in total of 42,579 samples in the reference data. The last week which was used
as the test data set contains 8993 samples after the samples with the missing values had been dropped out.

4.1. Scaling

We used the scaling introduced in [20]. The quality indicators were scaled continuously and piecewise linearly to interval
[0, 1], extremes corresponding to the worst and the best performance respectively. The mapping was constructed on the
P. Kumpulainen, K. Hätönen / Information Sciences 178 (2008) 3840–3859 3855

basis of a priori information of network experts. Four values of the performance indicators were defined: worst possible, very
poor, satisfactory, and best possible. These values were scaled to 0, 0.2, 0.9 and 1 respectively and the scaling function was
created with linear interpolation. The scaling function parameters can be easily adjusted by an expert for various perfor-
mance indicators and networks. The scaled performance indicators will be within the same range and the same value refers
to the same level of performance in each indicator.

4.2. Results

Radio interface data is a substantially larger data set than in the previous cases. In order to keep the number of SOM nodes
computationally
pffiffiffiffi reasonable the number of nodes has to be less than the N/5 used with previous data sets. In this case we
used 3  N nodes, where N is the number of samples. The other parameters of the method were the same as in the previous
experiment: the minimum number of samples required per node was 3 and the number of reference groups was limited to
between 3 and 10. This yielded a model of 619 nodes on the SOM and 6 reference groups.
In this experiment we demonstrate the advantages of the adaptive anomaly thresholds. The adaptive thresholds for
each reference group were calculated by the formula presented in Section 2.2. We present the results using two values
of the constant k: k = 3 and k = 2. Given 5% as the maximum percentage of anomalies the adaptive threshold will
always be equal or higher than the 95% threshold. Larger values of k will bring the threshold down towards the 95%
threshold.
The histograms of the quantization errors in the reference data are illustrated in Fig. 15. The 95% anomaly thresholds are
marked by green vertical lines and the adaptive thresholds by red dash lines. The difference between these two thresholds
varies, in groups 1 and 2 they are closer to each other than in the rest of the groups.
The adaptive thresholds will produce fewer anomaly detections. The results from the reference data, the numbers of
data samples and the detected anomalies in each reference group are presented in Table 3. The fixed 95% threshold will
always detect 5% of the data as anomalies. The adaptive thresholds depend on the distribution of the quantization errors
in reference groups and the resulting number of the anomalies will vary accordingly. The total number of anomalies de-
tected using the fixed 95% threshold is 2129. The number of detected anomalies is significantly lower when adaptive
thresholds are used. The adaptive thresholds produce total of 1018 and 394 anomalies with values 2 and 3 for the con-
stant k accordingly.
The results for the test data are presented in Table 4. The total number of anomalies detected using the fixed 95% thresh-
old is 421. This is far too many for a human expert to analyze without help of additional data mining methods. The adaptive
thresholds will produce total of 208 and 84 anomalies with values 2 and 3 for the constant k accordingly. Thus the number of
anomalies is more reasonable for further analysis.

5. Comparison with reference methods

In this section we present comparative results of two reference methods: Gaussian Mixture Model (GMM) and a clustered
distance based method.

5.1. GMM

Generally an observation can be regarded as abnormal if probability of such state is small. A threshold for abnormality can
be drawn from multivariate pdf (probability density function) fn(x) The unsupervised anomaly detection problem can be re-
garded as finding the contour of constant probability fn(x) = c so that samples in the data space outside that contour will have
low enough probability to be considered as anomalous situations. Probability for an observation xi where pdf fn(xi) < c is p, if
constant c is defined by
Z
fn ðxÞ dx ¼ p:
fn ðxÞ<c

This gives a threshold for abnormality with a probability of false decision being p. Typically p = 0.05, thus 5% from the edge of
the multivariate distribution is considered rare enough to be potentially anomalous.
Usually the distribution is not known and it has to be identified from the data. Multimodal distributions can be described
by a mixture of M component distributions [27]
X
M
pðxÞ ¼ PðjÞpðxjjÞ;
j¼1

where P(j) are the mixing coefficients that describe the proportion of each component density function p(xjj). When the com-
ponent distributions are normal distributions the model is called Gaussian Mixture model, GMM.
The model is fitted using EM algorithm. The result will depend on the initial guess of the gaussians. Typically the model is
fitted several times using random centers and the one with largest likelihood is then selected. However in the case of unsu-
pervised anomaly detection there is no proof that the model with largest likelihood is optimal for the task. The number of
3856 P. Kumpulainen, K. Hätönen / Information Sciences 178 (2008) 3840–3859

Group 1, 4013 samples


200
Fixed 95%
100 Adaptive k=2
Adaptive k=3
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Group 2, 2553 samples
200

100

0
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Group 3, 9078 samples
1000

500

0
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Group 4, 13324 samples
4000

2000

0
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Group 5, 2588 samples
100

50

0
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Group 6, 11023 samples
1000

500

0
0 0.2 0.4 0.6 0.8 1 1.2 1.4

Fig. 15. Histograms of the quantization errors in reference data and the 95% threshold (solid green line) and the adaptive threshold (red dash line).

Table 3
Number of samples and detected anomalies in the reference data using 95% and the adaptive thresholds

Reference group Samples in reference data Anomalies fixed 95% Anomalies adaptive k = 3 Anomalies adaptive k = 2
1 4013 201 88 63
2 2553 128 64 1
3 9078 454 119 58
4 13324 666 127 91
5 2588 129 26 24
6 11023 551 74 44

gaussians in the model has to be either selected manually or one can use information criteria for the selection, such as Akaike
or Bayes information criteria [25]. The EM algorithm estimates the parameters that maximize the log-likelihood. However
generally this maximum does not exist and regularization heuristics are required and there is no rigorous way to set the reg-
ularization parameters [26].
P. Kumpulainen, K. Hätönen / Information Sciences 178 (2008) 3840–3859 3857

Table 4
Number of samples and detected anomalies in the test data using 95% and the adaptive thresholds

Reference group Samples in test data Anomalies fixed 95% Anomalies adaptive k = 3 Anomalies adaptive k = 2
1 791 42 24 16
2 481 28 18 0
3 1964 93 21 10
4 2907 110 21 17
5 528 17 4 4
6 2322 131 15 10

5.2. Clustered distance based method

There are several ways to use cluster based techniques to detect local anomalies [35]. Here we use a clustered distance
based method.
The reference data is clustered and each sample assigned to a cluster. The distances from the nearest cluster centroids are
calculated for each sample. In unsupervised anomaly detection we do not know which samples are anomalies. We set the
anomaly threshold in a same way as in the SOM based methods, leaving 5% of the samples in each cluster outside the thresh-
old. Samples in new data are assigned to the nearest cluster centroids and the distances to the centroids are calculated. The
samples that have distance larger than the threshold are considered anomalies.
There is no unique right answer to the number of clusters to use. It has to be selected either manually or using some cri-
teria such as Davies–Boulding index [5] or mean silhouette [29]. The clustering method can be selected freely.

5.3. Synthetic example

The GMM and the cluster based methods were applied to the synthetic two-dimensional data, which was also used in
Section 2 for demonstrating the SOM based methods.
The GMM was identified using the EM algorithm and 1.0 as regularization added to the diagonal of the covariance matri-
ces to enhance the convergence of the algorithm. The number of components in the model was selected by the AIC (Akaike
Information Criteria) from the range between 1 and 10. The model identification was repeated 5 times for each number of
components and the one with largest likelihood was selected. This ended up with a model of 3 components. The 5% anomaly
threshold was derived from random simulation.
The cluster based method was carried out using hierarchical clustering and Ward linkage as in the local SOM based meth-
od. The number of clusters was selected from the range between 1 and 10 by Davies–Bouldin index and ended up to 3
clusters.
Fig. 16 depicts the results when applied to the synthetic example data. The GMM method was able to find only one anom-
aly. The normal distributions divide the probability from 1 to 1 and the area of 95% probability covers almost all the data.
The clustering based method leaves said 5% of data points outside the threshold as anomalies in each cluster. The findings
differ from those of our SOM based method either with local or global thresholds. The clustering method assumes circular
form for each cluster and finds anomalies at the end corners of data point groups. It misses anomalies that are stretching out
from the middle of slim data groups.

GMM Clustering

Fig. 16. Scatter plot of the synthetic example data. Anomaly thresholds of the GMM on the left and cluster based thresholds on the right.
3858 P. Kumpulainen, K. Hätönen / Information Sciences 178 (2008) 3840–3859

Cluster and distance based method, 7 anomalies

1.5

0.5

–0.5

–1

–1.5

20 40 60 80 100 120 140 160 180 200


Sample index

Fig. 17. Variable 4 with the anomalies detected by cluster based method.

5.4. Results on the system log data

The reference methods were applied to the system log data set 1, which was used in Sections 3.2,3.3 and 3.4. The refer-
ence data set was used to identify the GMM model and the clusters in the cluster method. They were identified using the
same algorithms and parameters as in the synthetic example in the previous section. The number of components in the
GMM and the number of clusters were selected from the range between 3 and 10; the same range we used for the number
of the reference groups in the local SOM based method. The anomalies were detected from both the reference data and the
test data.
The GMM ends up with 5 component model. There are only 6 samples, i.e., 0.76%, outside the 5% anomaly threshold in
reference data. In the test data set there were no anomalies detected. These results indicate that the GMM does not present
the true distribution of the data. The gaussians tend to spread wider than the data as seen in the two-dimensional example in
Fig. 16. The data from both the system logs and the radio performance measurements have sharp edges as seen in Fig. 1 and
thus multivariate gaussian distributions are not able to describe the data in sufficient detail.
Clustering ends up with 8 clusters. By definition there are 5% anomalies, i.e., 39 samples, in the reference data. However
there are only 7 anomalies detected in the test data. This is less than either of the SOM based methods. The detected anom-
alies are presented on variable 4 in Fig. 17. Corresponding results of the SOM based methods can be seen in Fig. 13. The clus-
tering method detects some anomalies but the system entering a new state in combination of variables 4 and 2 is not
detected, see Fig. 12. This state is best detected by the local SOM based method.

6. Conclusions

We have introduced a local anomaly detection method that uses SOM and clustering. This method is independent of the
distribution or clustering structure of the data. It also takes into account the local variations in variance. Therefore it can be
used in various systems that produce multivariate data. The scaling of the data is essential, as in all methods utilizing dis-
tances or variances. We introduce scaling that is suitable for the system log data. In the radio interface data we use a pre-
viously developed scaling method which utilizes a priori expert knowledge. When transferring the method to another
environment, the scaling has to be revised to match the importance of the variables used in the analysis. The method scales
up to larger systems and it can be used also for large data sets.
We compared the original global method to our local method using two data sets. The local method produces fewer false
alarms when the system is in states where the variance of the data is naturally high. In cases where the variance is lower this
method is more sensitive than the global one and it is able to detect local anomalies that will be undetected by the global
method. The local method also detects situations that occur too often to be detected as outliers but still rare enough to be
studied further.
This method has proved in practice to be useful in finding outliers and new phenomena in network system log data.
We also introduce adaptive anomaly thresholds which are especially useful with large data sets. Instead of a predefined
fixed percentile we find a natural threshold utilizing the distribution of the quantization error. We present an example using
radio interface performance data from a commercial network operator. A fixed 95% threshold of the reference data will pro-
duce hundreds of anomaly detections, which is far more than any human expert can handle. Usage of the adaptive thresholds
will produce fewer anomaly detections, yet still discovering the most severe ones. Simply moving the threshold would also
yield to fewer anomalies. However, using the adaptive threshold derived from the distribution of the data will ensure that
there is a decent difference between the anomalous and non-anomalous samples.
P. Kumpulainen, K. Hätönen / Information Sciences 178 (2008) 3840–3859 3859

Finally, we introduce a brief comparison of the SOM based methods to GMM and clustering methods. The two-dimen-
sional synthetic example demonstrates the typical properties of these methods. Neither of these methods performs well
in discovering new states in system log data. Especially GMM will fail due to non-normal distributions of the data.

References

[1] V. Barnett, Outliers in Statistical Data, Wiley, Chichester, 1987.


[2] C.M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
[3] S. Breunig, H.-P. Kriegel, R. Ng, J. Sander, LOF: identifying density-based local outliers, in: W. Chen, J.F. Naughton, P.A. Bernstein (Eds.), Proceedings of
the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, Texas, USA, May 16–18, 2000, pp. 93–104.
[4] R. Caruana, S. Lawrence, C.L. Giles, Overfitting in neural networks: backpropagation, conjugate gradient, and early stopping, in: Advances in Neural
Information Processing Systems, Denver, Colorado, November 28–30, 2001.
[5] D.L. Davies, D.W. Bouldin, A cluster separation measure, IEEE Transactions on Pattern Analysis and Machine Intelligence 1 (2) (1979) 224–227.
[6] A. Feldmann, W. Whitt, Fitting mixtures of exponentials to long-tail distributions to analyze network performance models, Performance Evaluation 31
(3–4) (1998) 245–279.
[7] C. Fuchs, R. Kenett, Multivariate Quality Control, Marcel Dekker, Inc., New York, 1998. 212 p.
[8] E.L. Grant, R.S. Leavenworth, Statistical Quality Control, seventh ed., McGraw-Hill, New York, 1996. 764 p.
[9] D. Hawkins, Identification of Outliers, Chapman & Hall, London, 1980.
[10] K. Hätönen, P. Kumpulainen, P. Vehviläinen, Pre and post-processing for mobile network performance data, in: Proceedings of seminar days of Finnish
Society of Automation, Helsinki, Finland, September 2003.
[11] K. Hätönen, S. Laine, T. Similä, Using the LogSig-function to integrate expert knowledge to Self-Organising Map (SOM) based analysis, IEEE
International Workshop on Soft Computing in Industrial Applications, Birmingham University, New York, June 23–25, 2003.
[12] A.J. Höglund, K. Hätönen, A.S. Sorvari, A computer host-based user anomaly detection system using the self-organizing map, Proc. IEEE-INNS-ENNS
International Joint Conference on Neural Networks (IJCNN), vol. 5, IEEE, 2000, pp. 411–416.
[13] R.A. Johnson, D.W. Wichern, Applied multivariate statistical analysis, fourth ed., Prentice-Hall, 1998.
[14] A. Kanaoka, E. Okamoto, Multivariate statistical analysis of network traffic for intrusion detection, in: Proceedings of the 14th International Workshop
on Database and Expert Systems Applications, IEEE, 1–5 September 2003, pp. 472–476.
[15] M. Kano, S. Hasebe, I. Hashimoto, Evolution of Multivariate statistical process control: application of independent component analysis and external
analysis, in: Proceedings of the Foundations of Computer Aided Process Operations Conference (FOCAPO 2003), Coral Springs, US, January 12–15, 2003,
pp. 385–388.
[16] E.M. Knorr, R.T. Ng, V. Tucakov, Distance-based outliers: algorithms and applications, The VLDB Journal The International Journal on Very Large Data
Bases 8 (3–4) (2000) 237–253.
[17] T. Kohonen, Self-Organizing Maps, Series in Information Sciences, vol. 30, Springer, Heidelberg, 1995.
[18] W.H. Kruskal, Some remarks on wild observations, Technometrics 2 (1) (1960) 1–3.
[19] P. Kumpulainen, K. Hätönen, Local anamaly detection for network system log monitoring, in: M. Konstantionis, I. Lazaros (Eds.), Proceedings of the
10th International Conference on Engineering Applications of Neural Networks, Thessaloniki, Greece, 2007, pp. 34–44.
[20] M. Kylväjä, P. Kumpulainen, K. Hätönen, Information summarization for network performance management, in: M. Laszlo, J.V. Zsolt (Eds.), Proceedings
of the 10th IMEKO TC10 International Conference on Technical Diagnostics, Budapest, Hungary, 2005, pp. 167–172.
[21] J. Laiho, K. Raivio, P. Lehtimaki, K. Hatonen, O. Simula, Advanced analysis methods for 3G cellular networks, IEEE Transactions on Wireless
Communications 4 (3) (2005) 930–942.
[22] W.E. Leland, M.S. Taqqu, W. Willinger, D.V. Wilson, On the self-similar nature of Ethernet traffic (extended version), IEEE/ACM Transactions on
Networking 2 (1) (1994) 1–15.
[23] K. Leung, C. Leckie, Unsupervised anomaly detection in network intrusion detection using clusters, in: Proceedings of the 28th Australasian Computer
Science Conference, Newcastle, NSW, Australia, 2005.
[24] R. McGill, J.W. Tukey, W.A. Larsen, Variations of boxplots, The American Statistician 32 (1978) 12–16.
[25] G. McLachlan, D. Peel, Finite Mixture Models, John Wiley & Sons, New York, 2000.
[26] Sayan Mukherjee, Vladimir Vapnik, Multivariate density estimation: a support vector machine approach, Massachusetts Institute Of Technology,
Artificial Intelligence Laboratory and Center For Biological and Computational Learning Department of Brain And Cognitive Sciences, A.I. Memo No.
1653 April 1999, C.B.C.L Paper No. 170, <ftp://publications.ai.mit.edu/ai-publications/pdf/AIM-1653.pdf>.
[27] I.T. Nabney, NETLAB Algorithms for Pattern Recognition, Advances in Pattern Recognition, Springer, 2001.
[28] V. Paxson, S. Floyd, Wide area traffic: the failure of Poisson modeling, IEEE/ACM Transactions on Networking 3 (3) (1995) 226–244.
[29] P.J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, Journal of Computational and Applied Mathematics
20 (1987) 53–65.
[30] C.S. Sastry, S. Rawat, A.K. Pujari, V.P. Gulati, Network traffic analysis using singular value decomposition and multiscale transforms, Information
Sciences 177 (23) (2007) 5275–5291.
[31] T. Shon, J. Moon, A hybrid machine learning approach to network anomaly detection, Information Sciences 177 (2007) 3799–3821.
[32] M.-L. Shyu, S.-C. Chen, K. Sarinnapakorn, L.-W. Chang, A novel anomaly detection scheme based on principal component classifier, in: Proceedings of
the IEEE Foundations and New Directions of Data Mining Workshop, 2003.
[33] SOM Toolbox. <http://www.cis.hut.fi/projects/somtoolbox/>.
[34] R. Sundberg, Continuum Regression, Encyclopedia of Statistical Sciences, second ed., Research Report 2002:4, 2002.
[35] P.-N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, Addison-Wesley, 2005.
[36] J. Vesanto, E. Alhoniemi, Clustering of the self-organizing map, IEEE Transactions on Neural Networks 11 (3) (2000) 586–600.
[37] J.H. Ward Jr., Hierarchical grouping to optimize an objective function, Journal of the American Statistical Association 58 (301) (1963) 236–244.
[38] C. Williamson, E. Halepovic, H. Sun, Y. Wu, Characterization of CDMA2000 cellular data network traffic, in: Proceedings of the IEEE Conference on Local
Computer Networks, Washington, DC, USA, 2005, pp. 712–719.
[39] N. Ye, S.M. Emran, Q. Chen, S. Vilbert, Multivariate statistical analysis of audit trails for host-based intrusion detection, IEEE Transactions on Computers
51 (7) (2002) 810–820.
[40] J. Zhang, M. Zulkernine, Anomaly based network intrusion detection with unsupervised outlier detection, in: 2006 IEEE International Conference on
Communications, vol. 5, 2006, pp. 2388–2393.

You might also like