Anomaly detection using baseline and K-means clustering
Moisés F. Lima∗ , Bruno B. Zarpelão† , Lucas D. H. Sampaio∗ , Joel J. P. C. Rodrigues‡ , Taufik Abrão∗
and Mario Lemes Proença Jr.∗
∗ Computing
Science Department, State University of Londrina (UEL), Londrina, Brazil
of Elect. & Comp. Engineering, University of Campinas (UNICAMP), Campinas, Brazil
‡ Instituto de Telecomunicações, University of Beira Interior, Covilhã, Portugal
E-mails: {moisesflima, brunozarpelao, lucas.dias.sampaio}@gmail.com, joeljr@ieee.org, {taufik, proenca}@uel.br
† School
Abstract: Anomaly detection refers to methods that provide
warnings of unusual behaviors which may compromise the
security and performance of communication networks. In this
paper it is proposed a novel model for network anomaly detection
combining baseline, K-means clustering and particle swarm optimization (PSO). The baseline consists of network traffic normal
behavior profiles, generated by the application of Baseline for
Automatic Backbone Management (BLGBA) model in SNMP
historical network data set, while K-means is a supervised
learning clustering algorithm used to recognize patterns or
features in data sets. In order to escape from local optima
problem, the K-means is associated to PSO, which is a metaheuristic whose main characteristics include low computational
complexity and small number of input parameters dependence.
The proposed anomaly detection approach classifies data clusters
from baseline and real traffic using the K-means combined with
PSO. Anomalous behaviors can be identified by comparing the
distance between real traffic and cluster centroids. Tests were
performed in the network of State University of Londrina and
the obtained detection and false alarm rates are promising.
1. INTRODUCTION
Identifying network anomalies is essential for communication networks of enterprises or institutions. The goal of
anomaly detection is to provide an early warning about
an unusual behavior, which can affect the security and the
performance of a network. It is very important to detect and
treat anomalies efficiently, because they affect the quality
of services provided, resulting in degradation of network
performance and even in operations’ interruption. Due to the
large number of anomalous events that can occur in networks,
the main challenge is to detect and classify anomalies automatically. [1]–[3].
Anomaly detection techniques are divided in three major
areas according [4]: statistical anomaly detection, data mining
and machine learning based techniques. Two another research
areas named information theory and spectral theory are included in the anomaly detection classification provided by [5].
Considering data mining techniques, there is a wide variety
of algorithms that can be applied to anomaly detection, where
stands the clustering as the most important unsupervised
learning process for finding pattern on unlabeled data [6].
Among the wide variety of applications for anomaly detection,
the most common are network traffic monitoring, intrusion
detection for cyber security, fault detection in safety critical
systems, insurance, military surveillance for enemy activities
and many others [5].
This work proposes the use of K-means clustering algorithm [7] combined to the network behavior profiles called
baseline [8] and Particle Swarm Optimization (PSO) [9] for
anomaly detection. This model fits into data mining based
methods, aiming to detect volume anomalies.
Classified as an unsupervised learning technique, K-means
clustering is a classical algorithm, initially developed by J.
MacQueen in 1967. Although being a simple algorithm, it
suffers from the inability to escape from local optima, which
can be overcome by combining with the PSO algorithm. PSO
is a high efficient heuristic technique with low computational
complexity and capability to escape from local optima, developed in 1995 by Kennedy and Eberhart [9]. The baseline
consists of different normal behavior profiles to a specific
network device or segment, generated by the GBA tool
(Automatic Backbone Management) [8], using data collected
from Simple Network Management Protocol (SNMP) objects.
The proposed anomaly detection system (ADS) combines
the K-means and PSO algorithms, aiming to calculate the
clusters centroids of real traffic collected in a SNMP object
and its respective baseline. Anomaly detection is performed
by comparing real traffic and clusters centroids.
Tests were carried out using a real network environment
in the State University of Londrina (UEL), Brazil. Numerical
results have been shown that the obtained detection and false
alarm rates are promising.
This paper is organized as follows. The Section 2 presents
related work on network anomalies and the traffic model
characterization is detailed in Section 3. Section 4 describes
the K-means and swarm optimization aspects. Section 5
details the proposed anomaly detection approach, while Section 6 discusses the adopted tests setup and the respective
performance results. Finally, the main conclusions and future
work are presented in Section 7.
2. RELATED WORK
In recent years, several works such as [2] [10] [11] have
been developed in the anomaly detection area. Though using
different approaches, they have the same goal of maximizing
the detection rate while minimizing the rate of false alarm.
The establishment of a normal model for the network traffic
and the need of increasing anomaly detection rate with lower
false alarm rate are still challenging tasks, which keeps the
anomaly detection an open research area.
Xiao et al. [7] proposed a K-means algorithm based on PSO
for network anomaly detection. As a hill-climbing method,
if the initial synaptic weights and input patterns of the Kmeans are not set correctly, the method does not converge,
or converges to a local optimum. Because their tendencies
to converge to a local minimum, the Particle Swarm Optimization which has a good global search ability, is associated
to solve the local convergence minimum problem. The KDD
CUP 1999 dataset [12] was used to evaluate the proposed
method. Experiments results show that the proposed method
is effective for partitioning large dataset and is useful for
anomaly detection, reaching satisfactory detection rates for
different classes of anomalies.
In [13] Liu proposed a modified version of the traditional
quantum-behaved particle swarm optimization (QPSO), the
MQPSO. This algorithm is employed to train a wavelet neural
network (WNN), which is used for network anomaly detection. A multidimensional vector composed by WNN parameters was associated to a particle in the evolutionary learning
algorithm. The suitable parameter combination determines the
feasibility of the search space to obtain the optimal solution. In
order to validate the proposed approach, the KDDCup99 [12]
training dataset was used as the experimental data set. Results
showed that the proposed algorithm has a better training
performance, faster convergence, as well as a better detecting
ability for new unknown type attacks, compared to QPSO.
ℎ = (��� − ��� )/5. Then, the limits of each ��� class are
obtained. They are calculated by ��� = ��� + ℎ ⋅ �, where
�� represents the � class (� = 1 . . . 5). The value that is
the greatest element inserted in the class with accumulated
frequency equal or greater than 80% is included in baseline.
The samples for the generation of baseline are collected
second by second along the day, by the GBA tool. Two types
of baseline are generated: the bl-7 consisting of one baseline
for each day of the week, and the bl-3 consisting of one
baseline for the workdays, one for Saturday and another one
for Sunday.
Figure 1 shows chart containing one day of monitoring
of UEL network. Data were collected from SNMP object
ifInOctets, at the University’s Web server in the period of
02/08/2010. The monitored traffic is represented in green and
the respective baseline values by the blue line. It is possible to
observe a great adjustment between the behavior of real traffic
and the baseline, excepting from 5 p.m to 10 p.m when occurs
a volume anomaly.
Figure 1. Test Day. Traffic and baseline of 02/08/2010 from ifInOctets
SNMP object, on main Web-Server of State University of Londrina.
3. TRAFFIC CHARACTERIZATION: BLGBA AND
BASELINE
The first step to detect anomalies is to adopt a model that
characterizes the network traffic efficiently, which represents
a significant challenge due to the non-stationary nature of network traffic. Large networks traffic behavior is composed by
daily cycles, where traffic levels are usually higher in working
hours and are also distinct for workdays and weekends. So
an efficient traffic characterization model should be able to
trustworthily represent these characteristics. Thus, in this work
the GBA tool is used to generate different profiles of normal
behavior for each day of the week, meeting this requirement.
These behavior profiles are named Digital Signature of Network Segment (baseline), proposed by Proença in [8] and
applied to anomaly detection with great results in [3].
Hence, the BLGBA algorithm was developed based on a
variation in the calculation of statistical mode. In order to
determine an expected value to a given second of the day, the
model analyzes the values for the same second in previous
weeks. These values are distributed in frequencies, based
on the difference between the greatest ��� and the smallest
��� element of the sample, using 5 classes. This difference,
divided by five, forms the amplitude ℎ between the classes,
4. K-MEANS CLUSTERING AND PARTICLE
SWARM OPTIMIZATION
K-means is a well-known clustering algorithm created by J.
MacQueen. It can be used for unsupervised learning of neural
networks, pattern recognitions, clustering analysis and more.
The algorithm classifies data sets based on attributes into K
groups. The grouping is performed by minimizing the sum of
squares of distances between data and the corresponding cluster centroid. The K-means algorithm suffers from the absence
of diversity mechanism to escape from local optimum. Thus,
in order to overcome this drawback and simultaneously keeps
computational complexity under control, mainly because for
high-dimensional problems complexity is a concern, the Kmeans algorithm can be associated to PSO [7] [14].
The PSO is an evolutionary computation technique based
on swarm intelligence, created by Kennedy and Eberhart in
1995, inspired on birds social behavior [9]. PSO is powerful
since it is able to escape from global optima while keeps a
simple structure. In PSO, the solutions into the search space
are called particles. Each particle has a fitness value, which is
measured by the function to be optimized, having an updating
speed that drives its flight and moving through search space.
The PSO principle is the movement of a group of particles,
randomly distributed in the search space, each one with its
own position and velocity. The position of each particle is
modified by the application of velocity in order to reach
a better performance [9]. The interaction among particles
is inserted in the calculation of particle velocity. Hence, at
each iteration, the speed and position of all particles from
a population of size � are updated. If the best values for
local or global solutions were founded, the respective best
is the best position
candidate-vector is updated, where pbest
�
value obtained so far by each particle in the population of size
is the best position value obtained by all particle
� , and pbest
�
so far. The best local and global particles are column-vectors
wise, with dimension �.
In the PSO strategy, each candidate-vector at �th iteration,
defined as p� [�] with �×1 dimension, is used for the velocity
calculation of next iteration as:
v� [� + 1]
=
� ⋅ v� [�] + �1 ⋅ U�1 [�](pbest
�
−
p
[�])
�2 ⋅ U�2 [�](pbest
�
�
− p� [�]) +
(1)
where � is the inertia weight, adopted as an unitary value
in this work, for simplicity; U�1 [�] and U�2 [�] are diagonal
matrices with dimension �, and elements are random variables with uniform distribution ∼ � ∈ [0, 1], generated for
and pbest
the �th particle at iteration � = 1, 2, . . . , � ; pbest
�
�
are the best global position and the best local positions found
until the �th iteration, respectively; �1 and �2 are acceleration
coefficients regarding the best particles and the best global
positions influences in the velocity updating, respectively.
The �th particle’s position at iteration � is a clustering
candidate-vector p� [�] of size � × 1. The position of each
particle is updated using the new velocity vector (1) for that
particle, according to:
p� [� + 1] = p� [�] + v� [� + 1],
� = 1, . . . , �
(2)
of �−dimension is clamped to a maximum magnitude �m . If
we could define the search space by the bounds [�min ; �max ],
then the value of �m will be typically set to �m = � (�max −
�min ), where 0.1 ≤ � ≤ 1.0.
In this work, the objective function to be minimized by
PSO is the sum of Euclidean distances of the candidate-vector
regarding each data point of the �th cluster generated by Kmeans, given by:
� ∑
� √
∑
2
�(p) =
∣p�� − c� ∣
where � is the number of clusters, � is the number of traffic
samples and c� is the �th cluster centroid.
5. NETWORK ANOMALY DETECTION MODEL
BASED ON SWARM INTELLIGENCE
As seen in section 3 the baseline is responsible for the normal traffic characterization, using historical SNMP network
data. So, the proposed ADS does not have a pre-processing
phase in order to characterize normal traffic, but instead the
baseline is responsible for this task. The objective of K-means
and PSO combination is to enable an efficient calculation of
traffic samples and baseline centroids, over a high dimensional
data. In this work is considered a fixed value of � = 1, and
every 300-seconds is calculated the value of c� . Then it is
calculated the distance of each traffic sample and the cluster
centroid. If one sample in the 300-seconds interval exceeds a
threshold, then this interval is considered anomalous.
The elements of the proposed ADS can be seen in Figure
2. The GBA tool [8] is responsible for the collection of real
traffic samples and generation of the baseline. The PSOCls system calculates the cluster centroids of the traffic
and baseline. Then, the PSO Alarm system can analyze the
distance between cluster centroids and real traffic samples,
aiming to find the existence of anomalies.
The PSO algorithm consists of repeated application of the
velocity and position updating equations until a stop criteria
is found. The stop criteria can be a fixed number of iteration
or determined by the non-improvement in the solution when
the algorithm evolves.
In order to reduce the likelihood that the particle might
leave the search universe, maximum velocity �m factor is
added to the PSO model (1), which will be responsible for
limiting the velocity in the range [±�m ]. The adjustment
of velocity allows the particle to move in a continuous but
constrained subspace, been simply accomplished by:
�� [�] = min {�m ; max {−�m ; �� [�]}}
(3)
From (3) it is clear that if ∣�� [�]∣ exceeds a positive constant
value �m specified by the user, the �th particle’ velocity is
assigned to be sign(�� [�])�m , i.e. particles velocity on each
(4)
�=1 �=1
Figure 2.
Proposed anomaly detection system model.
The process for anomaly detection of the proposed system
is divided into two stages:
1. The PSO-Cls system analyzes traffic data collected from
SNMP objects and their respective baseline every 300 seconds. Firstly, traffic data and baseline from each 300-seconds
interval are clustered simultaneously. Then, a centroid for each
cluster is calculated, which represents the expected behavior
for the traffic samples of the cluster. The clustered data and
clusters centroids generated in this stage are used in the next
step.
2. The PSO Alarm system is responsible for analyzing the
results generated by the step 1, verifying if exists anomalies
in the analyzed interval. The PSO Alarm system checks
how close each sample of traffic movement is from their
corresponding cluster centroid. The distance measure adopted
in this work is the Euclidean distance, which consists of
the straight line distance between two points. A sample is
considered anomalous if the Euclidian distance between it
and their respective cluster centroid, exceeds a threshold value
called �. Then, PSO Alarm system triggers an alarm to notify
the network administrator that occurred an anomaly.
There is no unanimity regarding the definition of anomalies
in network traffic. The same behavior deviation can be classified differently according to distinct management policies.
For a first network administrator, even small deviations should
be detected, in order to identify every possibility of undesired
usage of network resources. Other network administrator may
be interested only in long deviations, which make the users
to experience a degradation of services’ quality. Different
approaches are found in literature to define which behaviors
are anomalies. Thottan and Ji [15] consider as anomalies only
the behavior deviations that result in operations’ disruption.
Tapiador et al. [16] showed some events that were not reported
on syslogs and did not cause the operations’ disruption, but
should have been detected, because they influenced badly the
quality of service provided for end users.
Thus, in order to evaluate the proposed ADS in terms
of detection and false alarm rates, according to different
anomalies characteristics, a parameterized definition for volume anomalies was implemented. Two parameters are used
in anomalies defining, �, which is related to the amplitude
of the anomaly and � representing its duration. Taking into
consideration that the polling interval of the traffic monitor is
10 seconds and one day have 8640 traffic samples, if during
the monitoring, a traffic sample exceed or stay below its
baseline in �%, an alert interval will open. If within the alert
interval exists � samples that exceed or stay below its baseline
in �%, this interval is considered anomalous and must be
detected by the proposed ADS.
6. NUMERICAL RESULTS
Aiming to validate our ADS, we used a real network
environment from State University of Londrina (UEL). The
traffic used in the experiment was monitored during the day
02/08/2010, from ifInOctets SNMP MIB object of UEL main
Web server. ifInOctets determines the total number of octets
received on the interface. The objective of proposed algorithm
is to detect the large difference between real traffic (reading)
and it baseline in the period of 5 p.m. and 10 p.m. as can be
seen in figure 1, which indicates a volume anomaly.
As seen in section 5, for each traffic sample, PSO Alarm
system calculates the Euclidean distance between monitored
data sample and its respective cluster centroid, aiming to
verify whether the sample is anomalous. Every time the monitored traffic shows a significant deviation from the baseline,
a substantial variation on the Euclidean distance values takes
place, which can characterize a volume anomaly according to
the parameters specified by the network administrator. So, if
this distance exceed the � threshold value, the PSO Alarm
system triggers an alarm to notify the network administrator.
The evaluation of the proposed ADS is based on two
performance metrics: the detection rate, which consists of
the detection probability given by (5), and the false alarm
rate, which represents the probability of alarms that not show
significant variation between real traffic and the baseline,
according to (6). The variables used to calculate the detection
and false alarm rates are:
∙ ��������� ��������: number of anomalies that were correctly detected.
∙ ��� ���������: number of anomalies occurred in traffic.
∙ � ���� ���������: number of alarms that do not correspond to an anomalous situation.
∙ ��� ������: number of generated alarms.
��������� ���� = ��������� ��������/��� ��������� (5)
� ���� �������� ���� = � ���� ���������/��� ������ (6)
In order to assess the efficiency of the proposed ADS, one
class of anomalies called class 1 was defined. This class
concerns the long duration anomalies that exceed or stay
below their baseline up to 60%. The parameters used to define
this class are � = 60% and � = 25.
With the objective of finding the optimum � value that
results in the best values for detection and false alarm rates
for the test day, the detection algorithm was performed several
times with different values of � for the anomalies class 1.
Figure 3 describes the performance of PSO-based ADS in
terms of trade-off between detection and false alarm rates,
with � varying on the [1 . . . 100] interval for class 1. The
best threshold found for class 1 was � = 5.74 ∗ 106 . The
results confirm that the proposed method is useful for anomaly
detection, achieving the best detection rate × false alarm rate,
82.92% and 2.85% respectively for class 1 over the test day
traffic.
Figure 4 shows the alarms generated by the proposed ADS
for anomalies class 1 with � = 5.74∗106 for the test day. The
y-axis represents the Euclidean distances between samples and
State University of Londrina. The proposed clustering-based
anomaly detection algorithm showed robustness against false
alarm while held good anomaly detection rates, achieving
82.92% detection rate with 2.85% false alarm rate for the
test day, as shown in Figure 3.
Our ongoing work is centered on expanding the proposed
detection model through the simultaneous monitoring of several SNMP objects, in order to correlate these and classifying
the anomalies. Thus, the proposed approach can be extended
to other types of anomalies while improving the detection and
false alarm rate.
REFERENCES
Figure 3.
RoC curve for the test Day
baseline cluster centroids, and the x-axis represents the time
they occurred and red dotted line represents � threshold. One
can observe that between 5 p.m. and 10 p.m. there is a wide
variation between baseline and traffic, which was correctly
detected by PSO Alarm system. Analyzing this day, we have
that the detection × false alarm rates reached 82.92%×2.85%
for anomalies class 1, demonstrating that the proposed system,
achieved excellent results against anomalous traffic.
Figure 4.
Alarms for the test day
7. CONCLUSIONS AND FUTURE WORK
In this paper it was presented the K-means algorithm
combined with the particle swarm optimization for anomaly
detection. The experiments’ results applied to a real network
environment showed that proposed method is capable to
detect volume anomalies in real network traffic, achieving
satisfactory results.
The experiments were performed using data monitored
from the SNMP object ifInOctets, of the main web server of
[1] A. Lakhina, M. Crovella, and C. Diot, “Mining anomalies using traffic
feature distributions,” SIGCOMM Comput. Commun. Rev., vol. 35,
no. 4, 2005.
[2] A. Kind, M. P. Stoecklin, and X. Dimitropoulos, “Histogram-based
traffic anomaly detection,” in IEEE Transactions on Network Service
Management, vol. 6, no. 2, June 2009.
[3] B. B. Zarpelão, L. S. Mendes, M. L. Proença Jr., and J. J. P. C.
Rodrigues, “Parameterized anomaly detection system with automatic
configuration,” in GC’09 CSS. 2009 IEEE Global Communications
Conference (IEEE GLOBECOM 2009), Communications Software and
Services Symposium, 2009.
[4] A. Patcha and J. M. Park, “An overview of anomaly detection techniques: Existing solutions and latest technological trends,” Computer
Networks: The International Journal of Computer and Telecommunications Networking, 2007.
[5] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A
survey,” ACM Computing Surveys, vol. 41, no. 3, July 2009.
[6] M. Jianliang, S. Haikun, and B. Liang, “The application on intrusion
detection based on k-means cluster algorithm,” in International Forum
on Information Technology and Applications, 2009.
[7] L. Xiao, Z. Shao, and G. Liu, “K-means algorithm based on particle
swarm optimization algorithm for anomaly intrusion detection,” in
WCICA 2006 . The Sixth World Congress on Intelligent Control and
Automation, 2006, pp. 5854 – 5858.
[8] M. L. Proença Jr., C. Coppelmans, M. Botolli, and L. S. Mendes,
Security and reliability in information systems and networks: Baseline
to help with network management. Springer, 2006, pp. 149–157.
[9] R. C. Eberhart and J. Kennedy, “A new optimizer using particle
swarm theory,” in Proceedings of the Sixth International Symposium
on Micromachine and Human Science, 1995, pp. 39–43.
[10] Y. ling Zhang, Z. guo Han, and J. xia Ren, “A network anomaly
detection method based on relative entropy theory,” in Proceedings of
the 2009 Second International Symposium on Electronic Commerce and
Security, 2009, pp. 231 – 235.
[11] V. Sotiris, P. Tse, and M. Pecht, “Anomaly detection through a bayesian
support vector machine,” Reliability, IEEE Transactions on, pp. 277 –
286, june 2010.
[12] The third international knowledge discovery and data mining tools
competition data set KDD99-Cup. Available at http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.
[13] L. li Liu and Y. Liu, “MQPSO based on wavelet neural network for
network anomaly detection,” in Wireless Communications, Networking
and Mobile Computing, 2009. WiCom ’09. 5th International Conference
on, 2009.
[14] B. Firouzi, T. Niknam, and M. Nayeripour, “A new evolutionary
algorithm for cluster analisys,” in International Journal of Computer
Science, 2009.
[15] M. Thottan and C. Ji, “Anomaly detection in ip networks,” IEEE
Transactions in Signal Processing, vol. 51, no. 8, pp. 2191–2204, 2004.
[16] J. M. Tapiador, P. G. Teodoro, and J. E. D. Verdejo, “Anomaly detection methods in wired networks: a survey and taxonomy,” Computer
Communications, vol. 27, no. 16, pp. 1569–1584, October 2004.