Detecting Botnets Using MapReduce
Detecting Botnets Using MapReduce
Detecting Botnets Using MapReduce
Jerome Francois, Shaonan Wang, Walter Bronzi, Radu State, Thomas Engel
University of Luxembourg Interdisciplinary Center for Security, Reliability and Trust
6 rue R. Coudenhove-Kalergi, L-1359 Luxembourg, Luxembourg firstname.lastname@uni.lu
I. I NTRODUCTION
The attacks in Internet and their variety have greatly increased leading to the emergence of defense techniques including firewalls, IDS (Intrusion Detection System), antivirus
software... However, recent studies have shown that new
attacks are hard to detect [3], [4]. Thus, forensics is required
for understanding these attacks and measuring their impacts.
This is helpful to recover the system back to a safe state and
counter them in the future. From a network point of view,
attacks are more distributed. Botnets are one of the most major
threat [5] and have evolved from a centralized model towards
a decentralized, highly scalable architecture [6] based on peerto-peer (P2P) networks [7].
Thus, detection and forensics analysis have to be shifted
from edges to the core of the network, i.e., the operators
(Internet Service Providers - ISP). However, the ISPs have to
deal with a huge volume of traffic although fast and efficient
analysis is required. Many forensic tools still rely on manual
analysis [8]. Hence, new assisted approaches have appeared
relying on host dependencies [3], profiling host behaviors [9]
or using deep packet inspection [10]. Due to scalability issues
in high speed networks, common solutions exclude that and
focus exclusively on Netflow [11] data. This is an aggregated
view of the network traffic excluding content and, thus, avoid
many privacy issues which have to be considered in forensic
analysis [9].
Therefore, we propose to detect new generation of botnets
from large dataset of Netflow data, such as those gathered
by each individual operator. Our previous approach [12] is
extended by leveraging cloud computing paradigms especially
MapReduce [13] for detecting densely interconnected hosts
which are potential botnet members.
WIFS2011, November 29th-December 2nd, 2011, Foz do
c
Iguacu, Brazil. 978-1-4577-1019-3/11/$26.00
2011
IEEE.
C. Hadoop
Our botnet detection method is based on Hadoop [2],
an open source implementation of MapReduce. A common
deployment of an Hadoop cluster is represented in the lower
part of figure 1 with a master and slave nodes. The first key
component is the Hadoop Distributed File System (HDFS) for
storing data. The namenode daemon maintains the file namespace (directory structure, the location of file blocks). However,
the blocks are directly stored on the slaves (datanodes) and
guarantee a redundancy. Although the master node is the entry
point to locate data, the scalability is enforced thanks to direct
data transfer between entities (slave machines, user) without
being forwarded by the master.
Considering the application side, the jobtracker takes as
input a MapReduce job and is responsible to coordinate (task
assignment) and monitor the map and reduce tasks. For improving the robustness, the jobtracker and namenode daemons
may be executed on different master machines. Unlike the
number of reduce tasks, the number of map tasks is automatically determined (the user can only give a hint) regarding the
data distribution on HDFS. This is the application of the main
MapReduce paradigm which aims to upload the code where
the data is. When a node is overloaded, it will transfer data
blocks to another which will then run the code. Obviously, the
user has not to deal with these aspects which is a strength of
the Hadoop.
D. Detection
(1)
(2)
(3)
2,3,4
Mapper
2
0.3
Mapper
0.3
Mapper
0.3
Mapper
0.3
0.3,1
0.3,1
Reducer
Reducer
Reducer
0.3
1.3
1.3
Reduce Tasks
(b) MapReduce
V. E VALUATION
A. Datasets and methodology
Because no public dataset with labeled botnet traffic at an
equivalent level than an ISP is available, our evaluation is
based on NetFlow data provided by a major Luxembourg
Internet Operator considered as free of botnet as it was
checked by a traffic screening solution dedicated to ISP [21].
This dataset involves around 16 million hosts within 720
million of netflow records. The corresponding dependency
graph contains 57 millions links. Then, synthetic additional
records reflecting the topology of three P2P protocols are
added in a similar manner than in [12]: chord [22], kademlia
[23] and koorde [24]. A P2P network maps each peer to
an ID defined in a huge ID space, as for example with a
maximal node ID equal 2160 . This paper focuses on structured
P2P protocols since they guarantee high performances. Hence,
Chord is a pioneering work where the routing has a complexity
equals to log(N). Koorde is an extension of Chord with a
lower complexity: O(log(N)/log(log(N)). Finally, Kademlia
was chosen because it is well used in real world for file sharing
but also for botnet communications [7].
Although the efficiency evaluation was done using the entire
dataset (section V-C), generating synthetic botnet traces needs
a lot of computation. Hence, the first experiment (section
V-B) about the detection accuracy is based on a subpart of
the dataset: 2,133k records and 323k hosts. To strengthen the
evaluation, only 1% of IP addresses are used to generate bot
traces (stealthy botnet).
7000
6000
220
#slaves
1
4
8
11
200
Average iteration time (s)
8000
5000
4000
3000
2000
1000
160
140
120
100
80
60
40
0
100000 1e+06 1e+07
#links
180
#slaves
1
4
8
11
1e+08
20
100000
1e+06
#links
1e+07
a huge dataset since this ratio seems stable for 100 millions
links or more.
VI. R ELATED W ORK
Using NetFlow records were also employed in the past from
detecting various attacks including botnets [25]. As well as the
whole paper, this section focuses on traffic analysis for botnet
detection, which is one of the major threat in Internet.
Since tools and/or datasets of other approaches are not
publicly available, only qualitative aspects are considered. By
tracking hosts looking for specific DNS names, infected hosts
might be detected [26]. For P2P botnets, a honeypot must be
active to retrieve infected hosts by crawling the P2P network
[27]. Another way to detect bots is to look for malicious
activities such as spamming or scanning [28]. Correlating such
activities with potential C&C communication patterns may be
helpful [29], [30].
Considering our approach, there are also other works considering graphs especially host interactions [31] and complex
analytic methods may be employed to detect P2P networks
[32], [33]. Such traffic classification is also performed by
BLINC [34]. Discovering service or host dependencies may
be helpful for network management tasks [35]. BotGrep [36]
uses on a random walk technique to discover botnet cluster
within an interaction graph. Link analysis is leveraged in
our previous work based on flow dependency graphs [37]
and host dependency graphs [38] for finding the root cause
of attack traffic. This technique is refined in this paper to
detect botnets. Although in [12] we focused our evaluation
regarding detection performance in different scenarios, this
paper is dedicated to show the viability of the approach in a
cloud-computing environment by highlighting computational
performance benefits and specific use cases where it is more
relevant (depending on the underlying data to deal with). Other
works, like [39], aim at dividing the traffic to analyze among
different intrusion detection systems before correlating the
generated alerts. Finally, [40] is the closest relative work since
they propose also a cloud-computing based botnet detection.
The main difference with our approach is that it is focused
on detecting malicious activities, especially spam, to create