Profiling Cyber Attackers Using Case-Based Reasoning
Profiling Cyber Attackers Using Case-Based Reasoning
Profiling Cyber Attackers Using Case-Based Reasoning
1 Introduction
A typical cyber-attack involves a substantial number of steps, each one often being a cyber-
attack by itself, such as reconnaissance, social engineering, remote installation of rootkits,
recruitment of bots and propagation of malware before attempting to hack into a target system.
At each step, the attacker leaves cyber traces, which can potentially lead to profiling and ulti-
mately identifying the person before the next step of an attack, or forensically, after it finishes.
While enormous attention has been traditionally placed on the profiling of criminals in physical
attacks and identification of intention in the context of physical surveillance, the equivalent in
the context of cyber security has remained unexplored. This is considered a major challenge.
Henson et al. [1] have argued that cyber criminals behaviour is different from that of nor-
mal criminals and depending on their skills, experience, knowledge, techniques, educational
background, mode of operation and target, their profiles could vary immensely [2]. In terms of
technical means too, they are likely to adapt and devise new mechanisms continuously [12, 3].
Here, we examine whether a Case-based Reasoning (CBR) approach can help security and
forensic investigators to profile human attackers with regards to their behavioural (e.g. how risk
averse they are), demographic (e.g. gender) and technical characteristics (e.g. speed).
The structure of the paper is as follows: Section 2 will give an overview of the literature,
Section 3 will explain the methodology followed throughout this research and Section 4 will
present the research hypotheses and the relevant results. Finally, Section 5 will conclude this
search by summarising the outcomes and indicating the paths for future work.
2 Related work
Cyber-attacks have become one of the most serious types of crimes, as the damage they inflict
on the victim organisations can be severe. This section aims to investigate work related to cyber
profiling and identification of cyber attackers. Our focus is on the human attacker responsible
for the cyber-attack.
Cyber-attacks have become one of the most serious types of crimes, as the damage they
inflict on the victim organisations can be severe. Work in the early 1990s by Landreth [21] has
attempted to classify hackers as in novices, students, tourists, crashers and thieves in an effort
to reveal their motivation and individual characteristics. The Hacker Profiling Project [22] has
equivalently attempted to codify the behavior / background of hackers with the use of question-
naires in a mission to reveal useful characteristics such as age, demographics, personal attrib-
utes, etc. Kjaerlands [25] analysis of reported incidents to CERT/CC to classify attacker opera-
tion related incidents. In an attempt to expand the classification window Kjaerland has present-
ed the factors that were most likely to happen together. Work on attackers behavior has also
been conducted from a psychological point of view. Shaw et al. [26] has presented that ele-
ments of malicious cyber activity may be related with history of negative social and personal
experiences, lack of social skills, sense of entitlement and ethical flexibility. Watters et al. [27]
working from a similar perspective has attempted to apply a qualitative identification of cyber
intruder profiles by conducting an ethnographic study of cyber- attacks.
However, these and similar pieces of work do not provide technical mechanisms that can
make use of a hacker's characteristics in practice and possibly in real-time. So, traditional tech-
nical security measures have always revolved around the characteristics of the attack rather
than the attacker. Take, for instance, botnet attacks. Defense usually concentrates on an extend-
ed number of distributed nodes in an effort to identify common patterns from inbound traffic,
looking perhaps for similarities in terms of network characteristics [10, 11, and 12], data min-
ing for identification of concurrent synchronization relationships for bots [13], passively ana-
lysing DNS-based black-hole list lookup traffic [14]. In [4], Filippoupolitis et al. showed that
an approach that would monitor and take into account the characteristics of the attacker as
observed from the side of the victim computer can potentially help build a profile of the attack-
er and tell whether it is a human or a bot, using a decision tree-based approach.
This work will attempt to identify further whether human attacker identification is possible
by applying Case-based reasoning in a similar context. Case-based reasoning has as its founda-
tion logic that similar problems have similar solutions. As a result its Retrieve, Reuse, Re-
vise, Retain process cycle [15] aims to identify similarity among cases close to each other
and by matching the closest aims to retrieve past knowledge, adapt it and apply it to any inves-
tigated case in an attempt to provide a solution.
Case-Based Reasoning could help in decision support and provide reasoning in cases were
uncertainty and fuzziness is present since reasoning is provided based on past evidence and not
any proprietary rule system. A lot of work on reasoning upon event cases refers to the work-
flow analogy where work has been done by Minor et al. [16] on workflow adaptation, Kyong
Joo Oh and Tae Yoon Kim [17] on financial traces monitoring and identification of daily condi-
tion indicators and Kapetanakis et al. [8, 18, 19, 20] whose work has been mainly focused on
business process monitoring and detection of anomalous behaviour on changing business pro-
cesses.
In particular to Case-based reasoning and intrusion detection, CBR has been applied by
Schwartz et al. [23] for the snort intrusion detection system and Micarelli and Sansonetti [24] in
anomaly intrusion detection. The latter work has focused on rational architecture and represen-
tation of potential anomalous behaviour using CBR.
3 Methodology
This section will describe the methodology we adopted for this study in terms of the sample
features, the attacker characteristics and their classification based on the details of the experi-
mental data collection.
The main aim of this study is to be able to identify an attacker's profile based on observable
characteristics collected from real systems that are susceptible to intrusion attempts. In order to
achieve this target a number of features which are tightly related to potential cyber intruders
will be evaluated, such as their skill level, risk aversion, education level, gender, predefined
goal, speed, mistakes, anti-forensic actions and success [4]. In order to carry out the study, 87
individuals were requested to attack a specified system that had a number of services running
(e.g. ftp server, web server, e-mail server, etc.). Each of the participants received (knew) the IP
address of the target system and was prompted to attack it by using whatever means necessary
to disable, control, or stop the services. Once the attacker was successful in penetrating the
system, a cyber-intruder profiling tool was able to detect this event and start recording changes
in the values of the systems observable features. While doing this, data were collected, coded,
and stored for future use [4].
In the research process, successful attacks were listed, but altogether, each piece of infor-
mation from the attacks would be used in the detection system. All the actions were closely
monitored such as any attempts to alter or delete log files after carrying out their attack. Several
factors were taken into consideration while defining each attack, for example the way in which
the attackers had or were attempting to have access specific folders. The participants were
finally asked to fill in a questionnaire that would detail the values of non-observable features
[4].
A Case-Based Reasoning technique was selected based on the nature of the research problem
in order to investigate whether successful classification can be made on intrusion attempts.
More specifically, our system operated by identifying characteristics of attackers while they
were in progress. Any intrusion attempt was been identified by relating it and comparing to
data from previous cases. For the needs of this work each attack was regarded as an individual
case for the CBR system. All cases were subject to analysis by forensic experts and have been
classified based on their attack outcomes. These outcomes were used as evidence for the solu-
tion part of the cases.
The ability of Case-based reasoning to express and reason upon specialised knowledge, was
one of the main reasons it was chosen for this study. In addition, it uses simple knowledge
that has been well defined in the configuration stage and the information can be under-
stood by the user versus for example rule-based systems [5].
Case-based reasoning was used as a complementary reasoning technique to the Machine
Learning one as indicated by Filippoupolitis et al. 2014 [4], based on the assumption that hu-
man intrusion to cyber systems is characterised by predominantly fuzziness and uncertainty as
this has been identified in related research [6, 7]. Monitoring systems that deal with uncertainty
have been shown effective provided a number of decisive measures (e.g. suitable temporal
event representation, transformation to graph reasoning, pattern matching, etc.)
4 Profile Detection
For this work a number of experiments were designed and applied in an attempt to evaluate and
classify cyber profile behaviours. For this research a pool of 87 real attack patterns were used
as both qualitative and quantitative evidence to formulate the case base. The investigated data
comprised information regarding the nature of the attack, trace evidence taken from the attack
environment and expert ranking of what was the outcome of the attack (a team of human ex-
perts have identified all attacks in terms of success or failure). Following the above, each case
contained profile information for the attacker in terms of background education, forensics
knowledge, networking expertise, etc. Each case was fully anonymised before adding it to the
case base as well as cleaned and cleansed for any redundant semantics (noise) information.
For this work two main stages of experiments were conducted to incrementally build upon
reasoning and investigate the optimum evidence for argumentation while classifying potential
cyber-attacks. The two questions that were attempted to address were:
For the needs of the experiments MyCBR [9] was used (Fig. 1)
Fig. 1. MyCBR [9] used for the calculation of similarities among cases
as the main case-based reasoning framework to accommodate the majority of them using pre-
dominantly normalised Euclidean distance to calculate similarity among attributes (equation 1).
( ) (1)
( )
where si is the standard deviation of xi, yi over the sample set of attributes.
For the similarity calculation among evidence traces a simple count of similar type events
(Components) algorithm (equation 2) [8] was used since it provides the necessary degree of
granularity among traces and has been proven effective [8] in the identification of quantitative
event patterns among trace data.
(2)
( )
where Ni is the number of events of type i common to both event traces and Ntotal and Ntotal
are the total expected number of events in traces C or C
For the needs of the initial experiments Euclidean distance similarities were applied with the
application of empirical weights based primarily on the attributes and experience ranking from
domain experts. As initial result from the similarity measures an attribute matrix was created
indicating two clusters of successful and unsuccessful attacks within the pool of the case base.
The rate of the attributes can be seen in Fig 2 below:
Fig. 2. Sample of the successful attacks cluster showing the frequency rates of attributes
As it can be seen from Fig. 2, while addressing the first research question it has been identified
that successful attackers were in majority male users (33 successful attacks versus 5 female),
were mainly in a high educational profile, were between 25-35 years old and had a lower fre-
quency of syntax and command mistakes coinciding in result to the findings of Filippoupolitis
et al. (2014) [4]. The second main cluster with the majority of unsuccessful attacks could also
be qualified with specific relevance to gender, education, risk and skill attributes, indicating
potentially a pattern for future identification (Fig. 3).
Fig. 3. Sample of unsuccessful attacks cluster indicating the attribute frequency rates
Finally, in order to answer the second research question CBR was applied to classify a random
selected sample from the case base. For the needs of this stage random samples of 10% - 12%
of the case base were selected and their classification information was hidden. CBR was called
to classify them using similarity measures and 3NN classification. The experiments were con-
ducted 10 times for each case and the results were averaged. With the selected case-base, CBR
has shown variable accuracy between 60 and 80% with an average classification rate of 69%
over 6 different samples and approximately 10x 6 x 9 or 10 = 540 to 600 iterations. All the
indicated samples contained a random selection of human attacks upon which CBR was called
to reason against. CBR has shown similar efficiency in the classification of both intruders and
not with precision of 67% and 72% respectively. This efficiency in accuracy can be regarded
as positive and was regarded as promising since both of the stated questions have been satisfied
from the findings.
However, greater variation in terms of a different data sample could affect the CBR output
since the current case base contained attack snapshots in controlled environments. Table 1
below shows a snapshot of the executed experiments. A brief explanation regarding the pre-
sented columns/rows: Column Case id refers to the anonymised cases. Actual Ranking refers to
whether the investigated case was an attack or not, Columns Ranking refer to the k nearest
neighbours of each investigated case, Averaged refers to the final decision for the case based on
its neighbours classifaction. Finally, F refers to any unsuccessful attack whereas S refers to
successful ones.
Case id Actual Ranking Ranking Ranking Ranking Ranking Ranking Ranking Ranking Ranking Averaged
Ranking 1st 3NN 2nd 3NN 3rd 3NN 4th 3NN 5th 3NN 6th 3NN 7th 3NN 8th 3NN 9th 3NN 3NN
vote
0346A F S S S F F S S F F False
Positive
0353A F F F F F S S F F F F
0343C S S S F F S S S S F S
0348C S S S S S S S S S S S
0357D S S S F F F F F S S Missed
Negative
0360A S S S S S S F S F F S
2054 F F F F F F F F F F F