Malicious Behavior Detection Using Windows Audit Logs: Konstantin Berlin David Slater Joshua Saxe

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Malicious Behavior Detection using Windows Audit Logs


Konstantin Berlin David Slater Joshua Saxe
Invincea Labs, LLC Invincea Labs, LLC Invincea Labs, LLC
Arlington, VA, USA Arlington, VA, USA Arlington, VA, USA
kberlin@invincea.com david.slater@invincea.com josh.saxe@invincea.com
arXiv:1506.04200v2 [cs.CR] 25 Aug 2015

ABSTRACT signature is to perform detection on the actual dynamic


As antivirus and network intrusion detection systems have behavior of software, since obfuscating behavior is poten-
increasingly proven insufficient to detect advanced threats, tially harder, and requires significantly more research and
large security operations centers have moved to deploy end- development to create new behavior infection vectors [15].
point-based sensors that provide deeper visibility into low- Currently, collecting such dynamic behavior on an enterprise
level events across their enterprises. Unfortunately, for many network is a huge challenge, primarily because it requires in-
organizations in government and industry, the installation, strumentation of individual machines, or redirection of files
maintenance, and resource requirements of these newer so- to be run on a centralized virtual environment (e.g., FireEye,
lutions pose barriers to adoption and are perceived as risks Lastline), which can be costly, require installation of third
to organizations’ missions. party software, and can be difficult to maintain. Finally,
To mitigate this problem we investigated the utility of most newer systems are typically perimeter defenses, where
agentless detection of malicious endpoint behavior, using the malware is analyzed in a virtual environment, and can do
only the standard build-in Windows audit logging facility nothing in the case when software manages to by bypass it
as our signal. We found that Windows audit logs, while (e.g., by requiring a reasonably complicated user interaction
emitting manageable sized data streams on the endpoints, in order to unpack itself and execute).
provide enough information to allow robust detection of ma- At the same time most organizations have existing in-
licious behavior. Audit logs provide an effective, low-cost frastructure for monitoring their endpoint machines directly,
alternative to deploying additional expensive agent-based called Security Information and Event Management (SIEM)
breach detection systems in many government and indus- systems, which collect vast amounts of security information
trial settings, and can be used to detect, in our tests, 83% from actual software being executed on an endpoint. This
percent of malware samples with a 0.1% false positive rate. is an important distinction, since during actual execution,
They can also supplement already existing host signature- malware has no choice but to expose its behavior. Unfortu-
based antivirus solutions, like Kaspersky, Symantec, and nately, this information is typically used for forensic anal-
McAfee, detecting, in our testing environment, 78% of mal- ysis, rather than detection. In the cases when detection is
ware missed by those antivirus systems. done, it is usually in the form of a network intrusion detec-
tion system (IDS), where the focus is to discover anomalous
communication on the enterprise network, rather than ma-
1. INTRODUCTION licious software behavior on a host [21].
Recent, public, high-profile breaches of large enterprise One of the low hanging fruits not currently utilized in
networks have painfully demonstrated that cybersecurity is dynamical behavior detection are the Windows audit logs.
rapidly becoming one of the more daunting challenges for Windows audit logs can be easily collected using existing
large organizations and government entities. The challenge Windows enterprise management tools [5], and are built
stems from the relative ease with which malware authors into the operating system, so there is a low performance
can permute, transform, and obfuscate their cyberweapons and maintenance overhead in using them. In our work we
in order to avoid signature based detection, the dominant investigate the potential of these audit logs to augment ex-
approach for detecting cyber threats. isting enterprise defense tools. Our main contributions are
One strategy for combating obfuscation of the malware as follows:
∗Corresponding author. • We demonstrate a scalable computational approach
that processes audit logs into an interpretable set of
features, which can then be used to build a malware
detector.
• We demonstrate that a linear classification model, us-
ing a small subset of audit log features, is able to detect
malicious behavior in a simulated enterprise setting
with high accuracy.
• We show that this classifier can provide significant de-
. tection of malware that are completely missed by top
antivirus vendors. Thus providing a cheap value-add approach, our work is motivated by the practical issues that
to already installed host-based systems. typically prevent deployment of any new tools, no matter
how effective they might be, such as reticence of IT de-
• We describe some interesting malware behavior that partments to install new software, additional maintenance
we discovered to be significant signals of malware, which requirements, and costs. In our approach we mitigate those
supports our claim that our detections works on mali- issues by taking advantage of already existing Windows au-
cious behavior rather than behavioral signatures. dit logs collection mechanisms in most SIEMs. While such
The rest of this manuscript is organized into several major an approach limits the type of event data that we can use
sections. We start with Section 2, where we present relevant to what is currently recorded by Windows audit logs, col-
background information relating to malware detection tech- lection of his data can be implemented on a network admin-
niques, and machine learning. Then, in Section 3, we de- istration level without any software installation, requires no
scribe the dataset of audit logs that we use to compute and additional maintenance of the hosts, and, as we describe
evaluate our machine-learning classifier on, as well as how later, induces minimal network and system overhead.
we determined classification labels for these logs. In Section
4 we describe how we extracted relevant features from the 2.1 Machine Learning
collected logs, and based on these extracted features, how Machine learning has been used extensively in static and
we learn our classifier. We show our results in Section 5, dynamic malware detection (e.g., see a recent review [12]),
and end with a conclusion in Section 6. though the potential performance in an enterprise setting is
hard to judge, since results are typically computed on dif-
2. BACKGROUND ferent, small datasets. The detection problem is commonly
formulated as either a classification or an anomaly detec-
Malware detection can be roughly split into two major
tion problem, depending on the type of data that is used.
approaches: i) detection based on static analysis, where the
In anomaly detection, malicious behavior is characterized
suspected executable file is scanned for malicious content;
deviation from the “normal” expected behavior. The ad-
and ii) dynamic behavior detection, where the detection is
vantages of such an approach is that only normal behavior
done on the behavior of the executable as it is running [8].
needs to be collected for training, which is very easy to get
In the commercial antivirus industry, static analysis has
in large volumes on an enterprise network. In cybersecurity,
been the dominant approach for detection, and is typically
anomaly detection methods are typically applied to network
implemented by blacklisting malware executables based on
intrusion detection problems, where network flow traffic is
their byte signatures. Blacklisting using signatures is very
analyzed [21].
effective in actual deployment, since, when properly done,
Trying to detect abnormal behavior can be problematic
the signature detector can have virtually zero false positives,
when dealing with active users, since behavior can period-
with detection rates ≥ 90% for common malware[14, 7].
ically deviate from expectation. For example, when new
Unfortunately, since detection is done on signatures, small
software is installed, job assignment changes, or a new user
modifications of the source code, compiler, or the binary
takes over a computer temporarily, the audit logs can devi-
would evade the signature method. Even more advanced
ate from expectation. In an enterprise setting, false alarms
detection methods are still susceptible to simple obfuscation
can have a very detrimental effect, since it can cause the
techniques [15]. Since small modifications of malware can
administrators to lose faith in the detector. In this situa-
be done at a low cost, the result has been an explosion in
tion, we can potentially improve detection by also utilizing
new malware being observed in the wild, and a significant
malware-associated signals. We therefore focus on express-
lag between when a piece of malware is first observed and
ing audit log malware detection as a classification problem,
the time antivirus engines are able to detect it [20].
in the hopes that this will yield a more robust solution.
A complementary approach to static detection is dynamic
behavior detection. Behavior is more difficult to obfuscate,
so the advantage of dynamic analysis is that it is harder 3. EXPERIMENTAL DATA
to recycle existing malware. Software is typically observed In this section we describe how we setup the Windows
by automatic execution in virtualized environments (e.g., Audit Log to collect the required data, as well as the actual
[2]), prior to allowing it to run on the endpoint system, in dataset that we will use for learning and evaluation. We
order to prevent infection. Automated execution of software start by describing our Audit Log settings and the types of
is difficult since malware can try to detect if it’s running behavior that we record. We then describe our efforts to
in a sandbox, and prevent itself from performing malicious collect real enterprise audit logs using these settings, as well
behavior, which essentially leads to an arms race between as a set of diverse software samples (benign and malicious)
dynamic behavior detectors using a sandbox and malware [1, that we run through a virtual sandbox environment. Finally,
10]. A somewhat related concept to dynamic tracing is taint we discuss how we decide which binaries are benign and
analysis [22, 9], which requires significant instrumentation which are malicious for use in the training and testing of
to the underlying system and can suffer from its own set of our detection approach.
issues [18].
While we are proposing a dynamic behavior detection sys- 3.1 Configuration
tem, our approach is complementary to the above described Collecting Windows audit logs, while straightforward, re-
methods, since the detection is done after the malware by- quires defining the type of system objects that will be mon-
passes perimeter defense layers (like antivirus and sandbox) itored (e.g., files or registry keys, network events), what
and executes behavior directly on the endpoint, where it types of access to those objects should be recorded (reads,
has no choice but to behave maliciously if it is to accom- writes, permission changes, etc.), and whose accesses should
plish its goal. Instead of proposing a custom agent-based be monitored (users, system, all, etc.). The configuration
for this information is held in three places: i) the security • (MAL3P) Around one thousand malware binaries man-
access control list (SACL), which controls the access types ually created by a third party. All these files are known
audited for all file and registry objects; ii) the local secu- to be malware.
rity policy, which controls what sorts of objects and events
generate events in the log; and iii) the Windows firewall, • (MALAPT) Five sets of binaries recovered from high
which partially controls logging for network specific events, profile APT cyber incidents. All these files are known
such as network connections. The specific policy can eas- to be malware.
ily be applied to the entire enterprise network using domain • (UVPN): Around sixteen thousand binaries download
controllers. by virtual private network users that we received from
While turning on all logging mechanisms would provide our corporate partner. These binaries consist of down-
the most coverage of possible behaviors, the volume of log loaded files sampled from several million active end
events quickly overwhelms commodity storage systems and users, providing a recent and realistic set of mixed
significantly impacts the performance of the monitored com- malicious/benign binaries, allowing us to estimate real
puters. To filter out the vast majority of these events while world binary execution behavior. The set contains a
maintaining their utility in detecting malicious behavior, we mix of benign and malware executables.
removed read events from our collection, as nearly all ma-
licious behavior involves some form of modification to the • (OS) Windows XP Service Pack 3 and ReactOS (open
system (reconnaissance is one notable exception, though the source Windows clone) system files. All these files are
software still has to gain permanence). We also did not known to be benignware.
record network events, due to the limitation of our sandbox
Note that our CuckooBox dataset of audit logs contains
environment. In total, we end up only recording file/reg-
both malware and benign audit logs.
istry writes, deletes, and executes, and process spawns. Our
settings generated about 100-200 MB of data (representing 3.3 Labeling
about 300-600 thousand events) per computer per day on
In order to use classification algorithms to build a detec-
an enterprise network, with peak volume days being about
tor we must provide classification labels of “malware”, 1,
double that. The logs can be compressed down for long-
or “benign”, −1, for all of our samples. Given the non-
term storage at a rate of 16:1. No detectable performance
negligible chance of benignware occurring in the malware
degradation was detected by any users of the monitored ma-
MAL2M dataset, and lack of any a priori classification of
chines. Thus, activating the additional logging needed for
the UVPN binaries, we ran all the used binaries through
our approach does not pose a heavy storage or network load
VirusTotal. VirusTotal uses approximately 55 different mal-
for a modern enterprise network, and has negligible memory
ware engines to see if any of them label the executable as
or computational overhead.
malware [3]. We will define the VirusTotal score s, where
3.2 Collection 0 ≤ s ≤ 1.0, as the number of detections divided by the
number of malware engines. The distribution of scores on
Our collected audit logs consists primarily of two sources:
our binary dataset is shown in Fig. 1B.
i) the enterprise logs collected from users of an enterprise
To reduce the chance of mislabeling we remove all binaries
network; and ii) sandboxed virtual machine based runs on
with ambiguous scores of 0 < s < 0.3, based on the valley
a set of malicious and benign binaries. While the enterprise
centered at around s = 0.3 in Fig. 1A. Any binary with a
logs represent exactly the environment on which we propose
score of s = 0 we label as benignware, and s ≥ 0.3 we label
to deploy our system, we are unable to collect any malicious
as malware. The exception are any binaries for which we
behavior that we require for machine learning.
know the correct labels based on the data source: MAL3P
The enterprise logs we collected came directly from an in-
and MALAPT we always label as malicious; and OS binaries
ternal enterprise network, recorded using the described audit
we always label as benign.
log configurations. These logs consist of four Windows com-
Our CuckooBox audit log dataset contains 981 different
puters actively used by members of our sales, executive, and
malware families, as characterized by Kaspersky antivirus.
IT teams.
As shown in Fig. 1B, our corpus is well distributed over
To generate malicious audit logs, as well as to diversify the
the size of the malware family, it’s VirusTotal score (indi-
benign behaviors recorded, we created a virtual sandbox en-
cating whether it is considered malicious or benign), its cre-
vironment in which we run several collections of binaries.
ation time (with some very recent ones coming from VPN
We used Cuckoo Sandbox (CuckooBox) [2], an open source
dataset), and the relative popularity of specific audit log
sandbox, in conjunction with VirtualBox [4] running Win-
events produced when running a binary. In particular, rel-
dows 7 SP1 VMs to automatically generate these audit logs
ative popularity is ranked by the number of times a given
from the binaries. We ran about twenty thousand binary
event occurred, where an event is represented by a string
samples through CuckooBox, with each sample taking an av-
tuple ha, bi, where a is the action (e.g., spawn, write, etc.)
erage of four minutes per run, with the maximum allowable
and b is the associated file path or regularized path. Note
runtime set to ten minutes. We will collectively refer to this
that this is not the popularity of the malware binary itself.
dataset of logs generated by running through CuckooBox as
Out of 14679 confirmed malicious binaries that we suc-
the CuckooBox dataset. The binaries were sampled from
cessfully ran through CuckooBox, Kaspersky detected 88%
our collection of portable executable (PE) format Windows
of malware, McAfee 95%, and Symantec 90%. If we combine
files:
the above engines into a “meta”-engine, where we consider a
• (MAL2M) Over two million malware samples collected detection valid if at least one of the engines detects malware,
in 2012. While majority of the files are malware, some then for McAfee+Symantec we have 96% detection, and for
small fraction is incorrectly labeled benignware. McAfee+Symantec+Kaspersky we have 98% detection.
(A) (B)
0.1 0.1
Our malware logs, on average, are 4 minute long record-
0.09 0.09 ings of the whole system during which a binary was execut-
ing in CuckooBox. On the other hand, the enterprise logs
Fraction of Classified Binaries

Fraction of Malware Binaries


0.08 0.08

0.07 0.07 are continuous recording lasting for hours at a time. We


0.06 0.06
therefore split all our enterprise logs into disjoint 4 minute
0.05 0.05

0.04 0.04
windows to mimic our CuckooBox runs. In order to remove
0.03 0.03 some common host specific artifacts from file paths and reg-
0.02 0.02 istries, we ran them all through a series of regular expression
0.01 0.01
that abstracted out the host specific strings (ex. username
0 0
0 0.2 0.4 0.6 0.8 1 0 0.02 0.04 0.06 0.08 0.1 in the path, UIDs in the registry, etc.).
VirusTotal Score (s) Relative Size of Malware Family
(C) (D) The most direct approach for mapping events in the log
0.1 10
5
to features is to represent each single event as a single fea-
0.09
4
ture, and represent each log as a bag-of-words of individual
0.08 10
events. However, the time order between events is lost in
Number of Occurances
Fraction of Binaries

0.07

0.06 10 3 such a mapping. This potentially might not significantly


0.05 impact results, since malware can be multithreaded, and in-
0.04 10
2
fect multiple processes at the same time, so it is not possible
0.03

0.02 1
to unambiguously order all log events. However, some order-
10
0.01
ings have meaning, since an event could represent different
0
95 00 05 10 15
10
0 behavior depending on the context.
10 0 10 2 10 4 10 6 10 8
Year Created Popularity Rank One improvement over the one-to-one mapping, which
partially preserves order, is to represent all q continuous
time ordered sets of events (event q-grams) in the log as
Figure 1: Characterization of our database binary features, for some number q. For example, one event could
files and logs. (A) The VirusTotal classification of be execution of “a.dll”, followed immediately by execution
all the binaries ran through CuckooBox with s > 0. of “b.dll”, rather than two independent features. The draw-
(B) The distribution of our malware binaries as a back of such an approach is that it causes an exponential
function of family size. (C) The distribution of com- increase in the number of features, limiting how big we can
pile timestamp (or file creation data, if timestamp computationally make q. Also, as we make q larger we are
is unavailable) in our CuckooBox run binary files. potentially getting features that do not generalize as well,
Extreme values (due to corrupted stamps) are not so the utility of computing q-grams for large q is low. In our
shown. (D) The popularity of events produced in approach, we group all events in a log by their associated
CuckooBox runs of binaries, ranked by the number process IDs, extract all q-grams for each process, and then
of occurrences (where events are defined by action aggregate them together into one large feature set of q-grams
type and target). This demonstrates a long tail of that represents the entire log. We set q = {1, 2, 3}, a range
observed malicious events. of sizes we determined to be a practical compromise between
computational complexity and goodness of results. Repre-
senting all events of 1-3 q-grams resulted in about seven
Note that our detection percent for various antivirus en- million unique features.
gines is potentially more optimistic than in real deployment We can now represent our combined benign and malware
environment for several reasons: (i) a large fraction of our audit log dataset simply by an M ×N matrix A, where entry
malware is more than two years old; ii) while we potentially aij is the number of observations of feature j in log i. Unfor-
have newer malware in the VPN binary dataset, there is a tunately there is a danger that different users have a different
chance that it is not detected by more than 30% of antivirus number of benign processes running at a time, so the counts
engines, and is so is not included in our statistics; and (iii) can vary significantly between different logs within the same
the antivirus engines are potentially setup with more aggres- timeframe. In addition, the amount of events occurring in
sive detection settings than they would otherwise be when CuckooBox is potentially smaller than on an enterprise user
actually deployed. machine, where more software is actively being used. To
In the rest of the manuscript we will refer to y as the reduce the effect of heterogeneity in our audit log data, we
classification vector of size M , where yi ∈ {−1, 1} is the ith drop all counts from our matrix, such that the matrix only
sample’s classification, M is the number of observations in stores binary values, 0 if the feature never occurred in the
the dataset, and −1, 1 means benign or malicious, respec- log, and 1 otherwise.
tively. When running binaries in the sandbox it is possible that
a fraction of these binaries did not execute properly due to
4. METHOD various reasons. To filter them out, we compute the mean
Our method for deriving a classifier from our audit logs and standard deviation of the number of features for our
is divided into two stages. In the first stage we perform CuckooBox audit log dataset, and removed all the observa-
tions where the number of features was below two standard
the feature engineering, where we use domain knowledge to
deviations from the mean.
extract features out of audit logs. In the second stage we
use machine learning to compute a classifier that classifies
audit logs based on the extracted features. 4.1.1 Feature Filtering
The large number of features that we extract from the logs
4.1 Feature Engineering can make it intractable to compute a classifier. In our case,
most features are not useful for detection because they occur Given the LR classifier hx̂, b̂i, the probability of the ith
in a tiny fraction of logs (see Fig. 1D). Our first step is to fil- observation being malware, pi , is evaluated using the logistic
ter out the long tail of the feature distribution, such that we function
would be able to tractably apply a standard machine learn- 1
ing algorithm. Our goal is to do a computationally fast filter pi = . (3)
that will preserve the sparsity of the data matrix. There- 1 + e−(ai x̂+b̂)
fore, we avoid performing expensive operations like matrix 4.2.2 Regularization
decompositions, and filter the features based on the uncen-
The value of the regularization parameter λ is not known
tered correlation coefficient, cj , between the labels y and the
a priori , since it unclear what the expected total loss is for
jth column of A, a∗j ,
our dataset. In order to determine the proper λ parame-
aT∗j y ter in eq. (2) we will use internal 10-fold cross-validation,
cj = , (1) where for each λ value we compute the average loss of the
ka∗j k2 ky]k2
validation sets. The cross-validation can be computed fairly
which can be efficiently computed using a sparse matrix quickly, since the optimization problem can be sped up us-
multiplication. Using |cj | values, we select the top 50000 ing warm restarts, where we use the solution from a previ-
most correlated (or inverse correlated) features, a conserva- ously computed λ value to converge faster to the solution
tive number that is small enough to practically train our for the new λ [11]. Rather than picking the value of λ that
machine-learning classifier on a desktop computer, but two gives the lowest loss during cross-validation, we will err on
orders of magnitude bigger than the number of popular events the side of parsimony and use the “one-standard-error” rule,
in our log, while still preserving sparsity of the feature ma- since our deployment environment will somewhat diverge
trix. from our testing environment [11]. Note that this validation
Note that in the case when N is extremely large (ex. bil- is not our actual validation, which we report on later, but
lions), we can pre-filter our correlation filter by using a prob- an internal cross-validation to select the λ value.
abilistic counter [13], and performing two passes through the
audit log dataset. On the first pass we count the number 4.3 Validation
of occurrences, and on the second pass we only add a fea- The LR classifier’s tradeoff between detection and false
ture to A if is above a certain threshold. Fortunately, for alarm rates can be controlled by the threshold value at which
our current dataset, we were able to compute the correlation we consider the probability pi (see eq. (3)) high enough to
directly without pre-filtering. be malicious. The full range of this tradeoff is summarized
by the receiver operating characteristic (ROC) curve, where
4.2 Learning Algorithm on the y-axis we have the true positive rate (TPR), the num-
Picking the optimal machine learning approach for a spe- ber of malicious logs classified by our classifier as malicious,
cific problem is a research topic itself, and so we leave build- divided by the total number of malicious logs; and on the
ing the most optimal detector to future work. In our case, x-axis we show the false positive rate (FPR), the number
we focus on building an easily deployable detector that can of logs we classified as malicious that are actually benign
perform detection very quickly using very few important fea- divided by the total number of benign logs. We evaluate
tures, since: i) we can practically deploy such a detector in a our approach based on TPR and FPR because both are in-
large enterprise network by using simple pre-filtering for im- dependent of the ratio of benign vs. malicious logs in our
portant features on the SIEM system, activating the more dataset, which varies between deployment environments.
expensive detection system only in the case when one of The TPR and FPR cannot be properly estimated using
the important features occurs; and ii) in the case when a the data that was used to learn the classifier, and instead
detection does occur, we can provide an explanation that must be computed through validation on previously unseen
can be verified by a human agent. Therefore, we will focus data, where the classifier is trained on “training” data and
specifically on the two class ℓ1 -regularized logistic regression tested on “test” data, with the most common type of vali-
(LR) machine-learning classifier, where the ℓ1 -norm induces dation being cross-validation [17]. Unfortunately, direct ap-
a sparse solution [19]. plication of the random split cross-validation approach, in
To confirm our results, which we give later, we have also our case, could provide misleading results due to the dif-
tried SVMs and random forests [17], with both methods ferences between our validation data and the deployment
yielding similar performance to LR. We do not report on environment. This is a direct consequence of us not being
those results. able to observe malware infections directly on an enterprise
network, and so having to simulate the infections using a
4.2.1 Logistic Regression sandbox.
Consider a typical enterprise deployment environment: on
An ℓ1 -regularized LR classifier can be computed from y
a typical day there are a multitude of users active, most of
and A by minimizing the associated convex loss function
them using Office products, a web browser, or conferenc-
ing applications; lots of software is actively running on each
X  
hx̂, b̂i = arg min log 1 + e−yi (ai x+b) + λkxk1 , (2)
x,b computer, and is being constantly interacted with, but soft-
i
ware installations are uncommon. We contrast this with the
where ai is the feature row vector of the ith observation, yi is CuckooBox environment that we used to exhibit malware
the associated label, x is the column vector of the classifier behavior, where only a few processes are active, there is
feature weights, b is the offset value, k·k1 is the ℓ1 -norm, no active interaction with any applications, but software is
and λ is a regularization parameter. We will describe how constantly being installed and CuckooBox monitoring tools,
we pick λ through cross-validation in the next subsection. which would not exist on a real enterprise machine, are
mixed in with actual monitored software behavior. Such would also validate on the enterprise data. While this would
heterogeneity between benign enterprise data and malware demonstrate that our classifier does indeed distinguish be-
CuckooBox data can result in us not being able to distin- tween benignware and malware in the same environment,
guish between a classifier that primarily detects CuckooBox it is not clear what our performance is on actual enterprise
vs. enterprise environment and one that actually detects data. If our FPR on the enterprise data is lower than the
malicious behavior. fraction of benignware that is mislabeled as malware, our
Another problem in estimating the TPR and the FPR reported FPR might be artificially high.
from validation results is the unexpected dataset “twinning”, One approach would be to get rid of the CuckooBox be-
where a fraction of our samples in the test set are behav- nign data, and instead test directly on CuckooBox malicious
iorally almost identical to the samples in the training set. vs. enterprise datasets. However, testing directly on enter-
Indeed, we have observed executions of binaries with differ- prise data and synthetically generated malware CuckooBox
ent SHA1s that produced almost identical audit logs. Re- logs could be misleading, since it is still possible to separate
lated to this, is the problem of concept drift, where malware the benign samples from malware samples just by perform-
behaviors tend to evolve over time and drift away from the ing detection for CuckooBox environmental features (e.g.,
samples in our dataset. In both cases, random split valida- CuckooBox specific DLL calls). It is also conceivable for
tion would produce overly optimistic results. a bad classifier to pass the CuckooBox validation, and also
On the other hand, if our classifier is actually classify- the proposed validation, by first detecting the execution en-
ing based on behavior, estimating the FPR only on Cuck- vironment, and if it does not detect the CuckooBox envi-
ooBox benignware might result in a pessimistic rate, since ronment then automatically classify it as benign, otherwise
we potentially have a number of malware binaries in our perform actual detection using some CuckooBox model. We
VPN dataset that were missed by VirusTotal meta-engine, avoid such a possibility by synthetically generating enter-
but are detected by our classifier. Additionally, separating prise malicious logs by logically OR-ing it with a malware
behavior of VPN binaries (which mostly contain new soft- CuckooBox log
ware installers) from malware is much harder than separat-
as = ae ∨ ac , (4)
ing typical behavior of an enterprise endpoint from malicious
behavior, since those tend to deviate less from expected be- where ae and ac are the feature vectors associated with the
havior. enterprise and CuckooBox samples, respectively, and as is
Therefore, in order to present a more accurate estimate the resulting synthetic feature vector that we use instead of
of expected performance, we have designed six different val- the CuckooBox ac vector in our validation.
idations that, when taken together, show that our classifier There is still a possibility that some of our detection re-
provides robust performance under the more realistic set of sults on the synthetic dataset are inflated because we could
assumptions. First, to address dataset twinning or concept be using some CuckooBox features to increase our score
drift, in addition to the validation based on random dataset (these features would not exist in deployment, so our true
splitting, we also validate results by testing only on malware TPR would be potentially lower). So in addition to remov-
that is at least one or two years older than malware in our ing obvious CuckooBox features (see Feature Engineering),
training set. Since compile time in an executable can be we also find all CuckooBox features that occur in > 1% of
faked or be corrupted, we remove all executables that have benign CuckooBox (in order not to select accidentally mal-
compile time before the year 1995 or compile time after 2014. ware features due to mislabeled data), and we remove all of
As an alternative, we also validate by splitting on malware those features that have positive weights in our LR model.
families instead of compile time, where we define a family This gives us a very conservative guess for the TPR (though
based on Kaspersky’s label. In our malware families test not a lower bound, since we potentially missed some fea-
we remove “Trojan.Win32.Generic” from both the training tures), and represents our best attempt at estimating the
and testing datasets, though we acknowledge that family true deployment TPR vs. FPR tradeoff.
labeling can be somewhat ambiguous [14]. To sum it up, the random training testing split, the com-
The second variable in our validations is the environment pile time split, and the family split, computed separately for
of our test data, where we test the same classifier, first in CuckooBox and synthetic enterprise data form our six val-
CuckooBox, and then enterprise environments, separately. idations. In addition, our chances of detecting on environ-
By maintaining the environment constant for the benign and ment is further minimized by our choice of a ℓ1 -regularizer,
malicious samples in the test set, we (mostly) mitigate the since a feature that detects behavior would better generalize
possibility that our evaluation results are tainted due to the across our two environments, and so would likely be chosen
heavily biased ratio of benign vs. malicious samples in our over a feature that detects on just the environment.
two environments. We describe this approach and how we
generate malicious samples for the enterprise environment 5. RESULTS
next.
Our experimental dataset consists of 32,078 samples and
6,898,953 unique extracted features. Out of the samples,
4.3.1 Mitigating Environmental Bias 17,399 are benign, 14,679 malicious. 20,362 audit logs are
We start with the the closest approximation to the stan- from binaries executed in CuckooBox, and 11,716 are In-
dard 10-fold cross-validation approach, where we split all the vincea’s enterprise four minute windowed audit logs. Out
CuckooBox datasets into 10 disjoint sets, and for each vali- of the 20,362 CuckooBox audit logs, 5,683 are benign and
dation iteration train on the 9 out of 10 CuckooBox sets and 14,679 are malicious. Of the 14,679 malicious audit logs,
all of the enterprise data, and then validate on the remaining 3,010 are known to be malicious based on their origination.
CuckooBox set (no enterprise data). Note the difference be- The density of the A matrix, as computed as the number of
tween this approach and standard cross-validation, where we non-zeros divided by number of entries, is 1.2×10−4 , indicat-
ing that the matrix is indeed very sparse. All computations enterprise level FPRs.
were performed on a 2014 16GB MacBook Pro laptop run-
ning MATLAB 8.3. The LR classifier was computed using 5.1 Detecting AV Missed Malware
the Glmnet software package [11, 16], with 20-fold internal One important advantage of our approach over standard
cross-validation for determining λ. antivirus engines is that we are using observation vectors
The results for the six various validations of our LR classi- that are currently not utilized for detection, making it less
fier are shown in Fig. 2. In the top validation panels we test likely that malware would be able to hide from it. As the
exclusively on CuckooBox data, while in the bottom pan- result, assuming our classification labels are mostly correct,
els exclusively on the synthetic enterprise data. Note that we are actually able to detect a large fraction of malware
the classifier used to compute the TPR and the FPR in the missed by popular antivirus engines.
top and bottom of each column is the same. Since we are Recall that 88%, 95%, and 90% of our malware are de-
proposing to deploy the classifier in an enterprise setting, tected by Kaspersky, McAfee, Symantec, as well as 96%
we are primarily concerned with performances in the very when using McAfee and Symantec as a “meta”-engine, and
low FPR regions (≤ 10−2 ). We roughly estimate that the 98% when using all three. Removing those audit logs from
10−2 and 10−3 FPR to be equivalent to a false alarm once our validation dataset, in Fig. 3 we show that we are able
a day and once every 8 days/ per computer, respectively, to detect a significant fraction of malware that is missed by
assuming 8 hr/day usage. standard antivirus engines, as well as meta-engines consist-
Fig. 2A shows that for random splitting our LR classifier ing of several popular antivirus engines. Specifically, we are
has a robust detection level of 77% TPR at 1% FPR. Fig. able to detect around 80% of malware completely missed
2E shows that for the synthetic enterprise version of the by a Mcafee + Symantec + Kaspersky ensemble detector.
validation we can see the large improvement in the ROC Note that other than removal of detected malware from our
curve at the lower FPR values due to the removal of the validation dataset, the validation procedure in the figure is
VirusTotal missed malware, with the TPR at about 89% identical to that of Fig. 2E.
for 1% FPR, and even larger gain at deployment-relevant It is possible that some of our detection in Fig. 3 is due
FPRs. The associated bar plot of TPRs at FPRs 10−2 and to the varying definition of malware between the major ven-
10−3 are shown in Fig. 2D and H, grouped by malware type dors and VirusTotal consensus. For example, while Kasper-
(as determined by Kaspersky antivirus) . sky and Symantec have approximately the same detection
In addition to the validation based on random dataset rate on our dataset, our ROC curve for Kaspersky is worse
splitting, we also validated results by testing only on mal- than for Symantec. This suggests that Kaspersky has po-
ware that is at least one or two years older than malware in tentially a different threshold for what it considers malware,
the training set (Figs. 2B and F), and separately testing on which results in our training data missing some category of
malware families not included in our training dataset (Figs. malware. Therefore, we have also analyzed our TPR for
2C and G). The number of samples used to train the clas- known malware, labeled as malware, not based on Virus-
sifier in Fig. 2B and F is smaller than in the other panels, Total score, but due to known origination of the associated
but all the ROC curves for all the years were computed us- binary executable. Our analysis showed 79% TPR at 1%
ing the same training dataset, with only the validation set FPR, and 72% TPR at 0.1% FPR, when measuring perfor-
being adjusted during ROC curve computation. The zero mance on this known malware, using the Kaspersky malware
year ROC curve line is worse than the ROC curve in first only trained classifier, thus confirming our overall results.
column, even though it is the same validation, because the
amount of training data is significantly reduced in the sec- 5.2 Important Features
ond column in order to be able to keep the same classifier
In addition to the validation above we also looked at the
for all three curves.
actual features that are used for detection. To do this, we
We observe around an 8% drop in detection for each ad-
recomputed the regularized LR model on our full dataset,
ditional minimum year difference between the training and
which resulted in a LR model that uses 1,704 features. Of
the testing set, which demonstrates that concept drift does
these features, the sum of the positive weights (indicating
affect our detection. The detection rate is still fairly high
malicious features) was around 159 vs. -73 for the nega-
for the synthetic enterprise data (bottom panel), where the
tive weights (indicating benign features). This shows that
TPR is at least 67% with a FPR of 0.1%, and at least 73%
our detection relies more on malware behavior than benign
for the FPR of 1%. These results are promising, consider-
behavior for classification, as we have expected.
ing that commercial antivirus solutions have a 60% TPR for
While in general interpreting Windows audit logs is diffi-
previously unseen malware [20].
cult without more fine grained knowledge (we only know the
Since we did not filter all malware audit logs for executions
DLL executed, not the function called), we did observe some
that did not exhibit malicious behavior (ex. crashed prema-
features that are directly interpretable. We note that none
turely or required interaction to activate), our reported TPR
of the features by themselves are enough to classify a log
is potentially underestimated. We performed manual exam-
as malicious, and that it is observation of at least several of
ination of a small fraction of the missed detections, and a
these features that causes detection. To do this, we sort the
large fraction of them where GUI installers that simply did
non-zero LR features by their contribution to classification
not finish executing in an automated sandbox run. In terms
of malware, which we compute by multiplying the number
of practicality, FPR is vastly more critical, since it controls
of times a feature occurs in malware by the absolute weight
the number of false alarms that a enterprise network ad-
in our LR model. Below are several top features that we
ministrator would have to handle. For example a detector
found to have a direct interpretation:
with a 5% FPR is simply not deployable. Our results show
The #2 most important feature is:
that a significant level of detection can be achieved close to
Executing edited file.
(A) (B) (C) (D)
1 1 1
trojan 5027
virus 912
0.8 0.8 0.8

Kaspersky Classification
adware 895

# Observations
packed 858
0.6 0.6 0.6 backdoor 732
TPR

worm 625
0.4 0.4 0.4 net 415
hoax 215
downloader 178
0.2 0.2 0 year, AUC=0.96 0.2
1 year, AUC=0.89 email 135
AUC=0.97 2 year, AUC=0.79 AUC=0.97
0 0 0
10 -3 10 -2 10 -1 10 0 10 -3 10 -2 10 -1 10 0 10 -3 10 -2 10 -1 10 0 0 0.2 0.4 0.6 0.8 1
(E) (F) (G) (H)
1 1 1
trojan 5027
virus 912
0.8 0.8 0.8

Kaspersky Classification
adware 895

# Observations
packed 858
0.6 0.6 0.6 backdoor 732
TPR

worm 625
0.4 0.4 0.4 net 415
hoax 215
downloader 178
0.2 0.2 0 year, AUC=0.96 0.2
1 year, AUC=0.91 email 135
AUC=0.95 2 year, AUC=0.84 AUC=0.95
0 0 0
10 -3 10 -2 10 -1 10 0 10 -3 10 -2 10 -1 10 0 10 -3 10 -2 10 -1 10 0 0 0.2 0.4 0.6 0.8 1
FPR FPR FPR Fraction Detected

Figure 2: 10-fold validation results for our ℓ1 -regularized LR classifier. The top panels are validation results
computed only on CuckooBox collected audit log data, while the bottom panels show validation results done
only on synthetic malware and pure benign enterprise audit logs. Each column shares the same classifier
between the top and bottom panel. The first column shows the ROC validation curves for randomly split
data. The second column shows ROC validation curves based on compile time splits, where the training
malware data is at least one (middle line) or two (bottom line) years older than validation malware data.
The third column show ROC validation curved based on malware families splits, where we train on one set
of malware families, and validate on the rest of the families. The fourth column shows results from the first
column, at two deployment-relevant false positive rates (10−2 top and blue, and 10−3 bottom and yellow),
broken down by specific malware types, with the number of total observations of each malware type is shown
on the right axis.

This represents an execution of a binary that was at some The ROC curve estimates for the low FPR regions are
point before was edited by some process. While not always not as reliable as for the less relevant higher FPR regions.
occurring in malicious software, this is clearly suspicious be- More reliable estimates for the deployment relevant region
havior. would require significantly more samples. Also, given the
The #3 most important features is: somewhat artificial conditions in which we collected malware
Write to ...\windows defender\...\detections.log. audit logs, our estimates of the TPRs could potentially be
This feature represents an actual detection by the Microsoft inaccurate. The TPR could further be affected by the con-
antivirus engine. This is a good validation of our algorithm, stantly changing distribution of the malware in the wild,
since this is clearly a strong non-signature indicator of mal- which we are not able to effectively approximate.
ware, and we were able to discover it automatically. Our approach is only an initial demonstration of audit
The #9 most important feature is: log detection feasibility and can undoubtedly be improved
Write to ...[system]\msvbvm60.dll. through a combination of better training data, improvement
While at first it might seem surprising that a library for in feature engineering, and algorithmic tuning. Other prob-
Visual Basic 6 is an indicator of malware, a bit of research lems can be mitigated by Microsoft improving their audit
shows that the use of Visual Basic in malware has been on log capabilities.
the rise because VB code is tricky to reverse engineer [6].
Add to this fact that support for the language has ended
more than 10 years ago, it is clear why this would be a good 6. CONCLUSION
indicator of malware behavior. We demonstrated that audit logs can potentially provide
While these are just three examples of the features that an effective, low-cost signal for detecting malicious behavior
our LR model detects on, it clearly demonstrates that we are on enterprise networks, and adds value to current antivirus
able to automatically recover some important non-signature detection systems. Our LR model yields a detection rate of
events. 85% at an expected false positive rate of 1%, and detected
80% of malware missed by commercial antivirus systems.
5.3 Limitations While audit logs do not directly record certain malicious
It is clear that this, or any other approach, is not a panacea tactics (e.g., thread injection), our results show that they
to the malware problem. Given a large enough deployment still provide adequate information to robustly detect most
of such a detector, malware authors will start finding ways malware. Since audit logs can easily be collected on enter-
to obfuscate their audit log trail. For one thing, slow mov- prise networks and aggregated within SIEM databases, we
ing malware is an open problem, and potentially will not be believe our solution is deployable at reasonably low cost,
detected using the four minute window of an audit log. though further work needs to be done to thoroughly test
(A) (B) (C)
1 1 0.9

0.9 0.9 0.8


0.8 0.8 0.7
0.7 0.7

Fraction Detected
0.6
0.6 0.6
0.5

TPR

TPR
0.5 0.5
0.4
0.4 0.4
0.3
0.3 0.3

0.2 0.2 0.2


AUC=0.80
AUC=0.89 AUC=0.91
0.1 0.1 0.1
AUC=0.90 AUC=0.90
0 0 0
10
-3
10
-2
10
-1
10
0
10
-3
10
-2
10
-1
10
0 0.3 0.4 0.5 0.6 0.7 0.8 0.9
FPR FPR VirusTotal Score

Figure 3: 10-fold validated ROC curves of malware that is detected by our classifier and is missed by antivirus
engines. In each instance the model was only trained on antivirus detected data, and validated only on
malware missed by that engine. (A) Validation on specific antivirus engines. Kaspersky, black squares,
McAfee red x, and Symantec, blue triangle. (B) Validation on meta engines. Kaspersky+McAfee+Symantec,
black squares, McAfee+Symantec, blue x. (C) The fraction of malware detected, based on VirusTotal score
for the composite (all three) engine, at two false positive rates (10−2 left and blue, and 10−3 right and yellow).
Lower VirusTotal scores indicate malware that is harder to detect

this claim. [9] W. Enck, P. Gilbert, S. Han, V. Tendulkar, B.-G. Chun,


Importantly, by putting multiple obstacles, such as au- L. P. Cox, J. Jung, P. McDaniel, and A. N. Sheth.
dit log detection, in the way of would be attackers, we can Taintdroid: An information-flow tracking system for
realtime privacy monitoring on smartphones. ACM
increase the time and cost required to develop effective mal- Transactions on Computer Systems (TOCS),
ware. In our future work we will explore improvements and 32(2):5:1–5:29, June 2014.
integration of other audit log signals, like network flow, in [10] D. Fleck, A. Tokhtabayev, A. Alarif, A. Stavrou, and
our detection. T. Nykodym. Pytrigger: A system to trigger & extract
user-activated malware behavior. In Proceedings of the
2013 International Conference on Availability, Reliability
7. DATA AND SOFTWARE and Security, pages 92–101. IEEE, 2013.
We are actively working on getting the full dataset anonymized, [11] J. H. Friedman, T. Hastie, and R. Tibshirani.
and approved for released to the public. The source code Regularization paths for generalized linear models via
that completely reproduces our figures, and the link to our coordinate descent. Journal of Statistical Software,
data can be found at GitHub: 33(1):1–22, 2 2010.
https://github.com/konstantinberlin/malware-windows- [12] E. Gandotra, D. Bansal, and S. Sofat. Malware analysis
and classification: A survey. Journal of Information
audit-log-detection. Security, 5(02):56, 2014.
[13] M. Mitzenmacher and E. Upfal. Probability and computing:
8. ACKNOWLEDGEMENT Randomized algorithms and probabilistic analysis.
Cambridge University Press, 2005.
We would like to thank Robert Gove for his comments
[14] A. Mohaisen and O. Alrawi. AV-meter: An evaluation of
and discussion. antivirus scans and labels. In Detection of Intrusions and
Malware, and Vulnerability Assessment, pages 112–131.
9. REFERENCES Springer, 2014.
[15] A. Moser, C. Kruegel, and E. Kirda. Limits of static
[1] Anubis. https://anubis.iseclab.org/. analysis for malware detection. In Proceedings of the 23rd
[2] Cuckoo Sandbox. http://www.cuckoosandbox.org. Computer Security Applications Conference, pages
[3] VirtualBox. https://www.virustotal.com. 421–430, 2007.
[4] VirusTotal. hhttps://www.virtualbox.org. [16] J. Qian, T. Hastie, J. Friedman, R. Tibshirani, and
N. Simon. Glmnet for MATLAB.
[5] Description of security events in Windows Vista and http://www.stanford.edu/~hastie/glmnet_matlab, 2013.
in Windows Server 2008. [17] S. Shalev-Shwartz and S. Ben-David. Understanding
https://support.microsoft.com/en-us/kb/947226, Machine Learning: From Theory to Algorithms.
January 2009. Cambridge University Press, 2014.
[6] Visual basic platform is becoming increasingly popular [18] A. Slowinska and H. Bos. Pointless tainting?: evaluating
among malware writers. http: the practicality of pointer tainting. In Proceedings of the
4th ACM European Conference on Computer Systems,
//www.lavasoft.com/mylavasoft/securitycenter/
pages 61–74. ACM, 2009.
whitepapers/visual-basic-platform-is-becoming- [19] R. Tibshirani. Regression shrinkage and selection via the
increasingly-popular-among-malware, September lasso. Journal of the Royal Statistical Society. Series B
2012. (Methodological), pages 267–288, 1996.
[7] File detection test of malicious software. [20] G. Vigna. Antivirus isn’t dead, it just can’t keep up.
http://www.av-comparatives.org/wp-content/uploads/ http://labs.lastline.com/lastline-labs-av-isnt-
2015/04/avc_fdt_201503_en.pdf, April 2015. dead-it-just-cant-keep-up, May 2014.
[8] M. Egele, T. Scholte, E. Kirda, and C. Kruegel. A survey [21] T.-F. Yen, A. Oprea, K. Onarlioglu, T. Leetham,
on automated dynamic malware-analysis techniques and W. Robertson, A. Juels, and E. Kirda. Beehive:
tools. ACM Computing Surveys (CSUR), 44(2):6, 2012.
Large-scale log analysis for detecting suspicious activity in
enterprise networks. In Proceedings of the 29th Annual
Computer Security Applications Conference, pages
199–208. ACM, 2013.
[22] H. Yin, D. Song, M. Egele, C. Kruegel, and E. Kirda.
Panorama: capturing system-wide information flow for
malware detection and analysis. In Proceedings of the 14th
ACM Conference on Computer and Communications
Security, pages 116–127. ACM, 2007.

You might also like