Malicious Behavior Detection Using Windows Audit Logs: Konstantin Berlin David Slater Joshua Saxe
Malicious Behavior Detection Using Windows Audit Logs: Konstantin Berlin David Slater Joshua Saxe
Malicious Behavior Detection Using Windows Audit Logs: Konstantin Berlin David Slater Joshua Saxe
∗
Konstantin Berlin David Slater Joshua Saxe
Invincea Labs, LLC Invincea Labs, LLC Invincea Labs, LLC
Arlington, VA, USA Arlington, VA, USA Arlington, VA, USA
kberlin@invincea.com david.slater@invincea.com josh.saxe@invincea.com
arXiv:1506.04200v2 [cs.CR] 25 Aug 2015
0.04 0.04
windows to mimic our CuckooBox runs. In order to remove
0.03 0.03 some common host specific artifacts from file paths and reg-
0.02 0.02 istries, we ran them all through a series of regular expression
0.01 0.01
that abstracted out the host specific strings (ex. username
0 0
0 0.2 0.4 0.6 0.8 1 0 0.02 0.04 0.06 0.08 0.1 in the path, UIDs in the registry, etc.).
VirusTotal Score (s) Relative Size of Malware Family
(C) (D) The most direct approach for mapping events in the log
0.1 10
5
to features is to represent each single event as a single fea-
0.09
4
ture, and represent each log as a bag-of-words of individual
0.08 10
events. However, the time order between events is lost in
Number of Occurances
Fraction of Binaries
0.07
0.02 1
to unambiguously order all log events. However, some order-
10
0.01
ings have meaning, since an event could represent different
0
95 00 05 10 15
10
0 behavior depending on the context.
10 0 10 2 10 4 10 6 10 8
Year Created Popularity Rank One improvement over the one-to-one mapping, which
partially preserves order, is to represent all q continuous
time ordered sets of events (event q-grams) in the log as
Figure 1: Characterization of our database binary features, for some number q. For example, one event could
files and logs. (A) The VirusTotal classification of be execution of “a.dll”, followed immediately by execution
all the binaries ran through CuckooBox with s > 0. of “b.dll”, rather than two independent features. The draw-
(B) The distribution of our malware binaries as a back of such an approach is that it causes an exponential
function of family size. (C) The distribution of com- increase in the number of features, limiting how big we can
pile timestamp (or file creation data, if timestamp computationally make q. Also, as we make q larger we are
is unavailable) in our CuckooBox run binary files. potentially getting features that do not generalize as well,
Extreme values (due to corrupted stamps) are not so the utility of computing q-grams for large q is low. In our
shown. (D) The popularity of events produced in approach, we group all events in a log by their associated
CuckooBox runs of binaries, ranked by the number process IDs, extract all q-grams for each process, and then
of occurrences (where events are defined by action aggregate them together into one large feature set of q-grams
type and target). This demonstrates a long tail of that represents the entire log. We set q = {1, 2, 3}, a range
observed malicious events. of sizes we determined to be a practical compromise between
computational complexity and goodness of results. Repre-
senting all events of 1-3 q-grams resulted in about seven
Note that our detection percent for various antivirus en- million unique features.
gines is potentially more optimistic than in real deployment We can now represent our combined benign and malware
environment for several reasons: (i) a large fraction of our audit log dataset simply by an M ×N matrix A, where entry
malware is more than two years old; ii) while we potentially aij is the number of observations of feature j in log i. Unfor-
have newer malware in the VPN binary dataset, there is a tunately there is a danger that different users have a different
chance that it is not detected by more than 30% of antivirus number of benign processes running at a time, so the counts
engines, and is so is not included in our statistics; and (iii) can vary significantly between different logs within the same
the antivirus engines are potentially setup with more aggres- timeframe. In addition, the amount of events occurring in
sive detection settings than they would otherwise be when CuckooBox is potentially smaller than on an enterprise user
actually deployed. machine, where more software is actively being used. To
In the rest of the manuscript we will refer to y as the reduce the effect of heterogeneity in our audit log data, we
classification vector of size M , where yi ∈ {−1, 1} is the ith drop all counts from our matrix, such that the matrix only
sample’s classification, M is the number of observations in stores binary values, 0 if the feature never occurred in the
the dataset, and −1, 1 means benign or malicious, respec- log, and 1 otherwise.
tively. When running binaries in the sandbox it is possible that
a fraction of these binaries did not execute properly due to
4. METHOD various reasons. To filter them out, we compute the mean
Our method for deriving a classifier from our audit logs and standard deviation of the number of features for our
is divided into two stages. In the first stage we perform CuckooBox audit log dataset, and removed all the observa-
tions where the number of features was below two standard
the feature engineering, where we use domain knowledge to
deviations from the mean.
extract features out of audit logs. In the second stage we
use machine learning to compute a classifier that classifies
audit logs based on the extracted features. 4.1.1 Feature Filtering
The large number of features that we extract from the logs
4.1 Feature Engineering can make it intractable to compute a classifier. In our case,
most features are not useful for detection because they occur Given the LR classifier hx̂, b̂i, the probability of the ith
in a tiny fraction of logs (see Fig. 1D). Our first step is to fil- observation being malware, pi , is evaluated using the logistic
ter out the long tail of the feature distribution, such that we function
would be able to tractably apply a standard machine learn- 1
ing algorithm. Our goal is to do a computationally fast filter pi = . (3)
that will preserve the sparsity of the data matrix. There- 1 + e−(ai x̂+b̂)
fore, we avoid performing expensive operations like matrix 4.2.2 Regularization
decompositions, and filter the features based on the uncen-
The value of the regularization parameter λ is not known
tered correlation coefficient, cj , between the labels y and the
a priori , since it unclear what the expected total loss is for
jth column of A, a∗j ,
our dataset. In order to determine the proper λ parame-
aT∗j y ter in eq. (2) we will use internal 10-fold cross-validation,
cj = , (1) where for each λ value we compute the average loss of the
ka∗j k2 ky]k2
validation sets. The cross-validation can be computed fairly
which can be efficiently computed using a sparse matrix quickly, since the optimization problem can be sped up us-
multiplication. Using |cj | values, we select the top 50000 ing warm restarts, where we use the solution from a previ-
most correlated (or inverse correlated) features, a conserva- ously computed λ value to converge faster to the solution
tive number that is small enough to practically train our for the new λ [11]. Rather than picking the value of λ that
machine-learning classifier on a desktop computer, but two gives the lowest loss during cross-validation, we will err on
orders of magnitude bigger than the number of popular events the side of parsimony and use the “one-standard-error” rule,
in our log, while still preserving sparsity of the feature ma- since our deployment environment will somewhat diverge
trix. from our testing environment [11]. Note that this validation
Note that in the case when N is extremely large (ex. bil- is not our actual validation, which we report on later, but
lions), we can pre-filter our correlation filter by using a prob- an internal cross-validation to select the λ value.
abilistic counter [13], and performing two passes through the
audit log dataset. On the first pass we count the number 4.3 Validation
of occurrences, and on the second pass we only add a fea- The LR classifier’s tradeoff between detection and false
ture to A if is above a certain threshold. Fortunately, for alarm rates can be controlled by the threshold value at which
our current dataset, we were able to compute the correlation we consider the probability pi (see eq. (3)) high enough to
directly without pre-filtering. be malicious. The full range of this tradeoff is summarized
by the receiver operating characteristic (ROC) curve, where
4.2 Learning Algorithm on the y-axis we have the true positive rate (TPR), the num-
Picking the optimal machine learning approach for a spe- ber of malicious logs classified by our classifier as malicious,
cific problem is a research topic itself, and so we leave build- divided by the total number of malicious logs; and on the
ing the most optimal detector to future work. In our case, x-axis we show the false positive rate (FPR), the number
we focus on building an easily deployable detector that can of logs we classified as malicious that are actually benign
perform detection very quickly using very few important fea- divided by the total number of benign logs. We evaluate
tures, since: i) we can practically deploy such a detector in a our approach based on TPR and FPR because both are in-
large enterprise network by using simple pre-filtering for im- dependent of the ratio of benign vs. malicious logs in our
portant features on the SIEM system, activating the more dataset, which varies between deployment environments.
expensive detection system only in the case when one of The TPR and FPR cannot be properly estimated using
the important features occurs; and ii) in the case when a the data that was used to learn the classifier, and instead
detection does occur, we can provide an explanation that must be computed through validation on previously unseen
can be verified by a human agent. Therefore, we will focus data, where the classifier is trained on “training” data and
specifically on the two class ℓ1 -regularized logistic regression tested on “test” data, with the most common type of vali-
(LR) machine-learning classifier, where the ℓ1 -norm induces dation being cross-validation [17]. Unfortunately, direct ap-
a sparse solution [19]. plication of the random split cross-validation approach, in
To confirm our results, which we give later, we have also our case, could provide misleading results due to the dif-
tried SVMs and random forests [17], with both methods ferences between our validation data and the deployment
yielding similar performance to LR. We do not report on environment. This is a direct consequence of us not being
those results. able to observe malware infections directly on an enterprise
network, and so having to simulate the infections using a
4.2.1 Logistic Regression sandbox.
Consider a typical enterprise deployment environment: on
An ℓ1 -regularized LR classifier can be computed from y
a typical day there are a multitude of users active, most of
and A by minimizing the associated convex loss function
them using Office products, a web browser, or conferenc-
ing applications; lots of software is actively running on each
X
hx̂, b̂i = arg min log 1 + e−yi (ai x+b) + λkxk1 , (2)
x,b computer, and is being constantly interacted with, but soft-
i
ware installations are uncommon. We contrast this with the
where ai is the feature row vector of the ith observation, yi is CuckooBox environment that we used to exhibit malware
the associated label, x is the column vector of the classifier behavior, where only a few processes are active, there is
feature weights, b is the offset value, k·k1 is the ℓ1 -norm, no active interaction with any applications, but software is
and λ is a regularization parameter. We will describe how constantly being installed and CuckooBox monitoring tools,
we pick λ through cross-validation in the next subsection. which would not exist on a real enterprise machine, are
mixed in with actual monitored software behavior. Such would also validate on the enterprise data. While this would
heterogeneity between benign enterprise data and malware demonstrate that our classifier does indeed distinguish be-
CuckooBox data can result in us not being able to distin- tween benignware and malware in the same environment,
guish between a classifier that primarily detects CuckooBox it is not clear what our performance is on actual enterprise
vs. enterprise environment and one that actually detects data. If our FPR on the enterprise data is lower than the
malicious behavior. fraction of benignware that is mislabeled as malware, our
Another problem in estimating the TPR and the FPR reported FPR might be artificially high.
from validation results is the unexpected dataset “twinning”, One approach would be to get rid of the CuckooBox be-
where a fraction of our samples in the test set are behav- nign data, and instead test directly on CuckooBox malicious
iorally almost identical to the samples in the training set. vs. enterprise datasets. However, testing directly on enter-
Indeed, we have observed executions of binaries with differ- prise data and synthetically generated malware CuckooBox
ent SHA1s that produced almost identical audit logs. Re- logs could be misleading, since it is still possible to separate
lated to this, is the problem of concept drift, where malware the benign samples from malware samples just by perform-
behaviors tend to evolve over time and drift away from the ing detection for CuckooBox environmental features (e.g.,
samples in our dataset. In both cases, random split valida- CuckooBox specific DLL calls). It is also conceivable for
tion would produce overly optimistic results. a bad classifier to pass the CuckooBox validation, and also
On the other hand, if our classifier is actually classify- the proposed validation, by first detecting the execution en-
ing based on behavior, estimating the FPR only on Cuck- vironment, and if it does not detect the CuckooBox envi-
ooBox benignware might result in a pessimistic rate, since ronment then automatically classify it as benign, otherwise
we potentially have a number of malware binaries in our perform actual detection using some CuckooBox model. We
VPN dataset that were missed by VirusTotal meta-engine, avoid such a possibility by synthetically generating enter-
but are detected by our classifier. Additionally, separating prise malicious logs by logically OR-ing it with a malware
behavior of VPN binaries (which mostly contain new soft- CuckooBox log
ware installers) from malware is much harder than separat-
as = ae ∨ ac , (4)
ing typical behavior of an enterprise endpoint from malicious
behavior, since those tend to deviate less from expected be- where ae and ac are the feature vectors associated with the
havior. enterprise and CuckooBox samples, respectively, and as is
Therefore, in order to present a more accurate estimate the resulting synthetic feature vector that we use instead of
of expected performance, we have designed six different val- the CuckooBox ac vector in our validation.
idations that, when taken together, show that our classifier There is still a possibility that some of our detection re-
provides robust performance under the more realistic set of sults on the synthetic dataset are inflated because we could
assumptions. First, to address dataset twinning or concept be using some CuckooBox features to increase our score
drift, in addition to the validation based on random dataset (these features would not exist in deployment, so our true
splitting, we also validate results by testing only on malware TPR would be potentially lower). So in addition to remov-
that is at least one or two years older than malware in our ing obvious CuckooBox features (see Feature Engineering),
training set. Since compile time in an executable can be we also find all CuckooBox features that occur in > 1% of
faked or be corrupted, we remove all executables that have benign CuckooBox (in order not to select accidentally mal-
compile time before the year 1995 or compile time after 2014. ware features due to mislabeled data), and we remove all of
As an alternative, we also validate by splitting on malware those features that have positive weights in our LR model.
families instead of compile time, where we define a family This gives us a very conservative guess for the TPR (though
based on Kaspersky’s label. In our malware families test not a lower bound, since we potentially missed some fea-
we remove “Trojan.Win32.Generic” from both the training tures), and represents our best attempt at estimating the
and testing datasets, though we acknowledge that family true deployment TPR vs. FPR tradeoff.
labeling can be somewhat ambiguous [14]. To sum it up, the random training testing split, the com-
The second variable in our validations is the environment pile time split, and the family split, computed separately for
of our test data, where we test the same classifier, first in CuckooBox and synthetic enterprise data form our six val-
CuckooBox, and then enterprise environments, separately. idations. In addition, our chances of detecting on environ-
By maintaining the environment constant for the benign and ment is further minimized by our choice of a ℓ1 -regularizer,
malicious samples in the test set, we (mostly) mitigate the since a feature that detects behavior would better generalize
possibility that our evaluation results are tainted due to the across our two environments, and so would likely be chosen
heavily biased ratio of benign vs. malicious samples in our over a feature that detects on just the environment.
two environments. We describe this approach and how we
generate malicious samples for the enterprise environment 5. RESULTS
next.
Our experimental dataset consists of 32,078 samples and
6,898,953 unique extracted features. Out of the samples,
4.3.1 Mitigating Environmental Bias 17,399 are benign, 14,679 malicious. 20,362 audit logs are
We start with the the closest approximation to the stan- from binaries executed in CuckooBox, and 11,716 are In-
dard 10-fold cross-validation approach, where we split all the vincea’s enterprise four minute windowed audit logs. Out
CuckooBox datasets into 10 disjoint sets, and for each vali- of the 20,362 CuckooBox audit logs, 5,683 are benign and
dation iteration train on the 9 out of 10 CuckooBox sets and 14,679 are malicious. Of the 14,679 malicious audit logs,
all of the enterprise data, and then validate on the remaining 3,010 are known to be malicious based on their origination.
CuckooBox set (no enterprise data). Note the difference be- The density of the A matrix, as computed as the number of
tween this approach and standard cross-validation, where we non-zeros divided by number of entries, is 1.2×10−4 , indicat-
ing that the matrix is indeed very sparse. All computations enterprise level FPRs.
were performed on a 2014 16GB MacBook Pro laptop run-
ning MATLAB 8.3. The LR classifier was computed using 5.1 Detecting AV Missed Malware
the Glmnet software package [11, 16], with 20-fold internal One important advantage of our approach over standard
cross-validation for determining λ. antivirus engines is that we are using observation vectors
The results for the six various validations of our LR classi- that are currently not utilized for detection, making it less
fier are shown in Fig. 2. In the top validation panels we test likely that malware would be able to hide from it. As the
exclusively on CuckooBox data, while in the bottom pan- result, assuming our classification labels are mostly correct,
els exclusively on the synthetic enterprise data. Note that we are actually able to detect a large fraction of malware
the classifier used to compute the TPR and the FPR in the missed by popular antivirus engines.
top and bottom of each column is the same. Since we are Recall that 88%, 95%, and 90% of our malware are de-
proposing to deploy the classifier in an enterprise setting, tected by Kaspersky, McAfee, Symantec, as well as 96%
we are primarily concerned with performances in the very when using McAfee and Symantec as a “meta”-engine, and
low FPR regions (≤ 10−2 ). We roughly estimate that the 98% when using all three. Removing those audit logs from
10−2 and 10−3 FPR to be equivalent to a false alarm once our validation dataset, in Fig. 3 we show that we are able
a day and once every 8 days/ per computer, respectively, to detect a significant fraction of malware that is missed by
assuming 8 hr/day usage. standard antivirus engines, as well as meta-engines consist-
Fig. 2A shows that for random splitting our LR classifier ing of several popular antivirus engines. Specifically, we are
has a robust detection level of 77% TPR at 1% FPR. Fig. able to detect around 80% of malware completely missed
2E shows that for the synthetic enterprise version of the by a Mcafee + Symantec + Kaspersky ensemble detector.
validation we can see the large improvement in the ROC Note that other than removal of detected malware from our
curve at the lower FPR values due to the removal of the validation dataset, the validation procedure in the figure is
VirusTotal missed malware, with the TPR at about 89% identical to that of Fig. 2E.
for 1% FPR, and even larger gain at deployment-relevant It is possible that some of our detection in Fig. 3 is due
FPRs. The associated bar plot of TPRs at FPRs 10−2 and to the varying definition of malware between the major ven-
10−3 are shown in Fig. 2D and H, grouped by malware type dors and VirusTotal consensus. For example, while Kasper-
(as determined by Kaspersky antivirus) . sky and Symantec have approximately the same detection
In addition to the validation based on random dataset rate on our dataset, our ROC curve for Kaspersky is worse
splitting, we also validated results by testing only on mal- than for Symantec. This suggests that Kaspersky has po-
ware that is at least one or two years older than malware in tentially a different threshold for what it considers malware,
the training set (Figs. 2B and F), and separately testing on which results in our training data missing some category of
malware families not included in our training dataset (Figs. malware. Therefore, we have also analyzed our TPR for
2C and G). The number of samples used to train the clas- known malware, labeled as malware, not based on Virus-
sifier in Fig. 2B and F is smaller than in the other panels, Total score, but due to known origination of the associated
but all the ROC curves for all the years were computed us- binary executable. Our analysis showed 79% TPR at 1%
ing the same training dataset, with only the validation set FPR, and 72% TPR at 0.1% FPR, when measuring perfor-
being adjusted during ROC curve computation. The zero mance on this known malware, using the Kaspersky malware
year ROC curve line is worse than the ROC curve in first only trained classifier, thus confirming our overall results.
column, even though it is the same validation, because the
amount of training data is significantly reduced in the sec- 5.2 Important Features
ond column in order to be able to keep the same classifier
In addition to the validation above we also looked at the
for all three curves.
actual features that are used for detection. To do this, we
We observe around an 8% drop in detection for each ad-
recomputed the regularized LR model on our full dataset,
ditional minimum year difference between the training and
which resulted in a LR model that uses 1,704 features. Of
the testing set, which demonstrates that concept drift does
these features, the sum of the positive weights (indicating
affect our detection. The detection rate is still fairly high
malicious features) was around 159 vs. -73 for the nega-
for the synthetic enterprise data (bottom panel), where the
tive weights (indicating benign features). This shows that
TPR is at least 67% with a FPR of 0.1%, and at least 73%
our detection relies more on malware behavior than benign
for the FPR of 1%. These results are promising, consider-
behavior for classification, as we have expected.
ing that commercial antivirus solutions have a 60% TPR for
While in general interpreting Windows audit logs is diffi-
previously unseen malware [20].
cult without more fine grained knowledge (we only know the
Since we did not filter all malware audit logs for executions
DLL executed, not the function called), we did observe some
that did not exhibit malicious behavior (ex. crashed prema-
features that are directly interpretable. We note that none
turely or required interaction to activate), our reported TPR
of the features by themselves are enough to classify a log
is potentially underestimated. We performed manual exam-
as malicious, and that it is observation of at least several of
ination of a small fraction of the missed detections, and a
these features that causes detection. To do this, we sort the
large fraction of them where GUI installers that simply did
non-zero LR features by their contribution to classification
not finish executing in an automated sandbox run. In terms
of malware, which we compute by multiplying the number
of practicality, FPR is vastly more critical, since it controls
of times a feature occurs in malware by the absolute weight
the number of false alarms that a enterprise network ad-
in our LR model. Below are several top features that we
ministrator would have to handle. For example a detector
found to have a direct interpretation:
with a 5% FPR is simply not deployable. Our results show
The #2 most important feature is:
that a significant level of detection can be achieved close to
Executing edited file.
(A) (B) (C) (D)
1 1 1
trojan 5027
virus 912
0.8 0.8 0.8
Kaspersky Classification
adware 895
# Observations
packed 858
0.6 0.6 0.6 backdoor 732
TPR
worm 625
0.4 0.4 0.4 net 415
hoax 215
downloader 178
0.2 0.2 0 year, AUC=0.96 0.2
1 year, AUC=0.89 email 135
AUC=0.97 2 year, AUC=0.79 AUC=0.97
0 0 0
10 -3 10 -2 10 -1 10 0 10 -3 10 -2 10 -1 10 0 10 -3 10 -2 10 -1 10 0 0 0.2 0.4 0.6 0.8 1
(E) (F) (G) (H)
1 1 1
trojan 5027
virus 912
0.8 0.8 0.8
Kaspersky Classification
adware 895
# Observations
packed 858
0.6 0.6 0.6 backdoor 732
TPR
worm 625
0.4 0.4 0.4 net 415
hoax 215
downloader 178
0.2 0.2 0 year, AUC=0.96 0.2
1 year, AUC=0.91 email 135
AUC=0.95 2 year, AUC=0.84 AUC=0.95
0 0 0
10 -3 10 -2 10 -1 10 0 10 -3 10 -2 10 -1 10 0 10 -3 10 -2 10 -1 10 0 0 0.2 0.4 0.6 0.8 1
FPR FPR FPR Fraction Detected
Figure 2: 10-fold validation results for our ℓ1 -regularized LR classifier. The top panels are validation results
computed only on CuckooBox collected audit log data, while the bottom panels show validation results done
only on synthetic malware and pure benign enterprise audit logs. Each column shares the same classifier
between the top and bottom panel. The first column shows the ROC validation curves for randomly split
data. The second column shows ROC validation curves based on compile time splits, where the training
malware data is at least one (middle line) or two (bottom line) years older than validation malware data.
The third column show ROC validation curved based on malware families splits, where we train on one set
of malware families, and validate on the rest of the families. The fourth column shows results from the first
column, at two deployment-relevant false positive rates (10−2 top and blue, and 10−3 bottom and yellow),
broken down by specific malware types, with the number of total observations of each malware type is shown
on the right axis.
This represents an execution of a binary that was at some The ROC curve estimates for the low FPR regions are
point before was edited by some process. While not always not as reliable as for the less relevant higher FPR regions.
occurring in malicious software, this is clearly suspicious be- More reliable estimates for the deployment relevant region
havior. would require significantly more samples. Also, given the
The #3 most important features is: somewhat artificial conditions in which we collected malware
Write to ...\windows defender\...\detections.log. audit logs, our estimates of the TPRs could potentially be
This feature represents an actual detection by the Microsoft inaccurate. The TPR could further be affected by the con-
antivirus engine. This is a good validation of our algorithm, stantly changing distribution of the malware in the wild,
since this is clearly a strong non-signature indicator of mal- which we are not able to effectively approximate.
ware, and we were able to discover it automatically. Our approach is only an initial demonstration of audit
The #9 most important feature is: log detection feasibility and can undoubtedly be improved
Write to ...[system]\msvbvm60.dll. through a combination of better training data, improvement
While at first it might seem surprising that a library for in feature engineering, and algorithmic tuning. Other prob-
Visual Basic 6 is an indicator of malware, a bit of research lems can be mitigated by Microsoft improving their audit
shows that the use of Visual Basic in malware has been on log capabilities.
the rise because VB code is tricky to reverse engineer [6].
Add to this fact that support for the language has ended
more than 10 years ago, it is clear why this would be a good 6. CONCLUSION
indicator of malware behavior. We demonstrated that audit logs can potentially provide
While these are just three examples of the features that an effective, low-cost signal for detecting malicious behavior
our LR model detects on, it clearly demonstrates that we are on enterprise networks, and adds value to current antivirus
able to automatically recover some important non-signature detection systems. Our LR model yields a detection rate of
events. 85% at an expected false positive rate of 1%, and detected
80% of malware missed by commercial antivirus systems.
5.3 Limitations While audit logs do not directly record certain malicious
It is clear that this, or any other approach, is not a panacea tactics (e.g., thread injection), our results show that they
to the malware problem. Given a large enough deployment still provide adequate information to robustly detect most
of such a detector, malware authors will start finding ways malware. Since audit logs can easily be collected on enter-
to obfuscate their audit log trail. For one thing, slow mov- prise networks and aggregated within SIEM databases, we
ing malware is an open problem, and potentially will not be believe our solution is deployable at reasonably low cost,
detected using the four minute window of an audit log. though further work needs to be done to thoroughly test
(A) (B) (C)
1 1 0.9
Fraction Detected
0.6
0.6 0.6
0.5
TPR
TPR
0.5 0.5
0.4
0.4 0.4
0.3
0.3 0.3
Figure 3: 10-fold validated ROC curves of malware that is detected by our classifier and is missed by antivirus
engines. In each instance the model was only trained on antivirus detected data, and validated only on
malware missed by that engine. (A) Validation on specific antivirus engines. Kaspersky, black squares,
McAfee red x, and Symantec, blue triangle. (B) Validation on meta engines. Kaspersky+McAfee+Symantec,
black squares, McAfee+Symantec, blue x. (C) The fraction of malware detected, based on VirusTotal score
for the composite (all three) engine, at two false positive rates (10−2 left and blue, and 10−3 right and yellow).
Lower VirusTotal scores indicate malware that is harder to detect