PDF S: Detecting Javascript-Based Attacks in PDF Documents: Crutinizer

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

2012 Tenth Annual International Conference on Privacy, Security and Trust

PDF S CRUTINIZER: Detecting JavaScript-based


Attacks in PDF Documents
Florian Schmitt¹, Jan Gassen² and Elmar Gerhards-Padilla²
¹ University of Bonn - Institute of Computer Science 4
Friedrich-Ebert-Allee 144, 53113 Bonn, Germany
² Fraunhofer FKIE
Friedrich-Ebert-Allee 144, 53113 Bonn, Germany
{schmittf, gassen, padilla}@cs.uni-bonn.de

Abstract—For a long time PDF documents have arrived in the victims network. It must be assumed, that this was a targeted
everyday life of the average computer user, corporate businesses attack against defense contractors or military authorities [3].
and critical structures, as authorities and military. Due to its Non-targeted malicious PDF documents are often mass-
wide spread in general, and because out-of-date versions of PDF
readers are quite common, using PDF documents has become a mailed to web users via email [1]. With the help of social
popular malware distribution strategy. In this context, malicious engineering tricks, the user is encouraged to open a malicious
documents have useful features: they are trustworthy, attacks attachment. In [4], an example is given for a spam-based
can be camouflaged by inconspicuous document content, but campaign using attached malicious PDF documents. The email
still, they can often download and install malware undetected by messages contain for instance spoofed orders or invoices, in
firewall and anti-virus software. In this paper we present PDF
Scrutinizer, a malicious PDF detection and analysis tool. We use order to induce the recipient to open the attached document.
static, as well as, dynamic techniques to detect malicious behavior Another realized method, although not targeting a single
in an emulated environment. We evaluate the quality and the person or institution, will apparently mostly affect enterprises:
performance of the tool with PDF documents from the wild, The email body is primed to trick the victim into thinking,
and show that PDF Scrutinizer reliably detects current malicious that the present email is from a copier machine, which sends
documents, while keeping a low false-positive rate and reasonable
runtime performance. scanned documents, converted into PDF documents, automat-
Keywords—PDF, document malware, client-side exploits, ma- ically via email [5]. There is a good chance, that users, which
licious PDF are accustomed to such a system, will open the attachment.
Although several PDF readers exist, it is likely, that Adobe
I. I NTRODUCTION Reader is the most widespread PDF document viewer, which
is why attackers mostly focus on its vulnerabilities. In July
Nowadays malicious software is a major issue of infor- 2011, a report was published, stating that 60 % of the users of
mation technology. Since attacks using remote exploits have Adobe Reader are using vulnerable versions of the program
become rather difficult, client-side exploits are commonly used [6]. Similarly, in [7] it is reported, that although Adobe Reader
for carrying out cyber attacks. In particular, attacks using is installed on 83 % of companies machines, 56 % of those
malicious PDF (Portable Document Format) documents are installations are out of date and thus vulnerable. With such a
increasingly popular with malware writers and operators. As huge amount of potential victims, this is a great incentive for
PDF documents became widespread in the population and the criminals to increase the usage of PDF documents as malware
general acceptance of the format has grown, malware authors distribution strategy.
realized the potential of PDF for malware distribution [1]. Although governments do not necessarily reveal internally
PDF is an electronic document format maintained by Adobe used software, authorities often publish documents or forms on
Systems Inc., which allows the filing and exchange of docu- their websites, for which reason it can be presumed, that they
ments, regardless of the operating system. Dynamic content is do use PDF creation and reading software internally. “There’s
enabled with the ability to embed JavaScript code in PDF not a federal agency that does not use PDF. Acrobat software
documents. Although the format simplifies the handling of and Adobe PDF are key technologies in some capacity at all
documents using computers, it also opens possibilities for branches and levels of government, the military and virtually
client-side attacks. Thus, PDF documents are used for targeted every agency.”, said Greg Pisocky, Business Development
and non-targeted attacks on private persons, business organi- Manager at Adobe Systems Inc. [8].
zations and government agencies. These examples show, that the use of PDF is both, a relief
In July 2011, a malicious PDF document was found, whose and a threat for governments, companies and private users. At-
content was a call for papers for ”2012 AIAA Strategic tacks can be executed, without the victim becoming suspicious,
and Tactical Missile Systems Conference” [2]. The included by masquerading malicious documents as relevant information.
malware installs a backdoor, in order to take access to the Additionally, since the client initiates the attack, by opening a

978-1-4673-2326-0/12/$31.00 ©2012 IEEE 104


malicious document, security solutions as firewalls generally are executed. Despite, triggering actions after the document
do not prevent the assault. Once the control over the PDF is loaded is also possible, for example, when the user prints
reader’s process is taken, malware can be downloaded and or closes the document, or even only after a certain page
installed onto the clients system. is displayed. This can be done by connecting the additional
In this paper we present PDF Scrutinizer, an analyzer, using action name /AA of a page with a JavaScript action.
static and dynamic detecting approaches in order to classify The capabilities of JavaScript code used in PDF documents
PDF documents. Common existing approaches often rely on is not limited to the core JavaScript features, but enhanced by a
signature-based detection techniques, where embedded code diverse JavaScript for Acrobat API. Important objects offered
is scanned for known exploits, which is a fast way to detect by the API are the static app class, which encapsulates the
malicious PDF documents. This approach, however, suffers reader application and the doc object, which represents the
from sophisticated obfuscation methods, which are used within currently opened document.
malicious PDF documents. Hence, with PDF Scrutinizer, in
addition to using signature-based techniques, embedded code Malicious Techniques used in PDF Documents
is executed in a controlled analysis environment. With our The PDF format enables attackers to utilize various tech-
technique, on the one hand, vulnerable objects are emulated niques allowing them to perform stealthy and effective attacks
to detect known attacks. On the other hand, dynamic heuristics against the victims PDF reader. One commonly used technique
are used to examine the operations of the interpreter during to prevent attacks from being detected is called obfuscation,
evaluation, with the aim to detect malicious behavior. which is explained below. Subsequently, the general procedure
In our evaluation, we will show that PDF Scrutinizer is able of exploiting vulnerabilities is explained in section II-B and
to detect current malicious PDF documents while maintaining the method heap spraying, which is widely used in JavaScript-
a low false-positive rate. Furthermore we will show, that based attacks, is introduced in section II-C. Additionally, the
PDF Scrutinizer is able to process large amounts of PDF malicious usage of embedded files is illustrated in section II-D.
documents with reasonable runtime.
Today, PDF Scrutinizer focuses on JavaScript-based at- A. Obfuscation
tacks. Still, it is possible to integrate detection-modules for Obfuscation is a technique used to disguise the real content
non-JavaScript-based exploits. With our evaluation, two mod- of an information, or the real functionality of source code.
ules for common non-JavaScript-based exploits were used. A PDF document itself can be obfuscated to hide several
properties of the document. In contrast, JavaScript code can
II. P RINCIPLES be obfuscated to reduce the readability. In both cases, the
The Portable Document Format (PDF) is a format repre- motivation for obfuscation can be legitimate, for example to
senting electronic documents, published by Adobe Systems protect ones intellectual achievements, or can be malicious,
in 1993. The occasion of its development was the request for example to complicate static analysis of the document or
for a document format, with which it is possible to create, code.
view, print and exchange documents reliable and environment- A basic code obfuscation method is the reduction of read-
independent [9]. Today, PDF is virtually the standard elec- ability. Thereby, for instance, the code is split up where
tronic document format for printable documents. JavaScript- possible and veined with useless multi line comments. Another
support was introduced with PDF version 1.3 and allows for often used technique is the scrambling of identifiers, at which
example the usage of interactive forms, database communica- variable names and function names are replaced by long,
tion or dynamic document content. semantically equivalent sequences, which are difficult to read
PDF is virtually a text format consisting of human-readable for humans. Dead code, which never gets executed, is often
identifiers, but it also contains binary data e.g., in terms introduced to distract the analyst.
of images, embedded files or encrypted data. The format JavaScript allows evaluation of code during runtime, which
is organized by referencing objects, described by an object enables the usage of more complex obfuscation techniques.
number and a generation number. PDF documents allow For instance, code can be stored encrypted in a string, called
incremental updates, whereas objects are superseded by newer payload, and only during runtime it is deciphered and ex-
generations. Each object instances one of eight data types. One ecuted. With the use of the Acrobat API it is furthermore
considerable object type is called name, a unique identifier, unnecessary to include the payload in the script, because it can
which starts with a slash. be distributed to locations within the document. For example
A JavaScript action is an object holding either a string meta data fields of the document and even the words of the
or a reference to a string or stream object, containing document can and will be used to store the payload.
JavaScript code. The PDF specification allows the evaluation
of JavaScript actions at several points in time during the B. Exploitation of Vulnerabilities
lifetime of a document within a PDF reader. For example, When exploiting vulnerabilities in PDF readers, the attack-
when a document is loaded, a special object, the document ers goal is to execute shellcode with the privileges of the
catalog, is examined. JavaScript actions, referenced by the readers process. PDF exploits can be categorized into two
/OpenAction or /Names keys of the document catalog, kinds: JavaScript-based and non-JavaScript-based [1].

105
JavaScript-based exploits attack methods of the JavaScript are logged and features are extracted. With Wepawet, they pre-
for Acrobat API, whose implementations contain security sented ten features, which include for instance, the number of
flaws. To trigger an exploit, a vulnerable method is called bytes allocated through string operations, the number of likely
with primed arguments. The aim is to overwrite the methods shellcode strings and the values of attributes and parameters in
return address, hence the control flow is changed. In order method calls [15]. Using a known-good dataset they learned
for shellcode to be executed, the process memory of the PDF anomaly thresholds, which are used to classify newly analyzed
reader is prepared by a technique called heap spraying. documents. Since the internals of the analysis environment are
With non-JavaScript-based exploits, either, for example, non-public, it is unknown how much and which parts of the
embedded Flash files are used to conduct heap-spraying and Acrobat API are currently emulated in Wepawet. Compared
afterwards a vulnerability unrelated to the JavaScript-engine to Wepawet, PDF Scrutinizer does not use a machine learning
is exploited. Or even vulnerabilities are known, where heap approach, where general information about the execution are
spraying is unnecessary, but rather the control flow can be collected and tested against learned thresholds. Instead, the
directly redirected to shellcode embedded in one of the docu- occurrence of a small number of events, which likely represent
ments objects. malicious operations, induce a malicious classification in PDF
Scrutinizer.
C. Heap Spraying In [16], a standalone PDF document analyzer called MD-
To place shellcode in the PDF readers process memory, it Scan is presented. For execution of extracted JavaScript code,
can be allocated as a unicode string either to a variable or as an they modified the Java-Script-engine Spidermonkey, for which
array value. In both cases, the shellcode is stored in the heap they emulated parts of the Acrobat for JavaScript API. For
area. Since the exact location of the shellcode in the memory detection of malicious documents, MDScan dynamically an-
is unknown, with heap spraying, the heap space is filled with alyzes used string variables for the presence of shellcode.
multiple blocks, consisting of nop-sled and shellcode. When Because this approach does not rely on previously known
an exploit takes place, and a return address is overwritten with vulnerabilities, they claim to be able to detect malicious
an address that points approximately into the sprayed memory documents, which exploit unknown vulnerabilities in PDF
region, the chance is high, that the program counter is set to readers. But on the other hand, concrete, known vulnerabilities
an address, which is located in one of the nop-sleds, leading cannot be identified, so details about the attack, as CVE-IDs,
to shellcode execution [10]. are not given. In addition, they rely solely on detection of
shellcode. Even if most PDF documents contain shellcode,
D. Malicious Embedded Files with this approach, the detection rate depends exclusively
on the quality of the used shellcode detector. Furthermore,
The PDF format allows embedding of arbitrary files into
shellcode is not necessarily stored in a string variable, but can
documents. This feature is also used by malware operators to
also be wrapped in an array, or can consist of multiple arrays.
disguise malicious content and exploiting additional vulnera-
PDF Scrutinizer uses multiple heuristics, to be able to detect
bilities. For instance, the Adobe Reader can display embedded
a broader range of malicious operations. In addition, to enable
Flash programs directly. This allows exploitation, not only
the identification of concrete vulnerabilities, known vulnerable
of the PDF reader program, but also of the integrated Flash
methods are emulated.
Player, as shown in [11] and [12]. One solely static approach using machine learning is imple-
Another particularly interesting technique, named nested mented with PJScan [17]. Here, JavaScript code is extracted
PDF, is given by embedding a malicious PDF document and used for a lexical analysis to determine features. After-
into a benign document [13]. The top-level document can wards, classifications are made by comparing new samples
be constructed in a way that the embed document is loaded with previously learned thresholds. Since the code is not ex-
instantaneously, even multi-layered constructions are possible. ecuted, dynamically constructed and evaluated scripts are not
With this approach, the malicious content is encapsulated in a available and only top-level code can be considered. Compared
single PDF object. to the beforehand presented tools and PDF Scrutinizer, PJScan
can be distinguished by a significantly higher performance,
III. R ELATED W ORK
because of the fact that no code has to be executed. In contrast,
There are a number of existent malicious PDF analysis tools. drawbacks of this approach are that no information and no
In the following we present three dynamic and one pure static identification of concrete attacks can be given. Additionally,
analysis tool, and compare each one with PDF Scrutinizer. it has a high false-positive rate of about 16 %, when testing
Wepawet is a web service for analysis and detection of benign PDF documents including JavaScript code.
web-based malware [14] [15]. It lists found exploits, evaluated
IV. I MPLEMENTATION
JavaScript code, potentially detected shellcode and found mal-
ware. Wepawet uses anomaly detection and machine learning A. Overview
approaches to classify PDF documents. As PDF Scrutinizer, Because of the widespread use of PDF, a huge amount
Wepawet uses an emulated environment to execute extracted of PDF documents are in circulation. The major demand for
JavaScript code. During the execution, dynamic information automated malware processing software is a preferably reliable

106
extract execute
malicious parsing
PDF document JavaScript JavaScript
PDF Scrutinizer suspicious
benign create
access
parsed
dynamic
document API heuristics
access emulation
PDFBox Rhino libemu
JavaScript detected found triggered embedded
code shellcode exploits heuristics files
PDF Scrutinizer

Fig. 1. Functionality of PDF Scrutinizer Fig. 2. Architecture of PDF Scrutinizer

classification, with that a manual analysis becomes obsolete. can store an additional action referencing a JavaScript action,
Thus, the classifications malicious, suspicious and benign were every PDF object which contains the additional action name
chosen. As a result, the amount of documents, that need to be (/AA) is analyzed. JavaScript execution can also be initiated
manually examined is reduced to the documents which are using AcroForms, wherefore the document is searched for the
classified as suspicious. existence of those. If any AcroForms are found, the encapsu-
The main functionality of PDF Scrutinizer is the classi- lated JavaScript code is extracted. All encountered JavaScript
fication of PDF documents into these three categories. To code is collected, saved for later analysis and processed in the
support the analyst during the process of examining suspicious next step.
PDF documents, PDF Scrutinizer does not only display the 3) Execution of actions: The extracted JavaScript code is
resulting classification, but also furnishes further information executed in a modified JavaScript-engine, where parts of the
on the reasons of the classification. For example, if any known Acrobat for JavaScript API are emulated. To provide certain
exploits were used in the document, its CVE-ID is given and if API methods, it is necessary, that their implementations access
any heuristics were triggered, they are indicated. All occurring the loaded PDF document. Only in this way, they are able
JavaScript code is saved, as well as any found shellcode. to return the correct values, which are needed for code to
Additionally, all embedded files are extracted and saved for continue error-free execution and expose the real, malicious
later analysis. To overcome the nested PDF technique, any functionality. The API emulation is illustrated in more detail
embedded PDF documents are saved and afterwards analysed in section IV-C. To detect malicious behavior, the modified
separately by PDF Scrutinizer. Figure 1 illustrates the main interpreter keeps track on operations and variable values
functionality of the tool. In the following, the main activities during evaluation of code. So, several dynamic heuristics
of PDF Scrutinizer are described. were developed, which monitor the control flow for malicious
1) Parsing: At first the document is loaded and parsed. operations, they are explained individually in section IV-E.
All PDF objects are analyzed and saved in memory, so that
following steps can access them. Because of the fact that B. Architecture
malicious PDF documents are often malformed and corrupt, PDF Scrutinizer is a Java library, which extends and con-
the parsing must not be limited to the PDF specification. nects existing components. Apache PDFBox [18] is used as
Instead, the parsing component should try to extract PDF interface to the loaded PDF document, every interaction with
objects at all cost. In PDF Scrutinizer, we focused on sim- the document is performed through the library. PDFBox was
ulating the way Adobe Reader parses documents, because of selected, because a large part of the PDF specification has
the fact that it is the most widespread reader with the most been implemented, and because the included PDF parser works
known vulnerabilities. Because of ambiguities in the PDF well, even with malformed documents. However, PDFBox was
specification and the amount of different readers, which are modified in several ways, to emulate the parsing methodology
running on different platforms, including mobile devices and of Adobe Reader and to improve the processing of malicious
browser-integrated PDF readers, this approach is not sufficient documents.
to be able to gain an optimal detection rate. It would thus To execute JavaScript code, which was extracted from
be conceivable in the future, to implement multiple instances the PDF document, Mozilla Rhino [19] is utilized. Rhino
which emulate different readers. But certainly, this approach only supports the main features of JavaScript and does not
would increase the analysis time significantly. include the Acrobat for JavaScript API. In order to be able
2) Extraction of actions: The next step is the extraction of to execute Acrobat JavaScript nevertheless, parts of the API
JavaScript actions. PDF Scrutinizer first tries to find actions were emulated and details in the behavior of the interpreter
the same way, as any PDF reader does. The document catalog were modified. For this reason, Rhino allows adding custom
dictionary is examined, if an /OpenAction is registered. objects into the context of the interpreter. Thus, it is possible
Furthermore, the documents catalog can store a reference to emulate API methods, by implementing their functionality,
to the /Names array. This array is scanned for included and if required, accessing the document through PDFBox.
JavaScript actions. Since, for example any page or annotation Malicious PDF documents commonly rely on shellcode

107
in order to execute malicious commands within the readers methods are non-existent. Although this modification enabled
process. Thus, recognition of shellcode can be used to detect certain scripts to execute thoroughly, even if methods were
potential attacks within the examined document. This can be called that were not covered in the context, there were negative
achieved, by using libemu [20], which is a shellcode detection side effects. We encountered malicious PDF documents, which
and analysis library. It tries to detect shellcode by searching use exception handling constructs to detect simulated environ-
for GetPC code, which is used in lot of shellcodes to find ments. For example, a non-existed method is called purposely
the actual value of the program counter. In PDF Scrutinizer, and the malicious behavior is only exposed within the error-
libemu is used to analyze variable values for the presence of handler. With the proposed modification activated, the error-
shellcode during execution. handler would not be called and thus the malicious code would
The architecture and the interaction of the components in not be executed. Therefore, we disabled this extension during
PDF Scrutinizer is sketched in figure 2. evaluation.
The Acrobat for JavaScript API provides means to execute
C. API Emulation scripts asynchronously. Given scripts are executed either one
To successfully execute JavaScript actions, which were time after a given period elapsed or periodically. In malicious
extracted from PDF documents, the Acrobat for JavaScript API documents asynchronous methods are for instance used, to
has to be emulated. Malicious JavaScript, using the Acrobat delay malicious operations until the document is rendered.
API, commonly only uses a subset of all available methods. This is done, because a victim could become suspicious after
Since the entire Acrobat API is rather complex, just the most loading a document, when a reader is not responding and is
frequently used methods, based on observation of the present not showing the content.
samples, were emulated in PDF Scrutinizer. The possibility Asynchronous methods can also be utilized to detect em-
to extend the emulated API easily is provided. ulated environments, which are for simplicity not delaying
Essential parts of the emulated API correspond to the the script execution, but are executing given scripts instanta-
JavaScript for Acrobat API, thus provides compliant results. neously. Hence, asynchronous API methods were implemented
Hence, for instance, the methods doc.getAnnots and in PDF Scrutinizer. For example, a script, which changes
doc.getPageNthWord where implemented according to a variable, can be delayed, and following code can easily
the specification. With these methods, early stage JavaScript determine, by checking the value of the variable, whether the
code extracts shellcode or JavaScript code, that is for example script was instantly executed or is still being delayed. With
located in the documents annotations or text. It is important this technique, the script can expose an emulated environment
to note, that this part contributes significantly to the success and avoid execution of malicious code to remain undetected.
of the further detection methods. If the emulation lacks in this This is a general issue of the emulation approach: Since
early loading phase, further malicious code is not correctly only a subset of the available API methods are available,
constructed and never gets executed, thus the heuristics would a document could detect being in a analysis environment
not be able to detect the attack. Since encrypted JavaScript by invoking not commonly used API methods and check
code or shellcode can also be stored in the documents meta- the return value for plausibility. With this information, a
data, it is important to make their actual values available to the malicious document could disguise its malicious behaviour
engine. In our emulation, corresponding to the JavaScript for to bypass detection or it could go into an endless loop. In
Acrobat API, these properties are encapsulated in an object order to identify the latter, a basic endless loop detection has
accessable thru doc.info. been developed, in which case the document is marked as
Apart from methods which are API-conform, known to suspicious.
be vulnerable methods are additionally emulated, in order to
D. Static Heuristics
detect the usage of known exploits. Known vulnerable methods
are prone, for example to buffer-overflow or use-after-free Static heuristics treat source code as string and analyse
attacks in the emulated reader version. In PDF Scrutinizer, it using string analysis, with the goal to discover malicious
these methods are certainly not actually vulnerable, but they or suspicious elements in the code. Although this approach
analyze the used parameters for primed values, known to cause can be outsmartet using obfuscation or dynamic evaluation of
exploits. This way, it is feasible to detect specific known code, it can provide a fast option, to test for signatures or to
vulnerabilities a malicious document is trying to exploit. We detect suspicious strings. In PDF Scrutinizer, the developed
included the methods which are vulnerable to our knowledge. static heuristics process all extracted JavaScript actions, as
In Addition, stub methods, without any functionality, are well as any code which is send to the eval method of
needed, so that the engine does not abort if any JavaScript code the interpreter. Additionally, any code, that is executed by
invokes them. If such a method is not expected to deliver a asynchronous methods, is forwarded to the static heuristics. In
return value, the omission is simple. In the case, a return object the following, the developed static heuristics are introduced.
is needed, we supply a mock object, which contains at least the RegexMalicious includes a list of signatures, which are based
correct data structure and default values. We further provided on regular expressions. The code is checked for occur-
the ability to modify the JavaScript engine directly, in a way rences of the signatures. If any signatures are found,
that it acts more tolerant and not throws exceptions, if invoked the document is marked as malicious. Unfortunately,

108
no source of signatures, tailored to malicious Acrobat analyzed whether the code tries to add multiple identical, large
JavaScript, is publicly available. Thus, a new list, contain- data blocks into an array, which is a strong sign for heap
ing vulnerable method calls including parameters, which spraying. If so, the strings are not added to the array, which
are often used by exploits, was created, based on a set of is why the blocks are not written into memory, leading to a
malicious PDF documents containing known attacks. better runtime performance. At the same time, the heuristic is
RegexSuspicious is similar to the first static heuristic, but in triggered and the document is marked as malicious.
contrast, it marks the document as suspicious when a 3) ShellcodeTester: Shellcode commonly downloads and
match occurs. Currently, the set of regular expressions installs malware, or, in the case a malware binary is included
consists of words which are used in some malicious or appended to the PDF document, constructs an executable
JavaScript code as variable names, i.e. ”exploit”, ”shell- and installs it. Since installing malware is usually the attackers
code” or ”heapspray”. goal, shellcode is included in virtually every malicious PDF
VulnerableMethodCalls uses a list of known vulnerable API document. Shellcode, used within malicious JavaScript, can
methods, and examines how many of them are found either be stored in variables declared in source code or it can be
in the code. If the code string contains any vulnerable build at runtime. Both ways, the shellcode is eventually stored
method calls, it is marked as suspicious. in a string or array variable. Thus, with the ShellcodeTester
heuristic, used variables are observed and tested with the
E. Dynamic Heuristics shellcode detection and analysis library libemu. Shellcode
To be able to detect malicious documents even if they instructions are commonly encoded using percent-encoding,
use unknown vulnerabilities, dynamic heuristics were imple- whereby, for instance, opcodes 0x90 and 0x90 are given
mented. In PDF Scrutinizer, the JavaScript interpreter was by the string "%u9090". The unescape method decodes
modified in order to monitor certain variable values and to the percent-encoding string and returns a string only with
intercept certain bytecode instructions. This way, the relevant the encoded bytes. That is why, the result of the method
information are forwarded to heuristics, during execution. unescape is additionally processed by the heuristic.
1) StringLengthTester: Usually, long strings in malicious Since shellcode detection is rather slow, several techniques
JavaScript code occur, when constructing NOP-sleds later were utilized to reduce the amount of strings which need
utilized in heap spraying. Therefore, the StringLengthTester to be tested. For instance, only strings with length between
heuristic tests used variables for an excessive number of reasonable lower and upper bound are tested. In addition,
characters. If the length of a used string exceeds a configurable strings which frequently contain the sequence "%u", are most
threshold, the document is marked as malicious. Evaluation likely percent-encoded shellcode instructions, which will be
indicates that 100,000 characters is a suitable amount used as unescaped later, so these strings are not processed by libemu.
default threshold. If shellcode is detected in the remaining strings, the doc-
Strings are only examined at one point in Rhino’s inter- ument is marked as malicious, the shellcode binary is stored
preter: the assignment of variables. It must be mentioned that and the libemu profile, which gives the shellcodes API calls,
this point is not only reached at the first assignment of a is saved. With the help of some modifications to libemu,
variable. As in JavaScript every string change leads to an potentially included malware executables are automatically
assignment of a new string, this point is reached whenever saved during the analysis.
a string modification takes place.
2) HeapSprayDetector: Heap spraying is an essential tech- V. E VALUATION
nique used in most JavaScript-based attacks in PDF documents To measure the performance of PDF Scrutinizer, especially,
and overcomes the trouble of getting to know the exact the false positive and false negative rates, sets of malicious
location of shellcode in memory. Hence, when exploiting a and benign samples were analyzed. The used sets [21] contain
vulnerability, the overwritten return address only needs to 6,054 benign and 11,278 malicious samples, collected from
be roughly in the right memory region. Usually, with heap email attachments and web sites.
spraying, strings consisting of nop-sled and shellcode are For the evaluation, both sets were successively processed by
added to an array, effectively inflating the heap segment with PDF Scrutinizer, and the results were analyzed. The benign
large amounts of data. sample set was mainly utilized, to acquire the false positive
When executing code, which utilizes heap spraying, in rate, which is an important metric for an automated analysis
a JavaScript engine, the memory is filled with such large tool. Additionally, the performance of PDF Scrutinizer, when
blocks of nop-sled and shellcode, leading to a bad runtime analyzing benign samples was evaluated, which is a crucial
performance. The ambition to prevent heap spraying from factor for the practial suitability of the tool. When using as a
happening, led to the development of a feature, which can honeyclient in connection with a web crawler, large amounts
detect heap spraying and at the same time stop inflation of the of PDF documents need to be processed, from which likely
heap segment. With this feature, heap spraying is detected and most are benign in practise. Thus, the performance, when
avoided which reduces the analysis time significantly. analyzing benign PDF documents, has a large impact on the
The HeapSprayDetector feature hooks into the operation overall performance. Using the malicious sample set, the false
of the interpreter where elements are added to arrays. It is negative rate of PDF Scrutinizer was evaluated.

109
Count Percentage Benign Malicious
Overall samples 6054 100.00 % Amount of Samples 6054 11278
Processed without error 6039 99.75 % Analysis Time 5 min 33 sec 14 hours 2 min
Contained JavaScript actions 229 3.78 % Average Time per Sample 0.06 sec 4.48 sec
Analysis result Average Samples per Minute 1100.73 13.39
Malicious 0 0.00 % Average Throughput 310 MB/min 0,33 MB/min
Suspicious 3 0.05 % TABLE III
Benign 6036 99.70 % A NALYSIS PERFORMANCE
Error 15 0.25 %
TABLE I
A NALYSIS RESULT OF BENIGN SAMPLES

documents could on the one hand be documents which use


non-JavaScript-based vulnerabilities, which are at present not
Count Percentage
detected by PDF Scrutinizer, and therefore are not correctly
Overall samples 11278 100.00 %
Processed without error 11196 99.27 % classified. Currently, PDF Scrutinizer has only rudimentary
Contained JavaScript actions 10650 97.57 % detection mechanisms for non-JavaScript-based attacks. On the
Analysis result other hand, false-negative detection results could occur from
Malicious 10125 89.78 % documents, where falsely no code could be extracted. Another
Suspicious 628 5.57 % possibility is that some documents were malformed in a way,
Benign 443 3.93 %
Error 82 0.73 %
that they could not be successfully parsed using PDFBox. It
can not be excluded that the documents were damaged through
TABLE II
A NALYSIS RESULT OF MALICIOUS SAMPLES a data transfer, as the original source is unknown. Finally, there
is no certainty, that the remaining documents are malicious at
all, until they were examined by hand.
The achieved performance during the evaluation of the two
In table I the analysis result of the benign sample set is sets is shown in table III. With the benign sample set, the
given. The false-positive rate is zero, which is an optimal average time needed to process one document is 0.06 sec,
result. Three documents are falsely classified as suspicious, which corresponds to about 1100 analyzed documents per
because they use the vulnerable method doc.getAnnots minute. The relative low analysis time comes from the fact
in a legitimate manner. Fifteen documents exited with an that many benign documents do not contain JavaScript code
error, which equals 0.25 % of the set. These errors are the that has to be executed, which is a time consuming task. So,
result of missing mock objects and the use of the XML Forms when used within a client honeypot, crawling the web for
Architecture (XFA), which is currently not emulated in PDF malicious PDF document, PDF Scrutinizer can be used to
Scrutinizer. In both cases, the embed JavaScript cannot be process large amounts of PDF documents per day. During the
completely executed without errors. JavaScript actions were evaluation of the malicious sample set, an average analysis
found in 3.78 % of the documents. This corresponds to the time of 4.48 sec per sample was observed. From the fact,
observation that harmless documents rarely use JavaScript. that most malicious documents use JavaScript, this shows,
The analysis result of the malicious sample set is shown that the need of executing code increases the runtime substan-
in table II. A total of 97.57 % of the malicious documents tially. Furthermore, malicious documents are often performing
contained JavaScript code, which shows that most present ma- computationally expensive operations as decryption, and the
licious PDF documents are utilizing JavaScript-based attacks. dynamic heuristics are observing the engine while execution,
99.27 % of the malicious documents were processed error- which as well lowers the performance. Additionally, shellcode
free using PDF Scrutinizer. When processing the malicious detection is a time consuming task: about 17 % of the analysis
data set, 89.78 % of the containing documents are correctly time was consumed by libemu.
classified as malicious. It shows, that the different detection
Heuristics
methods together achieve a high detection rate with the used
samples. From the malicious data set, 628 documents, which This section provides an evaluation on the implemented
corresponds to 5.57 % of the set, are categorized as suspicious. heuristics. It was investigated how often which heuristics were
The reasons why these documents were not classified as mali- triggered during the analysis of the documents, which were
cious, may be diverse. For example, the contained code could classified as malicious. The results are shown in table IV.
use mechanisms to detect the emulated analysis environment The RegexMalicious heuristic, which uses signatures of vul-
and disguise their malicious behavior. It is also conceivable, nerable methods and parameters, achieved a rate of 82.24 %.
that documents use parts of the PDF specification, which are Since obfuscation techniques are commonly used and the
not correctly emulated. Lastly, errors during the execution set of regular expressions only contained the amount of six
of JavaScript code can lead to a premature ending of the signatures at the time of analysis, this can be considered a
execution and thus merely to a suspicious classification. The promising result. By improving the amount and the quality of
remaining 3.93 % of the set are classified as benign. These the signatures the rate could be further enhanced.

110
Count Percentage
If a known attack is performed, PDF Scrutinizer can often
Malicious classified samples 10125 100.00 %
recognize the CVE-IDs of the used vulnerabilities. Even if a
Heuristic triggered
RegexMalicious 8327 82.24 %
document tries to exploit an unknown vulnerability, chances
StringLengthTester 8756 86.48 % are high, that PDF Scrutinizer still detects the attack using
HeapSprayDetector 9868 97.46 % dynamic heuristics. This is particularly interesting for using the
ShellcodeTester 7482 73.90 % tool with a client honeypot, because this way, novel exploits
TABLE IV could be identified at an early stage.
T RIGGERING FREQUENCY OF THE HEURISTICS
R EFERENCES
[1] K. Selvaraj and N. F. Gutierrez, “The Rise of PDF Malware,” 2010.
[Online]. Available: http://www.symantec.com/content/en/us/enterprise/
The highest rate is achieved by the HeapSprayDetector media/security response/whitepapers/the rise of pdf malware.pdf
[2] M. Hypponen, “Military Targets,” 2011. [Online]. Available: http:
heuristic. This heuristic was triggered by 97.46 % of the //www.f-secure.com/weblog/archives/00002203.html
malicious documents, which shows that heap spraying is used [3] The H security, “Targeted attacks on arms manufacturers continue,”
by the major part of malicious PDF documents and that it 2011. [Online]. Available: http://www.h-online.com/security/news/item/
Targeted-attacks-on-arms-manufacturers-continue-1283425.html
can be reliably detected using this heuristic. Excessively long [4] P. O. Baccas, “Who ordered spam? New trick in PDF malware
strings were detected in 86.48 % of the documents, using the uncovered,” 2011. [Online]. Available: http://nakedsecurity.sophos.com/
StringLengthTester heuristic. Again, this is a indicator for the 2011/04/18/orders-spam-new-trick-in-pdf-malware/
[5] E. Balunsat, “How PDF files hide malware Example PDF scan from
presence of malicious code. In 73.90 % of the documents Xerox,” 2011. [Online]. Available: http://blog.commtouch.com/cafe/
in the malicious data set, shellcode was found using the malware/how-pdf-files-hide-malware-example-pdf-scan-from-xerox/
ShellcodeTester heuristic. The result depends on the quality [6] Avast Software, “Six out of every ten users run vulnerable versions of
Adobe Reader,” 2011. [Online]. Available: http://public.avast.com/mkt/
of the used shellcode detector and can vary. As shellcode 20110713 6 out of 10 with vulnerable PDF.pdf
is included in virtually every malicious document, that uses [7] M. Sutton, J. Sobrier, M. Geide, P. Kulkarni, and U. Wanve, “State of
JavaScript-based attacks, this rate should be improvable. the Web - Quarter 2, 2011 Report,” 2011. [Online]. Available: http:
//www.zscaler.com/pdf/Zscaler-Labs-State-of-the-Web-2011Q2.pdf
In summary it can be said, that the heuristics achieve a high [8] D. Johnson, “PDF in Government,” 2007. [Online]. Available:
detection rate within the samples with malicious classification. http://www.appligent.com/2007-02-02
This is an indicator, that novel documents, which use yet [9] Adobe Systems Incorporated, “ISO 32000-1:200. document management
– portable document format – part 1: PDF 1.7,” Tech. Rep., Jul.
unknown vulnerabilities should be reliably detectable using 2008. [Online]. Available: http://www.adobe.com/devnet/acrobat/pdfs/
PDF Scrutinizer, as long as they use the characteristics, which PDF32000 2008.pdf
are observed by these heuristics. Noteworthy is additionally [10] A. Sotirov, “Heap Feng Shui in JavaScript,” 2007.
[Online]. Available: http://www.blackhat.com/presentations/bh-europe-
the low false positive rate and the considerable performance, 07/Sotirov/Whitepaper/bh-eu-07-sotirov-WP.pdf
that was observed during the evaluation of the benign sample [11] L. Zeltser, “How to Extract Flash Objects from Malicious PDF Files,”
set. 2011. [Online]. Available: http://computer-forensics.sans.org/blog/2011/
05/04/extract-flash-from-malicious-pdf-files
VI. F UTURE W ORK [12] S. Porst, “A brief analysis of a malicious PDF file
which exploits this weeks Flash 0-day,” 2010. [On-
A limitation of the emulated approach is, that the analysis line]. Available: http://blog.zynamics.com/2010/06/09/analyzing-the-
is complicated as soon as malicious document writers start to currently-exploited-0-day-for-adobe-reader-and-adobe-flash
[13] G. Delugré, “An approach to PDF shielding,” 2010. [Online].
use a larger proportion of the JavaScript for Acrobat API. As Available: http://esec-lab.sogeti.com/post/2010/09/01/An-approach-to-
emulated environments always differ in some way from the PDF-shielding
original, malicious PDF document authors can and will use [14] University of Californica, Santa Barbara, “Wepawet.” [Online].
Available: http://wepawet.cs.ucsb.edu/
such differences to detect simulated systems. Thus, enhancing [15] M. Cova, C. Kruegel, and G. Vigna, “Detection and analysis of
the emulation quality is an ongoing process, with the aim to drive-by-download attacks and malicious JavaScript code,” 2010.
improve the flawless execution of Acrobat JavaScript and with [Online]. Available: http://www.cs.ucsb.edu/∼vigna/publications/2010
cova kruegel vigna Wepawet.pdf
that the detection rate of malicious documents. [16] Z. Tzermias, G. Sykiotakis, M. Polychronakis, and E. P. Markatos,
“Combining static and dynamic analysis for the detection of
VII. C ONCLUSION malicious documents,” 2011. [Online]. Available: http://www.syssec-
In this paper, we presented PDF Scrutinizer, a PDF docu- project.eu/media/page-media/3/mdscan-eurosec11.pdf
[17] P. Laskov and N. Šrndić, “Static detection of malicious JavaScript-
ment analysis tool, which uses static and dynamic detection bearing PDF documents,” 2011. [Online]. Available: http://doi.acm.org/
mechanisms to recognise malicious PDF documents. The 10.1145/2076732.2076785
attempt to emulate PDF reader’s behavior has proven to [18] T. A. S. Foundation, “Apache PDFBox - Java PDF Library,” http://
pdfbox.apache.org/.
be promising. PDF Scrutinizer showed a detection rate of [19] Mozilla Foundation, “Rhino: JavaScript for Java.” [Online]. Available:
about 90 %, whereas false-positive classifications did not occur http://www.mozilla.org/rhino/
during our evaluation. [20] P. Baecher and M. Koetter, “libemu - x86 Shellcode Emulation.”
[Online]. Available: http://libemu.carnivore.it/
Since code contained in PDF documents is executed, obfus- [21] M. Parkour, “Version 4 April 2011 - 11,355+ Malicious
cation techniques, which hinder a reliable static detection, are documents - archive for signature testing and research,”
overcome. During the execution, details about the behavior are 2011. [Online]. Available: http://contagiodump.blogspot.com/2010/08/
malicious-documents-archive-for.html
collected, which make a manual analysis often unnecessary.

111

You might also like