PDF S: Detecting Javascript-Based Attacks in PDF Documents: Crutinizer
PDF S: Detecting Javascript-Based Attacks in PDF Documents: Crutinizer
PDF S: Detecting Javascript-Based Attacks in PDF Documents: Crutinizer
Abstract—For a long time PDF documents have arrived in the victims network. It must be assumed, that this was a targeted
everyday life of the average computer user, corporate businesses attack against defense contractors or military authorities [3].
and critical structures, as authorities and military. Due to its Non-targeted malicious PDF documents are often mass-
wide spread in general, and because out-of-date versions of PDF
readers are quite common, using PDF documents has become a mailed to web users via email [1]. With the help of social
popular malware distribution strategy. In this context, malicious engineering tricks, the user is encouraged to open a malicious
documents have useful features: they are trustworthy, attacks attachment. In [4], an example is given for a spam-based
can be camouflaged by inconspicuous document content, but campaign using attached malicious PDF documents. The email
still, they can often download and install malware undetected by messages contain for instance spoofed orders or invoices, in
firewall and anti-virus software. In this paper we present PDF
Scrutinizer, a malicious PDF detection and analysis tool. We use order to induce the recipient to open the attached document.
static, as well as, dynamic techniques to detect malicious behavior Another realized method, although not targeting a single
in an emulated environment. We evaluate the quality and the person or institution, will apparently mostly affect enterprises:
performance of the tool with PDF documents from the wild, The email body is primed to trick the victim into thinking,
and show that PDF Scrutinizer reliably detects current malicious that the present email is from a copier machine, which sends
documents, while keeping a low false-positive rate and reasonable
runtime performance. scanned documents, converted into PDF documents, automat-
Keywords—PDF, document malware, client-side exploits, ma- ically via email [5]. There is a good chance, that users, which
licious PDF are accustomed to such a system, will open the attachment.
Although several PDF readers exist, it is likely, that Adobe
I. I NTRODUCTION Reader is the most widespread PDF document viewer, which
is why attackers mostly focus on its vulnerabilities. In July
Nowadays malicious software is a major issue of infor- 2011, a report was published, stating that 60 % of the users of
mation technology. Since attacks using remote exploits have Adobe Reader are using vulnerable versions of the program
become rather difficult, client-side exploits are commonly used [6]. Similarly, in [7] it is reported, that although Adobe Reader
for carrying out cyber attacks. In particular, attacks using is installed on 83 % of companies machines, 56 % of those
malicious PDF (Portable Document Format) documents are installations are out of date and thus vulnerable. With such a
increasingly popular with malware writers and operators. As huge amount of potential victims, this is a great incentive for
PDF documents became widespread in the population and the criminals to increase the usage of PDF documents as malware
general acceptance of the format has grown, malware authors distribution strategy.
realized the potential of PDF for malware distribution [1]. Although governments do not necessarily reveal internally
PDF is an electronic document format maintained by Adobe used software, authorities often publish documents or forms on
Systems Inc., which allows the filing and exchange of docu- their websites, for which reason it can be presumed, that they
ments, regardless of the operating system. Dynamic content is do use PDF creation and reading software internally. “There’s
enabled with the ability to embed JavaScript code in PDF not a federal agency that does not use PDF. Acrobat software
documents. Although the format simplifies the handling of and Adobe PDF are key technologies in some capacity at all
documents using computers, it also opens possibilities for branches and levels of government, the military and virtually
client-side attacks. Thus, PDF documents are used for targeted every agency.”, said Greg Pisocky, Business Development
and non-targeted attacks on private persons, business organi- Manager at Adobe Systems Inc. [8].
zations and government agencies. These examples show, that the use of PDF is both, a relief
In July 2011, a malicious PDF document was found, whose and a threat for governments, companies and private users. At-
content was a call for papers for ”2012 AIAA Strategic tacks can be executed, without the victim becoming suspicious,
and Tactical Missile Systems Conference” [2]. The included by masquerading malicious documents as relevant information.
malware installs a backdoor, in order to take access to the Additionally, since the client initiates the attack, by opening a
105
JavaScript-based exploits attack methods of the JavaScript are logged and features are extracted. With Wepawet, they pre-
for Acrobat API, whose implementations contain security sented ten features, which include for instance, the number of
flaws. To trigger an exploit, a vulnerable method is called bytes allocated through string operations, the number of likely
with primed arguments. The aim is to overwrite the methods shellcode strings and the values of attributes and parameters in
return address, hence the control flow is changed. In order method calls [15]. Using a known-good dataset they learned
for shellcode to be executed, the process memory of the PDF anomaly thresholds, which are used to classify newly analyzed
reader is prepared by a technique called heap spraying. documents. Since the internals of the analysis environment are
With non-JavaScript-based exploits, either, for example, non-public, it is unknown how much and which parts of the
embedded Flash files are used to conduct heap-spraying and Acrobat API are currently emulated in Wepawet. Compared
afterwards a vulnerability unrelated to the JavaScript-engine to Wepawet, PDF Scrutinizer does not use a machine learning
is exploited. Or even vulnerabilities are known, where heap approach, where general information about the execution are
spraying is unnecessary, but rather the control flow can be collected and tested against learned thresholds. Instead, the
directly redirected to shellcode embedded in one of the docu- occurrence of a small number of events, which likely represent
ments objects. malicious operations, induce a malicious classification in PDF
Scrutinizer.
C. Heap Spraying In [16], a standalone PDF document analyzer called MD-
To place shellcode in the PDF readers process memory, it Scan is presented. For execution of extracted JavaScript code,
can be allocated as a unicode string either to a variable or as an they modified the Java-Script-engine Spidermonkey, for which
array value. In both cases, the shellcode is stored in the heap they emulated parts of the Acrobat for JavaScript API. For
area. Since the exact location of the shellcode in the memory detection of malicious documents, MDScan dynamically an-
is unknown, with heap spraying, the heap space is filled with alyzes used string variables for the presence of shellcode.
multiple blocks, consisting of nop-sled and shellcode. When Because this approach does not rely on previously known
an exploit takes place, and a return address is overwritten with vulnerabilities, they claim to be able to detect malicious
an address that points approximately into the sprayed memory documents, which exploit unknown vulnerabilities in PDF
region, the chance is high, that the program counter is set to readers. But on the other hand, concrete, known vulnerabilities
an address, which is located in one of the nop-sleds, leading cannot be identified, so details about the attack, as CVE-IDs,
to shellcode execution [10]. are not given. In addition, they rely solely on detection of
shellcode. Even if most PDF documents contain shellcode,
D. Malicious Embedded Files with this approach, the detection rate depends exclusively
on the quality of the used shellcode detector. Furthermore,
The PDF format allows embedding of arbitrary files into
shellcode is not necessarily stored in a string variable, but can
documents. This feature is also used by malware operators to
also be wrapped in an array, or can consist of multiple arrays.
disguise malicious content and exploiting additional vulnera-
PDF Scrutinizer uses multiple heuristics, to be able to detect
bilities. For instance, the Adobe Reader can display embedded
a broader range of malicious operations. In addition, to enable
Flash programs directly. This allows exploitation, not only
the identification of concrete vulnerabilities, known vulnerable
of the PDF reader program, but also of the integrated Flash
methods are emulated.
Player, as shown in [11] and [12]. One solely static approach using machine learning is imple-
Another particularly interesting technique, named nested mented with PJScan [17]. Here, JavaScript code is extracted
PDF, is given by embedding a malicious PDF document and used for a lexical analysis to determine features. After-
into a benign document [13]. The top-level document can wards, classifications are made by comparing new samples
be constructed in a way that the embed document is loaded with previously learned thresholds. Since the code is not ex-
instantaneously, even multi-layered constructions are possible. ecuted, dynamically constructed and evaluated scripts are not
With this approach, the malicious content is encapsulated in a available and only top-level code can be considered. Compared
single PDF object. to the beforehand presented tools and PDF Scrutinizer, PJScan
can be distinguished by a significantly higher performance,
III. R ELATED W ORK
because of the fact that no code has to be executed. In contrast,
There are a number of existent malicious PDF analysis tools. drawbacks of this approach are that no information and no
In the following we present three dynamic and one pure static identification of concrete attacks can be given. Additionally,
analysis tool, and compare each one with PDF Scrutinizer. it has a high false-positive rate of about 16 %, when testing
Wepawet is a web service for analysis and detection of benign PDF documents including JavaScript code.
web-based malware [14] [15]. It lists found exploits, evaluated
IV. I MPLEMENTATION
JavaScript code, potentially detected shellcode and found mal-
ware. Wepawet uses anomaly detection and machine learning A. Overview
approaches to classify PDF documents. As PDF Scrutinizer, Because of the widespread use of PDF, a huge amount
Wepawet uses an emulated environment to execute extracted of PDF documents are in circulation. The major demand for
JavaScript code. During the execution, dynamic information automated malware processing software is a preferably reliable
106
extract execute
malicious parsing
PDF document JavaScript JavaScript
PDF Scrutinizer suspicious
benign create
access
parsed
dynamic
document API heuristics
access emulation
PDFBox Rhino libemu
JavaScript detected found triggered embedded
code shellcode exploits heuristics files
PDF Scrutinizer
classification, with that a manual analysis becomes obsolete. can store an additional action referencing a JavaScript action,
Thus, the classifications malicious, suspicious and benign were every PDF object which contains the additional action name
chosen. As a result, the amount of documents, that need to be (/AA) is analyzed. JavaScript execution can also be initiated
manually examined is reduced to the documents which are using AcroForms, wherefore the document is searched for the
classified as suspicious. existence of those. If any AcroForms are found, the encapsu-
The main functionality of PDF Scrutinizer is the classi- lated JavaScript code is extracted. All encountered JavaScript
fication of PDF documents into these three categories. To code is collected, saved for later analysis and processed in the
support the analyst during the process of examining suspicious next step.
PDF documents, PDF Scrutinizer does not only display the 3) Execution of actions: The extracted JavaScript code is
resulting classification, but also furnishes further information executed in a modified JavaScript-engine, where parts of the
on the reasons of the classification. For example, if any known Acrobat for JavaScript API are emulated. To provide certain
exploits were used in the document, its CVE-ID is given and if API methods, it is necessary, that their implementations access
any heuristics were triggered, they are indicated. All occurring the loaded PDF document. Only in this way, they are able
JavaScript code is saved, as well as any found shellcode. to return the correct values, which are needed for code to
Additionally, all embedded files are extracted and saved for continue error-free execution and expose the real, malicious
later analysis. To overcome the nested PDF technique, any functionality. The API emulation is illustrated in more detail
embedded PDF documents are saved and afterwards analysed in section IV-C. To detect malicious behavior, the modified
separately by PDF Scrutinizer. Figure 1 illustrates the main interpreter keeps track on operations and variable values
functionality of the tool. In the following, the main activities during evaluation of code. So, several dynamic heuristics
of PDF Scrutinizer are described. were developed, which monitor the control flow for malicious
1) Parsing: At first the document is loaded and parsed. operations, they are explained individually in section IV-E.
All PDF objects are analyzed and saved in memory, so that
following steps can access them. Because of the fact that B. Architecture
malicious PDF documents are often malformed and corrupt, PDF Scrutinizer is a Java library, which extends and con-
the parsing must not be limited to the PDF specification. nects existing components. Apache PDFBox [18] is used as
Instead, the parsing component should try to extract PDF interface to the loaded PDF document, every interaction with
objects at all cost. In PDF Scrutinizer, we focused on sim- the document is performed through the library. PDFBox was
ulating the way Adobe Reader parses documents, because of selected, because a large part of the PDF specification has
the fact that it is the most widespread reader with the most been implemented, and because the included PDF parser works
known vulnerabilities. Because of ambiguities in the PDF well, even with malformed documents. However, PDFBox was
specification and the amount of different readers, which are modified in several ways, to emulate the parsing methodology
running on different platforms, including mobile devices and of Adobe Reader and to improve the processing of malicious
browser-integrated PDF readers, this approach is not sufficient documents.
to be able to gain an optimal detection rate. It would thus To execute JavaScript code, which was extracted from
be conceivable in the future, to implement multiple instances the PDF document, Mozilla Rhino [19] is utilized. Rhino
which emulate different readers. But certainly, this approach only supports the main features of JavaScript and does not
would increase the analysis time significantly. include the Acrobat for JavaScript API. In order to be able
2) Extraction of actions: The next step is the extraction of to execute Acrobat JavaScript nevertheless, parts of the API
JavaScript actions. PDF Scrutinizer first tries to find actions were emulated and details in the behavior of the interpreter
the same way, as any PDF reader does. The document catalog were modified. For this reason, Rhino allows adding custom
dictionary is examined, if an /OpenAction is registered. objects into the context of the interpreter. Thus, it is possible
Furthermore, the documents catalog can store a reference to emulate API methods, by implementing their functionality,
to the /Names array. This array is scanned for included and if required, accessing the document through PDFBox.
JavaScript actions. Since, for example any page or annotation Malicious PDF documents commonly rely on shellcode
107
in order to execute malicious commands within the readers methods are non-existent. Although this modification enabled
process. Thus, recognition of shellcode can be used to detect certain scripts to execute thoroughly, even if methods were
potential attacks within the examined document. This can be called that were not covered in the context, there were negative
achieved, by using libemu [20], which is a shellcode detection side effects. We encountered malicious PDF documents, which
and analysis library. It tries to detect shellcode by searching use exception handling constructs to detect simulated environ-
for GetPC code, which is used in lot of shellcodes to find ments. For example, a non-existed method is called purposely
the actual value of the program counter. In PDF Scrutinizer, and the malicious behavior is only exposed within the error-
libemu is used to analyze variable values for the presence of handler. With the proposed modification activated, the error-
shellcode during execution. handler would not be called and thus the malicious code would
The architecture and the interaction of the components in not be executed. Therefore, we disabled this extension during
PDF Scrutinizer is sketched in figure 2. evaluation.
The Acrobat for JavaScript API provides means to execute
C. API Emulation scripts asynchronously. Given scripts are executed either one
To successfully execute JavaScript actions, which were time after a given period elapsed or periodically. In malicious
extracted from PDF documents, the Acrobat for JavaScript API documents asynchronous methods are for instance used, to
has to be emulated. Malicious JavaScript, using the Acrobat delay malicious operations until the document is rendered.
API, commonly only uses a subset of all available methods. This is done, because a victim could become suspicious after
Since the entire Acrobat API is rather complex, just the most loading a document, when a reader is not responding and is
frequently used methods, based on observation of the present not showing the content.
samples, were emulated in PDF Scrutinizer. The possibility Asynchronous methods can also be utilized to detect em-
to extend the emulated API easily is provided. ulated environments, which are for simplicity not delaying
Essential parts of the emulated API correspond to the the script execution, but are executing given scripts instanta-
JavaScript for Acrobat API, thus provides compliant results. neously. Hence, asynchronous API methods were implemented
Hence, for instance, the methods doc.getAnnots and in PDF Scrutinizer. For example, a script, which changes
doc.getPageNthWord where implemented according to a variable, can be delayed, and following code can easily
the specification. With these methods, early stage JavaScript determine, by checking the value of the variable, whether the
code extracts shellcode or JavaScript code, that is for example script was instantly executed or is still being delayed. With
located in the documents annotations or text. It is important this technique, the script can expose an emulated environment
to note, that this part contributes significantly to the success and avoid execution of malicious code to remain undetected.
of the further detection methods. If the emulation lacks in this This is a general issue of the emulation approach: Since
early loading phase, further malicious code is not correctly only a subset of the available API methods are available,
constructed and never gets executed, thus the heuristics would a document could detect being in a analysis environment
not be able to detect the attack. Since encrypted JavaScript by invoking not commonly used API methods and check
code or shellcode can also be stored in the documents meta- the return value for plausibility. With this information, a
data, it is important to make their actual values available to the malicious document could disguise its malicious behaviour
engine. In our emulation, corresponding to the JavaScript for to bypass detection or it could go into an endless loop. In
Acrobat API, these properties are encapsulated in an object order to identify the latter, a basic endless loop detection has
accessable thru doc.info. been developed, in which case the document is marked as
Apart from methods which are API-conform, known to suspicious.
be vulnerable methods are additionally emulated, in order to
D. Static Heuristics
detect the usage of known exploits. Known vulnerable methods
are prone, for example to buffer-overflow or use-after-free Static heuristics treat source code as string and analyse
attacks in the emulated reader version. In PDF Scrutinizer, it using string analysis, with the goal to discover malicious
these methods are certainly not actually vulnerable, but they or suspicious elements in the code. Although this approach
analyze the used parameters for primed values, known to cause can be outsmartet using obfuscation or dynamic evaluation of
exploits. This way, it is feasible to detect specific known code, it can provide a fast option, to test for signatures or to
vulnerabilities a malicious document is trying to exploit. We detect suspicious strings. In PDF Scrutinizer, the developed
included the methods which are vulnerable to our knowledge. static heuristics process all extracted JavaScript actions, as
In Addition, stub methods, without any functionality, are well as any code which is send to the eval method of
needed, so that the engine does not abort if any JavaScript code the interpreter. Additionally, any code, that is executed by
invokes them. If such a method is not expected to deliver a asynchronous methods, is forwarded to the static heuristics. In
return value, the omission is simple. In the case, a return object the following, the developed static heuristics are introduced.
is needed, we supply a mock object, which contains at least the RegexMalicious includes a list of signatures, which are based
correct data structure and default values. We further provided on regular expressions. The code is checked for occur-
the ability to modify the JavaScript engine directly, in a way rences of the signatures. If any signatures are found,
that it acts more tolerant and not throws exceptions, if invoked the document is marked as malicious. Unfortunately,
108
no source of signatures, tailored to malicious Acrobat analyzed whether the code tries to add multiple identical, large
JavaScript, is publicly available. Thus, a new list, contain- data blocks into an array, which is a strong sign for heap
ing vulnerable method calls including parameters, which spraying. If so, the strings are not added to the array, which
are often used by exploits, was created, based on a set of is why the blocks are not written into memory, leading to a
malicious PDF documents containing known attacks. better runtime performance. At the same time, the heuristic is
RegexSuspicious is similar to the first static heuristic, but in triggered and the document is marked as malicious.
contrast, it marks the document as suspicious when a 3) ShellcodeTester: Shellcode commonly downloads and
match occurs. Currently, the set of regular expressions installs malware, or, in the case a malware binary is included
consists of words which are used in some malicious or appended to the PDF document, constructs an executable
JavaScript code as variable names, i.e. ”exploit”, ”shell- and installs it. Since installing malware is usually the attackers
code” or ”heapspray”. goal, shellcode is included in virtually every malicious PDF
VulnerableMethodCalls uses a list of known vulnerable API document. Shellcode, used within malicious JavaScript, can
methods, and examines how many of them are found either be stored in variables declared in source code or it can be
in the code. If the code string contains any vulnerable build at runtime. Both ways, the shellcode is eventually stored
method calls, it is marked as suspicious. in a string or array variable. Thus, with the ShellcodeTester
heuristic, used variables are observed and tested with the
E. Dynamic Heuristics shellcode detection and analysis library libemu. Shellcode
To be able to detect malicious documents even if they instructions are commonly encoded using percent-encoding,
use unknown vulnerabilities, dynamic heuristics were imple- whereby, for instance, opcodes 0x90 and 0x90 are given
mented. In PDF Scrutinizer, the JavaScript interpreter was by the string "%u9090". The unescape method decodes
modified in order to monitor certain variable values and to the percent-encoding string and returns a string only with
intercept certain bytecode instructions. This way, the relevant the encoded bytes. That is why, the result of the method
information are forwarded to heuristics, during execution. unescape is additionally processed by the heuristic.
1) StringLengthTester: Usually, long strings in malicious Since shellcode detection is rather slow, several techniques
JavaScript code occur, when constructing NOP-sleds later were utilized to reduce the amount of strings which need
utilized in heap spraying. Therefore, the StringLengthTester to be tested. For instance, only strings with length between
heuristic tests used variables for an excessive number of reasonable lower and upper bound are tested. In addition,
characters. If the length of a used string exceeds a configurable strings which frequently contain the sequence "%u", are most
threshold, the document is marked as malicious. Evaluation likely percent-encoded shellcode instructions, which will be
indicates that 100,000 characters is a suitable amount used as unescaped later, so these strings are not processed by libemu.
default threshold. If shellcode is detected in the remaining strings, the doc-
Strings are only examined at one point in Rhino’s inter- ument is marked as malicious, the shellcode binary is stored
preter: the assignment of variables. It must be mentioned that and the libemu profile, which gives the shellcodes API calls,
this point is not only reached at the first assignment of a is saved. With the help of some modifications to libemu,
variable. As in JavaScript every string change leads to an potentially included malware executables are automatically
assignment of a new string, this point is reached whenever saved during the analysis.
a string modification takes place.
2) HeapSprayDetector: Heap spraying is an essential tech- V. E VALUATION
nique used in most JavaScript-based attacks in PDF documents To measure the performance of PDF Scrutinizer, especially,
and overcomes the trouble of getting to know the exact the false positive and false negative rates, sets of malicious
location of shellcode in memory. Hence, when exploiting a and benign samples were analyzed. The used sets [21] contain
vulnerability, the overwritten return address only needs to 6,054 benign and 11,278 malicious samples, collected from
be roughly in the right memory region. Usually, with heap email attachments and web sites.
spraying, strings consisting of nop-sled and shellcode are For the evaluation, both sets were successively processed by
added to an array, effectively inflating the heap segment with PDF Scrutinizer, and the results were analyzed. The benign
large amounts of data. sample set was mainly utilized, to acquire the false positive
When executing code, which utilizes heap spraying, in rate, which is an important metric for an automated analysis
a JavaScript engine, the memory is filled with such large tool. Additionally, the performance of PDF Scrutinizer, when
blocks of nop-sled and shellcode, leading to a bad runtime analyzing benign samples was evaluated, which is a crucial
performance. The ambition to prevent heap spraying from factor for the practial suitability of the tool. When using as a
happening, led to the development of a feature, which can honeyclient in connection with a web crawler, large amounts
detect heap spraying and at the same time stop inflation of the of PDF documents need to be processed, from which likely
heap segment. With this feature, heap spraying is detected and most are benign in practise. Thus, the performance, when
avoided which reduces the analysis time significantly. analyzing benign PDF documents, has a large impact on the
The HeapSprayDetector feature hooks into the operation overall performance. Using the malicious sample set, the false
of the interpreter where elements are added to arrays. It is negative rate of PDF Scrutinizer was evaluated.
109
Count Percentage Benign Malicious
Overall samples 6054 100.00 % Amount of Samples 6054 11278
Processed without error 6039 99.75 % Analysis Time 5 min 33 sec 14 hours 2 min
Contained JavaScript actions 229 3.78 % Average Time per Sample 0.06 sec 4.48 sec
Analysis result Average Samples per Minute 1100.73 13.39
Malicious 0 0.00 % Average Throughput 310 MB/min 0,33 MB/min
Suspicious 3 0.05 % TABLE III
Benign 6036 99.70 % A NALYSIS PERFORMANCE
Error 15 0.25 %
TABLE I
A NALYSIS RESULT OF BENIGN SAMPLES
110
Count Percentage
If a known attack is performed, PDF Scrutinizer can often
Malicious classified samples 10125 100.00 %
recognize the CVE-IDs of the used vulnerabilities. Even if a
Heuristic triggered
RegexMalicious 8327 82.24 %
document tries to exploit an unknown vulnerability, chances
StringLengthTester 8756 86.48 % are high, that PDF Scrutinizer still detects the attack using
HeapSprayDetector 9868 97.46 % dynamic heuristics. This is particularly interesting for using the
ShellcodeTester 7482 73.90 % tool with a client honeypot, because this way, novel exploits
TABLE IV could be identified at an early stage.
T RIGGERING FREQUENCY OF THE HEURISTICS
R EFERENCES
[1] K. Selvaraj and N. F. Gutierrez, “The Rise of PDF Malware,” 2010.
[Online]. Available: http://www.symantec.com/content/en/us/enterprise/
The highest rate is achieved by the HeapSprayDetector media/security response/whitepapers/the rise of pdf malware.pdf
[2] M. Hypponen, “Military Targets,” 2011. [Online]. Available: http:
heuristic. This heuristic was triggered by 97.46 % of the //www.f-secure.com/weblog/archives/00002203.html
malicious documents, which shows that heap spraying is used [3] The H security, “Targeted attacks on arms manufacturers continue,”
by the major part of malicious PDF documents and that it 2011. [Online]. Available: http://www.h-online.com/security/news/item/
Targeted-attacks-on-arms-manufacturers-continue-1283425.html
can be reliably detected using this heuristic. Excessively long [4] P. O. Baccas, “Who ordered spam? New trick in PDF malware
strings were detected in 86.48 % of the documents, using the uncovered,” 2011. [Online]. Available: http://nakedsecurity.sophos.com/
StringLengthTester heuristic. Again, this is a indicator for the 2011/04/18/orders-spam-new-trick-in-pdf-malware/
[5] E. Balunsat, “How PDF files hide malware Example PDF scan from
presence of malicious code. In 73.90 % of the documents Xerox,” 2011. [Online]. Available: http://blog.commtouch.com/cafe/
in the malicious data set, shellcode was found using the malware/how-pdf-files-hide-malware-example-pdf-scan-from-xerox/
ShellcodeTester heuristic. The result depends on the quality [6] Avast Software, “Six out of every ten users run vulnerable versions of
Adobe Reader,” 2011. [Online]. Available: http://public.avast.com/mkt/
of the used shellcode detector and can vary. As shellcode 20110713 6 out of 10 with vulnerable PDF.pdf
is included in virtually every malicious document, that uses [7] M. Sutton, J. Sobrier, M. Geide, P. Kulkarni, and U. Wanve, “State of
JavaScript-based attacks, this rate should be improvable. the Web - Quarter 2, 2011 Report,” 2011. [Online]. Available: http:
//www.zscaler.com/pdf/Zscaler-Labs-State-of-the-Web-2011Q2.pdf
In summary it can be said, that the heuristics achieve a high [8] D. Johnson, “PDF in Government,” 2007. [Online]. Available:
detection rate within the samples with malicious classification. http://www.appligent.com/2007-02-02
This is an indicator, that novel documents, which use yet [9] Adobe Systems Incorporated, “ISO 32000-1:200. document management
– portable document format – part 1: PDF 1.7,” Tech. Rep., Jul.
unknown vulnerabilities should be reliably detectable using 2008. [Online]. Available: http://www.adobe.com/devnet/acrobat/pdfs/
PDF Scrutinizer, as long as they use the characteristics, which PDF32000 2008.pdf
are observed by these heuristics. Noteworthy is additionally [10] A. Sotirov, “Heap Feng Shui in JavaScript,” 2007.
[Online]. Available: http://www.blackhat.com/presentations/bh-europe-
the low false positive rate and the considerable performance, 07/Sotirov/Whitepaper/bh-eu-07-sotirov-WP.pdf
that was observed during the evaluation of the benign sample [11] L. Zeltser, “How to Extract Flash Objects from Malicious PDF Files,”
set. 2011. [Online]. Available: http://computer-forensics.sans.org/blog/2011/
05/04/extract-flash-from-malicious-pdf-files
VI. F UTURE W ORK [12] S. Porst, “A brief analysis of a malicious PDF file
which exploits this weeks Flash 0-day,” 2010. [On-
A limitation of the emulated approach is, that the analysis line]. Available: http://blog.zynamics.com/2010/06/09/analyzing-the-
is complicated as soon as malicious document writers start to currently-exploited-0-day-for-adobe-reader-and-adobe-flash
[13] G. Delugré, “An approach to PDF shielding,” 2010. [Online].
use a larger proportion of the JavaScript for Acrobat API. As Available: http://esec-lab.sogeti.com/post/2010/09/01/An-approach-to-
emulated environments always differ in some way from the PDF-shielding
original, malicious PDF document authors can and will use [14] University of Californica, Santa Barbara, “Wepawet.” [Online].
Available: http://wepawet.cs.ucsb.edu/
such differences to detect simulated systems. Thus, enhancing [15] M. Cova, C. Kruegel, and G. Vigna, “Detection and analysis of
the emulation quality is an ongoing process, with the aim to drive-by-download attacks and malicious JavaScript code,” 2010.
improve the flawless execution of Acrobat JavaScript and with [Online]. Available: http://www.cs.ucsb.edu/∼vigna/publications/2010
cova kruegel vigna Wepawet.pdf
that the detection rate of malicious documents. [16] Z. Tzermias, G. Sykiotakis, M. Polychronakis, and E. P. Markatos,
“Combining static and dynamic analysis for the detection of
VII. C ONCLUSION malicious documents,” 2011. [Online]. Available: http://www.syssec-
In this paper, we presented PDF Scrutinizer, a PDF docu- project.eu/media/page-media/3/mdscan-eurosec11.pdf
[17] P. Laskov and N. Šrndić, “Static detection of malicious JavaScript-
ment analysis tool, which uses static and dynamic detection bearing PDF documents,” 2011. [Online]. Available: http://doi.acm.org/
mechanisms to recognise malicious PDF documents. The 10.1145/2076732.2076785
attempt to emulate PDF reader’s behavior has proven to [18] T. A. S. Foundation, “Apache PDFBox - Java PDF Library,” http://
pdfbox.apache.org/.
be promising. PDF Scrutinizer showed a detection rate of [19] Mozilla Foundation, “Rhino: JavaScript for Java.” [Online]. Available:
about 90 %, whereas false-positive classifications did not occur http://www.mozilla.org/rhino/
during our evaluation. [20] P. Baecher and M. Koetter, “libemu - x86 Shellcode Emulation.”
[Online]. Available: http://libemu.carnivore.it/
Since code contained in PDF documents is executed, obfus- [21] M. Parkour, “Version 4 April 2011 - 11,355+ Malicious
cation techniques, which hinder a reliable static detection, are documents - archive for signature testing and research,”
overcome. During the execution, details about the behavior are 2011. [Online]. Available: http://contagiodump.blogspot.com/2010/08/
malicious-documents-archive-for.html
collected, which make a manual analysis often unnecessary.
111