Deguard

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Statistical Deobfuscation of Android Applications

Benjamin Bichsel Veselin Raychev Petar Tsankov Martin Vechev


Department of Computer Science
ETH Zurich
bichselb@student.ethz.ch {veselin.raychev, ptsankov, martin.vechev}@inf.ethz.ch

ABSTRACT this probabilistic model to suggest a (statistically likely) de-


This work presents a new approach for deobfuscating An- obfuscation of new, obfuscated Android applications. Our
droid APKs based on probabilistic learning of large code approach enables a variety of security applications. For in-
bases (termed “Big Code”). The key idea is to learn a prob- stance, our system successfully deobfuscates Android APKs
abilistic model over thousands of non-obfuscated Android produced by ProGuard [6], the most popular obfuscation
applications and to use this probabilistic model to deob- tool for Android applications.
fuscate new, unseen Android APKs. The concrete focus
of the paper is on reversing layout obfuscation, a popular Focus: Layout Deobfuscation.
transformation which renames key program elements such The focus of this paper is on reversing layout obfuscation
as classes, packages and methods, thus making it difficult to of Android APKs. While general obfuscation can include
understand what the program does. other transformations (e.g., changes to the program’s data
Concretely, the paper: (i) phrases the layout deobfusca- representation or control-flow [25]), layout obfuscation re-
tion problem of Android APKs as structured prediction in mains a key part of virtually all obfuscation tools. In layout
a probabilistic graphical model, (ii) instantiates this model obfuscation, the names of program elements that carry key
with a rich set of features and constraints that capture the semantic information are replaced with other (short) identi-
Android setting, ensuring both semantic equivalence and fiers with no semantic meaning. Examples of such elements
high prediction accuracy, and (iii) shows how to leverage are comments, variable, method and class names. Renaming
powerful inference and learning algorithms to achieve over- these program elements makes it much harder for humans
all precision and scalability of the probabilistic predictions. to read and understand what the program does and is use-
We implemented our approach in a tool called DeGuard ful in a variety of security scenarios including protection of
and used it to: (i) reverse the layout obfuscation performed intellectual property.
by the popular ProGuard system on benign, open-source ap-
plications, (ii) predict third-party libraries imported by be- Benefits and Challenges.
nign APKs (also obfuscated by ProGuard), and (iii) rename Among others, reversing layout obfuscation for Android
obfuscated program elements of Android malware. The ex- APKs has various benefits including: (i) it makes it easier
perimental results indicate that DeGuard is practically ef- for security analysts to inspect Android applications obfus-
fective: it recovers 79.1% of the program element names cated with ProGuard, (ii) it identifies third-party libraries
obfuscated with ProGuard, it predicts third-party libraries embedded in Android APKs, and (iii) it enables one to au-
with accuracy of 91.3%, and it reveals string decoders and tomatically search for certain identifiers in the code.
classes that handle sensitive data in Android malware. However, reversing layout obfuscation is a hard problem.
The reason is that once the original names are removed from
the application and replaced with short meaningless identi-
1. INTRODUCTION fiers, there is little hope in recovering the original names by
This paper presents a new approach for deobfuscating simply inspecting the application alone, in isolation.
Android applications based on probabilistic models. Our
approach uses large amounts of existing Android programs Probabilistic Learning from “Big Code”.
available in public repositories (referred to as “Big Code”) To address challenges that are difficult to solve by consid-
to learn a powerful probabilistic model which captures key ering the program in isolation, the last couple of years have
features of non-obfuscated Android programs. It then uses seen an emerging interest in new kinds of statistical tools
Permission to make digital or hard copies of all or part of this work for personal or which learn probabilistic models from “Big Code” and then
classroom use is granted without fee provided that copies are not made or distributed use these models to provide likely solutions to tasks that
for profit or commercial advantage and that copies bear this notice and the full citation are difficult to solve otherwise. Examples of such tasks in-
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or clude machine translation between programming languages
republish, to post on servers or to redistribute to lists, requires prior specific permission [18], statistical code synthesis [32, 30], and predicting names
and/or a fee. Request permissions from permissions@acm.org. and types in source code [31, 9]. Interestingly, due to their
CCS’16, October 24 - 28, 2016, Vienna, Austria unique capabilities, some of these probabilistic systems have
c 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM. quickly become popular in the developer community [31].
ISBN 978-1-4503-4139-4/16/10. . . $15.00
DOI: http://dx.doi.org/10.1145/2976749.2978422
This Work: Android Deobfuscation via “Big Code”. Example.
Motivated by these advances, we present a new approach Figure 1(a) shows a fragment of an Android application
for reversing Android layout obfuscation by learning from that has been obfuscated with ProGuard (the obfuscated
thousands of readily available, non-obfuscated Android ap- program elements are highlighted with red). The depicted
plications. Technically, our approach works by phrasing the code fragment can be easily obtained from the APK using
problem of predicting identifier names (e.g., class names, standard tools, such as Dex2Jar [2] and Java Decompiler [4].
method names, etc.) renamed by layout obfuscation as Here, the name of the class is replaced with the non-
structured prediction with probabilistic graphical models. descriptive name a and similarly, the private field of type
In particular, we leverage Conditional Random Fields (CRFs) SQLiteDatabase and the method returning a Cursor object
[23], a powerful model widely used in various areas includ- are renamed with the obscure names b and c, respectively.
ing computer vision and natural language processing. To our It is evident that inspection of this code, as well as any other
knowledge, this is the first time probabilistic graphical mod- code using the obfuscated class a, is challenging. For exam-
els learned from “Big Code” have been applied to address a ple, the intended behavior of the following two lines of code
core security challenge. Using our approach we present a is concealed due to the non-descriptive class and method
tool called DeGuard, and show that it can automatically names:
reverse layout obfuscation of Android APKs as performed a obj = new a();
by ProGuard with high precision. obj.c(str);

Main Contributions. As mentioned, ProGuard keeps the names of some pro-


The main contributions of this paper are: gram elements to preserve the application’s semantics. For
example, the name of the class SQLiteOpenHelper and its
• A structured prediction approach for performing prob- methods getWritableDatabase and rawQuery are not re-
abilistic layout deobfuscation of Android APKs. named because this class is part of the core Android API.
• A set of features and constraints cleanly capturing key
parts of Android applications. Combined, these ensure
our probabilistic predictions result in high precision
2.2 DeGuard
and preserve application’s semantics. Given an Android APK as input, DeGuard outputs a se-
mantically equivalent Android APK with descriptive names
• A complete implementation of our approach in a scal- for fields, methods, classes, and packages. We depict the
able probabilistic system called DeGuard1 . source code of the output APK produced by DeGuard in
• An evaluation of DeGuard on open-source Android Figure 1(d). The key steps of our approach are shown with
applications obfuscated by ProGuard and Android mal- thick gray arrows ( ) in Figure 1. We now describe these
ware samples. Our results show that DeGuard is steps.
practically effective: it correctly predicts 79.1% of the
program elements obfuscated by ProGuard, it iden- Dependency Graph.
tifies 91.3% of the imported third-party libraries, and DeGuard analyzes the input APK and formalizes the
reveals relevant string decoders and classes in malware. structure of the Android application as a graph over the
program elements, where an edge signifies that the corre-
sponding two program elements are related. The graph in
2. OVERVIEW Figure 1(b) illustrates a fragment of the generated depen-
In this section we provide an informal overview of our dency graph for our example.
statistical deobfuscation approach for Android. First, we The red circular nodes denote the unknown (i.e., obfus-
discuss ProGuard, which is the most widely used tool for cated) program elements whose names the tool will try to
obfuscating Android applications. We then present the key predict, and the purple rectangular nodes are the known
steps of our DeGuard system. The purpose here is to pro- program elements, which will not be modified by the tool.
vide an intuitive understanding of the approach. Full formal The name of the class a is therefore represented with a
details are presented in the later sections. red node, while the class SQLiteOpenHelper with a purple
2.1 ProGuard one. The graph’s edges are labeled with a particular rela-
tionship, which represents how the two program elements
ProGuard obfuscates program elements including names are related. For example, the edge from node a to node
of fields, methods, classes, and packages, by replacing them SQLiteOpenHelper is labeled with the relationship extends
with semantically obscure names. It also removes unused to formalize that the former class extends the latter. Since
classes, fields, and methods to minimize the size of the result- program elements can have multiple relationships, the de-
ing Android application package (APK) released to users. pendency graph may in general contain multiple edges be-
ProGuard processes both the application and all third-party tween two nodes (thus, technically, the dependency graph is
libraries that the application imports (e.g., advertising and a multigraph).
analytics libraries). All third-party libraries imported by the Formally, the relationships between two nodes represent
application are therefore concealed in the released APK. different feature functions. The constructed dependency
ProGuard cannot obfuscate all program elements as that graph, along with all feature functions specifies a Condi-
would change the application’s semantics. For example, the tional Random Field (CRF) [23], a powerful probabilistic
names of methods part of the Android API and the names graphical model. We formally define the dependency graph,
of classes referenced in static files, are kept intact. feature functions, and CRFs in Section 3, and the features
1
http://apk-deguard.com for Android applications in Section 4.
(partial) Dependency graph:
1 class a extends SQLiteOpenHelper {
extends
2 SQLiteDatabase b ; SQLiteOpenHelper a
3 public a (Context context) { field-in
4 super(context, "app.db", null, 1);
getWritableDatabase b
5 b = getWritableDatabase(); gets
6 }
7 Cursor c (String str){ Derive graph, Naming constraints:
8 return b .rawQuery(str); and constraints
9 }
C={ a 6= MainActivity ,··· }
10 } (b) Dependency graph, features, and constraints
(a) An Android application obfuscated by ProGuard
Predict

name 1 name 2 weight

1 class DBHelper extends SQLiteOpenHelper { SQLiteOpenHelper DBUtils 0.3


SQLiteOpenHelper DBHelper 0.2
2 SQLiteDatabase db ;
3 public DBHelper (Context context) { extends
SQLiteOpenHelper DBHelper
name 1 name 2 weight
4 super(context, "app.db", null, 1);
field-in DBUtils instance 0.5
5 db = getWritableDatabase();
DBHelper db 0.4
6 } getWritableDatabase db
gets DBUtils db 0.2
7 Cursor execSQL (String str){ Rename DBHelper instance 0.2
identifiers
8 return db .rawQuery(str); name 1 name 2 weight
9 } getWritableDatabase db 0.7
10 } getWritableDatabase instance 0.4
(d) Deobfuscated Android application using DeGuard (c) Graph with predicted unknown identifiers

Figure 1: Statistical deobfuscation of Android applications using DeGuard. The red color indicates the elements whose
names are to be renamed (in the input), the green color are the same elements with the new names (in the output), and the
purple color denotes the elements whose names are known and remain the same.

Syntactic and Semantic Constraints. Figure 1(c). DeGuard predicts that the name of the obfus-
Given an Android APK, DeGuard automatically derives cated class a is DBUtils and that the name of the obfuscated
a set of constraints which restricts the possible names as- field b is db. Below, we describe how DeGuard concludes
signed to the unknown program elements. These naming that these are the most likely names for this example.
constraints guarantee that the deobfuscated APK gener- To predict the names of the obfuscated elements, De-
ated by DeGuard is: (i) a syntactically well-formed pro- Guard performs a joint prediction that considers all pro-
gram, and (ii) semantically equivalent to the input APK. gram elements, known and unknown. To illustrate this in-
Two example constraints are: all fields declared in the same ference step, consider the graph in Figure 1(c). The tables
class must have distinct names and all classes that belong associated with the graph’s edges represent the likelihood,
to the same package must have distinct names. Any well- of each (pairwise) assignment of names to program elements
formed application must satisfy these two syntactic proper- (nodes). We remark that each table is derived from feature
ties. Naming constraints of methods are more intricate due functions associated with weights, which together represent
to method overriding. For example, if a method in a sub- (log-)likelihoods. We formally define feature functions and
class overrides a method in a superclass (in the input APK), explain the derivation of the likelihood tables in Section 3.
then the two methods must have the same name after de- Here, we illustrate how these likelihood tables are used to
obfuscation to preserve the overriding property. choose the most likely names. Our goal is to find an assign-
For example, suppose the package of class a also contains ment for all program elements that maximizes a score that
a class with the (non-obfuscated) name MainActivity. The is the sum of the weights in each table.
constraint a 6= MainActivity in Figure 1(b) specifies that For our example, according to the top-most table, the
the predicted name for node a must be distinct from the weight of assigning the name DBUtils to the class is 0.3.
name MainActivity. Indeed, if these two classes have iden- However, DeGuard does not select DBUtils as the name of
tical names, then the resulting output APK would not be this class. This is because selecting the name DBUtils does
syntactically well-formed. not result in the highest possible overall score. Suppose we
In Section 5, we describe an algorithm that, for any APK, select the name DBUtils. Then, we have two possible names
generates all necessary syntactic and semantic constraints. for the obfuscated field b, namely db and instance. Accord-
ing to the likelihood tables, both the former and the latter
Probabilistic Prediction. choice result in a total score of 1.2. However, if we select the
Using the derived dependency graph and constraints, De- name DBHelper and db, then the total score is 1.3, which
Guard infers the most likely names for all obfuscated ele- is the highest possible score for this example. DeGuard
ments. The predicted names for our example are depicted in therefore returns these names as most likely.
Formally, DeGuard performs a Maximum a Posteriori 3. BACKGROUND
(MAP) inference query on the CRF model defined by the de- In this section we provide the necessary background on
pendency graph and the feature functions. We define MAP probabilistic models, queries, and learning which we leverage
inference in Section 3. in this work. These concepts are well known in the field of
probabilistic graphical models [20]. The main purpose here
2.3 Security Applications is to review these parts and to illustrate how they are used
DeGuard can be used to tackle several practical security by our approach.
problems. In our evaluation, we show that DeGuard can
effectively reverse ProGuard’s layout obfuscation for benign Problem Statement.
Android APKs. Although ProGuard obfuscates 86.7% of the We phrase the problem of predicting the most likely names
program elements on average, DeGuard correctly predicts assigned to all obfuscated program elements as a problem in
the names for 79.1% of those elements. structured prediction. Intuitively, we model the elements
Predicting libraries is another important problem, which of a program as a tuple of random variables (V1 , . . . , Vn )
is particularly relevant for Android [14]. Mobile develop- ranging over a set of name labels Names. The set Names
ers tend to rely on a large number of libraries which often in our case contains all possible names from which we can
contain security vulnerabilities — from personal informa- choose to name program elements. Then, the joint distri-
tion release [15, 11] to severe man-in-the-middle vulnerabil- bution P (V1 , . . . , Vn ) (discussed later in this section) over
ities [1]. In our experiments, DeGuard reveals over 90% of these variables assigns a probability to each assignment of
the third-party libraries concealed by ProGuard. names to variables.
Further, numerous security analyses rely on descriptive ~ = (V1 , . . . , V ~ ) be the variables representing obfus-
Let O |O|
program identifiers. Examples include analyses that per- cated program elements, i.e., the variables whose names we
form statistical filtering of potential vulnerabilities [37] and would like to predict. The names of the remaining variables
probabilistic systems for detecting privacy leaks [11]. These K~ = (V ~ , . . . , Vn ) are known and will not be affected by
|O|+1
systems assume that the application’s program elements are the renaming. Then, to predict the most likely names for the
non-obfuscated. DeGuard can be used to deobfuscate ap- obfuscated program elements, we compute the Maximum a
plications before they are analyzed by such systems. Posteriori (MAP) inference query:
2.4 Challenges ~o = argmax P (O ~ = ~k)
~ = o~0 | K
o~0 ∈Ω
We discuss three key challenges when building a prediction
system for deobfuscating Android applications: ~
where Ω ⊆ Names |O| is the set of all possible assignments of
(i) Capturing the rich structure of Android applications: ~ and ~k ∈ Names |K|~
names to the obfuscated variables O, de-
precisely encoding the structure of Android applications us- ~
fines the names assigned to the known variables K. Next, we
ing a concise, yet adequate set of program elements and
describe how we actually represent and compute the condi-
relationships is important to ensure the predictions made ~ = ~k) for a given assignment
~ = ~o | K
by the system are accurate. This is difficult as a large set tional probability P (O
of relationships may hurt the scalability of the system while of names ~k.
missing important relationships, or defining bad ones, can
reduce the prediction accuracy. Dependency Graph.
(ii) Semantic equivalence: the rich structure of Java poses A dependency graph for a given program is an undirected
nontrivial constraints that must be captured to ensure the multigraph G = (V, E), where V is a set of random variables
resulting deobfuscated Android APK has equivalent seman- representing program elements and E ⊆ V ×V ×Rels is a set
tics to the input APK. of labeled edges. Here, Rels is a set of relationships between
(iii) Scalable learning: the expressive structure of An- program elements; we instantiate this set for Android appli-
droid applications inevitably results in complex dependency cations in Section 4. An edge (Vi , Vj , rel ) says that elements
graphs and a large variety of features that cannot be han- Vi and Vj are related via rel . An example of a dependency
dled efficiently by off-the-shelf machine learning systems. graph is shown in Figure 1(b).
According to our experiments, the most scalable available
prediction system for programs [31] required an order of Features and Weights.
magnitude longer than DeGuard to learn a probabilistic We define a pairwise feature function ϕ as follows:
model for Android.
ϕ : Names × Names × Rels → R
2.5 Scope and Limitations This function maps a pair of names and their relationship
In this work, we focus on deobfuscating Android appli- to a real number. In Section 4, we define several kinds of
cations that have been transformed using layout obfusca- feature functions and based on these we obtain the entire set
tion mechanisms, which rename fields, methods, classes and of pairwise features {ϕ1 , . . . , ϕm } automatically during the
packages with semantically obscure names. Other obfusca- learning phase (described at the end of this section). For
tion techniques, such as data obfuscation mechanisms, which example, for each observed edge (Vi , Vj , rel ) in the training
alter data structures, control-flow and cryptographic obfus- set of dependency graphs where the names assigned to Vi
cation mechanisms fall outside the scope of this work. We and Vj are ni and nj , respectively, we define a pairwise
remark that malicious Android applications often uses mul- feature ϕ(N, N 0 , Rel ) = 1 if N = ni , N 0 = nj , and Rel =
tiple obfuscation techniques to prevent reverse engineering. rel ; otherwise, ϕi (N1 , N2 , rel ) = 0. Further, for any ϕi , we
Security experts must thus use a combination of deobfusca- associate a weight wi , also computed during the learning
tion tools to effectively deobfuscated Android malware. phase.
Given a dependency graph G = (V, E), a prediction ~o (SQLiteOpenHelper, DBUtils, extends) and 0 for all other
~ and an assignment ~k for the
for the obfuscated variables O, inputs. That feature function also has a weight of wi = 0.3,
fixed, known variables K, ~ we associate a feature function fi which is determined during learning. We do not include the
to each pairwise feature ϕi defined as follows: type of relationship in the tables since in this example the
X program elements are connected via a single relationship.
fi (~o, ~k) = ϕi ((~o, ~k)j , (~o, ~k)l , rel ) In our example, the MAP inference query will return the
(Vj ,Vl ,rel)∈E assignment ~o highlighted in green (i.e., DBHelper and db)
as that assignment satisfies the constraints in Ω (in our ex-
Here, (~o, ~k)j denotes the jth name in the vector (~o, ~k). We ample we have a single inequality constraint) and the total
can think of fi as lifting ϕi to quantify ϕi ’s effect on all the score of ~o is the highest: 0.2 + 0.4 + 0.7 = 1.3. To compute
edges in the graph (i.e., adding up ϕi ’s effect on each edge). this score, DeGuard implements a greedy MAP inference
The end result computed by fi is a real number capturing algorithm which we describe in Section 6.1.
the overall effect of ϕi .
Learning from “Big Code”.
Conditional Random Fields. The input to the learning phase of DeGuard is a set
A conditional random field (CRF) is a probabilistic model of p programs {h~o(j) , ~k(j) i}pj=1 for which both vectors of
which defines a conditional probability distribution, that is,
~ = ~k) as follows:
~ = ~o | K names ~o and ~k are given. That is, the training data con-
P (O
tains non-obfuscated applications, which can be downloaded
m
from repositories for open-source Android applications, such
~ = ~k) = 1 exp(
X
~ = ~o | K
P (O wi fi (~o, ~k)), as F-Droid [3]. From this input, the learning outputs weights
Z i=1 {wi }m
i=1 such that names in the training data programs are
where each fi , 1 ≤ i ≤ m, is a feature function associated correctly predicted. There are several variations of this
with a weight wi , and Z is a normalization constant. We learning procedure [29, 20]. For our application we use learn-
do not define Z as it can be omitted for our specific type of ing with pseudo-likelihoods as described in [35, §5.4]. We
query. describe this algorithm in more detail in Section 6.1.
It is then immediate that the dependency graph, together
with the feature functions f1 , . . . , fm and their associated 4. FEATURE FUNCTIONS
weights w1 , . . . , wm , define a CRF. In this section, we present the pairwise feature functions
used for our Android deobfuscation task. As described ear-
Prediction via MAP Inference. lier in Section 3, these feature functions are used to build the
To compute the most likely assignment ~o for the variables dependency graph. Recall that the construction of an appli-
~ we perform a MAP inference query:
O, cation’s dependency graph amounts to introducing a node
~ = ~k)
~ = o~0 | K for each program element and then connecting the program
~o = argmax P (O
o~0 ∈Ω
elements that are related. The signature of a dependency
graph is therefore defined by the application’s program ele-
Using our CRF model, we can compute the probability of ments and the relationships between them.
an assignment ~o using the formula:
m 4.1 Program Elements
~ = ~k) = 1 X
~ = o~0 | K
P (O exp( wi fi (~o, ~k)) A program’s dependency graph is defined over nodes that
Z i=1 represent different program elements. To capture the struc-
ture of an Android application, we introduce nodes for each
We omit the constant Z (as it does not affect the result of
of the following program elements:
the MAP inference query), expand fi (~o, ~k), and rewrite the
formula as follows: • Types. We introduce a node for each primitive type
~ = ~k) ∼
~ = o~0 | K (e.g., int, long, float etc.), reference type (e.g., Object,
P (O ArrayList, etc.), and array type (e.g., int[], Object[],
m
X X etc.) that appears in the application. For example,
∼ exp( wi ϕi ((~o, ~k)j , (~o, ~k)l , rel )) = we introduce a node to represent the reference type
i=1 (Vj ,Vl ,rel)∈E
SQLiteDatabase in the example of Figure 1(a).
X m
X
= exp( wi ϕi ((~o, ~k)j , (~o, ~k)l , rel )) • Fields. We introduce a node for each field declared in
(Vj ,Vl ,rel)∈E i=1 the application’s classes. For example, we introduce a
node to represent the field of type SQLiteDatabase in
We refer to the above as the total score of an assignment (~o, ~k). the example of Figure 1(a).
• Packages. We introduce a node for each package in
MAP Inference Example. the application. For example, given a package a.b, we
We now explain the above equation by referring to our ex-
introduce two nodes: one to represent the package a,
ample from Section 2. The product wi ϕi ((~o, ~k)j , (~o, ~k)l , rel ) and another to represent the package a.b.
scores a particular pairwise feature function ϕi . In our
example, each row in the tables given in Figure 1(c) de- • Methods. We introduce a node for each method de-
fines a pairwise feature function and its weight. Consider clared in the application’s classes. For the example of
the first row of the top-most table. This row denotes a Figure 1(a), we introduce two nodes: one node to rep-
pairwise feature function which returns 1 if its inputs are resent the constructor <init>() and another node for
the method c(). If a class overrides a given method, To detect all overrides that occur in a given set of classes,
we use one node to represent both the method declared for each class we collect all of the methods it implements.
in the superclass and the one declared in the subclass. Then, for each of these methods, we link it with all the
This guarantees that overriding methods are renamed methods it overrides. Finally, we combine all methods that
consistently, which is necessary for preserving the ap- are (possibly indirectly) linked together into a single node.
plication’s semantics.
4.2 Relationships
• Expressions. We introduce a node to represent con- To capture the structure of an Android application, we
stant values (e.g., integers, strings, etc.) and the value introduce relationships between its program elements. We
null. For example, we add a node 5 to capture the
define all such relationships in Figure 2. The second column
constant value 5. Nodes for other kinds of constant in the table defines the type of each edge. For example, the
values are introduced analogously. first edge type is (m, op, performs-op). This edge type says
• Access Modifiers. We introduce nodes to repre- that it connects a method m to an operation op with the re-
sent access modifiers, such as static, synchronized, lationship performs-op. The third column specifies under
private, protected, public, and so forth. what condition the edge is added between two nodes (of the
correct type). The edge of the first type is added whenever
• Operations. We introduce nodes to represent opera- the method m performs an operation op. The types m and
tions (e.g., +, -, etc.). op are the ones we already defined in Section 4.1. We or-
ganize the relationships into two broad categories: method
We remark that we ignore generic types because they are
relationships and structural relationships.
removed by the compiler during the type erasure process [7].
Furthermore, we ignore the names of local variables and
method parameters since they are not part of the applica- Method Relationships.
tion’s APK. For example, we do not introduce a node to rep- Method relationships capture the semantic behavior of
resent the method parameter’s name str of the method c() methods. This is important because method names typically
in Figure 1(b). We do however capture the types of local describe the method’s behavior. For example, the method
variables and method parameters, e.g. we capture that the name execSQL in our example of Figure 1 describes that this
method c() has a parameter of type String. method executes an SQL command. We remark on several
points pertaining to method relationships. First, the pro-
gram elements denoted by o, arg and v are not necessarily
Known and Unknown Program Elements. fields. For example, the method call field.foo().bar() re-
We capture whether a node’s name is obfuscated and is
sults in two edges of type receiver: (foo, bar, receiver)
to be predicted, or is known and should not be predicted,
and (field, foo, receiver). Second, for every loop occur-
using the following set of rules:
ring in a method, we capture how different values and fields
• Nodes that represent packages, classes, methods, and are used within the loop using the relationships loop-read
fields that are part of the Android API are known. For and loop-write. To capture which classes are accessed by a
example, the node of the class SQLiteOpenHelper in method, we introduce the relationship writes-classfield
our example is known because this class is part of the and the relationship calls-classmethod.
Android API. All program elements that are part of Finally, we remark that the relationships defined above, in
the Android API are referred by their name, and thus addition to capturing the semantics of methods, also capture
any obfuscator keeps their names intact. how fields, classes and methods are used by the application.
For example, adding the information that method m reads
• Constructor methods (both dynamic and static) have field f also conveys that f is read by m.
fixed names and are thus known.
Structural Relationships.
• If a method overrides a known method (e.g., a method
The structural relationships capture the relations between
that is part of the Android API), then the nodes repre-
the nodes, such as whether two classes are defined within
senting both methods are known. We enforce this rule
the same package or not. These features are particularly
implicitly by keeping a single node to represent both
important for predicting obfuscated third-party libraries, as
methods. We explain this shortly.
the correct prediction of a small number of classes within
• All remaining packages, classes, methods, and fields the library’s package significantly aids to accurately predict
are unknown. the library’s remaining program elements. Note that for
method parameters, we express not only their type, but also
how often they occur. This is captured with the relation-
Grouping Method Nodes. ship argtype-N. Further, we define the two relationships
In the context of inheritance and method overriding, intro- read-before and written-before to capture the order of
ducing a node for each declared method leads to issues. The reads and writes to fields.
reason is that the two methods must have the same signa-
ture, where a method’s signature is defined by the method’s Comparison to Other Prediction Systems.
name along with the number and types of its parameters. Android applications have significantly more complex struc-
To guarantee that the deobfuscation is semantic-preserving, ture compared to programs encoded in, e.g., untyped, dy-
we combine all methods related via inheritance in a single namic languages. Precisely capturing this structure is key to
node. We refer to the process of combining all such methods enable the accurate prediction of Android applications. De-
as grouping. Guard is the first prediction system for programs that sup-
Relationship Type Condition for Relating Two Program Elements
Method Relationships
(m, op, performs-op) method m performs operation op (e.g. addition +, xor, etc.)
Method (m, t, performs-cast) method m performs a cast to type t
operation (m, t, instance-of) method m performs an instanceof check for type t
(m, e, returns) method m returns an expression e (e.g. a field or a method call)
Reads and (m, e, uses) expression e appears in m
writes (m, f , writes) method m modifies field f
Arguments and (m, arg, flows-into) there is a call o.m(. . . , arg, . . .) with argument arg
receivers (o, m, receiver) there is a call o.m(. . .)
Loops (v, m, loop-read) method m uses the value v within a loop
(f , m, loop-write) method m writes to the field f within a loop
Accessed (m, c, writes-classfield) class c that contains a field that is read by m
Classes (m, c, calls-classmethod) class c that contains a method called by m
Structural Relationships
(e, p, contained-in-package) package p contains a class, a method, or a field e
Packages (p1 , p2 , direct-subpackage-of) package p1 that is directly contained within a package p2
(p1 , p2 , subpackage-of) package p1 is contained within a package p2 , but not directly
(f , c, field-in) field f is declared in a class c
Classes (m, c, method-in) method m declared in a class c
(c1 , c2 , overrides) class c1 overrides a program element in class c2
(c, i, implements) class c that implements an interface i
(c1 , c2 , extends) class c1 extends class c2
(f , t, field-type) the type of field f is t
Types (m, t, return-type) method m that returns an object of return type t
(m, t, argtype-N) method m has N parameters of type t
Access modifiers (e, am, has-modifier) method or field e has access modifier am
(f , e, gets) assignment expression f = e (e.g. a field name or a method call)
Fields (f , e, initialized-by) there is an initialization statement statements f = e
(f1 , f2 , read-before) field f1 is read before field f2
(f1 , f2 , written-before) field f1 is written before field f2

Figure 2: Relationships used to relate the program elements of Android applications. The second column defines the edge
type (i.e. the program elements it related). Each relationship is added if the condition in the third column is true.

ports a rich set of structural relationships, including types, an indicator function for each pair of labels and kind of
structural hierarchies, and access modifiers. In our evalua- relationship observed in the training set of non-obfuscated
tion, we show that DeGuard strikes a balance between the programs. While the pairwise features are derived from the
accuracy and efficiency of prediction: the set of relationships training data, the weights associated to these features are
defined above are sufficient to accurately predict a significant learned during the learning phase.
part (roughly 80%) of the program elements obfuscated by
ProGuard, while keeping the complete prediction time rea-
sonable (under a minute on average). 5. CONSTRAINTS
In this section we define the constraints that our deob-
4.3 Pairwise Feature Functions fuscation mechanism must satisfy while renaming program
The pairwise features ϕi are derived from the relationships elements to ensure both syntactic and semantic validity of
defined above, based on the relationships observed in the the deobfuscated application. First, we describe method
dependency graphs used in the learning phase. Formally, let naming constraints, which are more complex to define due
G1 = (V1 , E1 ), . . . , Gm = (Vm , Em ) be the set of dependency to method overrides. Afterwards, we describe naming con-
graphs used in the learning phase with naming assignments. straints for fields, classes, and packages.
For each edge (Vi , Vj , rel ) ∈ E1 ∪ . . . ∪ Em that appears in
the dependency graphs, we define a pairwise feature 5.1 Naming Constraints for Methods
Method naming constraints are necessary for both seman-
1 if N = ni , N 0 = nj , Rel = rel

ϕi (N, N 0 , Rel ) = tic reasons and for syntactic well-formedness. According to
0 otherwise
Java’s semantics, whenever a class extends another class,
where ni and nj are the names assigned to the program the method declared in the subclass overrides a method de-
elements denoted by Vi and Vj . The pairwise features define clared in the super class if the two methods have the same
1 public class A { Algorithm 1 Detecting inequality constraints for method
2 public void a(A a) {} names
3 public void b(Object o) {} 1: function findConstraints
4 public void c() {} 2: object ← java.lang.Object
5 } 3: handleClass(object, ∅)
6 public class B extends A { 4: end function
5: function handleClass(class, aboveMethods)
7 public void g() {} 6: methods ← aboveMethods ∪ class.nonPrivateMethods()
8 private int h() {} 7: reportConstraints(methods ∪ class.privateMethods())
9 } 8: subClasses ← classes that directly extend/implement class
10 public class C extends B { 9: for subClass ∈ subClasses do
11 public void x() {} 10: handleClass(subClass, methods)
12 } 11: end for
12: end function
Figure 3: An example that illustrates different types of 13: function reportConstraints(methods)
method naming constraints 14: . Report all inequality constraints for methods
15: partitions ← partition methods by parameter types
16: for partition ∈ partitions do
17: report partition as an inequality constraint
signature, i.e. the same name and list of parameter types. 18: end for
19: end function
Method overrides thus change the application’s semantics.
Further, all methods within the same class must have dis-
tinct signatures. Below, we give an example to illustrate
different kinds of method naming constraints. Afterwards,
we describe how DeGuard derives these constraints. plicitly specified. We specify inequality constraints as sets
of program elements that have distinct names.
Example. Formally, let V = {V1 , . . . , Vn } be the set of nodes in
We illustrate method naming constraints with an example. the dependency graph. We define an inequality constraint
Consider the program in Figure 3. Here we have three classes as a set of nodes C ⊆ V . An assignment ~ y = (y1 , . . . , yn )
that exemplify different cases of method naming constraints. of names to program elements satisfies the inequality con-
The name of the method A.a(A) is not constrained by straint C if ∀Vi , Vj ∈ C. yi 6= yj . For example, to spec-
any method declared in Figure 3 because it has a unique ify that the methods C.x() and B.g() must have distinct
list of parameter types. Here, the method A.a(A) is the names, we use the inequality constraint C = {C.x(), B.g()}.
only method that has one parameter of type A. In contrast, Note that we define inequality constraint using sets of el-
A.b(Object) cannot be renamed to equals, because then ements whose elements must be pairwise distinct, as op-
it would override java.lang.Object.equals(Object) from posed to standard binary inequality constraints (e.g., C.x()
the Java standard library. This constraint is needed because 6= B.g()) because the encoding of the former is more concise.
A implicitly extends Object and the list of parameter types
a matches that of the method equals declared in the class Deriving Inequality Constraints for Methods.
Object. We next describe how we derive inequality constraints
The names of the methods B.g() and B.h() must be dis- for methods. Without loss of generality, we treat inter-
tinct even though their return types and access level mod- faces identically to classes. The inequality constraints for
ifiers are different. This is because neither the return type methods are derived using Algorithm 1. The function find-
nor the access level modifier is part of a method signature, Constraints calls handleClass with the class Object and
and therefore renaming both methods to the same name re- the empty set of method names (since Object has no su-
sults in the same method signature. per classes). The recursive call on Line 10 reaches all other
The names of the methods B.g() and A.c() must be dis- classes as every class (transitively) extends Object.
tinct due to the semantics of method overriding in the pres- The function handleClass(class, aboveMethods) reports
ence of inheritance. Here, the class B extends A. Therefore, all inequality constraints for class, where the parameter
a potential change of the name of B.g() to that of A.c() aboveMethods contains all methods that the methods de-
would result in overriding method A.c(). The names of the clared in class can potentially override.
methods B.h() and A.c() must be also distinct due to the The function reportConstraints reports the inequality
semantics of method overriding, even though B.h() is pri- constraints for the methods contained in methods. To do
vate. The method C.x()’s name is not constrained by the this, it first partitions the methods based on their parameter
name of B.h() because B.h() is declared as private. Ac- types. All methods in a given partition must have distinct
cording to Java’s semantics, no method may override private names, because otherwise, they would have the same signa-
methods. ture. On the other hand, no method is constrained by the
methods in the other partitions.
Expressing Method Constraints.
Our example shows that in addition to equality constraints, Result on the Example.
inequality constraints are also needed to formalize all nam- Here, we show the result of applying Algorithm 1 on the
ing properties of methods. Equality constraints can be han- program in Figure 3. For simplicity, we assume that A does
dled implicitly by representing methods that must have an not extend java.lang.Object. More precisely, a call to
identical signature with one node in the dependency graph handleClass(A, ∅) results in the inequality constraints
(see Section 4). This guarantees that all such methods are {A.c(), B.g(), B.h()} and {A.c(), B.g(), C.x()}. Note
renamed consistently. Inequality constraints, necessary to that we implicitly remove singleton inequality constraints
avoid accidental overrides due to inheritance, must be ex- as these are always satisfied.
5.2 Naming Constraints for Fields, Classes, and that maximize the probability P (O ~ = ~kj ), com-
~ = ~oj | K
Packages puted as defined in Section 3, for all programs h~oj , ~kj i in
The deobfuscation mechanism must satisfy the following the training set. Unfortunately, computing the weights us-
properties: (i) any two packages contained in the same pack- ing precise maximum likelihood estimation is prohibitively
age must have distinct names, (ii) any two classes contained expensive in our context, due to the large number of nodes
in the same package must have distinct names, and (iii) and possible labels that can be assigned to them. DeGuard
any two fields declared in the same class must have distinct therefore learns the weights using pseudo likelihood, which
names. We remark that the types of fields are irrelevant for approximates the conditional distribution P (O ~ | K)
~ as the
naming constraints. This is because a field is referred to by product of the conditional distributions P (Oi | N (Oi ), K)~
its name, and the type of a field is not part of this name. of each unknown node Oi ∈ O ~ conditioned on the node’s
This is in contrast to methods, which are called by their neighbors N (Oi ) and the known nodes K. ~ For the complete
signature and where the types of a method’s parameters are details on training using pseudo likelihoods see [35, §5.4].
part of the method’s signature. Using the training described above, the training of this
For a given Android app, the naming constraints for fields, model took about 2 hours on a 32-core machine with four
classes, and packages, are formalized using inequality con- 2.13GHz Xeon processors running Ubuntu 14.04 with 64-Bit
straints in the same way we formalize the constraints for OpenJDK Java 1.7.0 51.
method names. We derive all inequality constraints for fields,
classes, and packages, by iterating over all classes and pack- MAP Inference.
ages and reporting inequality constraints as defined by the To predict likely names ~o to be assigned to all obfuscated
above properties. ~ DeGuard computes the MAP inference query
elements O,

~ = ~k)
~ = o~0 | K
~o = argmax P (O
6. IMPLEMENTATION AND EVALUATION
o~0 ∈Ω
In this section we describe the implementation of De-
Guard and the experiments we conducted with it.
where ~k are the names assigned to the known elements K. ~
For this step, we use the publicly available Nice2Predict
6.1 The DeGuard System framework [5]. Nice2Predict computes the MAP query us-
We now present our DeGuard system, which is pub- ing a scalable, greedy algorithm, where names assigned to
licly available at http://apk-deguard.com. DeGuard is im- obfuscated program elements are iteratively changed one-
plemened using Soot [38], a framework for static analysis by-one or in pairs until the score stops improving. At every
of Java and Android applications. Given an Android APK, iteration, all naming constraints are checked for violations.
Soot transforms it into an intermediate format (called Jim- More details on this algorithm are provided in [31]. After
ple) that simplifies the analysis of the application. To con- predicting the names for all obfuscated elements, DeGuard
struct the application’s dependency graph, we use Soot’s renames them using the Soot API, and then constructs and
API to traverse all program elements. outputs the deobfuscated APK.
To predict the names of all obfuscated elements for a given
application, DeGuard performs a MAP inference query
on the CRF model constructed from the application’s pro-
6.2 Experimental Evaluation
gram elements, the set of pairwise features (described in 4), We now present our experiments with DeGuard. First,
and the feature weights. Next, we describe how DeGuard we evaluate DeGuard’s accuracy on deobfuscating benign,
learns a probabilistic model (the pairwise features and their open-source applications obfuscated using ProGuard. Sec-
weights) from non-obfuscated Android applications, and how ond, we discuss our experience in inspecting malware sam-
it uses this probabilistic model to predict likely names of ob- ples deobfuscated using DeGuard.
fuscated program elements using MAP inference.
6.2.1 ProGuard Experiments
Feature Functions and Weights. We perform two tasks to evaluate DeGuard’s perfor-
To learn all feature functions and weights, we downloaded mance on ProGuard-obfuscated applications. First, we mea-
1784 non-obfuscated Android applications from F-Droid [3], sure DeGuard’s accuracy on predicting the names of pro-
a popular repository for open-source Android applications. gram elements obfuscated by ProGuard. Second, based on
Out of these 1784 applications, we randomly selected 100 the results of the first task, we report DeGuard’s accuracy
which we intentionally left as our benchmark applications, on the task of predicting the names of obfuscated third-party
i.e., the ones we later use in our evaluation. We used the libraries imported in the APK.
remaining 1684 applications as our training set of applica- To conduct the above tasks, we obfuscated 100 benign ap-
tions. plications from F-Droid. These are the 100 applications that
The set of possible names assigned to obfuscated program we intentionally did not use during the learning phase. For
elements is drawn from the names observed in the training all 100 applications, we enabled ProGuard obfuscation by
set. The pairwise features ϕ1 , . . . , ϕm are also derived from modifying their build files, without modifying ProGuard’s
the training set, as described in Section 4.3. The only com- obfuscation rules, which specify which elements are obfus-
ponent missing in our probabilistic model are weights w ~ = cated. In our experiments, we use the non-obfuscated ver-
[w1 , . . . , wm ] associated with the pairwise features. One way sions of the applications as an oracle to check whether De-
to learn the weights is to use maximum likelihood estimation, Guard correctly deobfuscates the program elements’ names
where the weights w ~ are chosen such that the training data by renaming them to their original (i.e., non-obfuscated)
has the highest probability. That is, we chose weights w ~ names.
100
30
80
% of nodes

20 60

40
10
20
0
0 105 15 20 25 30 35 0
Fields Methods Classes Packages Total
Number of neighbors
Figure 4: Distribution of total number of neighbors over the Known Correct Mis-predicted
100 ProGuard-obfuscated Android applications. Figure 6: Average percentage of known, correctly predicted,
and mis-predicted program elements calculated over the 100
Android applications deobfuscated by DeGuard.
30
% of nodes

20 ments: (i) known, which are the elements not obfuscated


by ProGuard and which the system keeps as is, (ii) correct,
10 which are the elements that DeGuard correctly renames to
their original names, and (iii) mis-predicted, which are the el-
ements for which DeGuard predicts names that differ from
0 the original ones. Here, the first four bars show data about
010 5 15 20 25 30 35
fields, methods, classes, and packages, respectively, and the
Number of known neighbors fifth bar shows the aggregate data for all program elements.
Figure 5: Distribution of known number of neighbors over We use the predictions made for the package names in the
the 100 ProGuard-obfuscated Android applications. second task discussed in this section.
The data shows that ProGuard obfuscates a substantial
number of the program elements. On average, only 1.6% of
ProGuard-obfuscated APKs. the fields, 33% of the methods, 9.4% of the classes, and 9.3%
In Figures 4 and 5 we show two relevant metrics that re- of the packages are known. Thus, on average, ProGuard
veal the dependency structure of the 100 applications that obfuscates 86.7% of each application’s program elements.
we obfuscated using ProGuard. The bar chart depicted in The data further shows that DeGuard correctly deobfus-
Figure 4 shows the distribution of total number of neighbors. cates a significant part of the obfuscated program elements.
This figure shows one bar for each neighborhood size, where For example, while only 1.6% of the fields in the obfus-
the bar’s height indicates the percentage of nodes that have cated applications are known, after DeGuard deobfuscates
exactly that number of neighbors. For example, the fifth them, 80.6% of all fields have names identical to the origi-
bar indicates that the percentage of nodes with exactly 4 nal ones. We remark that 80.6% is a lower bound on how
neighbors is around 22%. Similarly, the bar chart shown in well DeGuard deobfuscates fields. This is because some
Figure 5 shows the distribution of known neighbors. The of the names classified as mis-predicted are semantically
data in these two figures reveals two key points about our close to the original ones. For example, in the application
features: (i) the nodes are well-connected (99% of the nodes FacebookNotifications, DeGuard suggested appView and
have at least 3 neighbors), and (ii) most nodes have known mWindowManager as names for two fields, while the original
neighbors (99% have at least one known neighbor). That is, names are webview and windowManager, respectively.
our features lead to dependency graphs where informed pre- Overall, the data shows that among all program elements,
diction seems possible (rather than graphs which are mostly DeGuard increases the percentage of names that are identi-
disconnected where there would be little or no flow into a cal to the original ones from 13.3% (in the obfuscated APK)
node whose name is to be predicted). to 79.1% (in the deobfuscated APK). We remark that the ap-
plications used in this experiment are benign. DeGuard’s
Task 1: Predicting Program Element Names. prediction accuracy on malicious applications may therefore
For this task, we deobfuscated the 100 benchmark ap- be lower.
plications, which we previously obfuscated with ProGuard.
We remark that ProGuard, in addition to renaming pro- Task 2: Predicting Third-party Libraries.
gram elements, also removes some elements. For example, We next use the deobfuscation results for package names
it removes fields, methods, and classes that are not used obtained from Task 1 in order to evaluate DeGuard’s effec-
by the application. Hence, we evaluate whether DeGuard tiveness for predicting third-party libraries.
correctly deobfuscates elements not removed by ProGuard. We first explain what we mean by the term library. We
Figure 6 shows the percentage of known elements (which identify libraries by their package names. We classified pack-
DeGuard does not try to reverse), correctly predicted el- age names into library and application-specific using a sim-
ements, and mis-predicted elements, averaged over all 100 ple heuristic: any package name that appears in multiple
applications deobfuscated by DeGuard. Each bar has three applications is classified as corresponding to a library. This
segments, which represent the three kinds of program ele- heuristic works well because most application-specific pack-
1 public final class d { 1 public class SearchOfficesView extends BaseView {
2 private String a = 2 private void g () {
3 System.getProperty("line.separator"); 3 m = getSystemService("location");
4 private char[] b ; 4 local = m .getBestProvider(...);
5 private byte[] c ; 5 o = m .getLastKnownLocation(local);
6 6 ...
7 public static byte[] a (String str) {...}; 7 }
8
8 }
(a) Obfuscated code 9 private void j () {...}
10 }
1 public final class Base64 { (a) Obfuscated code
2 private String NL =
3 System.getProperty("line.separator"); 1 public class SearchOfficesView extends BaseView {
4 private char[] ENC ; 2 private void init () {
5 private byte[] DEC ; 3 locationManager =getSystemService("location");
6 4 local = locationManager .getBestProvider(...);
7 public static byte[] decode (String str) {...}; 5 location =
8 } 6 locationManager .getLastKnownLocation(local);
(b) Deobfuscated code
7 ...
Figure 7: Deobfuscating the Base64 decoder found in the 8 }
9
GingerMaster malware sample.
10 private void requestLocationUpdates () {...}
11 }
(b) Deobfuscated code
age names are unique to a particular application. For ex-
ample, org.apache.commons.collections4 is classified as Figure 8: Deobfuscating fields and methods that store and,
a library, while com.pindroid.providers is classified as an respectively, handle location data. The code snippet is taken
application-specific package name. Based on our heuristic, from the Bgserv malware sample.
we identified a total of 133 libraries imported in the APKs.
To measure DeGuard’s effectiveness in predicting libraries
we use two (standard) metrics — precision and recall. Let Summary of ProGuard Experiments.
L denote the set of all obfuscated libraries imported by an In summary, our experiments demonstrate that: (i) De-
application and P denote the set of predicted libraries by Guard correctly predicts an overwhelming part of the pro-
DeGuard. Here, P contains all package names that map to gram elements obfuscated by ProGuard, thereby effectively
one of the 133 names that we have previously classified as reversing ProGuard’s obfuscation mechanism; (ii) it pre-
libraries. We compute DeGuard’s precision using the for- cisely identifies third-party Android libraries, and (iii) it is
mula precision = |L ∩ P |/|P |, and recall using the formula robust and efficient, taking on average less than a minute
recall = |L ∩ P |/|L|. Intuitively, precision shows the per- per application.
centage of libraries that DeGuard correctly predicts, and
recall captures the percentage of libraries that DeGuard 6.2.2 Experiments with Malware Samples
attempts to predict. We randomly selected one sample from each of the 49 mal-
DeGuard’s precision and recall for predicting libraries is, ware families reported in [40]. We used DeGuard to deob-
on average, 91.3% and 91.0%, respectively. This result indi- fuscate the selected samples and manually inspected some
cates that DeGuard predicts libraries more accurately com- of them. While we cannot report DeGuard’s exact preci-
pared to arbitrary program elements. Further, DeGuard sion on the selected samples (since they are all obfuscated),
almost never mis-predicts a library. This is likely because we report on several interesting examples that suggest that
the training set, which we use to learn the weights of all DeGuard can be useful when inspecting malware. We also
features, may contain applications that import the same li- discuss DeGuard’s limitations in the context of malware.
braries that we attempt to predict.
We remark that the benchmark applications in this ex-
Revealing Base64 String Decoders.
periment are not malicious, and so the libraries that they
Malware often disguises text strings using the Base64 en-
import are also benign. Malicious third-party libraries em-
coding scheme. DeGuard can be used to discover the classes
bedded in applications can be more difficult to identify.
that implement such standard encoding schemes. Concretely,
DeGuard discovered the Base64 decoders in three of the
Prediction Speed. malware samples that we inspected. We remark that eight
For most applications, DeGuard takes on average less other samples also have Base64 decoders, but the classes
than a minute to deobfuscate its APK. Around 10% of the that implement this encoding are not obfuscated.
time is spent in constructing the dependency graph and de- As an example, in Figure 7 we show code snippets taken
riving all syntactic and semantics constraints. The remain- from the GingerMaster [39] malware sample2 . Figure 7(a)
ing 90% of the time is spent in computing the most likely shows the obfuscated code, and Figure 7(b) the correspond-
naming assignment using the approximate MAP inference ing deobfuscated code obtained using DeGuard. DeGuard
query. An interesting future work item is to investigate discovers that the class d implements a Base64 decoder and
faster MAP inference algorithms leveraging the specifics of
2
our dependency graphs. SHA1: 2e9b8a7a149fcb6bd2367ac36e98a904d4c5e482
renames it to Base64. Further, DeGuard reveals that the new names: it only identifies bad names based on syntactic
method a(String) decodes strings formatted in Base64 (we guidelines and provides that feedback to developers.
confirmed this by inspecting the implementation of method The works of Allamanis et al.[8, 9] suggest names for Java
a(String)) and renames it to decode(String). The state- program elements using n-gram language models and neural
ment Base64.decode(String) is more descriptive compared networks. Their technique, however, only allows predicting
to d.a(String), and we thus believe that DeGuard can the name of a single program element and is thus not appli-
help security analysts in inspecting this malware sample. cable to a deobfuscation task where most names are missing.
LibRadar [26] detects third-party Android libraries by ex-
Revealing Sensitive Data Usage. tracting a unique fingerprint from each library and creating
Malware often steals personal information, such as loca- a mapping from fingerprints to library names. Obfuscated
tion data, device identifiers, and phone numbers. We search libraries are then identified by their fingerprints. In contrast
the deobfuscated code of our malware samples for identi- to DeGuard, LibRadar is less general as it predicts names
fiers such as location and deviceId and discovered that only for packages, and it is less robust because it relies on
DeGuard often deobfuscates the names of fields that store stable features (i.e., completely unaffected by obfuscation).
sensitive data and methods that handle sensitive data. As
an example, in Figure 8 we show an obfuscated code snip- Probabilistic models for programs.
pet taken from the Bgserv malware sample3 , along with A recent surge in the number of open-source repositories
the deobfuscated code output by DeGuard. We observe has triggered several authors to create large-scale probabilis-
that DeGuard renames the obfuscated field o, which stores tic models for code. These models are then used for novel ap-
the device’s location, to location. Further, it renames plications such as code completion [32], generating code from
the method j() to requestLocationUpdates(). We in- natural language [17, 10], sampling code snippets [27], pro-
spected the method j() to reveal that it instantiates the gramming language translation [18], type annotating pro-
interface LocationListener and implements the method grams [19, 31] and others.
onLocationChanged() to receive location updates. The name Closest to our work is [31] which also uses structured pre-
requestLocationUpdates() assigned by DeGuard captures diction and the Nice2Predict framework [5] to guess names
the behavior of this method. of local variables for JavaScript programs. Our setting how-
In the remaining malware samples, we discovered that De- ever is different and requires more diverse feature functions,
Guard renames a number of fields and methods that handle constraints and range of elements for which names are to
other kinds of sensitive data, such as device identifiers. We be predicted; further, we use an order of magnitude more
believe that DeGuard can help in inspecting malware that scalable learning mechanism than [31].
misuses sensitive data, e.g., by allowing security experts to Several works [22, 21, 16, 24] use graphical models to dis-
search for certain identifiers in the deobfuscated code. cover properties about programs such as function specifi-
cations and invariants. These works, however, do not use
Limitations. MAP inference to discover overall optimal solutions for all
In addition to obfuscating program identifiers, most of predicted properties (and most do not learn from existing
malware samples we inspected are obfuscated with addi- programs). The work of Shin et al. [34] uses neural networks
tional techniques to further hinder reverse engineering. Ex- and a large training set to identify libraries in binaries. In
amples include custom encoding of strings (e.g. using cus- the context of Android, a recent paper by Octeau et al. [28]
tom encryption), extensive use of reflection, control flow ob- uses a probabilistic model and static analysis to determine
fuscation, code reordering using goto statements, and oth- if two applications may communicate via the Android intent
ers. Reversing these additional obfuscation steps is beyond mechanisms. However, theirs is a rather different task than
the scope of DeGuard. To inspect malware, security ana- the one addressed by our work.
lysts must therefore use a combination of deobfuscation tools
tailored to reversing different obfuscation techniques. 8. CONCLUSION
We presented a new approach for layout deobfuscation
7. RELATED WORK of Android APKs. The key idea is to phrase the problem
This section summarizes the works that are most closely of reversing obfuscated names as structured prediction in a
related to ours. probabilistic graphical model and to leverage the vast avail-
ability of non-obfuscated Android programs to learn this
Suggesting names for program elements. model. We implemented our approach in a system called
Several works have studied the effect of identifier names [36, DeGuard and demonstrated that DeGuard can success-
33, 12] and have shown that good names have significant im- fully and with high precision reverse obfuscations performed
pact on one’s ability to understand the source code. These by ProGuard, a task beyond the reach of existing systems.
studies have inspired tools [13, 33] that rename identifiers We believe that our work indicates the promise of leveraging
within a project to make them follow a given coding con- probabilistic models over “Big Code” for addressing impor-
vention. In contrast to DeGuard, the systems presented tant challenges in security.
in [13] and [33] cannot be used for deobfuscation. The
tool described in [13] relies on the names to be replaced to 9. ACKNOWLEDGMENTS
have meaningful, non-obfuscated names, so it can improve
The research in this work has been partially supported
them. This system cannot predict meaningful replacements
by ERC starting grant #680358. We thank Matteo Panza-
of names such as “a”. The tool of [33] does not suggest
cchi for extending Nice2Predict [5] with support for pseudo
3
SHA1: 03f9fc8769422f66c59922319bffad46d0ceea94 likelihood estimation.
10. REFERENCES [23] J. D. Lafferty, A. McCallum, and F. C. N. Pereira.
[1] Advertising SDK Can Be Hijacked for Making Phone Conditional Random Fields: Probabilistic Models for
Calls, Geo-Locating. Segmenting and Labeling Sequence Data. In ICML,
http://www.hotforsecurity.com/blog/advertising-sdk- 2001.
can-be-hijacked-for-making-phone-calls-geo-locating- [24] B. Livshits, A. V. Nori, S. K. Rajamani, and
7461.html. A. Banerjee. Merlin: Specification inference for
[2] dex2jar. https://github.com/pxb1988/dex2jar. explicit information flow problems. In PLDI, 2009.
[3] F-Droid. https://f-droid.org/. [25] D. Low. Protecting Java Code via Code Obfuscation.
[4] Java Decompiler. http://jd.benow.ca/. Crossroads, 4(3), Apr. 1998.
[5] Nice2Predict. [26] Z. Ma, H. Wang, Y. Guo, and X. Chen. Libradar: fast
https://github.com/eth-srl/Nice2Predict. and accurate detection of third-party libraries in
android apps. In ICSE 2016 - Companion Volume,
[6] ProGuard. http://proguard.sourceforge.net/.
2016.
[7] Type Erasure. https://docs.oracle.com/javase/
[27] C. J. Maddison and D. Tarlow. Structured generative
tutorial/java/generics/genTypes.html.
models of natural source code. In ICML, 2014.
[8] M. Allamanis, E. T. Barr, C. Bird, and C. Sutton.
[28] D. Octeau, S. Jha, M. Dering, P. McDaniel, A. Bartel,
Learning natural coding conventions. In FSE, 2014.
L. Li, J. Klein, and Y. Le Traon. Combining static
[9] M. Allamanis, E. T. Barr, C. Bird, and C. Sutton.
analysis with probabilistic models to enable
Suggesting accurate method and class names. In FSE,
market-scale android inter-component analysis. In
2015.
POPL, 2016.
[10] M. Allamanis, D. Tarlow, A. D. Gordon, and Y. Wei.
[29] N. D. Ratliff, J. A. Bagnell, and M. Zinkevich.
Bimodal modelling of source code and natural
(Approximate) Subgradient Methods for Structured
language. In ICML, 2015.
Prediction. In AISTATS, 2007.
[11] S. Arzt, S. Rasthofer, C. Fritz, E. Bodden, A. Bartel,
[30] V. Raychev, P. Bielik, M. Vechev, and A. Krause.
J. Klein, Y. Le Traon, D. Octeau, and P. McDaniel.
Learning programs from noisy data. In POPL, 2016.
Flowdroid: Precise context, flow, field, object-sensitive
[31] V. Raychev, M. Vechev, and A. Krause. Predicting
and lifecycle-aware taint analysis for android apps. In
program properties from ”big code”. In POPL, 2015.
PLDI, 2014.
[32] V. Raychev, M. Vechev, and E. Yahav. Code
[12] S. Butler, M. Wermelinger, Y. Yu, and H. Sharp.
completion with statistical language models. In PLDI,
Exploring the influence of identifier names on code
2014.
quality: An empirical study. In CSMR, 2010.
[33] P. A. Relf. Tool assisted identifier naming for
[13] B. Caprile and P. Tonella. Restructuring program
improved software readability: an empirical study. In
identifier names. In ICSM, 2000.
ISESE, 2005.
[14] K. Chen, X. Wang, Y. Chen, P. Wang, Y. Lee,
[34] E. C. R. Shin, D. Song, and R. Moazzezi. Recognizing
X. Wang, B. Ma, A. Wang, Y. Zhang, and W. Zou.
functions in binaries with neural networks. In
Following devil’s footprints: Cross-platform analysis of
USENIX Security, 2015.
potentially harmful libraries on android and ios. In
S&P, 2016. [35] C. Sutton and A. McCallum. An introduction to
conditional random fields. Found. Trends Mach.
[15] W. Enck, P. Gilbert, B.-G. Chun, L. P. Cox, J. Jung,
Learn., 4(4):267–373, Apr. 2012.
P. McDaniel, and A. N. Sheth. Taintdroid: An
information-flow tracking system for realtime privacy [36] A. A. Takang, P. A. Grubb, and R. D. Macredie. The
monitoring on smartphones. In OSDI, 2010. effects of comments and identifier names on program
comprehensibility: an experimental investigation. J.
[16] S. Gulwani and N. Jojic. Program verification as
Prog. Lang., 4(3):143–167, 1996.
probabilistic inference. In POPL, 2007.
[37] O. Tripp, S. Guarnieri, M. Pistoia, and A. Aravkin.
[17] T. Gvero and V. Kuncak. Synthesizing java
Aletheia: Improving the usability of static security
expressions from free-form queries. In OOPSLA, 2015.
analysis. In CCS, 2014.
[18] S. Karaivanov, V. Raychev, and M. Vechev.
[38] R. Vallée-Rai, P. Co, E. Gagnon, L. Hendren, P. Lam,
Phrase-based statistical translation of programming
and V. Sundaresan. Soot - a Java Bytecode
languages. Onward!, 2014.
Optimization Framework. In Proceedings of the 1999
[19] O. Katz, R. El-Yaniv, and E. Yahav. Estimating types
Conference of the Centre for Advanced Studies on
in binaries using predictive modeling. In POPL, 2016.
Collaborative Research. IBM Press, 1999.
[20] D. Koller and N. Friedman. Probabilistic Graphical
[39] R. Yu. Ginmaster: A case study in android malware.
Models: Principles and Techniques - Adaptive
https://www.sophos.com/en-us/medialibrary/PDFs/
Computation and Machine Learning. The MIT Press,
technical%20papers/Yu-VB2013.pdf.
2009.
[40] Y. Zhou and X. Jiang. Dissecting android malware:
[21] T. Kremenek, A. Y. Ng, and D. Engler. A factor
Characterization and evolution. In S&P, 2012.
graph model for software bug finding. In IJCAI, 2007.
[22] T. Kremenek, P. Twohey, G. Back, A. Ng, and
D. Engler. From uncertainty to belief: Inferring the
specification within. In OSDI, 2006.

You might also like