Deguard
Deguard
Deguard
Figure 1: Statistical deobfuscation of Android applications using DeGuard. The red color indicates the elements whose
names are to be renamed (in the input), the green color are the same elements with the new names (in the output), and the
purple color denotes the elements whose names are known and remain the same.
Syntactic and Semantic Constraints. Figure 1(c). DeGuard predicts that the name of the obfus-
Given an Android APK, DeGuard automatically derives cated class a is DBUtils and that the name of the obfuscated
a set of constraints which restricts the possible names as- field b is db. Below, we describe how DeGuard concludes
signed to the unknown program elements. These naming that these are the most likely names for this example.
constraints guarantee that the deobfuscated APK gener- To predict the names of the obfuscated elements, De-
ated by DeGuard is: (i) a syntactically well-formed pro- Guard performs a joint prediction that considers all pro-
gram, and (ii) semantically equivalent to the input APK. gram elements, known and unknown. To illustrate this in-
Two example constraints are: all fields declared in the same ference step, consider the graph in Figure 1(c). The tables
class must have distinct names and all classes that belong associated with the graph’s edges represent the likelihood,
to the same package must have distinct names. Any well- of each (pairwise) assignment of names to program elements
formed application must satisfy these two syntactic proper- (nodes). We remark that each table is derived from feature
ties. Naming constraints of methods are more intricate due functions associated with weights, which together represent
to method overriding. For example, if a method in a sub- (log-)likelihoods. We formally define feature functions and
class overrides a method in a superclass (in the input APK), explain the derivation of the likelihood tables in Section 3.
then the two methods must have the same name after de- Here, we illustrate how these likelihood tables are used to
obfuscation to preserve the overriding property. choose the most likely names. Our goal is to find an assign-
For example, suppose the package of class a also contains ment for all program elements that maximizes a score that
a class with the (non-obfuscated) name MainActivity. The is the sum of the weights in each table.
constraint a 6= MainActivity in Figure 1(b) specifies that For our example, according to the top-most table, the
the predicted name for node a must be distinct from the weight of assigning the name DBUtils to the class is 0.3.
name MainActivity. Indeed, if these two classes have iden- However, DeGuard does not select DBUtils as the name of
tical names, then the resulting output APK would not be this class. This is because selecting the name DBUtils does
syntactically well-formed. not result in the highest possible overall score. Suppose we
In Section 5, we describe an algorithm that, for any APK, select the name DBUtils. Then, we have two possible names
generates all necessary syntactic and semantic constraints. for the obfuscated field b, namely db and instance. Accord-
ing to the likelihood tables, both the former and the latter
Probabilistic Prediction. choice result in a total score of 1.2. However, if we select the
Using the derived dependency graph and constraints, De- name DBHelper and db, then the total score is 1.3, which
Guard infers the most likely names for all obfuscated ele- is the highest possible score for this example. DeGuard
ments. The predicted names for our example are depicted in therefore returns these names as most likely.
Formally, DeGuard performs a Maximum a Posteriori 3. BACKGROUND
(MAP) inference query on the CRF model defined by the de- In this section we provide the necessary background on
pendency graph and the feature functions. We define MAP probabilistic models, queries, and learning which we leverage
inference in Section 3. in this work. These concepts are well known in the field of
probabilistic graphical models [20]. The main purpose here
2.3 Security Applications is to review these parts and to illustrate how they are used
DeGuard can be used to tackle several practical security by our approach.
problems. In our evaluation, we show that DeGuard can
effectively reverse ProGuard’s layout obfuscation for benign Problem Statement.
Android APKs. Although ProGuard obfuscates 86.7% of the We phrase the problem of predicting the most likely names
program elements on average, DeGuard correctly predicts assigned to all obfuscated program elements as a problem in
the names for 79.1% of those elements. structured prediction. Intuitively, we model the elements
Predicting libraries is another important problem, which of a program as a tuple of random variables (V1 , . . . , Vn )
is particularly relevant for Android [14]. Mobile develop- ranging over a set of name labels Names. The set Names
ers tend to rely on a large number of libraries which often in our case contains all possible names from which we can
contain security vulnerabilities — from personal informa- choose to name program elements. Then, the joint distri-
tion release [15, 11] to severe man-in-the-middle vulnerabil- bution P (V1 , . . . , Vn ) (discussed later in this section) over
ities [1]. In our experiments, DeGuard reveals over 90% of these variables assigns a probability to each assignment of
the third-party libraries concealed by ProGuard. names to variables.
Further, numerous security analyses rely on descriptive ~ = (V1 , . . . , V ~ ) be the variables representing obfus-
Let O |O|
program identifiers. Examples include analyses that per- cated program elements, i.e., the variables whose names we
form statistical filtering of potential vulnerabilities [37] and would like to predict. The names of the remaining variables
probabilistic systems for detecting privacy leaks [11]. These K~ = (V ~ , . . . , Vn ) are known and will not be affected by
|O|+1
systems assume that the application’s program elements are the renaming. Then, to predict the most likely names for the
non-obfuscated. DeGuard can be used to deobfuscate ap- obfuscated program elements, we compute the Maximum a
plications before they are analyzed by such systems. Posteriori (MAP) inference query:
2.4 Challenges ~o = argmax P (O ~ = ~k)
~ = o~0 | K
o~0 ∈Ω
We discuss three key challenges when building a prediction
system for deobfuscating Android applications: ~
where Ω ⊆ Names |O| is the set of all possible assignments of
(i) Capturing the rich structure of Android applications: ~ and ~k ∈ Names |K|~
names to the obfuscated variables O, de-
precisely encoding the structure of Android applications us- ~
fines the names assigned to the known variables K. Next, we
ing a concise, yet adequate set of program elements and
describe how we actually represent and compute the condi-
relationships is important to ensure the predictions made ~ = ~k) for a given assignment
~ = ~o | K
by the system are accurate. This is difficult as a large set tional probability P (O
of relationships may hurt the scalability of the system while of names ~k.
missing important relationships, or defining bad ones, can
reduce the prediction accuracy. Dependency Graph.
(ii) Semantic equivalence: the rich structure of Java poses A dependency graph for a given program is an undirected
nontrivial constraints that must be captured to ensure the multigraph G = (V, E), where V is a set of random variables
resulting deobfuscated Android APK has equivalent seman- representing program elements and E ⊆ V ×V ×Rels is a set
tics to the input APK. of labeled edges. Here, Rels is a set of relationships between
(iii) Scalable learning: the expressive structure of An- program elements; we instantiate this set for Android appli-
droid applications inevitably results in complex dependency cations in Section 4. An edge (Vi , Vj , rel ) says that elements
graphs and a large variety of features that cannot be han- Vi and Vj are related via rel . An example of a dependency
dled efficiently by off-the-shelf machine learning systems. graph is shown in Figure 1(b).
According to our experiments, the most scalable available
prediction system for programs [31] required an order of Features and Weights.
magnitude longer than DeGuard to learn a probabilistic We define a pairwise feature function ϕ as follows:
model for Android.
ϕ : Names × Names × Rels → R
2.5 Scope and Limitations This function maps a pair of names and their relationship
In this work, we focus on deobfuscating Android appli- to a real number. In Section 4, we define several kinds of
cations that have been transformed using layout obfusca- feature functions and based on these we obtain the entire set
tion mechanisms, which rename fields, methods, classes and of pairwise features {ϕ1 , . . . , ϕm } automatically during the
packages with semantically obscure names. Other obfusca- learning phase (described at the end of this section). For
tion techniques, such as data obfuscation mechanisms, which example, for each observed edge (Vi , Vj , rel ) in the training
alter data structures, control-flow and cryptographic obfus- set of dependency graphs where the names assigned to Vi
cation mechanisms fall outside the scope of this work. We and Vj are ni and nj , respectively, we define a pairwise
remark that malicious Android applications often uses mul- feature ϕ(N, N 0 , Rel ) = 1 if N = ni , N 0 = nj , and Rel =
tiple obfuscation techniques to prevent reverse engineering. rel ; otherwise, ϕi (N1 , N2 , rel ) = 0. Further, for any ϕi , we
Security experts must thus use a combination of deobfusca- associate a weight wi , also computed during the learning
tion tools to effectively deobfuscated Android malware. phase.
Given a dependency graph G = (V, E), a prediction ~o (SQLiteOpenHelper, DBUtils, extends) and 0 for all other
~ and an assignment ~k for the
for the obfuscated variables O, inputs. That feature function also has a weight of wi = 0.3,
fixed, known variables K, ~ we associate a feature function fi which is determined during learning. We do not include the
to each pairwise feature ϕi defined as follows: type of relationship in the tables since in this example the
X program elements are connected via a single relationship.
fi (~o, ~k) = ϕi ((~o, ~k)j , (~o, ~k)l , rel ) In our example, the MAP inference query will return the
(Vj ,Vl ,rel)∈E assignment ~o highlighted in green (i.e., DBHelper and db)
as that assignment satisfies the constraints in Ω (in our ex-
Here, (~o, ~k)j denotes the jth name in the vector (~o, ~k). We ample we have a single inequality constraint) and the total
can think of fi as lifting ϕi to quantify ϕi ’s effect on all the score of ~o is the highest: 0.2 + 0.4 + 0.7 = 1.3. To compute
edges in the graph (i.e., adding up ϕi ’s effect on each edge). this score, DeGuard implements a greedy MAP inference
The end result computed by fi is a real number capturing algorithm which we describe in Section 6.1.
the overall effect of ϕi .
Learning from “Big Code”.
Conditional Random Fields. The input to the learning phase of DeGuard is a set
A conditional random field (CRF) is a probabilistic model of p programs {h~o(j) , ~k(j) i}pj=1 for which both vectors of
which defines a conditional probability distribution, that is,
~ = ~k) as follows:
~ = ~o | K names ~o and ~k are given. That is, the training data con-
P (O
tains non-obfuscated applications, which can be downloaded
m
from repositories for open-source Android applications, such
~ = ~k) = 1 exp(
X
~ = ~o | K
P (O wi fi (~o, ~k)), as F-Droid [3]. From this input, the learning outputs weights
Z i=1 {wi }m
i=1 such that names in the training data programs are
where each fi , 1 ≤ i ≤ m, is a feature function associated correctly predicted. There are several variations of this
with a weight wi , and Z is a normalization constant. We learning procedure [29, 20]. For our application we use learn-
do not define Z as it can be omitted for our specific type of ing with pseudo-likelihoods as described in [35, §5.4]. We
query. describe this algorithm in more detail in Section 6.1.
It is then immediate that the dependency graph, together
with the feature functions f1 , . . . , fm and their associated 4. FEATURE FUNCTIONS
weights w1 , . . . , wm , define a CRF. In this section, we present the pairwise feature functions
used for our Android deobfuscation task. As described ear-
Prediction via MAP Inference. lier in Section 3, these feature functions are used to build the
To compute the most likely assignment ~o for the variables dependency graph. Recall that the construction of an appli-
~ we perform a MAP inference query:
O, cation’s dependency graph amounts to introducing a node
~ = ~k)
~ = o~0 | K for each program element and then connecting the program
~o = argmax P (O
o~0 ∈Ω
elements that are related. The signature of a dependency
graph is therefore defined by the application’s program ele-
Using our CRF model, we can compute the probability of ments and the relationships between them.
an assignment ~o using the formula:
m 4.1 Program Elements
~ = ~k) = 1 X
~ = o~0 | K
P (O exp( wi fi (~o, ~k)) A program’s dependency graph is defined over nodes that
Z i=1 represent different program elements. To capture the struc-
ture of an Android application, we introduce nodes for each
We omit the constant Z (as it does not affect the result of
of the following program elements:
the MAP inference query), expand fi (~o, ~k), and rewrite the
formula as follows: • Types. We introduce a node for each primitive type
~ = ~k) ∼
~ = o~0 | K (e.g., int, long, float etc.), reference type (e.g., Object,
P (O ArrayList, etc.), and array type (e.g., int[], Object[],
m
X X etc.) that appears in the application. For example,
∼ exp( wi ϕi ((~o, ~k)j , (~o, ~k)l , rel )) = we introduce a node to represent the reference type
i=1 (Vj ,Vl ,rel)∈E
SQLiteDatabase in the example of Figure 1(a).
X m
X
= exp( wi ϕi ((~o, ~k)j , (~o, ~k)l , rel )) • Fields. We introduce a node for each field declared in
(Vj ,Vl ,rel)∈E i=1 the application’s classes. For example, we introduce a
node to represent the field of type SQLiteDatabase in
We refer to the above as the total score of an assignment (~o, ~k). the example of Figure 1(a).
• Packages. We introduce a node for each package in
MAP Inference Example. the application. For example, given a package a.b, we
We now explain the above equation by referring to our ex-
introduce two nodes: one to represent the package a,
ample from Section 2. The product wi ϕi ((~o, ~k)j , (~o, ~k)l , rel ) and another to represent the package a.b.
scores a particular pairwise feature function ϕi . In our
example, each row in the tables given in Figure 1(c) de- • Methods. We introduce a node for each method de-
fines a pairwise feature function and its weight. Consider clared in the application’s classes. For the example of
the first row of the top-most table. This row denotes a Figure 1(a), we introduce two nodes: one node to rep-
pairwise feature function which returns 1 if its inputs are resent the constructor <init>() and another node for
the method c(). If a class overrides a given method, To detect all overrides that occur in a given set of classes,
we use one node to represent both the method declared for each class we collect all of the methods it implements.
in the superclass and the one declared in the subclass. Then, for each of these methods, we link it with all the
This guarantees that overriding methods are renamed methods it overrides. Finally, we combine all methods that
consistently, which is necessary for preserving the ap- are (possibly indirectly) linked together into a single node.
plication’s semantics.
4.2 Relationships
• Expressions. We introduce a node to represent con- To capture the structure of an Android application, we
stant values (e.g., integers, strings, etc.) and the value introduce relationships between its program elements. We
null. For example, we add a node 5 to capture the
define all such relationships in Figure 2. The second column
constant value 5. Nodes for other kinds of constant in the table defines the type of each edge. For example, the
values are introduced analogously. first edge type is (m, op, performs-op). This edge type says
• Access Modifiers. We introduce nodes to repre- that it connects a method m to an operation op with the re-
sent access modifiers, such as static, synchronized, lationship performs-op. The third column specifies under
private, protected, public, and so forth. what condition the edge is added between two nodes (of the
correct type). The edge of the first type is added whenever
• Operations. We introduce nodes to represent opera- the method m performs an operation op. The types m and
tions (e.g., +, -, etc.). op are the ones we already defined in Section 4.1. We or-
ganize the relationships into two broad categories: method
We remark that we ignore generic types because they are
relationships and structural relationships.
removed by the compiler during the type erasure process [7].
Furthermore, we ignore the names of local variables and
method parameters since they are not part of the applica- Method Relationships.
tion’s APK. For example, we do not introduce a node to rep- Method relationships capture the semantic behavior of
resent the method parameter’s name str of the method c() methods. This is important because method names typically
in Figure 1(b). We do however capture the types of local describe the method’s behavior. For example, the method
variables and method parameters, e.g. we capture that the name execSQL in our example of Figure 1 describes that this
method c() has a parameter of type String. method executes an SQL command. We remark on several
points pertaining to method relationships. First, the pro-
gram elements denoted by o, arg and v are not necessarily
Known and Unknown Program Elements. fields. For example, the method call field.foo().bar() re-
We capture whether a node’s name is obfuscated and is
sults in two edges of type receiver: (foo, bar, receiver)
to be predicted, or is known and should not be predicted,
and (field, foo, receiver). Second, for every loop occur-
using the following set of rules:
ring in a method, we capture how different values and fields
• Nodes that represent packages, classes, methods, and are used within the loop using the relationships loop-read
fields that are part of the Android API are known. For and loop-write. To capture which classes are accessed by a
example, the node of the class SQLiteOpenHelper in method, we introduce the relationship writes-classfield
our example is known because this class is part of the and the relationship calls-classmethod.
Android API. All program elements that are part of Finally, we remark that the relationships defined above, in
the Android API are referred by their name, and thus addition to capturing the semantics of methods, also capture
any obfuscator keeps their names intact. how fields, classes and methods are used by the application.
For example, adding the information that method m reads
• Constructor methods (both dynamic and static) have field f also conveys that f is read by m.
fixed names and are thus known.
Structural Relationships.
• If a method overrides a known method (e.g., a method
The structural relationships capture the relations between
that is part of the Android API), then the nodes repre-
the nodes, such as whether two classes are defined within
senting both methods are known. We enforce this rule
the same package or not. These features are particularly
implicitly by keeping a single node to represent both
important for predicting obfuscated third-party libraries, as
methods. We explain this shortly.
the correct prediction of a small number of classes within
• All remaining packages, classes, methods, and fields the library’s package significantly aids to accurately predict
are unknown. the library’s remaining program elements. Note that for
method parameters, we express not only their type, but also
how often they occur. This is captured with the relation-
Grouping Method Nodes. ship argtype-N. Further, we define the two relationships
In the context of inheritance and method overriding, intro- read-before and written-before to capture the order of
ducing a node for each declared method leads to issues. The reads and writes to fields.
reason is that the two methods must have the same signa-
ture, where a method’s signature is defined by the method’s Comparison to Other Prediction Systems.
name along with the number and types of its parameters. Android applications have significantly more complex struc-
To guarantee that the deobfuscation is semantic-preserving, ture compared to programs encoded in, e.g., untyped, dy-
we combine all methods related via inheritance in a single namic languages. Precisely capturing this structure is key to
node. We refer to the process of combining all such methods enable the accurate prediction of Android applications. De-
as grouping. Guard is the first prediction system for programs that sup-
Relationship Type Condition for Relating Two Program Elements
Method Relationships
(m, op, performs-op) method m performs operation op (e.g. addition +, xor, etc.)
Method (m, t, performs-cast) method m performs a cast to type t
operation (m, t, instance-of) method m performs an instanceof check for type t
(m, e, returns) method m returns an expression e (e.g. a field or a method call)
Reads and (m, e, uses) expression e appears in m
writes (m, f , writes) method m modifies field f
Arguments and (m, arg, flows-into) there is a call o.m(. . . , arg, . . .) with argument arg
receivers (o, m, receiver) there is a call o.m(. . .)
Loops (v, m, loop-read) method m uses the value v within a loop
(f , m, loop-write) method m writes to the field f within a loop
Accessed (m, c, writes-classfield) class c that contains a field that is read by m
Classes (m, c, calls-classmethod) class c that contains a method called by m
Structural Relationships
(e, p, contained-in-package) package p contains a class, a method, or a field e
Packages (p1 , p2 , direct-subpackage-of) package p1 that is directly contained within a package p2
(p1 , p2 , subpackage-of) package p1 is contained within a package p2 , but not directly
(f , c, field-in) field f is declared in a class c
Classes (m, c, method-in) method m declared in a class c
(c1 , c2 , overrides) class c1 overrides a program element in class c2
(c, i, implements) class c that implements an interface i
(c1 , c2 , extends) class c1 extends class c2
(f , t, field-type) the type of field f is t
Types (m, t, return-type) method m that returns an object of return type t
(m, t, argtype-N) method m has N parameters of type t
Access modifiers (e, am, has-modifier) method or field e has access modifier am
(f , e, gets) assignment expression f = e (e.g. a field name or a method call)
Fields (f , e, initialized-by) there is an initialization statement statements f = e
(f1 , f2 , read-before) field f1 is read before field f2
(f1 , f2 , written-before) field f1 is written before field f2
Figure 2: Relationships used to relate the program elements of Android applications. The second column defines the edge
type (i.e. the program elements it related). Each relationship is added if the condition in the third column is true.
ports a rich set of structural relationships, including types, an indicator function for each pair of labels and kind of
structural hierarchies, and access modifiers. In our evalua- relationship observed in the training set of non-obfuscated
tion, we show that DeGuard strikes a balance between the programs. While the pairwise features are derived from the
accuracy and efficiency of prediction: the set of relationships training data, the weights associated to these features are
defined above are sufficient to accurately predict a significant learned during the learning phase.
part (roughly 80%) of the program elements obfuscated by
ProGuard, while keeping the complete prediction time rea-
sonable (under a minute on average). 5. CONSTRAINTS
In this section we define the constraints that our deob-
4.3 Pairwise Feature Functions fuscation mechanism must satisfy while renaming program
The pairwise features ϕi are derived from the relationships elements to ensure both syntactic and semantic validity of
defined above, based on the relationships observed in the the deobfuscated application. First, we describe method
dependency graphs used in the learning phase. Formally, let naming constraints, which are more complex to define due
G1 = (V1 , E1 ), . . . , Gm = (Vm , Em ) be the set of dependency to method overrides. Afterwards, we describe naming con-
graphs used in the learning phase with naming assignments. straints for fields, classes, and packages.
For each edge (Vi , Vj , rel ) ∈ E1 ∪ . . . ∪ Em that appears in
the dependency graphs, we define a pairwise feature 5.1 Naming Constraints for Methods
Method naming constraints are necessary for both seman-
1 if N = ni , N 0 = nj , Rel = rel
ϕi (N, N 0 , Rel ) = tic reasons and for syntactic well-formedness. According to
0 otherwise
Java’s semantics, whenever a class extends another class,
where ni and nj are the names assigned to the program the method declared in the subclass overrides a method de-
elements denoted by Vi and Vj . The pairwise features define clared in the super class if the two methods have the same
1 public class A { Algorithm 1 Detecting inequality constraints for method
2 public void a(A a) {} names
3 public void b(Object o) {} 1: function findConstraints
4 public void c() {} 2: object ← java.lang.Object
5 } 3: handleClass(object, ∅)
6 public class B extends A { 4: end function
5: function handleClass(class, aboveMethods)
7 public void g() {} 6: methods ← aboveMethods ∪ class.nonPrivateMethods()
8 private int h() {} 7: reportConstraints(methods ∪ class.privateMethods())
9 } 8: subClasses ← classes that directly extend/implement class
10 public class C extends B { 9: for subClass ∈ subClasses do
11 public void x() {} 10: handleClass(subClass, methods)
12 } 11: end for
12: end function
Figure 3: An example that illustrates different types of 13: function reportConstraints(methods)
method naming constraints 14: . Report all inequality constraints for methods
15: partitions ← partition methods by parameter types
16: for partition ∈ partitions do
17: report partition as an inequality constraint
signature, i.e. the same name and list of parameter types. 18: end for
19: end function
Method overrides thus change the application’s semantics.
Further, all methods within the same class must have dis-
tinct signatures. Below, we give an example to illustrate
different kinds of method naming constraints. Afterwards,
we describe how DeGuard derives these constraints. plicitly specified. We specify inequality constraints as sets
of program elements that have distinct names.
Example. Formally, let V = {V1 , . . . , Vn } be the set of nodes in
We illustrate method naming constraints with an example. the dependency graph. We define an inequality constraint
Consider the program in Figure 3. Here we have three classes as a set of nodes C ⊆ V . An assignment ~ y = (y1 , . . . , yn )
that exemplify different cases of method naming constraints. of names to program elements satisfies the inequality con-
The name of the method A.a(A) is not constrained by straint C if ∀Vi , Vj ∈ C. yi 6= yj . For example, to spec-
any method declared in Figure 3 because it has a unique ify that the methods C.x() and B.g() must have distinct
list of parameter types. Here, the method A.a(A) is the names, we use the inequality constraint C = {C.x(), B.g()}.
only method that has one parameter of type A. In contrast, Note that we define inequality constraint using sets of el-
A.b(Object) cannot be renamed to equals, because then ements whose elements must be pairwise distinct, as op-
it would override java.lang.Object.equals(Object) from posed to standard binary inequality constraints (e.g., C.x()
the Java standard library. This constraint is needed because 6= B.g()) because the encoding of the former is more concise.
A implicitly extends Object and the list of parameter types
a matches that of the method equals declared in the class Deriving Inequality Constraints for Methods.
Object. We next describe how we derive inequality constraints
The names of the methods B.g() and B.h() must be dis- for methods. Without loss of generality, we treat inter-
tinct even though their return types and access level mod- faces identically to classes. The inequality constraints for
ifiers are different. This is because neither the return type methods are derived using Algorithm 1. The function find-
nor the access level modifier is part of a method signature, Constraints calls handleClass with the class Object and
and therefore renaming both methods to the same name re- the empty set of method names (since Object has no su-
sults in the same method signature. per classes). The recursive call on Line 10 reaches all other
The names of the methods B.g() and A.c() must be dis- classes as every class (transitively) extends Object.
tinct due to the semantics of method overriding in the pres- The function handleClass(class, aboveMethods) reports
ence of inheritance. Here, the class B extends A. Therefore, all inequality constraints for class, where the parameter
a potential change of the name of B.g() to that of A.c() aboveMethods contains all methods that the methods de-
would result in overriding method A.c(). The names of the clared in class can potentially override.
methods B.h() and A.c() must be also distinct due to the The function reportConstraints reports the inequality
semantics of method overriding, even though B.h() is pri- constraints for the methods contained in methods. To do
vate. The method C.x()’s name is not constrained by the this, it first partitions the methods based on their parameter
name of B.h() because B.h() is declared as private. Ac- types. All methods in a given partition must have distinct
cording to Java’s semantics, no method may override private names, because otherwise, they would have the same signa-
methods. ture. On the other hand, no method is constrained by the
methods in the other partitions.
Expressing Method Constraints.
Our example shows that in addition to equality constraints, Result on the Example.
inequality constraints are also needed to formalize all nam- Here, we show the result of applying Algorithm 1 on the
ing properties of methods. Equality constraints can be han- program in Figure 3. For simplicity, we assume that A does
dled implicitly by representing methods that must have an not extend java.lang.Object. More precisely, a call to
identical signature with one node in the dependency graph handleClass(A, ∅) results in the inequality constraints
(see Section 4). This guarantees that all such methods are {A.c(), B.g(), B.h()} and {A.c(), B.g(), C.x()}. Note
renamed consistently. Inequality constraints, necessary to that we implicitly remove singleton inequality constraints
avoid accidental overrides due to inheritance, must be ex- as these are always satisfied.
5.2 Naming Constraints for Fields, Classes, and that maximize the probability P (O ~ = ~kj ), com-
~ = ~oj | K
Packages puted as defined in Section 3, for all programs h~oj , ~kj i in
The deobfuscation mechanism must satisfy the following the training set. Unfortunately, computing the weights us-
properties: (i) any two packages contained in the same pack- ing precise maximum likelihood estimation is prohibitively
age must have distinct names, (ii) any two classes contained expensive in our context, due to the large number of nodes
in the same package must have distinct names, and (iii) and possible labels that can be assigned to them. DeGuard
any two fields declared in the same class must have distinct therefore learns the weights using pseudo likelihood, which
names. We remark that the types of fields are irrelevant for approximates the conditional distribution P (O ~ | K)
~ as the
naming constraints. This is because a field is referred to by product of the conditional distributions P (Oi | N (Oi ), K)~
its name, and the type of a field is not part of this name. of each unknown node Oi ∈ O ~ conditioned on the node’s
This is in contrast to methods, which are called by their neighbors N (Oi ) and the known nodes K. ~ For the complete
signature and where the types of a method’s parameters are details on training using pseudo likelihoods see [35, §5.4].
part of the method’s signature. Using the training described above, the training of this
For a given Android app, the naming constraints for fields, model took about 2 hours on a 32-core machine with four
classes, and packages, are formalized using inequality con- 2.13GHz Xeon processors running Ubuntu 14.04 with 64-Bit
straints in the same way we formalize the constraints for OpenJDK Java 1.7.0 51.
method names. We derive all inequality constraints for fields,
classes, and packages, by iterating over all classes and pack- MAP Inference.
ages and reporting inequality constraints as defined by the To predict likely names ~o to be assigned to all obfuscated
above properties. ~ DeGuard computes the MAP inference query
elements O,
~ = ~k)
~ = o~0 | K
~o = argmax P (O
6. IMPLEMENTATION AND EVALUATION
o~0 ∈Ω
In this section we describe the implementation of De-
Guard and the experiments we conducted with it.
where ~k are the names assigned to the known elements K. ~
For this step, we use the publicly available Nice2Predict
6.1 The DeGuard System framework [5]. Nice2Predict computes the MAP query us-
We now present our DeGuard system, which is pub- ing a scalable, greedy algorithm, where names assigned to
licly available at http://apk-deguard.com. DeGuard is im- obfuscated program elements are iteratively changed one-
plemened using Soot [38], a framework for static analysis by-one or in pairs until the score stops improving. At every
of Java and Android applications. Given an Android APK, iteration, all naming constraints are checked for violations.
Soot transforms it into an intermediate format (called Jim- More details on this algorithm are provided in [31]. After
ple) that simplifies the analysis of the application. To con- predicting the names for all obfuscated elements, DeGuard
struct the application’s dependency graph, we use Soot’s renames them using the Soot API, and then constructs and
API to traverse all program elements. outputs the deobfuscated APK.
To predict the names of all obfuscated elements for a given
application, DeGuard performs a MAP inference query
on the CRF model constructed from the application’s pro-
6.2 Experimental Evaluation
gram elements, the set of pairwise features (described in 4), We now present our experiments with DeGuard. First,
and the feature weights. Next, we describe how DeGuard we evaluate DeGuard’s accuracy on deobfuscating benign,
learns a probabilistic model (the pairwise features and their open-source applications obfuscated using ProGuard. Sec-
weights) from non-obfuscated Android applications, and how ond, we discuss our experience in inspecting malware sam-
it uses this probabilistic model to predict likely names of ob- ples deobfuscated using DeGuard.
fuscated program elements using MAP inference.
6.2.1 ProGuard Experiments
Feature Functions and Weights. We perform two tasks to evaluate DeGuard’s perfor-
To learn all feature functions and weights, we downloaded mance on ProGuard-obfuscated applications. First, we mea-
1784 non-obfuscated Android applications from F-Droid [3], sure DeGuard’s accuracy on predicting the names of pro-
a popular repository for open-source Android applications. gram elements obfuscated by ProGuard. Second, based on
Out of these 1784 applications, we randomly selected 100 the results of the first task, we report DeGuard’s accuracy
which we intentionally left as our benchmark applications, on the task of predicting the names of obfuscated third-party
i.e., the ones we later use in our evaluation. We used the libraries imported in the APK.
remaining 1684 applications as our training set of applica- To conduct the above tasks, we obfuscated 100 benign ap-
tions. plications from F-Droid. These are the 100 applications that
The set of possible names assigned to obfuscated program we intentionally did not use during the learning phase. For
elements is drawn from the names observed in the training all 100 applications, we enabled ProGuard obfuscation by
set. The pairwise features ϕ1 , . . . , ϕm are also derived from modifying their build files, without modifying ProGuard’s
the training set, as described in Section 4.3. The only com- obfuscation rules, which specify which elements are obfus-
ponent missing in our probabilistic model are weights w ~ = cated. In our experiments, we use the non-obfuscated ver-
[w1 , . . . , wm ] associated with the pairwise features. One way sions of the applications as an oracle to check whether De-
to learn the weights is to use maximum likelihood estimation, Guard correctly deobfuscates the program elements’ names
where the weights w ~ are chosen such that the training data by renaming them to their original (i.e., non-obfuscated)
has the highest probability. That is, we chose weights w ~ names.
100
30
80
% of nodes
20 60
40
10
20
0
0 105 15 20 25 30 35 0
Fields Methods Classes Packages Total
Number of neighbors
Figure 4: Distribution of total number of neighbors over the Known Correct Mis-predicted
100 ProGuard-obfuscated Android applications. Figure 6: Average percentage of known, correctly predicted,
and mis-predicted program elements calculated over the 100
Android applications deobfuscated by DeGuard.
30
% of nodes