Testing Using String Constraint Solve
Testing Using String Constraint Solve
Solver
by MASS ACHUSETTS INSTITUTE
OF TECHNOLOGY
A uthor............................
Department of Electrical Engineering and Computer Science
May 21, 2009
Certified by
Michael D. Ernst
Associate Professor
Thesis Supervisor
Abstract
This dissertation presents techniques and tools for improving software reliabil-
ity, by using an expressive string-constraint solver to make implementation-based
testing more effective and more applicable. Concolic testing is a paradigm of
implementation-based systematic software testing that combines dynamic sym-
bolic execution with constraint-based systematic execution-path enumeration. Con-
colic testing is easy to use and effective in finding real errors. It is, however, limited
by the expressiveness of the underlying constraint solver. Therefore, to date, con-
colic testing has not been successfully applied to programs with highly-structured
inputs (e.g., compilers), or to Web applications.
This dissertation shows that the effectiveness and applicability of concolic test-
ing can be greatly improved by using an expressive and efficient string-constraint
solver, i.e., a solver for constraints on string variables. We present the design, im-
plementation, and experimental evaluation of a novel string-constraint solver. Fur-
thermore, we show novel techniques for two important problems in concolic test-
ing: getting past input validation in programs with highly-structured inputs, and
creating inputs that demonstrate security vulnerabilities in Web applications.
1 Introduction
1.1 Hampi: A Solver for String Constraints ...... . . . . . . . . . . .
1.2 Grammar-Based Concolic Testing .....................
1.3 Concolic Security Testing ..........................
1.4 Contributions and Results .........................
2 Concolic Testing
6 Conclusions
6.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Future Work ........................
Bibliography 91
Chapter 1
Introduction
(i) Systematic testing of UNIX programs. Our experiments show that our
technique led to up to 2x improvements in line coverage (up to 5x cov-
erage improvements in parser code), eliminated all illegal inputs, and
enabled discovering 3 distinct, previously unknown, inputs that led to
infinitely-looping program execution.
(ii) A large security-critical application, the JavaScript interpreter of Inter-
net Explorer 7 (IE7). Our technique explores deeper program paths and
avoids dead-ends due to non-parsable inputs. Compared to regular
concolic testing, our technique increased coverage of the code gener-
ation module of the IE7 JavaScript interpreter from 53% to 81% while
using three times fewer tests.
3. Concolic security testing [64], a novel concolic testing technique (Chapter 5).
We created ARDILLA, a tool that implements this technique for PHP and ap-
plied ARDILLA to finding security vulnerabilities in Web applications. We
evaluated the tool on five PHP applications and found 68 previously un-
known vulnerabilities (23 SQLI, 33 first-order XSS, and 12 second-order XSS).
This dissertation begins with a description of concolic testing (Chapter 2). The
next three chapters form the core of this dissertation and present the main con-
tributions. Specifically, we present our solver for string constraints (Chapter 3),
describe grammar-based concolic testing (Chapter 4), and describe concolic secu-
rity testing (Chapter 5). We conclude with potential directions for future work
(Chapter 6).
Chapter 2
Concolic Testing
The purpose of concolic testing is to find program inputs that expose errors. The
idea of concolic testing is to run the program on an input, observe the program
execution on that input, and create additional inputs derived from the original.
The additional inputs are perturbed versions of the original input and exercise
execution-flow paths that are similar to that of the original input.
We illustrate the high-level idea of concolic testing. Running the program on
the seed input (Ii) exercises a particular concrete execution path. In the following
figure, the arrow denotes the execution path. The small circles denote the dynamic
execution of conditional statements (e.g., i f) that depend on the program input.
12
In the same way, from the original input 11,concolic testing creates additional
inputs 13, I4,and 15 that exercise alternative execution-flow paths in the program.
Each path agrees on the original path up to the last conditional, and then the exe-
cution takes the opposite branch. Note that an alternative path need not be of the
same length as the original path.
13
Each of the newly-created inputs is then executed, and thus concolic testing
generates additional inputs. For example, executing 14 allows concolic testing to
generate 16 and 17 (for clarity, paths exercised by previously-generated inputs have
been shortened). Thus, concolic testing runs the program multiple times, each time
with a different input. Each newly-generated input may lead to the generation of
additional inputs.
16
17
Concolic testing continues the process of running inputs and generating new
inputs, until all execution-flow paths have been exercised, or until the time limit is
reached.
In principle, concolic testing can exercise every feasible path in the program.
In practice, however, concolic testing is usually incomplete because the number of
feasible control paths may be astronomical (or even infinite) and because the preci-
sion of concolic execution, constraint generation, and solving is inherently limited.
Nevertheless, concolic testing has been shown to be very effective in finding errors
in programs [16,17,46,102].
Algorithm
Now, we discuss the concolic testing algorithm in more detail. Concolic test-
ing [17,46, 102] combines concolic execution with execution-path enumeration.
Given a sequential deterministic program P under test and a set of initial (seed)
program inputs, concolic testing finds inputs for P that expose errors. The algo-
rithm performs concolic execution of program P with each input, collects con-
straints generated during the execution, and generates new inputs by modifying
and solving the constraints. Each such newly-generated input aims to exercise a
different execution-flow path in P. The algorithm repeats concolic execution and
input generation, thus executing the program multiple times, each time with a dif-
ferent input. Each newly generated input may lead to the generation of additional
inputs. The algorithm terminates when a testing time limit is reached or no more
inputs can be generated (i.e., all execution paths in the code have been exercised).
First, the worklist of inputs is initialized with the seed inputs (line 4). Then,
the algorithm loops until there are no more inputs to explore, or the time limit is
reached (line 5). In the loop, the algorithm takes an input from the worklist (line 6),
and checks whether running 5P with the input triggers a runtime error (line 7).
The program is then executed concolically (procedure concolicExecution, explained
below). The result of concolic execution is a path constraint,which concolic testing
uses to generate new inputs (procedure createNewlnputs, explained below).
Concolic Execution
Concolic execution (procedure concolicExecution) combines concrete and symbolic
execution [65]. Concolic execution is also known as dynamic symbolic execution,
because it is carried out while the program is running on a particular concrete
input. Dynamic execution allows any imprecision in symbolic execution to be alle-
viated using concrete values: whenever symbolic execution does not know how to
generate a symbolic expression for a runtime value, it can simplify the expression
by using the concrete runtime value [46].
Concolic execution records how the input affects the control flow in the pro-
gram. The result of concolic execution is a path constraint that is a logic formula
parameters: Program P, concrete input
result : Path condition pc
12 procedure concolicExecution(P,input):
13 path constraint pc := true
14 foreach instruction inst executed by P with input do
15 update the symbolic store
16 switch inst do
17 case input-dependent conditional statement
18 c := expression for the executed branch
19 pc := pc A c
20 otherwise
21 if inst reads byte from input then
22 mark input byte as symbolic variable
23 return pc
that is satisfied by the currently executed concrete input and any other concrete in-
put that will drive the program's execution along the same control path. Symbolic
variables in the path constraint refer to bytes in the program's input. The path
constraint represents the equivalence class of all inputs that drive the execution of
the program along the same control path as the original input. By memoizing the
path constraints (or by using generation search [48]), concolic testing can guaran-
tee executing at most one input from each equivalence class.
The algorithm keeps a symbolic store which is a map from concrete runtime val-
ues to symbolic expressions over symbolic variables. The algorithm updates the
symbolic store whenever the program manipulates input data (line 15). At every
conditional statement that involves symbolic expressions, the algorithm extends
the current path constraint pc with an additional conjunct c that represents the
branch of the conditional statement taken in the current execution (line 19). At
every instruction that reads a byte from the input, a fresh symbolic variable is as-
sociated with the input byte (line 22).
Implementation strategies for concolic execution include source rewriting [18,
27,46,101,102], binary rewriting [84], and modifying the execution environment [2,
16, 48,60, 110]. Concolic execution tools exist for C [18,27,46, 102], Java [60, 101],
.NET [110], and PHP [2,116].
Execution-Path Enumeration
tional statement corresponding to the last constraint in that prefix (assuming con-
colic execution and constraint solving have perfect precision, otherwise the actual
execution may diverge from this path). The algorithm calls the constraint solver
to find a concrete input that satisfies the alternative path constraint (line 29). If the
constraint solver can find such a value (line 30), this new test input is generated
(line 31).
Example
We illustrate concolic testing on an example program (from Godefroid et al. [48]).
The program takes a 4-character array as input and has an error that can be ex-
posed by only one input, bad! (for readability, we print the character array as
a string). Finding this error without examining the implementation is hard -
assuming 8-bit characters, only 1 input in 28*4 (more than 4 billion) exposes the
problem. Thus, randomly sampling the input space is unlikely to lead to quickly
exposing the error. Concolic testing is implementation-based and finds the error-
causing input very quickly, in a few program executions.
Concolic testing starts with an arbitrary seed input, and executes that seed in-
put to collect constraints and create additional inputs. Then, concolic testing exe-
cutes the program repeatedly with newly generated inputs. The process continues
until all execution paths are explored, or until the time limit is reached.
Execution 1. Concolic testing starts with an arbitrary input, e.g., good, and per-
forms concolic execution of the program on the input. Concolic execution marks
each character in the input as symbolic. We denote the symbolic marks io0,... ,i3
(corresponding to input [0 ] ,...,input [ 3 ]). When the execution reaches the first
i f statement (line 3), the path constraint (initially set to true) gets extended with
conjunct io # b. This expression corresponds to the branch that is taken in the
execution (in this case, the else branch of the conditional because the concrete
value of io is g). When the program execution finishes, the path constraint is:
1 void main(char input[4]) f
2 int count=O;
3 if (input[O] == 'b')
4 count++;
5 if (input [] == 'a')
6 count++;
7 if (input[2] == 'd')
8 count++;
9 if (input[3]
10 count++;
11 if (count >= 3)
12 abort(); // error
13 }
(io # b) A (il a) A (i2 f d) A (i 3 * !). Concolic testing generates four new in-
puts by systematically negating conjuncts in the path constraint and solving the
resulting constraint.
pc] (io # b) A (il # a) A (i2 d) A (i 3 = !) -- goo!
pc2 (io # b) A (il # a) A (i 2 = d) godd
pc 3 (io b) A (il = a) - gaod
pc 4 (io = b) -~ bood
For simplicity of exposition, we assume that all generated inputs are of the
same length as the original input, and that the constraint solver prefers the con-
crete values from the original input to other concrete values. For example, there
are many solutions to (io # b) A (il = a), and we assume that the solver chooses
gaod. This assumption does not change the technique. In fact, this assumption is
commonly used in the implementation of concolic testing [16,48,60].
Each newly generated input explores a unique execution path. Each path shares
a prefix with the original path, up to the conditional that corresponds to the negated
conjunct, and then the new paths diverge. For example, the original input good ex-
ecutes the path (numbers indicate line numbers in the example program): 2,3,5,7,9,11.
Input gaod executes the path 2,3,5,6,7,9,11. The two paths agree on the prefix 2,3,5
and then diverge.
Execution 2. Concolic testing executes one of the generated inputs, etc. bood.
Concolic execution of this input results in path constraint: (io = b) A (il # a) A
(i2 # d) A (i3 # !). By systematically negating conjuncts in this path constraint and
solving the resulting formulas, concolic testing creates three additional inputs.
pc 4 (io = b) A (i # a) A (i2 d) A (i 3 = !) - boo!
pc5 (io = b) A (i a) A (i 2 = d) bodd
pc 6 (io = b) A (il = a) baod
Execution n < 16. After a few more iterations, concolic testing generates the
error-exposing input bad!. The exact number of iterations depends on the algo-
rithm for selecting the next input to execute but, in this example, the number is not
larger than 16, i.e., the number of feasible execution paths in the program. This is
much fewer than 264 possible inputs.
Chapter 3
Many automatic testing [18, 46, 102], analysis [50, 118], and verification [22, 23,
59] techniques for programs can be effectively reduced to a constraint-generation
phase followed by a constraint-solving phase. This separation of concerns often
leads to more effective and maintainable tools. Such an approach to analyzing
programs is becoming more effective as the efficiency of off-the-shelf constraint
solvers for Boolean SAT [35,86] and other theories [33,42] continues to increase.
Most of these solvers are aimed at propositional logic, linear arithmetic, or the
theory of bit-vectors.
Many programs, such as Web applications, take string values as input, manip-
ulate them, and then use them in sensitive operations such as database queries.
Analyses of string-manipulating programs in techniques for automatic testing [12,
36,45], verifying correctness of program output [103], and finding security faults [41,
116] produce string constraints, which are then solved by custom string solvers
written by the authors of these analyses. Writing a custom solver for every appli-
cation is time-consuming and error-prone, and the lack of separation of concerns
may lead to systems that are difficult to maintain. Thus, there is a clear need for an
effective and sufficiently expressive off-the-shelf string-constraint solver that can
be easily integrated into a variety of applications.
We designed and implemented HAMPI, a solver for constraints over fixed-size
string variables. HAMPI constraints express membership in regular and fixed-
size context-free languages'. HAMPI constraints may contain a fixed-size string
variable, context-free language definitions, regular-language definitions and op-
erations, and language-membership predicates. Given a set of constraints over a
string variable, HAMPI outputs a string that satisfies all the constraints, or reports
that the constraints are unsatisfiable. HAMPI is designed to be used as a compo-
nent in testing, analysis, and verification applications. HAMPI can also be used
1All fixed-size languages are finite, and every finite language is regular. Hence, it would suffice
to say that HAMPI supports only fixed-size regular languages. However, it is important to empha-
size the ease-of-use that HAMPI provides by allowing users to specify context-free languages.
to solve the intersection, containment, and equivalence problems for regular and
fixed-size context-free languages.
A key feature of HAMPI is fixed-sizing of regular and context-free languages.
Fixed-sizing makes HAMPI different from custom string-constraint solvers used in
many testing and analysis tools [36]. As we demonstrate, for many practical appli-
cations, fixed-sizing the input languages is not a handicap. In fact, it allows for a
more expressive input language that allows operations on context-free languages
that would be undecidable without fixed-sizing. Furthermore, fixed-sizing makes
the satisfiability problem solved by HAMPI tractable. This difference is similar to
that between model-checking and bounded model-checking [11], and it has been
previously explored in automated decision procedures for integer arithmetic [24]
and relational logic [58].
HAMPI's input language can encode queries that help identify SQL injection
attacks, such as: "Find a string v of size 12 characters, such that the SQL query
SELECT msg FROM messages WHERE topicid=v is a syntactically valid SQL
statement, and that the query contains the substring OR 1=1" (where OR 1=1 is
a common tautology that can lead to SQL injection attacks). HAMPI finds a string
value that satisfies the constraints, or answers that no satisfying value exists (for
the above example, string 1 OR 1=1 is a solution).
HAMPI works in four steps: First, normalize the input constraints, and gen-
erates what we refer to as the core string constraints. The core string constraints
are expressions of the form v e R or v § R, where v is a fixed-size string variable,
and R is a regular expression. Second, translate these core string constraints into
a quantifier-free logic of bit-vectors. A bit-vector is a fixed-size, ordered, list of
bits. The fragment of bit-vector logic that HAMPI uses contains standard Boolean
operations, extracting sub-vectors, and comparing bit-vectors. Third, hand over
the bit-vector constraints to STP [42], a constraint solver for bit-vectors and arrays.
Fourth, if STP reports that the constraints are unsatisfiable, then HAMPI reports
the same. Otherwise, STP reports that the input constraints are satisfiable, and
generates a satisfying assignment in its bit-vector language. HAMPI decodes this
to output a string solution.
Experimental Evaluation
Figure 3-1: Fragment of a PHP program that displays messages stored in a MySQL
database. This program is vulnerable to an SQL injection attack. Section 3.1 dis-
cusses the vulnerability.
This program is vulnerable to an SQL injection attack. An attacker can read all
messages from the database (including ones intended to be private) by crafting a
malicious URL such as
http://www.mysite.com/?topicid=1l' OR '1'='1
Upon being invoked with that URL, the program reads
1' OR '1'=' 1
as the value of the $mytopicid variable, and submits the following query to the
database in line 4:
SELECT msg FROM messages WHERE topicid = ' 1' OR '1'='1'
The WHERE condition is always true because it contains the tautology ' 1' =' 1'.
Thus, the query retrieves all messages, possibly leaking private information.
A programmer or an automated tool might ask, "Can an attacker exploit the
topicid parameter and introduce a tautology into the query at line 4 in the code
of Figure 3-1?" The HAMPI solver answers such questions, and creates strings that
can be used as exploits.
HAMPI constraints can formalize the above question (Figure 3-2). Automated
vulnerability-scanning tools [64,116] can create HAMPI constraints via either static
or dynamic program analysis (we demonstrate both static and dynamic techniques
in our evaluation in Section 3.3.1 and in Chapter 5). Specifically, a tool could create
the HAMPI input of Figure 3-2 from analyzing the code of Figure 3-1.
We now discuss various features of the HAMPI input language that Figure 3-2
illustrates. (Section 3.2.1 describes the input language in more detail.)
* Keyword var (line 2) introduces a string variable v. The variable has a fixed
size of 12 characters. The goal of the HAMPI solver is to find a string that,
when assigned to the string variable, satisfies all the constraints. HAMPI can
look for solutions of any fixed size; we chose 12 for this example.
15 //constraint conjuncts
16 assert q in SqlSmall;
Figure 3-2: The HAMPI input that finds an attack vector that exploits the SQL in-
jection vulnerability from Figure 3-1.
Figure 3-3: Schematic view of the HAMPI string solver. Section 3.2 describes the
HAMPI solver.
A HAMPI input must declare a single string variable and specify the variable's size
in number of characters. If the input constraints are satisfiable, then HAMPI finds
a value for the variable that satisfies all constraints. Line 2 in Figure 3-2 declares a
variable v of size 12 characters.
Sometimes, an application of a string-constraint solver requires examining strings
up to a given length. Users of HAMPI can deal with this issue in two ways: (i) re-
peatedly run HAMPI for different fixed sizes of the variable (can be fast due to the
optimizations of Section 3.2.6), or (ii) adjust the constraint to allow "padding" of
the variable (e.g., using Kleene star to denote trailing spaces). It would be straight-
forward to extend HAMPI to permit specifying a size range, using syntax such as
var v:1. .12.
Input Var Stmt* HAMPI input
Stmt Cfg IReg I Val I Assert statement
Var ::=var Id : Int string variable
Cfg ::=cfg Id := CfgProdRHS context-free lang.
Reg ::=reg Id := RegElem regular-lang.
RegElem::= StrConst constant
Id var. reference
fixsize( Id , Int) CFG fixed-sizing
or( RegElem * ) union
concat( RegElem * ) concatenation
star( RegElem ) Kleene star
Val val Id := ValElem temp. variable
ValElem::= Id IStrConst I concat( ValElem * )
Assert assert Id [not]? in Id membership
assert Id [not]? contains StrConst substring
Figure 3-4: Summary of HAMPI's input language. Terminals are bold-faced,
nonterminals are italicized. A HAMPI input (Input) is a variable declaration, fol-
lowed by a list of statements: context-free-grammar declarations, regular-language
declarations, temporary variables, and assertions. Some nonterminals are omitted
for readability.
HAMPI input can declare context-free languages using grammars in the standard
notation, Extended Backus-Naur Form (EBNF). Terminals are enclosed in double
quotes (e.g., "SELECT "), and productions are separated by the vertical bar symbol
(I). Grammars may contain special symbols for repetition (+ and *) and character
ranges (e.g., [a-z ).
For example, lines 5-10 in Figure 3-2 show the declaration of a context-free
grammar for a subset of SQL.
HAMPI's format of context-free grammars is as expressive as that of widely-
used tools such as Yacc/Lex; in fact, we have written a simple syntax-driven script
that transforms a Yacc specification to HAMPI format (available on the HAMPI web-
site).
HAMPI input can declare regular languages. The following regular expressions
define regular languages: (i) a singleton set with a string constant, (ii) a concate-
nation/union of regular languages, (iii) a repetition (Kleene star) of a regular lan-
guage, (iv) a fixed-sizing of a context-free language. Every regular language can
be expressed using the first three of those operations [105].
For example, (b* ab * ab* ) * is a regular expression that describes the language
of strings over the alphabet {a, b}, with an even number of a symbols. In HAMPI
syntax this is:
reg Bstar := star("b"); // 'helper' expression
reg EvenA := star(concat(Bstar, "a", Bstar, "a", Bstar));
HAMPI allows construction of regular languages by fixed-sizing context free
languages. The set of all strings of a given size from a context-free language is
regular (because every finite language is regular).
For example, in Figure 3-2, to describe the regular language that consists of all
syntactically correct SQL strings of length 53 (according to the SqlSmall gram-
mar) we would write fixsize (SqlSmall, 53). Using the fixsize operator is
more convenient than writing the regular expression explicitly.
HAMPI can infer the correct size and perform the fixed-sizing automatically.
For example, in line 16 in Figure 3-2, HAMPI automatically infers that 53 is the
correct size for q (sum of the size of the string-variable and the constant strings in
the declaration of q: 12+40+1).
Constraints - assert
HAMPI constraints (declared by the as s e rt keyword) specify membership of vari-
ables in regular languages. For example, line 16 in Figure 3-2 declares that the
string value of the temporary variable q is in the regular language defined by Sql-
SmallFixedSize.
HAMPI uses the following algorithm to create regular expressions that specify the
set of strings of a fixed length that are derivable from a context-free grammar:
1. Expand all special symbols in the grammar (e.g., repetition, option, character
range).
3. Construct the regular expression that encodes all fixed-sized strings of the
grammar as follows: First, pre-compute the length of the shortest and longest
(if exists) string that can be generated from each nonterminal (i.e., lower and
upper bounds). Second, given a size n and a nonterminal N, examine all pro-
ductions for N. For each production N ::= S1... Sk, where each S1 may be
a terminal or a nonterminal, enumerate all possible partitions of n charac-
ters to k grammar symbols (HAMPI takes the pre-computed lower and up-
per bounds to make the enumeration more efficient). Then, create the sub-
expressions recursively and combine the subexpressions with a concatena-
tion operator. Memoization of intermediate results (Section 3.2.6) makes this
(worst-case exponential in k) process scalable.
Example of Grammar Fixed-Sizing
Consider the following grammar of well-balanced parentheses and the problem
of finding the regular language that consists of all strings of length 6 that can be
generated from the nonterminal E.
cfg E := "()" | E E " (" E ")" ;
The grammar does not contain special symbols or E productions, so first two
steps of the algorithm do nothing. Then, HAMPI determines that the shortest string
E can generate is of length 2. There are three productions for the nonterminal E,
so the final regular expression is a union of three parts. The first production, E
:= " () ", generates no strings of size 6 (and only one string of size 2). The sec-
ond production, E := E E, generates strings of size 6 in two ways: either the
first occurrence of E generates 2 characters and the second occurrence generates 4
characters, or the first occurrence generates 4 characters and the second occurrence
generates 2 characters. From the pre-processing step, HAMPI knows that the only
other possible partition of 6 characters is 3-3, which HAMPI tries and fails (because
E cannot generate 3-character strings). The third production, E := " (" E ") ",
generates strings of size 6 in only one way: the nonterminal E must generate 4
characters. In each case, HAMPI creates the sub-expressions recursively. The re-
sulting regular expression for this example is (plus signs denote union and square
brackets group sub-expressions):
After STP finds a solution to the bit-vector formula (if one exists), HAMPI de-
codes the solution by reading 8-bit sub-vectors as consecutive ASCII characters.
3.2.4 Complexity
The satisfiability problem for HAMPI's logic (core string constraints) is NP-complete.
To show NP-hardness, we reduce the 3-CNF (conjunctive normal form) Boolean
satisfiability problem to the satisfiability problem of the core string constraints in
HAMPI's logic. Consider an arbitrary 3-CNF formula with n Boolean variables
and m clauses. A clause in 3-CNF is a disjunction (v) of three literals. A literal
is a Boolean variable (vi) or its negation (-vi). For every 3-CNF clause, a HAMPI
constraint can be generated. Let E = {T, Fl denote the alphabet. For the clause
(vo V v1 V 'V2), the equivalent HAMPI constraint is:
where the HAMPI variable V is an n-character string representing the possible as-
signments to all n Boolean variables satisfying the input 3-CNF formula. Each of
the HAMPI regular-expression disjuncts in the core string constraint shown above,
such as TEE... E, is also of size n and has a T in the ith slot for vi (and F for -'vi), i.e.,
i-1 n-i
The total number of such HAMPI constraints is m, the number of clauses in the
input 3-CNF formula (here m = 1). This reduction from a 3-CNF Boolean formula
into HAMPI is clearly polynomial-time.
To establish that the satisfiability problem for HAMPI's logic is in NP, we only
need to show that for any set of core string constraints, there exists a polynomial-
time verifier that can check any short witness. The size of a set of core string con-
straints is the size k of the string variable plus the sum r of the sizes of regular
expressions. A witness has to be of size k, and it is easy to check, in time polyno-
mial in k + r, whether the witness belongs to each regular language.
var v:2;
cfg E := " ()" I EE " (" E ")";
val q := concat( "((" , v , "))" );
assert q in E; // turns into constraint cl
assert q contains "())"; // turns into constraint c2
HAMPI follows the solving algorithm outlined in Section 3.2 (The alphabet of the
regular expression or context-free grammar in a HAMPI input is implicitly re-
stricted to the terminals specified):
step 1. Normalize constraints to core form, using the algorithm in Section 3.2.2. In
this example, the size inference required to normalize the first as sert gives size 6
for q (computed as 2+2+2 from the declarations of v and q). The core constraints
are:
cl: ((v)) ()() + (())]+
([() () + (()) )
c2: ((v)) + )]* ()) + )
step 2. Encode the core-form constraints in bit-vector logic. We show how HAM-
PI encodes constraint cl; the process for c2 is similar. HAMPI creates a bit-vector
variable by of length 6*8=48 bits, to represent the left-hand side of cl (since EFfixed
is 6 bytes). Characters are encoded using ASCII codes: ' (' is 40 in ASCII, and
')' is 41. HAMPI encodes the left-hand-side expression of cl, ( ( v ) ), as formula
L 1, by specifying the constant values: L 1 : (bv[0] = 40) A (bv[l] = 40) A (bv[4] =
41) A (bv[5] = 41). Bytes bv[2] and bv[3] are reserved for v, a 2-byte variable.
The top-level regular expression in the right-hand side of cl is a 3-way
union, so the result of the encoding is a 3-way disjunction. For the first dis-
junct () [() () + ( () )], HAMPI creates the following formula: Dia: bv[0] =
40 A bv[] = 41 A ((bv[2] = 40 A bv[3] = 41 A bv[4] = 40 A bv[5] = 41) v (bv[2] =
40 A bv[3] = 40 A bv[4] = 41 A bv[5] = 41)).
Formulas D1b and D1c for the remaining conjuncts are similar. The bit-vector
formula that encodes cl is C1 = Li A(DlaVDlbVDlc). Similarly, a formula C2 (not shown
here) encodes c2. The formula that HAMPI sends to the STP solver is (C1 A C2 ).
step 3. STP finds a solution that satisfies the formula: bv[0] = 40, bv[1] = 40, bv[2] =
41, bv[3] = 40, bv[4] = 41, bv[5] = 41. In decoded ASCII, the solution is "( () () )"
(quote marks not part of solution string).
step 4. HAMPI reads the assignment for variable v off of the STP solution, by de-
coding the elements of by that correspond to v, i.e., elements 2 and 3. It reports the
solution for v as ") (". (String " ()" is another legal solution for v, but STP only
finds one solution.)
3.2.6 Optimizations
Optimizations in HAMPI aim at reducing computation time.
Memoization
HAMPI stores and reuses partial results during the computation of fixed-sizing of
context-free grammars (Section 3.2.2) and during the encoding of core constraints
in bit-vector logic (Section 3.2.3).
Example. Consider the example from Section 3.2.5, i.e., fixed-sizing the context-
free grammar of well-balanced parentheses to size 6.
cfg E := "()" I EE "(" E ")" ;
Consider the second production E := E E. There are two ways to construct a
string of 6 characters: either construct 2 characters from the first occurrence of E
and construct 4 characters from the second occurrence, or vice-versa. After creat-
ing the regular expression that corresponds to the first of these ways, HAMPI cre-
ates the second expression from the memoized sub-results. HAMPI's implementa-
tion shares the memory representations of common subexpressions. For example,
HAMPI uses only one object to represent all three occurrences of () () + ( () ) in
constraint cl of the example in Section 3.2.5.
Constraint Templates
Constraint templates capture common encoded sub-expressions, modulo offset
in the bit-vector. This optimization is related to the template mechanism pro-
posed by Shlyakhter et al. [104], and to the sharing detection in the KodKod model
finder [111]. During the bit-vector encoding step (Section 3.2.3), HAMPI may en-
code the same regular expression multiple times as bit-vector formulas, as long
as the underlying offsets in the bit-vector are different. For example, the (con-
stant) regular expression ) ( may be encoded as (bv[0] = 41) A (bv[l] = 40) or as
(bv[3] = 41) A (bv[4] = 40), depending on the offset in the bit-vector (0 and 3, re-
spectively).
HAMPI creates a single "template", parameterized by the offset, for the encoded
expression, and instantiates the template every time, with appropriate offsets. For
the example above, the template is T(p) = bv[p] = 41 A bv[p + 1] = 40, where p is
the offset parameter. HAMPI then instantiates the template to T(0) and T(3).
As another example, consider cl in Section 3.2.5: The subexpression () () + ( () )
occurs 3 times in cl, each time with a different offset (2 for the first occurrence, 0
for the second, and 1 for the third). The constraint-template optimization enables
HAMPI to do the encoding once and reuse the results, with appropriate offsets.
Server Mode
The server mode improves HAMPI's efficiency on simple constraints and on re-
peated calls. Because HAMPI is a Java program, the startup time of the Java virtual
machine may be a significant overhead when solving small constraints. Therefore,
we added a server mode to HAMPI, in which the (constantly running) solver ac-
cepts inputs passed over a network socket, and returns the results over the same
socket. This enables HAMPI to be efficient over repeated calls, for tasks such as
solving the same constraints on string variables of different sizes.
3.3 Evaluation
We experimentally tested HAMPI's applicability to practical problems involving
string constraints, and to compare HAMPI's performance and scalability to another
string-constraint solver.
Experiments
Results. We found empirically that when a solution exists, it can be very short.
In 306 of the 1,367 inputs, the intersection was not empty (both solvers produced
identical results). Out of the 306 inputs with non-empty intersections, we mea-
sured the percentage for which HAMPI found a solution (for increasing values of
N): 2% for N = 1, 70% for N = 2, 88% for N = 3, and 100% for N = 4. That is, in
this large dataset, all non-empty intersections contain strings with no longer than 4
characters. Due to false positives inherent in Wassermann and Su's static analysis,
the strings generated from the intersection do not necessarily constitute real attack
vectors. However, this is a limitation of the static analysis, not of HAMPI.
HAMPI solves most queries quickly. Figure 3-7 shows the percentage of inputs
that HAMPI can solve in the given time, for 1 < N 4, i.e., until all satisfying
assignments are found. For N = 4, HAMPI can solve 99.7% of inputs within 1
second.
We measured how HAMPI's solving time depends on the size of the grammar.
We measured the size of the grammar as the sum of lengths of all productions
(we counted E-productions as of length 1). Among the 1,367 grammars in the
dataset, the mean size was 5490.5, standard deviation 4313.3, minimum 44, max-
imum 37955. We ran HAMPI for N = 4, i.e., the length at which all satisfying
assignments were found. Figure 3-8 shows the solving time as a function of the
grammar size, for all 1,367 inputs.
Our experimental results, obtained on a large dataset from a powerful static
analysis and real Web applications, indicate that HAMPI's fixed-size solving algo-
rithm is applicable to real problems.
' '"''
60
50 -
40 -
30
string size 1 -
. i string size 2
string size 3 --------
string size 4 .............
)1 0.1
time (sec.)
~ ...... , -r -- -
100 1
S+
0 +
0 0 + +
4
00jo 84 +
. . . . . .
I ". . ". . ... . .. .
1000 10000 100000
grammar size
Figure 3-8: HAMPI solving time as function of grammar size (number of all
elements in all productions), on 1,367 inputs from the Wassermann and Su
dataset [114]. The size of the string variable was 4, the smallest at which HAM-
PI finds all satisfying assignments for the dataset. Each point represents an input;
shapes indicate SAT/UNSAT. Section 3.3.1 describes the experiment.
3.3.2 Comparing Performance to CFGAnalyzer
We evaluated HAMPI's utility in analyzing context-free grammars, and compared
HAMPI'S performance to a specialized decision procedure, CFGAnalyzer [4]. CF-
GAnalyzer is a SAT-based decision procedure for bounded versions of 6 problems
(5 undecidable) that involve context-free grammars: universality, inclusion, inter-
section, equivalence, ambiguity, and emptiness (decidable). We downloaded the
latest available version 2 (released 3 December 2007) and configured the program
according to the manual.
20
15 CFGAnalyzer
" 5 10
.......... Hampi
.......
0
10 20 30 40 50
string size (characters)
Results. HAMPI is faster than CFGAnalyzer for all sizes larger than 4 characters.
Figure 3-9 shows the results averaged over all pairs of grammars. Importantly,
2
http: //www.tcs.ifi.Imu.de/-mlange/cfganalyzer
HAMPI'S win grows as the size of the problem increases (up to 6.8x at size 50). For
the largest problems (N = 50), HAMPI was faster (by up to 3000x) on 99 of the 100
grammar pairs, and 1.3x slower on the remaining 1 pair of grammars. The HAMPI
website contains all experimental data and detailed results.
HAMPI is faster also on grammar-membership constraints. We performed an
additional experiment we: searching for any string of a given length from a context-
free grammar. The results were similar to those for intersection: e.g., HAMPI finds
a string of size 50, on average, in 1.5 seconds, while CFGAnalyzer finds one in 8.7
seconds (5.8x difference). The HAMPI website contains all experimental data and
detailed results.
MONA [66] uses finite-state automata and tree automata to reason about sets of
strings. However, the user still has to translate their input problem into MONA's
input language (weak monadic second-order theory of one successor). MONA also
provides automata-based tools, similar to other libraries [38-40]. Fido [66, 67] is
a higher-level formalism implemented on top of the MONA solver to concisely
represent regular sets of strings and trees.
Word equations [12,96] describe equality between two strings that contain string
variables. Rajasekar [96] proposes a logic programming approach that includes
constraints on individual words. His solver handles concatenation but not regu-
lar language membership. Bjorner et al. [12] describe a constraint solver for word
queries over a variety of operations, and translate string constraints to the lan-
guage of the Z3 solver [33]. If there is a solution, Z3 returns a finite bound for the
set of strings, that is then explored symbolically. Alonso et al. [1] describe a solver
for word equations, based on genetic algorithms. However, unlike HAMPI, these
tools do not support context-free grammars directly.
Hooimeijer and Weimer [54] describe a decision procedure for regular language
constraints, focusing on generating sets of satisfying assignments rather than in-
dividual strings. Unlike HAMPI, the associated implementation does not easily
allow users to express fixed-size context-free grammars.
Custom String Solvers
Many analyses use custom solvers for string constraints [7,15,20,36,41,45,83,114-
116]. All of these approaches include some implementation for language inter-
section and language inclusion; most, similarly to HAMPI, can perform regular-
language intersection. Each of these implementations is tightly integrated with
the associated program analysis, making a direct comparison with HAMPI imprac-
tical.
Christensen et al. [20] have a static analysis tool to check for SQL injection vul-
nerabilities that uses automata-based techniques to represent over-approximation
of string values. Fu et al. [41] also use an automata-based method to solve string
constraints. Ruan et al. [97] use a first-order encoding of string functions occurring
in C programs, and solve the constraints using a linear arithmetic solver.
Besides the custom solvers by Wassermann et al. [114], the solver by Emmi
et al. [36] is closest to HAMPI. Emmi et al. used their solver for automatic test
case generation for database applications. Unlike HAMPI, their solver allows con-
straints over unbounded regular languages and linear arithmetic, but does not
support context-free grammars.
Many of the program analyses listed here perform similar tasks when reasoning
about string-valued variables. This is strong evidence that a unified approach, in
the form of an external string-constraint solvers such as HAMPI, is warranted.
3.5 Conclusion
We presented HAMPI, a solver for constraints over fixed-size string variables. HAM-
PI constraints express membership in regular and fixed-size context-free languages.
HAMPI constraints may contain a fixed-size string variable, context-free language
definitions, regular-language definitions and operations, and language-membership
predicates. Given a set of constraints over a string variable, HAMPI outputs a string
that satisfies all the constraints, or reports that the constraints are unsatisfiable.
HAMPI works by encoding the constraint in the bit-vector logic and solving using
STP.
HAMPI is designed to be used as a component in testing, analysis, and veri-
fication applications. HAMPI can also be used to solve the intersection, contain-
ment, and equivalence problems for regular and fixed-size context-free languages.
We evaluated HAMPI's usability and effectiveness as a component in static- and
dynamic-analysis tools for PHP Web applications. Our experiments show that
HAMPI is expressive enough to easily encode constraint arising in finding SQL
injection attacks, and in systematic testing of real-world programs. In our experi-
ments, HAMPI was able to find solutions quickly, and scale to practically-relevant
problem sizes.
By using a general-purpose freely-available string-constraint solver such as
HAMPI, builders of analysis and testing tools can save significant development
effort, and improve the effectiveness of their tools.
38
Chapter 4
By tracking the tokens returned by the lexer (i.e., the function nextToken,
line 3 in Figure 4-1) and considering those as symbolic inputs, our concolic-testing
1 // Lexer: Reads and returns next token from file.
2 // Terminates on erroneous inputs.
3 Token nextToken() {
4 . .
5 readInputByte ();
6 ..
7}
8
9 // Parser: Parses the input file, returns parse tree.
10 // Terminates on erroneous inputs.
11 ParseTree parse() {
12 ...
13 Token t = nextToken();
14 ...
15 }
16
17 void main() {
18 ...
19 ParseTree t = parse();
20 ...
21 Bytecode code = codeGen(t);
22 ... // Execute code
23 }
Figure 4-1: Sketch of an interpreter. The interpreter processes the inputs in stages:
lexer (function nextToken), parser (function parse), and code generator (func-
tion codeGen). Next, the interpreter executes the generated bytecode (omitted
here).
algorithm generates constraints in terms of such tokens. For example, running the
interpreter on the valid input
function f(){ }
may correspond to the sequence of symbolic token constraints
tokeno
0 = function
token1 = id
token 2 =
token 3 = )
token 4 = {
token 5 = }
Negating the fourth constraint in this path constraint leads to the new sequence of
constraints:
tokeno = function
tokenl = id
token 2 =
token 3 )
There are many ways to satisfy this constraint, but most solutions lead to non-
parsable inputs. In contrast, our grammar-based constraint solver can directly
conclude that the only way to satisfy this constraint while generating a valid input
according to the grammar is to set
token 3 = id
and to complete the remainder of the input with, for example,
token 4 = )
token 5 = {
token 6 =
Thus, the generated input that corresponds to this solution is
function f(id){ }
where id can be any identifier.
Similarly, the grammar-based constraint solver can immediately prove that negat-
ing the third constraint in the previous path constraint, thus leading to the new
path constraint
tokeno = function
tokenl = id
token 2 # (
is unsolvable, i.e., there are no inputs that satisfy this constraint and are recognized
by the grammar. Grammar-based concolic testing prunes in one iteration the en-
tire sub-tree of lexer executions corresponding to all possible non-parsable inputs
matching this case.
1
Symbolic variables could also be associated with other values returned by the tokenization
function for specific types of tokens, such as the string value associated with each identifier, the
numerical value associated with each number, etc.
parameters: Program 1P, seed inputs, grammar q
result : Error reports for P
1 Procedure grammarBasedConcolic(P,inputs, 9):
2 errors := 0
3 worklist := emptyQueueO
4 enqueueAll(worklist, inputs)
5 while not empty(worklist) and not timeExpiredO do
6 input := dequeue(worklist)
7 errors := errors U Run&Check(P, input)
8 pc := concolicExecution(5P,input)
9 newlnputs := createNewlnputs(pc,g)
10 enqueueAll(worklist, newInputs)
11 return errors
productions and in the right-hand sides of the original productions for S. The algo-
rithm employs a simple unroll-and-prune approach: in the ithiteration of the main
loop (line 5), the algorithm unrolls the right-hand sides of productions to expose
a 0... i prefix of terminals (line 14), and prunes those productions that violate the
constraint ci on the ith token variable token, in the regular expression R (line 16).
During each round of unrolling and pruning, the algorithm uses the worklist W to
store productions that have not yet been unrolled and examined for conformance
with the regular expression.
Finally, the algorithm produces a result string. After the unrolling and pruning,
the algorithm checks emptiness [55] of the resulting language L(G') and generates
a string s from the intersection grammar G' (line 20). For speed, our implementa-
tion uses a bottom-up strategy that generates a string with the shortest derivation
tree (i.e., tree of smallest height) for each nonterminal in the grammar, by com-
bining the strings from the right-hand sides of productions for nonterminals. This
strategy is fast due to memoizing strings during generation. Section 4.2.2 discusses
alternatives and limitations of the algorithm in Figure 4-3.
Solving Example
token¢ E {(}
token 2 E {+}
token 3 c {(}
token 4 E {(,),num, id,let}
S' ::= (let ((id S')) S') I (Op S' S') Inum I id
Op ::= +I-
S ::= (let ((id S')) S') I (Op S' S') Inum I id
Next, the main iteration begins. The first conjunct in the grammar constraint is
token1 E {(}, therefore the algorithm (line 16) removes the last two productions for
the start symbol S. The result is the following grammar (execution is now back at
the top of the loop in line 5).
S' ::= (let ((id S')) S') I(Op S' S') Inum I id
Op ::= +I-
S ::= (let ((id S')) S') I(Op S' S')
In the next iteration of the for loop, the algorithm examines the second con-
junct in the regular path constraint, token 2 e {+}. The algorithm prunes the first
production rule from S since let does not match + (line 16), and then expands the
nonterminal Op in the production S ::= (Op S' S') (line 14). The production is re-
placed by two productions, S ::= (+ S' S') and S ::= (- S' S'), which are added to the
worklist W. The grammar g' is then
S' ::= (let ((id S')) S') I(Op S' S') Inum I id
Op ::= +I-
S ::= (+ ' S')l I (- S' S')
In the next iteration of the while loop, the second of the new productions is
removed from the grammar (line 16) because it violates the grammar constraint.
After the removal, the execution is now again at the top of the loop in line 5.
S' ::= (let ((id S')) S') I (Op S' S') Inum I id
Op ::= +I-
S ::= (+ S' S')
After 2 more iterations of the for loop, the algorithm arrives at the final gram-
mar
S' ::= (let ((id S')) S') I(Op S' S') Inum I id
Op ::= +I-
S ::= (+ (let ((id S')) S') S')
As the last two steps, the algorithm checks that L(g') * 0 (line 17) and generates
a string s from the final grammar g' for the intersection of g and R (line 20). Our
bottom-up strategy generates the string (+ (let (( id num )) num) num). From this
string of tokens, our tool generates a matching string of input bytes by applying
an application-specific de-tokenization function.
Domain Knowledge
Using a grammar to filter out invalid inputs may reduce code coverage in the lexer
and parser themselves, since the grammar explicitly prevents the execution of code
paths handling invalid inputs in those stages. For testing those stages, traditional
concolic testing can be used. Moreover, our experiments (Sections 4.5-4.4) indicate
that grammar-based concolic testing does not decrease coverage in the lexer or
parser.
Grammar-based concolic testing approach uses the actual lexer and parser code
of the program under test. Indeed, removing these layers and using automatically
generated software stubs simulating those parts may feed unrealistic inputs to the
rest of the program.
4.3 Grammar-Based Concolic Testing Case Studies
We performed two case studies to evaluate grammar-based concolic testing. For
the case studies, we implemented our technique in two existing concolic-testing
tools, Klee [16] (Section 4.4) and SAGE [48] (Section 4.5). For each case study, we
designed experiments with the following goals:
* Measure whether test inputs generated by grammar-based concolic testing
are effective in exercising long execution paths in the program, i.e., reaching
beyond the lexer and parser.
* Compare grammar-based concolic testing to other approaches, in particular
concolic testing.
The results of our case studies show that grammar-based concolic testing is a
general and powerful technique. In each case study, we were able to successfully
implement grammar-based concolic testing in a tool with which we were initially
unfamiliar (though we did interact with the tools' authors). Moreover, in each case
study, grammar-based concolic testing significantly improved the effectiveness of
the testing tool (in terms of code coverage), for programs with highly-structured
inputs.
Summary of Results
Compared to concolic testing, grammar-based concolic testing led to up to 2x im-
provements in line coverage (up to 5x coverage improvements in parser code),
eliminated all illegal inputs, and enabled discovering 3 distinct, previously un-
known, inputs that led to infinitely-looping program execution. These results
show that using HAMPI can improve the effectiveness of automated test-case gen-
eration and testing tools.
4.4.1 Experiment
We compared the coverage achieved and numbers of legal (and rejected) inputs
generated by running Klee with and without HAMPI string constraints. While
previous studies used custom-made string-constraint solvers, we are able to apply
HAMPI as an "off-the-shelf" solver for Klee without modifying Klee at all.
Klee provides an Application Programming Interface (API) for target programs
to mark inputs as symbolic and to place constraints on them. The code snippet
below uses klee_assert to impose the constraint that all elements of bu f must
be numeric before the target program runs:
char buf[l0]; // program input
klee_make_symbolic(buf, 10); // make all 10 bytes symbolic
1. Automatically convert its Yacc specification into HAMPI's input format (de-
scribed in Section 3.2.1), using a script we wrote. To simplify lexical analysis,
we used either a single letter or numeric digit to represent certain tokens,
depending on its Lex specification (this should not reduce coverage in the
parser).
3. Run HAMPI to compile the input grammar file into STP bit-vector constraints
(described in Section 3.2.3).
4. Automatically convert the STP constraints into C code that expresses the
equivalent constraints using C variables and calls to klee_assert (),with
a script we wrote (the script performs only simple syntactic transformations
since STP operators map directly to C operators).
5. Run Klee on the target program using an N-byte input buffer, first marking
that buffer as symbolic, then executing the C code that imposes the input
constraints, and finally executing the program itself.
6. After a 1-hour time-limit expires, collect all generated inputs and run them
through the original program (compiled using gcov) to measure coverage
and legality of each input.
7. As a control, run Klee for 1 hour using an N-byte symbolic input buffer (with
no initial constraints), collect test cases, and run them through the original
program to measure coverage and legality of each input.
Table 4.1: The result of using HAMPI grammars to improve coverage of test cases
generated by the Klee systematic testing tool. ELOC lists Executable Lines of Code,
as counted by gcov over all . c files in the program (whole-project line counts are
several times larger, but much of that code does not directly execute). Each trial
was run for 1 hour. Klee generates a new input only when it covers new lines
that previous inputs have not yet covered (to create minimal test suites); the total
number of explored paths is usually hundreds of times greater than the number of
generated inputs. Rows concolic show results for runs of Klee without a HAMPI
grammar. Rows concolic+grammar show results for runs of Klee with a HAM-
PI grammar. Rows combined show accumulated results for both kinds of runs.
Section 4.4 describes the experiment.
The rest of this section describes our experiments and discusses the results.
Naturally, because they come from a limited sample, these experimental results
need to be taken with caution. However, our evaluation is extensive and per-
formed with a large, widely-used JavaScript interpreter, a representative "real-
world" program.
2
http: //www.ecma-international.org
fuzzing generates test inputs by randomly modifying an initial input. We used a
Microsoft-internal widely-used fuzzing tool.
concolic generates test inputs using the concolic testing algorithm of Section 2. We
used SAGE [48], an existing tool that implements concolic testing for Win-
dows programs.
concolic+tokens extends concolic testing with only the lexical part of the gram-
mar, i.e., marks token identifiers as symbolic, instead of individual input
bytes, but does not use a grammar. This strategy was implemented as an
extension of SAGE.
concolic+grammar is our the main strategy, i.e., the grammar-based concolic test-
ing. This strategy extends concolic testing with both symbolic tokens and an
input grammar. This strategy was implemented as an extension of SAGE.
Figure 4-4 tabulates the strategies used in the evaluation and shows their char-
acteristics.
Other strategies are conceivable. For example, concolic testing could be com-
bined directly with the grammar, without tokens. Doing so requires transform-
ing the grammar into a scannerless grammar [99]. Another possible strategy is
bounded exhaustive enumeration [72,108]. We have not included the latter in our
evaluation because, while all other strategies we evaluated can be time-bounded
(i.e., can be stopped at any time), exhaustive enumeration up to some input length
is biased if terminated before completion, which makes it hard to fairly compare
to time-bounded techniques.
4.5.3 Methodology
We used randomly generated seed inputs. To avoid bias when using test gener-
ation strategies that require seed inputs (see Figure 4-4), we used 50 seed inputs
with 15 to 20 tokens generated randomly from the grammar. Section 4.5.4 provides
more information about selecting the size of seed inputs. Also, to avoid bias across
all strategies, we ran all experiments inside the same test harness.
The concolic+tokens and concolic+grammarstrategies require identifying the tok-
enization function that creates grammar tokens. Our implementation allows doing
so in a simple way, by overriding a single function in our framework.
For each of the examined modules (lexer, parser, and code generator), we mea-
sured the reachability rate, i.e., the percentage of inputs that execute at least one
instruction of the module. Deeper modules always have lower reachability rates.
We measured instruction coverage, i.e., the ratio of the number of unique exe-
cuted instructions to all instructions in the module of interest. This coverage met-
ric seems the most suitable for our needs, since we want to estimate the error-
finding potential of the generated inputs, and blocks with more instructions are
more likely to contain errors than short blocks. In addition to the total instruction
coverage for the interpreter, we also measured coverage in the lexer, parser, and
code generator modules.
We ran each test input generation strategy for 2 hours. The 2-hour time in-
cluded all experimental tasks: program execution, symbolic execution (where ap-
plicable), constraint solving (where applicable), generation of new inputs, and cov-
erage measurements.
For reference, we also include coverage data and reachability results obtained
with a "manual" test suite, created over several years by the developers and testers
of this JavaScript interpreter. The suite consists of more than 2,800 hand-crafted
inputs that exercise the interpreter thoroughly.
As an additional control, we used a test suite with full production coverage, cre-
ated by using Purdom's algorithm [74,94], which gives a set of strings that cover
all grammar productions.
Longer, randomly generated, inputs are more likely to be accepted by the grammar
and rejected by the parser. For example, the grammar specifies that break state-
ments may occur anywhere in the function body, while the parser enforces that
break statements may appear only in loops and switch statements. Enforcing
this is possible by modifying the grammar but it would make the grammar much
larger. Another example of over-approximation concerns line breaks and semi-
colons. The standard specifies that certain semicolons may be omitted, as long
as there are appropriate line breaks in the file3 . However, the grammar does not
enforce this requirement and allows omitting all semicolons.
We selected 15 to 20 as the size range, in tokens, of the input seeds we use in
other experiments. The results in Figure 4-5 indicate that this length makes the
seed inputs variable without sacrificing the reachability rate (i.e., reachability of
the code generation module).
4.5.5 Results
Coverage and Reachability
Figure 4-6 tabulates the coverage and reachability results for the 2-hour runs with
each of the five automated test generation strategies previously discussed. For
comparison, results obtained with the manually-written test suite are also included,
even though running it requires between 2 and 3 hours (the 2,820 input JavaScript
programs are typically much larger).
3
See Section 7.9 of the specification: http://interglacial .com/javascript_spec/
a-7.html#a-7.9
strategy inputs coverage % reachability %
total lexer parser gen. lexer parser gen.
fuzzing 8658 14 25 25 52 99 99 18
fuzzing+grammar 7837 12 22 24 61 100 100 72
concolic 6883 15 26 29 54 99 99 17
concolic+tokens 3086 16 35 39 53 100 100 16
concolic+grammar 2378 20 25 42 82 100 100 81
seed inputs 50 11 18 21 51 100 100 66
full prod. coverage 15 12 20 27 52 100 100 53
manual test suite 2820 59 62 76 92 100 100 100
Figure 4-6: Coverage and reachability results for 2-hour runs (in %). The seed inputs
row lists statistics for the seed inputs used by some of the test generation strate-
gies (see Sections 4.5.3 and 4.5.4). The manual test suite takes more than 2 hours
(less than 3 hours) to run and is included here for reference. The inputs column
gives the number of inputs tested by each strategy. The coverage columns give
the instruction coverage for lexer, parser, and code generator (gen.) modules. The
reachability columns give the percentage of inputs that reach the module's entry-
point.
In summary, the results of these experiments validate our claim that grammar-
based concolic testing is effective in reaching deeper into the tested application
and exercising the code more thoroughly than other automated test-generation
strategies.
Relative Coverage
Grammar-based concolic testing creates test input that cover code not covered by
other strategies. Figure 4-7 compares the instructions covered with concolic+grammar
and the other analyzed strategies. The numbers show that the inputs generated by
concolic+grammarcovered most of the instructions covered by the inputs generated
by the other strategies (see the small numbers in the S - GBC column), while cov-
ering many other instructions (see the relatively larger numbers in the GBC - S
column).
Combined with the results of Section 4.5.5, this shows that concolic+grammar
achieved the highest total coverage, highest reachability rate, and highest coverage
in the deepest module, while using the smallest number of inputs.
Figure 4-8 presents statistics related to the concolic executions performed dur-
ing the 2-hour runs of each of the three concolic strategies evaluated. We make the
following observations.
* All three concolic strategies performed roughly the same number of symbolic
executions.
* However, the concolic strategy created a larger average number of symbolic
variables because it operated on characters, while the other two strategies
worked on tokens (cf. Figure 4-4).
* The concolic+tokensstrategy created the smallest average number of symbolic
variables per execution. This is because concolic+tokens generated many un-
parsable inputs (cf. Figure 4-6), which the parser rejected early and therefore
no symbolic variables were created for the tokens after the parse error.
Token-based strategies avoided path explosion in the lexer. Figure 4-8 shows
how constraint creation was distributed among the lexer, parser, and code gen-
erator modules of the JavaScript interpreter. The two token-based strategies (con-
colic+tokens and concolic+grammar)generated no constraints in the lexer. This helped
to avoid path explosion in that module. Those strategies did explore the lexer
(indeed, Figure 4-6 shows high coverage) but they did not get lost in the lexer's
error-handling paths.
All strategies created constraints in the deepest, code generator, module. How-
ever, there were few such constraints because the parser transforms the stream of
tokens into an Abstract Syntax Tree (AST) and subsequent code, such as the code
generator, operates on the AST. When processing the AST in later stages, symbolic
variables associated with input bytes or tokens are largely absent, so symbolic exe-
cution does not create constraints from code branches in these stages. The number
of symbolic constraints in those deeper stages could be increased by associating
symbolic variables with other values returned by the tokenization function such
as string and integer values associated with some tokens.
4
http://www.immunitysec.com/resources-freesoftware.shtml
5
http://peachfuz z. sourceforge . net/
6
http://labs.idefense.com/software/fuzzing.php
7
http://autodafe.sourceforge.net
provided compact "summary" for the entire lexer/parser, that may include over-
approximations. Computing such a finite-size summary automatically may be
impossible due to infinitely many paths or limited symbolic reasoning capability
when analyzing the lexer/parser. Grammar-based concolic testing and test sum-
maries are complementary techniques which could be used simultaneously.
Another approach to path explosion consists of abstracting lower-level func-
tions using software stubs, marking their return values as symbolic, and then re-
fining these abstractions to eliminate infeasible program paths [71]. In contrast,
grammar-based concolic testing is always grounded in concrete executions, and
thus does not require the expensive step of removing infeasible paths.
Emmi et al. [36] extend systematic testing with constraints that describe the
state of the data for database applications. Our approach also solves path and data
constraints simultaneously, but ours is designed for compilers and interpreters in-
stead of database applications.
Majumdar and Xu's recent and independent work [72] is closest to ours. These
authors combine grammar-based fuzzing with concolic testing by exhaustively
pre-generating strings from the grammar (up to a given length), and then perform-
ing concolic testing starting from those pre-generated strings, treating only vari-
able names, number literals etc. as symbolic. Exhaustive generation inhibits scal-
ability of this approach beyond very short inputs. Also, the exhaustive grammar-
based generation and the concolic testing parts do not interact with each other in
Majumdar and Xu's framework. In contrast, our grammar-based concolic testing
approach is more powerful as it exploits the grammar for solving constraints gen-
erated during symbolic execution to generate input variants that are guaranteed to
be valid.
4.7 Conclusion
We introduced grammar-based concolic testing to enhance the effectiveness of
concolic testing for applications with complex, highly-structured inputs, such as
interpreters and compilers. Grammar-based concolic testing tightly integrates
implementation-based and specification-based testing, and leverages the strengths
of both.
As shown by our case studies, grammar-based concolic testing generates tests
that exercise more code in the deeper, harder-to-test layers of the program un-
der test. In our experiments with UNIX programs, our technique lead to up to
2x improvements in line coverage, eliminated all illegal inputs, and enabled dis-
covering 3 distinct, previously unknown, inputs that led to infinitely-looping pro-
gram execution. In our experiments with the JavaScript interpreter in Internet
Explorer 7, grammar-based concolic testing strongly outperformed both concolic
testing and fuzzing. Code generator coverage improved from 61% to 81% and
deep reachability improved from 72% to 80%. Deep parts of the application are
the hardest to test automatically and our technique shows how to address this.
Since grammars are often partial specifications of valid inputs, grammar-based
fuzzing approaches are fundamentally limited. Thanks to concolic testing, some
of this incompleteness can be recovered, which explains why grammar-based con-
colic testing also outperformed grammar-based fuzzing in our experiments.
Chapter 5
SQL Injection
A SQLI vulnerability results from the application's use of user input in construct-
ing database statements. The attacker invokes the application, passing as an input
a (partial) SQL statement, which the application executes. The attack permits the
attacker to get unauthorized access to, or to damage, the data stored in a database.
To prevent this attack, applications need to sanitize input values that are used in
constructing SQL statements, or else reject potentially dangerous inputs.
First-order XSS
A first-order XSS (also known as Type 1, or reflected, XSS) vulnerability results
from the application inserting part of the user's input in the next HTML page that
it renders. The attacker uses social engineering to convince a victim to click on
a (disguised) URL that contains malicious HTML/JavaScript code. The victim's
browser then displays HTML and executes JavaScript that was part of the attacker-
crafted malicious URL. The attack can result in website defacement, stealing of
browser cookies and other sensitive user data. To prevent first-order XSS attacks,
users need to check link anchors before clicking on them, and applications need to
reject or modify input values that may contain script code.
Second-order XSS
A second-order XSS (also known as Type 2, persistent, or stored, XSS) vulnerability
results from the application storing (part of) the attacker's input in a database, and
later inserting it in an HTML page that is displayed to multiple victim users (e.g.,
in an online bulletin-board application).
It is harder to prevent second-order XSS than first-order XSS. Applications need
to reject or sanitize input values that may contain script code and are either dis-
played in HTML output, or used in database commands.
Second-order XSS is much more damaging than first-order XSS because: (i)
social engineering is not required (the attacker can directly supply the malicious
input without tricking users into clicking on a URL), and (ii) a single malicious
script planted once into a database executes on the browsers of many victim users.
http://www.mysite.com/?mode=display&topicid=l
In this example URL, the input has two key-value pairs: mode=display and
topic id= 1. The input passed inside the URL is available to the PHP program via
the $_GET associative array.
The program can operate in two modes: posting a message or displaying all
messages for a given topic. When posting a message, the program constructs and
submits the SQL statement to store the message in the database (lines 25 and 28)
and then displays a confirmation message (line 29). In the displaying mode, the
program retrieves and displays messages for the given topic (lines 39, 40, and 44).
This program is vulnerable to the following attacks, all of which our technique
can automatically generate:
Figure 5-1: Example PHP program that implements a simple message board using
a MySQL database. This program is vulnerable to SQL injection and cross-site
scripting attacks. Section 5.1.1 discusses the vulnerabilities. (For simplicity, the
figure omits code that establishes a connection with the database.)
This string leads to an attack because the query that the program submits to the
database in line 40,
SELECT msg FROM messages WHERE topicid='1' OR '1'='1'
contains a tautology in the WHERE clause and will retrieve all messages, possibly
leaking private information.
To exploit the vulnerability, the attacker must create an attack vector, i.e., the
full set of inputs that make the program follow the exact path to the vulnerable
mysql_query call and execute the attack query. In our example, the attack vector
must contain at least parameters mo de and t op i c i d set to appropriate values. For
example:
mode -* display
topicid -+ 1' OR '1'='1
5.2 Technique
Our technique generates a set of concrete inputs, executes the program under test
with each input, and dynamically observes whether and how data flows from a
user input to a sensitive sink (e.g., a function such as mysqlquery or echo), in-
cluding any data flows that pass through a database. If an input reaches a sensitive
sink, our technique modifies the input by either using a string-constraint solver, or
using a library of attack patterns, in an attempt to pass malicious data through the
program.
This section first shows the five components of our technique (Section 5.2.1)
and then describes the algorithms for automatically generating first-order (Sec-
tion 5.2.2) and second-order (Section 5.2.3) attacks.
* The Input Generator creates a set of inputs for the program under test, aim-
ing to cover many execution paths.
* The Concolic Executor runs the program on each input produced by the in-
put generator and tracks how the user input flows into sensitive sinks. For
each sensitive sink, the executor outputs a set of symbolic expressions on
string variables (input parameters). These expressions indicate how the in-
put values flow into the sink.
* The Attack Generator takes a list of symbolic expressions (one for each sen-
sitive sink), creates candidate attacks by modifying the inputs using a String-
Constraint Solver.
* The Attack Checker runs the program on the candidate attacks to determine
which are real attacks.
* The Concolic Database is a relational database engine that can execute SQL
statements both concretely and symbolically. Our technique uses this compo-
nent to track the flow of symbolic data through the database, which is critical
for accurate detection of second-order XSS attacks.
Eib MySQI
PH
17*ma
etr
Figure 5-2: The architecture of ARDILLA. The inputs to ARDILLA are the PHP
program and its associated database state. The output is a set of attack vectors
for the program. Each attack vector is a complete input that exposes a security
vulnerability.
5.2.2 First-order Attacks
Figure 5-3 shows the algorithm for generating SQLI and first-order XSS attacks
(both called first-orderbecause they do not involve storing malicious inputs in the
database). The algorithms for creating SQLI and first-order XSS attacks are iden-
tical except for the sensitive sinks (mysql_query for SQLI, echo and print for
XSS) and details in the attack generator/checker.
Figure 5-3: Algorithm for creating SQLI and first-order XSS attacks.
The algorithm takes the program P under test and its associated database db
populated with the proper tables and initial data (usually done via an installation
script or taken from an existing installation). Until a time limit is reached, the algo-
rithm generates new concrete inputs (line 3), runs the program on each input and
collects symbolic expressions (line 4), generates candidate attacks vectors (line 5)
and checks the candidates to find real attack vectors (line 6).
Example SQLI
Here is how our technique generates the SQL injection attack presented in Sec-
tion 5.1.1. First, new inputs are successively generated and the program executes
concolically on each input until some input allows the program to reach line 40 in
the code in Figure 5-1, which contains the sensitive sink mysql_query. An exam-
ple of such an input I is (our input generator picks 1 as the default "don't care"
value):
mode - display
topicid - 1
Second, the concolic executor runs the program on I, marks each input param-
eter as symbolic variable, and creates symbolic expressions for string values that
flow into sensitive sinks. In the example, the concrete string value that reaches the
sensitive sink is:
"SELECT msg FROM messages WHERE topicid='l'"
The corresponding symbolic expression is:
concat("SELECT msg FROM messages WHERE topicid=' ", topicid, "' ")
Third, ARDILLA use the HAMPI string-constraint solver to find a value of the
parameter topicid for which the query is an attack. ARDILLA uses a list of SQLI
attack patterns, e.g., OR ' 1' =' 1' . The string-constraint solver finds a value for
which the concrete value is a valid SQL statement and contains the attack pattern
(Chapter 3 describes the necessary HAMPI string constraints.) One such value is
1' OR ' 1' =' 1. Using this value alters input I into I':
mode -* display
topicid - 1' OR '1'='1
Fourth, the attack checker runs the program on I' and determines that I' is a
real attack, i.e., changes the syntactic structure of the SQL statement.
Finally, the algorithm outputs I' as an attack vector for the first-order SQLI
vulnerability in line 29 of Figure 5-1.
Example XSS
Here is how our technique generates the first-order XSS attack presented in Sec-
tion 5.1.1. First, new inputs are successively generated and the program executes
concolically on each input until some input allows the program to reach line 29 in
the code in Figure 5-1, which contains the sensitive sink echo. An example of such
an input I is:
mode - add
topicid -* 1
msg -* 1
poster 1
(Even though only the value of mode determines whether execution reaches line 29,
all parameters are required to be set; otherwise the program rejects the input in
line 17.)
Second, the concolic executor runs the program on I, marks each input param-
eter as symbolic variable, and creates symbolic expressions for string values that
flow into sensitive sinks. In the example, the executor determines that the value of
the parameter poster flows into the local variable myposter, which flows into
the sensitive sink echo in line 29:
$myposter = $_GET['poster'];
The algorithm takes the program P under test and a database db. In the first
step (line 3), the algorithm makes a symbolic copy of the concrete database, thus
creating a concolic database. Then, until a time limit expires, the algorithm gen-
erates new concrete inputs and attempts to create attack vectors by modifying the
inputs. The algorithm maintains a set of inputs generated so far (in the inputs vari-
able), from which, in each iteration, the algorithm picks two inputs (lines 6 and 7).
Then, the algorithm performs concolic executions of the two inputs, in sequence,
(lines 8 and 9) using the concolic database. The first execution (simulating the at-
tacker) sets the state of the database (db' ,,) that the second execution (simulating
the victim) uses. Next, the attack generator (line 10) creates candidate second-order
XSS attack scenarios (i.e., input pairs) using a library of attack patterns. Finally, the
attack checker (line 11) checks candidate second-order XSS attack scenarios (i.e.,
input pairs).
To favor execution paths that lead to second-order XSS attacks, on line 6 our
implementation heuristically picks an input that executes a database write, and on
line 7 picks an input that executes a database read on the same table.
Example XSS
Here is how our technique generates the second-order XSS attack introduced in
Section 5.1.1. First, the input generator creates inputs and picks the following
pair I,:
mode -- add
topicid -+ 1
msg - 1
poster -- 1
and 12:
mode -* display
topicid - 1
Second, the concolic executor runs the program on I, using the concolic database.
During this execution, the program stores the value 1 of the input parameter msg
(together with the symbolic expression msg) in the database (line 25 of Figure 5-1).
Third, the concolic executor runs the program on 12, using the concolic database.
During this execution, the program retrieves the value 1 from the database (to-
gether with the value's stored symbolic expression msg) and outputs the value via
the echo in line 44. echo is a sensitive sink, and the symbolic expression associ-
ated with the string value that flows into the sink is concat("Thank you ", msg),
where msg is a parameter from I,. Thus, the algorithm has dynamically tracked
the flow of user data across two executions: from msg to the local variable mymsg
(line 20), into the database (line 28), back out of the database (line 40), into the $ row
array (line 43), and finally as a parameter to echo (line 44).
Fourth, the attack generator uses the attack-pattern library (similarly to the
first-order XSS example before) to alter msg in I, to create an attack-candidate in-
put I':
mode - add
topicid - 1
msg - <script>alert ("XSS") </script>
poster - 1
Fifth, the attack checker runs the program, in sequence, on I' and 12 (note that I2
remains unchanged), and determines that this sequence of inputs is an attack sce-
nario.
Finally, the algorithm outputs the pair (1', 12) as a second-order XSS attack sce-
nario that exploits the vulnerability in line 44 of Figure 5-1.
5.3 The ARDILLA Tool for Creating SQLI and XSS At-
tacks
ARDILLA is an implementation of concolic security testing. ARDILLA generates
concrete attack vectors for Web applications written in PHP. The user of ARDILLA
needs to specify the type of attack (SQLI, first-order XSS, or second-order XSS), the
PHP program to analyze, and the initial database state. The outputs of ARDIL-
LA are attack vectors. This section describes ARDILLA's implementation of each
component of the technique described in Section 5.2.
lhttp://www.zend.com
5.3.3 Attack Generator
The attack generator creates candidate attack vectors that are variants of the given
input. The attack generator starts with an input for which there is dataflow from
a parameter to a sensitive sink. For each parameter whose value flows into the
sink (i.e., symbolic variable in the symbolic expression), the generator creates new
inputs that differ only for that parameter. The generator can create new inputs
in one of two modes: using a string-constraint solver or an attack-pattern library.
When using a string-constraint solver, the generator finds the value of that param-
eter that, when used in an input, results in an attack string (a value that may result
in an attack if supplied to a vulnerable input parameter). When using the attack-
pattern library, the generator systematically replaces the value of that parameter
by values taken from the library, i.e., a set of values that may result in an attack if
supplied to a vulnerable input parameter.
Following previous work [41,52], we define an SQLI attack as a syntactically
valid (according to the grammar) SQL statement with a tautology in the WHERE
clause, e.g., OR 1=1. In our experiments, we used 4 attack strings (tautology pat-
terns), distilled from several security lists 2,3.
The input to HAMPI includes a partial SQL grammar (similar to that in Fig-
ure 3-2). We manually wrote a grammar that covers a subset of SQL queries com-
monly observed in Web-applications, which includes SELECT, INSERT, UPDATE,
and DELETE, all with WHERE clauses. The grammar has 14 nonterminals, 16 termi-
nals, and 27 productions. Each terminal is represented by a single unique charac-
ter.
ARDILLA'S XSS attack-pattern library4 contains 113 XSS attack patterns, includ-
ing many filter-evading patterns (that use various character encodings, or that
avoid specific strings in patterns).
ARDILLA'S goal is creating concrete exploits, not verifying the absence of vul-
nerabilities. Moreover, ARDILLA checks every candidate attack input. There-
fore, ARDILLA is useful even given the pattern library's inevitable incompleteness
(missing attack patterns), and potential unsoundness (patterns that do not lead to
attacks).
The attack library needs to be integrated in ARDILLA to be effective; the library
alone is not enough to construct attacks. ARDILLA constructs each attack input
so that the execution reaches the vulnerable call site (using random values is in-
effective for this purpose [2]). In particular, the constructed attack inputs contain
many key-value pairs. Strings from the attack library constitute only 1 value in
each attack input.
2
http://www.justinshattuck.com/2007/01/18/mysql-injection-cheat-sheets,
http://ferruh.mavituna.com/sql-injection-cheatsheet-oku,
http://pentestmonkey.net/blog/mysql-sql-injection-cheat-sheet
3
ARDILLA'S list omits attacks that transform one query into multiple queries, because the PHP
mysql_query function only allows one query to be executed per call.
4
http://ha. ckers. org/xss . html
5.3.4 Attack Checker
The attack checker determines whether a candidate input is an attack. The attack
checker compares the candidate input's execution to that of the original input. The
attack checker ensures that ARDILLA creates concrete exploits, which are much
easier for programmers to fix than reports of abstract traces [9,37].
The attack checker is different for SQLI and XSS. In both types of Web at-
tacks, the PHP program interacts with another component (a database or a Web
browser) in a way the programmer did not intend. The essence of an SQLI attack
is a change in the structure of the SQL statement that preserves its syntactic validity
(otherwise, the database rejects the statement and the attack attempt is unsuccess-
ful) [107]. The essence of an XSS attack is the introduction of additional script-
inducing constructs (e.g., <script> tags) into a dynamically-generated HTML
page [115].
ARDILLA detects attacks by comparing surely-innocuous runs with potentially-
malicious ones. We assume that the input generator creates innocuous (non-attack)
inputs, since the values of the input parameters are simple constants such as 1
or literals from the program text. Therefore, the innocuous input represents how
the program is intended to interact with a component (database or browser). The
attack generator creates potentially-malicious inputs.
The checker runs the program on the two inputs and compares the executions.
Running the program on the attack-candidate input avoids two potential sources
of false warnings: (i) input sanitizing-the program may sanitize (i.e., modify to
make harmless) the input before passing it into a sensitive sink. ARDILLA does not
require the user to specify a list of sanitizing routines. (ii) input filtering-the pro-
gram may reject inputs that satisfy a malicious-input pattern (blacklisting), or else
fail to satisfy an innocuous-input pattern (whitelisting). However, the symbolic
expressions are unaffected by control flow (symbolic expressions only reflect data
flow) and cannot capture input filtering.
The SQLI attack checker compares database statements (e.g., SELECT, INSERT)
issued by the PHP program executed separately on the two inputs. The checker
compares the first pair of corresponding statements, then the second, etc. The
checker signals an attack if the statements in any pair are both valid SQL but have
different syntactic structure (i.e., parse tree).
The XSS attack checker signals an attack if the HTML page produced from
the execution of a candidate attack input (or sequence of inputs, for second-order
attacks) contains additional script-inducing constructs.
Figure 5-5: Example state of the concolic database table messages used by the
PHP program of Figure 5-1. Each concrete column (left-most two columns) has a
symbolic counterpart (right-most two columns) that contains a symbolic expres-
sion. The 0 values represent empty symbolic expressions.
Figure 5-5 shows an example database state during the execution of the pro-
gram in Figure 5-1. Assume the database was pre-populated with a test message
in topic 1, so the symbolic expressions for fields in the first row are empty. When
the user posts a message Hello in topic 2 (line 28), the symbolic expressions from
the respective input parameters are stored along with their concrete values in the
second row. Later, when the user fetches data from that row (line 43), the symbolic
expressions are also fetched and propagated to the assigned variables.
ARDILLA dynamically rewrites SQL statements in the program under test. The
rewritten SQL statements account for the new columns-either updating or read-
ing symbolic expressions. Our current implementation handles a subset of SQL,
rewriting their strings before passing them into mysql_query: CREATE TABLE,
INSERT, UPDATE, and (non-nested) SELECT. (The DELETE statement and WHERE
condition do not need to be rewritten-the database server can locate the relevant
rows using the concrete values.)
* CREATE TABLE creates a new table. ARDILLA rewrites the statement to add a
duplicate for each column (e.g., the two right-most columns in Figure 5-5) to use
for storing symbolic expressions.
* INSERT adds new rows to tables. ARDILLA rewrites the statement to store sym-
bolic expressions in the duplicate columns. For example, consider the following
PHP string representing an SQL statement (PHP automatically performs the string
concatenation):
INSERT INTO messages VALUES(' S_GET['msg']',
' $S_GET[' topicid' ] ' )
Consider an execution in which parameters msg and topicid have concrete val-
ues Hello and 2 and have symbolic expressions that contain only the parameters
themselves. ARDILLA dynamically rewrites the statement as follows:
INSERT INTO messages VALUES('Hello','2',
'msg', 'topicid')
* UPDATE modifies values in tables. For example, for:
UPDATE messages SET msg='$ _ GET['msg ' ]'
WHERE topicid=' $S_GET['topicid' ]'
ARDILLA's dynamic rewriting for UPDATE is similar to that for INSERT (the WHERE
condition is unchanged):
UPDATE messages SET msg='Hi',
msg_s='msg' WHERE topicid='3'
* SELECT finds and returns table cells. ARDILLA rewrites the statement to include
the duplicate (symbolic) column names in the selection. Thereafter, ARDILLA uses
the value retrieved from the duplicate column as the symbolic expression for the
concrete value retrieved from the original column. For example, consider the con-
crete statement executed in line 39 of the program in Figure 5-1 (given the example
state of the concolic database in Figure 5-5).
SELECT msg FROM messages WHERE topicid = '2'
ARDILLA rewrites the statement to:
SELECT msg, msgs FROM messages WHERE topicid = '2'
The result of executing this rewritten statement on the table in Figure 5-5 is a 1-
row table with concrete string Hello and associated symbolic expression msg, in
columns msg and msg_s.
5.4 Evaluation
We evaluated ARDILLA on five open-source programs. We downloaded
the programs from http://sourceforge.net: schoolmate 1.5.4 (tool for
school administration, 8181 lines of code, or LOC), webchess 0.9.0 (online
chess game, 4722 LOC), faqforge 1.3.2 (tool for creating and managing docu-
ments, 1712 LOC), EVE 1.0 (player-activity tracker for an online game, 915 LOC),
and geccbblite 0.1 (a simple bulletin board, 326 LOC). We used the latest available
versions as of 5 September 2008.
We performed the following procedure for each subject program.
1. Run the program's installation script to create the necessary database tables.
2. Pre-populate the database with representative data (e.g., defaults where avail-
able).
3. Run ARDILLA with a 30-minute time limit in each of three modes: SQLI, first-
order XSS, and second-order XSS. The time limit includes all experimental
tasks, i.e., input generation, concolic execution, attack generation (includ-
ing string-constraint solving), and attack checking. When necessary, we pro-
vided the input generator with (non-administrator) username and password
combinations. Doing so poses no methodological problems because an at-
tacker can use a legitimate account to launch an attack. The SQLI mode used
the HAMPI string-constraint solver, while the XSS modes used the attack-
pattern library.
4. Manually examine attack vectors reported by ARDILLA to determine if they
reveal true security vulnerabilities. We did not know any SQLI or XSS vulner-
abilities in the subject programs before performing the experiments. (Thanks
to previous studies [114,115], we were awareof the presence of first-order XSS
and SQLI vulnerabilities in geccbblite and EVE.)
We ran ARDILLA in two modes for checking validity of XSS attacks: lenient
and strict. (The SQLI checker has only one mode.) In the lenient mode, the XSS
checker reports a vulnerability when the outputs differ in script-inducing elements
or HTML elements such as href. In the strict mode, the XSS checker only reports
a vulnerability when the outputs differ in script-inducing elements.
5.4.1 Measurements
Number of sensitive sinks (all) is the statically computed number of echo/print
(for XSS) or my s qlqu e ry statements (for SQLI), whose parameter is not a constant
string.
Number of reached sinks (reach) on all generated inputs is an indication of cov-
erage achieved by the input generator. This measure is suitable for ARDILLA, be-
cause ARDILLA looks for attacks on sensitive sinks.
Number of tainted sinks (taint) is the number of sensitive sinks reached with non-
empty symbolic expressions during concolic execution. Each such occurrence po-
tentially exposes a vulnerability, which ARDILLA uses the attack generator and
checker to test.
Number of validated vulnerabilities (Vuln): We count at most one vulnerability
per sensitive sink, since a single-line code-fix would eliminate all attacks on the
sink. If a single attack vector attacks multiple sensitive sinks, then we examine and
count each vulnerability separately. This number does not include false positives.
We manually inspected each ARDILLA report and determined whether it really
constituted an attack (i.e., corruption or unintended disclosure of data for SQL,
and unintended HTML structure for XSS). For second-order XSS, we checked that
the attacker's malicious input can result in an unintended Web page for the victim.
Number of false positives (F) is the number of ARDILLA reports that are not vali-
dated vulnerabilities.
HAMPI performance: we measured solving time for each string constraint and
each size of N = 1 ... 6. (HAMPI found all known solutions for N < 6.)
5.4.2 Results
ARDILLA found 23 SQLI, 33 first-order XSS, and 12 second-order XSS vulnerabil-
ities in the subject programs (see Figure 5-6). The attacks that ARDILLA found, as
well as the attack patterns we used, are available at http://pag. csail. mit .
edu/ardilla.
We examined two of the three instances in which ARDILLA found no vulnera-
bilities. In geccbblite, we manually determined that there are no first-order XSS
sensitive sinks lenient strict
program mode all reach taint Vuln Vuln
SQLI 218 28 23 6 0 6 0
schoolmate XSS1 122 26 20 14 6 10 0
XSS2 122 4 4 4 0 2 0
SQLI 93 42 40 12 0 12 0
webchess XSS1 76 39 39 13 18 13 0
XSS2 76 40 0 0 0 0 0
SQLI 33 7 1 1 0 1 0
faqforge XSS1 35 10 4 4 0 4 0
XSS2 35 0 0 0 0 0 0
SQLI 12 6 6 2 0 2 0
EVE XSS1 24 5 4 2 0 2 0
XSS2 24 5 3 3 0 2 0
SQLI 10 8 6 2 0 2 0
geccbblite XSS1 17 17 11 0 0 0 0
XSS2 17 17 5 5 0 4 0
SQLI 366 91 76 23 23
Total XSS1 274 97 78 33 29
XSS2 274 66 12 12 8
Figure 5-6: Results of running ARDILLA to create SQLI, XSS1 (first-order XSS), and
XSS2 (second-order XSS) attacks. The le,nient and strict columns refer to ARDILLA
modes (Section 5.4). Section 5.4.1 descri bes the remaining columns (Vuln columns
in bold list the discovered real vulnerabilities).
which causes execution to construct the following malicious SQL statement which
bypasses authentication (-- starts an SQL comment):
Two of our subject programs were previously analyzed for vulnerabilities. In gec-
cbblite, a previous study [115] found 1 first-order XSS vulnerability, and 7 second-
order XSS vulnerabilities (possibly including false positives). However, ARDILLA
and our manual examination of geccbblite found no first-order XSS vulnerabili-
ties. In EVE, another study [114] found 4 SQLI vulnerabilities. The result data
from neither study are available so we cannot directly compare the findings or
establish the reason for the inconsistencies.
Comparison to Black-box Fuzzing
We compared ARDILLA to a black-box fuzzer Burp Intruder5 (listed among the 10
most popular Web-vulnerability scanners 6 ). Burp Intruder is a fuzzer for finding
first-order XSS attacks. We configured the fuzzer according to its documentation.
The fuzzer requires manual setting up of HTTP request patterns to send to the
Web application. Additionally, the fuzzer requires manual indication of variables
to mutate. This is hard because it requires examining the source to find names of
parameters read from the $S_GETand $_POST arrays. We ran the fuzzer using the
same attack-pattern library that ARDILLA uses, and on the same subject programs.
(We have not been able to successfully configure webchess to run with the fuzzer.)
We ran the fuzzer until completion (up to 8 hours).
The fuzzer found 1 first-order XSS vulnerability in schoolmate, 3 in faqforge, 0
in EVE, and 0 in geccbblite. All 4 vulnerabilities reported by the fuzzer were also
discovered by ARDILLA.
Limitations
The main limitation of ARDILLA stems from the input generator. ARDILLA can
only generate attacks for a sensitive sink if the input generator creates an input
that reaches the sink. However, effective input generation for PHP is challeng-
ing [2,77,116], complicated by the dynamic language features and execution model
(running a PHP program often generates an HTML page with forms and links
that require user interaction to execute code in additional files). In particular, the
generator that ARDILLA uses can create inputs only for one PHP script at a time
and cannot simulate sessions (i.e., user-application interactions that involve mul-
tiple pages), which is a serious hindrance to achieving high coverage in Web ap-
plications; line coverage averaged less than 50%. In fact, only on one application
(webchess) did the input generator run until the full 30-minute time limit-in all
other cases, the generator finished within 2 minutes because it did not manage to
cover more code. We also attempted to run the generator on a larger application,
the phpBB Web-forum creator (35 kLOC), but it achieved even lower coverage
(14%). ARDILLA uses the input generator as a black box and any improvement in
input generation (such as proposed recently by Artzi et al. [3]) is likely to improve
ARDILLA'S effectiveness.
5
http://portswigger. net/intruder
6
http://sectools.org/web-scanners.html
able for new code. However, it requires rewriting existing code, while our tech-
nique requires no change to the programming language, the libraries, or the appli-
cation.
Static approaches can, in principle, prove the absence of vulnerabilities [70,113-
115, 117]. In practice, however, analysis imprecision causes false warnings. Ad-
ditionally, static techniques do not create concrete attack vectors. In contrast, our
technique does not introduce such imprecision, and creates attack vectors.
Dynamic monitoring aims to prevent SQLI attacks by tracking user-provided
values [52,87,92,107] during operation of a deployed application. However, dy-
namic monitoring does not help to remove errors before software deployment,
and requires either modifying the application, or running a modified server. For
example, CANDID [6] modifies the application source and requires changing the
runtime system, with performance overhead of up to 40% on the production ap-
plication.
Information-flow control restricts the flow of information between pieces of
software, either statically [98] or dynamically [119,121]. Information-flow control
enforces confidentiality and integrity policies on the data and prevents attacks that
use inappropriate information flows. However, some SQLI and XSS attacks abuse
legitimate information flows; the SQL queries or the JavaScript can be dynamically
generated and can depend on legal user input. Information-flow control requires
modifying the application and either the operating system and the libraries, or the
programming language. System-level techniques may have runtime performance
overhead up to 40% [119].
Static and dynamic approaches can be combined [51,56]. Lam et al. [68] com-
bine static analysis, model checking, and dynamic monitoring. QED [75] combines
static analysis and model checking to automatically create SQLI and first-order
XSS attacks on Java applications. In contrast to ARDILLA, QED (i) does not tar-
get second-order XSS, and (ii) requires programmers to use a custom specification
language to describe attacks.
Saner [5] combines static and dynamic analyses to find potential XSS and SQLI
vulnerabilities. Saner focuses on the sanitization process and abstracts away other
details of the application, i.e., Saner creates attack vectors only for extracted, pos-
sibly infeasible, paths from the static dependency graph (Saner does dynamically
validate the exploitability of string-manipulating code from those paths, but ig-
nores control flow). Saner also reports a vulnerability whenever a path from source
to sink contains no custom sanitation. The path, however, may be infeasible or not
exploitable. Saner tests each source-to-sink path independently and may miss at-
tacks in which output is constructed from multiple sinks. To detect attacks, Saner
simply searches for specific strings in the output, whereas ARDILLA compares the
structure of HTML or SQL between innocuous and attack runs.
Apollo [2] generates test inputs for PHP, checks the execution for crashes, and
validates the output's conformance to HTML standards. The goal of ARDILLA
is different: to find security vulnerabilities. ARDILLA uses the test-input generator
subcomponent of Apollo as a black box. ARDILLA's taint propagation implementa-
tion is partially based on that of Apollo, but we enhanced it significantly by adding
propagation across function calls, taint filters, taint sinks, and tracing taint across
database calls.
Emmi et al. [36] model a database using symbolic constraints and provide a
custom string-constraint solver to create database states that help exercise various
execution paths in the Web application. Our work differs in its goal (finding secu-
rity vulnerabilities vs. improving test coverage) and in the targeted language (PHP
vs. Java).
Wassermann et al.'s tool [116] executes a PHP application on a concrete input
and collects symbolic constraints. Upon reaching an SQL statement, the tool at-
tempts to create an input that exposes an SQL injection vulnerability, by using
a string analysis [83]. The tool has re-discovered 3 previously known vulnera-
bilities. The most important differences between Wassermann's work and ours
are: (i) Their tool has not discovered any previously unknown vulnerabilities, and
requires a precise indication of an attack point. Our tool has discovered 68 pre-
viously unknown vulnerabilities and requires no indication of vulnerable points.
(ii) Their technique focuses on SQLI, while ours targets both SQLI and XSS. (iii)
Their tool performs source-code instrumentation and backward-slice computation
by re-executing and instrumenting additional code. Our tool works on unchanged
application code. (iv) Their tool requires manual loading of pages and supplying
of inputs to the page, while ours is fully automatic.
5.6 Conclusion
We have presented concolic security testing, a technique for creating SQL injec-
tion and cross-site scripting (XSS) attacks in Web applications and an automated
tool, ARDILLA, that implements the technique for PHP. Our technique is based
on input generation, concolic execution, and input mutation. To find a variant
of the input that exposes a vulnerability, input mutation uses the HAMPI string-
constraint solver, or a library of attack patterns. Using a novel concolic database to
store symbolic expressions, ARDILLA can effectively and accurately find the most
damaging type of input-based Web application attack: stored (second-order) XSS.
A novel attack checker that compares the output from running on an innocuous
input and on a candidate attack vector allows ARDILLA to detect vulnerabilities
with high accuracy. In our experiments, ARDILLA found 68 attack vectors in five
programs, each exposing a different vulnerability, with few false positives.
Chapter 6
Conclusions
This chapter presents the summary of the contributions of this dissertation, dis-
cusses lessons learned (Section 6.1), and proposes directions for future work (Sec-
tion 6.2).
Software systems are becoming steadily more complex and our reliance on
them also increases. Testing is currently the dominant way of ensuring software
reliability and we think that it is likely to remain so. Regardless of its well-known
deficiencies, e.g., labor-intensity and incompleteness, testing is the easiest to adopt
strategy for software reliability, and it can be very effective.
The cost of testing can be reduced by automation. Testing is expensive because
creating effective test suites requires skilled test engineers. The aim of the research
presented in this dissertation is to automate the process of creating effective test
inputs.
Concolic testing is a paradigm of implementation-based software testing. Con-
colic testing is based on the premise that the software implementation can be ex-
ploited to find reliability errors. A number of automated concolic testing tools
have been developed and shown to efficiently find errors in real software. This
dissertation presents techniques and tools in the concolic testing paradigm.
The key idea of this dissertation is that by enhancing the underlying constraint
solver with the theory of strings, concolic testing can be made much more effective.
We presented such a string-constraint solver, and evaluate its efficiency. Further-
more, we showed the effectiveness of our idea by improving concolic testing of
programs that have structured inputs (e.g., programs whose inputs come from a
context-free grammar), and of Web applications that manipulate string inputs.
6.1 Discussion
Software testing has two main driving forces: quality and security. We feel that
software quality is today higher than ever. Thanks partly to adoption of good test-
ing practices (e.g., automated regression tests) and software-engineering practices
(e.g., automated build systems) in the industry, the ease-of-use, feature repertoire,
and general reliability of today's commonly-used software is impressive, despite
the vast and very rapidly increasing amount of software we use every day. Soft-
ware security, however, is a massive and growing problem. Thousands of severe
vulnerabilities are reported every year, and their numbers are quickly increasing'1 .
There is a transfer of ideas and technologies from research to hackers. Testing
and analysis technologies that used to exist as research prototypes are now widely
used by villains. For example, fuzz testing, developed in 1990 at the University of
Wisconsin [82], has leapt to prominence in the 2000s as the main technology used
by blackhats (as well as whitehat security professionals 2). The transfer of more
sophisticated technologies, such as concolic testing, is very likely and perhaps im-
minent3 .
The arms race between researchers and hackers will lead to more automated
technologies and more effective tools. We think that progress in this area will
be fueled by combining implementation-based and specification-based testing, by
powerful constraint solvers that combine many theories, and by contests of error-
finding tools.
Exploiting the knowledge of input-format specifications and combining it with
implementation-based analyses (Concolic Security Testing is an example of such a
combination), will play an important role in future testing tools. This combined ap-
proach compensates the inherent limitations of each analyses with the strengths of
the other analyses. Effective testing tools will require powerful constraint solvers
that combine many theories: strings, integers, bit vectors, arrays, functions, etc.
Creating software error-finding contests (such as the Iron Chef contest4 or the SA-
MATE project s ) will be important in fostering development of sophisticated tools.
Similar competitions have proved to be very valuable in many fields of computer
science, from formal methods' 7 to bioinformatics 8 .
We feel that the falsification view (i.e., finding errors) will continue and ex-
pand its dominance over the validation view (i.e., proving the absence of errors)
of software reliability. We think that the error-finding approach is likely to become
important even in domains, such as security, that traditionally have been viewed
mostly with validation in mind. Validation techniques, such as static analysis,
will be important as preludes or guides for test-input generation [28]. Similarly,
bounded verification [34,59] is likely to grow in significance, provided it is com-
bined with generation of concrete inputs that demonstrate real errors. We think
that automatic software verification is desirable and it may become feasible, but
it is going to be achievable only for small, critical, pieces of software architecture,
e.g., common data structures [120] or the Java bytecode verifier.
[1] Cesar L. Alonso, David Alonso, Mar Callau, and Jose Luis Montafia. A word
equation solver based on Levensthein distance. In Mexican InternationalCon-
ference on Artificial Intelligence, 2007.
[2] Shay Artzi, Adam Kiezun, Julian Dolby, Frank Tip, Danny Dig, Amit Parad-
kar, and Michael Ernst. Finding bugs in dynamic Web applications. In Inter-
nationalSymposium on Software Testing and Analysis, 2008.
[3] Shay Artzi, Adam Kiezun, Julian Dolby, Frank Tip, Danny Dig, Amit Parad-
kar, and Michael D. Ernst. Finding bugs in web applications using dynamic
test generation and explicit state model checking. Technical Report MIT-
CSAIL-TR-2009-010, Massachusetts Institute of Technology Computer Sci-
ence and Artificial Intelligence Laboratory, 2009.
[4] Roland Axelsson, Keijo Heljank, and Martin Lange. Analyzing context-free
grammars using an incremental SAT solver. In International Colloquium on
Automata, Languages and Programming,2008.
[5] Davide Balzarotti, Marco Cova, Victoria Felmetsger, Nenad Jovanovic, Engin
Kirda, Christopher Kruegel, and Giovanni Vigna. Saner: Composing static
and dynamic analysis to validate sanitization in Web applications. In IEEE
Symposium on Security and Privacy, 2008.
[7] Adam Barth, Juan Caballero, and Dawn Song. Secure content sniffing for
web browsers or how to stop papers from reviewing themselves. In IEEE
Symposium on Security & Privacy,2009.
[8] Marco Benedetti. sKizzo: a Suite to Evaluate and Certify QBFs. In Conference
on Automated Deduction, 2005.
[9] Nicolas Bettenburg, Sascha Just, Adrian Schroter, Cathrin Weiss, Rahul
Premraj, and Thomas Zimmermann. What makes a good bug report? In
InternationalSymposium on the Foundationsof Software Engineering,2008.
[10] Armin Biere. Resolve and expand. In InternationalConference on Theory and
Applications of SatisfiabilityTesting, 2005.
[11] Armin Biere, Alessandro Cimatti, Edmund M. Clarke, Ofer Strichman, and
Yunshan Zhu. Bounded model checking. Advances in Computers, 58, 2003.
[12] Nikolaj Bjorner, Nikolai Tillmann, and Andrei Voronkov. Path feasibility
analysis for string-manipulating programs. In International Conference on
Tools and Algorithms for the Constructionand Analysis of Systems, 2009.
[13] Nikita Borisov, David Brumley, Helen J. Wang, and Chuanxiong Guo.
Generic application-level protocol analyzer and its language. In Network and
DistributedSystem Security Symposium, 2007.
[14] Chandrasekhar Boyapati, Sarfraz Khurshid, and Darko Marinov. Korat: au-
tomated testing based on Java predicates. InternationalSymposium on Software
Testing and Analysis, 2002.
[15] Juan Caballero, Stephen McCamant, Adam Barth, and Dawn Song. Extract-
ing models of security-sensitive operations using string-enhanced white-box
exploration on binaries. Technical Report UCB/EECS-2009-36, EECS Depart-
ment, University of California, Berkeley, 2009.
[16] Cristian Cadar, Daniel Dunbar, and Dawson R. Engler. Klee: Unassisted and
automatic generation of high-coverage tests for complex systems programs.
In Symposium on Operating Systems Design and Implementation, 2008.
[17] Cristian Cadar and Dawson R. Engler. Execution generated test cases: How
to make systems code crash itself. In SPIN Workshop on Model Checking of
Software, 2005.
[18] Cristian Cadar, Vijay Ganesh, Peter M. Pawlowski, David L. Dill, and Daw-
son R. Engler. EXE: automatically generating inputs of death. In Conference
on Computer and Communications Security, 2006.
[19] Cenzic. Application security trends report Q1 2008. http : / /www. cenz ic.
com.
[20] Aske Simon Christensen, Anders Moller, and Michael I. Schwartzbach. Pre-
cise analysis of string expressions. In InternationalStatic Analysis Symposium,
2003.
[21] Koen Claessen and John Hughes. QuickCheck: A lightweight tool for ran-
dom testing of Haskell programs. In InternationalConference on Functional
Programming,2000.
[22] Edmund M. Clarke, Daniel Kroening, and Flavio Lerda. A tool for checking
ANSI-C programs. In InternationalConference on Tools and Algorithms for the
Constructionand Analysis of Systems, 2004.
[23] Edmund M. Clarke, Daniel Kroening, Natasha Sharygina, and Karen Yorav.
Predicate abstraction of ANSI-C programs using SAT. Formal Methods in
System Design, 25(2-3), 2004.
[24] Byron Cook, Daniel Kroening, and Natasha Sharygina. Cogent: Accurate
theorem proving for program verification. In InternationalConference on Com-
puter Aided Verification, 2005.
[25] William R. Cook and Siddhartha Rai. Safe query objects: statically typed
objects as remotely executable queries. In InternationalConference on Software
Engineering,2005.
[26] David Coppit and Jiexin Lian. yagg: an easy-to-use generator for structured
test inputs. Automated Software Engineering,2005.
[27] CREST: automatic test generation tool for C. http: / /code . google . com/
p/crest.
[28] Christoph Csallner and Yannis Smaragdakis. DSD-Crasher: A hybrid analy-
sis tool for bug finding. ISSTA, 2007.
[29] Weidong Cui, Jayanthkumar Kannan, and Helen J. Wang. Discoverer: Auto-
matic protocol reverse engineering from network traces. In USENIX Security
Symposium, 2007.
[30] Brett Daniel, Danny Dig, Kely Garcia, and Darko Marinov. Automated test-
ing of refactoring engines. In International Symposium on the Foundations of
Software Engineering,2007.
[31] Robert Dabrowski and Wojciech Plandowski. On word equations in one
variable. In InternationalSymposium on Mathematical Foundationsof Computer
Science, 2002.
[32] Robert Dqbrowski and Wojciech Plandowski. Solving two-variable word
equations. Lecture Notes in Computer Science, 2004.
[33] Leonardo de Moura and Nikolaj Bjorner. Z3: An Efficient SMT Solver. In
InternationalConference on Tools and Algorithms for the constructionand Analysis
of Systems, 2008.
[34] Greg Dennis, Kuat Yessenov, and Daniel Jackson. Bounded verification of
voting software. Proceedings of the Second InternationalConference on Verified
Software: Theories, Tools, Experiments, 2008.
[35] Niklas Een and Niklas Sorensson. An extensible SAT solver. In International
Conference on Theory and Applications of Satisfiability Testing, 2003.
[36] Michael Emmi, Rupak Majumdar, and Koushik Sen. Dynamic test input
generation for database applications. In InternationalSymposium on Software
Testing and Analysis, 2007.
[37] Dawson R. Engler and Madanlal Musuvathi. Static analysis versus software
model checking for bug finding. In International Conference on Verification,
Model Checking, and Abstract Interpretation,2004.
[38] Brics finite state automata utilities. http: / /www. brics . dk/automaton/
faq. html.
[39] Finite state automata utilities. http: / /www. let . rug. nl / vannoord/
Fsa/fsa.html.
[40] AT&T Finite State Machine Library. http: //www. research. att .com/
-fsmtools/fsm.
[41] Xiang Fu, Xin Lu, Boris Peltsverger, Shijun Chen, Kai Qian, and Lixin Tao.
A static analysis framework for detecting SQL injection vulnerabilities. In
InternationalComputer Software and Applications Conference, 2007.
[42] Vijay Ganesh and David L. Dill. A decision procedure for bit-vectors and
arrays. In InternationalConference on Computer Aided Verification, 2007.
[43] Patrice Godefroid. The soundness of bugs is what matters (position state-
ment). In Workshop on the Evaluation of Software Defect Detection Tools, 2005.
[46] Patrice Godefroid, Nils Klarlund, and Koushik Sen. DART: Directed auto-
mated random testing. In ProgrammingLanguage Design and Implementation,
2005.
[47] Patrice Godefroid, Michael Y. Levin, and David A. Molnar. Active property
checking. In InternationalConference on Embedded Systems, 2008.
[53] Kenneth V. Hanford. Automatic generation of test cases. IBM Systems Journal,
9(4), 1970.
[54] Pieter Hooimeijer and Westley Weimer. A decision procedure for subset con-
straints over regular languages. In ProgrammingLanguage Design and Imple-
mentation, 2009.
[55] John E. Hopcroft and Jeffrey D. Ullman. Introduction to automata theory, lan-
guages and computation. Addison-Wesley Series in Computer Science, 1979.
[56] Yao-Wen Huang, Fang Yu, Christian Hang, Chung-Hung Tsai, Der-Tsai Lee,
and Sy-Yen Kuo. Securing Web application code by static analysis and run-
time protection. In International World Wide Web Conference, 2004.
[59] Daniel Jackson and Mandana Vaziri. Finding bugs with a constraint solver.
In InternationalSymposium on Software Testing and Analysis, 2000.
[60] Karthick Jayaraman, David Harvison, Vijay Ganesh, and Adam Kiezun.
jFuzz: A concolic whitebox fuzzer for Java. In NASA FormalMethods Sympo-
sium, 2009.
[61] Susmit Kumar Jha, Sanjit A. Seshia, and Rhishikesh Shrikant Limaye. On the
computational complexity of satisfiability solving for string theories. Techni-
cal Report UCB/EECS-2009-41, EECS Department, University of California,
Berkeley, 2009.
[63] Adam Kiezun, Vijay Ganesh, Philip J. Guo, Pieter Hooimeijer, and
Michael D. Ernst. Hampi: A solver for string constraints. In International
Symposium on Software Testing and Analysis, 2009.
[64] Adam Kiezun, Philip J. Guo, Karthik Jayaraman, and Michael D. Ernst. Au-
tomatic creation of SQL injection and cross-site scripting attacks. In Interna-
tional Conference on Software Engineering, 2009.
[65] James C. King. Symbolic execution and program testing. Communicationsof
the ACM, 19(7), 1976.
[66] Nils Klarlund. Mona & Fido: The logic-automaton connection in practice. In
InternationalWorkshop on Computer Science Logic, 1998.
[67] Nils Klarlund and Michael I. Schwartzbach. A domain-specific language for
regular sets of strings and trees. IEEE Transactions on Software Engineering,
25(3), 1999.
[68] Monica S. Lam, Michael Martin, Benjamin Livshits, and John Whaley. Se-
curing Web applications with static and dynamic information flow tracking.
In Symposium on PartialEvaluation and Semantics-based ProgramManipulation,
2008.
[70] Benjamin Livshits and Monica Lam. Finding security vulnerabilities in Java
applications with static analysis. In USENIX Security Symposium, 2005.
[71] Rupak Majumdar and Koushik Sen. LATEST: Lazy dynamic test input gener-
ation. Technical Report UCB/EECS-2007-36, EECS Department, University
of California, Berkeley, 2007.
[72] Rupak Majumdar and Ru-Gang Xu. Directed test generation using symbolic
grammars. In Automated Software Engineering,2007.
[73] GS Makanin. The problem of solvability of equations in a free semigroup.
Sbornik: Mathematics, 32(2), 1977.
[74] Brian A. Malloy and James F. Power. An interpretation of Purdom's algo-
rithm for automatic generation of test cases. InternationalConference on Com-
puter and Information Science, 2001.
[75] Michael Martin and Monica Lam. Automatic generation of XSS and SQL in-
jection attacks with goal-directed model checking. In USENIX Security Sym-
posium, 2008.
[76] Peter M. Maurer. Generating test data with enhanced context-free grammars.
IEEE Software, 7(4), 1990.
[77] Sean McAllister, Engin Kirda, and Christopher Kriigel. Leveraging user in-
teractions for in-depth testing of Web applications. In International Sympo-
sium on Recent Advances in Intrusion Detection, 2008.
[78] Russell A. McClure and Ingolf H. Krtiger. SQL DOM: compile time checking
of dynamic SQL statements. In InternationalConference on Software Engineer-
ing, 2005.
[79] Bruce McKenzie. Generating strings at random from a context free gram-
mar. Technical Report TR-COSC 10/97, Department of Computer Science,
University of Canterbury, 1997.
[80] David Melski and Thomas Reps. Interconvertbility of set constraints and
context-free language reachability. In Symposium on Partial Evaluation and
Semantics-based ProgramManipulation, 1997.
[81] Barton P. Miller, Lars Fredriksen, and Bryan So. An empirical study of the
reliability of UNIX utilities. Communications of the ACM, 33(12), 1990.
[82] Barton P. Miller, Lars Fredriksen, and Bryan So. An empirical study of the
reliability of UNIX utilities. Communicationsof the ACM, 33(12), 1990.
[84] David A. Molnar and David Wagner. Catchconv: Symbolic execution and
run-time type inference for integer conversion errors. Technical Report
UCB/EECS-2007-23, EECS Department, University of California, Berkeley,
Feb 2007.
[85] Robert C. Moore. Removing left recursion from context-free grammars. In
Proceedingsof thefirst conference on North American chapter of the Associationfor
ComputationalLinguistics, 2000.
[86] Matthew W. Moskewicz, Concor F. Madigan, Ying Zhao, Lintao Zhang, and
Sharad Malik. Chaff: engineering an efficient SAT solver. In Design Automa-
tion Conference, 2001.
[87] James Newsome and Dawn Song. Dynamic taint analysis for automatic de-
tection, analysis, and signature generation of exploits on commodity soft-
ware. In Network and DistributedSystem Security Symposium, 2005.
[89] Carlos Pacheco, Shuvendu K. Lahiri, Michael D. Ernst, and Thomas Ball.
Feedback-directed random test generation. In InternationalConference on Soft-
ware Engineering,2007.
[90] Ruoming Pang, Vern Paxson, Robin Sommer, and Larry Peterson. binpac:
a yacc for writing application protocol parsers. Proceedings of the 6th ACM
SIGCOMM on Internet measurement, 2006.
[91] Gilles Pesant. A regular language membership constraint for finite se-
quences of variables. In International Conference on Principles and Practiceof
ConstraintProgramming,2004.
[92] Tadeusz Pietraszek and Chris Vanden Berghe. Defending against injection
attacks through context-sensitive string evaluation. In InternationalSympo-
sium on Recent Advances in Intrusion Detection, 2005.
[94] Paul Purdom. A sentence generator for testing parsers. BIT Numerical Math-
ematics, 12(3), 1972.
[97] Hui Ruan, Jian Zhang, and Jun Yan. Test data generation for C programs
with string-handling functions. In Theoretical Aspects of Software Engineering
Conference, 2008.
[101] Koushik Sen and Gul Agha. CUTE and jCUTE: Concolic unit testing and
explicit path model-checking tools. Lecture Notes in Computer Science, 4144,
2006.
[102] Koushik Sen, Darko Marinov, and Gul Agha. CUTE: A concolic unit test-
ing engine for C. In International Symposium on the Foundations of Software
Engineering,2005.
[103] Daryl Shannon, Sukant Hajra, Alison Lee, Daiqian Zhan, and Sarfraz Khur-
shid. Abstracting symbolic execution with string analysis. In Testing: Aca-
demic and IndustrialConference Practiceand Research Techniques, 2007.
[104] Ilya Shlyakhter, Manu Sridharan, Robert Seater, and Daniel Jackson. Ex-
ploiting subformula sharing in automatic analysis of qualified formulas. In
InternationalConference on Theory and Applications of SatisfiabilityTesting, 2003.
[108] Kevin Sullivan, Jinlin Yang, David Coppit, Sarfraz Khurshid, and Daniel
Jackson. Software assurance by bounded exhaustive testing. In International
Symposium on Software Testing and Analysis, 2004.
[109] Michael Sutton, Adam Greene, and Pedram Amini. Fuzzing: Brute Force Vul-
nerabilityDiscovery. Addison-Wesley, 2007.
[110] Nikolai Tillmann and Jonathan de Halleux. Pex-White Box Test Generation
for .NET. Lecture Notes in Computer Science, 4966, 2008.
[111] Emina Torlak and Daniel Jackson. Kodkod: A relational model finder. In In-
ternationalConference on Tools and Algorithms for the Construction and Analysis
of Systems, 2007.
[113] Gary Wassermann, Carl Gould, Zhendong Su, and Premkumar Devanbu.
Static checking of dynamically generated queries in database applications.
In InternationalConference on Software Engineering, 2004.
[114] Gary Wassermann and Zhendong Su. Sound and precise analysis of Web
applications for injection vulnerabilities. In Programming Language Design
and Implementation, 2007.
[115] Gary Wassermann and Zhendong Su. Static detection of cross-site scripting
vulnerabilities. In InternationalConference on Software Engineering,2008.
[116] Gary Wassermann, Dachuan Yu, Ajay Chander, Dinakar Dhurjati, Hiroshi
Inamura, and Zhendong Su. Dynamic test input generation for web appli-
cations. In InternationalSymposium on Software Testing and Analysis, 2008.
[117] Yichen Xie and Alex Aiken. Static detection of security vulnerabilities in
scripting languages. In USENIX Security Symposium, 2006.
[118] Yichen Xie and Alex Aiken. Saturn: A scalable framework for error detec-
tion using Boolean satisfiability. In InternationalConference on ComputerAided
Verification, 2007.
[119] Maxwell Krohn Alexander Yip, Micah Brodsky, Natan Cliffer, Frans
Kaashoek, Eddie Kohler, and Robert Morris. Information flow control for
standard OS abstractions. In Symposium on OperatingSystems Principles,2007.
[120] Karen Zee, Viktor Kuncak, and Martin C. Rinard. Full functional verication
of linked data structures. In ProgrammingLanguage Design and Implementa-
tion, 2008.
[121] Nickolai Zeldovich, Silas Boyd-Wickizer, and David Mazieres. Securing dis-
tributed systems with information flow control. In Symposium on Networked
Systems Design and Implementation, 2008.
100