Academia.eduAcademia.edu

Symbolic search-based testing

2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011)

This is a repository copy of Symbolic search-based testing.

This is a repository copy of Symbolic search-based testing. White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/152427/ Version: Accepted Version Proceedings Paper: Baars, A.I., Harman, M., Hassoun, Y. et al. (4 more authors) (2011) Symbolic search-based testing. In: Alexander, P., Pasareanu, C.S. and Hosking, J.G., (eds.) Proceedings of 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011). 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011), 06-10 Nov 2011, Lawrence, KS, USA. IEEE , pp. 53-62. ISBN 978-1-4577-1638-6 https://doi.org/10.1109/ASE.2011.6100119 © 2011 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works. Reproduced in accordance with the publisher's self-archiving policy. Reuse Items deposited in White Rose Research Online are protected by copyright, with all rights reserved unless indicated otherwise. They may be downloaded and/or printed for private study, or other acts as permitted by national copyright laws. The publisher or other rights holders may allow further reproduction and re-use of the full text version. This is indicated by the licence information on the White Rose Research Online record for the item. Takedown If you consider content in White Rose Research Online to be in breach of UK law, please notify us by emailing eprints@whiterose.ac.uk including the URL of the record and the reason for the withdrawal request. eprints@whiterose.ac.uk https://eprints.whiterose.ac.uk/ Symbolic Search-Based Testing Arthur Baars Mark Harman Youssef Hassoun Kiran Lakhotia Universidad Politécnica de Valencia University College London King’s College London University College London Valencia, Spain CREST Centre, London, U.K. London, U.K. CREST Centre, London, U.K. abaars@pros.upv.es mark.harman@ucl.ac.uk youssef.hassoun@kcl.ac.uk k.lakhotia@ucl.ac.uk Phil McMinn Paolo Tonella Tanja Vos University of Sheffield Sheffield, U.K. p.mcminn@sheffield.ac.uk Fondazione Bruno Kessler Trento, Italy tonella@fbk.eu Universidad Politécnica de Valencia Valencia, Spain tvos@dsic.upv.es Abstract—We present an algorithm for constructing fitness functions that improve the efficiency of search-based testing when trying to generate branch adequate test data. The algorithm combines symbolic information with dynamic analysis and has two key advantages: It does not require any change in the underlying test data generation technique and it avoids many problems traditionally associated with symbolic execution, in particular the presence of loops. We have evaluated the algorithm on industrial closed source and open source systems using both local and global search-based testing techniques, demonstrating that both are statistically significantly more efficient using our approach. The test for significance was done using a onesided, paired Wilcoxon signed rank test. On average, the local search requires 23.41% and the global search 7.78% fewer fitness evaluations when using a symbolic execution based fitness function generated by the algorithm. Index Terms—Search–Based Testing, Symbolic Execution, Fitness Functions I. I NTRODUCTION Automation is essential in software testing because the process is very slow and consequently expensive if undertaken manually. This need to automate software testing has provided a rich set of challenging problems for the research community for over thirty years. One approach to software test automation that has achieved a great deal of recent attention is SearchBased Software Testing (SBST). SBST uses meta-heuristic algorithms to automate the generation of test inputs that meet a test adequacy criterion. One of the most widely-studied test adequacy criteria in SBST is branch coverage ([1], [2], [3], [4], [5], [6], [7]), the adequacy criterion considered in this paper. Despite the large body of work in SBST focusing on branch coverage, the state of the art fitness function definitions used for branch adequate testing have changed little since the early seminal work on the Daimler Automated Software Testing System, which has been in use for more than a decade [7]. Though there have been many developments in SBST, these focus on changing the search algorithms and the way in which they are used, rather than the underlying fitness functions on which all metaheuristic search relies. This paper takes a different approach and proposes to use static analysis. In particular, we use a form of partial symbolic execution to statically collect information available at compile time that can be used to define richer and more expressive fitness functions. We do not perform a complete symbolic execution, as this would be computationally expensive. Rather, we compute smaller amounts of symbolic information that can be used to imbue a fitness function with a much finer characterisation of the true search landscape, which defines the location of the global optima that represent the coverage of individual branches. Our aim in attacking the underlying fitness function is to provide an approach that makes SBST more efficient (and possibly more effective), regardless of the particular search algorithm used to generate the test data. We present results for two widely used approaches to demonstrate, empirically, that our approach does indeed make SBST more efficient. There are many search algorithms that we could have chosen to study in our experimental work. A recent survey of Search-Based Approaches in Software Engineering [8] listed 15 search-based algorithms that have been used in SBST work. Clearly, it is not possible to report results for all of them in this paper. Rather than make an arbitrary choice of algorithms to study, we choose to empirically study a local search (the widely used Alternating Variable Method (AVM) of Korel [3]) and a global search (the Genetic Algorithm approach used by Wegener et al. [7] and widely followed by other subsequent SBST research). Our reason for this choice was that these two approaches characterise the two possible outcomes for the primary choice of which algorithm to use; whether it will be local or global. Most of the other algorithms used in SBST are formed by a combination of local and global search. Our results indicate that the partial symbolic information we compute can indeed improve the efficiency of SBST, for both real world production code and for open source. In the case of a local search, the information also leads to improved effectiveness of SBST. The primary contributions of this paper are: • We introduce an algorithm for improving SBST by enriching fitness functions with statically collected symbolic information. Because our approach targets the fitness function itself, it applies to any and every SBST technique • • and can be incorporated without change to the searchbased algorithm that uses the enriched fitness function. We introduce an approach to overcome the problem of loops in traditional symbolic execution that allows us to approximate the impact of symbolic information on fitness. We introduce a new metric called the approximation level1 to account for uncertainty whenever we cannot compute precise symbolic information, such as in the presence of loops. We present the results of an empirical study on both open and closed source code, the results of which indicate that our enriched fitness functions are significantly more efficient than their traditional counterparts. The rest of the paper is organised as follows: The next section provides an overview of the standard fitness function used in SBST for branch coverage. Section III introduces our fitness function enhancement approach, while Section IV introduces the code analysis algorithm based on symbolic execution. Section V presents the empirical study, with corresponding threats to validity discussed in Section VI. Section VII describes related work and Section VIII draws conclusions. II. BACKGROUND Meta-heuristic algorithms rely on a fitness function to guide the search towards a global optimum, i.e., the desired test data. For branch coverage, the state of the art fitness function comprises two measures: A branch distance and an approach level. When both these measures are 0 the desired test data has been found. The approach level records how many of a target branch’s control dependent nodes were not executed by a particular input. The fewer control dependent nodes executed, the ‘further away’ the input is from executing the target in control flow terms. Consider the example from Figure 1 and assume the target is the true branch of node (3). If an input takes the false branch at node (1) then the approach level is 2, and if an input takes the false branch at node (2), the approach level is 1 and so forth. Whenever an input misses the target branch, the branch distance measure is used to compute how close the input was to staying on a path leading to the target. It is computed using the condition of the last (Control Flow Graph) node in an input’s execution trace which holds a transitive control dependence on the target, and where execution diverged from the target. Resuming the example from Figure 1, if an input takes the false branch at node (1), the branch distance is computed through |x − 0| + K, where K is a failure constant (K = 1 throughout this paper). Different branch distance formulae exist depending on the relational predicate types used within the condition of branching nodes on which the target is control dependent. The interested reader is referred to the work of Tracey et al. [9] for a complete list of branch distance functions. 1 Note that the definition of approximation level in this paper is not to be confused with the approximation level defined in [7], which is equivalent to the approach level metric described in Section II. 1 2 3 4 void foo(int x, int y, int z) { if( x == 0 ) if( y == z ) if( x == z ) { //TARGET } } Fig. 1. Example C code used to demonstrate the standard fitness function used in Search-Based Testing. For a target branch t, an input vector v, and a node n where execution diverged from t, the complete fitness value is then computed by combining the branch distance and approach level: ff n (t, v) = approach level(t, v) + norm(branch distance(t, v)) Note that the branch distance measure is normalized to a value between [0, 1], using either of the following normalization functions [10]: norm(d) =  1 − 1.001−d d (d+1) or, (used in this paper) This fitness function can be inefficient when multiple, interdependent conditions need to be satisfied as in the example from Figure 1. For instance, when trying to cover the true branch at node (3), the values chosen for the inputs x,y and z that satisfy the first two conditions are unlikely to traverse the true branch at node (3). This is because the probability of the search optimizing both y and z to 0 is low. In general, optimizing each condition in ‘isolation’, as is the case with the standard fitness function, can be considered sub-optimal. Symbolic execution on the other hand is able to capture such constraints and interdependencies between variables in the form of a path condition. For example, the path condition describing the execution where all conditions evaluate to true in Figure 1 would be hx = 0 ∧ y = z ∧ x = zi, where x, y and z denote symbolic variables corresponding to the three integer inputs x,y,z. For the purpose of testing, a path condition can be fed to a constraint solver to obtain concrete input values which can be used to execute the program. However, it is well known that static symbolic execution of a program faces several challenges, arising from loops and code constructs that cannot easily be symbolically executed such as unknown (library) functions, complex pointer arithmetic and functions pointers to name but a few. Loops in particular are a common problem, because they can result in infinitely many program paths and further, when trying to cover a target branch, it may not be possible to determine the number of loop iterations necessary to reach the target a priori. The field of Dynamic Symbolic Execution (DSE), also known as concolic testing, first introduced by Godefroid et al. [11] tries to overcome some of the challenges faced by static symbolic execution. In DSE, information obtained through dynamic analysis is used to aid symbolic execution. The work presented in this paper proposes to do the opposite, i.e., use information gathered through symbolic execution to aid SBST. III. S YMBOLICALLY E NHANCED F ITNESS F UNCTION The hypothesis underlying the research work presented in this paper assumes that incorporating information obtained from symbolic execution into the fitness function for branch coverage reduces the number of fitness evaluations required to cover a branch. We call our approach fitness function enhancement because the information collected along a program path using symbolic execution is used in place of the traditional approach level and branch distance measures. Before the formal presentation of the fitness function enhancement algorithm we provide some initial intuition. We propose to replace the branch distance measure introduced in Section II with a path distance measure. Assume an input follows the false branch at node (1) in Figure 1 and that our target is the true branch of node (3). We start by computing a path expression [12] representing all paths from node (1) to the target. Let this path expression be abc (the edge labels a, b, c refer to the sub-graph shown on the right in Figure 1). We then symbolically execute this path expression to obtain a set of (partial) path conditions. In our example this set denotes a singleton of the form {hx = 0∧y = z ∧x = zi} because there is only one path from node (1) to the true branch of node (3). Next we apply the branch distance measure from Section II to each of the atomic conditions in the path condition (i.e., x = 0, y = z, x = z), and sum the results to form a path distance. In case we have more than one path distance, we choose the minimum for our fitness computation. The intuition behind this choice is that the path condition with the smallest path distance is the closest to being satisfied by an input. It may not always be possible to symbolically execute a path expression due to sources of uncertainty. To account for this we introduce a second measure called the approximation level. The approximation level will be defined as the number of conditions that cannot be added to a path condition, and are thus not considered in the path distance. For example, a condition that uses variables whose definition originates from a statement inside a loop will be dropped. This is because in general we do not know how often a loop is executed, thus we also do not know the final value of the variables that are defined in a loop. Other sources of uncertainty can include variables defined through system calls to which we do not have access. The next section will provide formal definitions of the approximation level and path distance, along with the algorithm for computing the enhanced fitness function (eff ). 1 2 3 4 5 6 7 8 void foo(int n, int a, int b) { int s = 0; if (n > 0) { for (int i = 0 ; i < n ; i++) s += i; if (a == n) { if (s == 10) { if (b == s) { // TARGET (t) } } } } } Fig. 2. Illustrative C code involving a loop with nested if-statements, used to demonstrate the symbolically enhanced fitness function. A. Definitions Let p be the path expression representing all paths between a start node n and a target node t. This path expression may contain loops, represented as terms of the form A∗ . For such terms, we may opt for an arbitrary level of unrolling (e.g. k times), but we cannot handle unbounded (potentially infinite) unrollings. As a consequence, when the upper limit for the number of unrollings is reached, we make the assumption that variables defined in any successive loop iterations are destroyed, since in general, we cannot determine the required number of loop iterations. Such variables will be represented by the term D[A]. The path expression involving loops (i.e., A∗ ) can thus be expanded as: A∗ = 1 + A + A2 + A3 + . . . + Ak D[A] By replacing A∗ with A∗ = 1+A+A2 +A3 +. . .+Ak D[A] in the path expression p we obtain an approximated path expression p′ which contains some destroy terms of the form D[A]. In the path condition produced by symbolic execution of p′ , we drop any clauses involving variables whose definition is inside a destroyed sub-path. Such destroyed clauses are counted and their number defines the approximation level used in the fitness function. Path conditions built by dropping one or more conditions are said to be partial; the others are said to be complete. Definition 1. Branch distance: A quantification in the range [0, 1] of a boolean branch condition, such that the value zero is obtained iff the condition evaluates to true. Values close to 1 indicate that the condition is far from being satisfied. Intermediate values should be such as to smoothly guide the search toward satisfying the condition. Definition 2. Path distance: A quantification of the (partial or complete) path condition, given by the sum of the branch distances computed for the conditions appearing as nondestroyed in the path condition for the approximated path expression. It is zero when all conjuncts in the path condition evaluate to true. Whenever we build a partial path condition, we are dropping a number of conditions which involve data dependencies originating in a loop. The number of dropped conditions is the approximation level. Definition 3. Approximation level: The approximation level along a path is the number of conditions that are dropped from the path condition since they involve variables defined inside loops that are used in the condition. An approximated path expression p′ can always be normalized into a sum of alternative paths. In fact, in p′ , loops A∗ are replaced by alternative k-bounded loop iterations, hence p′ contains only sequence (multiplication) and alternative (sum) operators, which can be normalized into a sum of products by resorting to distribution of multiplication over sum. Definition 4. Fitness function: Let the normalized approximated path expression p′ have the form p1 + p2 + . . . + ph , the fitness function eff for a node n is defined as: eff n = min{ff 1 , ff 2 , . . . , ff h } where ff 1 , ff 2 , . . . , ff h are the fitness functions for the atomic paths in p′ , each being computed as the sum of approximation level and path distance: ff i = approximation level + path distance(pi ) Consider the example code in Figure 2 and assume our target is the true branch at node (7). If an input traverses the false branch at node (2), the path expression representing all paths from node (2) to the target is a(bc)∗ def g. We may distinguish paths not entering the loop, (bc)∗ , from paths which enter it one or more times (i.e., k = 1). The first case (not entering the loop) is described by the path expression adef g. Symbolically executing this path expression yields the following path condition: hn > 0 ∧ 0 ≥ n ∧ a = n ∧ s = 10 ∧ b = si. The path described by this path condition is clearly infeasible, because the conditions n > 0 and 0 ≥ n are mutually exclusive. Hence, we will not consider it any further. The second case (entering the loop one or more times) can be described by the path expression abcD[bc]def g. The term D[bc] indicates that all variable definitions occurring along the path bc are to be treated as unknown, because the number of iterations for the loop bc is unknown (it will be one or more). For the example in Figure 2, two variables are defined inside the loop bc; s and i. Since we do not know how often the loop will be executed, we also do not know the final values of s and i when we exit the loop. Therefore we drop any conditions obtained by symbolically executing the sub-path following the term D[bc] that involve such variables, i.e., we do not add those conditions to the path condition. The approximation level accounts for this by being incremented for each condition that is dropped. Completing our example, symbolically executing the path expression abcD[bc]def g yields the path condition: hn > 0 ∧ 0 < n ∧ a = ni. Since we dropped three conditions (1 ≥ n, s = 10, b = s) the approximation level is 3. Thus, the approximation level allows us to distinguish an input reaching node (2) and taking the false branch, from an input taking the true branch at node (2) and thus getting closer to the target. Note that the approximation level reaches 0 once we reach node (5). IV. A LGORITHM TO C OMPUTE S YMBOLICALLY E NHANCED F ITNESS F UNCTIONS Algorithm 1 Compute symbolically enhanced fitness functions Input CF G: Control flow graph of the program under test; t: Target edge to be covered Output eff n : Fitness function to be used by each test case reaching node n, for each CFG node n holding a transitive control dependency on t 1: for each CF G node n ∈ N |n holds a transitive control dependency on t do 2: Compute the sub-graph subCF Gn of CF G from n to t, i.e., the intersection between nodes/edges forward reachable from n and nodes/edges backward reachable from t 3: Apply the node reduction algorithm [12] to determine the path expression p for subCF Gn 4: Compute the approximated path expression p′ from p by approximating loops A∗ in the path expression p as A∗ ≈ 1 + A + A2 + A3 + . . . + Ak D[A], for some fixed value of k 5: Normalize the approximated path expression p′ into a sum of products: p′ = p1 + p2 + . . . + ph 6: for each path pi in the normalized path expression p′ do 7: Perform a symbolic execution along pi , keeping track of destroyed variables and annotating destroyed conditions as D[c]; the result is path condition pci 8: Turn the path condition pci into a fitness function ff i by replacing conditions with branch distances and destroyed conditions with 1 9: end for 10: Define the fitness function eff n for node n as: eff n = min{ff 1 , ff 2 , . . . , ff h } 11: end for Algorithm 1 shows the pseudo-code for the computation of the enhanced fitness functions introduced in the previous section. Input to the algorithm is a program, represented as its Control Flow Graph (CFG), and a CFG edge t that represents the current test target, i.e., the branch to be covered. The output produced by the algorithm is a set of symbolically enhanced fitness functions, one for each CFG node n that holds a transitive control dependency on t. For each such node that is part of the execution trace of an input, the corresponding fitness function is evaluated, with the minimum value forming the overall fitness for that input. For all nodes n that hold a transitive control dependency on the target branch, Algorithm 1 determines the path expression p representing all paths from n to the target t (steps 2-3). Then, loops are approximated (typically as A∗ ≈ 1 + AD[A]) and an approximated path expression p′ is computed and normalized into a sum of products (steps 4-5). For each normalized approximated path expression pi composing the path p′ , symbolic execution is used to compute the corresponding path condition pci (step 7). Whenever a destroyed sub-path is encountered during the symbolic execution, all variables defined inside the sub-path are collected among the destroyed variables. Successively added conditions which make use of destroyed variables are marked as destroyed conditions. In step 8, the path condition pci is converted into a fitness function for pi by replacing conditions with branch distances, except for destroyed conditions, which increase the approximation level by one. The final fitness function for node n is the minimum among the fitness function values computed along the alternative paths appearing in the normalized approximated path expression. It is important to note that we cannot use a constraint solver to provide a set of input values that satisfy a path condition pci . This is because the path expression pi not always captures all execution paths from the entry node of a CFG to a target edge t. It is computed using only a sub-graph of the entire CFG (see Step 2 in Algorithm 1), i.e. the graph representing all execution paths between a critical branching node and t. Consequently, pci may contain local variables, rendering the use of a constraint solver infeasible. V. E MPIRICAL S TUDY The aim of the empirical study in this paper is to analyse the impact of using the enhanced fitness function in SBST. The two research questions to be addressed by the study are as follows: Research Question 1 - Effect of eff on branch coverage. The level of branch coverage achieved, i.e., effectiveness of the testing technique, is often the main focus when investigating an automated test data generation approach. Our proposed change in fitness function should not negatively affect the level of coverage achieved by a test data generation technique. Does this hypothesis hold? Research Question 2 - Effect of eff on efficiency. Alongside coverage, efficiency is also an important factor of any testing technique. Does the enhanced fitness function make SBST more efficient, and if so, what is the performance increase? We selected two commonly used search algorithms for evaluation; a form of hill climbing known as the Alternating Variable Method (AVM), first introduced by Korel [3], and a Genetic Algorithm (GA) based on the approach described by Wegener et al. [7]. Details of the two algorithms can be found in Section V-A and Section V-B respectively. The search-based testing framework, IGUANA [13], was extended to include the enhanced fitness function proposed in this paper and subsequently used to perform the test data searches. The study was performed on 338 branches, drawn from five different C programs 2 , two of which were provided by Daimler, two are open source and one is the well-studied triangle program. The input domain for each function is composed of global variables and formal parameters. We chose 2 Programs were chosen arbitrarily. However, all branches in the empirical study have been used to evaluate search-based testing techniques in the past [1], [2], [14]. Thus, we considered them good candidates for evaluating a new fitness function for SBST not to use any input domain reduction and defined the domain of each variable according to its declared type. Details of the subjects used in the empirical study can be found in Table I. The programs f2 and defroster are industrial case studies provided by Daimler and represent production code for engine and rear window defroster control systems. The code is machine generated from a design model of the desired behaviour. To complement the industrial examples, two opensource case studies were selected. tiff-3.8.2 is a library for manipulating images in the Tag Image File Format (TIFF). The functions tested comprise routines for placing images on pages and for building ‘overview’ compressed sample images. Finally, triangle is the well-known triangle classification program, often used as a benchmark program in automated test data generation studies. Each search for test data was performed 30 times for every combination of fitness function and search algorithm. If test data was not found to cover a branch after 100, 000 fitness evaluations, the search was terminated. Serendipitous coverage, i.e., branches covered by accident during the test data generation process, was ignored, so that a distinct search was carried out for every branch. The success or failure of each search was recorded, along with the number of fitness evaluations required to find the test data. From this, the ‘success rate’ of each branch can be calculated – the percentage of the 30 runs in which test data to execute the branch was found. The 30 runs were performed using an identical list of fixed seeds for random number generation, so as to provide a basis for assessment with tests for statistical significance using a one-sided, paired Wilcoxon signed rank test. Such tests are necessary to provide robust results in the presence of the inherently stochastic behaviour of the search algorithms. To facilitate replication, we will now discuss the configuration of the two search algorithms used in the study. A. Alternating Variable Method Setup The AVM is a simple but effective optimization technique [2]. It is a form of hill climbing and works by continuously changing an input parameter to a function in isolation. Initially all (arithmetic type) inputs are initialized with random values. Then, so called exploratory moves are made for each input in turn. These consist of adding or subtracting a delta from the value of an input. For integral types the delta starts off at 1, i.e., the smallest increment (decrement). When a change leads to an improved fitness value, the search tries to accelerate towards an optimum by increasing the size of the neighbourhood move with every step. These are known as pattern moves. The formula used to calculate the delta added or subtracted from an input is: δ = 2it · dir · 10−preci , where it is the repeat iteration of the current move (for pattern moves), dir either −1 or 1, and preci the precision of the ith input variable. The precision applies to floating point variables only (i.e., it is 0 for integral types). It denotes a scale factor for the size of a neighbourhood move. For example, setting the precision (preci ) of an input to 1 limits the smallest possible move to ±0.1. Increasing the precision to 2 limits TABLE I D ETAILS OF THE TEST SUBJECTS .T HE L INES OF C ODE COLUMN CONTAINS THE ansic OUTPUT OF THE SLOCCOUNT TOOL [15] USED IN ITS DEFAULT SETTING AND APPLIED TO THE ROOT SOURCE DIRECTORY OF EACH PROGRAM . Test Subject / Function bibclean check ISBN check ISSN defroster Defroster main f2 F2 tiff-3.8.2 TIFF GetSourceSamples TIFF SetSample PlaceImage triangle triangle Total Lines of Code 10,252 Number of Branches Number of Loops Approximate Domain Size 54 54 1 1 2112 2112 72 0 2137 46 0 2272 32 28 24 2 0 0 2135 21102 28402 28 338 0 4 296 179 305 47,794 53 58,583 the smallest possible move to ±0.01, and so forth. For all experiments carried out in this paper, the precision for floating point variables was fixed at 3. Once no further improvements can be found for an input, the search continues optimizing the next input parameter, and may recommence with the first input if necessary. In case the search stagnates, i.e., no move leads to an improvement, the search restarts at another randomly chosen location in the search-space. This is known as a random restart strategy and is designed to overcome local optima and enable the AVM to explore a wider region of the input domain for the function under test. B. Genetic Algorithm Setup A GA is a global search algorithm first proposed by Holland in the 1970s [16]. The configuration of the GA used in this paper is based on the approach described by Wegener et al. [7] who used GEATbx by Hartmut Pohlheim [17]. An overall population of 300 individuals is divided into six competing sub-populations, which begin with 50 individuals each. Each sub-population evolves separately using selection, recombination, mutation and re-insertion strategies. After evaluation, individuals in each sub-population are sorted using a linear ranking method [18] with a selection pressure of 1.7. Then, individuals are selected for reproduction through Stochastic Universal Sampling (SUS) [19]. In SUS, the probability of an individual being selected is proportionate to its (rank-based) fitness value. Selected individuals are recombined using a discrete recombination strategy [20], whereby an offspring receives each gene from either parent with an equal probability. After recombination, offspring individuals are mutated according to the breeder genetic algorithm mutation strategy [20]. The mutation operator is applied with probability 1 , where len is the number of genes in an individual (i.e., len the length of the input vector). For each gene to be mutated, a mutation range ri = size · domi is defined, where domi is the domain size of the ith input parameter and size is a mutation step size. The mutation step size varies for each of the six sub-populations and is defined as size = 10−pop with 1 ≤ pop ≤ 6. The mutated value of an input parameter can thus be computed as vi = xi ± ri · η. Addition subtraction Por 15 is chosen with an equal probability, and η = x=0 αx · 2−x , 1 and 0 otherwise. After where αx is 1 with a probability of 16 mutation, offspring are reinserted into a sub-population using an elitist reinsertion strategy. That is, the top 10% of the current generation is retained and the remaining individuals are replaced by fitter offspring. A feature of the Wegener model is that the six subpopulations of the GA compete with one another for the number of individuals each sub-population evolves. An average fitness value is computed for each sub-population and this value is used to linearly rank the sub-populations (again using a selection pressure of 1.7). The rank-based fitness value rank of a sub-population is then used to compute a progress value prog for the population in generation g using the formula progg+1 = 0.9 · progg + 0.1 · rank. Then, after every four generations, the populations are ranked according to their progress value prog, and the size of each sub-population is updated, with weaker sub-populations transferring individuals to stronger ones. However, no sub-population can lose its last five individuals, preventing it from dying out. Finally, a general migration of individuals takes place after every 20th generation, where sub-populations randomly exchange 10% of their individuals with one another. C. Results Research Question 1 - Effect of eff on branch coverage. Figure 3 shows the coverage achieved by the AVM and the GA for each test subject. A branch is counted as covered if the search for test data succeeded in at least one out of the thirty runs. As can be seen, using a symbolically enhanced fitness function does not negatively affect the level of branch coverage achieved by either local or global search. Instead, the Branch Coverage (%) 100 90 80 70 60 50 40 30 20 10 0 Std. FF Enhanced FF Branch coverage with the Genetic Algorithm 100 Branch Coverage (%) GA is able to cover a branch that it previously failed to cover. Similarly, the local search is able to cover more branches when using the enhanced fitness functions. To gain a better understanding of how the proposed approach affects each search algorithm, we also computed the success rate for each search target. Table II lists the branches for which we observed a difference in success rate when using the enhanced fitness function. The GA exhibits little variation. For three branches, the success rate is slightly reduced when using the enhanced fitness function. However, for five branches the success rate increases. Compared to the GA, the enhanced fitness function has a bigger impact on the success rate of the AVM. For branches where we observed a difference, the trend is in an increase in success rate. Five branches stand out particularly because the AVM failed to find test data for these using the standard fitness function. With the enhanced fitness function, the search was able to find the required test data in all of the 30 runs. The effect of the enhanced fitness function is not always beneficial though; for example branches in the function F2 from Daimler are covered with a reduced success rate. This function is interesting because some if statements check if a subtraction operation (on operands of type short int) resulted in an over- or underflow. For example, for one branch where the enhanced fitness function performed worse than the standard fitness function, the path distance measure is computed using the path condition hV 11 ≥ 0 ∧ V 9 < 0 ∧ (V 11 − V 9) < 0i. The conjuncts of the path condition correspond to three nested if statements in the original code. When the path distance is computed, the first two conjuncts pull into the opposite direction of the last conjunct. That is, as the branch distance for the first two conjuncts converges towards 0, the branch distance for the third conjunct increases until an overflow occurs. The standard fitness function, which optimizes each of the conjuncts in turn, does not appear to suffer from this problem and is able to reliably find the required test data. All cases where the enhanced fitness function did worse than the standard fitness function for F2 were in code that checks for over- or underflow errors. Research Question 2 - Effect of eff on efficiency. The results support the hypothesis that enhancing the fitness function with information gathered from symbolic execution can reduce the number of fitness evaluations required to cover a branch. Details of the average number of fitness evaluations required by each search technique are given in Table III. The trend for the GA is to require fewer fitness evaluations with the enhanced fitness function. This difference is particularly visible for three functions, where we observed more than a 25% reduction in fitness computations. However, there is again one case (PlaceImage from tiff-3.8.2) where we see a small increase in the number of fitness evaluations. As with the success rates, the AVM benefits more from the enhanced fitness function than the GA. Four functions require fewer than 50% of the fitness evaluations compared to the standard fitness function. This is not surprising since all these functions contain branches for which the AVM failed to find 90 80 70 60 50 40 30 20 10 0 Std. FF Enhanced FF Branch coverage with the Alternating Variable Method Fig. 3. This Figure shows the branch coverage achieved by the Genetic Algorithm (top) and the Alternating Variable Method (bottom) when using the standard and enhanced fitness functions. The graphs confirm that symbolically enhanced fitness functions are equally or more effective than the standard fitness functions. test data using the standard fitness function, but for which it achieved a 100% success rate using the enhanced fitness function. Conversely, the AVM uses more fitness evaluations with the enhanced fitness function for F2, because branches are covered with a lower success rate, and each failed search results in 100, 000 fitness evaluations. To see if the differences in efficiency for the GA and the AVM are statistically significant, we used the statistical tool R [21] to perform a paired, one-sided Wilcoxon signed rank test with continuity correction and specified an alpha level of 0.01. For both the GA and AVM we obtained a p value of 2.2 × 10−16 . This p value indicates that the difference in the number of fitness evaluations required by each search algorithm is statistically significant (p ≤ α). Finally, we also recorded the time taken to perform the upfront static analysis required by the enhanced fitness function. To obtain a reasonable sample pool we repeated this analysis 30 times for each function. The average analysis times, alongside standard deviation are recorded in Table IV. Loops often result in path explosion, even when only a single loop unrolling is performed. Thus, the analysis takes longer for functions containing one or more loops. Note that the symbolic analysis is performed once per function and can be re-used by a search algorithm for all branches contained TABLE II D IFFERENCE IN SUCCESS RATES WITH THE STANDARD FITNESS FUNCTION AND THE ENHANCED FITNESS FUNCTION . B RANCHES ARE ONLY LISTED IF THERE IS A DIFFERENCE FOR EITHER THE AVM OR GA. A 0% SUCCESS RATE MEANS A SEARCH ALGORITHM WAS UNABLE TO COVER A BRANCH IN ALL OF THE 30 REPEAT RUNS . A 100% SUCCESS RATE MEANS A SEARCH WAS ABLE TO FIND THE REQUIRED TEST DATA IN EACH OF THE 30 TRIALS . Test Subject/ Function (Branch ID) bibclean check ISSN (53T) check ISSN (55T) check ISSN (58T) check ISSN (58F) f2 F2 (4T) F2 (15T) F2 (35T) F2 (43T) tiff-3.8.2 TIFF GetSourceSamples (14T) TIFF GetSourceSamples (17T) TIFF GetSourceSamples (20T) TIFF GetSourceSamples (23T) TIFF GetSourceSamples (26T) TIFF GetSourceSamples (29T) TIFF GetSourceSamples (32T) TIFF SetSample (2T) TIFF SetSample (5T) TIFF SetSample (8T) TIFF SetSample (11T) TIFF SetSample (14T) TIFF SetSample (17T) TIFF SetSample (20T) triangle triangle (14T) triangle (15T) triangle (15F) triangle (17T) triangle (17F) triangle (19T) triangle (21T) AVM Standard / Enhanced 0% 0% 0% 0% / / / / (0%) (0%) (0%) (0%) 43.33% / 60.00% 16.67% / 66.67% 100% / 96.67% 100% / 76.67% (+16.67%) (+50.00%) (-3.33%) (-23.33%) 100% / 53.33% 100% / 46.67% 100% / 100% 3.33% / 3.33% (-46.67%) (-53.33%) (0%) (0%) 100% / 100% 100% / 100% 100% / 96.67% 0% / 6.67% (0%) (0%) (-3.33%) (+6.67%) 6.67% / 100% 0% / 100% 10.00% / 100% 10.00% / 100% 13.33% / 100% 16.67% / 100% 23.33% / 100% 13.33% / 100% 13.33% / 100% 13.33% / 100% 26.67% / 100% 13.33% / 100% 20.00% / 100% 23.33% / 100% (+93.33%) (+100%) (+90.00%) (+90.00%) (+86.67%) (+83.33%) (+76.67%) (+86.67%) (+86.67%) (+86.67%) (+73.33%) (+86.67%) (+80.00%) (+76.67%) 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% (0%) (0%) (0%) (0%) (0%) (0%) (0%) (0%) (0%) (0%) (0%) (0%) (0%) (0%) 0% 0% 0% 0% 0% 0% 0% (+100%) (+93.33%) (+100%) (+100%) (+100%) (+96.67%) (+96.67%) 100% / 100% 100% / 100% 96.67% / 100% 100% / 100% 96.67% / 100% 100% / 100% 100% / 100% / / / / / / / 0% 0% 0% 0% GA Standard / Enhanced 100% 93.33% 100% 100% 100% 96.67% 96.67% / / / / / / / / / / / / / / 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% (0%) (0%) (+3.33%) (0%) (+3.33%) (0%) (0%) TABLE III N ORMALIZED AVERAGE FITNESS EVALUATIONS REQUIRED BY THE GA AND AVM USING THE STANDARD AND ENHANCED FITNESS FUNCTIONS . Test Subject/ Function bibclean check ISBN check ISSN defroster Defroster main f2 F2 tiff-3.8.2 TIFF GetSourceSamples TIFF SetSample PlaceImage triangle triangle AVM Standard / Enhanced GA Standard / Enhanced 100% / 99.99% 100% / 99.98% (-0.01%) (-0.02%) 100% / 99.99% 100% / 93.71% (-0.01%) (-6.29%) 100% / 99.57% (-0.43%) 100% / 68.15% (-31.85%) 100% / 157.44% (+57.44%) 100% / 91.61% (-8.39%) 100% / 12.41% 100% / 8.78% 100% / 99.17% (-87.59%) (-91.22%) (-0.83%) 100% / 73.05% 100% / 73.87% 100% / 100.53% (-26.95%) (-26.13%) (+0.53%) 100% / 35.35% (-64.65%) 100% / 98.79% (-1.21%) TABLE IV AVERAGE TIME ( IN MILLISECONDS ) TAKEN OVER 30 TRIALS TO PERFORM THE UP - FRONT STATIC ANALYSIS REQUIRED BY THE ENHANCED FITNESS FUNCTION . T HE STANDARD DEVIATION IS SHOWN IN THE RIGHT MOST COLUMN . Test Subject/ Function bibclean check ISBN check ISSN defroster Defroster main f2 F2 tiff-3.8.2 TIFF GetSourceSamples TIFF SetSample PlaceImage triangle triangle Analysis Time(ms) (StdDev) 1,909,741.13 1,792,913.50 (2,172.47) (2,827.38) 34,509.87 (345.55) 318,279.23 (431.68) 516,514.23 1,092.77 956.43 (878.52) (12.22) (105.36) 716.03 (6.24) within that function. Therefore, compared to the overall execution time of the test data generation algorithms, we consider the analysis times reported in this paper as acceptable. Future work might investigate how we can make the static analysis more efficient, for example by re-using symbolic information for nested branches. VI. T HREATS TO VALIDITY Naturally there are threats to validity in any empirical study such as this. The first issue to address is the threat to the internal validity of the experiments, i.e., whether there has been a bias in the experimental design that could affect the obtained results. One potential source of bias comes from the configuration of the algorithms used in the test data generation tool IGUANA. The settings for the GA and AVM were taken from previous studies [1], [22], [14] that looked at generating branch adequate test data. Thus, they have been shown in the past to provide a good trade-off between effectiveness and efficiency. Another potential source of bias comes from the inherent stochastic behaviour of the meta-heuristic search algorithms. The most reliable (and widely used) technique for overcoming this source of variability is to perform statistical tests using a sufficiently large sample of result data. In order to ensure a large sample size, experiments were repeated 30 times, providing a reasonable pool of data from which to draw observations, and ensuring sample means were normally distributed. To show that the enhanced fitness function is more efficient than the standard fitness function used in SBST, a test for a statistical significant difference in the sample means was performed. We used a one-sided, paired Wilcoxon signed rank test with the confidence level set at 99%. A further source of bias includes the selection of the functions used in the empirical study, which could potentially affect its external validity, i.e., the extent to which it is possible to generalize from the results obtained. The study draws upon code from real world programs, both from industrial production code and from open source. While we sampled a variety of programming styles and sources, we only considered functions from five programs. Therefore caution is required before making any claims as to whether these results would be observed on other functions. Instead, the results reported herein should only be seen to provide some initial intuition and a larger study is required to validate or refute our findings. VII. R ELATED W ORK The present paper is the first to develop an amended form of symbolic execution for SBST. Previous work on developing symbolic execution as a practical means of improving automated testing focussed on constraint based testing techniques, leading to the development of the very active field now known as ‘Dynamic Symbolic Execution’ (DSE). This field began with the seminal work by Godefroid et al. [11] on Directed Automated Random Testing (DART), which combined symbolic execution with random testing. Since then a number of authors have followed this approach, which is sometimes referred to as ‘concolic testing’ [23] as well as DSE [24], [11], [25]. DSE and SBST have developed as separate schools of thought in automated software testing, each with their own advantages and disadvantages. The introduction of Dynamic Symbolic Execution creates a significant step forward in the development of previous constraint based approaches to automated test data generation, on which DSE builds. Our introduction of partial symbolic execution as a means of augmenting SBST seeks to provide a similar impetus to SBST research. Like DSE, we augment an existing test automation technique with a form of symbolic execution and like DSE, we need to amend traditional symbolic execution to ameliorate its problems. However, DSE performs a complete symbolic execution, sometimes using concrete values in place of symbolic values, whereas our approach does not use concrete values, but retains the symbolic nature of symbolic execution. Rather than performing a complete symbolic execution, we perform a localised or ‘partial’ symbolic computation and use approximation to overcome the problems of static symbolic execution. The first authors to propose a combination of SBST and DSE were Inkumsah and Xie [26] with the EVACON framework. Their framework targets test data generation for object oriented code written in JAVA and uses two existing tools: eToc [27], an evolutionary test data generation tool, and jCUTE [28], a DSE tool. Method sequences putting the class containing the method under test into specific states, are constructed by eToc. Then, jCUTE is used to maximize code coverage of a given method sequence by generating values for the sequences’ input parameters. The method sequences with optimized parameter values are then passed back to eToc for further optimization. More recently, Lakhotia et al. [29] investigated a combination of SBST and DSE in order to improve DSE’s ability to handle constraints over floating point variables. Their work integrated the AVM, also used in this paper, and Evolution Strategies into Pex [25], a DSE tool for .NET. Lakhotia et al. [30] also proposed a combination of symbolic execution with search in order to improve SBST. Inspired by the work on CUTE [23], they use symbolic execution to extend and improve the AVM for pointer inputs. The work presented in this paper differs from all previous work in that it is the first to consider symbolic execution in order to improve a fitness function used in SBST. A benefit of this approach lies in its generality; it may be used with any search algorithm. Furthermore, the enhanced fitness function does not require a constraint solver, despite making use of symbolic execution techniques. The path condition generated through symbolic execution is transformed into a fitness function to guide an optimisation algorithm. This is an advantage when testing code that contains floating point computations or calls to system libraries. VIII. C ONCLUSION This paper has introduced and evaluated a symbolic searchbased software testing approach for the branch coverage test adequacy criterion. We propose to replace the existing branch distance and approach level measures with two new measures: Path distance and approximation level. The new metrics make use of information gathered from symbolic execution. An empirical study, performed on 338 branches, taken from a mix of open source and industrial programs, confirmed our hypothesis that a symbolically enhanced fitness function can make search algorithms more efficient. The proposed approach was evaluated with two commonly used algorithms in SearchBased Software Testing: The Alternating Variable Method and a Genetic Algorithm. The main goal of the enhanced fitness function is to make search-based testing more efficient. However, it also enables the Alternating Variable Method, a form of hill climbing, to cover branches for which the search failed using the traditional fitness function. Future work will investigate how symbolic search-based testing can be further developed to not only improve efficiency, but also effectiveness of a search algorithm. ACKNOWLEDGEMENT Arthur Baars, Kiran Lakhotia, Paolo Tonella and Tanja Vos are funded through the European Union project FITTEST (ICT2009.1.2 no 257574). Mark Harman is supported by EPSRC Grants EP/G060525/1, EP/D050863, GR/S93684 & GR/T22872 and also by the kind support of Daimler Berlin, BMS and Vizuri Ltd., London. Phil McMinn is supported in part by EPSRC grants EP/G009600/1, EP/F065825/1 and EP/I010386/1. R EFERENCES [1] M. Harman, Y. Hassoun, K. Lakhotia, P. McMinn, and J. Wegener, “The Impact of Input Domain Reduction on Search-Based Test Data Generation,” in ESEC/SIGSOFT FSE, 2007, pp. 93–101. [2] M. Harman and P. McMinn, “A theoretical and empirical study of searchbased testing: Local, global, and hybrid search,” IEEE TSE, vol. 36, no. 2, pp. 226–247, 2010. [3] B. Korel, “Automated software test data generation,” IEEE TSE, vol. 16, no. 8, pp. 870–879, Aug. 1990. [4] K. Lakhotia, P. McMinn, and M. Harman, “An empirical investigation into branch coverage for C programs using CUTE and AUSTIN,” The J. of Systems and Software, vol. 83, no. 12, pp. 2379–2391, Dec. 2010. [5] C. C. Michael, G. McGraw, and M. Schatz, “Generating software test data by evolution,” IEEE TSE, vol. 27, no. 12, pp. 1085–1110, 2001. [6] R. P. Pargas, M. J. Harrold, and R. Peck, “Test-data generation using genetic algorithms,” Soft. Test., Ver. and Rel., vol. 9, no. 4, pp. 263–282, 1999. [7] J. Wegener, A. Baresel, and H. Sthamer, “Evolutionary test environment for automatic structural testing,” Information & Software Technology, vol. 43, no. 14, pp. 841–854, 2001. [8] M. Harman, A. Mansouri, and Y. Zhang, “Search based software engineering: A comprehensive analysis and review of trends techniques and applications,” Department of Computer Science, King’s College London, Tech. Rep., April 2009. [9] N. Tracey, J. A. Clark, K. Mander, and J. A. McDermid, “An automated framework for structural test-data generation,” in ASE, 1998, pp. 285– 288. [10] A. Arcuri, “It does matter how you normalise the branch distance in search based software testing,” in ICST, 2010, pp. 205–214. [11] P. Godefroid, N. Klarlund, and K. Sen, “DART: directed automated random testing,” ACM SIGPLAN Notices, vol. 40, no. 6, pp. 213–223, Jun. 2005. [12] B. Beizer, Software Testing Techniques, 2nd edition. International Thomson Computer Press, 1990. [13] P. McMinn, “IGUANA: Input generation using automated novel algorithms. A plug and play research tool.” Univ. Of Sheffield, Tech. Rep., 2007. [14] P. McMinn, M. Harman, Y. Hassoun, K. Lakhotia, and J. Wegener, “Input Domain Reduction through Irrelevant Variable Removal and its Effect on Local, Global and Hybrid Search-Based Structural Test Data Generation,” IEEE TSE, To Appear (2011). [15] D. A. Wheeler, “More than a gigabuck: Estimating GNU/Linux’s size,” http://www.dwheeler.com/sloc/, Jun. 2001. [16] J. H. Holland, “Genetic algorithms and the optimal allocation of trials,” SIAM J. of Computing, vol. 2, no. 2, pp. 88–105, Jun. 1973. [17] H. Pohlheim, “Evolutionary algorithms: Overview, methods and operators.” documentation for: Genetic evolutionary algorithm toolbox for use with matlab version: toolbox 1.92 documentation 1.92,” 1999. [18] D. Whitley, “The GENITOR algorithm and selection pressure: Why rank-based allocation of reproductive trials is best,” Computer Science Dept., Colorado State University, Fort Collins, CO, Tech. Rep., 1989. [19] J. E. Baker, “Reducing bias and inefficiency in the selection algorithm,” in Genetic Algorithms and their Applications (ICGA’87), J. J. Grefenstette, Ed., 1987, pp. 14–21. [20] H. Mühlenbein and D. Schlierkamp-Voosen, “Predictive models for the breeder genetic algorithm, I: Continuous parameter optimization,” Evolutionary Computation, vol. 1, no. 1, pp. 25–49, 1993. [21] R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2011, ISBN 3-900051-07-0. [Online]. Available: http://www.R-project.org [22] M. Harman, K. Lakhotia, and P. McMinn, “A multi-objective approach to search-based test data generation,” in GECCO, 2007, pp. 1098–1105. [23] K. Sen, D. Marinov, and G. Agha, “CUTE: a concolic unit testing engine for C,” in ESEC/SIGSOFT FSE, 2005, pp. 263–272. [24] C. Cadar and D. R. Engler, “Execution generated test cases: How to make systems code crash itself,” in Model Checking Software, 12th International SPIN Workshop, vol. 3639, 2005, pp. 2–23. [25] N. Tillmann and J. de Halleux, “Pex-white box test generation for.NET,” in TAP, 2008, pp. 134–153. [26] K. Inkumsah and T. Xie, “Evacon: A framework for integrating evolutionary and concolic testing for object-oriented programs,” in ASE, November 2007, pp. 425–428. [27] P. Tonella, “Evolutionary testing of classes,” in ISSTA, 2004, pp. 119– 128. [28] K. Sen and G. Agha, “CUTE and jCUTE: Concolic unit testing and explicit path model-checking tools,” in CAV, 2006, pp. 419–423. [29] K. Lakhotia, N. Tillman, M. Harman, and J. de Halleux, “FloPSy Search-Based Floating Point Constraint Solving for Symbolic Execution,” in ICTSS, 2010, pp. 142–157. [30] K. Lakhotia, M. Harman, and P. McMinn, “Handling dynamic data structures in search based testing.” in GECCO, 2008, pp. 1759–1766.