Wang 2014
Wang 2014
Wang 2014
JID: JSS
[m5G;December 8, 2014;18:52]
Certus Software V&V Center, Simula Research Laboratory, P.O. 134, 1325 Lysaker, Oslo, Norway
Department of Informatics, University of Oslo, P.O. 1080, 0316 Blindern, Oslo, Norway
a r t i c l e
i n f o
Article history:
Received 1 December 2013
Revised 9 June 2014
Accepted 16 August 2014
Available online xxx
Keywords:
Product line
Search algorithm
Test suite minimization
a b s t r a c t
Cost-effective testing of a product in a product line requires obtaining a set of relevant test cases from the
entire test suite via test selection and minimization techniques. In this paper, we particularly focus on test
minimization for product lines, which identies and eliminates redundant test cases from test suites in order
to reduce the total number of test cases to execute, thereby improving the eciency of testing. However, such
minimization may result in the minimized test suite with low test coverage, low fault revealing capability,
low priority test cases, and require more time than the allowed testing budget (e.g., time) as compared to the
original test suite. To deal with the above issues, we formulated the minimization problem as a search problem
and dened a tness function considering various optimization objectives based on the above issues. To assess
the performance of our tness function, we conducted an extensive empirical evaluation by investigating
the tness function with three weight-based Genetic Algorithms (GAs) and seven multi-objective search
algorithms using an industrial case study and 500 articial problems inspired from the industrial case study.
The results show that Random-Weighted Genetic Algorithm (RWGA) signicantly outperforms the other
algorithms since RWGA can balance all the objectives together by dynamically updating weights during
each generation. Based on the results of our empirical evaluation, we also implemented a tool called TEst
Minimization using Search Algorithms (TEMSA) to support test minimization using various search algorithms
in the context of product lines.
2014 Elsevier Inc. All rights reserved.
1. Introduction
Product Line Engineering (PLE) has proven to be a cost-effective
way to develop similar products by exploiting and managing commonalities and variabilities among a large number of products
(Benavides et al., 2010; McGregor, 2001). PLE has shown promising
benets in both academia and industry such as reducing development
time and costs, speeding up product time-to-market and improving
the quality of products of a product line family (Benavides et al.,
2010). Cost-effective testing of a new product in a product line can be
achieved by systematic application of model-based PLE approaches
(McGregor, 2001), for example, based on feature models (Benavides
et al., 2010). Feature models have become the de-facto standard for
capturing commonalities and variabilities in product lines and can
be used for systematic selection of test cases for a new product in a
product line (Benavides et al., 2010).
Corresponding author at: Certus Software V&V Center, Simula Research Laboratory,
P.O. 134, 1325 Lysaker, Oslo, Norway. Tel.: +47 45239921.
E-mail addresses: shuai@simula.no (S. Wang), shaukat@simula.no (S. Ali),
arnaud@simula.no (A. Gotlieb).
http://dx.doi.org/10.1016/j.jss.2014.08.024
0164-1212/ 2014 Elsevier Inc. All rights reserved.
Please cite this article as: S. Wang et al., Cost-effective test suite minimization in product lines using search techniques, The Journal of Systems
and Software (2014), http://dx.doi.org/10.1016/j.jss.2014.08.024
JID: JSS
2
ARTICLE IN PRESS
[m5G;December 8, 2014;18:52]
S. Wang et al. / The Journal of Systems and Software 000 (2014) 122
for product lines are continuously added further increasing the number of test cases. Executing all these test cases is practically impossible within the allocated testing time and thus warrants efcient and sophisticated test minimization techniques (Engstrm
and Runeson, 2011; Harman, 2011; Lopez-Herrejon et al., 2013;
Runeson and Engstrm, 2012; Wang et al., 2013c). However, such
minimization may reduce cost (e.g., execution time), but on the
other hand may also reduce effectiveness (e.g., fault detection capability). This means an ecient technique must achieve a desirable
balance among cost and effectiveness measures, which is a multiobjective optimization problem. A number of such objectives have
been studied for multi-objective test optimization in regression testing in the existing literature such as execution time, test coverage and fault detection capability (Harman, 2011; Yoo and Harman,
2012).
Multi-objective search algorithms are well adapted for the type
of problem we are solving in this paper (Harman et al., 2009; Konak
et al., 2006). Moreover, Weight-based Genetic Algorithms (GAs) are
also known to solve multi-objective optimization problems by assigning a set of weights to each objective (Konak et al., 2006). In
our previous work (Wang et al., 2013c), we dened a tness function based on the following cost/effectiveness measures based on
our industrial collaboration and the existing literature (Engstrm and
Runeson, 2011; Harman, 2011; Lopez-Herrejon et al., 2013): (1) Cost:
minimizing the number of test cases thereby indirectly reducing execution cost referred as Test Minimization Percentage (TMP); (2) Effectiveness: High Feature Pairwise Coverage (FPC) and Fault Detection
Capability (FDC). In Wang et al. (2013c), we compared three weightbased Genetic Algorithms (GAs) and showed that Random-Weighted
GA (RWGA) had signicantly better performance than the other
algorithms.
Further investigation at Cisco revealed the following additional
cost and effectiveness measures: (1) Cost: overall execution cost of
the minimized test cases; (2) Effectiveness: priority of test cases.
Notice these two additional measures have also been investigated
in the existing literatures (Engstrm and Runeson, 2011; Harman,
2011; Lopez-Herrejon et al., 2013). Moreover, we also decided to
compare weight-based algorithms with other types of algorithms
such as Fast Non-dominated Sorting Genetic Algorithm (NSGA-II)
(Konak et al., 2006; Yoo and Harman, 2007). Based on these observations, this paper extends our existing work (Wang et al., 2013c)
by including: (1) four effectiveness measures, i.e., Test Minimization
Percentage (TMP), Pairwise Coverage (FPC), Fault Detection Capability
(FDC) and Average Execution Frequency (AEF) and one cost measure,
i.e., Overall Execution Time (OET); (2) multi-objective tness function considering all the proposed cost/effectiveness measures; and
(3) empirical evaluation for the proposed tness function in conjunction with three weight-based multi-objective search algorithms and
seven other multi-objective search algorithms on an industrial case
study and assessing the scalability using 500 articial problems of
varying complexity; (4) a tool called TEst Minimization with Search
Algorithms (TEMSA) based on Durillo and Nebro (2011).
The results of our empirical evaluation showed that: (1) based
on the industrial case study, all the selected search algorithms except Improved Strength Pareto Evolutionary Algorithm (SPEA2) are
cost-effective for solving the test minimization problem and RWGA
achieves the best performance; (2) for the 500 articial problems, the
results are consistent with the industrial case study, i.e., RWGA outperforms all the other search algorithms. Furthermore, RWGA managed to solve the problems of varying complexity.
The rest of the paper is organized as follows. Section 2 provides a brief introduction of the selected search algorithms. Section 3
presents a formal representation of our problem, denitions and functions for the cost/effectiveness measures and tness function used by
all the algorithms. Section 4 describes our industrial case study and
designed articial problems followed by empirical evaluation of the
applied algorithms in Section 5. Section 6 discusses the results and
analysis for the experiments and Section 7 presents an overall discussion based on the obtained results. In Section 8, the tool support
is discussed. Section 9 addresses the threats to validity of our empirical study and related work is discussed in Section 10. Section 11
concludes the paper and presents future work.
2. Description of the selected algorithms
In this section, we provide a brief description of the ten algorithms that were selected for our experiments. These algorithms
are classied into four working mechanisms to guide the search
Please cite this article as: S. Wang et al., Cost-effective test suite minimization in product lines using search techniques, The Journal of Systems
and Software (2014), http://dx.doi.org/10.1016/j.jss.2014.08.024
ARTICLE IN PRESS
JID: JSS
[m5G;December 8, 2014;18:52]
S. Wang et al. / The Journal of Systems and Software 000 (2014) 122
Table 1
Classication of the selected search algorithms.
Different mechanisms
Weight-Based GA
GAs
Evolutionary
Algorithms (EAs)
Swarm Algorithm
Hybrid Algorithm
Stochastic Algorithm
Sorting-Based GA
Cellular-Based GA
Strength Pareto EA
Evolution Strategies
Particle Swarm Optimization
Cellular genetic algorithm + differential evolution
Random Search
Algorithms
Minimization before?
WBGA
WBGA-MO
RWGA
NSGA-II
MOCell
SPEA2
PAES
SMPSO
CellDE
RS
Yes
Yes
Yes
Yes
No
Yes
No
No
No
Yes
Please cite this article as: S. Wang et al., Cost-effective test suite minimization in product lines using search techniques, The Journal of Systems
and Software (2014), http://dx.doi.org/10.1016/j.jss.2014.08.024
JID: JSS
4
ARTICLE IN PRESS
[m5G;December 8, 2014;18:52]
S. Wang et al. / The Journal of Systems and Software 000 (2014) 122
for
solution
sk
(sk
comprised
{tsk 1 , tsk 2 , tsk 3 , . . . , tntsk } from TSpi , where 1 ntsk ntpi ) from
Please cite this article as: S. Wang et al., Cost-effective test suite minimization in product lines using search techniques, The Journal of Systems
and Software (2014), http://dx.doi.org/10.1016/j.jss.2014.08.024
ARTICLE IN PRESS
JID: JSS
[m5G;December 8, 2014;18:52]
S. Wang et al. / The Journal of Systems and Software 000 (2014) 122
Spi for testing the product pi to achieve the following two objectives
(i.e., maximum effectiveness and minimum cost):
sl Spi sl = sk :
neff ect
j=1
ncost
neff ect
j=1
j=1
ncost
ntsk
TMPsk = 1
ntpi
100%
FPCsk =
NumFPsk
NumFPpi
Num_FPsk =
sk
test three features. Then, the feature pairs covered by test case tci
are C23 = 3 2/2 = 3. Notice that if some feature pairs are repeated
ones compared with the feature pairs covered by the previous test
cases, repeated pairs will be removed when computing NumFPs .
k
Meanwhile, feature pairs only refer to the valid feature pairs that
are not violated against the constraints dened among various features. In our case, there are no specic constraints that are dened
on the features. But if such constraints exist in our contexts, the invalid feature pairs will be eliminated when calculating NumFPs and
k
NumFPp .
i
Num_FPpi is all number of feature pairs for testing the product pi
size(Fp )
j=1
Num_FPtci
i
measured as: Num_FPpi = C2
= ntpi (ntpi 1)/2. Fpi is the set
of features representing the product pi including ntpi features. For
instance, if pi is represented by ten features, all feature pairs covered
by pi are C210 = 45. Notice FPC is calculated for a chosen test solution
ranging from 0 to 1 and a higher value shows better feature pairwise
coverage.
3.2.1.3. Fault detection capability (FDC). Denition 3. FDC is to measure the fault detection capability of a selected test solution for a
product.
In our context, fault detection refers to the rate of successful execution for a test case in a given time, e.g., a week in our case. More
specically, the execution of a test case can be dened as a success if
it can detect faults in a given time (a week in our case) and as a fail if
it does not detect any fault.
Objective Function 3. FDC is measured using the following function.
ntsk
SucRtci
ntsk
i=1
FDCsk =
SucRtci =
NumSuctci
NumSuctci + NumFailtci
NumSuctci is the number of success for test case tci and NumFailtci
is the number of fail for test case tci . For instance, a test case is usually
executed 1000 times per week in Cisco. So if it executes successfully
for 800 times, the SucR is 800/1000 = 0.8.
Note that FDC value also ranges from 0 to 1 and a higher value
represents better fault detection capability.
3.2.1.4. Average execution frequency (AEF). Denition 4. AEF is to measure the average execution frequency of the solution sk based on the
execution frequency of each included test cases during a given time,
e.g., a week.
In our context, we found that a test suite may have high priority if it is executed very frequently during the given time based on
the domain knowledge. Therefore, the minimized test suite should
preferably consist of test cases with high priority, which can be determined by the average execution frequency. This is also our main
objective to seek a solution with higher average execution frequency
from the solution space.
Objective Function 4. The function to measure AEF is presented as
below.
ntsk
i=1
i
computed as: Num_FPtci = C2
size(Ftci ) is the number of features tested by test case tci . For instance, test case tci is used to
AEFsk =
EFtci
ntsk
i=1
Please cite this article as: S. Wang et al., Cost-effective test suite minimization in product lines using search techniques, The Journal of Systems
and Software (2014), http://dx.doi.org/10.1016/j.jss.2014.08.024
ARTICLE IN PRESS
JID: JSS
6
S. Wang et al. / The Journal of Systems and Software 000 (2014) 122
test cases: tc1 , tc2 and tc3 . These three test cases are executed 5
times, 6 times and 7 times, respectively. Then AEF for the solution
sk will be 6 (i.e., (5 + 6 + 7)/3 = 6). Notice that a higher value of
AEF represents higher average execution frequency thereby higher
priority.
3.2.2. Cost measures
Cost measures Cost includes one element in our context, i.e., overall
execution time (OET), which is dened as below.
3.2.2.1. Overall execution time (OET). Denition 5. OET is to measure
the overall execution time for the solution sk consisting of a set of test
cases.
Objective Function 5. OET can be measured using the function as
below.
nt
OETsk =
[m5G;December 8, 2014;18:52]
sk
AETtci
i=1
ntsk is the number of test cases for the solution sk , where 1 ntsk
ntpi . AETtci is the average execution time for test case tci in the solution
sk , which can be calculated as below.
EFtci
AETtci =
ETtci
EFtci
i=1
nor(x) =
x
x+1
where x can be any value obtained for each objective (i.e., TMP, FPC,
FDC, OET and AEF), such as 0.56 for FDC and 0.5 h for OET. Therefore,
the specic tness function for the weight-based GAs (i.e., WBGA,
WBGA-MO and RWGA) can be formulated as below, which ranges
from 0 to 1, where lower the value is, better the tness function is.
Table 2
Four products in Saturn.
Product
# features
# test cases
C20
C40
C60
C90
17
25
32
43
138
167
192
239
converted to a single objective problem with a scalar objective function, which is a classical approach and is ecient to be solved using
GAs (Konak et al., 2006). Notice that RWGA assigns weights to objectives dynamically during each generation and it is not required to set
xed weights to different objectives before.
The other search algorithms are based on different mechanisms
and there is no need to assign particular weights to each objective
function. All the cost/effectiveness measures including ve objectives
with functions have been clearly dened in Section 3.2. Notice that
using NSGA-II, MOCell, SPEA2, PAES, SMPSO, CellDE, a set of individual solutions can be obtained after a specic number of tness
generations.
4. Industrial case study and articial problems
First, we used an industrial case study of Videoconferencing System (VCSs) product line called Saturn provided by Cisco Systems,
Norway to evaluate our tness function (CiscoSystems, 2010). Saturn has several VCS products including C20 and C90 and has more
than 2000 test cases. Notice that not all test cases are relevant for
each product and new test cases are developed for Saturn every
day. Based on our collaboration with Cisco, we observed that the
current testing practice is performed in two ways: (1) executing
all the test cases for a product; (2) randomly selecting a subset of
test cases for a product. Therefore, random search could be considered as an equivalent case, though not a very accurate measure
and require additional experiments involving real testers, which can
be performed in the future. The following paper will evaluate the
performance of the selected search algorithms as compared with
RS.
We chose four products C20, C40, C60 and C90 from Saturn. There
are 169 features in Saturn and each product includes a subset of these
features. Meanwhile, each feature can be tested by at least one test
case (usually more than one). More details are shown in Table 2. For
instance, C20 has 17 features and 138 test cases are used to test these
features. Each test case tci has a successful rate for execution (SucRtci ),
an average execution time (AETtci ) and an execution frequency (EFtci )
(Section 3.2). In general, for Saturn, each feature is associated with
510 test cases and each test case tci is associated with 15 features with SucRtci ranging from 50% to 95%, AETtci ranging from 2
min to 60 min and EFtci ranging from 1 time per week to 50 times
per week.
Moreover, we dened 500 articial problems to empirically evaluate whether the tness function dened in Section 3 can address
our test minimization problem even with a large number of features
and test cases. With this objective in mind, we rst created a feature repository including 1200 features and a test case repository
of 60,000 test cases. For each test case tci , three key attributes are
assigned randomly (inspired by our industrial case study but with
expansion for generality), i.e., SucRtci ranges from 0% to 100%, AETtci
ranges from 1 min to 100 min and EFtci ranges from 1 time to 100
times per week. Moreover, we created articial problems with the
increasing number of features and test cases, i.e., we used a range
of 101000 with an increment of 10 for features number and each
feature can be associated with test cases ranging from 5 to 45 with an
increment of 10. So that 100 5 = 500 articial problems were obtained in this way. Based on that, each articial problem consists of a
Please cite this article as: S. Wang et al., Cost-effective test suite minimization in product lines using search techniques, The Journal of Systems
and Software (2014), http://dx.doi.org/10.1016/j.jss.2014.08.024
JID: JSS
ARTICLE IN PRESS
[m5G;December 8, 2014;18:52]
S. Wang et al. / The Journal of Systems and Software 000 (2014) 122
set of features associated with a test suite including a set of test cases
and the goal is to minimize the test suite using the selected multiobjective algorithms.
5. Empirical evaluation
This section rst presents experiment design including: (1) research questions required to be addressed, (2) selected criteria for
the algorithms and parameter settings, and (3) evaluation mechanisms for all the search algorithms. Moreover, we also describe relevant statistical tests used for our experiments followed by experiment
execution.
5.1. Experiments design
The goal of our experiments is to evaluate the tness function in
conjunction with the ten search algorithms (i.e., WGBA, WBGA-MO,
RWGA, NSGA-II, MOCell, SPEA2, PAES, SMPSO, CellDE and RS) in terms
of addressing the optimization problem: minimizing the test suite for
testing a product with high TMP, FPC, FDC and AEF, and OET within
allocated time budget.
5.1.1. Research questions
We will answer the following research questions with our experiments:
1. RQ1: Are the selected multi-objective search algorithms
cost/effective to solve our test minimization problem?
2. RQ2: Are the selected multi-objective search algorithms
cost/effective as compared to RS?
3. RQ3: Which one achieves the best performance among the selected multi-objective search algorithms?
4. RQ4: How does the increment of the number of features and
test cases impact the performance of the selected multi-objective
search algorithms?
5.1.2. Selected criteria for the algorithms and parameter settings
In our experiments, we rst compared three weight-based multiobjective GAs and RS, i.e., WGBA with two set of xed weights
based on the domain knowledge and expertise WBGA_W1 (W1 =
(0.2, 0.2, 0.2, 0.2, 0.2)), WBGA_W2 (W2 = (0.1, 0.2, 0.2, 0.4, 0.1)),
WBGA-MO and RWGA. For all of them, we used a standard onepoint crossover with a rate of 0.9 and mutation of a variable is done
with the standard probability 1/n, where n is the number of variables (defaulted standard parameter settings from jMetal (Durillo and
Nebro, 2011)). Meanwhile, the size of population and maximum number of tness evaluation are set as 100 and 25,000, respectively (all
these standard settings performed well in our previous work (Wang
et al., 2013c)). Finally, RS was used as the comparison baseline to
assess the diculty of the addressed minimization problems (Ali
et al., 2010).
Moreover, we compared the best weight-based GA with the other
six selected search algorithms, i.e., NSGA-II, MOCell, SPEA2, PAES,
SMPSO, CellDE (RS is still used as a baseline for assessing the difculty of the problems (Ali et al., 2010)). For all the other algorithms, we also used the default suggested parameter settings from
jMetal (Durillo and Nebro, 2011) as shown in Table 3. Notice that
different settings may lead to different performance for genetic algorithms, but standard settings usually perform well (Arcuri and Briand,
2011).
5.1.3. Evaluation mechanisms
To address RQ1, a set of thresholds for TMP, FPC, FDC and OET
are selected showing the minimum acceptable values for a particular context. Notice that these thresholds are set through the domain analysis, discussion with test engineers and a test manager
Table 3
Parameterization for the selected search algorithms (n is the number of variables).
Parameterization used in NSGA-II and SPEA2
Population Size
100
Selection of Parents
Binary tournament + binary tournament
Recombination
Simulated binary, crossover rate = 0.9
Mutation
Polynomial, mutation rate = 1.0/n
Parameterization used in MOCell
Population Size
100
Neighborhood
1-hop neighbors (8 surrounding solutions)
Selection of Parents
Binary tournament + binary tournament
Recombination
Simulated binary, crossover rate = 0.9
Mutation
Polynomial, mutation rate = 1.0/n
Archive Size
100
Parameterization used in PAES
Mutation
Polynomial, mutation rate = 1.0/n
Archive Size
100
Parameterization used in SMPSO
Swarm Size
100
Mutation
Polynomial, mutation rate = 1.0/n
Archive Size
100
Parameterization used in CellDE
Population Size
100
Neighborhood
1-hop neighbors (8 surrounding solutions)
Selection of Parents
Binary tournament + binary tournament
Recombination
Differential evolution, crossover rate = 0.9
Archive Size
100
Please cite this article as: S. Wang et al., Cost-effective test suite minimization in product lines using search techniques, The Journal of Systems
and Software (2014), http://dx.doi.org/10.1016/j.jss.2014.08.024
ARTICLE IN PRESS
JID: JSS
8
[m5G;December 8, 2014;18:52]
S. Wang et al. / The Journal of Systems and Software 000 (2014) 122
nr
r=1 OFVi,j,r
MFVi,j =
nr
45
MFVFi =
j=5,j=j+10 MFVi,j
1000
MFVTCj =
i=10,i=i+10 MFVi,j
100
Please cite this article as: S. Wang et al., Cost-effective test suite minimization in product lines using search techniques, The Journal of Systems
and Software (2014), http://dx.doi.org/10.1016/j.jss.2014.08.024
ARTICLE IN PRESS
JID: JSS
[m5G;December 8, 2014;18:52]
S. Wang et al. / The Journal of Systems and Software 000 (2014) 122
Table 4
Results for comparing the weight-based algorithms with the given thresholds for each objective.
Products
Algorithms
FPC: 0.8
FDC: 0.85
OET: 2 h
C20
WBGA_W1
WBGA_W2
WGBA-MO
RWGA
RS
0.45
0.55
0.61
0.69
0.22
0.12
0.10
0.17
0.19
0.01
0.37
0.44
0.67
0.76
0.34
0.08
0.17
0.13
0.19
0.02
0.39
0.56
0.54
0.75
0.47
0.19
0.36
0.32
0.10
0.05
0.45
0.38
0.39
0.32
0.55
0.16
0.09
0.08
0.03
0.23
C40
WBGA_W1
WBGA_W2
WBGA-MO
RWGA
RS
0.55
0.68
0.57
0.68
0.35
0.40
0.04
0.46
0.08
0.02
0.44
0.39
0.63
0.65
0.23
0.18
0.20
0.11
0.10
0.01
0.41
0.56
0.61
0.73
0.44
0.16
0.08
0.10
0.05
0.11
0.39
0.42
0.38
0.34
0.65
0.08
0.19
0.07
0.06
0.03
C60
WBGA_W1
WBGA_W2
WBGA-MO
RWGA
RS
0.61
0.59
0.56
0.67
0.39
0.10
0.37
0.22
0.13
0.01
0.48
0.52
0.64
0.71
0.32
0.38
0.27
0.09
0.17
0.02
0.32
0.57
0.66
0.78
0.35
0.04
0.06
0.07
0.05
0.02
0.43
0.39
0.32
0.39
0.71
0.33
0.25
0.08
0.15
0.01
C90
WBGA_W1
WBGA_W2
WGBA-MO
RWGA
RS
0.64
0.66
0.67
0.70
0.22
0.21
0.18
0.19
0.12
0.02
0.30
0.53
0.58
0.64
0.34
0.04
0.12
0.19
0.17
0.01
0.27
0.37
0.64
0.69
0.49
0.03
0.10
0.12
0.09
0.22
0.41
0.43
0.39
0.31
0.67
0.27
0.35
0.10
0.06
0.02
A: A12 , p: p-value. All p-values less than 0.05 are identied as bold.
and FDC and all of the A12 values are less than 0.5 for OET. Moreover, there are no signicant differences (all p-values are greater
than 0.05) when comparing with the given thresholds as shown in
Table 4.
For WBGA_W1 and WBGA_W2 , the results do not stay stable.
For some products and objectives (e.g., for TMP in C40 product),
WBGA_W1 and WBGA_W2 have more chance to achieve better results than the given thresholds since the A12 values are greater than
0.5 (Table 4). But for the other products and objectives (e.g., for FDC in
C90 product), WBGA_W1 and WBGA_W2 have less chance to obtain
the better results as compared with the given thresholds since the
A12 values are much less than 0.5 (Table 4). Since the only difference
between WBGA_W1 and WBGA_W2 is the sets of weights, different
weights can result in total different results when using WBGA. So providing an appropriate set of weights before using WBGA is essential
to obtain the expected results.
Based on the results, we can see that RS always has less probability to be used in practice when compared with the given thresholds since all the A12 values are less than 0.5 for TMP, FPC and FDC
and all of the A12 values are greater than 0.5 for OET. Moreover,
there are signicant differences when comparing with the given
thresholds since most of the p-values are less than 0.05 as shown in
Table 4.
When comparing the results for OFV considering all the objectives
together, WGBA-MO and RWGA have higher probability to achieve
better OFV than the given thresholds since all the A12 values are
greater than 0.5 and there are almost no signicant differences (most
of the p-values are greater than 0.05) as shown in Table 5. As for
Table 5
Results for comparing the weight-based algorithms with the given thresholds for OFV.
Pair of algorithms
C20
C40
C60
C90
0.51
0.52
0.56
0.62
0.28
0.23
0.18
0.11
0.32
0.02
0.32
0.41
0.55
0.59
0.33
0.04
0.12
0.15
0.18
0.02
0.35
0.59
0.62
0.67
0.39
0.04
0.14
0.09
0.07
0.01
0.40
0.47
0.56
0.71
0.27
0.12
0.19
0.16
0.04
0.01
Please cite this article as: S. Wang et al., Cost-effective test suite minimization in product lines using search techniques, The Journal of Systems
and Software (2014), http://dx.doi.org/10.1016/j.jss.2014.08.024
ARTICLE IN PRESS
JID: JSS
10
[m5G;December 8, 2014;18:52]
S. Wang et al. / The Journal of Systems and Software 000 (2014) 122
Table 6
Results for comparing the weight-based algorithms for each objective.
Products
Pair of algorithms
Objectives
TMP
FPC
FDC
OET
C20
0.57
0.35
0.22
0.66
0.54
0.28
0.65
0.30
0.67
0.61
0.29
0.05
0.02
0.04
0.12
0.02
0.02
0.01
0.02
0.06
0.58
0.41
0.40
0.58
0.47
0.32
0.66
0.32
0.70
0.73
0.15
0.38
0.25
0.17
0.53
0.02
0.02
0.02
0.01
0.01
0.45
0.40
0.27
0.69
0.40
0.41
0.57
0.34
0.58
0.59
0.27
0.27
0.10
0.04
0.14
0.13
0.14
0.02
0.09
0.10
0.54
0.32
0.38
0.68
0.59
0.30
0.64
0.38
0.69
0.79
0.22
0.08
0.24
0.02
0.08
0.02
0.03
0.06
0.02
0.01
C40
0.52
0.45
0.18
0.67
0.48
0.31
0.61
0.33
0.66
0.82
0.44
0.38
0.01
0.03
0.21
0.03
0.03
0.02
0.03
0.01
0.47
0.35
0.26
0.60
0.42
0.40
0.59
0.29
0.69
0.76
0.16
0.29
0.09
0.18
0.23
0.09
0.04
0.02
0.02
0.02
0.51
0.38
0.41
0.69
0.52
0.29
0.65
0.37
0.60
0.69
0.45
0.22
0.37
0.03
0.37
0.02
0.03
0.03
0.04
0.02
0.56
0.33
0.32
0.70
0.43
0.37
0.67
0.39
0.65
0.59
0.30
0.14
0.11
0.01
0.12
0.10
0.02
0.08
0.03
0.01
C60
0.53
0.52
0.35
0.59
0.57
0.30
0.60
0.39
0.60
0.59
0.43
0.46
0.04
0.27
0.09
0.03
0.04
0.14
0.03
0.03
0.58
0.36
0.17
0.65
0.43
0.21
0.59
0.29
0.68
0.62
0.15
0.19
0.01
0.04
0.22
0.01
0.04
0.02
0.02
0.02
0.46
0.29
0.25
0.68
0.47
0.29
0.61
0.30
0.55
0.74
0.21
0.02
0.07
0.04
0.39
0.01
0.03
0.03
0.16
0.01
0.47
0.34
0.34
0.56
0.54
0.32
0.56
0.32
0.64
0.66
0.34
0.11
0.15
0.29
0.27
0.02
0.11
0.03
0.02
0.03
C90
0.48
0.40
0.26
0.57
0.44
0.37
0.62
0.36
0.64
0.69
0.39
0.15
0.03
0.10
0.20
0.05
0.03
0.03
0.03
0.01
0.57
0.31
0.36
0.67
0.46
0.40
0.65
0.39
0.67
0.71
0.16
0.07
0.12
0.03
0.31
0.13
0.02
0.05
0.02
0.01
0.55
0.41
0.38
0.59
0.48
0.31
0.68
0.30
0.70
0.77
0.21
0.17
0.24
0.08
0.23
0.03
0.02
0.02
0.01
0.01
0.49
0.26
0.19
0.65
0.42
0.29
0.60
0.35
0.60
0.63
0.44
0.04
0.01
0.04
0.10
0.01
0.03
0.03
0.03
0.02
RQ2: For all the objectives (i.e., TMP, FPC, FDC and OET), the results show that all the selected weight-based GAs have signicantly
better performance than RS since all the A12 values for TMP, FPC,
and FDC are greater than 0.5, and all the A12 values for OET are
less than 0.5, and most of the p-values are less than 0.05 (Table 6).
When comparing the OFV results considering all objectives for the
algorithms and RS, we can see that all of the selected weight-based
Table 7
Results for comparing the weight-based algorithms for OFV.
Pair of algorithms
C20
C40
C60
C90
0.41
0.36
0.18
0.60
0.39
0.33
0.60
0.39
0.60
0.66
0.23
0.29
0.01
0.04
0.04
0.02
0.04
0.04
0.03
0.01
0.42
0.32
0.27
0.69
0.40
0.30
0.59
0.32
0.69
0.79
0.38
0.05
0.03
0.01
0.12
0.02
0.04
0.02
0.02
0.02
0.51
0.40
0.21
0.57
0.37
0.38
0.63
0.37
0.62
0.65
0.46
0.26
0.02
0.11
0.04
0.04
0.03
0.04
0.03
0.02
0.46
0.29
0.22
0.63
0.43
0.31
0.67
0.32
0.64
0.72
0.26
0.02
0.02
0.03
0.22
0.02
0.03
0.02
0.03
0.02
Please cite this article as: S. Wang et al., Cost-effective test suite minimization in product lines using search techniques, The Journal of Systems
and Software (2014), http://dx.doi.org/10.1016/j.jss.2014.08.024
ARTICLE IN PRESS
JID: JSS
[m5G;December 8, 2014;18:52]
S. Wang et al. / The Journal of Systems and Software 000 (2014) 122
Table 8
Results for the Vargha and Delaney statistics-articial problems (without/with Whitney U test).
RQ
A>B
A<B
A=B
RQ2
WBGA_W1 vs. RS
WBGA_W2 vs. RS
WBGA_MO vs. RS
RWGA vs. RS
367/354
392/371
445/426
489/480
133/102
108/96
55/45
11/0
0/54
0/33
0/29
0/20
RQ3
226/202
198/170
119/93
211/198
134/122
192/174
274/264
302/287
381/371
289/264
366/354
308/295
0/34
0/43
0/36
0/38
0/24
0/31
GAs have signicantly better performance than RS since all the A12
values are greater than 0.5 and all the p-values are less than 0.05
(Table 7).
RQ3: For all the objectives (i.e., TMP, FPC, FDC and OET), the results
show that RWGA signicantly outperforms WBGA_W1 , WBGA_W2
and WBGA-MO since all the A12 values for TMP, FPC, and FDC are
greater than 0.5, and all the A12 values for OET are less than 0.5,
and most of the p-values are less than 0.05 (Table 6). Similarly,
when comparing OFV, the performance of RWGA is also signicantly better than the other weight-based GAs since all the A12
values are greater than 0.5 and all the p-values are less than 0.05
(Table 7).
In general, we can conclude that RWGA achieves the best performance among the selected weight-based GAs for all the objectives
and overall tness value (OFV) when considering all the objectives
together.
6.1.2. Results and analysis for articial problems
To answer RQ2 and RQ3, we rst conducted the KruskalWallis
test (together with the Bonferroni Correction) for all the obtained results and we obtained a p-value of 0.014 (less than 0.05) showing that
there are signicant differences between the results obtained by the
selected weight-based GAs and RS. Moreover, we further compared
the selected weight-based algorithms with RS and then compared
each pair of them based on mean tness value (MFV) obtained after 25,000 generations for each algorithm and each of the problems.
Recall that each problem was repeated for 100 times to account for
random variation (i.e., MFV is a mean for 500 values of OFV after 100
runs). Notice that we answer RQ4 in terms of scalablity for all the ten
search algorithms in Section 6.2.2.2.
Table 8 summarizes the results for Vargha and Delaney statistics
without/with Whitney U test for the 500 articial problems. Two
numbers are shown in each cell of the table splited by a slash. The
rst number in the column A > B shows the number of problems
out of 500 for which an algorithm A has better performance than B
(A12 > 0.5), A < B means vice versa (A12 < 0.5), and A = B means the
number of problems for which A has equivalent performance as B
(A12 = 0.5). The second number after / in the column A > B means
the number of problems out of 500 for which an algorithm A has
signicantly better performance than B (A12 > 0.5 and p < 0.05), A
< B means vice versa (A12 < 0.5 and p < 0.05), and A = B means the
number of problems for which there are no siginicant differences in
performance between A and B (p 0.05).
6.1.2.1. Results and analysis for research question 2 and 3. To answer RQ2 and RQ3, we compared WBGA_W1 , WBGA_W2 , WBGA-MO,
RWGA and RS based on the mean tness values for each problem
(MFVi,j , where 10 i 1000 with an increment of 10 and 5 j 45
with an increment of 10, i.e., there are 500 MFV values for the 500
articial problems).
11
Table 9
Average running time for the selected weight-based search algorithms.
Algorithms
WBGA_W1
WBGA_W2
WBGA-MO
RWGA
RS
Time (s)
2.01
2.11
1.94
1.37
1.13
RQ2: All the selected weight-based GAs signicantly outperformed RS for nding optimal solutions in most of the problems.
For instance, WBGA-MO has better performance than RS for 445
problems and the performance is signicantly better for 426 problems. WBGA-MO has worse performance than RS for 55 problems (45 of them were signicantly worse). There were no signicant differences between WBGA-MO and RS for 29 problems
(Table 8).
RQ3: RWGA achieved the best performance among all the other
algorithms in most of the problems (the performance of RWGA
is better than the other algorithms for 351.7 problems on average ((381 + 366 + 308)/3 = 351.7) and signicantly better for 340
problems on average ((371 + 354 + 295)/3 = 340)). WBGA_W1 and
WBGA_W2 have almost equivalent performance, whose performance
stays at the second level. Moreover, WBGA-MO achieved better
performance than WBGA_W1 and WBGA_W2 (WBGA-MO outperformed WBGA_W1 and WBGA_W2 for 302 and 289 problems respectively and signicantly outperformed for 287 and 264 problems
respectively).
6.1.3. Timing analysis for weight-based GAs
In addition, we report the average time taken by the three weightbased GAs per run (i.e., WBGA, WBGA-MO and RWGA) as compared with RS (shown in Table 9). In addition, the KruskalWallis
test was conducted to evaluate whether there are signicant differences among the running time of the weight-based GAs as compared with RS (100 run in our case). The p-value obtained by the
KruskalWallis test is 0.39, which shows that there are no signicant differences among the running time of the weight-based GAs
and RS. Based on the results, we can conclude that the selected
weight-based GAs took similar running time as compared with RS
in terms of nding an optimal solution for our test minimization
problem.
6.2. Results and analysis for the best weight-based GA (RWGA) and the
other selected search algorithms
This section presents the results and related analysis for our industrial case study and the designed articial problems using the best
weight-based GA (RWGA) and the other selected algorithms.
6.2.1. Results and analysis for industrial case study
In this section, we summarize the results in the following sections
for the industrial case study.
6.2.1.1. Analysis for research question 1. We did one sample Mann
Whitney test since we compared the results by different algorithms
with one set of xed values for TMP, FPC, FDC, OET and OFV (i.e.,
overall tness value). Tables 10 and 11 show that the performance of
algorithms as compared with the threshold values for each objective
(i.e., TMP, FPC, FDC, OET and OFV). Based on the results, we answer
RQ1 as follows.
For TMP, all the selected algorithms except RS have higher probability to achieve better results than the given threshold for TMP since
most of the A12 values are greater than 0.5 and there are no signicant
differences between the results and the threshold (all the p-values
are greater than 0.05). However, all the values of A12 obtained by RS
are less than 0.5 and all the p-values are less than 0.05, i.e., RS has
Please cite this article as: S. Wang et al., Cost-effective test suite minimization in product lines using search techniques, The Journal of Systems
and Software (2014), http://dx.doi.org/10.1016/j.jss.2014.08.024
ARTICLE IN PRESS
JID: JSS
12
[m5G;December 8, 2014;18:52]
S. Wang et al. / The Journal of Systems and Software 000 (2014) 122
Table 10
Results for comparing the eight algorithms with the given thresholds for each objective.
Products
Algorithms
FPC: 0.8
FDC: 0.85
OET: 2 h
C20
RWGA
NSGA-II
MOCell
SPEA2
PAES
SMPSO
CellDE
RS
0.69
0.72
0.65
0.59
0.68
0.57
0.63
0.22
0.19
0.23
0.32
0.18
0.24
0.32
0.35
0.01
0.76
0.63
0.61
0.38
0.54
0.62
0.68
0.34
0.19
0.33
0.34
0.04
0.27
0.31
0.34
0.02
0.75
0.73
0.57
0.39
0.54
0.63
0.57
0.47
0.10
0.21
0.17
0.21
0.18
0.35
0.29
0.05
0.32
0.29
0.34
0.25
0.43
0.39
0.28
0.55
0.03
0.01
0.03
0.02
0.12
0.08
0.02
0.23
C40
RWGA
NSGA-II
MOCell
SPEA2
PAES
SMPSO
CellDE
RS
0.68
0.55
0.72
0.57
0.61
0.65
0.72
0.35
0.08
0.22
0.31
0.25
0.34
0.17
0.19
0.02
0.65
0.65
0.58
0.31
0.62
0.57
0.61
0.23
0.10
0.18
0.41
0.04
0.19
0.23
0.31
0.01
0.73
0.68
0.61
0.33
0.54
0.64
0.59
0.44
0.05
0.32
0.24
0.04
0.19
0.27
0.34
0.11
0.34
0.34
0.27
0.25
0.42
0.38
0.26
0.65
0.06
0.02
0.04
0.02
0.17
0.07
0.02
0.03
C60
RWGA
NSGA-II
MOCell
SPEA2
PAES
SMPSO
CellDE
RS
0.67
0.61
0.54
0.65
0.47
0.49
0.54
0.39
0.13
0.32
0.22
0.34
0.41
0.25
0.18
0.01
0.71
0.55
0.63
0.49
0.57
0.63
0.59
0.32
0.17
0.23
0.18
0.17
0.23
0.29
0.32
0.02
0.78
0.62
0.57
0.31
0.62
0.55
0.59
0.35
0.05
0.11
0.24
0.03
0.18
0.23
0.31
0.02
0.39
0.48
0.35
0.37
0.42
0.30
0.29
0.71
0.15
0.21
0.04
0.03
0.11
0.04
0.03
0.01
C90
RWGA
NSGA-II
MOCell
SPEA2
PAES
SMPSO
CellDE
RS
0.70
0.58
0.61
0.48
0.54
0.63
0.68
0.22
0.12
0.25
0.27
0.14
0.11
0.35
0.29
0.02
0.64
0.69
0.54
0.47
0.49
0.55
0.59
0.34
0.17
0.17
0.21
0.09
0.15
0.23
0.17
0.01
0.69
0.57
0.62
0.43
0.54
0.65
0.71
0.49
0.09
0.34
0.17
0.08
0.13
0.25
0.23
0.22
0.31
0.28
0.34
0.24
0.42
0.31
0.23
0.67
0.06
0.03
0.07
0.02
0.13
0.04
0.02
0.02
A: A12 , p: p-value. All p-values less than 0.05 are identied as bold.
p-values are less than 0.05, i.e., they have signicantly higher chances
to meet the time budget within the given threshold (2 h). Moreover, all the A12 values obtained by RS are greater than 0.5 and
most of the p-values are less than 0.05, i.e., the whole execution time
achieved by RS has higher chance to be signicantly more than 2 h
(Table 10).
When comparing results for OFV considering all the objectives, the
selected algorithms except SPEA2 and RS have higher probability to
obtain better OFV when compared with the given thresholds since
all the A12 values are greater than 0.5 and there are no signicant
differences (all the p-values are greater than 0.05). As for SPEA2, the
obtained OFV has signicantly less chance to be equivalent as the
given thresholds (most of the A12 values are less than 0.5 and half
of the p-values are less than 0.05). For RS, the obtained OFV has also
higher chance to be signicantly less than the thresholds since all the
Table 11
Results for comparing the eight algorithms with the given thresholds for OFV.
Pair of algorithms
C20
C40
C60
C90
0.62
0.71
0.58
0.38
0.55
0.65
0.59
0.28
0.32
0.28
0.24
0.11
0.23
0.27
0.18
0.02
0.59
0.62
0.62
0.31
0.58
0.72
0.63
0.33
0.18
0.21
0.19
0.04
0.16
0.34
0.22
0.02
0.67
0.54
0.64
0.29
0.61
0.69
0.74
0.39
0.07
0.12
0.21
0.03
0.27
0.27
0.39
0.01
0.71
0.66
0.59
0.51
0.60
0.74
0.69
0.27
0.04
0.19
0.16
0.13
0.33
0.25
0.35
0.01
Please cite this article as: S. Wang et al., Cost-effective test suite minimization in product lines using search techniques, The Journal of Systems
and Software (2014), http://dx.doi.org/10.1016/j.jss.2014.08.024
ARTICLE IN PRESS
JID: JSS
S. Wang et al. / The Journal of Systems and Software 000 (2014) 122
Table 12
Number of solutions obtained by each search algorithm.
Algorithms
WBGA_W1
WBGA_W2
WBGA-MO
RWGA
NSGA-II
MOCell
Solution No.
11
15
SPEA2
PAES
SMPSO
CellDE
RS
19
20
A12 values are less than 0.5 and all the p-values are less than 0.05
(Table 11).
We also report the number of solutions that satisfy all the given
thresholds (i.e., 0.8 for TMP, 0.8 for FPC, 0.85 for FDC and 2 h for OET)
obtained by each selected algorithm as shown in Table 12. Based on
the results, we observe that weight-based GAs can manage to nd
one solution which balances all the objectives together (i.e., maximum of the OFV value) while the Pareto-based algorithms can obtain
a set of non-dominated solutions, e.g., eleven non-dominiated solutions are obtained by using NSGA-II. Notice RS can not manage to
nd a solution which can satisfy the required thresholds for all the
objectives.
In general, the selected algorithms except SPEA2 and RS can
achieve the given thresholds for each objective well and the overall performance still works well in terms of OFV.
6.2.1.2. Analysis for research question 2 and 3. Similarly as Section
6.1.1.2, for the Saturn products, all the p-values obtained by the
KruskalWallis test (together with the Bonferroni Correction) are
less than 0.05 for each objective and OFV (0.009 for TMP, 0.031 for
FPC, 0.015 for FDC, 0.024 for OET and 0.016 for OFV), showing that
there are signicant differences between the performance of the selected search algorithms for each objective and OFV. Therefore, we
further performed the Vargha and Delaney statistics and the Mann
Whitney U test to compare each pair of the eight search algorithms
as shown in Tables 13 and 14. In total, we have C28 = 28 pairs to
compare for the eight selected search algorithms. The obtained results are rst compared for each objective and then compared for
OFV considering all the objectives (shown as Tables 13 and 14, respectively). Based on the results, we answer the RQ2 and RQ3 as
follows.
RQ2: For TMP, FPC, FDC and OET, we rst compared the performance of the selected algorithms with RS. The results show that
all the selected search algorithms have signicantly better performance than RS for TMP and OET since all the A12 values for TMP
are greater than 0.5, and all the A12 values for OET are less than
0.5, and most of the p-values are less than 0.05. For FPC and FDC,
all the selected algorithms except SPEA2 achieve signicantly better performance than RS since almost all the A12 values are greater
than 0.5 and most of the p-values are less than 0.05. SPEA2 has almost equivalent performance as RS for FPC and FDC since most of
the A12 values are close to 0.5 and the p-values are greater than 0.05
(Table 13).
When comparing the OFV results considering all objectives for the
algorithms and RS, we can see that six of them (i.e., RWGA, NSGA-II,
MOCell, PAES, SMPSO and CellDE) have signicantly better performance than RS since all the A12 values are greater than 0.5 and all the
p-values are less than 0.05. SPEA2 has almost equivalent performance
as RS since most of the A12 values are closer to 0.5 and the p-values
are greater than 0.05 (Table 14).
RQ3: For TMP, CellDE achieved the best performance than the other
algorithms since most of the A12 values are greater than 0.5 and the pvalues are less than 0.05. Similarly, MOCell, NSGA-II and SPEA2 have
the best performance for FPC, FDC and OET among all the algorithms,
respectively (Table 13).
When comparing the OFV results considering all objectives together, RWGA performs signicantly better among all the search
[m5G;December 8, 2014;18:52]
13
algorithms since all the A12 values are greater than 0.5 and most
of p-values are less than 0.05. Moreover, SPEA2 has signicantly
worse performance than the other six search algorithms since most
of the A12 values are less than 0.5 and p-values are greater than 0.05
(Table 14).
In general, we can conclude that different algorithms can achieve
the best performance for different objectives (e.g., CellDE for the objective TMP) but RWGA has signicantly better performance than the
other search algorithms when considering all the objectives together
in general.
6.2.2. Results and analysis for articial problems
In this section, we summarize the results as the following sections
for the designed 500 articial problems.
6.2.2.1. Results and analysis for research question 2 and 3. To answer RQ2 and RQ3, the KruskalWallis test (together with the Bonferroni Correction) was rst performed for the results of the eight
search algorithms and we obtained a p-value of 0.024 (less than 0.05)
showing that there exist signicant differences between the performances of the selected algorithms. Therefore, we further compared
the algorithms (i.e., RWGA, NSGA-II, MOCell, SPEA2, PAES, SMPSO,
CellDE) with RS and then compared each pair of them based on mean
tness value (MFV) achieved after 25,000 generations for each algorithm and each of the 500 problems. Recall that each problem
was repeated for 100 times to account for random variation, i.e.,
MFV is a mean for 100 values of OFV after 100 runs (Section 5.1.3).
Table 14 summarizes the results for Vargha and Delaney statistics
without or with Whitney U test for the 500 articial problems (similarly as Table 13).
RQ2: RWGA, NSGA-II, MOCell, PAES, SMPSO, and CellDE achieved
signicantly better performance than RS for nding optimal solutions for our minimization problem in most of the problems. For
instance, RWGA has better performance than RS for 489 problems
and the performance is signicantly better for 480 problems. RWGA
has worse performance than RS for 11 problems (none of them
was signicantly worse). There were no signicant differences between RWGA and RS for 20 problems. The performance of SPEA2
is close to RS (SPEA2 is better than RS for 298 problems and signicantly better for 251 problems. SPEA2 is worse than RS for 202
problems and signicantly worse for 179 problems. Meanwhile,
there is no signicant difference between them for 70 problems)
(Table 15).
RQ3: RWGA achieved the best performance among all the
other six algorithms in most of the problems, i.e., the performance of RWGA is better than the other six algorithms for
average 418.8 problems ((432 + 403 + 464 + 381 + 453 + 380)/6 =
418.8) and signicantly better for 401.7 problems on average
((405 + 397 + 451 + 362 + 438 + 357)/6 = 401.7). NSGA-II, MOCell,
PAES, SMPSO, and CellDE have almost equivalent performance,
whose performance stay at the second level. Finally, the performance of SPEA2 is worse than the above-mentioned algorithms, i.e.,
SPEA2 achieved the worst performance in the 500 articial problems (SPEA2 is worse than the other search algorithms for 377.8
problems and signicantly worse for 354.7 problems on average)
(Table 15).
6.2.2.2. Results and analysis for research question 4. To answer RQ4,
we calculated Spearmans rank correlation bettween mean tness values for each algorithm with increasing number of features (MFV_Fi ,
where 10 i 1000 with an increment of 10, Section 5.1.3) and associated test cases (MFV_TCj , where 5 j 45 with an increment of
10, Section 5.1.3). In this case, a xed number of features is used
to represent a product and each feature is associated with different
number of test cases. So we have 100 5 = 500 articial problems
Please cite this article as: S. Wang et al., Cost-effective test suite minimization in product lines using search techniques, The Journal of Systems
and Software (2014), http://dx.doi.org/10.1016/j.jss.2014.08.024
ARTICLE IN PRESS
JID: JSS
14
[m5G;December 8, 2014;18:52]
S. Wang et al. / The Journal of Systems and Software 000 (2014) 122
Table 13
Results for comparing the eight algorithms for each objective.
Pair of algorithms
Objectives
TMP: 0.8
FPC: 0.8
FDC: 0.85
OET: 2 h
0.43
0.51
0.48
0.47
0.42
0.36
0.61
0.51
0.46
0.54
0.46
0.49
0.72
0.58
0.56
0.54
0.31
0.69
0.51
0.45
0.28
0.69
0.43
0.46
0.67
0.31
0.65
0.76
0.11
0.12
0.11
0.21
0.24
0.04
0.06
0.17
0.23
0.28
0.17
0.35
0.01
0.17
0.14
0.27
0.03
0.04
0.34
0.22
0.01
0.02
0.29
0.31
0.02
0.01
0.03
0.01
0.34
0.37
0.57
0.45
0.52
0.43
0.73
0.41
0.75
0.53
0.55
0.47
0.82
0.75
0.68
0.65
0.66
0.79
0.34
0.29
0.31
0.52
0.46
0.43
0.63
0.55
0.69
0.67
0.04
0.04
0.07
0.18
0.13
0.08
0.01
0.13
0.01
0.24
0.18
0.23
0.02
0.01
0.02
0.02
0.02
0.01
0.01
0.01
0.01
0.21
0.17
0.25
0.03
0.24
0.03
0.03
0.40
0.36
0.29
0.54
0.47
0.52
0.59
0.67
0.70
0.65
0.68
0.64
0.73
0.68
0.54
0.49
0.51
0.69
0.36
0.31
0.28
0.54
0.49
0.47
0.68
0.53
0.72
0.77
0.08
0.04
0.03
0.25
0.28
0.14
0.10
0.02
0.02
0.02
0.02
0.02
0.01
0.02
0.19
0.25
0.31
0.02
0.01
0.01
0.01
0.17
0.24
0.29
0.02
0.19
0.01
0.01
0.53
0.55
0.37
0.51
0.53
0.55
0.79
0.48
0.37
0.53
0.48
0.54
0.65
0.35
0.52
0.46
0.43
0.67
0.71
0.67
0.72
0.75
0.48
0.45
0.67
0.51
0.66
0.63
0.25
0.19
0.03
0.22
0.17
0.21
0.01
0.24
0.02
0.25
0.17
0.28
0.01
0.01
0.14
0.21
0.27
0.02
0.01
0.02
0.01
0.01
0.17
0.23
0.02
0.34
0.02
0.02
0.48
0.49
0.52
0.49
0.47
0.32
0.82
0.48
0.51
0.52
0.49
0.29
0.89
0.54
0.49
0.51
0.29
0.73
0.47
0.49
0.32
0.73
0.47
0.44
0.75
0.28
0.69
0.76
0.21
0.12
0.23
0.16
0.19
0.02
0.01
0.20
0.11
0.16
0.20
0.02
0.02
0.19
0.09
0.23
0.01
0.02
0.22
0.31
0.01
0.01
0.17
0.12
0.01
0.01
0.02
0.01
0.46
0.41
0.52
0.53
0.48
0.47
0.76
0.43
0.73
0.55
0.52
0.49
0.81
0.77
0.73
0.69
0.68
0.76
0.31
0.36
0.28
0.55
0.52
0.49
0.74
0.51
0.73
0.74
0.17
0.07
0.18
0.20
0.27
0.14
0.02
0.16
0.01
0.18
0.20
0.09
0.01
0.01
0.01
0.02
0.03
0.01
0.01
0.02
0.01
0.13
0.23
0.31
0.01
0.36
0.01
0.01
0.38
0.32
0.31
0.52
0.45
0.55
0.69
0.69
0.73
0.69
0.64
0.72
0.77
0.62
0.56
0.53
0.54
0.73
0.29
0.36
0.35
0.47
0.54
0.54
0.66
0.48
0.77
0.70
0.03
0.03
0.03
0.21
0.29
0.20
0.02
0.02
0.01
0.02
0.03
0.01
0.01
0.06
0.09
0.18
0.27
0.01
0.01
0.04
0.02
0.22
0.13
0.22
0.02
0.27
0.01
0.01
0.34
0.51
0.33
0.54
0.49
0.47
0.59
0.51
0.33
0.48
0.49
0.52
0.75
0.32
0.48
0.49
0.47
0.66
0.77
0.59
0.68
0.81
0.52
0.47
0.61
0.53
0.70
0.69
0.02
0.22
0.03
0.19
0.23
0.18
0.01
0.17
0.01
0.21
0.08
0.14
0.01
0.01
0.22
0.32
0.18
0.02
0.01
0.07
0.01
0.01
0.25
0.18
0.05
0.24
0.01
0.02
0.52
0.53
0.48
0.51
0.46
0.34
0.59
0.54
0.53
0.57
0.46
0.32
0.23
0.19
0.30
0.31
0.17
0.02
0.03
0.23
0.14
0.09
0.15
0.05
0.46
0.48
0.47
0.56
0.49
0.49
0.62
0.49
0.75
0.51
0.57
0.53
0.17
0.39
0.14
0.13
0.32
0.37
0.02
0.32
0.01
0.33
0.11
0.16
0.33
0.28
0.27
0.57
0.52
0.56
0.74
0.63
0.73
0.74
0.69
0.75
0.02
0.01
0.01
0.11
0.27
0.13
0.01
0.04
0.01
0.01
0.02
0.01
0.55
0.54
0.41
0.46
0.46
0.44
0.66
0.54
0.39
0.57
0.52
0.48
0.24
0.18
0.06
0.16
0.13
0.22
0.03
0.22
0.03
0.11
0.09
0.24
Please cite this article as: S. Wang et al., Cost-effective test suite minimization in product lines using search techniques, The Journal of Systems
and Software (2014), http://dx.doi.org/10.1016/j.jss.2014.08.024
ARTICLE IN PRESS
JID: JSS
[m5G;December 8, 2014;18:52]
S. Wang et al. / The Journal of Systems and Software 000 (2014) 122
15
Table 13 (continued)
Pair of algorithms
Objectives
TMP: 0.8
FPC: 0.8
FDC: 0.85
OET: 2 h
NSGA-II vs. RS
MOCell vs. SPEA2
MOCell vs. PAES
MOCell vs. SMPSO
MOCell vs. CellDE
MOCell vs. RS
SPEA2 vs. PAES
SPEA2 vs. SMPSO
SPEA2 vs. CellDE
SPEA2 vs. RS
PAES vs. SMPSO
PAES vs. CellDE
PAES vs. RS
SMPSO vs. CellDE
SMPSO vs. RS
CellDE vs. RS
0.57
0.54
0.48
0.51
0.40
0.74
0.55
0.49
0.34
0.63
0.48
0.20
0.73
0.27
0.70
0.73
0.03
0.22
0.24
0.29
0.05
0.01
0.17
0.35
0.02
0.05
0.27
0.01
0.01
0.01
0.01
0.01
0.65
0.67
0.72
0.73
0.69
0.77
0.36
0.26
0.28
0.54
0.53
0.47
0.76
0.53
0.68
0.75
0.01
0.02
0.01
0.01
0.02
0.01
0.02
0.01
0.01
0.12
0.19
0.31
0.01
0.29
0.02
0.01
0.81
0.72
0.51
0.53
0.54
0.70
0.35
0.29
0.24
0.52
0.44
0.53
0.70
0.48
0.75
0.79
0.02
0.01
0.30
0.21
0.23
0.01
0.02
0.01
0.01
0.29
0.16
0.14
0.01
0.20
0.01
0.01
0.59
0.27
0.56
0.53
0.49
0.74
0.65
0.73
0.67
0.78
0.53
0.48
0.75
0.55
0.74
0.80
0.04
0.01
0.09
0.18
0.35
0.01
0.03
0.01
0.01
0.02
0.20
0.33
0.01
0.17
0.01
0.01
0.22
0.53
0.55
0.52
0.44
0.38
0.69
0.53
0.49
0.57
0.49
0.38
0.91
0.57
0.52
0.55
0.31
0.68
0.54
0.52
0.26
0.75
0.20
0.54
0.68
0.32
0.73
0.78
0.01
0.16
0.14
0.19
0.11
0.02
0.01
0.15
0.37
0.14
0.33
0.06
0.02
0.12
0.21
0.09
0.02
0.02
0.16
0.29
0.01
0.01
0.01
0.19
0.02
0.03
0.01
0.01
0.36
0.45
0.51
0.49
0.52
0.53
0.71
0.44
0.68
0.56
0.57
0.54
0.93
0.68
0.77
0.72
0.74
0.80
0.28
0.40
0.32
0.51
0.55
0.44
0.77
0.55
0.70
0.81
0.01
0.14
0.30
0.34
0.22
0.19
0.01
0.20
0.02
0.16
0.09
0.17
0.01
0.02
0.01
0.01
0.01
0.01
0.01
0.06
0.01
0.39
0.17
0.10
0.01
0.24
0.01
0.01
0.44
0.26
0.37
0.56
0.49
0.51
0.77
0.63
0.77
0.69
0.73
0.70
0.81
0.72
0.54
0.48
0.55
0.61
0.34
0.26
0.30
0.43
0.57
0.59
0.73
0.53
0.68
0.83
0.16
0.01
0.04
0.14
0.34
0.30
0.01
0.06
0.01
0.02
0.01
0.02
0.01
0.01
0.14
0.22
0.18
0.05
0.02
0.01
0.01
0.16
0.11
0.08
0.01
0.28
0.02
0.01
0.33
0.53
0.30
0.59
0.54
0.52
0.63
0.46
0.31
0.48
0.55
0.51
0.81
0.36
0.49
0.54
0.49
0.74
0.75
0.52
0.73
0.58
0.48
0.54
0.74
0.49
0.74
0.75
0.02
0.17
0.02
0.08
0.20
0.23
0.02
0.20
0.01
0.31
0.14
0.33
0.01
0.02
0.31
0.12
0.29
0.01
0.01
0.28
0.01
0.12
0.20
0.13
0.01
0.29
0.01
0.01
A: A12 , p: p-value. All p-values less than 0.05 are identied as bold.
Please cite this article as: S. Wang et al., Cost-effective test suite minimization in product lines using search techniques, The Journal of Systems
and Software (2014), http://dx.doi.org/10.1016/j.jss.2014.08.024
ARTICLE IN PRESS
JID: JSS
16
[m5G;December 8, 2014;18:52]
S. Wang et al. / The Journal of Systems and Software 000 (2014) 122
Table 14
Results for comparing the eight algorithms for OFV.
Pair of algorithms
C20
C40
C60
C90
0.51
0.57
0.55
0.61
0.71
0.56
0.66
0.59
0.61
0.58
0.45
0.56
0.64
0.65
0.58
0.59
0.49
0.71
0.34
0.28
0.45
0.55
0.48
0.46
0.65
0.49
0.63
0.72
0.22
0.03
0.07
0.02
0.01
0.14
0.01
0.11
0.04
0.15
0.33
0.25
0.02
0.02
0.08
0.14
0.21
0.02
0.02
0.01
0.11
0.32
0.17
0.21
0.02
0.32
0.01
0.01
0.62
0.57
0.62
0.57
0.67
0.75
0.79
0.51
0.69
0.56
0.61
0.57
0.69
0.73
0.55
0.56
0.59
0.75
0.31
0.32
0.32
0.51
0.51
0.45
0.72
0.46
0.77
0.75
0.01
0.02
0.01
0.11
0.03
0.01
0.02
0.23
0.04
0.17
0.04
0.08
0.01
0.04
0.12
0.13
0.17
0.01
0.02
0.02
0.03
0.24
0.09
0.14
0.02
0.29
0.01
0.01
0.59
0.72
0.69
0.71
0.64
0.60
0.65
0.61
0.65
0.61
0.52
0.51
0.58
0.55
0.51
0.52
0.60
0.67
0.43
0.38
0.28
0.48
0.43
0.52
0.66
0.43
0.64
0.68
0.04
0.01
0.03
0.01
0.02
0.06
0.02
0.09
0.09
0.07
0.32
0.29
0.01
0.09
0.21
0.08
0.19
0.04
0.09
0.02
0.02
0.12
0.17
0.15
0.03
0.16
0.03
0.02
0.52
0.54
0.64
0.56
0.55
0.68
0.72
0.53
0.78
0.45
0.55
0.63
0.70
0.64
0.57
0.51
0.53
0.66
0.29
0.41
0.27
0.54
0.52
0.41
0.71
0.38
0.69
0.75
0.11
0.12
0.04
0.08
0.09
0.04
0.02
0.19
0.02
0.22
0.18
0.04
0.02
0.04
0.09
0.14
0.19
0.02
0.01
0.07
0.02
0.08
0.25
0.09
0.01
0.06
0.02
0.01
A: A12 , p: p-value. All p-values less than 0.05 are identied as bold.
6.2.3. Timing analysis for RWGA and the other selected search
algorithms
Similarly, we also performed a timing analysis for the best weightbased GAs (i.e., RWGA) and the other kinds of selected search algorithms (e.g., NSGA-II) as compared with RS. Table 17 shows the
Table 15
Results for the Vargha and Delaney statistics-articial problems (without/with Whitney U test).
RQ
A>B
A<B
A=B
RQ2
RWGA vs. RS
NSGA-II vs. RS
MOCell vs. RS
SPEA2 vs. RS
PAES vs. RS
SMPSO vs. RS
CellDE vs. RS
489/480
467/449
438/419
298/251
430/414
388/362
423/399
11/0
33/14
62/46
202/179
70/59
112/93
77/65
0/20
0/37
0/35
0/70
0/27
0/45
0/36
RQ3
432/405
403/397
464/451
381/362
453/438
380/357
269/252
398/377
266/242
226/219
245/227
395/374
302/281
267/243
201/182
170/145
141/107
179/152
223/201
198/182
244/219
68/57
97/85
36/23
119/104
47/39
120/103
231/204
102/86
234/219
273/241
255/239
105/91
198/167
233/200
299/283
330/307
359/316
321/303
277/255
302/270
256/238
0/32
0/18
0/26
0/24
0/23
0/40
0/44
0/47
0/39
0/40
0/34
0/35
0/52
0/57
0/35
0/48
0/77
0/45
0/46
0/48
0/43
average running time per run taken by each selected search algorithms. We also obtained a p-value of 0.63 with the Kruskal
Wallis test suggesting there are no signicant differences between
the running time of the selected search algorithms as compared
with RS. Based on the results, we can conclude that all the selected search algorithms did not take signicantly different time
for running as compared with RS in terms of nding an optimal
solution.
7. Discussion
In this section, we discuss the results of the experiments based
on our industrial case study and the 500 articial problems. Section 7.1 provides insight on the results of RQ1RQ3, while Section
7.2 discusses results of RQ4. Recall that RQ1RQ3 address questions
related to cost/effectiveness and performance of our minimization
problem (as compared with RS), while RQ4 refers to the impact of the
increasing number of features and the number of test cases on the
performance of the selected multi-objective search algorithms (Section 5.1.1). Before getting deeper into the specic discussion related
with the research questions, the key results are rst summarized as
below:
Different algorithms achieve the best performance for each objective, i.e., CellDE for TMP, MOCell for FPC, NSGA-II for FDC and
SPEA2 for OET;
RWGA achieves signicantly best performance among all the selected search algorithms when considering all the objectives together;
The performance of weight-based GAs (e.g., RWGA) increases with
the growth of features, but is not signicantly inuenced by the
increasing number of associated test cases;
All the other search algorithms are not signicantly impacted by
the increasing complexity of problems (increasing features and
associated test cases).
Please cite this article as: S. Wang et al., Cost-effective test suite minimization in product lines using search techniques, The Journal of Systems
and Software (2014), http://dx.doi.org/10.1016/j.jss.2014.08.024
ARTICLE IN PRESS
JID: JSS
[m5G;December 8, 2014;18:52]
S. Wang et al. / The Journal of Systems and Software 000 (2014) 122
17
Table 16
Spearmans correlation analysis for the select algorithms with the increasing number of features and associated test cases.
Algorithms
WBGA_W1
WBGA_W2
WBGA-MO
RWGA
NSGA-II
MOCell
SPEA2
PAES
SMPSO
CellDE
RS
Spearman
Prob > | |
Spearman
Prob > | |
0.52
0.47
0.69
0.74
0.13
0.11
0.09
0.02
0.15
0.08
0.08
<0.0001
<0.0001
<0.0001
<0.0001
0.5546
0.6852
0.8019
0.7980
0.5742
0.7006
0.8704
0.08
0.10
0.15
0.19
0.10
0.02
0.12
0.14
0.07
0.13
0.09
0.5764
0.5123
0.4892
0.4719
0.6714
0.7653
0.5417
0.4928
0.6986
0.3985
0.7619
Table 17
Average running time for the selected weight-based search algorithms.
Algorithms
RWGA
NSGA-II
MOCell
SPEA2
PAES
SMPSO
CellDE
RS
Time (s)
1.37
1.54
1.23
1.98
2.06
2.19
1.45
1.13
Please cite this article as: S. Wang et al., Cost-effective test suite minimization in product lines using search techniques, The Journal of Systems
and Software (2014), http://dx.doi.org/10.1016/j.jss.2014.08.024
JID: JSS
18
ARTICLE IN PRESS
[m5G;December 8, 2014;18:52]
S. Wang et al. / The Journal of Systems and Software 000 (2014) 122
Please cite this article as: S. Wang et al., Cost-effective test suite minimization in product lines using search techniques, The Journal of Systems
and Software (2014), http://dx.doi.org/10.1016/j.jss.2014.08.024
ARTICLE IN PRESS
JID: JSS
S. Wang et al. / The Journal of Systems and Software 000 (2014) 122
Balanced
Non-dominated
[m5G;December 8, 2014;18:52]
19
8. Automation
optimal balanced solution from a larger balanced solution space with
the aim for balancing all the objectives together.
For the other algorithms based on the dominance theory (i.e.,
NSGA-II, MOCell, SPEA2, PAES, SMPSO and CellDE), the performance
is not signicantly inuenced by increasing number of features since
these algorithms guide the search in nding a set of non-dominated
solutions (usually one objective is the main concern). The overall tness value (OFV, Section 5.1.3) does not change signicantly even
when the space for non-dominated solutions increases, as shown in
(a) and (b) of Fig. 5. For instance, solution a and solution b are two
obtained non-dominated solutions by NSGA-II with TMP = 0.8, FPC =
0.5, FDC = 0.5, OET = 0.5, AEF = 0.5 and TMP = 0.5, FPC = 0.8, FDC = 0.5,
OET =0.5, AEF = 0.5, respectively, i.e., the main objective for solution a
is the percentage of test minimization (TMP) while the main concern
for solution b is feature pairwise coverage (FPC). But for the overall
tness value (OFV) when considering all the objectives, solution a and
b are equivalent since the OFV values for both solutions are the same
(i.e., both values of OFV for solution a and solution b are 0.56, Section
5.1.3), which is the main reason the performance of NSGA-II, MOCell, SPEA2, PAES, SMPSO and CellDE stays stable as shown in Figs. 2
and 4.
For random search (RS), the performance remains the worst and
is not inuenced by the increasing features since it chooses solutions
randomly without any guideline.
For the other combinations for the solution space, the balanced
solution space and the non-dominated solution, the observations for
experiments may not be the same as what we observed in our experiments. This further requires more focused experiments that assess
the performance of the search algorithms with varying combinations
of increase, decrease, and no change in the three types of solution
spaces.
In this section, we present the tool support for test minimization using search algorithms along with our proposed tness functions. The tool is called TEst Minimization with Search Algorithms
(TEMSA) developed to minimize the test cases at the same time
improving effectiveness and reducing predetermined cost based
on various cost/effectiveness measures. TEMSA is implemented as
a web-based application on top of the jMetal library using Java Sencha ExtJS (a JavaScript Framework for Rich Web Apps) (Sencha, 2010).
The architecture of TEMSA is shown in Fig. 6 (TEMSA can be tried out
at the website: http://zen-tools.com/TEMSA/).
The input of TEMSA is a set of selected features and test cases for
testing a product as an XML le in our industrial case study. To use
TEMSA, users have to provide an input le corresponding to our specic XML schema (the related XML schema and sample input XML le
can be downloaded from our tool website). Cost measures (e.g., OET)
and effectiveness measures (e.g., TMP, FPC, FDC and AEF) are integrated
into TEMSA for covering various objectives. Other objective measures
can also be coded into TEMSA by formulating them as specic objective functions mathematically as we discussed in Section 3.2. For the
test cases, each one includes three key attributes for cost/effectiveness
measures, i.e., SucR, EF and AET (Section 3.2), which are obtained from
test execution history. Based on these inputs, TEMSA outputs a set of
minimized test cases organized as another XML le which also conforms to our specic XML schema and can be transformed into any
other schema when needed. In our context, the output can be put into
test execution engine for scheduling and further executing for System
Under Test (SUT). Afterwards, the output of executed test cases will
be an input to update test execution history in the repository, such as
SucR, EF and AET for the executed test cases.
Moreover, our tool recommends different algorithms depending on test minimization objectives based on the results of our
empirical study as shown in Fig. 7. First, a user selects an objective
with highest importance (Activity A1) and TEMSA will automatically
recommend the relevant best algorithm based on the selected objective. For instance, if the highest importance is to reduce the number
of test cases (TMP), then CellDE will be recommended by TEMSA (Activity A2). Similarly, if feature pairwise coverage (FPC), fault detection
capability (FDC) and the allowed time budget (OET) are the most important objectives, TEMSA will recommend users to adapt MOCell,
NSGA-II and SPEA2 for their test minimization, respectively (Activity
A3, A4 and A5). Moreover, if a user wants to balance all the objectives together for test minimization, they do not need to select any
objective, (Activity A1) and RWGA will be recommended by TEMSA
automatically (Activity A6).
9. Threats to validity
In this section, we discuss four threats to validity for our experiments and how we address them.
Please cite this article as: S. Wang et al., Cost-effective test suite minimization in product lines using search techniques, The Journal of Systems
and Software (2014), http://dx.doi.org/10.1016/j.jss.2014.08.024
JID: JSS
20
ARTICLE IN PRESS
[m5G;December 8, 2014;18:52]
S. Wang et al. / The Journal of Systems and Software 000 (2014) 122
algorithms) and are not necessarily due to the application of the treatment (e.g., search algorithms) being studied (Ali et al., 2010; Barros
and Neto, 2011). A possible threat to internal validity is that we have
experimented with only one-default conguration settings for the
parameters of the selected search algorithms. However, these settings are in accordance with the common guidelines in the literature
and our previous experience on testing problems (Arcuri and Briand,
2011; Sheskin, 2007; Wang et al., 2013c).
cost that have severe limitations, as they are not precise. To reduce
construct validity threats in our context, we used the same stopping
criteria for all algorithms, i.e., number of tness evaluations. We ran
each algorithm for 25,000 evaluations to seek the best solution for
test minimization. This criterion is a comparable measure across all
the algorithms since each iteration requires updating the obtained
solution and comparing the computed value of tness function.
External validity threats exist when the outcome of results is inuenced by external factors and are not necessarily due to the application of the treatment being studied (Ali et al., 2010; Barros
and Neto, 2011). One common external validity threat in the software engineering experiments is about generalization of results. One
may argue the results cannot be generalized for other case studies if we ran our experiments on just one industrial case study.
However, such threat to external validity is common to all empirical studies. In addition, we also conducted an empirical evaluation
using carefully designed 500 articial problems and the results are
consistent with the industrial case study.
Please cite this article as: S. Wang et al., Cost-effective test suite minimization in product lines using search techniques, The Journal of Systems
and Software (2014), http://dx.doi.org/10.1016/j.jss.2014.08.024
JID: JSS
ARTICLE IN PRESS
S. Wang et al. / The Journal of Systems and Software 000 (2014) 122
(e.g., time). Test prioritization orders the test cases in the existing
test suite to test the current systems or programs with the objective of achieving the pre-dened criteria (e.g., maximum number of executing test cases) in a given time budget. The main difference between selection and minimization is that test minimization eliminates the redundant test cases permanently for the systems (programs), while test case selection refers to select relevant
test cases temporarily for testing the modied version of the systems (Harman, 2011; Yoo and Harman, 2012). However, from the
perspective of optimization, there is no signicant difference between test case selection and test minimization (Harman, 2011), i.e.,
both of them aim at choosing a subset of test cases from the existing test suite at the same time achieving pre-dened objectives
for optimization, which was called test case selection problem in
(Harman, 2011).
Our work is related to the above-mentioned selection problem,
but as opposed to the existing works, our work lies in the context of product line testing (e.g., feature pairwise coverage is used
to measure the effectiveness of the minimized test suite based on
the coverage of feature pairs from the product line view rather
than code level (e.g., code coverage) in the most of the existing
works for regression testing). Although there exist a large number
of test selection/minimization techniques in the context of regression testing, there is not enough evidence that these techniques can
be adapted to product lines (e.g., whether code coverage in regression testing is equivalent to the feature coverage in product line
testing).
Yoo and Harman (2007) used a greedy algorithm and a multiobjective search algorithm NSGA-II for test case selection for the following three measures: code coverage, fault detection history, and
execution time in the context of regression testing. Notice that these
three measures are similar to our cost/effectiveness measures FPC,
FDC and OET. The difference with our work is that rst we select a
set of relevant test cases for a new product using feature models as
we discussed in (Wang et al., 2012, 2013a) and then we perform test
minimization based on two additional measures that are TMP and
AEF. In addition, we conducted an empirical study to compare ten
multi-objective algorithms belonging to four different mechanisms
and evaluate their performance, which was not addressed in Yoo
and Harman (2007).
In another work related to test prioritization in regression testing (Walcott et al., 2006), a two-objective problem (i.e., code coverage and execution time) is converted into a single-objective problem using an arithmetical combination of weights for the tness
function. These two objectives are close to our cost/effectiveness
measures FPC and OET. However, as compared with code coverage, our dened FPC mainly focuses on the coverage of feature pairs from the view of product line rather than code level
of the systems. Moreover, additional three measures are dened
mathematically in our work and we do not limit our scope
within weight-based GAs, i.e., different types of search algorithm
are empirically evaluated, e.g., NSGA-II and SPEA2.
As compared with our previous work (Wang et al., 2013c),
we dene two additional cost/effectiveness measures (i.e., AEF and
OET) for test minimization problem. Moreover, we further compare the performances for the best weight-based GA (i.e., RWGA)
and the other six different search algorithms using an industrial case study and 500 designed articial problems, which cover
various kinds of search algorithms and make the results and
analysis more persuasive.
To the best of our knowledge and existing published reviews
(Harman et al., 2009; Yoo and Harman, 2012), none of the existing works studied test minimization in the context of product line
considering TMP, FPC, FDC, OET and AEF all together. In addition, it is
one of the few works, which extensively compares the performance
of search algorithms belonging to different mechanisms (e.g., evolu-
[m5G;December 8, 2014;18:52]
21
Please cite this article as: S. Wang et al., Cost-effective test suite minimization in product lines using search techniques, The Journal of Systems
and Software (2014), http://dx.doi.org/10.1016/j.jss.2014.08.024
JID: JSS
22
ARTICLE IN PRESS
[m5G;December 8, 2014;18:52]
S. Wang et al. / The Journal of Systems and Software 000 (2014) 122
Durillo, J., Nebro, A., Luna, F., Alba, E., 2008. Solving Three-objective optimization problems using a new hybrid cellular genetic algorithm. In: Rudolph, G., Jansen, T.,
Lucas, S., Poloni, C., Beume, N. (Eds.), Parallel Problem Solving from Nature PPSN
X. Springer, Berlin Heidelberg, pp. 661670.
Durillo, J.J., Nebro, A.J., 2011. jMetal: a Java framework for multi-objective optimization.
Adv. Eng. Softw. 42, 760771.
Engstrm, E., Runeson, P., 2011. Software product line testing a systematic mapping
study. Inf. Softw. Technol. 53, 213.
Fischer, K.F., Raji, F., Chruscicki, A., 1981. A methodology for retesting modied software, National Telecommunications Conference (NTC), pp. 16.
GmbH, P., 2006. Variant Management with Pure::Variants. Technical White Paper.
Greer, D., Ruhe, G., 2004. Software release planning: an evolutionary and iterative
approach. Inf. Softw. Technol. 46, 243253.
Harman, M., 2011. Making the case for MORTO: multi objective regression test optimization. In: Proceedings of the 2011 IEEE Fourth International Conference on
Software Testing, Verication and Validation Workshops. IEEE Computer Society,
pp. 111114.
Harman, M., Mansouri, S.A., Zhang, Y., 2009. Search Based Software Engineering: A Comprehensive Analysis and Review of Trends Techniques and Applications. Technical
Report TR-09-03.
Knowles, J.D., Corne, D.W., 2000. Approximating the nondominated front using the
pareto archived evolution strategy. Evol. Comput. 8, 149172.
Konak, A., Coit, D.W., Smith, A.E., 2006. Multi-objective optimization using genetic
algorithms: a tutorial. Reliab. Eng. Syst. Saf. 91, 9921007.
Kuo-Chung, T., Yu, L., 2002. A test generation strategy for pairwise testing. IEEE Trans.
Softw. Eng. 28, 109111.
Lopez-Herrejon, R.E., Chicano, F., Ferrer, J., Egyed, A., Alba, E., 2013. Multi-objective optimal test suite computation for software product line pairwise testing, International
Conference on Software Maintenance (ICSM), pp. 404407.
Mcdonald, J.H., 2009. Handbook of Biological Statistics.
McGregor, J., 2001. Testing a Software Product Line (CMU/SEI-2001-TR-022). Software
Engineering Institute, Carnegie Mellon University, Pittsburgh, PA.
Nebro, A., Durillo, J., Luna, F., Dorronsoro, B., Alba, E., 2007. Design issues in a multiobjective cellular genetic algorithm. In: Obayashi, S., Deb, K., Poloni, C., Hiroyasu, T.,
Murata, T. (Eds.), Evolutionary Multi-Criterion Optimization. Springer, Berlin Heidelberg, pp. 126140.
Nebro, A.J., Durillo, J.J., Garcia-Nieto, J., Coello Coello, C.A., Luna, F., Alba, E., 2009. SMPSO:
a new PSO-based metaheuristic for multi-objective optimization, MCDM09. IEEE
Symposium on Computational intelligence in Multi-criteria Decision-Making,
2009, pp. 6673.
Runeson, P., Engstrm, E., 2012. Regression testing in software product line engineering.
Adv. Comput. 86, 223263.
Sencha, 2010. http://www.sencha.com/products/extjs..
Sheskin, D.J., 2007. Handbook of Parametric and Nonparametric Statistical Procedures.
Chapman&Hall&CRC.
Walcott, K.R., Soffa, M.L., Kapfhammer, G.M., Roos, R.S., 2006. TimeAware test suite
prioritization, In: Proceedings of the 2006 International Symposium on Software
Testing and Analysis. ACM, Portland, Maine, USA, pp. 112.
Wang, S., Ali, S., Gotlieb, A., 2013. Minimizing test suites in software product lines using
weight-based genetic algorithms, In: Proceeding of the Fifteenth Annual Conference on Genetic and Evolutionary Computation Conference. ACM, Amsterdam, The
Netherlands, pp. 14931500.
Wang, S., Gotlieb, A., Ali, S., Liaaen, M., 2013. Automated test case selection using
feature model: an industrial case study. In: Moreira, A., Schtz, B., Gray, J., Vallecillo, A., Clarke, P. (Eds.), Model-Driven Engineering Languages and Systems.
Springer, Berlin Heidelberg, pp. 237253.
Wang, S., Gotlieb, A., Liaaen, M., Briand, L.C., 2012. Automatic selection of test execution
plans from a video conferencing system product line, In: Proceedings of the VARiability for You Workshop: Variability Modeling Made Useful for Everyone. ACM,
Innsbruck, Austria, pp. 3237.
Yoo, S., Harman, M., 2007. Pareto ecient multi-objective test case selection, In: Proceedings of the 2007 International Symposium on Software Testing and Analysis.
ACM, London, United Kingdom, pp. 140150.
Yoo, S., Harman, M., 2012. Regression testing minimization, selection and prioritization:
a survey. Softw. Test. Verif. Reliab. 22, 67120.
Yue, T., Briand, L.C., Labiche, Y., 2013. Facilitating the transition from use case models
to analysis models: approach and experiments. ACM Trans. Softw. Eng. Methodol.
22, 138.
Zhang, Y., Finkelstein, A., Harman, M., 2008. Search based requirements optimisation: existing work and challenges. In: Paech, B., Rolland, C. (Eds.), Requirements
Engineering: Foundation for Software Quality. Springer, Berlin Heidelberg, pp. 88
94.
Zheng, L., Harman, M., Hierons, R.M., 2007. Search algorithms for regression test case
prioritization. IEEE Trans. Softw. Eng. 33, 225237.
Zitzler, E., Laumanns, M., Thiele, L., 2001. SPEA2: improving the strength
pareto evolutionary algorithm, The EUROGEN 2001-Evolutionary Methods for
Design, Optimization and Control with Applications to Industrial Problems, pp.
95100.
Shuai Wang obtained his master degree for computer science and technology in the eld of software testing from
Beihang University, Beijing, China. He is currently working as PhD student at the Certus Software Verication &
Validation Center hosted by Simula Research Laboratory
and the Department of Informatics, University of Oslo, Norway. His main research interests are: variability modeling,
model-based testing and search-based testing. He is a student member of the ACM and IEEE computer society.
Shaukat Ali is currently a research scientist in Certus Software Verication & Validation Center, Simula Research
Laboratory, Norway. He has been aliated to Simula Research Lab since 2007. He has been involved in many industrial and research projects related to Model-Based Testing (MBT) and Empirical Software Engineering since 2003.
He has experience of working in several industries and
academia research groups in many countries including UK,
Canada, Norway and Pakistan. He is a member of the IEEE
Computer Society.
Arnaud Gotlieb obtained his PhD degree in the elds of Software Testing and Constraint Programming from the University of Nice-Sophia Antipolis, in 2000. He has worked
7 years in the Industry at THALES (pre. DASSAULT Electronics) where he was successively PhD student, engineer
and project manager in the eld of Software Testing. From
2002 to 2011, he has been worked in Rennes as an INRIAs researcher on Constraint-Based Testing. Since 2011,
he joined the Simula Research Laboratory as a senior research scientist and from 2012, he leads the CERTUS Software Validation & Verication Center and heads the Software Engineering department of SIMULA. He is the main
author of more than forty publications and coauthor of
more than seventy, and he is the main architect of several constraint solving engines targeted to the testing of critical programs. He is a member of the IEEE
Computer Society.
Please cite this article as: S. Wang et al., Cost-effective test suite minimization in product lines using search techniques, The Journal of Systems
and Software (2014), http://dx.doi.org/10.1016/j.jss.2014.08.024