P4 - Progress On Approaches To Software Defect Prediction
P4 - Progress On Approaches To Software Defect Prediction
Review Article
E-mail: jingxy_2000@126.com
Abstract: Software defect prediction is one of the most popular research topics in software engineering. It aims to predict
defect-prone software modules before defects are discovered, therefore it can be used to better prioritise software quality
assurance effort. In recent years, especially for recent 3 years, many new defect prediction studies have been proposed. The
goal of this study is to comprehensively review, analyse and discuss the state-of-the-art of defect prediction. The authors survey
almost 70 representative defect prediction papers in recent years (January 2014–April 2017), most of which are published in the
prominent software engineering journals and top conferences. The selected defect prediction papers are summarised to four
aspects: machine learning-based prediction algorithms, manipulating the data, effort-aware prediction and empirical studies.
The research community is still facing a number of challenges for building methods and many research opportunities exist. The
identified challenges can give some practical guidelines for both software engineering researchers and practitioners in future
software defect prediction.
1 Introduction 1990 and year 2009. They mainly survey machine learning and
statistical analysis-based methods for defect prediction. Hall et al.
Software defect prediction is one of the most active research areas [1] performed a systematic literature review to investigate how the
in software engineering and plays an important role in software context of the models, the independent variables used, and the
quality assurance [1–5]. The growing complexity and dependency modelling techniques applied influence the performance of defect
of the software have increased the difficulty in delivering a high models. Their analysis is based on 208 defect prediction studies
quality, low cost and maintainable software, as well as the chance published from January 2000 to December 2010. Recently,
of creating software defects. Software defect usually produces Malhotra [5] conducted a systematic review of studies in the
incorrect, or unexpected results and behaviours in unintended ways literature that use the machine learning techniques in the existing
[6]. research for software defect prediction. They identified 64 primary
Defect prediction is a very crucial and essential activity. Using studies and seven categories of the machine learning techniques
defect predictors can reduce the costs and improve the software from January 1991 to October 2013. As a summary, the defect
quality by recognising defect-prone modules (instances) prior to prediction published papers are complex and disparate, thus no up-
testing, such that software engineers can effectively optimise the to-date comprehensive picture of the current state of defect
allocation of limited resources for testing and maintenance. prediction exists. These review works are not able to cover the
Software defect prediction [7, 8] can be done by classifying a latest progress of the software defect prediction research.
software instance (e.g. method, class, file, change or package level) To fill up these gaps of existing systematic review works, this
as defective or non-defective. The defect prediction model can be paper tries to provide a comprehensive and systematic review for
built by using various software metrics and historical defect data pioneering works of software defect prediction in recent three years
collected from previous release of the same projects [9, 10] or even (i.e. from January 2014 to April 2017). A summary of the
other projects [11–13]. Such a model will be trained to predict approaches can be used by researchers as foundation for future
software instances as defective or not. With the prediction model, investigations of defect prediction. Fig. 1 shows the number of
software engineers can effectively allocate the available testing published paper with search keywords ‘‘defect prediction’ OR
resources on the defective instances for improving software quality ‘fault prediction’ OR ‘bug prediction’ AND ‘software’’ on four
in the early phases of development life cycle. For example, if only important computer science libraries: ACM, IEEE, Elsevier and
25% of testing resources are available, then software engineers can Springer, respectively, from 2014 to 2016. It can be seen that the
focus these testing resources on fixing the instances that are more published papers are gradually increasing and this indicates that the
prone to defects. Therefore, a high quality, low cost and research trend on the topic of defect prediction is growing.
maintainable software can be deployed in the given time, resources Naturally, it is impossible to complete review of all papers as
and budget. That is why, today software defect prediction is a shown in Fig. 1. Similar to prior works [1, 3–5], we need to make
popular research topic in the software engineering field [14]. some criteria. In this paper, we excluded papers which do not
In the past few decades, more and more research works pay closely related to the research topic on defect prediction, repeated
attention to the software defect prediction and a lot of papers have studies and papers not belong to current research hotspot by
been published. There have already been several excellent manual check and screening. Thus, we carefully choose almost 70
systematic review works for the software defect prediction [1, 3– representative defect prediction studies in recent years, and most of
5]. Catal and Diri [3] reviewed 74 defect prediction papers in 11 which are published in the prominent software engineering journals
journals and several conference proceedings by focusing on and top conferences. Through this paper, we provide an in-depth
software metrics, datasets and methods to build defect prediction study of the major techniques, research hotspots and trends of
models. Later on, according to the publication year, Catal [4] software defect prediction.
investigated 90 defect prediction papers published between year
Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.
Fig. 1 Published papers related to software defect prediction on ACM, IEEE, Elsevier and Springer libraries
Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.
source code [27]. These OO metrics simply count the number of NASA benchmark dataset consists of 13 software projects [7,
variables, methods, classes and so on. 56]. The number of instances ranges from 127 to 17,001, while the
Process metrics are extracted from historical information number of metrics ranges from 20 to 40. Each project in NASA
archived in different software repositories such as version control represents a NASA software system or sub-system, which contains
and issue tracking systems. These metrics reflect the changes over the corresponding defect-marking data and various static CMs. The
time and quantify many aspects of software development process repository records the number of defects for each instance by using
such as changes of source code, the number of code changes, a bug tracking system. Static CMs of NASA datasets include size,
developer information and so on. In recent years, a number of readability, complexity attributes and so on, which are closely
representative process metrics have been proposed. These metrics related to software quality.
can be categorised into the following five groups: (i) code change- Turkish software dataset (SOFTLAB) consists of five projects,
based metrics such as relative code change churn [28], change [29], which are embedded controller software for white goods [12]. The
change entropy [30], CM churn, code entropy [31]; (ii) developer projects of SOFTLAB are obtained from PROMISE repository and
information-based metrics such as the number of engineers [32], they have 29 metrics. The number of instances ranges from 36 to
developer-module networks [33], developer network and social 121.
network [34], ownership and authorship [35, 36], developer focus Jureczko and Madeyski [57] collected some open source,
and ownership [37], defect patterns of developer [38], micro proprietary and academic software projects, which are part of the
interaction metrics [39]; (iii) dependency analysis-based metrics PROMISE repository. The collected data consists of 92 versions of
such as dependency graph [40], socio-technical networks [41], 38 different software development projects, including 48 versions
change coupling [42], citation influence topic [43], change of 15 open-source projects, 27 versions of six proprietary projects
genealogy [44]; (iv) project team organisation-based metrics such and 17 academic projects. Each project has 20 metrics in total
as organisational structure, organisational volatility [45–47]; (v) which contains McCabe's cyclomatic metrics, CK metrics and
other process metrics such as popularity [48], anti-pattern [49]. other OO metrics.
Table 1 shows the common CMs and process metrics for Datasets in ReLink were collected by Wu et al. [51] to improve
software defect prediction. Both CMs and process metrics are able the defect prediction performance by increasing the quality of the
to build defect prediction models. However, there are different defect data. The defect information in ReLink has been manually
debates on whether CMs are good defect predictors or process verified and corrected. ReLink consists of three projects and each
metrics are better than CMs. Menzies et al. [7] confirmed that CMs one has 26 complexity metrics.
are still useful to build defect prediction models based on NASA AEEEM was used to benchmark different defect prediction
dataset. Zhang [50] found that it can be used to predict defects well models and collected by D'Ambros et al. [31]. Each AEEEM
by simply considering the LOC metric. Moser et al. [29] conducted dataset consists of 61 metrics: 17 source CMs, five previous-defect
a comparative analysis of the predictive power of code and process metrics, five entropy-of-change metrics, 17 entropy-of-source-
metrics for defect prediction. They observed that process metrics CMs, and 17 churn-of-source CMs [16].
are more efficient defect predictors than CMs for the Eclipse Just-in-time (JIT) dataset was used to study on predicting
dataset. Recently, Rahman et al. [21] performed an empirical study defect-inducing changes at the change level and collected by
to compare code and process metrics. They concluded that CMs is Kamei et al. [52]. It consists of six open source projects and each
less useful than process metrics because of stagnation of CMs. project has 14 change metrics including diffusion, size, purpose,
history and experience five dimensions.
2.3 Public datasets ECLIPSE1 bug data set was collected by Zimmermann et al.
[53]. It contains defect data of three Eclipse releases (i.e. 2.0, 2.1
Software defect dataset is one of the most important problems for and 3.0), which is extracted on both file and package levels. There
conducting the defect prediction. In the early stages, some are 31 static CMs on the file level and 40 metrics on the package
academic researchers and companies employed the non-public level. The resulting data set lists the number of pre-release and
datasets such as proprietary projects to develop defect prediction post-release defects for every file and package in these three
models. However, it is not possible to compare the results of such Eclipse releases.
methods to each other, due to their datasets cannot be obtained. ECLIPSE2 bug data set was used to study on dealing with the
Since machine learning researchers had similar problems in 1990s, noise in defect prediction and collected by Kim et al. [54]. This
they created a repository called University of California Irvine dataset contains two projects (SWT and Debug) from Eclipse 3.4.
(UCI) Machine Learning Repository. Inspired by the success of The defect data is collected by mining the Eclipse Bugzilla and
UCI repository, researchers create PROMISE repository of CVS repositories. There are 17 metrics in total, which contains
empirical software engineering data, which has collected several four different type of metrics, including complexity, OO, change
publicly available datasets since 2005. In addition, there are some and developer.
researchers [31, 44, 51–55] who spontaneously publish their The NetGene dataset was used to study the predictive
extracted datasets for further empirical study on software defect effectiveness of change genealogies in defect prediction and
prediction. In this section, we briefly introduce existing publicly collected by Herzig et al. [44]. This dataset consists of four open
available and commonly used benchmark datasets. Table 2 shows source projects. Each project has a total of 456 metrics, including
the detailed description of the publicly available datasets. complexity metrics, network metrics and change genealogy metrics
related to the history of a file.
Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.
Table 2 Summary of the publicly available datasets for software defect prediction
Dataset Description Number of Metric used Number of Granularity Website
projects metrics
NASA NASA Metrics Data Program 13 CMs ranges 20 to function http://openscience.us/repo/
40
SOFTLAB Software Research 5 CMs 29 function http://openscience.us/repo/
Laboratory from Turkey
PROMISE open source, proprietary and 38 CMs 20 class http://openscience.us/repo/
academic projects
ReLink recovering links between 3 CMs 26 file http://www.cse.ust.hk/scc/ReLink.htm
bugs and changes
AEEEM open source projects for 5 CMs, 61 class http://bug.inf.usi.ch/
evaluating defect prediction process
models metrics
JIT open source projects for JIT 6 process 14 change http://research.cs.queensu.ca/kamei/jittse/jit.zip
prediction metrics
ECLIPSE1 eclipse projects for predicting 3 CMs 31, 40 file, package https://www.st.cs.uni-saarland.de/softevo/bug-
defects data/eclipse/
ECLIPSE2 eclipse projects for handling 2 CMs, 17 file https://code.google.com/archive/p/hunkim/wikis/
noise in defect prediction process HandlingNoise.wiki
metrics
NetGen open source projects for 4 CMs, 465 file https://hg.st.cs.uni-saarland.de/projects/
predicting defects using process cg_data_sets/repository
change genealogies metrics
AEV industry projects for 3 CMs, 29 file http://www.ist.tugraz.at/_attach/Publish/
predicting defects process AltingerHarald/
metrics MSR_2015_dataset_automotive.zip
Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.
Balance [7] is defined as 1 − (0 − Pf)2 + (1 − Pd)2 / 2 . It class is predicted as non-defective) to address the class imbalance
combines Pf and Pd and calculates by the Euclidean distance from problem.
the ROC (receiver operating characteristic curve) sweet spot At the same time, Jing et al. [72] presented a collaborative
Pf = 0 and Pd = 1. Hence, better and higher balances fall closer to representation classification (CRC)-based defect prediction
the desired sweet spot of Pf = 0 and Pd = 1. (CSDP) method. CSDP makes use of the recently proposed CRC
Accuracy is defined as (TP + TN)/(TP + FP + FN + TN), technique, which main idea is that an instance can be
which denotes the percentage of correctly predicted instances. collaboratively represented by a linear combination of all other
Geometric mean (G-mean) [59, 65] is utilised for the overall instances.
evaluation of predictors in the imbalanced context. It computes the Wang et al. [73] presented a software defect prediction method,
geometric mean of Pd and 1 − Pf, which is defined as named multiple kernel ensemble learning (MKEL). MKEL can
better represent the defect data in a high-dimensional feature space
Pd × (1 − Pf).
through multiple kernel learning and reduce the bias caused by the
Matthews correlation coefficient (MCC) [66, 67] measures the non-defective class through assembling a series of weak classifiers.
correlation between the observed and predicted binary Besides, MKEL specially designs a new sample weight vector
classification with values in [−1, 1]. A value of 1 denotes a perfect updating strategy to relieve the class imbalance problem.
prediction, 0 no better than random prediction and −1 represents To bridge the gap between programmes’ semantics and defect
total disagreement between observation and prediction. prediction features, Wang et al. [74] automatically learned
AUC is the area under the ROC curve. This curve is plotted in a semantic representation of programmes from source code by
two-dimensional space with Pf as x-coordinate and Pd as y- utilising deep learning techniques. They firstly extracted
coordinate. The AUC is known as a useful measure for comparing programmes’ abstract syntax trees to obtain the token vectors.
different models and widely used because AUC is unaffected by Based on the token vectors, they leveraged deep belief network to
class imbalance as well as being independent from the prediction automatically learn semantic features.
threshold. Other measures such as Pd and precision can vary With the utilisation of popular deep learning techniques, Yang
according to prediction threshold values. However, AUC considers et al. [75] presented a Deeper method for JIT defect prediction.
prediction performance in all possible threshold values. The higher Deeper first employs the deep belief network to extract a set of
AUC represents better prediction performance and the AUC of 0.5 expressive metrics from an initial set of change metrics and then
means the performance of a random predictor [68]. trains a classifier with the extracted metrics to predict defects.
Popt [52, 60] and AUCEC [68] are two effort-aware measures. Chen et al. [76] designed a novel double transfer boosting
Similar to the AUC metric, Popt = 1 − Δopt, where Δopt is defined (DTB) algorithm for cross-company prediction. DTB firstly adopts
as the area between the predicted model and the optimal model. data gravitation to reconstruct the distribution of cross-company
AUCEC is defined as the area under cost-effectiveness curve of the (CC) data that close to within-company (WC) data. Then, DTB
predicted model. These measures consider the LOC to be inspected utilises the transfer boosting learning technique to remove negative
or tested by quality assurance teams or developers. In fact, both of instances in CC data by using a limited number of WC data.
them can be equivalent. This curve is plotted in a two-dimensional Xia et al. [77] developed a cross-project defect prediction
space with recall as x-coordinate and LOC as y-coordinate. The method, named hybrid model reconstruction approach (HYDRA),
idea of cost-effectiveness for defect prediction models has practical which contains two phases: genetic algorithm (GA) phase and
significance in practice. Cost-effectiveness means how many ensemble learning phase. At the end of these two phases, HYDRA
defects can be found among top n% LOC inspected or tested. That creates a massive composition of classifiers that can be applied to
is, if a certain prediction model can find more defects with less predict defect instances in the target project.
inspecting and testing effort comparing to other models, we could From the view of optimisation, Canfora et al. [78] treated the
say the cost-effectiveness of the model is higher. defect prediction problem as a multi-objective optimisation
Besides these commonly used evaluation measures, there are problem. They presented to use GA for training logistic regression
some of the less common evaluation measures such as overall and decision tree models, called multi-objective defect predictor
error-rate, H-measure [69], J-coefficient [59] and Brier score [70]. (MODEP). MODEP allows software engineers to select predictors
reaching a specific trade-off, i.e. the cost of code inspection and the
3 Categories of prediction algorithms number of defect-prone instances or the number of defects that
MODEP can predict.
Machine learning techniques are the most popular methods for Ryu et al. [79] presented a transfer cost-sensitive boosting
defect prediction [1, 3, 5]. Since new machine learning techniques (TCSBoost) method to deal with the class imbalance problem for
have been developed, various algorithms such as dictionary cross-project defect prediction. Based on the distributional
learning [71], collaborative representation learning [72], multiple characteristics, TCSBoost assigns different misclassification costs
kernel ensemble learning [73], deep learning [74, 75] and transfer for the correct/incorrect classification instances in each iteration of
learning [76, 77] have been applied to build better defect prediction the boosting algorithm.
models. From the view of machine learning, we roughly divide Yang et al. [80] leveraged decision tree and ensemble learning
related defect prediction literature into the following three techniques to design a two-layer ensemble learning (TLEL)
categories: supervised, semi-supervised and unsupervised methods. method for JIT defect prediction. TLEL firstly builds a random
Supervised learning methods are referred to the use of all labelled forest model by combining decision tree and bagging. With the
training data in a project to build defect prediction models. Semi- utilisation of random under-sampling algorithm, TLEL then trains
supervised learning methods construct defect prediction models by multiple different random forest models and assembles them once
employing only a small number of labelled training data and large more with stacking ensemble.
number of unlabelled data in a project. Unsupervised learning
methods do not require labelled training data, they are directly
3.2 Semi-supervised methods
utilise the unlabelled data in a project to learn defect prediction
models. By utilising and combining semi-supervised learning and ensemble
learning, Wang et al. [81] presented a non-negative sparse-based
3.1 Supervised methods SemiBoost (NSSB) method for software defect prediction. On one
hand, NSSB makes better use of a large number of unlabelled
By utilising a recently developed dictionary learning technique, instances and a small number of labelled instances through semi-
Jing et al. [71] designed a cost-sensitive discriminative dictionary supervised learning. On the other hand, NSSB assembles a number
learning (CDDL) approach to predict software defect. CDDL of weak classifiers to reduce the bias caused by the non-defective
exploits class information of historical data to improve the class through ensemble learning.
discriminant power and assigns different misclassification costs by With the utilisation of graph-based semi-supervised learning
increasing punishment on type II misclassification (i.e. defective and sparse representation learning techniques, Zhang et al. [82]
Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.
proposed a non-negative sparse graph-based label propagation Considering developer behaviour, Lee et al. [39] leveraged
(NSGLP) method for defect prediction. NSGLP firstly resamples developer interaction information to propose micro-interaction
the labelled non-defective instances to generate a balanced training metrics (MIMs). They separately used MIMs, source CMs and
dataset. Then, NSGLP constructs the weights by using non- change history metrics to build defect prediction models and
negative sparse graph algorithm to better learn the data compared their prediction performance. They found that MIMs
relationship. Finally, NSGLP iteratively predicts the unlabelled significantly improve overall defect prediction accuracy by
instances through a label propagation algorithm. combining with existing software metrics and perform well in a
cost-effective context. The findings can help software engineers to
3.3 Unsupervised methods better identify their own inefficient behaviours during software
development.
It is a challenging problem to enable defect prediction for new To measure the quality of source code, Okutan and Yildiz [85]
projects or projects without sufficient historical data. To address introduced a new metric called lack of coding quality (LOCQ).
this limitation, Nam and Kim [83] presented clustering instances They applied Bayesian networks to examine the probabilistic
and labelling instances in clusters (CLA) and clustering instances, influential relationships between software metrics and defect-
labelling instances, metric selection and instance selection proneness. They found that response for class, lines of code, and
(CLAMI) two unsupervised methods. The key idea of these two LOCQ are the most effective metrics whereas coupling between
methods is to label an unlabelled dataset by using the magnitude of objects, weighted method per class, and lack of cohesion of
metric values. Hence, CLA and CLAMI have the advantages of methods are less effective metrics for predicting defects.
automated manner and no manual effort required. To explore the predictive ability of mutation-aware defect
From the perspective of data clustering, Zhang et al. [84] prediction, Bowes et al. [86] defined 40 mutation metrics. They
presented to leverage spectral clustering (SC) to tackle the separately built defect prediction models to compare the
heterogeneity between source project and target project in cross- effectiveness of these mutation-aware metrics and 39 source CMs
project defect prediction. SC is a connectivity-based unsupervised (mutation-unaware) as well as combining them together. They
clustering method, which does not require any training data. Thus, found that mutation-aware metrics can significantly improve defect
SC does not suffer from the problem of heterogeneity. This is an prediction performance.
advantage of unsupervised clustering methods. To investigate the effect of concerns on software quality, Chen
Summary: By comparing and analysing the above defect et al. [87] approximated software concerns as topics by using a
prediction methods based on machine learning techniques as statistical topic modelling technique. They proposed a set of topic-
shown in Table 5, we can obtain following observations: (i) Defect based metrics including number of topics, number of defect-prone
prediction methods usually directly employ or modify existing topics, topic membership and defect-prone topic membership.
well-known machine learning algorithms for different prediction Results show that topic-based metrics provide additional
contexts. (ii) Most defect prediction methods leverage supervised explanatory power over existing structural and historical metrics,
learning techniques to build classification models. Generally, which can help better explain software defects.
supervised defect prediction methods are able to improve the
prediction performance, especially the classification accuracy. (iii) 4.1.2 Feature selection: Feature selection (metric selection) has
Most defect prediction methods are based on software CMs, become the focus of machine learning and data mining, which has
possible because they are easy to collect as compared with process also been used in software defect prediction [88, 89]. The aim of
metrics. feature selection is to select the features which are more relevant to
the target class from high-dimensional features and remove the
4 Manipulating the data features which are redundant and uncorrelated. After feature
In this section, we review the related defect prediction studies from selection, the classification performance of prediction models will
the perspective of manipulating the data. Specifically, we compare be improvement.
and analyse these studies in the following three aspects: attributes To check the positive effects of combining feature selection and
for prediction, data adoption and dataset quality. ensemble learning on the performance of defect prediction, Laradji
et al. [90] presented an average probability ensemble (APE)
method. APE can alleviate the effects caused by metric redundancy
4.1 Attributes for prediction and data imbalance on the defect prediction performance. They
4.1.1 New software metrics: Software metrics, such as source found that it is very necessary to carefully select relevant and
CMs, change churns, and the number of previous defects, have informative features for accurate defect prediction.
been actively studied to enable defect prediction and facilitate Liu et al. [91] designed a new feature selection framework,
software quality assurance. Recently, there are several methods to named feature clustering and feature ranking (FECAR). FECAR
design new software metrics for defect prediction [39, 85–87]. firstly partitions original features into k clusters, and then selects
Table 5 Comparison of the defect prediction methods using machine learning techniques
Study Category Technique Dataset Year
supervised Jing et al. [71] dictionary learning, cost-sensitive learning NASA 2014
Jing et al. [72] collaborative representation NASA 2014
Wang et al. [73] multiple kernel ensemble learning, boosting NASA 2016
Wang et al. [74] deep learning PROMISE 2016
Yang et al. [75] deep learning JIT 2015
Chen et al. [76] transfer learning, boosting PROMISE 2015
Xia et al. [77] transfer learning, GA, AdaBoost PROMISE 2016
Canfora et al. [78] multiobjective optimisation, GA PROMISE 2015
Ryu et al. [79] boosting, cost-sensitive learning, transfer learning PROMISE 2017
Yang et al. [80] decision tree, bagging, random forest JIT 2017
semi-supervised Wang et al. [81] SemiBoost, graph learning, sparse representation NASA 2016
Zhang et al. [82] sparse representation, sparse graph, label propagation NASA 2017
unsupervised Nam and Kim [83] cluster, feature selection ReLink 2015
Zhang et al. [84] spectral clustering NASA, AEEEM, PROMISE 2016
Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.
relevant features from each cluster. Results show that it is effective Herbold et al. [96] performed a large case of study on local
of FECAR to select features for defect prediction. models in the context of cross-project defect prediction. They
To study feature selection methods with a certain noise evaluated and compared the performance of local model, global
tolerance, Liu et al. [92] presented a feature selection method model and a transfer learning method for cross-project defect
feature clustering with selection strategies (FECS). FECS contains prediction. Results show that local models have only a small
two phases: feature clustering phase and feature selection phase. difference in terms of the prediction performance as compared with
They found that FECS is effective on both noise free and noisy the global model and transfer learning method for cross-project
datasets. defect prediction.
Xu et al. [93] presented a feature selection framework, named Mezouar et al. [97] compared local and global defect prediction
maximal information coefficient with hierarchical agglomerative models in the effort-aware context. They found that there is at least
clustering (MICHAC). MICHAC firstly ranks candidate features to one local model outperforms the global model, but there always
filter out irrelevant ones by using maximal information coefficient. exists another local model performs worse than the global model.
Then MICHAC groups features with hierarchical agglomerative They further observed that the worse local model is trained on that
clustering and removes redundant features by selecting one feature the subset of data has a low percentage of defect. Based on these
from each resulted group. findings, they recommended that files with smaller size should be
Summary: Table 6 shows the comparison of defect prediction special attention in future effort-aware defect prediction studies and
studies that focus on designing new software metrics and related it is beneficial to combine the advantages of global models by
feature selection techniques. Software metrics are very important to taking local considerations into account in practice.
build defect prediction models. In the past decades, many effective
metrics have been proposed, especially the CMs and process 4.2.2 Heterogeneous data: Heterogeneous defect prediction
metrics. To facilitate software quality assurance, new defect refers to predicting defect-proneness of software instances in a
prediction metrics are being increasingly designed. However, there target project by using heterogeneous metric data collected from
may be some redundant and irrelevant metrics which are harmful other projects. It provides a new perspective to defect prediction
to the prediction performance of learned predictors. To address this and has received much research interest. Recently, several
problem, feature selection techniques can be used to remove those heterogeneous defect prediction models [66, 98–100] are proposed
redundant and irrelevant metrics with the aim of improving the to predict defects across projects with heterogeneous metrics sets
performance. (i.e. source and target projects have different metric sets).
Since the metrics except common metrics might have
4.2 Data adoption favourable discriminant ability, Jing et al. [66] proposed a
heterogeneous defect prediction method, namely CCA+ (canonical
4.2.1 Local versus global models: Menzies et al. [94] were the correlation analysis), which utilises unified metric representation
first to present local and global models for defect prediction and (UMR) and CCA-based transfer learning technique. Specifically,
effort estimation. Local models are referred to clustering the whole the UMR consists of three types of metrics, including the common
dataset into smaller subsets with similar data properties, and metrics of the source and target projects, source-project-specific
building prediction models by training them on these subsets. On metrics, and target-project-specific metrics. By learning a pair of
the contrary, global model is trained on the whole dataset. projective transformations under which the correlation of the
To investigate the performance of local, global and multivariate source and target project is maximised, CCA+ can make the data
adaptive regression splines (MARS) three methods, Bettenburg et distribution of target project be similar to that of source project.
al. [95] separately built defect prediction models for them. At the same time, Nam and Kim [98] presented another solution
Specifically, they constructed local models by splitting the data for heterogeneous defect prediction. They firstly employed the
into smaller homogeneous subsets and learned several individual metric selection technique to remove redundant and irrelevant
statistical models, one for each subset. They then built a global metrics for source project. Then, they matched up (e.g.
model on the whole dataset. Besides, they treated MARS as a Kolmogorov–Smirnov test-based matching, KSAnalyser) the
global model with local considerations. In terms of modelling the metrics of source and target projects based on metric similarity
fit and predictive performance, they found that local model can such as distribution or correlation. After these processes, they
significantly outperform the global model. For practical finally arrived at a matched source and target metric sets. With the
applications, they observed that MARS produce general trends and obtained metric sets, they built heterogeneous defect prediction
can be considered a hybrid between global and local models. model to predict labels of the instances in a target.
Table 6 Comparison of the defect prediction studies that handling the metrics
Category Study Topic Technique Dataset Year
new metrics Lee et al. [39] predictive ability evaluation of the micro correlation-based feature subset Eclipse system 2016
interaction metrics selection, random forest
Okutan and Yildiz [85] evaluation of the lack of coding quality Bayesian networks PROMISE 2014
metric
Bowes et al. [86] predictive ability evaluation of the mutation naive Bayes, logistic regression, 3 real word 2016
metrics random forest, J48 systems
Chen et al. [87] explanatory power evaluation of topic- latent Dirichlet allocation, logistic 4 real word 2017
based metrics regression, Spearman correlation systems
analysis
feature selection Laradji et al. [90] validation of combining feature selection APE, weighted support vector NASA 2015
and ensemble learning machines
Liu et al. [91] methods to build and evaluate feature FF-correlation, FC-relevance NASA, Eclipse 2014
selection techniques
Liu et al. [92] evaluation of cluster-based feature k-medoids clustering, heuristic NASA, Eclipse 2015
selection method with a certain noise selection strategy
tolerance
Xu et al. [93] methods to build and evaluate feature hierarchical agglomerative NASA, AEEEM 2016
selection techniques clustering, maximal information
coefficient
Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.
He et al. [99] presented a CPDP with imbalanced feature sets to analysis (SSTCA), and combined SSTCA with ISDA to propose
address the problem of heterogeneous metric sets in cross-project the SSTCA + ISDA approach for cross-project class imbalance
defect prediction. They used distribution characteristics vectors learning.
[13] of each instance as new metrics to enable defect prediction. To handle the class imbalance problem in the context of online
To deal with the class imbalance problem under heterogeneous change defect prediction, Tan et al. [104] proposed to leverage
cross-project defect prediction setting, Cheng et al. [100] presented resampling and updatable classification techniques. They adopted
a cost-sensitive correlation transfer support vector machine (CCT- several resampling methods including simple duplicate, synthetic
SVM) method based on CCA+ [66]. Specifically, to alleviate the minority over-sampling technique (SMOTE), spread subsample,
influence of imbalanced data, they employed different resampling with/without replacements to increase the percentage of
misclassification costs for defective and non-defective classes by defective instances in the training set for addressing the imbalanced
incorporating the cost factors into the support vector machine data challenge.
(SVM) model. Ryu et al. [69] investigated the feasibility of using the class
Summary: Table 7 shows the comparison of defect prediction imbalance learning for cross-project defect prediction. They
studies by focusing on the data adoption. Local versus global designed a boosting-based model named value cognitive boosting
models investigate how to find training data with similar with support vector machine (VCB-SVM). It sets similarity
distribution for test data. It usually can achieve better prediction weights according to distributional characteristics, and combines
performance for training and test data with similar distribution [12, the weights with the asymmetric misclassification cost designed by
16]. In practice, human labelling for a large number of unlabelled the boosting algorithm for appropriate resampling.
modules is costly and time consuming, and may not be perfectly By utilising class overlap reduction and ensemble imbalance
accurate. Hence, it is difficult to collect defect information to label learning techniques, Chen et al. [105] presented a new software
a dataset for training a prediction model. Heterogeneous defect defect prediction method, called neighbour cleaning learning
prediction has good potential to use all heterogeneous data of (NCL) + ensemble random under-sampling (ERUS). On one hand,
software projects for defect prediction on new projects or projects they eliminated the overlapping non-defective instances through
with limited historical defect data [66, 98]. However, the main NCL algorithm. On the other hand, they resampled non-defective
challenge of heterogeneous defect prediction is to overcome the instances to construct balanced training subsets through ERUS
data distribution differences between source and target projects. algorithm. Finally, they assembled a series of classifiers to build
Thus, it needs to reshape the distribution of source data to be the ensemble prediction model through adaptive boosting
similar to the target data. In short, it is very useful and interesting (AdaBoost).
to develop high-quality heterogeneous defect prediction methods Wu et al. [106] presented a cost-sensitive local collaborative
for future research. representation (CLCR) method for defect prediction. CLCR
explores the neighbourhood information among software instances
4.3 Dataset quality to enhance the prediction ability of the model and incorporates
different cost factors for defective and non-defective classes to
4.3.1 Handling class imbalance: Software defect datasets are handle the class imbalance problem.
often highly imbalanced [65, 101, 102]. That is, the number of Liu et al. [107] presented a new two-stage cost-sensitive
defective instances (minority) is usually much fewer than the learning method for defect prediction, which employs cost
number of non-defective (majority) ones. It is challenging for most information both in feature selection and classification two stages.
conventional classification algorithms to work with data that has an In the first stage, they designed three cost-sensitive feature
imbalanced class distribution. The imbalanced distribution could selection algorithms by utilising different misclassification costs.
cause misclassification of the instances in the minority class, and In the second stage, they employed cost-sensitive back propagation
this is an important factor accounting for the unsatisfactory neural network classification algorithm with threshold-moving
prediction performance [1, 61]. Recently, more and more strategy.
researchers have paid attention to the class imbalance problem of Rodriguez et al. [108] handled the imbalanced data by
defect prediction [69, 103–105]. comparing and evaluating different types of algorithms for
To deal with the within-project and cross-project class software defect prediction. They separately used sampling, cost-
imbalance problem simultaneously, Jing et al. [103] developed a sensitive, ensembles and their hybrid forms to build prediction
unified defect prediction framework. They first utilised the models. Results show that these techniques can enhance the correct
improved subclass discriminant analysis (ISDA) to achieve classification of the defective class and the preprocessing of data
balanced subclasses for addressing within-project imbalanced data affects the improvement of prediction model.
classification. Then, they made the distributions of source and Malhotra and Khanna [109] compared three data sampling
target data similar through semi-supervised transfer component techniques (resample with replacement, spread subsample,
Table 7 Comparison of the defect prediction studies focusing on the data adoption
Category Study Topic Technique Dataset Year
local versus global models Bettenburg et al. [95] fit ability comparison of local and clustering, MARS s, linear six open source 2015
global models regression projects
Herbold et al. [96] predictive ability comparison of local clustering, support vector NASA, AEEEM, 2017
and global models in a cross-project machine PROMISE
context
Mezouar et al. [97] predictive ability comparison of local k-medoids clustering, AEEEM, PROMISE 2016
and global models spectral clustering
heterogeneous defect Jing et al. [66] a method to solve heterogeneous canonical correlation NASA, SOFTLAB, 2015
prediction models metric problem analysis, transfer learning ReLink, AEEEM
Nam and Kim [98] a method to solve heterogeneous metric selection, metric NASA, SOFTLAB, 2015
metric problem matching, transfer learning ReLink, AEEEM,
PROMISE
He et al. [99] feasibility using different metrics statistics, logistic ReLink, AEEEM, 2014
regression PROMISE
Cheng et al. [100] an improved method to solve canonical correlation NASA, SOFTLAB, 2016
heterogeneous metric problem analysis, transfer learning, ReLink, AEEEM
support vector machine
Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.
SMOTE) and cost-sensitive MetaCost learners with seven different environment). LACE2 combines CLIFF, LeaF and MORPH
cost ratios for change prediction. They employed six machine algorithms together. With LACE2, data owners can incrementally
learning classifiers to conduct change prediction under ten-fold add data to a private cache and contribute ‘interesting’ data that are
cross validation and inter-release validation two settings. They not similar to the current content of the private cache. Results show
found that resample with replacement sampling technique is better that LACE2 can produce higher privacy and provide better defect
than other techniques. prediction performance.
Summary: To better clarify and compare the differences among
4.3.2 Handling noises: Defect information plays an important these defect prediction studies, we provide a brief summary in
role in software maintenance such as measuring quality and Table 8. The above-mentioned defect prediction studies mainly pay
predicting defects. Since current defect collection practices are attention to the dataset quality including class imbalance, data
based on optional bug fix keywords or bug report links in change noise and privacy-preserving data sharing problems. These studies
logs, the collected defect data from change logs and bug reports explore various model techniques and provide some profound
could include noises [51, 54, 110, 111]. These biased defect data implications. As mentioned above in Section 3, some machine
will affect defect prediction performance. learning-based methods are also considered the class imbalance
To examine the characteristic of mislabelling, the impact of problem by using different types of techniques, which include cost-
realistic mislabelling on the prediction performance and the sensitive learning (Jing et al. [71], Ryu et al. [79]), ensemble
interpretation of defect models, Tantithamthavorn et al. [112] learning (Wang et al. [73, 81]) and data sampling algorithms
conducted an in-depth study of 3931 manually curated issue reports (Wang et al. [74], Yang et al. [75], Chen et al. [76], Zhang et al.
from two large open-source systems for defect prediction. They [82]). It can be seen that data sampling, ensemble learning, cost-
found that issue report mislabelling is not random and it has rarely sensitive learning and their hybrid methods are commonly used to
impact on the precision of defect prediction models as well as it deal with the class imbalance problem in the domain of defect
does not heavily influence the most influential metrics. prediction. Besides, most defect prediction studies use sampling
Herzig et al. [113] investigated the impact of tangled code techniques, possibly because they are simple, efficient and easy to
changes on defect prediction models. They found that up to 20% of be realised. For handling noises, noise-free data set is very
all bug fixing changes consisted of multiple tangled, which would important for building effective defect prediction models. Data
severely influence bug counting and bug prediction models. Due to contains noises, has a large impact on the defect prediction
tangled changes, there is at least 16.6% of all source files are performance. Due to privacy-preserving issue, most data owners
incorrectly associated with bug reports for predicting bug-prone are not willing to share their data, which will restrict us to utilise
files. Experimental results on tangled bug datasets show that these data for cross-project defect prediction, especially for new
untangling tangled code changes can result in more accurate projects or projects without sufficient historical defect data. Hence,
models. They recommended that future case studies can explore it is very necessary to further thoroughly investigate the dataset
better change organisation to reduce the impact of tangled changes. quality for future defect prediction studies.
4.3.3 Privacy-preserving data sharing: Due to privacy 5 Effort-aware context-based defect prediction
concerns, most companies are not willing to share their data. For studies
these reasons, many researchers doubt the practicality of data
sharing for the purposes of research. Recently, privacy preservation To prioritise quality assurance efforts, defect prediction techniques
issue has been investigated in some software engineering are often used to prioritise software instances based on their
applications, e.g. software defect prediction [64, 114], software probability of having a defect or the number of defects [61, 117,
effort estimation [115] and so on. 118]. With these defect prediction techniques, software engineers
To deal with the privacy preservation problem in multi-party can allocate limited testing or inspection resources to the most
scenario for cross-project defect prediction, Peters et al. [116] defect-prone instances. Recently, effort-aware context-based defect
presented LACE2 (large-scale assurance confidentiality prediction methods [117, 118] have been proposed. These methods
Table 8 Comparison of the defect prediction studies that handing dataset quality
Category Study Topic Technique Dataset Year
class imbalance Jing et al. [103] method to solve class imbalance ISDA SOFTLAB, NASA, 2017
problem ReLink, AEEEM
Tan et al. [104] online change prediction with different simple duplicate, SMOTE, six open source 2015
sampling techniques spread subsample, resampling projects
with/without replacements
Ryu et al. [69] method to solve cross-project class boosting, under-sampling, NASA, SOFTLAB 2016
imbalance over-sampling
Chen et al. [105] method to solve class imbalance ensemble random under- NASA 2016
problem sampling
Wu et al. [106] method to solve class imbalance cost-sensitive learning NASA 2016
problem
Liu et al. [107] method to solve class imbalance cost-sensitive learning NASA 2014
problem
Rodriguze et al. [108] predictive ability comparison of various sampling, cost-sensitive, NASA 2014
class imbalance techniques ensemble, hybrid
Malhotra and Khamna [109] predictive ability comparison of various resampling with replacement, six open source 2017
class imbalance techniques spread subsample, SMOTE, projects
MetaCost
data noise Tantithamthavorn et al. [112] validation of mislabelling on the bootstrap resampling, random NASA, PROMISE 2015
performance and interpretation forest, Scott-Knott test
Herzig et al. [113] validation of tangled code change on heuristic-based untangling, five open source 2016
the performance Caret package projects
privacy-preserving Peters et al. [116] privacy-preserving algorithm for cross- CLIFF, LeaF, MORPH PROMISE 2015
data sharing project prediction
Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.
factor in the effort needed to inspect or test code when evaluating that the performance of the prediction models is significantly
the effectiveness of prediction models, leading to more realistic dependent on the data set size and the percentage of defective
performance evaluations. Generally, effort-aware studies provide instances in the data set.
new interpretation and the practical adoption-oriented view of Panichella et al. [128] presented to train defect predictors with
defect prediction results. These methods often use the LOC metric the maximisation of their cost-effectiveness through GAs. They
as a proxy measure for inspection effort. Considering the practical employed regression tree and generalised linear model to build
significance, more and more defect prediction studies make use of defect prediction models, and found that regression models trained
the effort-aware performance evaluation [21, 38, 52, 60, 68]. by GAs achieve better performance than their traditional
Zhou et al. [119] provided a comprehensive investigation of the counterparts.
class size on the effect between OO metrics and defect-proneness Summary: Table 9 summarises the details of the effort-aware
in the effort-aware prediction context. They employed statistical defect prediction studies. It can be seen that most works are based
regression techniques to empirically study the effectiveness of on the evaluation and analysis of the relationship between software
defect prediction models. Results show that the performance of metrics and defect-proneness, and prediction models constructed.
learned models can usually have significant improvement after Effort-aware context-based defect prediction studies consider the
removing the confounding effect in terms of both ranking and effort that is required to perform quality assurance activities like
classification on effort-aware evaluation. testing or code review on more prone to defective modules. Based
To perform an in-depth evaluation on the ability of slice-based on the prediction results, software engineers can allocate their
cohesion metrics in effort-aware post-release defect prediction, limited testing resources to the defect-prone modules with the aim
Yang et al. [120] compared and evaluated the effect of slice-based of finding more defects by using smaller effort. From the view of
cohesion metrics and baseline (code and process) metrics. They practice, it is more realistic and useful to apply effort-aware defect
found that slice-based cohesion metrics are complementary to the prediction models in the actual software development, which can
code and process metrics. It suggests that there is practical value to improve the production efficiency and quality, reduce the
apply slice-based cohesion metrics for the effort-aware post-release development cost and software risk.
defect prediction.
To examine the effect of some package-modularisation metrics 6 Empirical studies
proposed by Sarkar et al. [121] in the context of effort-aware defect
prediction, Zhao et al. [122] compared and evaluated the In recent years, there have been many empirical studies to analyse
effectiveness of these new package-modularisation metrics and and evaluate defect prediction methods from different aspects [31,
traditional package-level metrics. They found that these new 56, 58, 59]. Related ideas and brief descriptions of these papers are
package-modularisation metrics are useful for developing quality discussed and summarised as follows.
software systems in the effort-aware context. Shepperd et al. [129] investigated the extent to which the
To investigate the predictive effectiveness of simple research group that performs a defect prediction study associates
unsupervised models in the context of effort-aware JIT prediction, with the reported performance of defect prediction models. By
Yang et al. [123] performed an empirical study to compare their conducting a meta-analysis of 42 primary studies, they found that
unsupervised models with the state-of-the-art supervised models the reported performance of a defect prediction model shares a
under three prediction settings including cross-validation, time- strong relationship with the group of researchers who construct the
wise-cross-validation and cross-project prediction. They found that models. Their findings suggest that many published defect
many simple unsupervised models achieve higher performance prediction works are biased.
than the state-of-the-art supervised models in effort-aware JIT Recently, Tantithamthavorn et al. [130] conduct an alternative
prediction. investigation of Shepperd et al.'s data [129]. They found that
To study the relationships between dependence clusters and research group shares a strong association with other explanatory
defect-proneness in effort-aware defect prediction, Yang et al. variables (dataset and metric families), the strong association
[124] empirically evaluated the effect of dependence clusters with among these explanatory variables makes it difficult to discern the
different statistical techniques. They found that large dependence impact of the research group on model performance, and the
clusters, functions inside dependence clusters tend to be more research group has a smaller impact than the metric family after
defect-prone, which help us to better understand dependence mitigating the impact of this strong association. Their findings
clusters and its effect on software quality in the effort-aware suggest that researchers experiment with a broader selection of
context. datasets and metrics to combat any potential bias in the results.
To examine the predictive power of network measures in the Ghotra et al. [131] replicated prior study [58] to examine
effort-aware defect prediction, Ma et al. [125] performed an in- whether the impact of classification techniques on the performance
depth evaluation on the network measures with logistic regression of defect prediction models is significant or not in two
technique. They found that it is practically useful to use network experimental settings. They found that the performance with
measures for effort-aware defect prediction. They also suggest that different classification techniques has small difference on the
researchers should carefully decide whether and when to use original NASA dataset which is consistent with the results of prior
network measures for defect prediction in practice. study. However, it has significantly different performance on the
Zhao et al. [126] provided a thorough study on the actual cleaned NASA dataset. Their results suggest that some
usefulness of client usage context in package cohesion for the classification techniques outperform others for defect prediction.
effort-aware defect prediction. They evaluated the predictive ability To investigate the effect of parameter on the performance of
of both context-based and non-context-based cohesion metrics defect prediction models, Tantithamthavorn et al. [132] conducted
taken alone or together for defect prediction, and found that it is an empirical study with an automated parameter optimisation tool,
practical value in package cohesion by considering client usage Caret. By evaluating candidate parameter settings, Caret can get
context for effort-aware defect prediction. the optimised setting based on the highest prediction performance.
To examine the effect of sampling techniques on the Results show that parameter settings can indeed significantly
performance in effort-aware defect prediction, Bennin et al. [127] influence the performance of defect prediction models. The finding
took into account the testing effort with an effort-aware measure. suggests that researchers should select right parameters of the
They used four sampling techniques including random under- classification techniques for the future defect prediction
sampling, random over-sampling, SMOTE and purely SMOTE experiment.
sampling for the experiments. Results show that the over-sampling To examine the bias and variance of model validation
techniques achieved higher performance than the under-sampling techniques in the field of defect prediction, Tantithamthavorn et al.
techniques. [70] conducted an in-depth study with 12 most widely used model
Bennin et al. [10] leveraged the commonly used statistical and validation techniques for the evaluation. They found that single-
machine learning techniques to build 11 effort-aware prediction repetition holdout validation tends to yield estimates with 46–
models for the more practical cross-release validation. They found 229% more bias and 53–863% more variance than the top-ranked
Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.
model validation techniques, while out-of-sample bootstrap researchers to explore various aggregation schemes in future defect
validation achieves the best balance between the bias and variance prediction studies.
of estimates. Hence, they suggested that researchers should adopt To validate the feasibility of using a simplified metric set for
out-of-sample bootstrap validation in future defect prediction defect prediction, He et al. [137] constructed three types of
studies. predictors in three scenarios with six typical classifiers. They found
To recognise classification techniques which perform well in that it is viable and practical to use a simplified metric set for
software defect prediction, Bowes et al. [133] applied four defect prediction, and the built models with the simplified metric
classifiers including random forest, naive Bayes, RPart and SVM subset can provide satisfactory prediction performance.
to predict and analyse the level of prediction uncertainty through Yang et al. [138] conducted an empirical study to examine the
the investigation of individual defects. They found that the relationship between end-slice-based and metric-slice-based
predictive performance of four classifiers is similar, but each cohesion metrics as well as compared the predictive power of
detects different sets of defects. Based on the experimental results, different type of slice for slice-based cohesion metrics. They found
they concluded that the prediction performance of classifiers with that end-slice-based and metric-slice-based cohesion metrics have
ensemble decision-making strategies outperforms majority voting. little difference. In practice, they suggested selecting end slice for
Due to lack of an identical performance evaluation measure to computing slice-based cohesion metrics in defect prediction.
directly compare different defect prediction models, Bowes et al. Jaafar et al. [139] conducted an empirical evaluation on the
[134] developed a Java tool ‘DConfusion’ that allows researchers impact of design pattern and anti-pattern dependencies on changes
and practitioners to transform many performance measures back and faults in OO systems. They found that classes of anti-patterns
into a confusion matrix. They found that the tool can produce very dependencies are more prone to defect than others while classes of
small errors by using a variety of datasets and learners for re- design patterns dependencies are not hold true. They also observed
computing the confusion matrix. that classes having dependencies with anti-patterns depend on the
To examine the effectiveness of search-based techniques and structural changes which are the most common changes.
their hybridised versions for defect prediction, Malhotra and Chen et al. [140] conducted an empirical study to examine the
Khanna [135] performed an empirical comparison of five search- predictive effectiveness of network measures in high severity
based, five hybridised, four machine learning techniques and a defect prediction. They separately used logistic regression to
statistical technique. Through comparing the predictive analysis the relationships between each network measure and
performance of each model, they advocated that researchers should defect-proneness, and then evaluate their predictive powers
employ hybridised techniques to develop defect prediction models compared to CMs. They found that network measures are of
for recognising software defects. practical value in high severity defect prediction.
Caglayan et al. [47] replicated a prior study [45] to examine the To investigate the relationship between process metrics and the
merits of organisational metrics on the performance of defect number of defects, Madeyski and Jureczko [22] provided an in-
prediction models for large-scale enterprise software. They depth evaluation which process metrics have large effect on
separately extracted organisational, code complexity, code churn improving the predictors compared to product metrics. They found
and pre-release defect metrics to build prediction models. They that the number of distinct committers (NDC) metric can
found that model with organisational metrics outperforms than significantly improve the performance of prediction models and
both churn metric and pre-release metric models. they recommended to use the NDC process metric for future defect
To investigate how different aggregation schemes have effect prediction studies.
on defect prediction models, Zhang et al. [136] conducted an in- To investigate the performance of JIT models in the context of
depth analysis of 11 aggregation schemes on 255 open source cross-project defect prediction, Kamei et al. [141] conducted an
software projects. They found that aggregation has large effect on empirical evaluation on 11 open source projects. They found that
both correlation among metrics and the correlation between metrics training and test projects with similar data distribution usually can
and defects. Using only the summation does not often achieve the obtain better performance, and combining the data of multiple
best performance, and it tends to underestimate the performance of projects or several prediction models tend to produce strong
defect prediction models. Given their findings, they advised performance. For the JIT cross-project prediction models with the
Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.
utilisation of carefully selected data, these models usually tend to Xu et al. [144] conducted an empirical evaluation on the impact
perform best. of 32 feature selection techniques for defect prediction. They found
To construct a universal defect prediction model, Zhang et al. that feature selection techniques are useful and the performance of
[67] presented a context-aware rank transformation method by predictors trained using these techniques exhibits significant
studying six context factors. Through a study of 1385 open source difference over all the projects.
projects, the universal model achieves comparable performance to Summary: Table 10 shows the comparison of the empirical
the within-project models, produces similar performance on five studies on defect prediction. We can observe that these papers
external projects and performs similarly among projects with mainly focus on different machine learning and statistical methods,
different context factors. classifiers or classifier ensemble techniques, parameter
Petric et al. [142] conducted an empirical study to examine the optimisation technique, model validation techniques, performance
use of an explicit diversity technique with stacking ensemble that evaluation measures, various software metrics and feature selection
whether defect prediction can be improved. They employed the techniques. The above empirical studies provide a variety of
stacking ensemble technique to combine four different types of profound, original, valuable and practical insights. These findings
classifiers and a weighted accuracy diversity model. Based on eight can help software researchers and practitioners to better understand
publicly available projects, the results demonstrate that stacking the results reported by the papers and give some practical
ensembles perform better than other defect prediction models and guidelines in future defect prediction studies.
the essential factor is the use of diversity ensemble.
Liu et al. [143] performed an empirical study of two-stage data 7 Future directions and challenges
preprocessing method for software defect prediction, which
includes feature selection and instance reduction two stages. The Although numerous efforts have been made and many outstanding
aim of feature selection is to remove irrelevant, redundant features, studies have been proposed in software defect prediction, there are
and the aim of instance reduction is to undersample non-defective still many potential future challenges, and some new open
instances. Experimental evaluation on eclipse and NASA projects problems require further investigation and exploration. Here, we
shows the effectiveness of the proposed two-stage data address some possible future research challenges on the defect
preprocessing method. prediction problem.
Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.
(i) Generalisation: Most defect prediction studies are verified in 8 Conclusion
open source software projects. The main reason is that these
projects easily collect, archive practical development history and Software quality assurance has become one of the most significant
make their data publicly available. However, current defect and expensive phase during the development of high assurance
prediction methods might not be generalisable to other closed software systems. As software systems play an increasingly key
software projects such as proprietary and commercial systems. role in our daily lives, their complexity continues to increase. It is
That is, the lack of availability of such proprietary and commercial very difficult to make their quality assurance since the increased
data leaves open the question of whether defect prediction models complexity of software systems. The defect prediction models can
built using data from open source projects would apply to these identify the defect-prone modules so that quality assurance teams
projects. Such defect prediction methods are not investigated in can effectively allocate limited resources for testing and code
depth. Thus, there need to make more partnership with business inspection by putting more effort on the defect-prone modules.
partners and have access to their data repositories. In recent years, new defect prediction techniques, problems and
applications are emerging quickly. This paper attempts to
(ii) Overcoming imbalance: Software defect prediction always
systematically summarise all typical works on software defect
suffers from class imbalance, namely the number of defective
prediction in recent years. Based on the results obtained in this
instances is often much less than the number of non-defective
work, this paper will help researchers and software practitioners to
instances. This imbalanced data distribution hampers the
better understand the previous defect prediction studies from data
effectiveness of many defect prediction models. Although some
sets, software metrics, evaluation measures, and modelling
defect prediction studies [65, 69, 103, 104, 109] have been
techniques perspectives in an easy and effective way.
presented to deal with the class imbalance problem, it is still
necessary and important to overcome this problem in the future
work and further improve the performance of defect prediction. 9 Acknowledgments
(iii) Focusing on effort: It is very necessary and important to The authors thank the editors and anonymous reviewers for their
evaluate the prediction models in a realistic setting, e.g. how much constructive comments and suggestions. The authors also thank
effort can reduce for testing and code inspection with defect Professor Jifeng Xuan and Professor Xiaoyuan Xie from School of
prediction models. Recent studies have attempted to address such Computer at Wuhan University for their insightful advice. This
problem when considering the effort [117, 118, 120, 123, 127]. work was supported by the NSFC-Key Project of General
These defect prediction studies usually utilise the lines of CM as a Technology Fundamental Research United Fund under grant no.
proxy for effort. In future effort-aware defect prediction models, U1736211, the National Key Research and Development Program
researchers should examine what is the best way to measure effort of China under grant no. 2017YFB0202001, the National Nature
and provide better practical guidelines. Such research can have a Science Foundation of China under grant nos. 61672208,
larger effect on the future applicability of defect prediction 41571417 and U1504611, the Science and Technique Development
techniques in practice. Program of Henan under grant no. 172102210186, and the
(iv) Heterogeneous defect prediction: Recently, several Research Foundation of Henan University under grant no.
heterogeneous defect prediction models have been developed [66, 2015YBZR024.
98, 100], which use heterogeneous metric data from other projects.
It provides a new perspective to defect prediction. For new projects
10 References
or projects lacking in sufficient historical data, it is very
meaningful to study and use these methods for predicting defects. [1] Hall, T., Beecham, S., Bowes, D., et al.: ‘A systematic literature review on
fault prediction performance in software engineering’, IEEE Trans. Softw.
Due to the different data distributions exist among the source and Eng., 2012, 38, (6), pp. 1276–1304
target projects, it is difficult to build good defect prediction models [2] Menzies, T., Milton, Z., Turhan, B., et al.: ‘Defect prediction from static code
that achieve the satisfactory performance. Furthermore, studies on features: current results, limitations, new approaches’, Autom. Softw. Eng.,
heterogeneous cross-project prediction feasibility are not mature 2010, 17, (4), pp. 375–407
[3] Catal, C., Diri, B.: ‘A systematic review of software fault prediction studies’,
yet in practice. At this point, there remains as an open research area Expert Syst. Appl., 2009, 36, (4), pp. 7346–7354
for practical use of heterogeneous cross-project defect prediction. [4] Catal, C.: ‘Software fault prediction: a literature review and current trends’,
(v) Privacy preservation issue: Due to the business sensitivity and Expert Syst. Appl., 2011, 38, (4), pp. 4626–4636
privacy concerns, most companies are not willing to share their [5] Malhotra, R.: ‘A systematic review of machine learning techniques for
software fault prediction’, Appl. Soft Comput., 2015, 27, pp. 504–518
data [64, 114–116]. In this scenario, it is often difficult to extract [6] Naik, K., Tripathy, P.: ‘Software testing and quality assurance: theory and
data from industrial companies. Therefore, current defect practice’ (John Wiley & Sons, Hoboken, NJ, 2011)
prediction models may not work for those of proprietary and [7] Menzies, T., Greenwald, J., Frank, A.: ‘Data mining static code attributes to
industrial projects. If we have more available such datasets, learn defect predictors’, IEEE Trans. Softw. Eng., 2007, 33, (1), pp. 2–13
[8] Song, Q., Jia, Z., Shepperd, M., et al.: ‘A general software defect proneness
building cross-project defect prediction models will be more sound. prediction framework’, IEEE Trans. Softw. Eng., 2011, 37, (3), pp. 356–370
In the future, it is very important and necessary to extensively [9] Herzig, K.: ‘Using pre-release test failures to build early post-release defect
study the privacy issue in the context of cross-project defect prediction models’. Proc. IEEE 25th Int. Symp. Software Reliability
prediction. Engineering, 2014, pp. 300–311
[10] Bennin, K.E., Toda, K., Kamei, Y., et al.: ‘Empirical evaluation of cross-
(vi) Fair evaluation: Up to now, numerous methods have been release effort-aware defect prediction models’. Proc. IEEE Int. Conf. Software
proposed to solve the defect prediction problem. However, the Quality, Reliability and Security, 2016, pp. 214–221
corresponding evaluations are inconsistent and inadequate, which [11] Zimmermann, T., Nagappan, N., Gall, H., et al.: ‘Cross-project defect
prediction: a large scale experiment on data vs. domain vs. process’. Proc. 7th
may lead to inappropriate conclusions about the performance of joint Meeting of the European Software Engineering Conf. ACM SIGSOFT
defect prediction methods. Hence, more fair evaluation for defect Int. Symp. Foundations of Software Engineering, 2009, pp. 91–100
prediction is very necessary. [12] Turhan, B., Menzies, T., Bener, A.B., et al.: ‘On the relative value of cross-
(vii) Reproducibility: Some defect prediction papers [51, 52, 78, company and within-company data for defect prediction’, Empir. Softw. Eng.,
2009, 14, (5), pp. 540–578
83, 123] are providing replication packages such as dataset, [13] He, Z., Shu, F., Yang, Y., et al.: ‘An investigation on the feasibility of cross-
implementation scripts, and additional settings file. The main goal project defect prediction’, Autom. Softw. Eng., 2012, 19, (2), pp. 167–199
of replication of defect prediction is not only to replicate and [14] Kamei, Y., Shihab, E.: ‘Defect prediction: accomplishments and future
validity their experiments but also to allow other researchers to challenges’. Proc. IEEE 23rd Int. Conf. Software Analysis, Evolution, and
Reengineering, 2016, pp. 33–45
compare the prediction performance of new models with the [15] Kim, S., Whitehead, E.J., Zhang, Y.: ‘Classifying software changes: clean or
original studies [14, 145]. Though having access to the replication buggy?’ IEEE Trans. Softw. Eng., 2008, 34, (2), pp. 181–196
packages of previous studies, there may not be get the exact same [16] Nam, J., Pan, S.J., Kim, S.: ‘Transfer defect learning’. Proc. 35th Int. Conf.
results. A selection of evaluation procedures, configuration Software Engineering, 2013, pp. 382–391
[17] Turhan, B., Mısırlı, A.T., Bener, A.: ‘Empirical evaluation of the effects of
parameters and performance metrics has a large impact on the mixed project data on learning defect predictors’, Inf. Softw. Technol., 2013,
reproducibility. However, we recommend that researchers could 55, (6), pp. 1101–1118
provide their complete replication packages in future defect
prediction studies.
IET Softw., 2018, Vol. 12 Iss. 3, pp. 161-175 173
© The Institution of Engineering and Technology 2018
Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.
[18] Zhang, Y., Lo, D., Xia, X., et al.: ‘An empirical study of classifier [52] Kamei, Y., Shihab, E., Adams, B., et al.: ‘A large-scale empirical study of
combination for cross-project defect prediction’. Proc. IEEE 39th Annual just-in-time quality assurance’, IEEE Trans. Softw. Eng., 2013, 39, (6), pp.
Computer Software and Applications Conf., 2015, pp. 264–269 757–773
[19] Krishna, R., Menzies, T., Fu, W.: ‘Too much automation? The bellwether [53] Zimmermann, T., Premraj, R., Zeller, A.: ‘Predicting defects for eclipse’.
effect and its implications for transfer learning’. Proc. 31st Int. Conf. Proc. Third Int. Workshop on Predictor Models in Software Engineering,
Automated Software Engineering, 2016, pp. 122–131 2007, pp. 9–15
[20] Andreou, A.S., Chatzis, S.P.: ‘Software defect prediction using doubly [54] Kim, S., Zhang, H., Wu, R., et al.: ‘Dealing with noise in defect prediction’.
stochastic Poisson processes driven by stochastic belief networks’, J. Syst. Proc. 33rd Int. Conf. Software Engineering, 2011, pp. 481–490
Softw., 2016, 122, pp. 72–82 [55] Altinger, H., Siegl, S., Dajsuren, Y., et al.: ‘A novel industry grade dataset for
[21] Rahman, F., Devanbu, P.: ‘How, and why, process metrics are better’. Proc. fault prediction based on model-driven developed automotive embedded
2013 Int. Conf. Software Engineering, 2013, pp. 432–441 software’. Proc. 12th Working Conf. Mining Software Repositories, 2015, pp.
[22] Madeyski, L., Jureczko, M.: ‘Which process metrics can significantly 494–497
improve defect prediction models? An empirical study’, Softw. Qual. J., 2015, [56] Shepperd, M., Song, Q., Sun, Z., et al.: ‘Data quality: some comments on the
23, (3), pp. 393–422 NASA software defect datasets’, IEEE Trans. Softw. Eng., 2013, 39, (9), pp.
[23] Radjenović, D., Heričko, M., Torkar, R., et al.: ‘Software fault prediction 1208–1215
metrics: a systematic literature review’, Inf. Softw. Technol., 2013, 55, (8), pp. [57] Jureczko, M., Madeyski, L.: ‘Towards identifying software project clusters
1397–1418 with regard to defect prediction’. Proc. 6th Int. Conf. Predictive Models in
[24] Halstead, M.H.: ‘Elements of software science’. vol. 7 (Elsevier, New York, Software Engineering, 2010, pp. 1–10
1977) [58] Lessmann, S., Baesens, B., Mues, C., et al.: ‘Benchmarking classification
[25] McCabe, T.J.: ‘A complexity measure’, IEEE Trans. Softw. Eng., 1976, 2, (4), models for software defect prediction: a proposed framework and novel
pp. 308–320 findings’, IEEE Trans. Softw. Eng., 2008, 34, (4), pp. 485–496
[26] Chidamber, S.R., Kemerer, C.F.: ‘A metrics suite for object oriented design’, [59] Jiang, Y., Cukic, B., Ma, Y.: ‘Techniques for evaluating fault prediction
IEEE Trans. Softw. Eng., 1994, 20, (6), pp. 476–493 models’, Empir. Softw. Eng., 2008, 13, (5), pp. 561–595
[27] Abreu, F.B., Carapuça, R.: ‘Candidate metrics for object-oriented software [60] Mende, T., Koschke, R.: ‘Revisiting the evaluation of defect prediction
within a taxonomy framework’, J. Syst. Softw., 1994, 26, (1), pp. 87–96 models’. Proc. 5th Int. Conf. Predictor Models in Software Engineering, 2009,
[28] Nagappan, N., Ball, T.: ‘Use of relative code churn measures to predict pp. 1–10
system defect density’. Proc. 27th Int. Conf. Software Engineering, 2005, pp. [61] Arisholm, E., Briand, L.C., Johannessen, E.B.: ‘A systematic and
284–292 comprehensive investigation of methods to build and evaluate fault prediction
[29] Moser, R., Pedrycz, W., Succi, G.: ‘A comparative analysis of the efficiency models’, J. Syst. Softw., 2010, 83, (1), pp. 2–17
of change metrics and static code attributes for defect prediction’. Proc. 30th [62] Xiao, X., Lo, D., Xin, X., et al.: ‘Evaluating defect prediction approaches
Int. Conf. Software Engineering, 2008, pp. 181–190 using a massive set of metrics: an empirical study’. Proc. 30th Annual ACM
[30] Hassan, A.E.: ‘Predicting faults using the complexity of code changes’. Proc. Symp. Applied Computing, 2015, pp. 1644–1647
31st Int. Conf. Software Engineering, 2009, pp. 78–88 [63] Menzies, T., Dekhtyar, A., Distefano, J., et al.: ‘Problems with precision: a
[31] D'Ambros, M., Lanza, M., Robbes, R.: ‘Evaluating defect prediction response to ‘comments on ‘data mining static code attributes to learn defect
approaches: a benchmark and an extensive comparison’, Empir. Softw. Eng., predictors’’’, IEEE Trans. Softw. Eng., 2007, 33, (9), pp. 635–636
2012, 17, (4–5), pp. 531–577 [64] Peters, F., Menzies, T., Gong, L., et al.: ‘Balancing privacy and utility in
[32] Weyuker, E.J., Ostrand, T.J., Bell, R.M.: ‘Do too many cooks spoil the broth? cross-company defect prediction’, IEEE Trans. Softw. Eng., 2013, 39, (8), pp.
Using the number of developers to enhance defect prediction models’, Empir. 1054–1068
Softw. Eng., 2008, 13, (5), pp. 539–559 [65] Wang, S., Yao, X.: ‘Using class imbalance learning for software defect
[33] Pinzger, M., Nagappan, N., Murphy, B.: ‘Can developer-module networks prediction’, IEEE Trans. Reliab., 2013, 62, (2), pp. 434–443
predict failures?’ Proc. 16th ACM SIGSOFT Int. Symp. Foundations of [66] Jing, X.Y., Wu, F., Dong, X., et al.: ‘Heterogeneous cross-company defect
Software Engineering, 2008, pp. 2–12 prediction by unified metric representation and CCA-based transfer learning’.
[34] Meneely, A., Williams, L., Snipes, W., et al.: ‘Predicting failures with Proc. 10th Joint Meeting on Foundations of Software Engineering, 2015, pp.
developer networks and social network analysis’. Proc. 16th ACM SIGSOFT 496–507
Int. Symp. Foundations of Software Engineering, 2008, pp. 13–23 [67] Zhang, F., Mockus, A., Keivanloo, I., et al.: ‘Towards building a universal
[35] Bird, C., Nagappan, N., Murphy, B., et al.: ‘Don't touch my code!: examining defect prediction model with rank transformed predictors’, Empir. Softw. Eng.,
the effects of ownership on software quality’. Proc. 19th ACM SIGSOFT 2016, 21, (5), pp. 1–39
Symp. and the 13th European Conf. Foundations of Software Engineering, [68] Rahman, F., Posnett, D., Devanbu, P.: ‘Recalling the imprecision of cross-
2011, pp. 4–14 project defect prediction’. Proc. ACM SIGSOFT 20th Int. Symp. the
[36] Rahman, F.: ‘Ownership, experience and defects: a fine-grained study of Foundations of Software Engineering, 2012, pp. 1–11
authorship’. Proc. 33rd Int. Conf. Software Engineering, 2011, pp. 491–500 [69] Ryu, D., Choi, O., Baik, J.: ‘Value-cognitive boosting with a support vector
[37] Posnett, D., Devanbu, P., Filkov, V.: ‘Dual ecological measures of focus in machine for cross-project defect prediction’, Empir. Softw. Eng., 2016, 21, (1),
software development’. Proc. 35th Int. Conf. Software Engineering, 2013, pp. pp. 43–71
452–461 [70] Tantithamthavorn, C., Mcintosh, S., Hassan, A., et al.: ‘An empirical
[38] Jiang, T., Tan, L., Kim, S.: ‘Personalized defect prediction’. Proc. IEEE/ACM comparison of model validation techniques for defect prediction models’,
28th Int. Conf. Automated Software Engineering, 2013, pp. 279–289 IEEE Trans. Softw. Eng., 2017, 43, (1), pp. 1–18
[39] Lee, T., Nam, J., Han, D., et al.: ‘Developer micro interaction metrics for [71] Jing, X.Y., Ying, S., Zhang, Z.W., et al.: ‘Dictionary learning based software
software defect prediction’, IEEE Trans. Softw. Eng., 2016, 42, (11), pp. defect prediction’. Proc. 36th Int. Conf. Software Engineering, 2014, pp. 414–
1015–1035 423
[40] Zimmermann, T., Nagappan, N.: ‘Predicting defects using network analysis [72] Jing, X.Y., Zhang, Z.W., Ying, S., et al.: ‘Software defect prediction based on
on dependency graphs’. Proc. 30th Int. Conf. Software Engineering, 2008, pp. collaborative representation classification’. Companion Proc. 36th Int. Conf.
531–540 Software Engineering, 2014, pp. 632–633
[41] Bird, C., Nagappan, N., Gall, H., et al.: ‘Using socio-technical networks to [73] Wang, T., Zhang, Z., Jing, X.Y., et al.: ‘Multiple kernel ensemble learning for
predict failures’, Proc. 20th IEEE Int. Symp. Software Reliability software defect prediction’, Autom. Softw. Eng., 2016, 23, (4), pp. 569–590
Engineering, 2009 [74] Wang, S., Liu, T., Tan, L.: ‘Automatically learning semantic features for
[42] D'Ambros, M., Lanza, M., Robbes, R.: ‘On the relationship between change defect prediction’. Proc. 38th Int. Conf. Software Engineering, 2016, pp. 297–
coupling and software defects’. Proc. 16th Working Conf. Reverse 308
Engineering, 2009, pp. 135–144 [75] Yang, X., Lo, D., Xia, X., et al.: ‘Deep learning for just-in-time defect
[43] Hu, W., Wong, K.: ‘Using citation influence to predict software defects’. Proc. prediction’. Proc. IEEE Int. Conf. Software Quality, Reliability and Security,
10th Working Conf. Mining Software Repositories, 2013, pp. 419–428 2015, pp. 17–26
[44] Herzig, K., Just, S., Rau, A., et al.: ‘Predicting defects using change [76] Chen, L., Fang, B., Shang, Z., et al.: ‘Negative samples reduction in cross-
genealogies’. Proc. IEEE 24th Int. Symp. Software Reliability Engineering, company software defects prediction’, Inf. Softw. Technol., 2015, 62, pp. 67–
2013, pp. 118–127 77
[45] Nagappan, N., Murphy, B., Basili, V.: ‘The influence of organizational [77] Xia, X., Lo, D., Pan, S.J., et al.: ‘Hydra: massively compositional model for
structure on software quality’. ACM/IEEE 30th Int. Conf. Software cross-project defect prediction’, IEEE Trans. Softw. Eng., 2016, 42, (10), pp.
Engineering, 2008, pp. 521–530 977–998
[46] Mockus, A.: ‘Organizational volatility and its effects on software defects’. [78] Canfora, G., Lucia, A.D., Penta, M.D., et al.: ‘Defect prediction as a
Proc. Eighteenth ACM SIGSOFT Int. Symp. Foundations of Software multiobjective optimization problem’, Softw. Test. Verif. Reliab., 2015, 25, (4),
Engineering, 2010, pp. 117–126 pp. 426–459
[47] Caglayan, B., Turhan, B., Bener, A., et al.: ‘Merits of organizational metrics [79] Ryu, D., Jang, J.I., Baik, J.: ‘A transfer cost-sensitive boosting approach for
in defect prediction: an industrial replication’. Proc. IEEE/ACM 37th IEEE cross-project defect prediction’, Softw. Qual. J., 2017, 25, (1), pp. 235–272
Int. Conf. Software Engineering, 2015, pp. 89–98 [80] Yang, X., Lo, D., Xia, X., et al.: ‘Tlel: a two-layer ensemble learning
[48] Bacchelli, A., D'Ambros, M., Lanza, M.: ‘Are popular classes more defect approach for just-in-time defect prediction’, Inf. Softw. Technol., 2017, 87, pp.
prone?’ Proc. 13th Int. Conf. Fundamental Approaches to Software 206–220
Engineering, 2010, pp. 59–73 [81] Wang, T., Zhang, Z., Jing, X.Y., et al.: ‘Non-negative sparse-based semiboost
[49] Taba, S.E.S., Khomh, F., Zou, Y., et al.: ‘Predicting bugs using antipatterns’. for software defect prediction’, Softw. Test. Verif. Reliab., 2016, 26, (7), pp.
Proc. 29th Int. Conf. Software Maintenance, 2013, pp. 270–279 498–515
[50] Zhang, H.: ‘An investigation of the relationships between lines of code and [82] Zhang, Z., Jing, X.Y., Wang, T.: ‘Label propagation based semi-supervised
defects’. Proc. IEEE Int. Conf. Software Maintenance, 2009, pp. 274–283 learning for software defect prediction’, Autom. Softw. Eng., 2017, 24, (1), pp.
[51] Wu, R., Zhang, H., Kim, S., et al.: ‘Relink: recovering links between bugs and 47–69
changes’. Proc. 19th ACM SIGSOFT Symp. the Foundations of Software [83] Nam, J., Kim, S.: ‘Clami: defect prediction on unlabeled datasets’. Proc. 30th
Engineering and 13th European Software Engineering Conf., 2011, pp. 15–25 IEEE/ACM Int. Conf. Automated Software Engineering, 2015, pp. 1–12
Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.
[84] Zhang, F., Zheng, Q., Zou, Y., et al.: ‘Cross-project defect prediction using a [117] Mende, T., Koschke, R.: ‘Effort-aware defect prediction models’. Proc. 14th
connectivity-based unsupervised classifier’. Proc. 38th Int. Conf. Software European Conf. Software Maintenance and Reengineering, 2010, pp. 107–116
Engineering, 2016, pp. 309–320 [118] Kamei, Y., Matsumoto, S., Monden, A., et al.: ‘Revisiting common bug
[85] Okutan, A., Yıldız, O.T.: ‘Software defect prediction using Bayesian prediction findings using effort-aware models’. Proc. IEEE Int. Conf.
networks’, Empir. Softw. Eng., 2014, 19, (1), pp. 154–181 Software Maintenance, 2010, pp. 1–10
[86] Bowes, D., Hall, T., Harman, M., et al.: ‘Mutation-aware fault prediction’. [119] Zhou, Y., Xu, B., Leung, H., et al.: ‘An in-depth study of the potentially
Proc. 25th Int. Symp. Software Testing and Analysis, 2016, pp. 330–341 confounding effect of class size in fault prediction’, ACM Trans. Softw. Eng.
[87] Chen, T.H., Shang, W., Nagappan, M., et al.: ‘Topic-based software defect Methodol., 2014, 23, (1), p. 10
explanation’, J. Syst. Softw., 2017, 129, pp. 79–106 [120] Yang, Y., Zhou, Y., Lu, H., et al.: ‘Are slice-based cohesion metrics actually
[88] Shivaji, S., Whitehead, E.J., Akella, R., et al.: ‘Reducing features to improve useful in effort-aware post-release fault-proneness prediction? An empirical
code change-based bug prediction’, IEEE Trans. Softw. Eng., 2013, 39, (4), study’, IEEE Trans. Softw. Eng., 2015, 41, (4), pp. 331–357
pp. 552–569 [121] Sarkar, S., Kak, A.C., Rama, G.M.: ‘Metrics for measuring the quality of
[89] Gao, K., Khoshgoftaar, T.M., Wang, H., et al.: ‘Choosing software metrics for modularization of large-scale object-oriented software’, IEEE Trans. Softw.
defect prediction: an investigation on feature selection techniques’, Softw. Eng., 2008, 34, (5), pp. 700–720
Pract. Experience, 2011, 41, (5), pp. 579–606 [122] Zhao, Y., Yang, Y., Lu, H., et al.: ‘An empirical analysis of package-
[90] Laradji, I.H., Alshayeb, M., Ghouti, L.: ‘Software defect prediction using modularization metrics: implications for software fault-proneness’, Inf. Softw.
ensemble learning on selected features’, Inf. Softw. Technol., 2015, 58, pp. Technol., 2015, 57, (1), pp. 186–203
388–402 [123] Yang, Y., Zhou, Y., Liu, J., et al.: ‘Effort-aware just-in-time defect prediction:
[91] Liu, S., Chen, X., Liu, W., et al.: ‘Fecar: A feature selection framework for simple unsupervised models could be better than supervised models’. Proc.
software defect prediction’. Proc. IEEE 38th Annual Computer Software and 24th ACM SIGSOFT Int. Symp. Foundations of Software Engineering, 2016,
Applications Conf., 2014, pp. 426–435 pp. 157–168
[92] Liu, W., Liu, S., Gu, Q., et al.: ‘Fecs: a cluster based feature selection method [124] Yang, Y., Harman, M., Krinke, J., et al.: ‘An empirical study on dependence
for software fault prediction with noises’. Proc. IEEE 39th Annual Computer clusters for effort-aware fault-proneness prediction’. Proc. 31st IEEE/ACM
Software and Applications Conf., 2015, pp. 276–281 Int. Conf. Automated Software Engineering, 2016, pp. 296–307
[93] Xu, Z., Xuan, J., Liu, J., et al.: ‘Michac: defect prediction via feature selection [125] Ma, W., Chen, L., Yang, Y., et al.: ‘Empirical analysis of network measures
based on maximal information coefficient with hierarchical agglomerative for effort-aware fault-proneness prediction’, Inf. Softw. Technol., 2016, 69, pp.
clustering’. Proc. IEEE 23rd Int. Conf. Software Analysis, Evolution, and 50–70
Reengineering, 2016, pp. 370–381 [126] Zhao, Y., Yang, Y., Lu, H., et al.: ‘Understanding the value of considering
[94] Menzies, T., Butcher, A., Cok, D., et al.: ‘Local versus global lessons for client usage context in package cohesion for fault-proneness prediction’,
defect prediction and effort estimation’, IEEE Trans. Softw. Eng., 2013, 39, Autom. Softw. Eng., 2017, 24, (2), pp. 393–453
(6), pp. 822–834 [127] Bennin, K.E., Keung, J., Monden, A., et al.: ‘Investigating the effects of
[95] Bettenburg, N., Nagappan, M., Hassan, A.E.: ‘Towards improving statistical balanced training and testing datasets on effort-aware fault prediction
modeling of software engineering data: think locally, act globally!’, Empir. models’. Proc. IEEE 40th Annual Computer Software and Applications Conf.,
Softw. Eng., 2015, 20, (2), pp. 294–335 2016, pp. 154–163
[96] Herbold, S., Trautsch, A., Grabowski, J.: ‘Global vs. local models for cross- [128] Panichella, A., Alexandru, C.V., Panichella, S., et al.: ‘A search-based
project defect prediction’, Empir. Softw. Eng., 2017, 22, (4), pp. 1866–1902 training algorithm for cost-aware defect prediction’. Proc. 2016 on Genetic
[97] Mezouar, M.E., Zhang, F., Zou, Y.: ‘Local versus global models for effort- and Evolutionary Computation Conf., 2016, pp. 1077–1084
aware defect prediction’. Proc. 26th Annual Int. Conf. Computer Science and [129] Shepperd, M., Bowes, D., Hall, T.: ‘Researcher bias: the use of machine
Software Engineering, 2016, pp. 178–187 learning in software defect prediction’, IEEE Trans. Softw. Eng., 2014, 40,
[98] Nam, J., Kim, S.: ‘Heterogeneous defect prediction’. Proc. 10th Joint Meeting (6), pp. 603–616
on Foundations of Software Engineering, 2015, pp. 508–519 [130] Tantithamthavorn, C., McIntosh, S., Hassan, A.E., et al.: ‘Comments on
[99] He, P., Li, B., Ma, Y.: ‘Towards cross-project defect prediction with ‘researcher bias: the use of machine learning in software defect prediction’‘,
imbalanced feature sets’, CoRR, 2014, abs/1411.4228. Available at http:// IEEE Trans. Softw. Eng., 2016, 42, (11), pp. 1092–1094
arxiv.org/abs/1411.4228 [131] Ghotra, B., McIntosh, S., Hassan, A.E.: ‘Revisiting the impact of
[100] Cheng, M., Wu, G., Jiang, M., et al.: ‘Heterogeneous defect prediction via classification techniques on the performance of defect prediction models’.
exploiting correlation subspace’. Proc. 28th Int. Conf. Software Engineering Proc. 37th Int. Conf. Software Engineering, 2015, pp. 789–800
and Knowledge Engineering, 2016, pp. 171–176 [132] Tantithamthavorn, C., McIntosh, S., Hassan, A.E., et al.: ‘Automated
[101] Zhang, H., Zhang, X.: ‘Comments on ‘data mining static code attributes to parameter optimization of classification techniques for defect prediction
learn defect predictors’’, IEEE Trans. Softw. Eng., 2007, 33, (9), pp. 635–637 models’. Proc. 38th Int. Conf. Software Engineering, 2016, pp. 321–332
[102] He, H., Garcia, E.A.: ‘Learning from imbalanced data’, IEEE Trans. Knowl. [133] Bowes, D., Hall, T., Petrić, J.: ‘Software defect prediction: do different
Data Eng., 2009, 21, (9), pp. 1263–1284 classifiers find the same defects?’ Softw. Qual. J., 2017, Available at https://
[103] Jing, X.Y., Wu, F., Dong, X., et al.: ‘An improved SDA based defect doi.org/10.1007/s11219-016-9353-3
prediction framework for both within-project and cross-project class- [134] Bowes, D., Hall, T., Gray, D.: ‘Dconfusion: a technique to allow cross study
imbalance problems’, IEEE Trans. Softw. Eng., 2017, 43, (4), pp. 321–339 performance evaluation of fault prediction studies’, Autom. Softw. Eng., 2014,
[104] Tan, M., Tan, L., Dara, S., et al.: ‘Online defect prediction for imbalanced 21, (2), pp. 287–313
data’. Proc. 37th Int. Conf. Software Engineering, 2015, pp. 99–108 [135] Malhotra, R., Khanna, M.: ‘An exploratory study for software change
[105] Chen, L., Fang, B., Shang, Z., et al.: ‘Tackling class overlap and imbalance prediction in object-oriented systems using hybridized techniques’, Autom.
problems in software defect prediction’, Softw. Qual. J., 2018, 26, (1), pp. 97– Softw. Eng., 2017, 24, (3), pp. 673–717
125 [136] Zhang, F., Hassan, A.E., McIntosh, S., et al.: ‘The use of summation to
[106] Wu, F., Jing, X.Y., Dong, X., et al.: ‘Cost-sensitive local collaborative aggregate software metrics hinders the performance of defect prediction
representation for software defect prediction’. Proc. Int. Conf. Software models’, IEEE Trans. Softw. Eng., 2017, 43, (5), pp. 476–491
Analysis, Testing and Evolution, 2016, pp. 102–107 [137] He, P., Li, B., Liu, X., et al.: ‘An empirical study on software defect
[107] Liu, M., Miao, L., Zhang, D.: ‘Two-stage cost-sensitive learning for software prediction with a simplified metric set’, Inf. Softw. Technol., 2015, 59, pp.
defect prediction’, IEEE Trans. Reliab., 2014, 63, (2), pp. 676–686 170–190
[108] Rodriguez, D., Herraiz, I., Harrison, R., et al.: ‘Preliminary comparison of [138] Yang, Y., Zhao, Y., Liu, C., et al.: ‘An empirical investigation into the effect
techniques for dealing with imbalance in software defect prediction’. Proc. of slice types on slice-based cohesion metrics’, Inf. Softw. Technol., 2016, 75,
18th Int. Conf. Evaluation and Assessment in Software Engineering, 2014, pp. 90–104
pp. 1–10 [139] Jaafar, F., Guéhéneuc, Y.G., Hamel, S., et al.: ‘Evaluating the impact of
[109] Malhotra, R., Khanna, M.: ‘An empirical study for software change prediction design pattern and anti-pattern dependencies on changes and faults’, Empir.
using imbalanced data’, Empir. Softw. Eng., 2017, 22, (6), pp. 2806–2851 Softw. Eng., 2016, 21, (3), pp. 896–931
[110] Herzig, K., Just, S., Zeller, A.: ‘It's not a bug, it's a feature: how [140] Chen, L., Ma, W., Zhou, Y., et al.: ‘Empirical analysis of network measures
misclassification impacts bug prediction’. Proc. 2013 Int. Conf. Software for predicting high severity software faults’, Sci. China Inf. Sci., 2016, 59,
Engineering, 2013, pp. 392–401 (12), pp. 1–18
[111] Rahman, F., Posnett, D., Herraiz, I., et al.: ‘Sample size vs. Bias in defect [141] Kamei, Y., Fukushima, T., McIntosh, S., et al.: ‘Studying just-in-time defect
prediction’. Proc. 2013 9th Joint Meeting on Foundations of Software prediction using cross-project models’, Empir. Softw. Eng., 2016, 21, (5), pp.
Engineering, 2013, pp. 147–157 2072–2106
[112] Tantithamthavorn, C., McIntosh, S., Hassan, A.E., et al.: ‘The impact of [142] Petrić, J., Bowes, D., Hall, T., et al.: ‘Building an ensemble for software
mislabelling on the performance and interpretation of defect prediction defect prediction based on diversity selection’. Proc. 10th ACM/IEEE Int.
models’. Proc. 37th Int. Conf. Software Engineering, 2015, pp. 812–823 Symp. Empirical Software Engineering and Measurement, 2016, p. 46
[113] Herzig, K., Just, S., Zeller, A.: ‘The impact of tangled code changes on defect [143] Liu, W., Liu, S., Gu, Q., et al.: ‘Empirical studies of a two-stage data
prediction models’, Empir. Softw. Eng., 2016, 21, (2), pp. 303–336 preprocessing approach for software fault prediction’, IEEE Trans. Reliab.,
[114] Peters, F., Menzies, T.: ‘Privacy and utility for defect prediction: experiments 2016, 65, (1), pp. 38–53
with morph’. Proc. 34th Int. Conf. Software Engineering, 2012, pp. 189–199 [144] Xu, Z., Liu, J., Yang, Z., et al.: ‘The impact of feature selection on defect
[115] Qi, F., Jing, X.Y., Zhu, X., et al.: ‘Privacy preserving via interval covering prediction performance: an empirical comparison’. Proc. IEEE 27th Int.
based subclass division and manifold learning based bi-directional Symp. Software Reliability Engineering, 2016, pp. 309–320
obfuscation for effort estimation’. Proc. 31st IEEE/ACM Int. Conf. [145] Mende, T.: ‘Replication of defect prediction studies: problems, pitfalls and
Automated Software Engineering, 2016, pp. 75–86 recommendations’. Proc. 6th Int. Conf. Predictive Models in Software
[116] Peters, F., Menzies, T., Layman, L.: ‘Lace2: better privacy-preserving data Engineering, 2010, pp. 1–10
sharing for cross project defect prediction’. Proc. 37th Int. Conf. Software
Engineering, 2015, pp. 801–811
Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.