0% found this document useful (0 votes)
89 views15 pages

P4 - Progress On Approaches To Software Defect Prediction

Progress on approaches to software defect prediction

Uploaded by

Touseef Tahir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views15 pages

P4 - Progress On Approaches To Software Defect Prediction

Progress on approaches to software defect prediction

Uploaded by

Touseef Tahir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

IET Software

Review Article

Progress on approaches to software defect ISSN 1751-8806


Received on 22nd June 2017
Revised 17th January 2018
prediction Accepted on 10th February 2018
E-First on 12th March 2018
doi: 10.1049/iet-sen.2017.0148
www.ietdl.org

Zhiqiang Li1, Xiao-Yuan Jing1,2 , Xiaoke Zhu1,3


1School of Computer, Wuhan University, Wuhan 430072, People's Republic of China
2School of Automation, Nanjing University of Posts and Telecommunications, Nanjing 210023, People's Republic of China
3School of Computer and Information Engineering, Henan University, Kaifeng 475001, People's Republic of China

E-mail: jingxy_2000@126.com

Abstract: Software defect prediction is one of the most popular research topics in software engineering. It aims to predict
defect-prone software modules before defects are discovered, therefore it can be used to better prioritise software quality
assurance effort. In recent years, especially for recent 3 years, many new defect prediction studies have been proposed. The
goal of this study is to comprehensively review, analyse and discuss the state-of-the-art of defect prediction. The authors survey
almost 70 representative defect prediction papers in recent years (January 2014–April 2017), most of which are published in the
prominent software engineering journals and top conferences. The selected defect prediction papers are summarised to four
aspects: machine learning-based prediction algorithms, manipulating the data, effort-aware prediction and empirical studies.
The research community is still facing a number of challenges for building methods and many research opportunities exist. The
identified challenges can give some practical guidelines for both software engineering researchers and practitioners in future
software defect prediction.

1 Introduction 1990 and year 2009. They mainly survey machine learning and
statistical analysis-based methods for defect prediction. Hall et al.
Software defect prediction is one of the most active research areas [1] performed a systematic literature review to investigate how the
in software engineering and plays an important role in software context of the models, the independent variables used, and the
quality assurance [1–5]. The growing complexity and dependency modelling techniques applied influence the performance of defect
of the software have increased the difficulty in delivering a high models. Their analysis is based on 208 defect prediction studies
quality, low cost and maintainable software, as well as the chance published from January 2000 to December 2010. Recently,
of creating software defects. Software defect usually produces Malhotra [5] conducted a systematic review of studies in the
incorrect, or unexpected results and behaviours in unintended ways literature that use the machine learning techniques in the existing
[6]. research for software defect prediction. They identified 64 primary
Defect prediction is a very crucial and essential activity. Using studies and seven categories of the machine learning techniques
defect predictors can reduce the costs and improve the software from January 1991 to October 2013. As a summary, the defect
quality by recognising defect-prone modules (instances) prior to prediction published papers are complex and disparate, thus no up-
testing, such that software engineers can effectively optimise the to-date comprehensive picture of the current state of defect
allocation of limited resources for testing and maintenance. prediction exists. These review works are not able to cover the
Software defect prediction [7, 8] can be done by classifying a latest progress of the software defect prediction research.
software instance (e.g. method, class, file, change or package level) To fill up these gaps of existing systematic review works, this
as defective or non-defective. The defect prediction model can be paper tries to provide a comprehensive and systematic review for
built by using various software metrics and historical defect data pioneering works of software defect prediction in recent three years
collected from previous release of the same projects [9, 10] or even (i.e. from January 2014 to April 2017). A summary of the
other projects [11–13]. Such a model will be trained to predict approaches can be used by researchers as foundation for future
software instances as defective or not. With the prediction model, investigations of defect prediction. Fig. 1 shows the number of
software engineers can effectively allocate the available testing published paper with search keywords ‘‘defect prediction’ OR
resources on the defective instances for improving software quality ‘fault prediction’ OR ‘bug prediction’ AND ‘software’’ on four
in the early phases of development life cycle. For example, if only important computer science libraries: ACM, IEEE, Elsevier and
25% of testing resources are available, then software engineers can Springer, respectively, from 2014 to 2016. It can be seen that the
focus these testing resources on fixing the instances that are more published papers are gradually increasing and this indicates that the
prone to defects. Therefore, a high quality, low cost and research trend on the topic of defect prediction is growing.
maintainable software can be deployed in the given time, resources Naturally, it is impossible to complete review of all papers as
and budget. That is why, today software defect prediction is a shown in Fig. 1. Similar to prior works [1, 3–5], we need to make
popular research topic in the software engineering field [14]. some criteria. In this paper, we excluded papers which do not
In the past few decades, more and more research works pay closely related to the research topic on defect prediction, repeated
attention to the software defect prediction and a lot of papers have studies and papers not belong to current research hotspot by
been published. There have already been several excellent manual check and screening. Thus, we carefully choose almost 70
systematic review works for the software defect prediction [1, 3– representative defect prediction studies in recent years, and most of
5]. Catal and Diri [3] reviewed 74 defect prediction papers in 11 which are published in the prominent software engineering journals
journals and several conference proceedings by focusing on and top conferences. Through this paper, we provide an in-depth
software metrics, datasets and methods to build defect prediction study of the major techniques, research hotspots and trends of
models. Later on, according to the publication year, Catal [4] software defect prediction.
investigated 90 defect prediction papers published between year

IET Softw., 2018, Vol. 12 Iss. 3, pp. 161-175 161


© The Institution of Engineering and Technology 2018

Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.
Fig. 1  Published papers related to software defect prediction on ACM, IEEE, Elsevier and Springer libraries

present some publicly available benchmark datasets for defect


prediction. Finally, we describe the widely used performance
evaluation measures in the defect prediction literature.

2.1 Defect prediction process


Most of existing software defect prediction studies have employed
machine learning techniques [15–20]. The overview of software
defect prediction process based on machine learning classification
models is shown in Fig. 2.
To build a defect prediction model, the first step is to create data
instances from software archives such as version control systems
(e.g. SVN, CVS, GIT), issue tracking systems (e.g. Bugzilla, Jira)
and so on. The version control systems contain the source codes
and some commit messages, while the issue tracking systems
include some defect information. According to prediction
Fig. 2  Software defect prediction process granularity, each instance can represent a method, a class, a source
code file, a package or a code change. The instance usually
The long history of defect prediction has led to proposal of contains a number of defect prediction metrics (features), which
many theories, approaches, and models. Previous literature reviews are extracted from the software archives. The metric values
[1, 3–5] covered defect prediction studies from 1990 to 2013. represent the complexity of software and its development process.
Within this article, we perform a detailed analysis of the defect An instance can be labelled as defective or non-defective according
prediction studies between 2014 and 2017. The above literature to whether the instance contains defects or not. Then, based on the
reviews can provide a comprehensive picture for existing defect obtained metrics and labels, the defect prediction models can be
prediction studies and cover a long-term progress in this field. built by using a set of training instances. Finally, the prediction
Based on these reviews, we can identify and analyse the research model can classify whether a new test instance is defective or not.
trends of defect prediction. Previous literature reviews have
performed systematic review in the broader area of defect
2.2 Defect prediction metrics
prediction including modelling techniques, methods, metrics,
datasets and performance evaluation measures. In this paper, we Software defect prediction metrics play the most important role to
aim to summarise and analyse the defect prediction studies that are build a prediction model that aims to improve software quality by
focusing on the research hotspots and emerging topics, e.g. predicting as many software defects as possible. Most defect
machine learning-based prediction algorithms, manipulating the prediction metrics can be divided into code and process metrics
data, effort-aware prediction and empirical studies. In this respect, [21]. Code metrics represent how the source code is complex while
we hope that this review will provide a reference point for process metrics represent how the development process is complex
conducting future research and yield many high-quality studies in [7, 21, 22]. A systematic literature review about the defect
the defect prediction field. prediction metrics can be found in the work [23].
This paper is organised as follows: In Section 2, we briefly Code metrics are directly collected existing source code and
introduce the defect prediction process, common software metrics, mainly measure properties of the source code such as size and
public datasets and widely used evaluation measures. Section 3 complexity. Its ground assumption is that the source code with
describes the categories of defect prediction algorithms and higher complexity can be more defect prone. In the past few
discusses some new representative techniques. Section 4 reviews decades, various code metrics (CMs) have been presented to
various defect prediction studies that focus on manipulating the measure code complexity. For example, lines of code (LOC) metric
data. Sections 5 and 6 separately survey the effort-aware context- is one of the commonly used and representative size metrics.
based defect prediction methods and empirical studies. We outline According to the number of operators and operands, Halstead [24]
some future directions and challenges in Section 7 and present presented a few complexity metrics. In order to measure the
conclusions in Section 8. complexity of software code structure, McCabe [25] proposed
several cyclomatic metrics. Afterwards, with object-oriented (OO)
2 Background programming is getting popular, CMs for OO languages have been
presented to improve development process. Chidamber and
In this section, we firstly briefly introduce the overview of software Kemerer (CK) metrics [26] are some of the most representative
defect prediction process. Secondly, we briefly review the common metrics for OO programmes. Beyond CK metrics, researchers have
software metrics used in defect prediction studies. Thirdly, we presented other OO metrics based on volume and quantity of

162 IET Softw., 2018, Vol. 12 Iss. 3, pp. 161-175


© The Institution of Engineering and Technology 2018

Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.
source code [27]. These OO metrics simply count the number of NASA benchmark dataset consists of 13 software projects [7,
variables, methods, classes and so on. 56]. The number of instances ranges from 127 to 17,001, while the
Process metrics are extracted from historical information number of metrics ranges from 20 to 40. Each project in NASA
archived in different software repositories such as version control represents a NASA software system or sub-system, which contains
and issue tracking systems. These metrics reflect the changes over the corresponding defect-marking data and various static CMs. The
time and quantify many aspects of software development process repository records the number of defects for each instance by using
such as changes of source code, the number of code changes, a bug tracking system. Static CMs of NASA datasets include size,
developer information and so on. In recent years, a number of readability, complexity attributes and so on, which are closely
representative process metrics have been proposed. These metrics related to software quality.
can be categorised into the following five groups: (i) code change- Turkish software dataset (SOFTLAB) consists of five projects,
based metrics such as relative code change churn [28], change [29], which are embedded controller software for white goods [12]. The
change entropy [30], CM churn, code entropy [31]; (ii) developer projects of SOFTLAB are obtained from PROMISE repository and
information-based metrics such as the number of engineers [32], they have 29 metrics. The number of instances ranges from 36 to
developer-module networks [33], developer network and social 121.
network [34], ownership and authorship [35, 36], developer focus Jureczko and Madeyski [57] collected some open source,
and ownership [37], defect patterns of developer [38], micro proprietary and academic software projects, which are part of the
interaction metrics [39]; (iii) dependency analysis-based metrics PROMISE repository. The collected data consists of 92 versions of
such as dependency graph [40], socio-technical networks [41], 38 different software development projects, including 48 versions
change coupling [42], citation influence topic [43], change of 15 open-source projects, 27 versions of six proprietary projects
genealogy [44]; (iv) project team organisation-based metrics such and 17 academic projects. Each project has 20 metrics in total
as organisational structure, organisational volatility [45–47]; (v) which contains McCabe's cyclomatic metrics, CK metrics and
other process metrics such as popularity [48], anti-pattern [49]. other OO metrics.
Table 1 shows the common CMs and process metrics for Datasets in ReLink were collected by Wu et al. [51] to improve
software defect prediction. Both CMs and process metrics are able the defect prediction performance by increasing the quality of the
to build defect prediction models. However, there are different defect data. The defect information in ReLink has been manually
debates on whether CMs are good defect predictors or process verified and corrected. ReLink consists of three projects and each
metrics are better than CMs. Menzies et al. [7] confirmed that CMs one has 26 complexity metrics.
are still useful to build defect prediction models based on NASA AEEEM was used to benchmark different defect prediction
dataset. Zhang [50] found that it can be used to predict defects well models and collected by D'Ambros et al. [31]. Each AEEEM
by simply considering the LOC metric. Moser et al. [29] conducted dataset consists of 61 metrics: 17 source CMs, five previous-defect
a comparative analysis of the predictive power of code and process metrics, five entropy-of-change metrics, 17 entropy-of-source-
metrics for defect prediction. They observed that process metrics CMs, and 17 churn-of-source CMs [16].
are more efficient defect predictors than CMs for the Eclipse Just-in-time (JIT) dataset was used to study on predicting
dataset. Recently, Rahman et al. [21] performed an empirical study defect-inducing changes at the change level and collected by
to compare code and process metrics. They concluded that CMs is Kamei et al. [52]. It consists of six open source projects and each
less useful than process metrics because of stagnation of CMs. project has 14 change metrics including diffusion, size, purpose,
history and experience five dimensions.
2.3 Public datasets ECLIPSE1 bug data set was collected by Zimmermann et al.
[53]. It contains defect data of three Eclipse releases (i.e. 2.0, 2.1
Software defect dataset is one of the most important problems for and 3.0), which is extracted on both file and package levels. There
conducting the defect prediction. In the early stages, some are 31 static CMs on the file level and 40 metrics on the package
academic researchers and companies employed the non-public level. The resulting data set lists the number of pre-release and
datasets such as proprietary projects to develop defect prediction post-release defects for every file and package in these three
models. However, it is not possible to compare the results of such Eclipse releases.
methods to each other, due to their datasets cannot be obtained. ECLIPSE2 bug data set was used to study on dealing with the
Since machine learning researchers had similar problems in 1990s, noise in defect prediction and collected by Kim et al. [54]. This
they created a repository called University of California Irvine dataset contains two projects (SWT and Debug) from Eclipse 3.4.
(UCI) Machine Learning Repository. Inspired by the success of The defect data is collected by mining the Eclipse Bugzilla and
UCI repository, researchers create PROMISE repository of CVS repositories. There are 17 metrics in total, which contains
empirical software engineering data, which has collected several four different type of metrics, including complexity, OO, change
publicly available datasets since 2005. In addition, there are some and developer.
researchers [31, 44, 51–55] who spontaneously publish their The NetGene dataset was used to study the predictive
extracted datasets for further empirical study on software defect effectiveness of change genealogies in defect prediction and
prediction. In this section, we briefly introduce existing publicly collected by Herzig et al. [44]. This dataset consists of four open
available and commonly used benchmark datasets. Table 2 shows source projects. Each project has a total of 456 metrics, including
the detailed description of the publicly available datasets. complexity metrics, network metrics and change genealogy metrics
related to the history of a file.

Table 1 Summary of software metrics


Type Categories Representatives
CMs size LOC
complexity Halstead, McCabe
OO CK, other OO metrics
process metrics code change relative code change churn, change, change entropy, CM churn, code entropy
developer information the number of engineers, developer-module networks, develop network and social network,
ownership and authorship, developer focus and ownership, developer defect patterns, micro
interaction metrics
dependency analysis dependency graph, socio-technical networks, change coupling, citation influence topic, change
genealogy
project team organisation organisational structure, organisational volatility
others popularity, anti-pattern

IET Softw., 2018, Vol. 12 Iss. 3, pp. 161-175 163


© The Institution of Engineering and Technology 2018

Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.
Table 2 Summary of the publicly available datasets for software defect prediction
Dataset Description Number of Metric used Number of Granularity Website
projects metrics
NASA NASA Metrics Data Program 13 CMs ranges 20 to function http://openscience.us/repo/
40
SOFTLAB Software Research 5 CMs 29 function http://openscience.us/repo/
Laboratory from Turkey
PROMISE open source, proprietary and 38 CMs 20 class http://openscience.us/repo/
academic projects
ReLink recovering links between 3 CMs 26 file http://www.cse.ust.hk/scc/ReLink.htm
bugs and changes
AEEEM open source projects for 5 CMs, 61 class http://bug.inf.usi.ch/
evaluating defect prediction process
models metrics
JIT open source projects for JIT 6 process 14 change http://research.cs.queensu.ca/kamei/jittse/jit.zip
prediction metrics
ECLIPSE1 eclipse projects for predicting 3 CMs 31, 40 file, package https://www.st.cs.uni-saarland.de/softevo/bug-
defects data/eclipse/
ECLIPSE2 eclipse projects for handling 2 CMs, 17 file https://code.google.com/archive/p/hunkim/wikis/
noise in defect prediction process HandlingNoise.wiki
metrics
NetGen open source projects for 4 CMs, 465 file https://hg.st.cs.uni-saarland.de/projects/
predicting defects using process cg_data_sets/repository
change genealogies metrics
AEV industry projects for 3 CMs, 29 file http://www.ist.tugraz.at/_attach/Publish/
predicting defects process AltingerHarald/
metrics MSR_2015_dataset_automotive.zip

Table 3 Confusion matrix 2.4 Evaluation measures


Predicted defective Predicted non-
For defect prediction performance, various evaluation measures
defective
have been widely used in [58–62]. The measurement of prediction
actual defective TP FN performance is usually based on the analysis of data in a confusion
actual non-defective FP TN matrix. This matrix reports how a prediction model classified the
different defect categories compared to their actual classification.
Table 3 shows the confusion matrix with four defect prediction
Table 4 Commonly used performance evaluation measures results. Here, true positive (TP), false negative (FN), false positive
Measure Defined as (FP) and true negative (TN) are the number of defective instances
Pd/recall/TP TP that are predicted as defective, the number of defective instances
rate TP + FN that are predicted as non-defective, the number of non-defective
FP
instances that are predicted as defective, and the number of non-
Pf/FP rate
FP + TN defective instances that are predicted as non-defective,
respectively.
precision TP
TP + FP
With the confusion matrix, we can define the following
performance evaluation measures, which are commonly used in the
F-measure 2 × Pd × precision 2TP
= defect prediction studies. Table 4 shows the mostly used
Pd + precision 2TP + FP + FN
performance evaluation measures for defect prediction.
G-measure 2 × Pd × (1 − Pf) Pd, probability of detection or recall or true positive rate is
Pd + (1 − Pf) defined as TP/(TP + FN). It denotes the ratio of the number of
balance (0 − Pf)2 + (1 − Pd)2 defective instances that are correctly classified as defective to the
1−
2 total number of defective instances. This measure is very important
accuracy TP + TN for defect prediction, because prediction models intend to find out
TP + FP + FN + TN defective instances as many as possible.
G-mean Pd × (1 − Pf) Pf, probability of false alarm or false positive rate is defined as
FP/(FP + TN). It denotes the ratio of the number of non-defective
MCC TP × TN − FP × FN
instances that are wrongly classified as defective to the total
(TP + FP) × (TP + FN) × (TN + FP) × (TN + FN)
number of non-defective instances.
AUC the area under the ROC curve Precision is defined as TP/(TP + FP). It denotes the ratio of the
Popt the area under effort-based cumulative lift charts, number of defective instances that are correctly classified as
which compares a predicted model with an optimal defective to the number of instances that are classified as defective.
model As observed by Menzies et al. [63], precision is highly unstable
AUCEC the area under cost-effectiveness curve performance indicators when data sets contain a low percentage of
defects.
F-measure is a comprehensive measure for harmonic mean of
AEV data set was collected by Altinger et al. [55]. It is a novel Pd and Precision. It is defined as
industry dataset obtained from three different automotive 2 × Pd × Precision/(Pd + Precision).
embedded software projects developed by Audi Electronics Similar to F-measure, the G-measure [64] is harmonic mean of
Venture GmbH. Each project has a total of 29 software metrics. Pd and 1 − Pf, which is defined as
2 × Pd × (1 − Pf)/(Pd + (1 − Pf)). The 1 − Pf represents specificity
[59].

164 IET Softw., 2018, Vol. 12 Iss. 3, pp. 161-175


© The Institution of Engineering and Technology 2018

Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.
Balance [7] is defined as 1 − (0 − Pf)2 + (1 − Pd)2 / 2 . It class is predicted as non-defective) to address the class imbalance
combines Pf and Pd and calculates by the Euclidean distance from problem.
the ROC (receiver operating characteristic curve) sweet spot At the same time, Jing et al. [72] presented a collaborative
Pf = 0 and Pd = 1. Hence, better and higher balances fall closer to representation classification (CRC)-based defect prediction
the desired sweet spot of Pf = 0 and Pd = 1. (CSDP) method. CSDP makes use of the recently proposed CRC
Accuracy is defined as (TP + TN)/(TP + FP + FN + TN), technique, which main idea is that an instance can be
which denotes the percentage of correctly predicted instances. collaboratively represented by a linear combination of all other
Geometric mean (G-mean) [59, 65] is utilised for the overall instances.
evaluation of predictors in the imbalanced context. It computes the Wang et al. [73] presented a software defect prediction method,
geometric mean of Pd and 1 − Pf, which is defined as named multiple kernel ensemble learning (MKEL). MKEL can
better represent the defect data in a high-dimensional feature space
Pd × (1 − Pf).
through multiple kernel learning and reduce the bias caused by the
Matthews correlation coefficient (MCC) [66, 67] measures the non-defective class through assembling a series of weak classifiers.
correlation between the observed and predicted binary Besides, MKEL specially designs a new sample weight vector
classification with values in [−1, 1]. A value of 1 denotes a perfect updating strategy to relieve the class imbalance problem.
prediction, 0 no better than random prediction and −1 represents To bridge the gap between programmes’ semantics and defect
total disagreement between observation and prediction. prediction features, Wang et al. [74] automatically learned
AUC is the area under the ROC curve. This curve is plotted in a semantic representation of programmes from source code by
two-dimensional space with Pf as x-coordinate and Pd as y- utilising deep learning techniques. They firstly extracted
coordinate. The AUC is known as a useful measure for comparing programmes’ abstract syntax trees to obtain the token vectors.
different models and widely used because AUC is unaffected by Based on the token vectors, they leveraged deep belief network to
class imbalance as well as being independent from the prediction automatically learn semantic features.
threshold. Other measures such as Pd and precision can vary With the utilisation of popular deep learning techniques, Yang
according to prediction threshold values. However, AUC considers et al. [75] presented a Deeper method for JIT defect prediction.
prediction performance in all possible threshold values. The higher Deeper first employs the deep belief network to extract a set of
AUC represents better prediction performance and the AUC of 0.5 expressive metrics from an initial set of change metrics and then
means the performance of a random predictor [68]. trains a classifier with the extracted metrics to predict defects.
Popt [52, 60] and AUCEC [68] are two effort-aware measures. Chen et al. [76] designed a novel double transfer boosting
Similar to the AUC metric, Popt = 1 − Δopt, where Δopt is defined (DTB) algorithm for cross-company prediction. DTB firstly adopts
as the area between the predicted model and the optimal model. data gravitation to reconstruct the distribution of cross-company
AUCEC is defined as the area under cost-effectiveness curve of the (CC) data that close to within-company (WC) data. Then, DTB
predicted model. These measures consider the LOC to be inspected utilises the transfer boosting learning technique to remove negative
or tested by quality assurance teams or developers. In fact, both of instances in CC data by using a limited number of WC data.
them can be equivalent. This curve is plotted in a two-dimensional Xia et al. [77] developed a cross-project defect prediction
space with recall as x-coordinate and LOC as y-coordinate. The method, named hybrid model reconstruction approach (HYDRA),
idea of cost-effectiveness for defect prediction models has practical which contains two phases: genetic algorithm (GA) phase and
significance in practice. Cost-effectiveness means how many ensemble learning phase. At the end of these two phases, HYDRA
defects can be found among top n% LOC inspected or tested. That creates a massive composition of classifiers that can be applied to
is, if a certain prediction model can find more defects with less predict defect instances in the target project.
inspecting and testing effort comparing to other models, we could From the view of optimisation, Canfora et al. [78] treated the
say the cost-effectiveness of the model is higher. defect prediction problem as a multi-objective optimisation
Besides these commonly used evaluation measures, there are problem. They presented to use GA for training logistic regression
some of the less common evaluation measures such as overall and decision tree models, called multi-objective defect predictor
error-rate, H-measure [69], J-coefficient [59] and Brier score [70]. (MODEP). MODEP allows software engineers to select predictors
reaching a specific trade-off, i.e. the cost of code inspection and the
3 Categories of prediction algorithms number of defect-prone instances or the number of defects that
MODEP can predict.
Machine learning techniques are the most popular methods for Ryu et al. [79] presented a transfer cost-sensitive boosting
defect prediction [1, 3, 5]. Since new machine learning techniques (TCSBoost) method to deal with the class imbalance problem for
have been developed, various algorithms such as dictionary cross-project defect prediction. Based on the distributional
learning [71], collaborative representation learning [72], multiple characteristics, TCSBoost assigns different misclassification costs
kernel ensemble learning [73], deep learning [74, 75] and transfer for the correct/incorrect classification instances in each iteration of
learning [76, 77] have been applied to build better defect prediction the boosting algorithm.
models. From the view of machine learning, we roughly divide Yang et al. [80] leveraged decision tree and ensemble learning
related defect prediction literature into the following three techniques to design a two-layer ensemble learning (TLEL)
categories: supervised, semi-supervised and unsupervised methods. method for JIT defect prediction. TLEL firstly builds a random
Supervised learning methods are referred to the use of all labelled forest model by combining decision tree and bagging. With the
training data in a project to build defect prediction models. Semi- utilisation of random under-sampling algorithm, TLEL then trains
supervised learning methods construct defect prediction models by multiple different random forest models and assembles them once
employing only a small number of labelled training data and large more with stacking ensemble.
number of unlabelled data in a project. Unsupervised learning
methods do not require labelled training data, they are directly
3.2 Semi-supervised methods
utilise the unlabelled data in a project to learn defect prediction
models. By utilising and combining semi-supervised learning and ensemble
learning, Wang et al. [81] presented a non-negative sparse-based
3.1 Supervised methods SemiBoost (NSSB) method for software defect prediction. On one
hand, NSSB makes better use of a large number of unlabelled
By utilising a recently developed dictionary learning technique, instances and a small number of labelled instances through semi-
Jing et al. [71] designed a cost-sensitive discriminative dictionary supervised learning. On the other hand, NSSB assembles a number
learning (CDDL) approach to predict software defect. CDDL of weak classifiers to reduce the bias caused by the non-defective
exploits class information of historical data to improve the class through ensemble learning.
discriminant power and assigns different misclassification costs by With the utilisation of graph-based semi-supervised learning
increasing punishment on type II misclassification (i.e. defective and sparse representation learning techniques, Zhang et al. [82]

IET Softw., 2018, Vol. 12 Iss. 3, pp. 161-175 165


© The Institution of Engineering and Technology 2018

Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.
proposed a non-negative sparse graph-based label propagation Considering developer behaviour, Lee et al. [39] leveraged
(NSGLP) method for defect prediction. NSGLP firstly resamples developer interaction information to propose micro-interaction
the labelled non-defective instances to generate a balanced training metrics (MIMs). They separately used MIMs, source CMs and
dataset. Then, NSGLP constructs the weights by using non- change history metrics to build defect prediction models and
negative sparse graph algorithm to better learn the data compared their prediction performance. They found that MIMs
relationship. Finally, NSGLP iteratively predicts the unlabelled significantly improve overall defect prediction accuracy by
instances through a label propagation algorithm. combining with existing software metrics and perform well in a
cost-effective context. The findings can help software engineers to
3.3 Unsupervised methods better identify their own inefficient behaviours during software
development.
It is a challenging problem to enable defect prediction for new To measure the quality of source code, Okutan and Yildiz [85]
projects or projects without sufficient historical data. To address introduced a new metric called lack of coding quality (LOCQ).
this limitation, Nam and Kim [83] presented clustering instances They applied Bayesian networks to examine the probabilistic
and labelling instances in clusters (CLA) and clustering instances, influential relationships between software metrics and defect-
labelling instances, metric selection and instance selection proneness. They found that response for class, lines of code, and
(CLAMI) two unsupervised methods. The key idea of these two LOCQ are the most effective metrics whereas coupling between
methods is to label an unlabelled dataset by using the magnitude of objects, weighted method per class, and lack of cohesion of
metric values. Hence, CLA and CLAMI have the advantages of methods are less effective metrics for predicting defects.
automated manner and no manual effort required. To explore the predictive ability of mutation-aware defect
From the perspective of data clustering, Zhang et al. [84] prediction, Bowes et al. [86] defined 40 mutation metrics. They
presented to leverage spectral clustering (SC) to tackle the separately built defect prediction models to compare the
heterogeneity between source project and target project in cross- effectiveness of these mutation-aware metrics and 39 source CMs
project defect prediction. SC is a connectivity-based unsupervised (mutation-unaware) as well as combining them together. They
clustering method, which does not require any training data. Thus, found that mutation-aware metrics can significantly improve defect
SC does not suffer from the problem of heterogeneity. This is an prediction performance.
advantage of unsupervised clustering methods. To investigate the effect of concerns on software quality, Chen
Summary: By comparing and analysing the above defect et al. [87] approximated software concerns as topics by using a
prediction methods based on machine learning techniques as statistical topic modelling technique. They proposed a set of topic-
shown in Table 5, we can obtain following observations: (i) Defect based metrics including number of topics, number of defect-prone
prediction methods usually directly employ or modify existing topics, topic membership and defect-prone topic membership.
well-known machine learning algorithms for different prediction Results show that topic-based metrics provide additional
contexts. (ii) Most defect prediction methods leverage supervised explanatory power over existing structural and historical metrics,
learning techniques to build classification models. Generally, which can help better explain software defects.
supervised defect prediction methods are able to improve the
prediction performance, especially the classification accuracy. (iii) 4.1.2 Feature selection: Feature selection (metric selection) has
Most defect prediction methods are based on software CMs, become the focus of machine learning and data mining, which has
possible because they are easy to collect as compared with process also been used in software defect prediction [88, 89]. The aim of
metrics. feature selection is to select the features which are more relevant to
the target class from high-dimensional features and remove the
4 Manipulating the data features which are redundant and uncorrelated. After feature
In this section, we review the related defect prediction studies from selection, the classification performance of prediction models will
the perspective of manipulating the data. Specifically, we compare be improvement.
and analyse these studies in the following three aspects: attributes To check the positive effects of combining feature selection and
for prediction, data adoption and dataset quality. ensemble learning on the performance of defect prediction, Laradji
et al. [90] presented an average probability ensemble (APE)
method. APE can alleviate the effects caused by metric redundancy
4.1 Attributes for prediction and data imbalance on the defect prediction performance. They
4.1.1 New software metrics: Software metrics, such as source found that it is very necessary to carefully select relevant and
CMs, change churns, and the number of previous defects, have informative features for accurate defect prediction.
been actively studied to enable defect prediction and facilitate Liu et al. [91] designed a new feature selection framework,
software quality assurance. Recently, there are several methods to named feature clustering and feature ranking (FECAR). FECAR
design new software metrics for defect prediction [39, 85–87]. firstly partitions original features into k clusters, and then selects

Table 5 Comparison of the defect prediction methods using machine learning techniques
Study Category Technique Dataset Year
supervised Jing et al. [71] dictionary learning, cost-sensitive learning NASA 2014
Jing et al. [72] collaborative representation NASA 2014
Wang et al. [73] multiple kernel ensemble learning, boosting NASA 2016
Wang et al. [74] deep learning PROMISE 2016
Yang et al. [75] deep learning JIT 2015
Chen et al. [76] transfer learning, boosting PROMISE 2015
Xia et al. [77] transfer learning, GA, AdaBoost PROMISE 2016
Canfora et al. [78] multiobjective optimisation, GA PROMISE 2015
Ryu et al. [79] boosting, cost-sensitive learning, transfer learning PROMISE 2017
Yang et al. [80] decision tree, bagging, random forest JIT 2017
semi-supervised Wang et al. [81] SemiBoost, graph learning, sparse representation NASA 2016
Zhang et al. [82] sparse representation, sparse graph, label propagation NASA 2017
unsupervised Nam and Kim [83] cluster, feature selection ReLink 2015
Zhang et al. [84] spectral clustering NASA, AEEEM, PROMISE 2016

166 IET Softw., 2018, Vol. 12 Iss. 3, pp. 161-175


© The Institution of Engineering and Technology 2018

Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.
relevant features from each cluster. Results show that it is effective Herbold et al. [96] performed a large case of study on local
of FECAR to select features for defect prediction. models in the context of cross-project defect prediction. They
To study feature selection methods with a certain noise evaluated and compared the performance of local model, global
tolerance, Liu et al. [92] presented a feature selection method model and a transfer learning method for cross-project defect
feature clustering with selection strategies (FECS). FECS contains prediction. Results show that local models have only a small
two phases: feature clustering phase and feature selection phase. difference in terms of the prediction performance as compared with
They found that FECS is effective on both noise free and noisy the global model and transfer learning method for cross-project
datasets. defect prediction.
Xu et al. [93] presented a feature selection framework, named Mezouar et al. [97] compared local and global defect prediction
maximal information coefficient with hierarchical agglomerative models in the effort-aware context. They found that there is at least
clustering (MICHAC). MICHAC firstly ranks candidate features to one local model outperforms the global model, but there always
filter out irrelevant ones by using maximal information coefficient. exists another local model performs worse than the global model.
Then MICHAC groups features with hierarchical agglomerative They further observed that the worse local model is trained on that
clustering and removes redundant features by selecting one feature the subset of data has a low percentage of defect. Based on these
from each resulted group. findings, they recommended that files with smaller size should be
Summary: Table 6 shows the comparison of defect prediction special attention in future effort-aware defect prediction studies and
studies that focus on designing new software metrics and related it is beneficial to combine the advantages of global models by
feature selection techniques. Software metrics are very important to taking local considerations into account in practice.
build defect prediction models. In the past decades, many effective
metrics have been proposed, especially the CMs and process 4.2.2 Heterogeneous data: Heterogeneous defect prediction
metrics. To facilitate software quality assurance, new defect refers to predicting defect-proneness of software instances in a
prediction metrics are being increasingly designed. However, there target project by using heterogeneous metric data collected from
may be some redundant and irrelevant metrics which are harmful other projects. It provides a new perspective to defect prediction
to the prediction performance of learned predictors. To address this and has received much research interest. Recently, several
problem, feature selection techniques can be used to remove those heterogeneous defect prediction models [66, 98–100] are proposed
redundant and irrelevant metrics with the aim of improving the to predict defects across projects with heterogeneous metrics sets
performance. (i.e. source and target projects have different metric sets).
Since the metrics except common metrics might have
4.2 Data adoption favourable discriminant ability, Jing et al. [66] proposed a
heterogeneous defect prediction method, namely CCA+ (canonical
4.2.1 Local versus global models: Menzies et al. [94] were the correlation analysis), which utilises unified metric representation
first to present local and global models for defect prediction and (UMR) and CCA-based transfer learning technique. Specifically,
effort estimation. Local models are referred to clustering the whole the UMR consists of three types of metrics, including the common
dataset into smaller subsets with similar data properties, and metrics of the source and target projects, source-project-specific
building prediction models by training them on these subsets. On metrics, and target-project-specific metrics. By learning a pair of
the contrary, global model is trained on the whole dataset. projective transformations under which the correlation of the
To investigate the performance of local, global and multivariate source and target project is maximised, CCA+ can make the data
adaptive regression splines (MARS) three methods, Bettenburg et distribution of target project be similar to that of source project.
al. [95] separately built defect prediction models for them. At the same time, Nam and Kim [98] presented another solution
Specifically, they constructed local models by splitting the data for heterogeneous defect prediction. They firstly employed the
into smaller homogeneous subsets and learned several individual metric selection technique to remove redundant and irrelevant
statistical models, one for each subset. They then built a global metrics for source project. Then, they matched up (e.g.
model on the whole dataset. Besides, they treated MARS as a Kolmogorov–Smirnov test-based matching, KSAnalyser) the
global model with local considerations. In terms of modelling the metrics of source and target projects based on metric similarity
fit and predictive performance, they found that local model can such as distribution or correlation. After these processes, they
significantly outperform the global model. For practical finally arrived at a matched source and target metric sets. With the
applications, they observed that MARS produce general trends and obtained metric sets, they built heterogeneous defect prediction
can be considered a hybrid between global and local models. model to predict labels of the instances in a target.

Table 6 Comparison of the defect prediction studies that handling the metrics
Category Study Topic Technique Dataset Year
new metrics Lee et al. [39] predictive ability evaluation of the micro correlation-based feature subset Eclipse system 2016
interaction metrics selection, random forest
Okutan and Yildiz [85] evaluation of the lack of coding quality Bayesian networks PROMISE 2014
metric
Bowes et al. [86] predictive ability evaluation of the mutation naive Bayes, logistic regression, 3 real word 2016
metrics random forest, J48 systems
Chen et al. [87] explanatory power evaluation of topic- latent Dirichlet allocation, logistic 4 real word 2017
based metrics regression, Spearman correlation systems
analysis
feature selection Laradji et al. [90] validation of combining feature selection APE, weighted support vector NASA 2015
and ensemble learning machines
Liu et al. [91] methods to build and evaluate feature FF-correlation, FC-relevance NASA, Eclipse 2014
selection techniques
Liu et al. [92] evaluation of cluster-based feature k-medoids clustering, heuristic NASA, Eclipse 2015
selection method with a certain noise selection strategy
tolerance
Xu et al. [93] methods to build and evaluate feature hierarchical agglomerative NASA, AEEEM 2016
selection techniques clustering, maximal information
coefficient

IET Softw., 2018, Vol. 12 Iss. 3, pp. 161-175 167


© The Institution of Engineering and Technology 2018

Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.
He et al. [99] presented a CPDP with imbalanced feature sets to analysis (SSTCA), and combined SSTCA with ISDA to propose
address the problem of heterogeneous metric sets in cross-project the SSTCA + ISDA approach for cross-project class imbalance
defect prediction. They used distribution characteristics vectors learning.
[13] of each instance as new metrics to enable defect prediction. To handle the class imbalance problem in the context of online
To deal with the class imbalance problem under heterogeneous change defect prediction, Tan et al. [104] proposed to leverage
cross-project defect prediction setting, Cheng et al. [100] presented resampling and updatable classification techniques. They adopted
a cost-sensitive correlation transfer support vector machine (CCT- several resampling methods including simple duplicate, synthetic
SVM) method based on CCA+ [66]. Specifically, to alleviate the minority over-sampling technique (SMOTE), spread subsample,
influence of imbalanced data, they employed different resampling with/without replacements to increase the percentage of
misclassification costs for defective and non-defective classes by defective instances in the training set for addressing the imbalanced
incorporating the cost factors into the support vector machine data challenge.
(SVM) model. Ryu et al. [69] investigated the feasibility of using the class
Summary: Table 7 shows the comparison of defect prediction imbalance learning for cross-project defect prediction. They
studies by focusing on the data adoption. Local versus global designed a boosting-based model named value cognitive boosting
models investigate how to find training data with similar with support vector machine (VCB-SVM). It sets similarity
distribution for test data. It usually can achieve better prediction weights according to distributional characteristics, and combines
performance for training and test data with similar distribution [12, the weights with the asymmetric misclassification cost designed by
16]. In practice, human labelling for a large number of unlabelled the boosting algorithm for appropriate resampling.
modules is costly and time consuming, and may not be perfectly By utilising class overlap reduction and ensemble imbalance
accurate. Hence, it is difficult to collect defect information to label learning techniques, Chen et al. [105] presented a new software
a dataset for training a prediction model. Heterogeneous defect defect prediction method, called neighbour cleaning learning
prediction has good potential to use all heterogeneous data of (NCL) + ensemble random under-sampling (ERUS). On one hand,
software projects for defect prediction on new projects or projects they eliminated the overlapping non-defective instances through
with limited historical defect data [66, 98]. However, the main NCL algorithm. On the other hand, they resampled non-defective
challenge of heterogeneous defect prediction is to overcome the instances to construct balanced training subsets through ERUS
data distribution differences between source and target projects. algorithm. Finally, they assembled a series of classifiers to build
Thus, it needs to reshape the distribution of source data to be the ensemble prediction model through adaptive boosting
similar to the target data. In short, it is very useful and interesting (AdaBoost).
to develop high-quality heterogeneous defect prediction methods Wu et al. [106] presented a cost-sensitive local collaborative
for future research. representation (CLCR) method for defect prediction. CLCR
explores the neighbourhood information among software instances
4.3 Dataset quality to enhance the prediction ability of the model and incorporates
different cost factors for defective and non-defective classes to
4.3.1 Handling class imbalance: Software defect datasets are handle the class imbalance problem.
often highly imbalanced [65, 101, 102]. That is, the number of Liu et al. [107] presented a new two-stage cost-sensitive
defective instances (minority) is usually much fewer than the learning method for defect prediction, which employs cost
number of non-defective (majority) ones. It is challenging for most information both in feature selection and classification two stages.
conventional classification algorithms to work with data that has an In the first stage, they designed three cost-sensitive feature
imbalanced class distribution. The imbalanced distribution could selection algorithms by utilising different misclassification costs.
cause misclassification of the instances in the minority class, and In the second stage, they employed cost-sensitive back propagation
this is an important factor accounting for the unsatisfactory neural network classification algorithm with threshold-moving
prediction performance [1, 61]. Recently, more and more strategy.
researchers have paid attention to the class imbalance problem of Rodriguez et al. [108] handled the imbalanced data by
defect prediction [69, 103–105]. comparing and evaluating different types of algorithms for
To deal with the within-project and cross-project class software defect prediction. They separately used sampling, cost-
imbalance problem simultaneously, Jing et al. [103] developed a sensitive, ensembles and their hybrid forms to build prediction
unified defect prediction framework. They first utilised the models. Results show that these techniques can enhance the correct
improved subclass discriminant analysis (ISDA) to achieve classification of the defective class and the preprocessing of data
balanced subclasses for addressing within-project imbalanced data affects the improvement of prediction model.
classification. Then, they made the distributions of source and Malhotra and Khanna [109] compared three data sampling
target data similar through semi-supervised transfer component techniques (resample with replacement, spread subsample,

Table 7 Comparison of the defect prediction studies focusing on the data adoption
Category Study Topic Technique Dataset Year
local versus global models Bettenburg et al. [95] fit ability comparison of local and clustering, MARS s, linear six open source 2015
global models regression projects
Herbold et al. [96] predictive ability comparison of local clustering, support vector NASA, AEEEM, 2017
and global models in a cross-project machine PROMISE
context
Mezouar et al. [97] predictive ability comparison of local k-medoids clustering, AEEEM, PROMISE 2016
and global models spectral clustering
heterogeneous defect Jing et al. [66] a method to solve heterogeneous canonical correlation NASA, SOFTLAB, 2015
prediction models metric problem analysis, transfer learning ReLink, AEEEM
Nam and Kim [98] a method to solve heterogeneous metric selection, metric NASA, SOFTLAB, 2015
metric problem matching, transfer learning ReLink, AEEEM,
PROMISE
He et al. [99] feasibility using different metrics statistics, logistic ReLink, AEEEM, 2014
regression PROMISE
Cheng et al. [100] an improved method to solve canonical correlation NASA, SOFTLAB, 2016
heterogeneous metric problem analysis, transfer learning, ReLink, AEEEM
support vector machine

168 IET Softw., 2018, Vol. 12 Iss. 3, pp. 161-175


© The Institution of Engineering and Technology 2018

Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.
SMOTE) and cost-sensitive MetaCost learners with seven different environment). LACE2 combines CLIFF, LeaF and MORPH
cost ratios for change prediction. They employed six machine algorithms together. With LACE2, data owners can incrementally
learning classifiers to conduct change prediction under ten-fold add data to a private cache and contribute ‘interesting’ data that are
cross validation and inter-release validation two settings. They not similar to the current content of the private cache. Results show
found that resample with replacement sampling technique is better that LACE2 can produce higher privacy and provide better defect
than other techniques. prediction performance.
Summary: To better clarify and compare the differences among
4.3.2 Handling noises: Defect information plays an important these defect prediction studies, we provide a brief summary in
role in software maintenance such as measuring quality and Table 8. The above-mentioned defect prediction studies mainly pay
predicting defects. Since current defect collection practices are attention to the dataset quality including class imbalance, data
based on optional bug fix keywords or bug report links in change noise and privacy-preserving data sharing problems. These studies
logs, the collected defect data from change logs and bug reports explore various model techniques and provide some profound
could include noises [51, 54, 110, 111]. These biased defect data implications. As mentioned above in Section 3, some machine
will affect defect prediction performance. learning-based methods are also considered the class imbalance
To examine the characteristic of mislabelling, the impact of problem by using different types of techniques, which include cost-
realistic mislabelling on the prediction performance and the sensitive learning (Jing et al. [71], Ryu et al. [79]), ensemble
interpretation of defect models, Tantithamthavorn et al. [112] learning (Wang et al. [73, 81]) and data sampling algorithms
conducted an in-depth study of 3931 manually curated issue reports (Wang et al. [74], Yang et al. [75], Chen et al. [76], Zhang et al.
from two large open-source systems for defect prediction. They [82]). It can be seen that data sampling, ensemble learning, cost-
found that issue report mislabelling is not random and it has rarely sensitive learning and their hybrid methods are commonly used to
impact on the precision of defect prediction models as well as it deal with the class imbalance problem in the domain of defect
does not heavily influence the most influential metrics. prediction. Besides, most defect prediction studies use sampling
Herzig et al. [113] investigated the impact of tangled code techniques, possibly because they are simple, efficient and easy to
changes on defect prediction models. They found that up to 20% of be realised. For handling noises, noise-free data set is very
all bug fixing changes consisted of multiple tangled, which would important for building effective defect prediction models. Data
severely influence bug counting and bug prediction models. Due to contains noises, has a large impact on the defect prediction
tangled changes, there is at least 16.6% of all source files are performance. Due to privacy-preserving issue, most data owners
incorrectly associated with bug reports for predicting bug-prone are not willing to share their data, which will restrict us to utilise
files. Experimental results on tangled bug datasets show that these data for cross-project defect prediction, especially for new
untangling tangled code changes can result in more accurate projects or projects without sufficient historical defect data. Hence,
models. They recommended that future case studies can explore it is very necessary to further thoroughly investigate the dataset
better change organisation to reduce the impact of tangled changes. quality for future defect prediction studies.

4.3.3 Privacy-preserving data sharing: Due to privacy 5 Effort-aware context-based defect prediction
concerns, most companies are not willing to share their data. For studies
these reasons, many researchers doubt the practicality of data
sharing for the purposes of research. Recently, privacy preservation To prioritise quality assurance efforts, defect prediction techniques
issue has been investigated in some software engineering are often used to prioritise software instances based on their
applications, e.g. software defect prediction [64, 114], software probability of having a defect or the number of defects [61, 117,
effort estimation [115] and so on. 118]. With these defect prediction techniques, software engineers
To deal with the privacy preservation problem in multi-party can allocate limited testing or inspection resources to the most
scenario for cross-project defect prediction, Peters et al. [116] defect-prone instances. Recently, effort-aware context-based defect
presented LACE2 (large-scale assurance confidentiality prediction methods [117, 118] have been proposed. These methods

Table 8 Comparison of the defect prediction studies that handing dataset quality
Category Study Topic Technique Dataset Year
class imbalance Jing et al. [103] method to solve class imbalance ISDA SOFTLAB, NASA, 2017
problem ReLink, AEEEM
Tan et al. [104] online change prediction with different simple duplicate, SMOTE, six open source 2015
sampling techniques spread subsample, resampling projects
with/without replacements
Ryu et al. [69] method to solve cross-project class boosting, under-sampling, NASA, SOFTLAB 2016
imbalance over-sampling
Chen et al. [105] method to solve class imbalance ensemble random under- NASA 2016
problem sampling
Wu et al. [106] method to solve class imbalance cost-sensitive learning NASA 2016
problem
Liu et al. [107] method to solve class imbalance cost-sensitive learning NASA 2014
problem
Rodriguze et al. [108] predictive ability comparison of various sampling, cost-sensitive, NASA 2014
class imbalance techniques ensemble, hybrid
Malhotra and Khamna [109] predictive ability comparison of various resampling with replacement, six open source 2017
class imbalance techniques spread subsample, SMOTE, projects
MetaCost
data noise Tantithamthavorn et al. [112] validation of mislabelling on the bootstrap resampling, random NASA, PROMISE 2015
performance and interpretation forest, Scott-Knott test
Herzig et al. [113] validation of tangled code change on heuristic-based untangling, five open source 2016
the performance Caret package projects
privacy-preserving Peters et al. [116] privacy-preserving algorithm for cross- CLIFF, LeaF, MORPH PROMISE 2015
data sharing project prediction

IET Softw., 2018, Vol. 12 Iss. 3, pp. 161-175 169


© The Institution of Engineering and Technology 2018

Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.
factor in the effort needed to inspect or test code when evaluating that the performance of the prediction models is significantly
the effectiveness of prediction models, leading to more realistic dependent on the data set size and the percentage of defective
performance evaluations. Generally, effort-aware studies provide instances in the data set.
new interpretation and the practical adoption-oriented view of Panichella et al. [128] presented to train defect predictors with
defect prediction results. These methods often use the LOC metric the maximisation of their cost-effectiveness through GAs. They
as a proxy measure for inspection effort. Considering the practical employed regression tree and generalised linear model to build
significance, more and more defect prediction studies make use of defect prediction models, and found that regression models trained
the effort-aware performance evaluation [21, 38, 52, 60, 68]. by GAs achieve better performance than their traditional
Zhou et al. [119] provided a comprehensive investigation of the counterparts.
class size on the effect between OO metrics and defect-proneness Summary: Table 9 summarises the details of the effort-aware
in the effort-aware prediction context. They employed statistical defect prediction studies. It can be seen that most works are based
regression techniques to empirically study the effectiveness of on the evaluation and analysis of the relationship between software
defect prediction models. Results show that the performance of metrics and defect-proneness, and prediction models constructed.
learned models can usually have significant improvement after Effort-aware context-based defect prediction studies consider the
removing the confounding effect in terms of both ranking and effort that is required to perform quality assurance activities like
classification on effort-aware evaluation. testing or code review on more prone to defective modules. Based
To perform an in-depth evaluation on the ability of slice-based on the prediction results, software engineers can allocate their
cohesion metrics in effort-aware post-release defect prediction, limited testing resources to the defect-prone modules with the aim
Yang et al. [120] compared and evaluated the effect of slice-based of finding more defects by using smaller effort. From the view of
cohesion metrics and baseline (code and process) metrics. They practice, it is more realistic and useful to apply effort-aware defect
found that slice-based cohesion metrics are complementary to the prediction models in the actual software development, which can
code and process metrics. It suggests that there is practical value to improve the production efficiency and quality, reduce the
apply slice-based cohesion metrics for the effort-aware post-release development cost and software risk.
defect prediction.
To examine the effect of some package-modularisation metrics 6 Empirical studies
proposed by Sarkar et al. [121] in the context of effort-aware defect
prediction, Zhao et al. [122] compared and evaluated the In recent years, there have been many empirical studies to analyse
effectiveness of these new package-modularisation metrics and and evaluate defect prediction methods from different aspects [31,
traditional package-level metrics. They found that these new 56, 58, 59]. Related ideas and brief descriptions of these papers are
package-modularisation metrics are useful for developing quality discussed and summarised as follows.
software systems in the effort-aware context. Shepperd et al. [129] investigated the extent to which the
To investigate the predictive effectiveness of simple research group that performs a defect prediction study associates
unsupervised models in the context of effort-aware JIT prediction, with the reported performance of defect prediction models. By
Yang et al. [123] performed an empirical study to compare their conducting a meta-analysis of 42 primary studies, they found that
unsupervised models with the state-of-the-art supervised models the reported performance of a defect prediction model shares a
under three prediction settings including cross-validation, time- strong relationship with the group of researchers who construct the
wise-cross-validation and cross-project prediction. They found that models. Their findings suggest that many published defect
many simple unsupervised models achieve higher performance prediction works are biased.
than the state-of-the-art supervised models in effort-aware JIT Recently, Tantithamthavorn et al. [130] conduct an alternative
prediction. investigation of Shepperd et al.'s data [129]. They found that
To study the relationships between dependence clusters and research group shares a strong association with other explanatory
defect-proneness in effort-aware defect prediction, Yang et al. variables (dataset and metric families), the strong association
[124] empirically evaluated the effect of dependence clusters with among these explanatory variables makes it difficult to discern the
different statistical techniques. They found that large dependence impact of the research group on model performance, and the
clusters, functions inside dependence clusters tend to be more research group has a smaller impact than the metric family after
defect-prone, which help us to better understand dependence mitigating the impact of this strong association. Their findings
clusters and its effect on software quality in the effort-aware suggest that researchers experiment with a broader selection of
context. datasets and metrics to combat any potential bias in the results.
To examine the predictive power of network measures in the Ghotra et al. [131] replicated prior study [58] to examine
effort-aware defect prediction, Ma et al. [125] performed an in- whether the impact of classification techniques on the performance
depth evaluation on the network measures with logistic regression of defect prediction models is significant or not in two
technique. They found that it is practically useful to use network experimental settings. They found that the performance with
measures for effort-aware defect prediction. They also suggest that different classification techniques has small difference on the
researchers should carefully decide whether and when to use original NASA dataset which is consistent with the results of prior
network measures for defect prediction in practice. study. However, it has significantly different performance on the
Zhao et al. [126] provided a thorough study on the actual cleaned NASA dataset. Their results suggest that some
usefulness of client usage context in package cohesion for the classification techniques outperform others for defect prediction.
effort-aware defect prediction. They evaluated the predictive ability To investigate the effect of parameter on the performance of
of both context-based and non-context-based cohesion metrics defect prediction models, Tantithamthavorn et al. [132] conducted
taken alone or together for defect prediction, and found that it is an empirical study with an automated parameter optimisation tool,
practical value in package cohesion by considering client usage Caret. By evaluating candidate parameter settings, Caret can get
context for effort-aware defect prediction. the optimised setting based on the highest prediction performance.
To examine the effect of sampling techniques on the Results show that parameter settings can indeed significantly
performance in effort-aware defect prediction, Bennin et al. [127] influence the performance of defect prediction models. The finding
took into account the testing effort with an effort-aware measure. suggests that researchers should select right parameters of the
They used four sampling techniques including random under- classification techniques for the future defect prediction
sampling, random over-sampling, SMOTE and purely SMOTE experiment.
sampling for the experiments. Results show that the over-sampling To examine the bias and variance of model validation
techniques achieved higher performance than the under-sampling techniques in the field of defect prediction, Tantithamthavorn et al.
techniques. [70] conducted an in-depth study with 12 most widely used model
Bennin et al. [10] leveraged the commonly used statistical and validation techniques for the evaluation. They found that single-
machine learning techniques to build 11 effort-aware prediction repetition holdout validation tends to yield estimates with 46–
models for the more practical cross-release validation. They found 229% more bias and 53–863% more variance than the top-ranked

170 IET Softw., 2018, Vol. 12 Iss. 3, pp. 161-175


© The Institution of Engineering and Technology 2018

Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.
model validation techniques, while out-of-sample bootstrap researchers to explore various aggregation schemes in future defect
validation achieves the best balance between the bias and variance prediction studies.
of estimates. Hence, they suggested that researchers should adopt To validate the feasibility of using a simplified metric set for
out-of-sample bootstrap validation in future defect prediction defect prediction, He et al. [137] constructed three types of
studies. predictors in three scenarios with six typical classifiers. They found
To recognise classification techniques which perform well in that it is viable and practical to use a simplified metric set for
software defect prediction, Bowes et al. [133] applied four defect prediction, and the built models with the simplified metric
classifiers including random forest, naive Bayes, RPart and SVM subset can provide satisfactory prediction performance.
to predict and analyse the level of prediction uncertainty through Yang et al. [138] conducted an empirical study to examine the
the investigation of individual defects. They found that the relationship between end-slice-based and metric-slice-based
predictive performance of four classifiers is similar, but each cohesion metrics as well as compared the predictive power of
detects different sets of defects. Based on the experimental results, different type of slice for slice-based cohesion metrics. They found
they concluded that the prediction performance of classifiers with that end-slice-based and metric-slice-based cohesion metrics have
ensemble decision-making strategies outperforms majority voting. little difference. In practice, they suggested selecting end slice for
Due to lack of an identical performance evaluation measure to computing slice-based cohesion metrics in defect prediction.
directly compare different defect prediction models, Bowes et al. Jaafar et al. [139] conducted an empirical evaluation on the
[134] developed a Java tool ‘DConfusion’ that allows researchers impact of design pattern and anti-pattern dependencies on changes
and practitioners to transform many performance measures back and faults in OO systems. They found that classes of anti-patterns
into a confusion matrix. They found that the tool can produce very dependencies are more prone to defect than others while classes of
small errors by using a variety of datasets and learners for re- design patterns dependencies are not hold true. They also observed
computing the confusion matrix. that classes having dependencies with anti-patterns depend on the
To examine the effectiveness of search-based techniques and structural changes which are the most common changes.
their hybridised versions for defect prediction, Malhotra and Chen et al. [140] conducted an empirical study to examine the
Khanna [135] performed an empirical comparison of five search- predictive effectiveness of network measures in high severity
based, five hybridised, four machine learning techniques and a defect prediction. They separately used logistic regression to
statistical technique. Through comparing the predictive analysis the relationships between each network measure and
performance of each model, they advocated that researchers should defect-proneness, and then evaluate their predictive powers
employ hybridised techniques to develop defect prediction models compared to CMs. They found that network measures are of
for recognising software defects. practical value in high severity defect prediction.
Caglayan et al. [47] replicated a prior study [45] to examine the To investigate the relationship between process metrics and the
merits of organisational metrics on the performance of defect number of defects, Madeyski and Jureczko [22] provided an in-
prediction models for large-scale enterprise software. They depth evaluation which process metrics have large effect on
separately extracted organisational, code complexity, code churn improving the predictors compared to product metrics. They found
and pre-release defect metrics to build prediction models. They that the number of distinct committers (NDC) metric can
found that model with organisational metrics outperforms than significantly improve the performance of prediction models and
both churn metric and pre-release metric models. they recommended to use the NDC process metric for future defect
To investigate how different aggregation schemes have effect prediction studies.
on defect prediction models, Zhang et al. [136] conducted an in- To investigate the performance of JIT models in the context of
depth analysis of 11 aggregation schemes on 255 open source cross-project defect prediction, Kamei et al. [141] conducted an
software projects. They found that aggregation has large effect on empirical evaluation on 11 open source projects. They found that
both correlation among metrics and the correlation between metrics training and test projects with similar data distribution usually can
and defects. Using only the summation does not often achieve the obtain better performance, and combining the data of multiple
best performance, and it tends to underestimate the performance of projects or several prediction models tend to produce strong
defect prediction models. Given their findings, they advised performance. For the JIT cross-project prediction models with the

Table 9 Comparison of effort-aware defect prediction studies


Study Topic Technique Data set Year
Zhou et al. [119] investigation of confounding effect of class size Sobel's statistical test, linear six open source Java 2014
regression projects
Yang et al. [120] predictive ability analysis of slice-based cohesion principal component analysis, five open source C 2015
metrics for post-release defects univariate/multivariate logistic project
regression
Zhao et al. [122] predictive ability evaluation of package- univariate/multivariate logistic 6 open source Java 2015
modularisation metrics regression projects
Yang et al. [123] predictive ability comparison of simple unsupervised change metrics-based unsupervised JIT 2016
models for JIT prediction techniques
Yang et al. [124] validation of the relationship between dependence cluster, Spearman's rank correlation, five open source C 2016
clusters and defect-proneness at the function level Fisher's exact test, univariate/ projects
multivariate logistic regression
Ma et al. [125] a comprehensive evaluation on the predictive univariate/multivariate logistic PROMISE 2016
effectiveness of network measures regression
Zhao et al. [126] validation on the value of considering client usage principal component analysis, ten open source Jave 2017
context in package cohesion univariate/multivariate logistic projects
regression
Bennin et al. [127] predictive ability comparison of sampling techniques sampling, statistical and machine ten open source projects 2016
learning
Bennin et al. [10] predictive ability comparison in cross-release statistical and machine learning 25 open source projects 2016
prediction
Panichella et al. [128] fit ability comparison of search-based algorithm GA, regression tree, generalised PROMISE 2016
liner model

IET Softw., 2018, Vol. 12 Iss. 3, pp. 161-175 171


© The Institution of Engineering and Technology 2018

Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.
utilisation of carefully selected data, these models usually tend to Xu et al. [144] conducted an empirical evaluation on the impact
perform best. of 32 feature selection techniques for defect prediction. They found
To construct a universal defect prediction model, Zhang et al. that feature selection techniques are useful and the performance of
[67] presented a context-aware rank transformation method by predictors trained using these techniques exhibits significant
studying six context factors. Through a study of 1385 open source difference over all the projects.
projects, the universal model achieves comparable performance to Summary: Table 10 shows the comparison of the empirical
the within-project models, produces similar performance on five studies on defect prediction. We can observe that these papers
external projects and performs similarly among projects with mainly focus on different machine learning and statistical methods,
different context factors. classifiers or classifier ensemble techniques, parameter
Petric et al. [142] conducted an empirical study to examine the optimisation technique, model validation techniques, performance
use of an explicit diversity technique with stacking ensemble that evaluation measures, various software metrics and feature selection
whether defect prediction can be improved. They employed the techniques. The above empirical studies provide a variety of
stacking ensemble technique to combine four different types of profound, original, valuable and practical insights. These findings
classifiers and a weighted accuracy diversity model. Based on eight can help software researchers and practitioners to better understand
publicly available projects, the results demonstrate that stacking the results reported by the papers and give some practical
ensembles perform better than other defect prediction models and guidelines in future defect prediction studies.
the essential factor is the use of diversity ensemble.
Liu et al. [143] performed an empirical study of two-stage data 7 Future directions and challenges
preprocessing method for software defect prediction, which
includes feature selection and instance reduction two stages. The Although numerous efforts have been made and many outstanding
aim of feature selection is to remove irrelevant, redundant features, studies have been proposed in software defect prediction, there are
and the aim of instance reduction is to undersample non-defective still many potential future challenges, and some new open
instances. Experimental evaluation on eclipse and NASA projects problems require further investigation and exploration. Here, we
shows the effectiveness of the proposed two-stage data address some possible future research challenges on the defect
preprocessing method. prediction problem.

Table 10 Empirical studies on defect prediction


Study Topic Technique Year
Shepperd et al. [129] identification of the effect of model building factors on statistical leaning 2014
predictive performance
Tantithamthavorn et al. [130] identification of the effect of model building factors on statistical leaning 2016
predictive performance
Ghotra et al. [131] predictive ability comparison of classification techniques 31 statistical and machine learning 2015
algorithms
Tantithamthavorn et al. [132] predictive ability analysis of automated parameter optimisation 30 statistical and machine learning 2016
technique algorithms
Tantithamthavorn et al. [70] predictive ability comparison of model validation techniques naive Bayes, logistic regression, random 2017
forest
Bowes et al. [134] applicability of a technique to allow cross study performance rpart, random forest, naive Bayes 2014
evaluation
Bowes et al. [133] comparison the performance of classifiers on predicting defects random forest, naive Bayes, rpart, support 2017
vector machine
Malhortra and Khanna [135] validation and comparison of search-based techniques and five search-based techniques, five hybridised 2017
their hybridised versions techniques
Caglayan et al. [47] validation the merits of organisational metrics correlation analysis, principal component 2015
analysis
Zhang et al. [136] evaluation and analysis the performance of different Spearman correlation, gain ratio, random 2017
aggregation schemes forest, linear regression
He et al. [137] feasibility of a simplified metric set in different prediction decision tree/table, naive Bayes, logistic 2015
scenarios regression, support vector machine,
Bayesian network
Yang et al. [138] applicability of slice types on slice-based cohesion metrics correlation analysis, principal component 2016
analysis
Jaafar et al. [139] evaluation of the impact of design pattern and anti-pattern Fisher's exact test, odds ratio 2016
dependencies on change and faults
Chen et al. [140] utility of network measures in high severity prediction univariate/multivariate logistic regression 2016
Madeyski and Jureczko [22] predictive ability comparison of process metrics stepwise linear regression 2015
Kamei et al. [141] validation of JIT prediction in a cross-project context random forest 2016
Zhang et al. [67] a method to build and evaluate a universal defect prediction clustering, naive Bayes, logistic regression 2016
model
Petric et al. [142] methods to build and evaluate diversity ensemble models naive Bayes, decision tree, k nearest 2016
neighbour, support vector machine, stacking
ensemble
Liu et al. [143] validation of data preprocessing techniques information gain, chi-square, ReliefF, 2016
symmetic uncertainty, cosine similarity,
random sampling
Xu et al. [144] predictive ability comparison of feature selection techniques filter, wrapper, clustering, extraction-based 2016
algorithms

172 IET Softw., 2018, Vol. 12 Iss. 3, pp. 161-175


© The Institution of Engineering and Technology 2018

Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.
(i) Generalisation: Most defect prediction studies are verified in 8 Conclusion
open source software projects. The main reason is that these
projects easily collect, archive practical development history and Software quality assurance has become one of the most significant
make their data publicly available. However, current defect and expensive phase during the development of high assurance
prediction methods might not be generalisable to other closed software systems. As software systems play an increasingly key
software projects such as proprietary and commercial systems. role in our daily lives, their complexity continues to increase. It is
That is, the lack of availability of such proprietary and commercial very difficult to make their quality assurance since the increased
data leaves open the question of whether defect prediction models complexity of software systems. The defect prediction models can
built using data from open source projects would apply to these identify the defect-prone modules so that quality assurance teams
projects. Such defect prediction methods are not investigated in can effectively allocate limited resources for testing and code
depth. Thus, there need to make more partnership with business inspection by putting more effort on the defect-prone modules.
partners and have access to their data repositories. In recent years, new defect prediction techniques, problems and
applications are emerging quickly. This paper attempts to
(ii) Overcoming imbalance: Software defect prediction always
systematically summarise all typical works on software defect
suffers from class imbalance, namely the number of defective
prediction in recent years. Based on the results obtained in this
instances is often much less than the number of non-defective
work, this paper will help researchers and software practitioners to
instances. This imbalanced data distribution hampers the
better understand the previous defect prediction studies from data
effectiveness of many defect prediction models. Although some
sets, software metrics, evaluation measures, and modelling
defect prediction studies [65, 69, 103, 104, 109] have been
techniques perspectives in an easy and effective way.
presented to deal with the class imbalance problem, it is still
necessary and important to overcome this problem in the future
work and further improve the performance of defect prediction. 9 Acknowledgments
(iii) Focusing on effort: It is very necessary and important to The authors thank the editors and anonymous reviewers for their
evaluate the prediction models in a realistic setting, e.g. how much constructive comments and suggestions. The authors also thank
effort can reduce for testing and code inspection with defect Professor Jifeng Xuan and Professor Xiaoyuan Xie from School of
prediction models. Recent studies have attempted to address such Computer at Wuhan University for their insightful advice. This
problem when considering the effort [117, 118, 120, 123, 127]. work was supported by the NSFC-Key Project of General
These defect prediction studies usually utilise the lines of CM as a Technology Fundamental Research United Fund under grant no.
proxy for effort. In future effort-aware defect prediction models, U1736211, the National Key Research and Development Program
researchers should examine what is the best way to measure effort of China under grant no. 2017YFB0202001, the National Nature
and provide better practical guidelines. Such research can have a Science Foundation of China under grant nos. 61672208,
larger effect on the future applicability of defect prediction 41571417 and U1504611, the Science and Technique Development
techniques in practice. Program of Henan under grant no. 172102210186, and the
(iv) Heterogeneous defect prediction: Recently, several Research Foundation of Henan University under grant no.
heterogeneous defect prediction models have been developed [66, 2015YBZR024.
98, 100], which use heterogeneous metric data from other projects.
It provides a new perspective to defect prediction. For new projects
10 References
or projects lacking in sufficient historical data, it is very
meaningful to study and use these methods for predicting defects. [1] Hall, T., Beecham, S., Bowes, D., et al.: ‘A systematic literature review on
fault prediction performance in software engineering’, IEEE Trans. Softw.
Due to the different data distributions exist among the source and Eng., 2012, 38, (6), pp. 1276–1304
target projects, it is difficult to build good defect prediction models [2] Menzies, T., Milton, Z., Turhan, B., et al.: ‘Defect prediction from static code
that achieve the satisfactory performance. Furthermore, studies on features: current results, limitations, new approaches’, Autom. Softw. Eng.,
heterogeneous cross-project prediction feasibility are not mature 2010, 17, (4), pp. 375–407
[3] Catal, C., Diri, B.: ‘A systematic review of software fault prediction studies’,
yet in practice. At this point, there remains as an open research area Expert Syst. Appl., 2009, 36, (4), pp. 7346–7354
for practical use of heterogeneous cross-project defect prediction. [4] Catal, C.: ‘Software fault prediction: a literature review and current trends’,
(v) Privacy preservation issue: Due to the business sensitivity and Expert Syst. Appl., 2011, 38, (4), pp. 4626–4636
privacy concerns, most companies are not willing to share their [5] Malhotra, R.: ‘A systematic review of machine learning techniques for
software fault prediction’, Appl. Soft Comput., 2015, 27, pp. 504–518
data [64, 114–116]. In this scenario, it is often difficult to extract [6] Naik, K., Tripathy, P.: ‘Software testing and quality assurance: theory and
data from industrial companies. Therefore, current defect practice’ (John Wiley & Sons, Hoboken, NJ, 2011)
prediction models may not work for those of proprietary and [7] Menzies, T., Greenwald, J., Frank, A.: ‘Data mining static code attributes to
industrial projects. If we have more available such datasets, learn defect predictors’, IEEE Trans. Softw. Eng., 2007, 33, (1), pp. 2–13
[8] Song, Q., Jia, Z., Shepperd, M., et al.: ‘A general software defect proneness
building cross-project defect prediction models will be more sound. prediction framework’, IEEE Trans. Softw. Eng., 2011, 37, (3), pp. 356–370
In the future, it is very important and necessary to extensively [9] Herzig, K.: ‘Using pre-release test failures to build early post-release defect
study the privacy issue in the context of cross-project defect prediction models’. Proc. IEEE 25th Int. Symp. Software Reliability
prediction. Engineering, 2014, pp. 300–311
[10] Bennin, K.E., Toda, K., Kamei, Y., et al.: ‘Empirical evaluation of cross-
(vi) Fair evaluation: Up to now, numerous methods have been release effort-aware defect prediction models’. Proc. IEEE Int. Conf. Software
proposed to solve the defect prediction problem. However, the Quality, Reliability and Security, 2016, pp. 214–221
corresponding evaluations are inconsistent and inadequate, which [11] Zimmermann, T., Nagappan, N., Gall, H., et al.: ‘Cross-project defect
prediction: a large scale experiment on data vs. domain vs. process’. Proc. 7th
may lead to inappropriate conclusions about the performance of joint Meeting of the European Software Engineering Conf. ACM SIGSOFT
defect prediction methods. Hence, more fair evaluation for defect Int. Symp. Foundations of Software Engineering, 2009, pp. 91–100
prediction is very necessary. [12] Turhan, B., Menzies, T., Bener, A.B., et al.: ‘On the relative value of cross-
(vii) Reproducibility: Some defect prediction papers [51, 52, 78, company and within-company data for defect prediction’, Empir. Softw. Eng.,
2009, 14, (5), pp. 540–578
83, 123] are providing replication packages such as dataset, [13] He, Z., Shu, F., Yang, Y., et al.: ‘An investigation on the feasibility of cross-
implementation scripts, and additional settings file. The main goal project defect prediction’, Autom. Softw. Eng., 2012, 19, (2), pp. 167–199
of replication of defect prediction is not only to replicate and [14] Kamei, Y., Shihab, E.: ‘Defect prediction: accomplishments and future
validity their experiments but also to allow other researchers to challenges’. Proc. IEEE 23rd Int. Conf. Software Analysis, Evolution, and
Reengineering, 2016, pp. 33–45
compare the prediction performance of new models with the [15] Kim, S., Whitehead, E.J., Zhang, Y.: ‘Classifying software changes: clean or
original studies [14, 145]. Though having access to the replication buggy?’ IEEE Trans. Softw. Eng., 2008, 34, (2), pp. 181–196
packages of previous studies, there may not be get the exact same [16] Nam, J., Pan, S.J., Kim, S.: ‘Transfer defect learning’. Proc. 35th Int. Conf.
results. A selection of evaluation procedures, configuration Software Engineering, 2013, pp. 382–391
[17] Turhan, B., Mısırlı, A.T., Bener, A.: ‘Empirical evaluation of the effects of
parameters and performance metrics has a large impact on the mixed project data on learning defect predictors’, Inf. Softw. Technol., 2013,
reproducibility. However, we recommend that researchers could 55, (6), pp. 1101–1118
provide their complete replication packages in future defect
prediction studies.
IET Softw., 2018, Vol. 12 Iss. 3, pp. 161-175 173
© The Institution of Engineering and Technology 2018

Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.
[18] Zhang, Y., Lo, D., Xia, X., et al.: ‘An empirical study of classifier [52] Kamei, Y., Shihab, E., Adams, B., et al.: ‘A large-scale empirical study of
combination for cross-project defect prediction’. Proc. IEEE 39th Annual just-in-time quality assurance’, IEEE Trans. Softw. Eng., 2013, 39, (6), pp.
Computer Software and Applications Conf., 2015, pp. 264–269 757–773
[19] Krishna, R., Menzies, T., Fu, W.: ‘Too much automation? The bellwether [53] Zimmermann, T., Premraj, R., Zeller, A.: ‘Predicting defects for eclipse’.
effect and its implications for transfer learning’. Proc. 31st Int. Conf. Proc. Third Int. Workshop on Predictor Models in Software Engineering,
Automated Software Engineering, 2016, pp. 122–131 2007, pp. 9–15
[20] Andreou, A.S., Chatzis, S.P.: ‘Software defect prediction using doubly [54] Kim, S., Zhang, H., Wu, R., et al.: ‘Dealing with noise in defect prediction’.
stochastic Poisson processes driven by stochastic belief networks’, J. Syst. Proc. 33rd Int. Conf. Software Engineering, 2011, pp. 481–490
Softw., 2016, 122, pp. 72–82 [55] Altinger, H., Siegl, S., Dajsuren, Y., et al.: ‘A novel industry grade dataset for
[21] Rahman, F., Devanbu, P.: ‘How, and why, process metrics are better’. Proc. fault prediction based on model-driven developed automotive embedded
2013 Int. Conf. Software Engineering, 2013, pp. 432–441 software’. Proc. 12th Working Conf. Mining Software Repositories, 2015, pp.
[22] Madeyski, L., Jureczko, M.: ‘Which process metrics can significantly 494–497
improve defect prediction models? An empirical study’, Softw. Qual. J., 2015, [56] Shepperd, M., Song, Q., Sun, Z., et al.: ‘Data quality: some comments on the
23, (3), pp. 393–422 NASA software defect datasets’, IEEE Trans. Softw. Eng., 2013, 39, (9), pp.
[23] Radjenović, D., Heričko, M., Torkar, R., et al.: ‘Software fault prediction 1208–1215
metrics: a systematic literature review’, Inf. Softw. Technol., 2013, 55, (8), pp. [57] Jureczko, M., Madeyski, L.: ‘Towards identifying software project clusters
1397–1418 with regard to defect prediction’. Proc. 6th Int. Conf. Predictive Models in
[24] Halstead, M.H.: ‘Elements of software science’. vol. 7 (Elsevier, New York, Software Engineering, 2010, pp. 1–10
1977) [58] Lessmann, S., Baesens, B., Mues, C., et al.: ‘Benchmarking classification
[25] McCabe, T.J.: ‘A complexity measure’, IEEE Trans. Softw. Eng., 1976, 2, (4), models for software defect prediction: a proposed framework and novel
pp. 308–320 findings’, IEEE Trans. Softw. Eng., 2008, 34, (4), pp. 485–496
[26] Chidamber, S.R., Kemerer, C.F.: ‘A metrics suite for object oriented design’, [59] Jiang, Y., Cukic, B., Ma, Y.: ‘Techniques for evaluating fault prediction
IEEE Trans. Softw. Eng., 1994, 20, (6), pp. 476–493 models’, Empir. Softw. Eng., 2008, 13, (5), pp. 561–595
[27] Abreu, F.B., Carapuça, R.: ‘Candidate metrics for object-oriented software [60] Mende, T., Koschke, R.: ‘Revisiting the evaluation of defect prediction
within a taxonomy framework’, J. Syst. Softw., 1994, 26, (1), pp. 87–96 models’. Proc. 5th Int. Conf. Predictor Models in Software Engineering, 2009,
[28] Nagappan, N., Ball, T.: ‘Use of relative code churn measures to predict pp. 1–10
system defect density’. Proc. 27th Int. Conf. Software Engineering, 2005, pp. [61] Arisholm, E., Briand, L.C., Johannessen, E.B.: ‘A systematic and
284–292 comprehensive investigation of methods to build and evaluate fault prediction
[29] Moser, R., Pedrycz, W., Succi, G.: ‘A comparative analysis of the efficiency models’, J. Syst. Softw., 2010, 83, (1), pp. 2–17
of change metrics and static code attributes for defect prediction’. Proc. 30th [62] Xiao, X., Lo, D., Xin, X., et al.: ‘Evaluating defect prediction approaches
Int. Conf. Software Engineering, 2008, pp. 181–190 using a massive set of metrics: an empirical study’. Proc. 30th Annual ACM
[30] Hassan, A.E.: ‘Predicting faults using the complexity of code changes’. Proc. Symp. Applied Computing, 2015, pp. 1644–1647
31st Int. Conf. Software Engineering, 2009, pp. 78–88 [63] Menzies, T., Dekhtyar, A., Distefano, J., et al.: ‘Problems with precision: a
[31] D'Ambros, M., Lanza, M., Robbes, R.: ‘Evaluating defect prediction response to ‘comments on ‘data mining static code attributes to learn defect
approaches: a benchmark and an extensive comparison’, Empir. Softw. Eng., predictors’’’, IEEE Trans. Softw. Eng., 2007, 33, (9), pp. 635–636
2012, 17, (4–5), pp. 531–577 [64] Peters, F., Menzies, T., Gong, L., et al.: ‘Balancing privacy and utility in
[32] Weyuker, E.J., Ostrand, T.J., Bell, R.M.: ‘Do too many cooks spoil the broth? cross-company defect prediction’, IEEE Trans. Softw. Eng., 2013, 39, (8), pp.
Using the number of developers to enhance defect prediction models’, Empir. 1054–1068
Softw. Eng., 2008, 13, (5), pp. 539–559 [65] Wang, S., Yao, X.: ‘Using class imbalance learning for software defect
[33] Pinzger, M., Nagappan, N., Murphy, B.: ‘Can developer-module networks prediction’, IEEE Trans. Reliab., 2013, 62, (2), pp. 434–443
predict failures?’ Proc. 16th ACM SIGSOFT Int. Symp. Foundations of [66] Jing, X.Y., Wu, F., Dong, X., et al.: ‘Heterogeneous cross-company defect
Software Engineering, 2008, pp. 2–12 prediction by unified metric representation and CCA-based transfer learning’.
[34] Meneely, A., Williams, L., Snipes, W., et al.: ‘Predicting failures with Proc. 10th Joint Meeting on Foundations of Software Engineering, 2015, pp.
developer networks and social network analysis’. Proc. 16th ACM SIGSOFT 496–507
Int. Symp. Foundations of Software Engineering, 2008, pp. 13–23 [67] Zhang, F., Mockus, A., Keivanloo, I., et al.: ‘Towards building a universal
[35] Bird, C., Nagappan, N., Murphy, B., et al.: ‘Don't touch my code!: examining defect prediction model with rank transformed predictors’, Empir. Softw. Eng.,
the effects of ownership on software quality’. Proc. 19th ACM SIGSOFT 2016, 21, (5), pp. 1–39
Symp. and the 13th European Conf. Foundations of Software Engineering, [68] Rahman, F., Posnett, D., Devanbu, P.: ‘Recalling the imprecision of cross-
2011, pp. 4–14 project defect prediction’. Proc. ACM SIGSOFT 20th Int. Symp. the
[36] Rahman, F.: ‘Ownership, experience and defects: a fine-grained study of Foundations of Software Engineering, 2012, pp. 1–11
authorship’. Proc. 33rd Int. Conf. Software Engineering, 2011, pp. 491–500 [69] Ryu, D., Choi, O., Baik, J.: ‘Value-cognitive boosting with a support vector
[37] Posnett, D., Devanbu, P., Filkov, V.: ‘Dual ecological measures of focus in machine for cross-project defect prediction’, Empir. Softw. Eng., 2016, 21, (1),
software development’. Proc. 35th Int. Conf. Software Engineering, 2013, pp. pp. 43–71
452–461 [70] Tantithamthavorn, C., Mcintosh, S., Hassan, A., et al.: ‘An empirical
[38] Jiang, T., Tan, L., Kim, S.: ‘Personalized defect prediction’. Proc. IEEE/ACM comparison of model validation techniques for defect prediction models’,
28th Int. Conf. Automated Software Engineering, 2013, pp. 279–289 IEEE Trans. Softw. Eng., 2017, 43, (1), pp. 1–18
[39] Lee, T., Nam, J., Han, D., et al.: ‘Developer micro interaction metrics for [71] Jing, X.Y., Ying, S., Zhang, Z.W., et al.: ‘Dictionary learning based software
software defect prediction’, IEEE Trans. Softw. Eng., 2016, 42, (11), pp. defect prediction’. Proc. 36th Int. Conf. Software Engineering, 2014, pp. 414–
1015–1035 423
[40] Zimmermann, T., Nagappan, N.: ‘Predicting defects using network analysis [72] Jing, X.Y., Zhang, Z.W., Ying, S., et al.: ‘Software defect prediction based on
on dependency graphs’. Proc. 30th Int. Conf. Software Engineering, 2008, pp. collaborative representation classification’. Companion Proc. 36th Int. Conf.
531–540 Software Engineering, 2014, pp. 632–633
[41] Bird, C., Nagappan, N., Gall, H., et al.: ‘Using socio-technical networks to [73] Wang, T., Zhang, Z., Jing, X.Y., et al.: ‘Multiple kernel ensemble learning for
predict failures’, Proc. 20th IEEE Int. Symp. Software Reliability software defect prediction’, Autom. Softw. Eng., 2016, 23, (4), pp. 569–590
Engineering, 2009 [74] Wang, S., Liu, T., Tan, L.: ‘Automatically learning semantic features for
[42] D'Ambros, M., Lanza, M., Robbes, R.: ‘On the relationship between change defect prediction’. Proc. 38th Int. Conf. Software Engineering, 2016, pp. 297–
coupling and software defects’. Proc. 16th Working Conf. Reverse 308
Engineering, 2009, pp. 135–144 [75] Yang, X., Lo, D., Xia, X., et al.: ‘Deep learning for just-in-time defect
[43] Hu, W., Wong, K.: ‘Using citation influence to predict software defects’. Proc. prediction’. Proc. IEEE Int. Conf. Software Quality, Reliability and Security,
10th Working Conf. Mining Software Repositories, 2013, pp. 419–428 2015, pp. 17–26
[44] Herzig, K., Just, S., Rau, A., et al.: ‘Predicting defects using change [76] Chen, L., Fang, B., Shang, Z., et al.: ‘Negative samples reduction in cross-
genealogies’. Proc. IEEE 24th Int. Symp. Software Reliability Engineering, company software defects prediction’, Inf. Softw. Technol., 2015, 62, pp. 67–
2013, pp. 118–127 77
[45] Nagappan, N., Murphy, B., Basili, V.: ‘The influence of organizational [77] Xia, X., Lo, D., Pan, S.J., et al.: ‘Hydra: massively compositional model for
structure on software quality’. ACM/IEEE 30th Int. Conf. Software cross-project defect prediction’, IEEE Trans. Softw. Eng., 2016, 42, (10), pp.
Engineering, 2008, pp. 521–530 977–998
[46] Mockus, A.: ‘Organizational volatility and its effects on software defects’. [78] Canfora, G., Lucia, A.D., Penta, M.D., et al.: ‘Defect prediction as a
Proc. Eighteenth ACM SIGSOFT Int. Symp. Foundations of Software multiobjective optimization problem’, Softw. Test. Verif. Reliab., 2015, 25, (4),
Engineering, 2010, pp. 117–126 pp. 426–459
[47] Caglayan, B., Turhan, B., Bener, A., et al.: ‘Merits of organizational metrics [79] Ryu, D., Jang, J.I., Baik, J.: ‘A transfer cost-sensitive boosting approach for
in defect prediction: an industrial replication’. Proc. IEEE/ACM 37th IEEE cross-project defect prediction’, Softw. Qual. J., 2017, 25, (1), pp. 235–272
Int. Conf. Software Engineering, 2015, pp. 89–98 [80] Yang, X., Lo, D., Xia, X., et al.: ‘Tlel: a two-layer ensemble learning
[48] Bacchelli, A., D'Ambros, M., Lanza, M.: ‘Are popular classes more defect approach for just-in-time defect prediction’, Inf. Softw. Technol., 2017, 87, pp.
prone?’ Proc. 13th Int. Conf. Fundamental Approaches to Software 206–220
Engineering, 2010, pp. 59–73 [81] Wang, T., Zhang, Z., Jing, X.Y., et al.: ‘Non-negative sparse-based semiboost
[49] Taba, S.E.S., Khomh, F., Zou, Y., et al.: ‘Predicting bugs using antipatterns’. for software defect prediction’, Softw. Test. Verif. Reliab., 2016, 26, (7), pp.
Proc. 29th Int. Conf. Software Maintenance, 2013, pp. 270–279 498–515
[50] Zhang, H.: ‘An investigation of the relationships between lines of code and [82] Zhang, Z., Jing, X.Y., Wang, T.: ‘Label propagation based semi-supervised
defects’. Proc. IEEE Int. Conf. Software Maintenance, 2009, pp. 274–283 learning for software defect prediction’, Autom. Softw. Eng., 2017, 24, (1), pp.
[51] Wu, R., Zhang, H., Kim, S., et al.: ‘Relink: recovering links between bugs and 47–69
changes’. Proc. 19th ACM SIGSOFT Symp. the Foundations of Software [83] Nam, J., Kim, S.: ‘Clami: defect prediction on unlabeled datasets’. Proc. 30th
Engineering and 13th European Software Engineering Conf., 2011, pp. 15–25 IEEE/ACM Int. Conf. Automated Software Engineering, 2015, pp. 1–12

174 IET Softw., 2018, Vol. 12 Iss. 3, pp. 161-175


© The Institution of Engineering and Technology 2018

Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.
[84] Zhang, F., Zheng, Q., Zou, Y., et al.: ‘Cross-project defect prediction using a [117] Mende, T., Koschke, R.: ‘Effort-aware defect prediction models’. Proc. 14th
connectivity-based unsupervised classifier’. Proc. 38th Int. Conf. Software European Conf. Software Maintenance and Reengineering, 2010, pp. 107–116
Engineering, 2016, pp. 309–320 [118] Kamei, Y., Matsumoto, S., Monden, A., et al.: ‘Revisiting common bug
[85] Okutan, A., Yıldız, O.T.: ‘Software defect prediction using Bayesian prediction findings using effort-aware models’. Proc. IEEE Int. Conf.
networks’, Empir. Softw. Eng., 2014, 19, (1), pp. 154–181 Software Maintenance, 2010, pp. 1–10
[86] Bowes, D., Hall, T., Harman, M., et al.: ‘Mutation-aware fault prediction’. [119] Zhou, Y., Xu, B., Leung, H., et al.: ‘An in-depth study of the potentially
Proc. 25th Int. Symp. Software Testing and Analysis, 2016, pp. 330–341 confounding effect of class size in fault prediction’, ACM Trans. Softw. Eng.
[87] Chen, T.H., Shang, W., Nagappan, M., et al.: ‘Topic-based software defect Methodol., 2014, 23, (1), p. 10
explanation’, J. Syst. Softw., 2017, 129, pp. 79–106 [120] Yang, Y., Zhou, Y., Lu, H., et al.: ‘Are slice-based cohesion metrics actually
[88] Shivaji, S., Whitehead, E.J., Akella, R., et al.: ‘Reducing features to improve useful in effort-aware post-release fault-proneness prediction? An empirical
code change-based bug prediction’, IEEE Trans. Softw. Eng., 2013, 39, (4), study’, IEEE Trans. Softw. Eng., 2015, 41, (4), pp. 331–357
pp. 552–569 [121] Sarkar, S., Kak, A.C., Rama, G.M.: ‘Metrics for measuring the quality of
[89] Gao, K., Khoshgoftaar, T.M., Wang, H., et al.: ‘Choosing software metrics for modularization of large-scale object-oriented software’, IEEE Trans. Softw.
defect prediction: an investigation on feature selection techniques’, Softw. Eng., 2008, 34, (5), pp. 700–720
Pract. Experience, 2011, 41, (5), pp. 579–606 [122] Zhao, Y., Yang, Y., Lu, H., et al.: ‘An empirical analysis of package-
[90] Laradji, I.H., Alshayeb, M., Ghouti, L.: ‘Software defect prediction using modularization metrics: implications for software fault-proneness’, Inf. Softw.
ensemble learning on selected features’, Inf. Softw. Technol., 2015, 58, pp. Technol., 2015, 57, (1), pp. 186–203
388–402 [123] Yang, Y., Zhou, Y., Liu, J., et al.: ‘Effort-aware just-in-time defect prediction:
[91] Liu, S., Chen, X., Liu, W., et al.: ‘Fecar: A feature selection framework for simple unsupervised models could be better than supervised models’. Proc.
software defect prediction’. Proc. IEEE 38th Annual Computer Software and 24th ACM SIGSOFT Int. Symp. Foundations of Software Engineering, 2016,
Applications Conf., 2014, pp. 426–435 pp. 157–168
[92] Liu, W., Liu, S., Gu, Q., et al.: ‘Fecs: a cluster based feature selection method [124] Yang, Y., Harman, M., Krinke, J., et al.: ‘An empirical study on dependence
for software fault prediction with noises’. Proc. IEEE 39th Annual Computer clusters for effort-aware fault-proneness prediction’. Proc. 31st IEEE/ACM
Software and Applications Conf., 2015, pp. 276–281 Int. Conf. Automated Software Engineering, 2016, pp. 296–307
[93] Xu, Z., Xuan, J., Liu, J., et al.: ‘Michac: defect prediction via feature selection [125] Ma, W., Chen, L., Yang, Y., et al.: ‘Empirical analysis of network measures
based on maximal information coefficient with hierarchical agglomerative for effort-aware fault-proneness prediction’, Inf. Softw. Technol., 2016, 69, pp.
clustering’. Proc. IEEE 23rd Int. Conf. Software Analysis, Evolution, and 50–70
Reengineering, 2016, pp. 370–381 [126] Zhao, Y., Yang, Y., Lu, H., et al.: ‘Understanding the value of considering
[94] Menzies, T., Butcher, A., Cok, D., et al.: ‘Local versus global lessons for client usage context in package cohesion for fault-proneness prediction’,
defect prediction and effort estimation’, IEEE Trans. Softw. Eng., 2013, 39, Autom. Softw. Eng., 2017, 24, (2), pp. 393–453
(6), pp. 822–834 [127] Bennin, K.E., Keung, J., Monden, A., et al.: ‘Investigating the effects of
[95] Bettenburg, N., Nagappan, M., Hassan, A.E.: ‘Towards improving statistical balanced training and testing datasets on effort-aware fault prediction
modeling of software engineering data: think locally, act globally!’, Empir. models’. Proc. IEEE 40th Annual Computer Software and Applications Conf.,
Softw. Eng., 2015, 20, (2), pp. 294–335 2016, pp. 154–163
[96] Herbold, S., Trautsch, A., Grabowski, J.: ‘Global vs. local models for cross- [128] Panichella, A., Alexandru, C.V., Panichella, S., et al.: ‘A search-based
project defect prediction’, Empir. Softw. Eng., 2017, 22, (4), pp. 1866–1902 training algorithm for cost-aware defect prediction’. Proc. 2016 on Genetic
[97] Mezouar, M.E., Zhang, F., Zou, Y.: ‘Local versus global models for effort- and Evolutionary Computation Conf., 2016, pp. 1077–1084
aware defect prediction’. Proc. 26th Annual Int. Conf. Computer Science and [129] Shepperd, M., Bowes, D., Hall, T.: ‘Researcher bias: the use of machine
Software Engineering, 2016, pp. 178–187 learning in software defect prediction’, IEEE Trans. Softw. Eng., 2014, 40,
[98] Nam, J., Kim, S.: ‘Heterogeneous defect prediction’. Proc. 10th Joint Meeting (6), pp. 603–616
on Foundations of Software Engineering, 2015, pp. 508–519 [130] Tantithamthavorn, C., McIntosh, S., Hassan, A.E., et al.: ‘Comments on
[99] He, P., Li, B., Ma, Y.: ‘Towards cross-project defect prediction with ‘researcher bias: the use of machine learning in software defect prediction’‘,
imbalanced feature sets’, CoRR, 2014, abs/1411.4228. Available at http:// IEEE Trans. Softw. Eng., 2016, 42, (11), pp. 1092–1094
arxiv.org/abs/1411.4228 [131] Ghotra, B., McIntosh, S., Hassan, A.E.: ‘Revisiting the impact of
[100] Cheng, M., Wu, G., Jiang, M., et al.: ‘Heterogeneous defect prediction via classification techniques on the performance of defect prediction models’.
exploiting correlation subspace’. Proc. 28th Int. Conf. Software Engineering Proc. 37th Int. Conf. Software Engineering, 2015, pp. 789–800
and Knowledge Engineering, 2016, pp. 171–176 [132] Tantithamthavorn, C., McIntosh, S., Hassan, A.E., et al.: ‘Automated
[101] Zhang, H., Zhang, X.: ‘Comments on ‘data mining static code attributes to parameter optimization of classification techniques for defect prediction
learn defect predictors’’, IEEE Trans. Softw. Eng., 2007, 33, (9), pp. 635–637 models’. Proc. 38th Int. Conf. Software Engineering, 2016, pp. 321–332
[102] He, H., Garcia, E.A.: ‘Learning from imbalanced data’, IEEE Trans. Knowl. [133] Bowes, D., Hall, T., Petrić, J.: ‘Software defect prediction: do different
Data Eng., 2009, 21, (9), pp. 1263–1284 classifiers find the same defects?’ Softw. Qual. J., 2017, Available at https://
[103] Jing, X.Y., Wu, F., Dong, X., et al.: ‘An improved SDA based defect doi.org/10.1007/s11219-016-9353-3
prediction framework for both within-project and cross-project class- [134] Bowes, D., Hall, T., Gray, D.: ‘Dconfusion: a technique to allow cross study
imbalance problems’, IEEE Trans. Softw. Eng., 2017, 43, (4), pp. 321–339 performance evaluation of fault prediction studies’, Autom. Softw. Eng., 2014,
[104] Tan, M., Tan, L., Dara, S., et al.: ‘Online defect prediction for imbalanced 21, (2), pp. 287–313
data’. Proc. 37th Int. Conf. Software Engineering, 2015, pp. 99–108 [135] Malhotra, R., Khanna, M.: ‘An exploratory study for software change
[105] Chen, L., Fang, B., Shang, Z., et al.: ‘Tackling class overlap and imbalance prediction in object-oriented systems using hybridized techniques’, Autom.
problems in software defect prediction’, Softw. Qual. J., 2018, 26, (1), pp. 97– Softw. Eng., 2017, 24, (3), pp. 673–717
125 [136] Zhang, F., Hassan, A.E., McIntosh, S., et al.: ‘The use of summation to
[106] Wu, F., Jing, X.Y., Dong, X., et al.: ‘Cost-sensitive local collaborative aggregate software metrics hinders the performance of defect prediction
representation for software defect prediction’. Proc. Int. Conf. Software models’, IEEE Trans. Softw. Eng., 2017, 43, (5), pp. 476–491
Analysis, Testing and Evolution, 2016, pp. 102–107 [137] He, P., Li, B., Liu, X., et al.: ‘An empirical study on software defect
[107] Liu, M., Miao, L., Zhang, D.: ‘Two-stage cost-sensitive learning for software prediction with a simplified metric set’, Inf. Softw. Technol., 2015, 59, pp.
defect prediction’, IEEE Trans. Reliab., 2014, 63, (2), pp. 676–686 170–190
[108] Rodriguez, D., Herraiz, I., Harrison, R., et al.: ‘Preliminary comparison of [138] Yang, Y., Zhao, Y., Liu, C., et al.: ‘An empirical investigation into the effect
techniques for dealing with imbalance in software defect prediction’. Proc. of slice types on slice-based cohesion metrics’, Inf. Softw. Technol., 2016, 75,
18th Int. Conf. Evaluation and Assessment in Software Engineering, 2014, pp. 90–104
pp. 1–10 [139] Jaafar, F., Guéhéneuc, Y.G., Hamel, S., et al.: ‘Evaluating the impact of
[109] Malhotra, R., Khanna, M.: ‘An empirical study for software change prediction design pattern and anti-pattern dependencies on changes and faults’, Empir.
using imbalanced data’, Empir. Softw. Eng., 2017, 22, (6), pp. 2806–2851 Softw. Eng., 2016, 21, (3), pp. 896–931
[110] Herzig, K., Just, S., Zeller, A.: ‘It's not a bug, it's a feature: how [140] Chen, L., Ma, W., Zhou, Y., et al.: ‘Empirical analysis of network measures
misclassification impacts bug prediction’. Proc. 2013 Int. Conf. Software for predicting high severity software faults’, Sci. China Inf. Sci., 2016, 59,
Engineering, 2013, pp. 392–401 (12), pp. 1–18
[111] Rahman, F., Posnett, D., Herraiz, I., et al.: ‘Sample size vs. Bias in defect [141] Kamei, Y., Fukushima, T., McIntosh, S., et al.: ‘Studying just-in-time defect
prediction’. Proc. 2013 9th Joint Meeting on Foundations of Software prediction using cross-project models’, Empir. Softw. Eng., 2016, 21, (5), pp.
Engineering, 2013, pp. 147–157 2072–2106
[112] Tantithamthavorn, C., McIntosh, S., Hassan, A.E., et al.: ‘The impact of [142] Petrić, J., Bowes, D., Hall, T., et al.: ‘Building an ensemble for software
mislabelling on the performance and interpretation of defect prediction defect prediction based on diversity selection’. Proc. 10th ACM/IEEE Int.
models’. Proc. 37th Int. Conf. Software Engineering, 2015, pp. 812–823 Symp. Empirical Software Engineering and Measurement, 2016, p. 46
[113] Herzig, K., Just, S., Zeller, A.: ‘The impact of tangled code changes on defect [143] Liu, W., Liu, S., Gu, Q., et al.: ‘Empirical studies of a two-stage data
prediction models’, Empir. Softw. Eng., 2016, 21, (2), pp. 303–336 preprocessing approach for software fault prediction’, IEEE Trans. Reliab.,
[114] Peters, F., Menzies, T.: ‘Privacy and utility for defect prediction: experiments 2016, 65, (1), pp. 38–53
with morph’. Proc. 34th Int. Conf. Software Engineering, 2012, pp. 189–199 [144] Xu, Z., Liu, J., Yang, Z., et al.: ‘The impact of feature selection on defect
[115] Qi, F., Jing, X.Y., Zhu, X., et al.: ‘Privacy preserving via interval covering prediction performance: an empirical comparison’. Proc. IEEE 27th Int.
based subclass division and manifold learning based bi-directional Symp. Software Reliability Engineering, 2016, pp. 309–320
obfuscation for effort estimation’. Proc. 31st IEEE/ACM Int. Conf. [145] Mende, T.: ‘Replication of defect prediction studies: problems, pitfalls and
Automated Software Engineering, 2016, pp. 75–86 recommendations’. Proc. 6th Int. Conf. Predictive Models in Software
[116] Peters, F., Menzies, T., Layman, L.: ‘Lace2: better privacy-preserving data Engineering, 2010, pp. 1–10
sharing for cross project defect prediction’. Proc. 37th Int. Conf. Software
Engineering, 2015, pp. 801–811

IET Softw., 2018, Vol. 12 Iss. 3, pp. 161-175 175


© The Institution of Engineering and Technology 2018

Authorized licensed use limited to: BLEKINGE TEKNISKA HOGSKOLA. Downloaded on April 06,2020 at 23:02:38 UTC from IEEE Xplore. Restrictions apply.

You might also like