Automated parameter optimization should be included in future  defect prediction studies

Automated Parameter Optimization
for Defect Prediction Models
Chakkrit (Kla) 
Tantithamthavorn
Shane McIntosh Ahmed E. Hassan Kenichi Matsumoto
http://chakkrit.com kla@chakkrit.com @klainfo

Defect models are used to predict software
modules that are likely to be defective in the future
2
Pre-release period
Releasedate
Post-release period
Defect
prediction
models
Module A
Module C
Module B
Module D
Module A
Module C
Module B
Module D
Clean
Defect-prone
Clean
Defect-prone

Defect models are trained  
using classiﬁcation techniques
3
1
2
Decision Tree 
Algorithms
Regression  
Algorithms
Clustering  
Algorithms
Ensemble  
Algorithms

Such classiﬁcation techniques often
require parameter settings
4
Ensemble  
Algorithms

4
The number of trees in  
a random forest classiﬁer
Ensemble  
Algorithms

4
The number of trees in  
a random forest classiﬁer
26 of the 30 most commonly used
classiﬁcation techniques require at least
one parameter setting
Ensemble  
Algorithms

5
Defect models may underperform if they are
trained using suboptimal parameter settings
The default settings of  
random forest, naïve bayes,
and support vector machines 
are suboptimal 
[Jiang et al., DEFECTS’08] 
[Tosun et al., ESEM’09]
[Hall et al., TSE’12]

6
Different toolkits have different default
settings for the same classiﬁcation technique
randomForest package
Default setting of the number of trees  
in a random forest
10
50
100
500
bigrf package

7
The parameter space is too large
for manual inspection
There are at least 17,000
possible settings to  
explore when
training k-NN classiﬁers 
[Kocaguneli et al., TSE’12]

8
The parameter space is too large
for manual inspection
There are at least 17,000
possible settings to  
explore when
training k-NN classiﬁers 
[Kocaguneli et al., TSE’12]
How do automated parameter
optimization techniques fare when
applied to defect prediction?

10
Performance
Improvement
Performance 
Stability

11
Performance
Improvement
Performance 
Stability

Caret — an off-the-shelf automated
parameter optimization technique
12
(Step-1) 
Generate
candidate
settings
Settings
(Step-2) 
Evaluate
candidate
settings
Performance 
for each setting
(Step-3) 
Identify
optimal
setting
Optimal 
setting

Generate a set of  
candidate settings to evaluate
13
#Trees for random forest
#Trees = 10 #Trees = 20 #Trees = 30
#Trees = 40 #Trees = 50
(Step-1) 
Generate
candidate
settings

Evaluate the performance of each candidate
setting using bootstrap validation
14
Defect  
Dataset
Testing  
Corpus
Training  
Corpus
Generate
bootstrap
samples
Construct 
defect
model Model
Calculate
performance
Perf.
Out-of-sample Bootstrap Validation  
with 100 repetitions
(Step-1) 
Generate
candidate
settings
(Step-2) 
Evaluate
candidate
settings

The optimal setting is the one that
achieved the top performance score
15
AUC=0.65 AUC=0.68 AUC=0.70
AUC=0.80 AUC=0.86
10 20 30
40 50
(Step-1) 
Generate
candidate
settings
(Step-2) 
Evaluate
candidate
settings
(Step-3) 
Identify
optimal
setting

We study a collection of  
18 datasets from 5 open corpora
16
A threat of bias exists if researchers ﬁxate on
studying the same datasets with the same metrics 
[Tantithamthavorn et al., TSE’16]

We study a collection of  
18 datasets from 5 open corpora
16
1-7K Modules 
21-28% Defective Rate 
21-38 Metrics 
[Shepperd et al., TSE’13]
1-10K Modules 
15-32 Metrics 
 
[Zimmermann et al., PROMISE’07]
[D’Ambros et al., MSR’10] 
[Kim et al., ICSE’11]
600-800 Modules 
20 Metrics 
 
[Jureczko et al., PROMISE’10]
A threat of bias exists if researchers ﬁxate on
studying the same datasets with the same metrics 
[Tantithamthavorn et al., TSE’16]

Compute the performance
improvement
17
Construct +
evaluate  
Caret-optimized
models
Optimized 
setting
Defect 
dataset Construct +
evaluate  
default models
Default 
setting

improvement
17
Construct +
evaluate  
Caret-optimized
models
Optimized 
setting
Defect 
dataset Construct +
evaluate  
default models
Default 
setting
Caret-optimized  
performance
100x
Technique 1
AUC
default 
performance
100x
Technique 1
AUC

improvement
17
Construct +
evaluate  
Caret-optimized
models
Optimized 
setting
Defect 
dataset Construct +
evaluate  
default models
Default 
setting
Caret-optimized  
performance
100x
Technique 1
AUC
default 
performance
100x
Technique 1
AUC
Average
Average

improvement
17
Construct +
evaluate  
Caret-optimized
models
Optimized 
setting
Defect 
dataset Construct +
evaluate  
default models
Default 
setting
Caret-optimized  
performance
100x
Technique 1
AUC
default 
performance
100x
Technique 1
AUC
Average
Average
Performance  
Improvement

Large Medium Small
●
●
●
●
●
● ●
●
0.0
0.1
0.2
0.3
0.4
C
5.0
AdaBoost
AVN
N
etC
ART
PC
AN
N
etN
N
etFDA
M
LPW
eightD
ecayM
LP
LM
TG
PLS
LogitBoostKN
N
xG
BTreeG
BM
N
B
R
BF
SVM
R
adia
G
AM
AUCPerformanceImprovement
Parameter settings can substantially inﬂuence
the performance of defect prediction models
18
Each boxplot presents
the performance
improvement for all 
the 18 studied datasets

Large Medium Small
●
●
●
●
●
● ●
●
0.0
0.1
0.2
0.3
0.4
C
5.0
AdaBoost
AVN
N
etC
ART
PC
AN
N
etN
N
etFDA
M
LPW
eightD
ecayM
LP
LM
TG
PLS
LogitBoostKN
N
xG
BTreeG
BM
N
B
R
BF
SVM
R
adia
G
AM
19
9 of the 26 studied
classiﬁcation techniques
have a large performance
improvement

Large Medium Small
●
●
●
●
●
● ●
●
0.0
0.1
0.2
0.3
0.4
C
5.0
AdaBoost
AVN
N
etC
ART
PC
AN
N
etN
N
etFDA
M
LPW
eightD
ecayM
LP
LM
TG
PLS
LogitBoostKN
N
xG
BTreeG
BM
N
B
R
BF
SVM
R
adia
G
AM
20
C5.0 and AdaBoost
have a median
improvement of 0.27
and 0.14 AUC

Large Medium Small
●
●
●
●
●
● ●
●
0.0
0.1
0.2
0.3
0.4
C
5.0
AdaBoost
AVN
N
etC
ART
PC
AN
N
etN
N
etFDA
M
LPW
eightD
ecayM
LP
LM
TG
PLS
LogitBoostKN
N
xG
BTreeG
BM
N
B
R
BF
SVM
R
adia
G
AM
21
C5.0 and AdaBoost 
span up to 0.40 AUC

Large Medium Small Ne
●
●
●
●
●
● ●
● ●
●
●
0.0
0.1
0.2
0.3
0.4
C
5.0
AdaBoost
AVN
N
etC
ART
PC
AN
N
etN
N
etFDA
M
LPW
eightD
ecayM
LP
LM
TG
PLS
LogitBoostKN
N
xG
BTreeG
BM
N
B
R
BF
SVM
R
adial
G
AM
Boost
R
FR
ippe
22
Caret improves the AUC
performance by up to 40
percentage points
Performance
Improvement
Performance 
Stability

23
Performance 
Stability
●
●
●
●
●
● ●
● ●
●
●
0.0
0.1
0.2
0.3
0.4
C
5.0
AdaBoost
AVN
N
etC
ART
PC
AN
N
etN
N
etFDA
M
LPW
eightD
ecayM
LP
LM
TG
PLS
LogitBoostKN
N
xG
BTreeG
BM
N
B
R
BF
SVM
R
adial
G
AM
Boost
R
FR
ippe
percentage points
Performance
Improvement

24
Default settings may introduce
instability into defect prediction models
Unstable performance
estimates may introduce bias
into the conclusion of research
 
[Jorgensen et al., TSE’07]
[Menzies and Shepperd, EMSE’12]

Estimating the stability of  
defect prediction models
25
Construct +
evaluate  
Caret-optimized
models
Optimized 
setting
Construct +
evaluate  
default models
Default 
setting
Defect 
dataset

25
Construct +
evaluate  
Caret-optimized
models
Optimized 
setting
Construct +
evaluate  
default models
Default 
setting
Caret-optimized  
performance
100x
Technique 1
AUC
Default 
performance
100x
Technique 1
AUC
Defect 
dataset

25
Construct +
evaluate  
Caret-optimized
models
Optimized 
setting
Construct +
evaluate  
default models
Default 
setting
Caret-optimized  
performance
100x
Technique 1
AUC
Default 
performance
100x
Technique 1
AUC
Defect 
dataset
Standard 
Deviation (S.D.)
Standard 
Deviation (S.D.)

25
Construct +
evaluate  
Caret-optimized
models
Optimized 
setting
Construct +
evaluate  
default models
Default 
setting
Caret-optimized  
performance
100x
Technique 1
AUC
Default 
performance
100x
Technique 1
AUC
Stability  
Ratio
= S.D. of Optimized
S.D. of Default
Defect 
dataset
Standard 
Deviation (S.D.)
Standard 
Deviation (S.D.)

Large Medium Small
0.0
0.5
1.0
1.5
2.0
C
5.0
AdaBoost
AVN
N
etC
ART
PC
AN
N
etN
N
etFDA
M
LPW
eightD
ecayM
LP
LM
TG
PLS
LogitBoostKN
N
xG
BTreeG
BM
N
B
R
BF
SVM
R
ad
G
A
StabilityRatio
Caret-optimized classiﬁers tend to be
more stable than default classiﬁers
26
Cliff’s delta = Large
Each boxplot presents the
stability ratio for all 
the 18 studied datasets

Large Medium Small
0.0
0.5
1.0
1.5
2.0
C
5.0
AdaBoost
AVN
N
etC
ART
PC
AN
N
etN
N
etFDA
M
LPW
eightD
ecayM
LP
LM
TG
PLS
LogitBoostKN
N
xG
BTreeG
BM
N
B
R
BF
SVM
R
ad
G
A
StabilityRatio
27
A stability ratio lower than one
indicates that Caret-optimized
classiﬁers tend to be more stable
than default classiﬁers

Large Medium Small
0.0
0.5
1.0
1.5
2.0
C
5.0
AdaBoost
AVN
N
etC
ART
PC
AN
N
etN
N
etFDA
M
LPW
eightD
ecayM
LP
LM
TG
PLS
LogitBoostKN
N
xG
BTreeG
BM
N
B
R
BF
SVM
R
ad
G
A
StabilityRatio
28
Stability ratio is lower
than 1 for 35% of the
studied classiﬁcation
techniques

Medium Small Negligible
D
ecayM
LP
LM
TG
PLS
LogitBoostKN
N
xG
BTreeG
BM
N
B
R
BF
SVM
R
adial
G
AM
Boost
R
FR
ipperPM
R
PDAM
AR
S
SVM
Linear
J48
Caret-optimized classifiers are at
least as stable as default classifiers
29
Stability ratio is about 0
for 65% of the studied
classification
techniques

Large
0.0
0.5
1.0
1.5
2.0
C
5.0
AdaBoost
AVN
N
etC
ART
PC
AN
N
etN
N
etFDA
M
LPW
eightD
ecayM
LP
L
StabilityRatio
30
Performance 
Stability
●
●
●
●
●
● ●
● ●
●
●
0.0
0.1
0.2
0.3
0.4
C
5.0
AdaBoost
AVN
N
etC
ART
PC
AN
N
etN
N
etFDA
M
LPW
eightD
ecayM
LP
LM
TG
PLS
LogitBoostKN
N
xG
BTreeG
BM
N
B
R
BF
SVM
R
adial
G
AM
Boost
R
FR
ippe
Performance
Improvement
percentage points.
Caret-optimized classiﬁers tend
to be more stable than
classiﬁers trained with default
settings

31
Suboptimal parameter settings 
may have an impact on  
prior defect prediction results!

Prior findings
on top-
performing
classification
techniques
32
17 of 22
classification
techniques are
indistinguishable 
 
[Lessmann et. al. TSE’08]
Classification
techniques have a
large impact on
the performance 
 
[Ghotra et al., ICSE’15]

Prior findings
on top-
performing
classification
techniques
32
However, these studies have not taken  
parameter optimization into account
17 of 22
classification
techniques are
indistinguishable 
 
[Lessmann et. al. TSE’08]
Classification
techniques have a
large impact on
the performance 
 
[Ghotra et al., ICSE’15]

Identifying statistically distinct
ranks of classiﬁcation techniques
33
100x
Technique 1
AUC Performance 
Distribution
100x
Technique 26
AUC Performance 
Distribution
Dataset 1
….100x
Technique 2
AUC Performance 
Distribution

33
Scott-Knott
ESD test
100x
Technique 1
AUC Performance 
Distribution
100x
Technique 26
AUC Performance 
Distribution
Dataset 1
….100x
Technique 2
AUC Performance 
Distribution

33
Scott-Knott
ESD test
100x
Technique 1
AUC Performance 
Distribution
100x
Technique 26
AUC Performance 
Distribution
Dataset 1
….100x
Technique 2
AUC Performance 
Distribution
Ranking for  
dataset 1

33
Scott-Knott
ESD test
100x
Technique 1
AUC Performance 
Distribution
100x
Technique 26
AUC Performance 
Distribution
Dataset 1
….100x
Technique 2
AUC Performance 
Distribution
Ranking for  
dataset 1
Ranking for  
Dataset 18
Scott-Knott
ESD test
Dataset 18
100x
Technique 1
AUC Performance 
Distribution
100x
Technique 26
AUC Performance 
Distribution
….100x
Technique 2
AUC Performance 
Distribution

34
Pool of ranking  
for each dataset
Dataset T1 T2 T3
1 2 1 3
2 1 2 3
3 1 1 2
Scott-Knott
ESD test
100x
Technique 26
AUC Performance 
Distribution
Dataset 1
Ranking for  
dataset 1
Ranking for  
Dataset 18
Scott-Knott
ESD test
Dataset 18
100x
Technique 26
AUC Performance 
Distribution

Pool of ranking  
for each dataset
Dataset T1 T2 T3
1 2 1 3
2 1 2 3
3 1 1 2
Compute the proportion of datasets
where a classiﬁer appears in the top rank
35
Likelihood
for each  
technique
T1 T2 T3
0.67 0.67 0
Compute  
likelihood

Pool of ranking  
for each dataset
Dataset T1 T2 T3
1 2 1 3
2 1 2 3
3 1 1 2
Compute the proportion of datasets
where a classiﬁer appears in the top rank
36
Likelihood
for each  
technique
T1 T2 T3
0.67 0.67 0
Compute  
likelihood

Bootstrap resampling to combat
sample selection bias
37
Bootstrap 
sample of ranking
Bootstrap  
Sampling
Pool of ranking  
for each dataset
Dataset T1 T2 T3
1 2 1 3
2 1 2 3
3 1 1 2
Dataset T1 T2 T3
1 2 1 3
2 1 2 3
2 1 2 3

Re-compute the likelihood  
for each sample
38
Likelihood
for each  
technique
T1 T2 T3
0.67 0.33 0
Bootstrap  
Sampling
Pool of ranking  
for each dataset
Dataset T1 T2 T3
1 2 1 3
2 1 2 3
3 1 1 2
Dataset T1 T2 T3
1 2 1 3
2 1 2 3
2 1 2 3
Compute  
likelihood
Bootstrap 
sample of ranking

Repeat the bootstrap 100 times to
estimate the conﬁdence interval
39
Bootstrap  
Sampling
Pool of ranking  
for each dataset
Dataset T1 T2 T3
1 2 1 3
2 1 2 3
3 1 1 2
Dataset T1 T2 T3
1 2 1 3
2 1 2 3
2 1 2 3
Bootstrap 
sample of ranking

Repeat the bootstrap 100 times to
estimate the conﬁdence interval
39
Bootstrap  
Sampling
Pool of ranking  
for each dataset
Dataset T1 T2 T3
1 2 1 3
2 1 2 3
3 1 1 2
Dataset T1 T2 T3
1 2 1 3
2 1 2 3
2 1 2 3
Repeat 100 times
T1 T2 T3
0.67 0.33 0
… … …
0.33 0 0
Distribution  
of likelihood
Compute  
likelihood
Bootstrap 
sample of ranking

Caret optimization can substantially shift
the top-ranked classiﬁcation techniques
40
●
●
●
● ●
●
●
● ●
●
● ●
● ●
● ●
PLS
PDA
N
N
et
PM
R
Boost
N
N
et
AR
S
FDA
Boost
adial
ecay
M
LP
R
BF
N
B
ipper
LM
T
●Optimized Classifier Default Classifier
●
●
●
●
●
●
●
●
● ●
●
●
● ●0.0
0.2
0.4
0.6
0.8
1.0
C
5.0xG
BTreeAVN
N
et
G
BM
R
FG
PLS
PDA
N
N
et
PM
RAM
Boost
PC
AN
N
etM
AR
S
FDA
AdaBoost
VM
R
adia
igh
Likelihood
●Optimized Classifier D
Top-ranklikelihoodestimate

41
●
●
●
● ●
●
●
● ●
●
● ●
● ●
● ●
PLS
PDA
N
N
et
PM
R
Boost
N
N
et
AR
S
FDA
Boost
adial
ecay
M
LP
R
BF
N
B
ipper
LM
T
●
●
●
●
●
●
●
●
● ●
●
●
● ●0.0
0.2
0.4
0.6
0.8
1.0
C
5.0xG
BTreeAVN
N
et
G
BM
R
FG
PLS
PDA
N
N
et
PM
RAM
Boost
PC
AN
N
etM
AR
S
FDA
AdaBoost
VM
R
adia
igh
Likelihood

42
●
●
●
● ●
●
●
● ●
●
● ●
● ●
● ●
PLS
PDA
N
N
et
PM
R
Boost
N
N
et
AR
S
FDA
Boost
adial
ecay
M
LP
R
BF
N
B
ipper
LM
T
●
●
●
●
●
●
●
●
● ●
●
●
● ●0.0
0.2
0.4
0.6
0.8
1.0
C
5.0xG
BTreeAVN
N
et
G
BM
R
FG
PLS
PDA
N
N
et
PM
RAM
Boost
PC
AN
N
etM
AR
S
FDA
AdaBoost
VM
R
adia
igh
Likelihood
Caret increases the
likelihood of appearing in
the top rank by up to 83%

Automated parameter optimization should be included in future  defect prediction studies

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Automated parameter optimization should be included in future  defect prediction studies (20)

More from Chakkrit (Kla) Tantithamthavorn (13)

Recently uploaded (20)