Calibration Classifierpdf
Calibration Classifierpdf
TUOMO ALASALMI, JAAKKO SUUTALA, and JUHA RÖNING, University of Oulu, Finland
arXiv:2002.10199v2 [cs.LG] 25 May 2020
1 INTRODUCTION
In many machine learning applications, e.g. in the medical domain [5], the models need to be
explainable, or they will not be very useful. Obviously this means that the model needs to commu-
nicate to the user somehow what has led it to the given conclusion instead of just being a black-box
[13]. Another important factor in model explainability is the information how reliable the given
prediction is. This property is called classifier calibration. A well calibrated classifier prediction is
such that the predicted probability of an event is close to the proportion of the those events among
a group of similar predictions [6]. However, the main design objective for classifiers tends to be
good class separation and not accurate reliability estimation. Therefore, many classifiers are not
well calibrated out of the box. To improve this probability estimate, accurate classifier calibration
algorithms are needed. With accurate calibration, almost any model can output a good estimate of
the probability that the decision it has made is indeed correct [23]. Accurate probability estimates
are also important for cost sensitive decision making [31].
For calibration algorithms to work well, a minimum of about 1000 to 2000 training samples
are needed for the calibration data set depending on the learning algorithm to avoid overfitting.
This is especially true for non-parametric calibration algorithms and calibration seems to improve
further with even bigger calibration data sets [23, 24]. To avoid biasing the calibration model, a
separate calibration data set is needed. This means that the amount of training data in total needs
Authors’ addresses: Tuomo Alasalmi, tuomo.alasalmi@oulu.fi; Jaakko Suutala, jaakko.suutala@oulu.fi; Juha Röning, juha.
roning@oulu.fi, University of Oulu, P.O. Box 4500, 90014, Oulu, Finland; Heli Koskimäki, [email protected],
Oura Health Ltd. Elektroniikkatie 10, Oulu, Finland.
ACM Trans. Knowl. Discov. Data., Vol. 14, No. 3, Article 34. Publication date: May 2020.
34:2 Alasalmi et al.
Training Calibration
Test
Fig. 1. Splitting of the available data into training, calibration, and test data sets. A part of the training data
is reserved for calibration to avoid bias. Training data contains validation data for hyperparameter tuning.
to be large. E.g. if 10 % of the training data set is used for calibration and the rest for modelling, a
training data set with at least 10 000 samples is needed. In addition, a separate data set needs to
be held out for testing. Figure 1 illustrates the data set partitioning. In many real world modelling
tasks, however, relatively small data sets are quite common. As we will demonstrate in this article,
traditional calibration algorithms fail to deliver on small data sets. But with our proposed data
generation approach, calibration can often be improved despite the data set being small.
The rest of the article is structured as follows. Literature in calibration with a view on small data
sets is briefly reviewed in Section 2. In Section 3, a set of experiments is described. The results of
the experiments are summarized in Section 4 and presented in more detail in the Appendix. To
conclude the article, the results are discussed in Section 5.
2 CLASSIFIER CALIBRATION
There are three main categories of calibration techniques. These are the parametric calibration
algorithms such as Platt scaling [25] and the non-parametric histogram binning [30] and isotonic
regression [32] algorithms. In Platt scaling, a sigmoid function is fit to the prediction scores to
transform prediction scores into probabilities. It was originally developed to improve calibration
of support vector machines (SVM) and might not be the right transformation for many other clas-
sifiers. In binning, the prediction scores of a classifier are sorted and divided into bins of equal size.
When we predict a test example, its prediction score can then be transformed into an estimated
probability of belonging to a particular class by calculating the frequency of training samples be-
longing to that class in the corresponding bin. As drawbacks to binning, the number of bins needs
to be specified and the probability estimates are discontinuous at bin boundaries. Also, depending
on the classifier used, the prediction scores of classifier might not be uniformly distributed causing
some bins to have significantly less, even zero, examples than others. Several methods have tried
to overcome these problems, such as adaptive calibration of predictions (ACP) [15], selection over
Bayesian binnings (SBB) and averaging over Bayesian binnings (ABB) [21], as well as Bayesian
binning into quantiles (BBQ) [22]. In isotonic regression, a monotonically increasing function is
used to map the prediction scores into probabilities. Isotonic regression is not continuous in gen-
eral and can have undesirable jumps. To alleviate these problems, smoothing can be used [14].
In practice, however, the isotonicity assumption does not always hold [21]. This makes isotonic
regression sub optimal in these cases albeit quite effective [23] regardless.
For the isotonicity constraint to hold true, the ranking imposed by the classifier would need to
be perfect which is rarely true with real-world data sets. An ensemble of near-isotonic regression
(ENIR) [20] allows violations of the ranking ordering and uses regularization to penalize the vio-
lations. In ENIR, a modified pool adjacent violators algorithm is used to find the solution path to a
ACM Trans. Knowl. Discov. Data., Vol. 14, No. 3, Article 34. Publication date: May 2020.
Better Classifier Calibration for Small Data Sets 34:3
near isotonic regression problem [26] and Bayesian information criterion (BIC) scoring is used to
combine the generated models. This ensemble is then used to post-process the classifier prediction
scores to map them into calibrated probabilities. In their experiments, ENIR was on average the
best performing calibration algorithm when compared to isotonic regression and BBQ with naive
Bayes (NB), logistic regression, and SVM classifiers. Similarly, to what was accomplished with iso-
tonic regression [32], ENIR can be extended to multi-class problems whereas the Bayesian binning
models cannot.
3 EXPERIMENTS
To test the effectiveness of using data generation for calibrating classifiers with small data sets,
a set of experiments was set up. ENIR was used as the calibration algorithm because as a non
parametric algorithm it should work equally well with all classifiers. In addition, Platt scaling
was used with SVM. Representatives from top performing classifier groups were selected for the
experiments and their calibration performance with different calibration scenarios was compared
with two Bayesian classifiers.
To serve as control, we used the uncalibrated prediction scores of each classifier. This calibration
scenario is referred in the results as Raw. In this case, as there was no need for a separate calibration
data set, all data points in the training data set were used for classifier training. To test if the raw
prediction scores could be improved by calibration to more closely resemble posterior probabilities,
the calibration algorithm ENIR was used in four different settings. First, ENIR was used in the
recommended way, i.e. a separate calibration data set was held out from the training data set that
was not used for classifier training but only for tuning the calibration model. Size of the calibration
data was set to 10 % of the training data set and the remaining 90 % was used for training the
classifier. This scenario is called ENIR in the results. Second, ENIR was used like the algorithm’s
creators, i.e. the full training data set was used for both training the classifier and to tune the
calibration model. This scenario is called ENIR full. The DG and DGG algorithms were also used
with ENIR calibration. These are called DG + ENIR and DGG + ENIR, respectively. With the SVM
ACM Trans. Knowl. Discov. Data., Vol. 14, No. 3, Article 34. Publication date: May 2020.
34:4 Alasalmi et al.
classifier, Platt scaling was used with either a separate calibration data set as described above or
with the full training data set. These are called Platt and Platt full in the results. Finally, the Out-
of-Bag sample was used with ENIR calibration in the case of RF. This is called ENIR OOB in the
results. R and Matlab code for carrying out the experiments is available on GitHub1 .
There are literally hundreds of different classifiers available to use. Each of them has its place but
not all of them perform equally well when compared over a diverse set of problems [11]. For our
experiments we chose a representative from each of the top performing classifier groups, namely
a random forest, an SVM, and a feed forward neural network (NN) with a single hidden layer. In
addition, a naive Bayes classifier was tested as it is computationally simple, easy to interpret, and
surprisingly accurate despite of the often unrealistic assumption of feature independence. Also,
the prediction scores of naive Bayes are not well calibrated which makes it a good candidate for
this experiment [8]. In addition, two Bayesian classifiers were used that should produce well cal-
ibrated probabilities without separate calibration. These were Bayesian logistic regression (BLR)
[12] which is a parametric linear classifier and Gaussian process classifier (GPC) [28] which is non-
parametric and nonlinear when nonlinear covariance function such as squared exponential is used.
We tested the GPC implemented with expectation propagation (EP) approximation. Markov chain
Monte Carlo (MCMC) sampling approximation of GPC can be considered the gold standard of
GPC approximations but it is computationally very complex whereas EP approximation has been
proven to have very good agreement with MCMC for both predictive probabilities and marginal
likelihood estimates for fraction of the computational cost [18].
RF was implemented using the R package randomForest. The default number of trees (500), ntree,
was used and the hyperparameter mtry was tuned by increments or decrements of two based on
the Out-of-Bag error estimate. For SVM, the R package e1071 was used. A Gaussian kernel was
used and the regularization parameter cost was tuned with values {10k }−2 11 . Good values for ker-
nel spread hyperparameter дamma were estimated based on the training data using the kernlab R
package and the median value of the estimates was used [4]. The NN was implemented with the
R package nnet. Hidden layer size was tuned in range from 1 to 9 neurons in increments of two
and the hyperparameter decay was tuned with values {10k }−4 0 . As an activation function, a logis-
tic function was used. For the Gaussian process classifier, GPML Matlab toolbox2 implementation
was used. A logistic likelihood function and a zero mean function was chosen and the covariance
function was set to isotropic squared exponential covariance function which is in line with SVM
with Gaussian kernel and regularization parameter cost. The hyperparameters for length-scale and
signal magnitude were tuned by minimizing the negative log marginal likelihood (i.e., type II max-
imum likelihood approximation) on training data set. With the non-Bayesian methods, in every
case except RF, which used Out-of-Bag error estimate, the tuning process was done using 10-fold
cross validation on the training data excluding the calibration data. naive Bayes was implemented
with the R package e1071. Bayesian logistic regression was implemented using the R package arm
and default hyperparameter values (i.e., Cauchy prior with scale 2.5) were used and model was
fitted by approximate expectation maximization algorithm on the training data set.
1 https://github.com/biovaan/Calibration
2 http://www.gaussianprocess.org/gpml/code/matlab/doc/
ACM Trans. Knowl. Discov. Data., Vol. 14, No. 3, Article 34. Publication date: May 2020.
Better Classifier Calibration for Small Data Sets 34:5
Below we will introduce two error metrics that are commonly used to evaluate classifier calibra-
tion. These metrics are used to compare calibration performance of different calibration scenarios
in our experiments.
Logarithmic loss (logloss) is an error metric that gives the biggest penalty for being both confi-
dent and wrong about a prediction. It is therefore a good metric to evaluate classifier calibration
especially if cost sensitive decisions are made based on the classifier outcome. Logarithmic loss is
defined in Equation (1). In the equation N stands for the number of observations, M stands for the
number of class labels, loд is the natural logarithm, yi, j equals 1 if observation i belongs to class j,
otherwise it is 0, and pi, j stands for the predicted probability that observation i belongs to class j.
A smaller value of logarithmic loss means better calibration.
Mean squared error (MSE) is another metric that is often used to evaluate classifier calibration.
The smaller the MSE value of a classifier, the better the calibration. However, MSE puts less em-
phasis on single confident but wrong decisions made by the classifier. It is defined in Equation (2)
where N stands for the number of observations, yi is 1 if observation i belongs to the positive class,
otherwise it is 0, and pi is the predicted probability that observation i belongs to the positive class.
As with logloss, a smaller value of MSE means better calibration.
1 ÕÕ
N M
loдloss = − yi, j loд(pi, j ) (1)
N i =1 j=1
ÍN
(yi − pi )2
MSE = i =1 (2)
N
To test the performance of each approach to calibration with each of the classifiers, the follow-
ing test sequence was ran. Features were standardized to have zero mean and unit variance and
near zero variance features were deleted. Depending on the calibration scenario, the data set was
divided into two or three parts as in Figure 1. These were training and test data sets and in the
ENIR and Platt scenarios, a separate calibration data set was split off from the training data set.
In the Raw scenario, logloss and MSE were calculated on the raw prediction scores obtained with
each classifier from the separate test data set. In the ENIR calibration scenario, the slightly smaller
training data set was used to train each classifier and the prediction scores were calibrated using
the ENIR algorithm that was tuned with the separate calibration data set. In ENIR full scenario, the
whole training data set was used for both training the classifiers and tuning the ENIR algorithm.
Finally the prediction scores from predicting the test data points were calibrated and the error met-
rics calculated. In DG + ENIR and DGG + ENIR scenarios, the corresponding algorithm was used
to create a calibration data set that was then used to tune the ENIR algorithm. The whole training
data set was used to train the classifiers and the test data set prediction scores were calibrated
and error metrics calculated. Threshold used for classification was selected using the calibrated
training data set so that the selected threshold maximized the classification rate. In addition to
measuring the error metrics, each calibration scenario’s computation time was also measured.
To be able to test the differences between calibration scenarios, a stratified 10-fold cross valida-
tion was used to create the data samples. A 5 × 2CV t-test [7] or a combined 5 × 2CV F-test [2] has
been suggested to be used to detect differences in classifier performance because of a lower Type
I error. The lower Type I error, however, does not come without a compromise, namely higher
Type II error (i.e. lower power). The lower power seems to be highlighted in our own experiments
with small data sets as the inherent variance between the results on different folds is quite high.
Therefore, cross validation was selected as the sampling method in our experiments and a Stu-
dent’s paired t-test with unequal variance assumption and the Welch modification to the degrees
of freedom [27] was used to determine if there was a difference between calibration scenarios.
ACM Trans. Knowl. Discov. Data., Vol. 14, No. 3, Article 34. Publication date: May 2020.
34:6 Alasalmi et al.
4 RESULTS
The synthetic data set was used to verify that the proposed approach does indeed improve prob-
ability estimates and not just calibration error metrics with discrete labels. Mean squared errors
with each classifier and calibration scenario are presented in Table 2. With the synthetic data, MSE
was calculated using the true probabilities, not discrete labels.
Results of the experiments with real data sets are presented here summarized and the full results
are attached as Appendix. The average logarithmic loss of each classifier and calibration scenario
ACM Trans. Knowl. Discov. Data., Vol. 14, No. 3, Article 34. Publication date: May 2020.
Better Classifier Calibration for Small Data Sets 34:7
Table 2. Mean squared error of different classifiers and calibration scenarios on the synthetic data set.
Table 3. Average computation times of the different classifiers and calibration scenarios in seconds on the
experiment data sets. The times include hyperparameter tuning, training the classifier, and steps needed for
calibration. Prediction times are excluded as they can be considered negligible.
combination are depicted in Figure 2 and the average mean squared error in Figure 3. The training
times of each classifier and calibration scenario were measured on a computational server (Intel
Xeon E5-2650 v2 @ 2.60GHz, 196GB RAM) and the results are shown in Table 3.
ACM Trans. Knowl. Discov. Data., Vol. 14, No. 3, Article 34. Publication date: May 2020.
34:8 Alasalmi et al.
4
Logarithmic loss
0
Raw ENIR ENIR full DG DGG CS BLR GPC
Algorithm
Fig. 2. Average logarithmic loss for different classifiers and calibration scenarios. Classifier specific (CS)
means Out-of-Bag samples with ENIR calibration for RF and Platt scaling with the full training set for SVM.
Lower value of logloss indicates better calibration performance.
to generate the calibration data for ENIR calibration. The calibration error was in these cases low-
ered to approximately the same level as is achieved with the Gaussian process classifier. SVM
achieved a comparable error level without calibration and no further improvement was achieved
with the proposed calibration approach but the calibration error did not increase either. Using
Platt scaling did increase the calibration error. Calibration error of naive Bayes was higher than
the best of the pack and stayed intact with the proposed approach. This suggests that naive Bayes,
due to the model’s assumptions, was not flexible enough to catch the feature interactions in the
data and therefore no improvement was achievable with calibration, even with DG or DGG data
generation.
On the real data sets, with only one exception, the Biodegradation data set with naive Bayes clas-
sifier, using ENIR with a separate calibration data split off from the training data fails to improve
calibration and actually makes the calibration worse although the differences are not statistically
significant in every case. This obviously results from a very small calibration data set and these
kind of results have also been noted in the literature before. This observation was the main moti-
vation behind this work. When using the same data for training both the classifier and the ENIR
calibration algorithm (ENIR full), we get mixed results. With the naive Bayes classifier, the cali-
bration improves statistically significantly over uncalibrated control on four of the six data sets
but the improvement does not reach statistical significance on the other two data sets. With the
other three classifiers calibration tends to deteriorate compared to the uncalibrated control. The
decrease in calibration performance is statistically significant on three data sets with SVM and RF,
and four data sets with the NN classifier. What is interesting and supports our hypothesis is that
with RF classifier ENIR full performs worse than ENIR with the tiny but separate calibration data
ACM Trans. Knowl. Discov. Data., Vol. 14, No. 3, Article 34. Publication date: May 2020.
Better Classifier Calibration for Small Data Sets 34:9
0.18
MSE
0.16
0.14
0.12
Fig. 3. Average mean squared error for different classifiers and calibration scenarios. Classifier specific (CS)
means Out-of-Bag samples with ENIR calibration for RF and Platt scaling with the full training set for SVM.
Lower value of MSE indicates better calibration performance.
which indicates overfitting. However, overfitting of the calibration model did not happen with the
other classifiers.
Of the classifier specific calibration scenarios, Platt scaling performed equally well or insignifi-
cantly better with the small but separate calibration data set and the whole training data set as the
calibration data on all but one data set on which using the full training data lowered logloss. Platt
scaling did on average better than ENIR full, however, it could not improve calibration on any of
the tested data sets over the uncalibrated control. The premise of using Out-of-Bag samples for
calibration with RF was that the whole training data could be used for calibration without biasing
the calibration model. Our results do not support that notion completely, at least on these small
data sets. When Out-of-Bag samples were used to tune the ENIR calibration algorithm, calibration
performance was worse than ENIR calibration with a separate calibration data set or using the
full training data set on four and better on two of the tested data sets although one of the better
performances was not statistically significant. What is more important, though, is that ENIR tuned
with Out-of-Bag samples could improve calibration over the uncalibrated control on only one of
the data sets. On those four misbehaving cases mentioned above the calibration significantly de-
creased instead.
Our DG algorithm coupled with ENIR was able to improve calibration over the uncalibrated
control with the naive Bayes classifier on five of the tested data sets. SVM calibration improved
slightly on two of the data sets with DG + ENIR and RF calibration performance decreased on
three of the data sets while it improved on one. With NN, DG + ENIR calibration performance
was not statistically significantly different from the uncalibrated control. It did, however, perform
equally well or better than ENIR or ENIR full. DGG with ENIR calibration on the other hand
ACM Trans. Knowl. Discov. Data., Vol. 14, No. 3, Article 34. Publication date: May 2020.
34:10 Alasalmi et al.
improved calibration over uncalibrated control on all data sets with the naive Bayes classifier and
on five out of six data sets with the RF classifier, although one of the improvements did not reach
statistical significance. Calibration of SVM was improved on one of the data sets with DGG +
ENIR and unaffected on the others. With NN, performance was improved with DGG + ENIR on
one and decreased on one while being neutral on the other three data sets. DGG performed better
than ENIR full on all data sets with all classifiers although the differences were not statistically
significant in every case.
As a comparison, Bayesian logistic regression and Gaussian process classifiers were tested on
the same data sets because these classifiers are supposed to be well calibrated without separate
calibration. BLR calibration was better than the best non-Bayesian classifier with DGG + ENIR
calibration on one of the data sets but worse on all other data sets although one of the differences
was not statistically significant. Also, classification rate of BLR was slightly lower on average
than on the other classifiers except NB, although the difference was not statistically significant.
Logloss for GPC was lowest of all classifiers and calibration scenarios on five of the data sets
by a clear margin but higher on one of the data sets. MSE, however, was higher on three and
lower on one of the data sets than with the best of the calibrated non-Bayesian classifiers. This
discrepancy indicates that a higher proportion of mistakes made by GPC were truly uncertain and
high confidence predictions were more often correct with GPC than with the other classifiers. Thus,
it could be said that GPC is not overconfident as classifiers calibrated with the ENIR algorithm. This
is definitely an advantage in applications where good calibration is needed.
Using ENIR calibration with a separate calibration data set lead to a slightly lowered classifica-
tion rate with three classifiers because the calibration data cannot be used for training the classifier
making the training data set smaller. NN, SVM, and GPC had the highest classification rate on these
data sets. A slightly lower classification rate was observed with RF and BLR classifiers. None of
these small differences were, however, statistically significant. naive Bayes could not compete with
the other classifiers in accuracy.
Training and calibration of naive Bayes, SVM, RF, and BLR took on average only seconds. NN
and the EP approximation of GPC were clearly more computationally complex but still acceptable
on these small data sets with training times of a few minutes.
ACM Trans. Knowl. Discov. Data., Vol. 14, No. 3, Article 34. Publication date: May 2020.
Better Classifier Calibration for Small Data Sets 34:11
Table 4. Effect of class imbalance on MSE on the subsampled Letter data sets.
Table 5. Effect of class imbalance on logloss on the subsampled Letter data sets.
5 DISCUSSION
The choice of a classifier depends on the problem at hand. Accuracy, computational complexity
and memory requirements (e.g. wearable device vs. cloud server), and need for explainability are
some properties that need to be taken into account when choosing a classifier. One aspect of
ACM Trans. Knowl. Discov. Data., Vol. 14, No. 3, Article 34. Publication date: May 2020.
34:12 Alasalmi et al.
explainability is classifier calibration, i.e. can the posterior probability estimates of the classifier
be trusted. Bayesian methods such as Bayesian logistic regression and Gaussian process classifiers
should be fairly well calibrated out of the box but may not be the most accurate on average when
tested on a wide array of problems. The top performing classifier groups have been shown to
be random forests, support vector machines, and neural network variations. Our results indicate
that SVM and NN calibration on the tested small data sets is fairly good but sometimes it can be
further improved by the DGG method coupled with ENIR calibration. RF on the other hand almost
always benefits from the DGG method coupled with ENIR. Gaussian Process classifier held on to
the premise of good calibration on most of the tested data sets but RF with DGG coupled with
ENIR calibration produced probabilities whose average MSE over all the data sets was actually
lower than with GPC. The same is not true for logloss which suggests that ENIR might produce
overconfident probability estimates. This discrepancy between performance on MSE and logloss
is more pronounced with the DG algorithm as it does not use label smoothing like implicitly DGG
does. This leads to clearly overconfident probability estimates with the DG approach coupled with
ENIR calibration. The proposed methods are not adversely affected by even severe class imbalance
as demonstrated in the experiments. The improvements in calibration error metrics do indicate a
real improvement in the quality of the predicted probabilities which was verified by the tests wit
a synthetic data set where true probabilities are known.
A slight drawback in DGG is that the number of samples and the group size parameter need to be
set. Also, the calibration data points generated with DG are not necessarily uniformly distributed
meaning that with a fixed bin size the bin width in DGG can vary. This can potentially affect
calibration resolution negatively with prediction scores that fall inside the widest bins. These cases
are rare, however, otherwise the bins would be narrower. A possible drawback of Gaussian process
classifier is that full GPCs unlike e.g. SVMs are not sparse out of the box but need additional
approximation approaches. This needs to be considered when training classifiers on large-scale
problems but might not pose a problem on small data sets.
On these small data sets ENIR on its own, either with a separate calibration data set or with
the whole training data set, performs poorly on all classifiers except naive Bayes which is known
for its poor calibration. Extra computation time from doing DGG is negligible in the case of small
data sets where it is mostly needed and therefore its use is recommended when better calibration
is essential. This is especially true with at least classifiers such as random forest and naive Bayes.
ACKNOWLEDGMENTS
The authors would like to thank Infotech Oulu, Jenny and Antti Wihuri Foundation, Tauno Tön-
ning Foundation, and Walter Ahlström Foundation for financial support of this work.
REFERENCES
[1] Tuomo Alasalmi, Heli Koskimäki, Jaakko Suutala, and Juha Röning. 2018. Getting More Out of Small Data
Sets - Improving the Calibration Performance of Isotonic Regression by Generating More Data. In Proceedings
of the 10th International Conference on Agents and Artificial Intelligence (ICAART 2018). SCITEPRESS, 379–386.
https://doi.org/10.5220/0006576003790386
[2] Ethem Alpaydm. 1999. Combined 5 x 2 cv F Test for Comparing Supervised Classification Learning Algorithms. Neural
Computation 11, 8 (nov 1999), 1885–1892. https://doi.org/10.1162/089976699300016007
[3] Henrik Boström. 2008. Calibrating random forests. Proceedings - 7th International Conference on Machine Learning
and Applications, ICMLA 2008 (2008), 121–126. https://doi.org/10.1109/ICMLA.2008.107
[4] Barbara Caputo, K Sim, F Furesjo, and Alex Smola. 2002. Appearance-based object recognition using SVMs: which
kernel should I use?. In Proceedings of NIPS workshop on Statistical methods for computational experiments in visual
processing and computer vision, Whistler.
[5] Brian Connolly, K. Bretonnel Cohen, Daniel Santel, Ulya Bayram, and John Pestian. 2017. A nonparametric Bayesian
method of translating machine learning scores to probabilities in clinical decision support. BMC Bioinformatics 18, 1
ACM Trans. Knowl. Discov. Data., Vol. 14, No. 3, Article 34. Publication date: May 2020.
Better Classifier Calibration for Small Data Sets 34:13
ACM Trans. Knowl. Discov. Data., Vol. 14, No. 3, Article 34. Publication date: May 2020.
34:14 Alasalmi et al.
[29] I-Cheng Yeh, King-Jang Yang, and Tao-Ming Ting. 2009. Knowledge discovery on RFM model using Bernoulli se-
quence. Expert Systems with Applications 36, 3, Part 2 (2009), 5866 – 5871. https://doi.org/10.1016/j.eswa.2008.07.018
[30] Bianca Zadrozny and Charles Elkan. 2001. Learning and making decisions when costs and probabilities are both
unknown. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
- KDD ’01. 204–213. https://doi.org/10.1145/502512.502540
[31] Bianca Zadrozny and Charles Elkan. 2001. Obtaining Calibrated Probability Estimates from Decision Trees and Naive
Bayesian Classifiers. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML ’01). Morgan
Kaufmann Publishers Inc., San Francisco, CA, USA, 609–616. http://dl.acm.org/citation.cfm?id=645530.655658
[32] Bianca Zadrozny and Charles Elkan. 2002. Transforming Classifier Scores into Accurate Multiclass Probability Esti-
mates. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(KDD ’02). ACM, 694–699. https://doi.org/10.1145/775047.775151
APPENDIX
Full results
Full results of our experiments are presented in Tables 7-12. The results in the tables are averages
and standard deviations of the results for each fold in 10-fold cross validation. The statistical tests to
determine if the differences between calibration conditions are statistically significant were done
with Student’s paired t-tests with unequal variance assumption. Bayesian logistic regression and
Gaussian process classifiers were compared to the best performing of the other classifiers based on
logarithmic loss of that classifier after calibrating the classifier with ENIR using DGG generated
calibration data. Table 6 lists the abbreviations used in the result tables.
Abbreviation Description
BLR Bayesian logistic regression
CR Classification rate
DG Data Generation algorithm
DGG Data Generation and Grouping algorithm
ENIR Ensemble of near isotonic regressions
MSE Mean squared error
Logloss Logarithmic loss
NB Naive Bayes
NN Neural network
OOB Out-of-Bag
RF Random forest
SVM Support vector machine
Table 6. List of abbreviations used in the results.
ACM Trans. Knowl. Discov. Data., Vol. 14, No. 3, Article 34. Publication date: May 2020.
Better Classifier Calibration for Small Data Sets 34:15
Table 7. Classification rate, mean squared error, and logarithmic loss of different classifiers and calibration
scenarios on the Biodegradation data set.
ACM Trans. Knowl. Discov. Data., Vol. 14, No. 3, Article 34. Publication date: May 2020.
34:16 Alasalmi et al.
Table 8. Classification rate, mean squared error, and logarithmic loss of different classifiers and calibration
scenarios on the Blood donation data set.
ACM Trans. Knowl. Discov. Data., Vol. 14, No. 3, Article 34. Publication date: May 2020.
Better Classifier Calibration for Small Data Sets 34:17
Table 9. Classification rate, mean squared error, and logarithmic loss of different classifiers and calibration
scenarios on the Contraceptive use data set.
ACM Trans. Knowl. Discov. Data., Vol. 14, No. 3, Article 34. Publication date: May 2020.
34:18 Alasalmi et al.
Table 10. Classification rate, mean squared error, and logarithmic loss of different classifiers and calibration
scenarios on the Letter recognition data set.
ACM Trans. Knowl. Discov. Data., Vol. 14, No. 3, Article 34. Publication date: May 2020.
Better Classifier Calibration for Small Data Sets 34:19
Table 11. Classification rate, mean squared error, and logarithmic loss of different classifiers and calibration
scenarios on the Mammographic mass data set.
ACM Trans. Knowl. Discov. Data., Vol. 14, No. 3, Article 34. Publication date: May 2020.
34:20 Alasalmi et al.
Table 12. Classification rate, mean squared error, and logarithmic loss of different classifiers and calibration
scenarios on the Titanic data set.
ACM Trans. Knowl. Discov. Data., Vol. 14, No. 3, Article 34. Publication date: May 2020.