SMOTE for Regression
SMOTE for Regression
1 Introduction
Forecasting rare extreme values of a continuous variable is very relevant for
several real world domains (e.g. finance, ecology, meteorology, etc.). This prob-
lem can be seen as equivalent to classification problems with imbalanced class
distributions which have been studied for a long time within machine learn-
ing (e.g. [1–4]). The main difference is the fact that we have a target numeric
variable, i.e. a regression task. This type of problem is particularly difficult be-
cause: i) there are few examples with the rare target values; ii) the errors of the
learned models are not equally relevant because the user’s main goal is predictive
accuracy on the rare values; and iii) standard prediction error metrics are not
adequate to measure the quality of the models given the preference bias of the
user.
The existing approaches for the classification scenario can be cast into 3 main
groups [5, 6]: i) change the evaluation metrics to better capture the applica-
tion bias; ii) change the learning systems to bias their optimization process to
the goals of these domains; and iii) sampling approaches that manipulate the
L. Correia, L.P. Reis, and J. Cascalho (Eds.): EPIA 2013, LNAI 8154, pp. 378–389, 2013.
c Springer-Verlag Berlin Heidelberg 2013
SMOTE for Regression 379
2 Problem Formulation
defines the notion of event using the concept of utility. In this context, the ratios
of the two metrics are also defined as functions of utility, finally leading to the
following definitions of precision and recall for regression,
(1 + ui )
i:ẑi =1,zi =1
recall = (2)
(1 + φ(yi ))
i:zi =1
and
(1 + ui )
i:ẑi =1,zi =1
precision = (3)
(1 + φ(yi )) + (2 − p (1 − φ(yi )))
i:ẑi =1,zi =1 i:ẑi =1,zi =0
where p is a weight differentiating the types of errors, while ẑ and z are binary
properties associated with being in the presence of a rare extreme case.
In the experimental evaluation of our sampling approaches we have used as
main evaluation metric the F-measure that can be calculated with the values of
precision and recall,
(β 2 + 1) · precision · recall
F = (4)
β 2 · precision + recall
where β is a parameter weighing the importance given to precision and recall
(we have used β = 1, which means equal importance to both factors).
3 Sampling Approaches
The basic motivation for sampling approaches is the assumption that the im-
balanced distribution of the given training sample will bias the learning systems
towards solutions that are not in accordance with the user’s preference goal.
This occurs because the goal is predictive accuracy on the data that is least
represented in the sample. Most existing learning systems work by searching the
space of possible models with the goal of optimizing some criteria. These criteria
are usually related to some form of average performance. These metrics will tend
to reflect the performance on the most common cases, which are not the goal of
the user. In this context, the goal of sampling approaches is to change the data
distribution on the training sample so as to make the learners focus on cases
that are of interest to the user. The change that is carried out has the goal of
balancing the distribution of the least represented (but more important) cases
with the more frequent observations.
Many sampling approaches exist within the imbalanced classification litera-
ture. To the best of our knowledge no attempt has been made to apply these
strategies to the equivalent regression tasks - forecasting rare extreme values. In
this section we describe the adaptation of two existing sampling approaches to
these regression tasks.
SMOTE for Regression 383
The basic idea of under-sampling (e.g. [7]) is to decrease the number of obser-
vations with the most common target variable values with the goal of better
balancing the ratio between these observations and the ones with the interest-
ing target values that are less frequent. Within classification this consists on
obtaining a random sample from the training cases with the frequent (and less
interesting) class values. This sample is then joined with the observations with
the rare target class value to form the final training set that is used by the se-
lected learning algorithm. This means that the training sample resulting from
this approach will be smaller than the original (imbalanced) data set.
In regression we have a continuous target variable. As mentioned in Section 2.1
the notion of relevance can be used to specify the values of a continuous target
variable that are more important for the user. We can also use the relevance func-
tion values to determine which are the observations with the common and unin-
teresting values that should be under-sampled. Namely, we propose the strategy
of under-sampling observations whose target value has a relevance less than a
user-defined parameter. This threshold will define the set of observations that
are relevant according to the user preference bias, Dr = {x, y ∈ D : φ(y) ≥ t},
where t is the user-defined threshold on relevance. Under-sampling will be carried
out on the remaining observations Di = D \ Dr .
Regards the amount of under-sampling that is to be carried out the strategy
is the following. For each of the relevant observations in Dr we will randomly
select nu cases from the ”normal” observations in Di . The value of nu is another
user-defined parameter that will establish the desired ratio between ”normal”
and relevant observations. Too large values of nu will result in a new training
data set that is still too unbalanced, but too small values may result in a training
set that is too small, particularly if there are too few relevant observations.
rareL ← {x, y ∈ D : φ(y) > tE ∧ y < ỹ} // ỹ is the median of the target Y
newCasesL ← genSynthCases(rareL, %o, k) // generate synthetic cases for rareL
rareH ← {x, y ∈ D : φ(y) > tE ∧ y > ỹ}
newCasesH ← genSynthCases(rareH,
%o, k) // generate synthetic cases for rareH
newCases ← newCasesL newCasesH
nrN orm ←%u of |newCases|
normCases ←sample of nrN orm cases ∈ D\{rareL rareH} // under-sampling
return newCases normCases
end function
SMOTE for Regression 385
newCases ← {}
ng ←%o/100 // nr. of new cases to generate for each existing case
for all case ∈ D do
nns ← kNN(k, case, Dr \ {case}) // k-Nearest Neighbours of case
for i ← 1 to ng do
x ← randomly choose one of the nns
for all a ∈ attributes do // Generate attribute values
if isNumeric(a) then
dif f ← case[a] − x[a]
new[a] ← case[a] + random(0, 1) × dif f
else
new[a] ← randomly select among case[a] and x[a]
end if
end for
d1 ← dist(new, case) // Decide the target value
d2 ← dist(new, x)
new[T arget] ← d2 ×case[T arget]+d
d1 +d2
1 ×x[T arget]
4 Experimental Evaluation
The goal of our experiments is to test the effectiveness of our proposed sampling
approaches at predicting rare extreme values of a continuous target variable. For
this purpose we have selected 17 regression data sets that can be obtained at the
URL mentioned previously. Table 1 shows the main characteristics of these data
sets. For each of these data sets we have obtained a relevance function using
the automatic method proposed by Ribeiro [10]. The result of this method are
relevance functions that assign higher relevance to high and low rare extreme
values, which are the target of the work in this paper. As it can be seen from the
data in Table 1 this results in an average of around 10% of the available cases
having a rare extreme value for most data sets.
In order to avoid any algorithm-dependent bias distorting our results, we have
carried out our comparisons using a diverse set of standard regression algorithms.
Moreover, for each algorithm we have considered several parameter variants. Ta-
ble 2 summarizes the learning algorithms that were used and also the respective
parameter variants. To ensure easy replication of our work we have used the
implementations available in the free open source R environment, which is also
the infrastructure used to implement our proposed sampling methods.
Each of the 20 learning approaches (8 MARS variants + 6 SVM variants
+ 6 Random Forest variants), were applied to each of the 17 regression prob-
lems using 7 different sampling approaches. Sampling comprises the following
approaches: i) carrying out no sampling at all (i.e. use the data set with the
original imbalance); ii) 4 variants of our SmoteR method; and iii) 2 variants of
under-sampling. The four SmoteR variants used 5 nearest neighbours for case
generation, a relevance threshold of 0.75 and all combinations of {200, 300}%
and {200, 500}% for percentages of under- and over-sampling, respectively (c.f.
Algorithm 1). The two under-sampling variants used {200, 300}% for percentage
of under-sampling and the same 0.75 relevance threshold. Our goal was to com-
pare the 6 (4 SmoteR + 2 under-sampling) sampling approaches against the
default of using the given data, using 20 learning approaches and 17 data sets.
All alternatives we have described were evaluated according to the F-measure
with β = 1, which means that the same importance was given to both precision
and recall scores that were calculated using the set-up described in Section 2.2.
The values of the F-measure were estimated by means of 3 repetitions of a 10-
fold cross validation process and the statistical significance of the observed paired
differences was measured using the non-parametric Wilcoxon paired test.
Table 3 summarizes the results of the paired comparison of each of the 6
sampling variants against the baseline of using the given imbalanced data set.
Each sampling strategy was compared against the baseline 340 times (20 learning
SMOTE for Regression 387
Table 2. Regression algorithms and parameter variants, and the respective R packages
Sampling Strat. Win (99%) Win (95%) Loss (99%) Loss (95%) Insignif. Diff.
S.o2.u2 164 32 5 6 99
S.o5.u2 152 38 5 1 110
S.o2.u3 155 41 1 8 101
S.o5.u3 146 41 5 4 110
U.2 136 39 6 4 121
U.3 123 44 5 4 130
variants times 17 data sets). For each paired comparison we check the statistical
significance of the difference in the average F score obtained with the respective
sampling approach and with the baseline. These averages were estimated using
a 3 × 10-fold CV process. We counted the number of significant wins and losses
of each of the 6 sampling variants on these 340 paired comparisons using two
significance levels (99% and 95%).
The results of Table 3 show clear evidence for the advantage that sampling ap-
proaches provide, when the task is to predict rare extreme values of a continuous
target variable. In effect, we can observe an overwhelming advantage in terms of
number of statistically significant wins over the alternative of using the data set
as given (i.e. no sampling). For instance, the particular configuration of using
200% over-sampling and 200% under-sampling was significantly better than the
alternative of using the given data set on 57.6% of the 340 considered situations,
388 L. Torgo et al.
while only on 3.2% of the cases sampling actually lead to a significantly worst
model. The results also reveal that a slightly better outcome is obtained by the
SmoteR approaches with respect to the alternative of simply under-sampling
the most frequent values.
Figure 1 shows the best scores obtained with any of the sampling and no-
sampling variants that were considered for each of the 17 data sets. As it can
be seen, with few exceptions it is clear that the best score is obtained with
some sampling variant. As expected the advantages decrease as the score of the
baseline no-sampling approach increases, as it is more difficult to improve on
results that are already good. Moreover, we should also mention that in our
experiments we have considered only a few of the possible parameter variants of
the two sampling approaches (4 in SmoteR and 2 with under-sampling).
●
● Sampling ●
No−Sampling
0.8
●
0.6
F−1 Score
● ●
●
●
0.4
●
● ●
● ● ●
●
●
●
0.2
a1
a2
a3
a4
a5
a6
a7
Abalone
Accel
dAiler
dElev
availPwr
bank8FM
boston
cpuSm
fuelCons
maxTorque
5 Conclusions
This paper has presented a general approach to tackle the problem of forecasting
rare extreme values of a continuous target variable using standard regression
tools. The key advantage of the described sampling approaches is their simplicity.
They allow the use of standard out-of-the-box regression tools on these particular
regression tasks by simply manipulating the available training data.
The key contributions of this paper are : i) showing that sampling approaches
can be successfully applied to this type of regression tasks; and ii) adapting one
of the most successful sampling methods (Smote ) to regression tasks.
The large set of experiments we have carried out on a diverse set of problems
and using rather different learning algorithms, highlights the advantages of our
proposals when compared to the alternative of simply applying the algorithms
to the available data sets.
SMOTE for Regression 389
References
1. Domingos, P.: Metacost: A general method for making classifiers cost-sensitive.
In: KDD 1999: Proceedings of the 5th International Conference on Knowledge
Discovery and Data Mining, pp. 155–164. ACM Press (1999)
2. Elkan, C.: The foundations of cost-sensitive learning. In: IJCAI 2001: Proc. of 17th
Int. Joint Conf. of Artificial Intelligence, vol. 1, pp. 973–978. Morgan Kaufmann
Publishers (2001)
3. Zadrozny, B.: One-benefit learning: cost-sensitive learning with restricted cost in-
formation. In: UBDM 2005: Proc. of the 1st Int. Workshop on Utility-Based Data
Mining, pp. 53–58. ACM Press (2005)
4. Chawla, N.V.: Data mining for imbalanced datasets: An overview. In: The Data
Mining and Knowledge Discovery Handbook. Springer (2005)
5. Zadrozny, B.: Policy mining: Learning decision policies from fixed sets of data.
PhD thesis, University of California, San Diego (2003)
6. Ling, C., Sheng, V.: Cost-sensitive learning and the class imbalance problem. In:
Encyclopedia of Machine Learning. Springer (2010)
7. Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: One-
sided selection. In: Proc. of the 14th Int. Conf. on Machine Learning, pp. 179–186.
Morgan Kaufmann (1997)
8. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: Synthetic
minority over-sampling technique. JAIR 16, 321–357 (2002)
9. Torgo, L., Ribeiro, R.: Precision and recall for regression. In: Gama, J., Costa,
V.S., Jorge, A.M., Brazdil, P.B. (eds.) DS 2009. LNCS, vol. 5808, pp. 332–346.
Springer, Heidelberg (2009)
10. Ribeiro, R.P.: Utility-based Regression. PhD thesis, Dep. Computer Science, Fac-
ulty of Sciences - University of Porto (2011)
11. Davis, J., Goadrich, M.: The relationship between precision-recall and roc curves.
In: ICML 2006: Proc. of the 23rd Int. Conf. on Machine Learning, pp. 233–240.
ACM ICPS, ACM (2006)
12. Torgo, L., Ribeiro, R.P.: Utility-based regression. In: Kok, J.N., Koronacki, J.,
Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) PKDD 2007.
LNCS (LNAI), vol. 4702, pp. 597–604. Springer, Heidelberg (2007)
13. Milborrow, S.: Earth: Multivariate Adaptive Regression Spline Models. Derived
from mda:mars by Trevor Hastie and Rob Tibshirani (2012)
14. Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., Weingessel, A.: e1071: Misc
Functions of the Department of Statistics (e1071), TU Wien (2011)
15. Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3),
18–22 (2002)