0% found this document useful (0 votes)

17 views

SMOTE for Regression

This paper presents a novel approach to address the problem of forecasting rare extreme values in regression tasks using a modified version of the SMOTE algorithm, termed SmoteR. The authors demonstrate that this sampling method effectively balances the training data distribution, allowing existing regression algorithms to perform better on rare cases. Empirical evaluations show the superiority of the proposed method in predicting rare extreme values across various application domains.

Uploaded by

António Pedro Pinheiro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

SMOTE for Regression

Uploaded by

António Pedro Pinheiro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

SMOTE for Regression

Luı́s Torgo1,2, Rita P. Ribeiro1,2 , Bernhard Pfahringer3 , and Paula Branco1,2

1
LIAAD - INESC TEC
2
DCC - Faculdade de Ciências - Universidade do Porto
3
Department of Computer Science - University of Waikato
{ltorgo,rpribeiro}@dcc.fc.up.pt, [email protected],
[email protected]

Abstract. Several real world prediction problems involve forecasting

rare values of a target variable. When this variable is nominal we have a
problem of class imbalance that was already studied thoroughly within
machine learning. For regression tasks, where the target variable is con-
tinuous, few works exist addressing this type of problem. Still, important
application areas involve forecasting rare extreme values of a continuous
target variable. This paper describes a contribution to this type of tasks.
Namely, we propose to address such tasks by sampling approaches. These
approaches change the distribution of the given training data set to de-
crease the problem of imbalance between the rare target cases and the
most frequent ones. We present a modiﬁcation of the well-known Smote
algorithm that allows its use on these regression tasks. In an extensive set
of experiments we provide empirical evidence for the superiority of our
proposals for these particular regression tasks. The proposed SmoteR
method can be used with any existing regression algorithm turning it
into a general tool for addressing problems of forecasting rare extreme
values of a continuous target variable.

1 Introduction
Forecasting rare extreme values of a continuous variable is very relevant for
several real world domains (e.g. finance, ecology, meteorology, etc.). This prob-
lem can be seen as equivalent to classification problems with imbalanced class
distributions which have been studied for a long time within machine learn-
ing (e.g. [1–4]). The main difference is the fact that we have a target numeric
variable, i.e. a regression task. This type of problem is particularly difficult be-
cause: i) there are few examples with the rare target values; ii) the errors of the
learned models are not equally relevant because the user’s main goal is predictive
accuracy on the rare values; and iii) standard prediction error metrics are not
adequate to measure the quality of the models given the preference bias of the
user.
The existing approaches for the classification scenario can be cast into 3 main
groups [5, 6]: i) change the evaluation metrics to better capture the applica-
tion bias; ii) change the learning systems to bias their optimization process to
the goals of these domains; and iii) sampling approaches that manipulate the

L. Correia, L.P. Reis, and J. Cascalho (Eds.): EPIA 2013, LNAI 8154, pp. 378–389, 2013.

c Springer-Verlag Berlin Heidelberg 2013
SMOTE for Regression 379

training data distribution so as to allow the use of standard learning systems.

All these three approaches were extensively explored within the classification
scenario (e.g. [7, 8]). Research work within the regression setting is much more
limited. Torgo and Ribeiro [9] and Ribeiro [10] proposed a set of specific metrics
for regression tasks with non-uniform costs and benefits. Ribeiro [10] described
system ubaRules that was specifically designed to address this type of prob-
lem. Still, to the best of our knowledge, no one has tried sampling approaches
on this type of regression tasks. Nevertheless, sampling strategies have a clear
advantage over the other alternatives - they allow the use of the many existing
regression tools on this type of tasks without any need to change them. The
main goal of this paper is to explore this alternative within a regression context.
We describe two possible methods: i) using an under-sampling strategy; and ii)
using a Smote-like approach.
The main contributions of this work are: i) presenting a first attempt at ad-
dressing rare extreme values prediction using standard regression tools through
sampling approaches; and ii) adapting the well-known and successful Smote [8]
algorithm for regression tasks. The results of the empirical evaluation of our con-
tributions provide clear evidence on the validity of these approaches for the task
of predicting rare extreme values of a numeric target variable. The significance
of our contributions results from the fact that they allow the use of any existing
regression tool on these important tasks by simply manipulating the available
data set using our supplied code.

2 Problem Formulation

Predicting rare extreme values of a continuous variable is a particular class of

regression problems. In this context, given a training sample of the problem,
D = {x, y}N i=1 , our goal is to obtain a model that approximates the unknown
regression function y = f (x). The particularity of our target tasks is that the
goal is the predictive accuracy on a particular subset of the domain of the target
variable Y - the rare and extreme values. As mentioned before, this is similar to
classification problems with extremely unbalanced classes. As in these problems,
the user goal is the performance of the models on a sub-range of the target vari-
able values that is very infrequent. In this context, standard regression metrics
(e.g. mean squared error) suffer from the same problems as error rate (or ac-
curacy) on imbalanced classification tasks - they do not focus on the rare cases
performance. In classification the solution usually revolves around the use of the
precision/recall evaluation framework [11]. Precision provides an indication on
how accurate are the predictions of rare cases made by the model. Recall tells
us how frequently the rare situations were signalled as such by the model. Both
are important properties that frequently require some form of trade-off. How can
we get similar evaluation for the numeric prediction of rare extreme values? On
one hand we want that when our models predict an extreme value they are ac-
curate (high precision), on the other hand we want our models to make extreme
value predictions for the cases where the true value is an extreme (high recall).
380 L. Torgo et al.

Assuming the user gives us information on what is considered an extreme for

the domain at hand (e.g. Y < k1 is an extreme low, and Y > k2 is an extreme
high), we could transform this into a classiﬁcation problem and calculate the
precision and recall of our models for each type of extreme. However, this would
ignore the notion of numeric precision. Two predicted values very distant from
each other, as long as being both extremes (above or below the given thresholds)
would count as equally valuable predictions. This is clearly counter-intuitive on
regression problems such as our tasks. A solution to this problem was described
by Torgo and Ribeiro [9] and Ribeiro [10] that have presented a formulation of
precision and recall for regression tasks that also considers the issue of numeric
accuracy. We will use this framework to compare and evaluate our proposals for
this type of tasks. For completeness, we will now brieﬂy describe the framework
proposed by Ribeiro [10] that will be used in the experimental evaluation of our
proposal1 .

2.1 Utility-Based Regression

The precision/recall evaluation framework we will use is based on the concept

of utility-based regression [10, 12]. At the core of utility-based regression is the
notion of relevance of the target variable values and the assumption that this
relevance is not uniform across the domain of this variable. This notion is mo-
tivated by the fact that contrary to standard regression, in some domains not
all the values are equally important/relevant. In utility-based regression the use-
fulness of a prediction is a function of both the numeric error of the prediction
(given by some loss function L(ŷ, y)) and the relevance (importance) of both the
predicted ŷ and true y values. Relevance is the crucial property that expresses
the domain-specific biases concerning the different importance of the values. It is
defined as a continuous function φ(Y ) : Y → [0, 1] that maps the target variable
domain Y into a [0, 1] scale of relevance, where 0 represents the minimum and 1
represents the maximum relevance.
Being a domain-specific function, it is the user responsibility to specify the
relevance function. However, Ribeiro [10] describes some specific methods of
obtaining automatically these functions when the goal is to be accurate at rare
extreme values, which is the case for our applications. The methods are based
on the simple observation that for these applications the notion of relevance is
inversely proportional to the target variable probability density function. We
have used these methods to obtain the relevance functions for the data sets used
in the experiments section.
The utility of a model prediction is related to the question on whether it has
led to the identification of the correct type of extreme and if the prediction was
precise enough in numeric terms. Thus to calculate the utility of a prediction it
is necessary consider two aspects: (i) does it identify the correct type of extreme?
(ii) what is the numeric accuracy of the prediction (i.e. L(ŷ, y))? This latter issue
1
Full details can be obtained at Ribeiro [10]. The code used in our experiments is
available at http://www.dcc.fc.up.pt/~ rpribeiro/uba/.
SMOTE for Regression 381

is important because it allows for coping with diﬀerent ”degrees” of actions

as a result of the model predictions. For instance, in the context of financial
trading an agent may use a decision rule that implies buying an asset if the
predicted return is above a certain threshold. However, this same agent may
invest different amounts depending on the predicted return, and thus the need
for precise numeric forecasts of the returns on top of the correct identification
of the type of extreme. This numeric precision, together with the fact that we
may have more than one type of extreme (i.e. more than one ”positive” class)
are the key distinguishing features of this framework when compared to pure
classification approaches, and are also the main reasons why it does not make
sense to map our problems to classification tasks.
The concrete utility score of a prediction, in accordance with the original
framework of utility-based learning (e.g. [2, 3]), results from the net balance
between its benefits and costs (i.e. negative benefits). A prediction should be
considered beneficial only if it leads to the identification of the correct type of
extreme. However, the reward should also increase with the numeric accuracy
of the prediction and should be dependent on the relevance of the true value. In
this context, Ribeiro [10] has defined the notions of benefits and costs of numeric
predictions, and proposed the following definition of the utility of the predictions
of a regression model,

Uφp (ŷ, y) = Bφ (ŷ, y) − Cφp (ŷ, y)

(1)
= φ(y) · (1 − ΓB (ŷ, y)) − φp (ŷ, y) · ΓC (ŷ, y)
where Bφ (ŷ, y), Cφp (ŷ, y), ΓB (ŷ, y) and ΓC (ŷ, y) are functions related to the no-
tions of costs and beneﬁts of predictions that are deﬁned in Ribeiro [10].

2.2 Precision and Recall for Regression

Precision and recall are two of the most commonly used metrics to estimate
the performance of models in highly skewed domains [11] such as our target
domains. The main advantage of these statistics is that they are focused on the
performance on the target events, disregarding the remaining cases. In imbal-
anced classification problems, the target events are cases belonging to the mi-
nority (positive) class. Informally, precision measures the proportion of events
signalled by the model that are real events, while recall measures the proportion
of events occurring in the domain that are captured by the model.
The notions of precision and recall were adapted to regression problems with
non-uniform relevance of the target values by Torgo and Ribeiro [9] and Ribeiro [10].
In this paper we will use the framework proposed by these authors to evaluate and
compare our sampling approaches. We will now briefly present the main details of
this formulation2 .
Precision and recall are usually defined as ratios between the correctly iden-
tified events (usually known as true positives within classification), and either
the signalled events (for precision), or the true events (for recall). Ribeiro [10]
2
Full details can be obtained in Chapter 4 of Ribeiro [10].
382 L. Torgo et al.

defines the notion of event using the concept of utility. In this context, the ratios
of the two metrics are also defined as functions of utility, finally leading to the
following definitions of precision and recall for regression,

(1 + ui )
i:ẑi =1,zi =1
recall = (2)
(1 + φ(yi ))
i:zi =1

and

(1 + ui )
i:ẑi =1,zi =1
precision = (3)
(1 + φ(yi )) + (2 − p (1 − φ(yi )))
i:ẑi =1,zi =1 i:ẑi =1,zi =0

where p is a weight diﬀerentiating the types of errors, while ẑ and z are binary
properties associated with being in the presence of a rare extreme case.
In the experimental evaluation of our sampling approaches we have used as
main evaluation metric the F-measure that can be calculated with the values of
precision and recall,

(β 2 + 1) · precision · recall
F = (4)
β 2 · precision + recall
where β is a parameter weighing the importance given to precision and recall
(we have used β = 1, which means equal importance to both factors).

3 Sampling Approaches
The basic motivation for sampling approaches is the assumption that the im-
balanced distribution of the given training sample will bias the learning systems
towards solutions that are not in accordance with the user’s preference goal.
This occurs because the goal is predictive accuracy on the data that is least
represented in the sample. Most existing learning systems work by searching the
space of possible models with the goal of optimizing some criteria. These criteria
are usually related to some form of average performance. These metrics will tend
to reﬂect the performance on the most common cases, which are not the goal of
the user. In this context, the goal of sampling approaches is to change the data
distribution on the training sample so as to make the learners focus on cases
that are of interest to the user. The change that is carried out has the goal of
balancing the distribution of the least represented (but more important) cases
with the more frequent observations.
Many sampling approaches exist within the imbalanced classiﬁcation litera-
ture. To the best of our knowledge no attempt has been made to apply these
strategies to the equivalent regression tasks - forecasting rare extreme values. In
this section we describe the adaptation of two existing sampling approaches to
these regression tasks.
SMOTE for Regression 383

3.1 Under-Sampling Common Values

The basic idea of under-sampling (e.g. [7]) is to decrease the number of obser-
vations with the most common target variable values with the goal of better
balancing the ratio between these observations and the ones with the interest-
ing target values that are less frequent. Within classification this consists on
obtaining a random sample from the training cases with the frequent (and less
interesting) class values. This sample is then joined with the observations with
the rare target class value to form the final training set that is used by the se-
lected learning algorithm. This means that the training sample resulting from
this approach will be smaller than the original (imbalanced) data set.
In regression we have a continuous target variable. As mentioned in Section 2.1
the notion of relevance can be used to specify the values of a continuous target
variable that are more important for the user. We can also use the relevance func-
tion values to determine which are the observations with the common and unin-
teresting values that should be under-sampled. Namely, we propose the strategy
of under-sampling observations whose target value has a relevance less than a
user-defined parameter. This threshold will define the set of observations that
are relevant according to the user preference bias, Dr = {x, y ∈ D : φ(y) ≥ t},
where t is the user-defined threshold on relevance. Under-sampling will be carried
out on the remaining observations Di = D \ Dr .
Regards the amount of under-sampling that is to be carried out the strategy
is the following. For each of the relevant observations in Dr we will randomly
select nu cases from the ”normal” observations in Di . The value of nu is another
user-defined parameter that will establish the desired ratio between ”normal”
and relevant observations. Too large values of nu will result in a new training
data set that is still too unbalanced, but too small values may result in a training
set that is too small, particularly if there are too few relevant observations.

3.2 SMOTE for Regression

Smote [8] is a sampling method to address classiﬁcation problems with im-

balanced class distribution. The key feature of this method is that it combines
under-sampling of the frequent classes with over-sampling of the minority class.
Chawla et. al. [8] show the advantages of this approach when compared to other
alternative sampling techniques on several real world problems using several clas-
siﬁcation algorithms. The key contribution of our work is to propose a variant
of Smote for addressing regression tasks where the key goal is to accurately
predict rare extreme values, which we will name SmoteR .
The original Smote algorithm uses an over-sampling strategy that consists on
generating ”synthetic” cases with a rare target value. Chawla et. al. [8] propose
an interpolation strategy to create these artiﬁcial examples. For each case from
the set of observations with rare values (Dr ), the strategy is to randomly select
one of its k-nearest neighbours from this same set. With these two observations a
new example is created whose attribute values are an interpolation of the values
of the two original cases. Regards the target variable, as Smote is applied to
384 L. Torgo et al.

classiﬁcation problems with a single class of interest, all cases in Dr belong to

this class and the same will happen to the synthetic cases.
There are three key components of the Smote algorithm that we need to ad-
dress in order to adapt it for our target regression tasks: i) how to define which
are the relevant observations and the ”normal” cases; ii) how to create new syn-
thetic examples (i.e. over-sampling); and iii) how to decide the target variable
value of these new synthetic examples. Regards the first issue, the original algo-
rithm is based on the information provided by the user concerning which class
value is the target/rare class (usually known as the minority or positive class).
In our problems we face a potentially infinite number of values of the target
variable. Our proposal is based on the existence of a relevance function (c.f. Sec-
tion 2.1) and on a user-specified threshold on the relevance values, that leads to
the definition of the set Dr (c.f. Section 3.1). Our algorithm will over-sample the
observations in Dr and under-sample the remaining cases (Di ), thus leading to
a new training set with a more balanced distribution of the values. Regards the
second key component, the generation of new cases, we use the same approach
as in the original algorithm though we have introduced some small modifications
for being able to handle both numeric and nominal attributes. Finally, the third
key issue is to decide the target variable value of the generated observations. In
the original algorithm this is a trivial question, because as all rare cases have
the same class (the target minority class), the same will happen to the examples
generated from this set. In our case the answer is not so trivial. The cases that
are to be over-sampled do not have the same target variable value, although they
do have a high relevance score (φ(y)). This means that when a pair of examples
is used to generate a new synthetic case, they will not have the same target vari-
able value. Our proposal is to use a weighed average of the target variable values
of the two seed examples. The weights are calculated as an inverse function of
the distance of the generated case to each of the two seed examples.

Algorithm 1. The main SmoteR algorithm

function SmoteR(D, tE , o, u, k)
// D - A data set
// tE - The threshold for relevance of the target variable values
// %o,%u - Percentages of over- and under-sampling
// k - The number of neighbours used in case generation

rareL ← {x, y ∈ D : φ(y) > tE ∧ y < ỹ} // ỹ is the median of the target Y
newCasesL ← genSynthCases(rareL, %o, k) // generate synthetic cases for rareL
rareH ← {x, y ∈ D : φ(y) > tE ∧ y > ỹ}
newCasesH ← genSynthCases(rareH,
%o, k) // generate synthetic cases for rareH
newCases ← newCasesL newCasesH
nrN orm ←%u of |newCases|
normCases ←sample of nrN orm cases ∈ D\{rareL rareH} // under-sampling
return newCases normCases
end function
SMOTE for Regression 385

Algorithm 2. Generating synthetic cases

function genSynthCases(D, o, k)

newCases ← {}
ng ←%o/100 // nr. of new cases to generate for each existing case
for all case ∈ D do
nns ← kNN(k, case, Dr \ {case}) // k-Nearest Neighbours of case
for i ← 1 to ng do
x ← randomly choose one of the nns
for all a ∈ attributes do // Generate attribute values
if isNumeric(a) then
dif f ← case[a] − x[a]
new[a] ← case[a] + random(0, 1) × dif f
else
new[a] ← randomly select among case[a] and x[a]
end if
end for
d1 ← dist(new, case) // Decide the target value
d2 ← dist(new, x)
new[T arget] ← d2 ×case[T arget]+d
d1 +d2
1 ×x[T arget]

newCases ← newCases {new}

end for
end for
return newCases
end function

Algorithm 1 describes our proposed SmoteR sampling method. The algo-

rithm uses a user-defined threshold (tE ) of relevance to define the sets Dr and
Di . Notice that in our target applications we may have two rather different sets
of rare cases: the extreme high and low values. This is another difference to the
original algorithm. The consequence of this is that the generation of the syn-
thetic examples is also done separately for these two sets. The reason is that
although both sets include rare and interesting cases, they are of different type
and thus with very different target variable values (extremely high and low val-
ues). The other parameters of the algorithm are the percentages of over- and
under-sampling, and the number of neighbours to use in the cases generation.
The key aspect of this algorithm is the generation of the synthetic cases. This
process is described in detail on Algorithm 2. The main differences to the original
Smote algorithm are: the ability to handle both numeric and nominal variables;
and the way the target value for the new cases is generated. Regards the former
issue we simply perform a random selection between the values of the two seed
cases. A possible alternative could be to use some biased sampling that considers
the frequency of occurrence of each of the values within the rare cases. Regards
the target value we have used a weighted average between the values of the two
seed cases. The weights are decided based on the distance between the new case
and these two seed cases. The larger the distance the smaller the weight.
386 L. Torgo et al.

R code implementing both the SmoteR method and the under-sampling

strategy described in Section 3.1 is freely provided at http://www.dcc.fc.up.
pt/~ltorgo/EPIA2013. This URL also includes all code and data sets necessary
to replicate the experiments in the paper.

4 Experimental Evaluation

The goal of our experiments is to test the effectiveness of our proposed sampling
approaches at predicting rare extreme values of a continuous target variable. For
this purpose we have selected 17 regression data sets that can be obtained at the
URL mentioned previously. Table 1 shows the main characteristics of these data
sets. For each of these data sets we have obtained a relevance function using
the automatic method proposed by Ribeiro [10]. The result of this method are
relevance functions that assign higher relevance to high and low rare extreme
values, which are the target of the work in this paper. As it can be seen from the
data in Table 1 this results in an average of around 10% of the available cases
having a rare extreme value for most data sets.
In order to avoid any algorithm-dependent bias distorting our results, we have
carried out our comparisons using a diverse set of standard regression algorithms.
Moreover, for each algorithm we have considered several parameter variants. Ta-
ble 2 summarizes the learning algorithms that were used and also the respective
parameter variants. To ensure easy replication of our work we have used the
implementations available in the free open source R environment, which is also
the infrastructure used to implement our proposed sampling methods.
Each of the 20 learning approaches (8 MARS variants + 6 SVM variants
+ 6 Random Forest variants), were applied to each of the 17 regression prob-
lems using 7 different sampling approaches. Sampling comprises the following
approaches: i) carrying out no sampling at all (i.e. use the data set with the
original imbalance); ii) 4 variants of our SmoteR method; and iii) 2 variants of
under-sampling. The four SmoteR variants used 5 nearest neighbours for case
generation, a relevance threshold of 0.75 and all combinations of {200, 300}%
and {200, 500}% for percentages of under- and over-sampling, respectively (c.f.
Algorithm 1). The two under-sampling variants used {200, 300}% for percentage
of under-sampling and the same 0.75 relevance threshold. Our goal was to com-
pare the 6 (4 SmoteR + 2 under-sampling) sampling approaches against the
default of using the given data, using 20 learning approaches and 17 data sets.
All alternatives we have described were evaluated according to the F-measure
with β = 1, which means that the same importance was given to both precision
and recall scores that were calculated using the set-up described in Section 2.2.
The values of the F-measure were estimated by means of 3 repetitions of a 10-
fold cross validation process and the statistical significance of the observed paired
differences was measured using the non-parametric Wilcoxon paired test.
Table 3 summarizes the results of the paired comparison of each of the 6
sampling variants against the baseline of using the given imbalanced data set.
Each sampling strategy was compared against the baseline 340 times (20 learning
SMOTE for Regression 387

Table 1. Used data sets and characteristics (N : n. of cases; p: n. of predictors; nRare:

n. cases with φ(Y ) > 0.75; %Rare: nRare/N )

Data Set N p nRare %Rare Data Set N p nRare %Rare

a1 198 12 31 0.157 dAiler 7129 6 450 0.063
a2 198 12 24 0.121 availPwr 1802 16 169 0.094
a3 198 12 34 0.172 bank8FM 4499 9 339 0.075
a4 198 12 34 0.172 cpuSm 8192 13 755 0.092
a5 198 12 22 0.111 dElev 9517 7 1109 0.116
a6 198 12 33 0.167 fuelCons 1764 38 200 0.113
a7 198 12 27 0.136 boston 506 14 69 0.136
Abalone 4177 9 679 0.163 maxTorque 1802 33 158 0.088
Accel 1732 15 102 0.059

Table 2. Regression algorithms and parameter variants, and the respective R packages

Learner Parameter Variants R package

MARS nk = {10, 17}, degree = {1, 2}, thresh = {0.01, 0.001} earth [13]
SVM cost = {10, 150, 300}, gamma = {0.01, 0.001} e1071 [14]
Random Forest mtry = {5, 7}, ntree = {500, 750, 1500} randomForest [15]

Table 3. Summary of the paired comparisons to the no sampling baseline (S - SmoteR

; U - under-sampling; ox - x × 100% over-sampling; ux - x × 100% under-sampling)

Sampling Strat. Win (99%) Win (95%) Loss (99%) Loss (95%) Insignif. Diﬀ.
S.o2.u2 164 32 5 6 99
S.o5.u2 152 38 5 1 110
S.o2.u3 155 41 1 8 101
S.o5.u3 146 41 5 4 110
U.2 136 39 6 4 121
U.3 123 44 5 4 130

variants times 17 data sets). For each paired comparison we check the statistical
significance of the difference in the average F score obtained with the respective
sampling approach and with the baseline. These averages were estimated using
a 3 × 10-fold CV process. We counted the number of significant wins and losses
of each of the 6 sampling variants on these 340 paired comparisons using two
significance levels (99% and 95%).
The results of Table 3 show clear evidence for the advantage that sampling ap-
proaches provide, when the task is to predict rare extreme values of a continuous
target variable. In effect, we can observe an overwhelming advantage in terms of
number of statistically significant wins over the alternative of using the data set
as given (i.e. no sampling). For instance, the particular configuration of using
200% over-sampling and 200% under-sampling was significantly better than the
alternative of using the given data set on 57.6% of the 340 considered situations,
388 L. Torgo et al.

while only on 3.2% of the cases sampling actually lead to a signiﬁcantly worst
model. The results also reveal that a slightly better outcome is obtained by the
SmoteR approaches with respect to the alternative of simply under-sampling
the most frequent values.
Figure 1 shows the best scores obtained with any of the sampling and no-
sampling variants that were considered for each of the 17 data sets. As it can
be seen, with few exceptions it is clear that the best score is obtained with
some sampling variant. As expected the advantages decrease as the score of the
baseline no-sampling approach increases, as it is more diﬃcult to improve on
results that are already good. Moreover, we should also mention that in our
experiments we have considered only a few of the possible parameter variants of
the two sampling approaches (4 in SmoteR and 2 with under-sampling).

F1 of Best Sampling and Best No−sampling

●
● Sampling ●
No−Sampling
0.8

●
0.6
F−1 Score

● ●
●

●
0.4

●
● ●
● ● ●
●
●
●
0.2

Abalone

Accel

dAiler

dElev

availPwr

bank8FM

boston

cpuSm

fuelCons

maxTorque

Fig. 1. Best Scores obtained with sampling and no-sampling

5 Conclusions
This paper has presented a general approach to tackle the problem of forecasting
rare extreme values of a continuous target variable using standard regression
tools. The key advantage of the described sampling approaches is their simplicity.
They allow the use of standard out-of-the-box regression tools on these particular
regression tasks by simply manipulating the available training data.
The key contributions of this paper are : i) showing that sampling approaches
can be successfully applied to this type of regression tasks; and ii) adapting one
of the most successful sampling methods (Smote ) to regression tasks.
The large set of experiments we have carried out on a diverse set of problems
and using rather diﬀerent learning algorithms, highlights the advantages of our
proposals when compared to the alternative of simply applying the algorithms
to the available data sets.
SMOTE for Regression 389

Acknowledgements. This work is part-funded by the ERDF - European

Regional Development Fund through the COMPETE Programme (operational
programme for competitiveness), by the Portuguese Funds through the FCT (Por-
tuguese Foundation for Science and Technology) within project FCOMP - 01-
0124-FEDER-022701.

References
1. Domingos, P.: Metacost: A general method for making classifiers cost-sensitive.
In: KDD 1999: Proceedings of the 5th International Conference on Knowledge
Discovery and Data Mining, pp. 155–164. ACM Press (1999)
2. Elkan, C.: The foundations of cost-sensitive learning. In: IJCAI 2001: Proc. of 17th
Int. Joint Conf. of Artificial Intelligence, vol. 1, pp. 973–978. Morgan Kaufmann
Publishers (2001)
3. Zadrozny, B.: One-benefit learning: cost-sensitive learning with restricted cost in-
formation. In: UBDM 2005: Proc. of the 1st Int. Workshop on Utility-Based Data
Mining, pp. 53–58. ACM Press (2005)
4. Chawla, N.V.: Data mining for imbalanced datasets: An overview. In: The Data
Mining and Knowledge Discovery Handbook. Springer (2005)
5. Zadrozny, B.: Policy mining: Learning decision policies from fixed sets of data.
PhD thesis, University of California, San Diego (2003)
6. Ling, C., Sheng, V.: Cost-sensitive learning and the class imbalance problem. In:
Encyclopedia of Machine Learning. Springer (2010)
7. Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: One-
sided selection. In: Proc. of the 14th Int. Conf. on Machine Learning, pp. 179–186.
Morgan Kaufmann (1997)
8. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: Synthetic
minority over-sampling technique. JAIR 16, 321–357 (2002)
9. Torgo, L., Ribeiro, R.: Precision and recall for regression. In: Gama, J., Costa,
V.S., Jorge, A.M., Brazdil, P.B. (eds.) DS 2009. LNCS, vol. 5808, pp. 332–346.
Springer, Heidelberg (2009)
10. Ribeiro, R.P.: Utility-based Regression. PhD thesis, Dep. Computer Science, Fac-
ulty of Sciences - University of Porto (2011)
11. Davis, J., Goadrich, M.: The relationship between precision-recall and roc curves.
In: ICML 2006: Proc. of the 23rd Int. Conf. on Machine Learning, pp. 233–240.
ACM ICPS, ACM (2006)
12. Torgo, L., Ribeiro, R.P.: Utility-based regression. In: Kok, J.N., Koronacki, J.,
Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) PKDD 2007.
LNCS (LNAI), vol. 4702, pp. 597–604. Springer, Heidelberg (2007)
13. Milborrow, S.: Earth: Multivariate Adaptive Regression Spline Models. Derived
from mda:mars by Trevor Hastie and Rob Tibshirani (2012)
14. Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., Weingessel, A.: e1071: Misc
Functions of the Department of Statistics (e1071), TU Wien (2011)
15. Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3),
18–22 (2002)

Research Methods in Psychology Evaluating A World of Information 3rd Edition PDF
No ratings yet
Research Methods in Psychology Evaluating A World of Information 3rd Edition PDF
13 pages
Assessment of Cybersecurity and The Law in Zambia
No ratings yet
Assessment of Cybersecurity and The Law in Zambia
32 pages
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Analysis of Consumer Behaviour On The Times of India
No ratings yet
Analysis of Consumer Behaviour On The Times of India
69 pages
Chapter 15 - Auditing Theory by Cabrera
100% (1)
Chapter 15 - Auditing Theory by Cabrera
85 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Uncertainty Notes
No ratings yet
Uncertainty Notes
166 pages
CS 3035 (ML) - CS - End - May - 2023
No ratings yet
CS 3035 (ML) - CS - End - May - 2023
11 pages
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet
Ds Module 4
No ratings yet
Ds Module 4
73 pages
Model Evaluation
No ratings yet
Model Evaluation
29 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
Assignment 4 Reportdocx
No ratings yet
Assignment 4 Reportdocx
10 pages
480-note-lin
No ratings yet
480-note-lin
11 pages
1. Lecture+Notes+-+Advanced+Regression
No ratings yet
1. Lecture+Notes+-+Advanced+Regression
12 pages
Rp Ribeiro Phd 11
No ratings yet
Rp Ribeiro Phd 11
254 pages
Data Augmentation With Variational Autoencoder
No ratings yet
Data Augmentation With Variational Autoencoder
12 pages
Lecture 7 - Part A - Mutli Class and Overfitting and Regularization
No ratings yet
Lecture 7 - Part A - Mutli Class and Overfitting and Regularization
43 pages
DAV 2201079 Exp 2 2-1
No ratings yet
DAV 2201079 Exp 2 2-1
35 pages
Supervised Learning: Clasificación: 0,1 (Valor Discreto) Regresión: (Valores Continuos)
No ratings yet
Supervised Learning: Clasificación: 0,1 (Valor Discreto) Regresión: (Valores Continuos)
22 pages
LNCS 2810 Regularized Learning with Flexible Constraints 1st edition by Eyke HÃ¼llermeier ISBN 3540408134 978-3540408130 - The latest updated ebook is now available for download
No ratings yet
LNCS 2810 Regularized Learning with Flexible Constraints 1st edition by Eyke HÃ¼llermeier ISBN 3540408134 978-3540408130 - The latest updated ebook is now available for download
47 pages
Lecture 02
No ratings yet
Lecture 02
43 pages
Stanford University CS 229, Autumn 2015 Midterm Examination
No ratings yet
Stanford University CS 229, Autumn 2015 Midterm Examination
25 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
1 Intro
No ratings yet
1 Intro
5 pages
RigNotes15 PDF
No ratings yet
RigNotes15 PDF
130 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
10 pages
Regression
No ratings yet
Regression
45 pages
Propagation of Input Tail Uncertainty in Rare-Event Estimation: A Light Versus Heavy Tail Dichotomy
No ratings yet
Propagation of Input Tail Uncertainty in Rare-Event Estimation: A Light Versus Heavy Tail Dichotomy
46 pages
Test2 2024
No ratings yet
Test2 2024
9 pages
05-1 Supervised Learning
No ratings yet
05-1 Supervised Learning
65 pages
Lecture 09_02.09.2024_Regression-01
No ratings yet
Lecture 09_02.09.2024_Regression-01
62 pages
A Physics and Data Co-Driven Surrogate Modeling Method For High-Dimensional Rare Event Simulation
No ratings yet
A Physics and Data Co-Driven Surrogate Modeling Method For High-Dimensional Rare Event Simulation
22 pages
Regression by Classification: LIACC - University of Porto
No ratings yet
Regression by Classification: LIACC - University of Porto
10 pages
Week11_regularization and optimization
No ratings yet
Week11_regularization and optimization
75 pages
Predictive Machines with Uncertainty Quantification (2022)
No ratings yet
Predictive Machines with Uncertainty Quantification (2022)
18 pages
Evaluating Hypothesis: Bias in The Estimate. First, The Observed Accuracy of The Learned Hypothesis Over The Training
No ratings yet
Evaluating Hypothesis: Bias in The Estimate. First, The Observed Accuracy of The Learned Hypothesis Over The Training
17 pages
A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) For Handling Class Imbalance
No ratings yet
A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) For Handling Class Imbalance
33 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Statement of Purpose: Jaweria Amjad
No ratings yet
Statement of Purpose: Jaweria Amjad
3 pages
Análisis de Regresión
No ratings yet
Análisis de Regresión
37 pages
Linear Regression
No ratings yet
Linear Regression
31 pages
Classification With Deep Neural Networks and Logistic Loss: Zihan Zhang
No ratings yet
Classification With Deep Neural Networks and Logistic Loss: Zihan Zhang
117 pages
unit-1.2-Perceptron-2024
No ratings yet
unit-1.2-Perceptron-2024
107 pages
Practical 7 Classification Revision Questions
No ratings yet
Practical 7 Classification Revision Questions
8 pages
Untitled Document
No ratings yet
Untitled Document
6 pages
MLA TAB Lecture3
No ratings yet
MLA TAB Lecture3
70 pages
LP III Lab Manual
100% (1)
LP III Lab Manual
8 pages
Overview.: 1.1 Statistical Learning
No ratings yet
Overview.: 1.1 Statistical Learning
2 pages
Lecture 3 - Linear Regression
No ratings yet
Lecture 3 - Linear Regression
31 pages
Notes5_Regression
No ratings yet
Notes5_Regression
14 pages
Linear Regression Summary
No ratings yet
Linear Regression Summary
57 pages
Class 9 after
No ratings yet
Class 9 after
38 pages
Factors That Can Affect Model Performance: Seonwoo Lee
No ratings yet
Factors That Can Affect Model Performance: Seonwoo Lee
34 pages
Unit-2: Logistic Regression
No ratings yet
Unit-2: Logistic Regression
30 pages
i2ML Cheatsheets
No ratings yet
i2ML Cheatsheets
7 pages
ML Model Paper 2 Solution
No ratings yet
ML Model Paper 2 Solution
15 pages
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
STAT 714 Linear Statistical Models: Lecture Notes
No ratings yet
STAT 714 Linear Statistical Models: Lecture Notes
150 pages
ML Cheatsheet 2024-2025
No ratings yet
ML Cheatsheet 2024-2025
2 pages
A Layman's Guide to the Project
No ratings yet
A Layman's Guide to the Project
34 pages
Cross-Validation
No ratings yet
Cross-Validation
9 pages
Linear Models: The Least-Squares Method, The Perceptron: A Heuristic Learning Algorithm For
No ratings yet
Linear Models: The Least-Squares Method, The Perceptron: A Heuristic Learning Algorithm For
25 pages
Nursing Research Statistics
100% (1)
Nursing Research Statistics
7 pages
Sampling and Data Collection: Lecture 19-20 Research Methods (Business) Isp-Aht
No ratings yet
Sampling and Data Collection: Lecture 19-20 Research Methods (Business) Isp-Aht
6 pages
A Project Report On Employees Satisfaction Regarding HDFC Bank
No ratings yet
A Project Report On Employees Satisfaction Regarding HDFC Bank
74 pages
Saurabh 1 Bluestar
No ratings yet
Saurabh 1 Bluestar
60 pages
diagnostic test for 3is
No ratings yet
diagnostic test for 3is
4 pages
2024 Study Guide For Marketing Research
No ratings yet
2024 Study Guide For Marketing Research
9 pages
Chap01 Quamet
No ratings yet
Chap01 Quamet
25 pages
2023.de Castro - Anxiety and Depression Signs Among Adolescents in 26 Low
No ratings yet
2023.de Castro - Anxiety and Depression Signs Among Adolescents in 26 Low
9 pages
The Untold Stories of Homeless Families: Practical Research 1
No ratings yet
The Untold Stories of Homeless Families: Practical Research 1
71 pages
FRM - Forest Biometry
No ratings yet
FRM - Forest Biometry
42 pages
Mann-1999-Marine Mammal Science
No ratings yet
Mann-1999-Marine Mammal Science
21 pages
Customer Satisfaction Towards Maruti Swift
50% (2)
Customer Satisfaction Towards Maruti Swift
71 pages
Data Management
No ratings yet
Data Management
33 pages
4 61 60603 MQGM Prel SM 3E 10
No ratings yet
4 61 60603 MQGM Prel SM 3E 10
11 pages
1B Data Collection
No ratings yet
1B Data Collection
14 pages
Effect of Fast Food Advertising On Children
No ratings yet
Effect of Fast Food Advertising On Children
14 pages
Ankit Bhatia - Project Synopsis - IMT
No ratings yet
Ankit Bhatia - Project Synopsis - IMT
17 pages
2021 M1 Question Paper
No ratings yet
2021 M1 Question Paper
10 pages
Consumer Perception Towards Green Products
No ratings yet
Consumer Perception Towards Green Products
52 pages
Overview of The Research Process
No ratings yet
Overview of The Research Process
31 pages
Financial Management in Secondary Schools
67% (3)
Financial Management in Secondary Schools
23 pages
Fa-3 Maths Project
No ratings yet
Fa-3 Maths Project
6 pages
Research Methodology
No ratings yet
Research Methodology
53 pages
Research Proposal of Glass Ceiling Topic
No ratings yet
Research Proposal of Glass Ceiling Topic
17 pages
Dissertation Proposal Examples Law
100% (2)
Dissertation Proposal Examples Law
8 pages
Research Methodology & Technical Communication (HSS-501) RCS
No ratings yet
Research Methodology & Technical Communication (HSS-501) RCS
2 pages

SMOTE for Regression

Uploaded by

SMOTE for Regression

Uploaded by

SMOTE for Regression

Luı́s Torgo1,2, Rita P. Ribeiro1,2 , Bernhard Pfahringer3 , and Paula Branco1,2

Abstract. Several real world prediction problems involve forecasting

training data distribution so as to allow the use of standard learning systems.

Predicting rare extreme values of a continuous variable is a particular class of

Assuming the user gives us information on what is considered an extreme for

2.1 Utility-Based Regression

The precision/recall evaluation framework we will use is based on the concept

is important because it allows for coping with diﬀerent ”degrees” of actions

Uφp (ŷ, y) = Bφ (ŷ, y) − Cφp (ŷ, y)

2.2 Precision and Recall for Regression

3.1 Under-Sampling Common Values

3.2 SMOTE for Regression

Smote [8] is a sampling method to address classiﬁcation problems with im-

classiﬁcation problems with a single class of interest, all cases in Dr belong to

Algorithm 1. The main SmoteR algorithm

Algorithm 2. Generating synthetic cases

newCases ← newCases {new}

Algorithm 1 describes our proposed SmoteR sampling method. The algo-

R code implementing both the SmoteR method and the under-sampling

Table 1. Used data sets and characteristics (N : n. of cases; p: n. of predictors; nRare:

Data Set N p nRare %Rare Data Set N p nRare %Rare

Learner Parameter Variants R package

Table 3. Summary of the paired comparisons to the no sampling baseline (S - SmoteR

F1 of Best Sampling and Best No−sampling

Fig. 1. Best Scores obtained with sampling and no-sampling

Acknowledgements. This work is part-funded by the ERDF - European

You might also like