Oversampling techniques for imbalanced data in regression
Oversampling techniques for imbalanced data in regression
Keywords: Our study addresses the challenge of imbalanced regression data in Machine Learning (ML) by introducing
Data augmentation tailored methods for different data structures. We adapt K-Nearest Neighbor Oversampling-Regression (KNNOR-
Machine learning Reg), originally for imbalanced classification, to address imbalanced regression in low population datasets,
AutoInflaters
evolving to KNNOR-Deep Regression (KNNOR-DeepReg) for high-population datasets. For tabular data, we also
Nearest neighbor
present the Auto-Inflater neural network, utilizing an exponential loss function for Autoencoders. For image
Imbalanced data
datasets, we employ Multi-Level Autoencoders, consisting of Convolutional and Fully Connected Autoencoders.
For such high-dimension data our approach outperforms the Synthetic Minority Oversampling Technique
for Regression (SMOTER) algorithm for the IMDB-WIKI and AgeDB image datasets. For tabular data we
conducted a comprehensive experiment using various models trained on both augmented and non-augmented
datasets, followed by performance comparisons on test data. The outcomes revealed a positive impact of data
augmentation, with a success rate of 83.75% for Light Gradient Boosting Method (LightGBM) and 71.57% for
the 18 other regressors employed in the study. This success rate is determined by the frequency of instances
where models performed better when augmented data was used compared to instances with no augmentation.
Access to the comparative code can be found in GitHub.
∗ Corresponding author.
E-mail addresses: [email protected] (S.B. Belhaouari), [email protected] (A. Islam), [email protected] (K. Kassoul),
[email protected] (A. Al-Fuqaha), [email protected] (A. Bouzerdoum).
https://doi.org/10.1016/j.eswa.2024.124118
Received 31 August 2023; Received in revised form 3 March 2024; Accepted 24 April 2024
Available online 20 May 2024
0957-4174/© 2024 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118
Fig. 1. Types of Imbalance. In this paper we are focusing on the continuous data sets
where the target values are skewed to one side.
Fig. 3. Domain and Co-Domain relationship in imbalanced regression. The non-ordered
set of features is represented on the left. The right side shows the histogram of the
target values, increasing vertically upwards. As the regression relationship between the
features is unknown, we need first to generate new examples of features, ensuring
that it represents the minority data set. Then we compute the possible target value
corresponding to the newly created feature.
2
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118
and their distribution is often skewed across the patient population. 2.1. Problem description
Other industries like finance, meteorology, and fault diagnosis are also
plagued with imbalanced regression problems (Krawczyk, 2016). This In a standard regression problem, the goal is to predict continuous
article proposes several novel methods of oversampling Imbalanced Re- values based on a set of examples and their corresponding target
gression Data. The advantages of the proposed techniques in this study, values. However, in the case of an imbalanced regression problem, the
when compared to those identified in the existing literature, can be data distribution may be skewed, with only a few examples having
summarized as follows: Firstly, we extend the K Nearest Neighbor Over- target values within a specific range of interest, while the majority of
sampling (KNNOR) method (Islam, Belhaouari, Rehman, & Bensmail, examples have target values in a different range. This imbalance can
2022b) to KNNOR-Regression (KNNOR-Reg), enabling the generation result in a biased model that tends to predict values within the range of
of target values for imbalanced regression problems. Additionally, we the majority data. This bias occurs because most regression algorithms
expand upon Class Aware Autoencoders by introducing Target Aware are designed to minimize the average error across all data points (Gan
Auto Encoders, which are designed for estimating target values for et al., 2020; Liu et al., 2018). As a consequence, the model’s accuracy is
new features. We also introduce a novel architecture known as Target typically higher for the majority class but lower for the minority class.
Aware AutoInflaters, serving to extract features from low-dimensional This can be misleading when assessing the overall model accuracy. In
data. Furthermore, our study involves the development of Multi-level fact, a classifier or regressor may predict the entire dataset to belong
Auto Encoders, which are adept at extracting features from images to the majority class or within the majority range, thus getting the
and generating new features and target values. To enhance prediction minority class or rare data wrong, while still achieving a seemingly high
accuracy, we employ an exponential loss function within the AutoIn- accuracy due to the averaging effect. This phenomenon is important
flater, effectively highlighting the differences between predicted and to consider when evaluating the performance of models in imbalanced
actual target values. Lastly, our approach incorporates the use of the regression scenarios (Fernández et al., 2018; Gan et al., 2020; Liu et al.,
maximum test target value as a normalizer for calculating regression 2018).
loss, providing a comprehensive and effective framework for addressing In the context of imbalanced regression, it is necessary to define
{( )}𝑁
the identified challenges. the relevant terms. Let 𝐷 = 𝐱𝑖 , 𝑦𝑖 𝑖=1 denote the set of training
The contributions of this paper are summarized as below. data, where 𝐱𝑖 ∈ R𝑑 represents the input features and 𝑦𝑖 ∈ R repre-
sents the dependent value, which is continuous in nature. To further
• Extending the KNNOR method to KNNOR-Regression (KNNOR- characterize the dependent value space, Branco, Torgo, and Ribeiro
Reg) to generate target values for imbalanced regression problems (2019) introduce a threshold value 𝑡𝑟 that divides the dataset into two
(see Section 3.2); complementary sets: the common data, represented by 𝐷𝑁 , and the
• Extending Class Aware Autoencoders to Target Aware Auto En- rare data, denoted by 𝐷𝑅 , where |𝑦𝑖 | < 𝑡𝑟 indicates rare data and
coders for estimating target values for new features (see Sec- |𝑦𝑖 | ≥ 𝑡𝑟 indicates common data. The imbalanced regression problem
tion 3.3.1); arises when the following conditions hold:
• Introducing a novel architecture called Target Aware AutoIn-
flaters to extract features from low dimensional data (see Sec- • Accurate prediction of 𝐷𝑅 is more crucial for determining the
tion 3.3); performance of the model;
• Developing Multi-level Auto Encoders for extracting features from • 𝐷𝑅 ≪ 𝐷𝑁 , where 𝐷𝑅 and 𝐷𝑁 represent the cardinalities of the
images and creating new features and target values (see Sec- rare and common datasets, respectively.
tion 3.4);
Minority and Majority in Continuous data
• Using an exponential loss function within the AutoInflater to bet-
In contrast to classification problems, labeling in regression prob-
ter highlight the difference between predicted and actual target
lems can be more complex since the focus is on identifying rare
values (see Section 3.3.3);
events or valuable data points, such as fraudulent transactions, highly
• Calculating regression loss using the maximum test target value
profitable stock market actions, or ecological catastrophes. Therefore,
as a normalizer (see Section 4.1.1).
the identification of minority data points is of utmost importance.
The paper is structured as follows: In Section 2, we introduce the However, it is also essential to consider that misclassifications can have
issue of imbalanced linear regression, explain its significance, and different costs. To address this, the utility theory is used to define a
provide a review of existing literature on the subject. In Section 3, we relevance function that assigns importance to each target value (Torgo
present our solutions to address this problem, which are categorized & Ribeiro, 2007). The relevance function is a continuous, real-valued
into two frameworks based on dataset size. For smaller datasets, we function that is dependent on the domain and maps each target value
introduce an extended version of KNNOR, while for larger datasets, to a relevance scale. Eq. (1) defines a relevance function that takes into
we propose a novel AutoEncoders implementation. Fig. 4 offers an account the application-specific bias and maps each target value to a
overview of our proposed methods, organized by data type and struc- continuous scale of relevance ranging from 0 to 1, where 0 indicates
ture. For tabular data, we propose different methods, such as KN- minimum and 1 indicates maximum importance.
NOR Regression (Section 3.2) and KNNOR DeepRegression with target- 𝜙(𝑌 ) ∶ → [0, 1] (1)
aware Auto Encoders/Inflaters (Sections 3.3.1 and 3.3.2), depending
on dataset size and feature count. In the case of image datasets, we To obtain the relevance function, we use the box and whisker plot
suggest a Multi-Level AutoEncoder scheme (Section 3.4). In Section 4, of the target value, where the median value is assigned an importance
we outline the experimental design, where we assess the effectiveness value of 0 and the upper adjacent and all higher values are assigned an
of our methods on well-known imbalanced regression datasets and importance value of 1. Similarly, all lower adjacent values are assigned
present the results and subsequent discussion. Section 5 encompasses an importance value of 1. To interpolate between these importance val-
the conclusion and outlines future work based on the results. ues and obtain a smooth relevance function, we use a piece-wise cubic
Hermite interpolation method (Camacho, Douzas, & Bacao, 2022).
2. Presentation of the problem and related work The relevance values, calculated using the same method, are illus-
trated in Fig. 5. In Fig. 5a, the histogram represents the target values
In this section, we begin by introducing the concept of imbalanced of the compactiv dataset, with the corresponding relevance values
linear regression, elucidating its importance, and conducting a compre- depicted below. Notably, there is a correlation between the relevance
hensive examination of the existing body of literature pertaining to this values and the frequency of the target values. The gap in the histogram
topic. is responsible for the discontinuity in the plot. Moving to Fig. 5b, it
3
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118
Fig. 5. Relevance function for the a. compactiv and b.bank8FM data set. Figure a displays a histogram depicting the distribution of target values within the compactiv dataset,
with the corresponding relevance values presented below. It is worth noting that there is a noticeable correlation between these relevance values and the frequency of target
values. The interruptions in the plot can be attributed to gaps within the histogram. Figure b shows the histogram for the bank8FM dataset, along with the associated relevance
values displayed below it.
shows the histogram for the bank8FM dataset, along with the associated
relevance values displayed beneath it. It is worth mentioning that
the positioning of the extremes is a crucial factor in determining the
relevance function. In this exercise, we considered datasets with either
high (Fig. 5b) or low (Fig. 5a) extremes but not both. The red and
blue sections in the lower half of the figure are also of significance.
The threshold of importance is manually defined, with target values
above this threshold considered critical. We adhere to established
practices (Camacho et al., 2022; Torgo & Ribeiro, 2007) to partition
the data into two subsets: 𝐷𝑁 and 𝐷𝑅 . For each dataset, a user-defined
coefficient is employed to determine the extent to which the whiskers
extend from the interquartile range in the box plot of the dependent
data. This threshold plays a pivotal role in segregating the data into
rare (𝐷𝑅 ) and common (𝐷𝑁 ) subsets. We rely on the methodology
proposed by Camacho et al. (2022) to derive these thresholds. Once the
data has been categorized into these two segments, our algorithm can
be applied. However, before delving into the specifics of our proposed
methods, it is essential to review recent work in the field of imbalanced Fig. 6. SMOTE — the fundamental augmentation algorithm. 𝑥𝑛𝑒𝑤
𝑖 is the generated point
regression. at a random distance between 𝑥𝑖 and 𝑥𝑛𝑒𝑎𝑟𝑒𝑠𝑡
𝑖 .
4
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118
Fig. 7. Application of SMOTER to generate new point as well as target value. Fig. 8. Simulated dataset — before augmentation. 𝑝0 is the source point from where
augmentation will start. 𝑝1 , 𝑝2 and 𝑝3 are its 3 nearest neighbors in increasing order
of distance.
5
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118
2.2.4. AutoEncoders
Autoencoder neural networks have the ability to generate output
features that match the input features. They are composed of three
main components: the encoder, the bottleneck, and the decoder (Ri-
fai, Vincent, Muller, Glorot, & Bengio, 2011). Due to their common
usage in image data, autoencoders typically have high-dimensional
input features that are reduced to the bottleneck size by the en-
coder. The decoder is then trained to reconstruct the initial output
by minimizing a cost function. Fig. 10 provides a block diagram of
an autoencoder that includes the Encoder, Bottleneck, and Decoder.
Fig. 10. Auto Encoder Block Diagram.
Autoencoders can be fully connected or can include Convolution and
De-convolution layers (Zeiler, Krishnan, Taylor, & Fergus, 2010), with
the bottleneck typically being a fully connected layer that extracts
minority points, denoted as 𝑥0 , and repeating the following steps for a one-dimensional feature representation. Autoencoders are used to
each of its k nearest neighbors: reduce high-dimensional datasets like images, making them more suit-
• generate a random point on a line between the start point and the able for statistical methods (Wang, Yao, & Zhao, 2016). This work
next closest neighbor; employs an innovative form of autoencoder known as the Class-Aware
Autoencoder (Islam, Belhaouari, Rehman, & Bensmail, 2022a), which
• make the point generated as the start point.
is further explained below.
Fig. 8 shows an artificial imbalanced dataset and Fig. 9 gives a Class Aware AutoEncoders
pictorial representation of the process of augmentation using three Autoencoders aim to minimize the difference between input and
neighboring points. In a general case where k neighbors are used output data. However, class-aware autoencoders take this a step further
to create an artificial point, the process can be represented by the by incorporating the class label information into the output data. This
following. means that the output of the class-aware Autoencoder includes both a
∀𝑖 ∈ [0, 1...𝑘] the sequence is defined using Eq. (4): close approximation of the input feature set and the corresponding class
label (Islam et al., 2022a). Fig. 11 illustrates the concept of a class-
𝑥𝑛𝑒𝑤
𝑖+1
= 𝑥𝑛𝑒𝑤
𝑖 + (𝑝𝑖 − 𝑥𝑛𝑒𝑤
𝑖 ) ∗ 𝛼𝑖 (4)
aware Autoencoder. To match the dimensions, a random or constant
where 𝑥𝑛𝑒𝑤
0
is any safe point in the minority class, 𝑝𝑖 is the ith nearest feature is added to the input data, and the output is then matched with
neighbor of 𝑥𝑛𝑒𝑤
0
and 𝛼𝑖 is uniform random variables over [0, 𝑀], where the actual class label. This approach has been primarily used in labeled
𝑀 is any positive value less than 1. datasets for classification tasks. However, it can also be extended to
At each iteration of the process, the new point generated at the regression data and applied to predict the target value for new data
preceding step becomes the starting point. A new point is synthesized points by modifying the loss function, as described in Section 3.3.1.
at a distance 𝑟𝑎𝑛𝑑𝑜𝑚(0, 𝑀) on the straight line joining the starting
point and the (𝑖 + 1)𝑡ℎ nearest neighbor of the origin point that started 3. Material and methods
the exercise. The synthetic point obtained after the last iteration is
considered the new augmented data point for the entire process. In this section, we outline our strategies for addressing this is-
sue, which are classified into two frameworks based on dataset size.
2.2.3. Deep learning in imbalanced regression For smaller datasets, we introduce an extended version of KNNOR,
In addition to statistical techniques, this paper proposes a deep while for larger datasets, we propose a novel AutoEncoders imple-
learning approach, which is crucial to understanding imbalanced re- mentation. Fig. 4 provides an overview of our proposed methods,
gression using deep neural networks. Neural networks are particularly categorized by data type and structure. Regarding tabular data, we
useful for high-dimensional datasets, such as images, and have been propose various methods, including KNNOR Regression (Section 3.2)
6
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118
3.1. Methods
7
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118
The process of generating artificial data points along with their cor-
responding target values is elucidated through a flowchart, as depicted
in Figs. 14 and 15. Fig. 14 outlines the methodology for creating new
data points, while Fig. 15 illustrates the procedure for computing target
values associated with the newly generated data points.
8
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118
Fig. 16. Target Aware Auto Encoder (Fully Connected). The neuron highlighted in red represents the additional value generated, matching the target value. The network is capable
not only to extract features but also estimate the target value. The latter is generated by including an additional component in the loss function.
Fig. 17. Target Aware Inflater-Deflater (Fully Connected) for low dimension data. The neuron highlighted in red represents the additional value generated, matching the target
value. The bottleneck expands the feature set while the Network gives an additional output, the target value corresponding to the features. This is done by including the target
value in the output as well as the loss function.
learning the features of the rare dataset. In this context, the Autoen-
coder is defined as the function 𝐹 , from R𝑑 to R𝑑+1 using Eq. (7) as
follows:
̂ 𝑦)
𝐹 (𝑋) = (𝑋, ̂ (7)
9
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118
converts them into vectors at the bottleneck. At the second level lies a
target aware, fully connected Auto Encoder that uses the bottleneck of
the previous level Autoencoder as input and trains a target-aware Neu-
ral Network. Fig. 20 illustrates the process. The external Convolution
Auto Encoder is responsible for extracting the features from the image
dataset at the bottleneck (marked as BottleNeck1 in Fig. 20). These
features and the target values (provided externally) are used to train
the internal target-aware Auto-Encoder. The bottleneck of the internal
Auto-Encoder (marked as BottleNeck2) is used to reduce the feature
size of the dataset further, and KNNOR is applied to these extracted
features to generate new data points. This approach proves to be more
efficient than training a single target-aware autoencoder on the images
directly, as seen in Table 7.
10
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118
Fig. 20. Multi-level Auto Encoders for Images. First, CNN Auto Encoders to reduce the dimension of images. Second, Fully Connected Target Aware Auto Encoders to learn the
target values of minority data set.
Fig. 22. Target generation steps for data with high population and low dimension.
Fig. 21. Target generation steps for data with high population and high dimension.
11
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118
Fig. 23. Target generation steps for data with low population.
Fig. 24. Augmentation of data and target values on the laser dataset, using KNNOR-
Regression and KNNOR-DeepRegression. Figures a and c show the scatter plot of data
considered to be common and rare and also show the datapoints after augmentation, Fig. 25. Augmentation of data and target values on the ele-2 dataset, using
Figures b and d show the frequency of the labels of the common, rare and augmented KNNOR-Regression and KNNOR-DeepRegression. Figures a and c display scatter plots
data points. representing data points categorized as common and rare, including the augmented
data points. Meanwhile, Figures b and d illustrate the label distribution for common,
rare, and augmented data points.
12
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118
Fig. 26. Recommendation of techniques for varying structure of data. Regarding Tabular data, we offer different methods depending on the population and number of features.
We propose KNNOR-Regression (Section 3.2) for low population data. For data with a high population, we advocate KNNOR Deep Regression which has two flavors. For high-
population data with a high number of features, we use Target Aware AutoEncoders (Section 3.3.1), and for high-population data with a low number of features, we use Auto
Inflaters (Section 3.3.2). Finally, for Image datasets, we use a combination of Convolution and Fully Connected AutoEncoders called Multi-Level AutoEncoders (Section 3.4).
Table 1
Numerical datasets used in comparison.
Data set Instances Features Relevance threshold Rare Rare (percentage) Type of extreme
ANACALT 4052 7 0.8 835 0.21 lower
bank8FM 4499 8 0.8 285 0.06 upper
baseball 337 16 0.5 50 0.15 upper
boston 506 13 0.8 113 0.22 upper
compactiv 8192 21 0.8 713 0.09 lower
concrete 1030 8 0.8 52 0.05 upper
cpuSm 8192 12 0.8 713 0.09 lower
ele-1 495 2 0.8 43 0.09 upper
ele-2 1056 4 0.8 110 0.1 upper
forestFires 517 12 0.8 78 0.15 upper
friedman 1200 5 0.5 48 0.04 upper
laser 993 4 0.8 75 0.08 upper
machineCPU 209 6 0.8 31 0.15 upper
mortgage 1049 15 0.8 106 0.1 upper
quake 2178 3 0.8 118 0.05 upper
stock 950 9 0.5 63 0.07 upper
treasury 1049 15 0.8 109 0.1 upper
wankara 321 9 0.5 31 0.1 lower
Table 2
High dimension (image) datasets used in comparison.
Data set Instances Features Rare Rare (percentage) Type of extreme
AgeDB 16 488 64 × 64 494 0.03 upper
IMDB-WIKI 213 553 64 × 64 2721 0.012 upper
subsequently applied to the training set of extracted features, class. The augmented features were then passed through the
creating additional artificial data points belonging to the rare target-aware autoencoder to generate the corresponding target
13
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118
Table 3
Hyperparameters of regressors.
Regressor Hyperparameters
Support Vector regression kernel = ‘rbf’, degree = 3, gamma = ‘scale’, coef0 = 0.0, tol = 0.001, C = 1.0
Random Forest regression n_estimators = 100, criterion = ‘squared_error’, max_depth = None, min_samples_split = 2,
min_samples_leaf = 1
Gradient Boosting regression loss = ‘squared_error’, learning_rate = 0.1, n_estimators = 100, subsample = 1.0, criterion =
‘friedman_mse’, min_samples_split = 2, min_samples_leaf = 1
Fully Connected Network hidden_layer_sizes = (100,), activation = ‘relu’, *, solver = ‘adam’, alpha = 0.0001, batch_size
= ‘auto’, learning_rate = ‘constant’, learning_rate_init = 0.001
where 𝑦̂𝑖 is the predicted value and 𝑦𝑖 is the actual target value. The 4.2. Results
value 𝑦𝑚𝑎𝑥 is the maximum target value in the test set, and 𝑛 is the
number of samples in the same. The utility of the error metric is Tables 5, 6, and 7 provide the results of the conducted comparisons.
illustrated in Table 4, which shows the difference in error percentage Table 5 presents the mean of the best performance achieved by various
even when the absolute differences are the same. A true value of 0.1, regressors on different datasets, with the max-relative-score serving as
represented as 1.1, is inferior to predicting 101 against the true value of the error metric. In Table 6, the mean Root Mean Squared Error (RMSE)
100. In order to allow the error to be expounded, we divide the absolute score on different datasets is presented. Table 7 focuses specifically
difference by the maximum true value in the test set. This helps magnify on the RMSE score for image datasets. Each column in these tables
the error at lower values compared to the same error at higher values. represents a different technique employed in the experiment, while
the rows correspond to the various regression algorithms utilized. It
4.1.2. Augmentation framework is important to mention that in Table 6, the AutoInflaters have been
For each train–test split of every dataset, we conduct evaluations trained with either the exponential or RMSE loss, corresponding to the
with varying levels of augmentation employing different regression columns in Table 5. In Tables 5 and 6, the columns are structured as
techniques. The initial evaluation is conducted without any augmen- follows.
tation, and subsequent assessments involve the application of both
state-of-the-art and our proposed augmentation methods. • Column 1: Different Regressors used;
14
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118
Table 5
Relative-max-error scores for different datasets.
Regression No SMOTE Type Type Type Type Type Type Type
algorithm oversampling regression I II III IV V VI VII
RFR 1.1749 1.1397 1.1212 1.1263 1.135 1.2892 1.2589 1.2844 1.3244
GBR 1.1479 1.1184 1.0924 1.1054 1.1055 1.3297 1.2951 1.3291 1.3526
SVR 1.7765 1.7095 1.662 1.6663 1.6647 1.629 1.6316 1.6069 1.6902
LR 2.0867 2.1257 2.1056 2.1047 2.1114 1.6612 0.4964 1.6502 1.5737
FCN 0.3951 0.4886 0.4866 0.3883 0.3915 0.5244 0.5665 0.5187 0.5676
Table 6
RMSE scores for different datasets.
Regression No SMOTE Type Type Type Type Type Type Type
algorithm oversampling regression I II III IV V VI VII
RFR 0.0085 0.0075 0.0071 0.0074 0.0073 0.0097 0.0095 0.0097 0.0091
GBR 0.0078 0.0074 0.0065 0.0065 0.0064 0.0098 0.0087 0.0097 0.0092
SVR 0.013 0.0126 0.0119 0.0118 0.0117 0.0117 0.0117 0.0115 0.0116
LR 0.0122 0.0121 0.0118 0.0119 0.0118 0.011 0.0092 0.0107 0.0095
FCN 0.0066 0.006 0.0058 0.0064 0.0064 0.0121 0.0126 0.0121 0.0125
15
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118
Fig. 27. Ranks of different methods on all datasets when the error metric was max-relative-error. Lines represent the different methods explained in Section 4.2. KNNOR-Regression
denoted by the green line is the best performing method having ranked the maximum number of times.
16
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118
Fig. 28. Ranks of different methods on all datasets when the error metric was RMSE. Lines represent the different methods explained in Section 4.2. KNNOR-DR-I is the best
performing method followed by the KNNOR-Regression method as they rank 1st maximum number of times.
Table 9
Models used for extensive testing. All of them have been used by leveraging the Pycaret library in python which
gives standard implementation of all the models (Ali, 2020).
Models Models Models
AdaBoost Regressor Extreme Gradient Boosting Least Angle Regression
Bayesian Ridge Gradient Boosting Regressor Light Gradient Boosting Machine
Decision Tree Regressor Huber Regressor Linear Regression
Dummy Regressor K Neighbors Regressor Orthogonal Matching Pursuit
Elastic Net Lasso Least Angle Regression Passive Aggressive Regressor
Extra Trees Regressor Lasso Regression Random Forest Regressor
4.4. Testing with additional models a robust assessment of its efficacy in enhancing predictive models. A
total of 1368 experiments were made, for 18 models on 18 datasets.
In order to cement our claims, we have added a host of models to Augmentation was performed for 3, 4 and 5 neighbors and the best
train on augmented as well as non-augmented data and then compare result was selected.
their performance on the test data. The datasets have been already The results in Table 10 provide a comparative overview of 18 re-
mentioned in Table 1 and the models used are mentioned in Table 9. gression models evaluated on both augmented and original imbalanced
Default hyper-parameters were used for each model as mentioned in Ali data. Following is a detailed summary of the findings:
(2020).
We conducted a comprehensive evaluation by applying each model 1. Passive Aggressive Regressor, Linear Regression, and Light Gra-
to both augmented and non-augmented datasets. To assess their per- dient Boosting Machine are the top-performing models, with
formance rigorously, we employed a set of six essential metrics: MAE, wins in approximately 83.33% of cases.
MSE, RMSE, R-squared (R2), Root Mean Squared Logarithmic Error 2. Ridge Regression and Bayesian Ridge follow closely, with wins
(RMSLE), and Mean Absolute Percentage Error (MAPE). These met- in about 82.29% of cases.
rics have been defined in Section 4.3 and provide a well-rounded 3. Decision Tree Regressor, Random Forest Regressor, and Ad-
perspective on how well the models make predictions. aBoost Regressor also demonstrate strong performance, with
Our analysis involved calculating how many times the models wins in 79.17% to 75.15% of cases.
trained on augmented data outperformed those trained on non-augme- 4. Several models, including Extreme Gradient Boosting, Least An-
nted data across all six metrics. This cumulative count provides a con- gle Regression, and Lasso Regression, exhibit wins in approxi-
solidated measure of the augmentation’s impact on model performance. mately 75% of cases.
By considering multiple metrics and aggregating the results, we gain a 5. Models like Extra Trees Regressor, Huber Regressor, and Elastic
holistic understanding of the benefits of using augmented data, offering Net have wins in the range of 68.75% to 66.67% of cases.
17
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118
18
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118
Data availability Kohler, M., & Langer, S. (2021). On the rate of convergence of fully connected deep
neural network regression estimates. The Annals of Statistics, 49(4), 2231–2249.
Krawczyk, B. (2016). Learning from imbalanced data: open challenges and future
Open source Links to codebase has been mentioned in the paper.
directions. Progress in Artificial Intelligence, 5(4), 221–232.
Kubat, M., Matwin, S., et al. (1997). Addressing the curse of imbalanced training sets:
Acknowledgments one-sided selection. In International conference on machine learning: vol. 97, (pp.
179–186). Morgan Kaufmann.
Open Access funding provided by the Qatar National Library. Laza, R., Pavón, R., Reboiro-Jato, M., & Fdez-Riverola, F. (2011). Evaluating the effect
of unbalanced data in biomedical document classification. Journal of Integrative
Bioinformatics, 8(3), 105–117.
References Liu, N., Shen, J., Xu, M., Gan, D., Qi, E.-S., & Gao, B. (2018). Improved cost-sensitive
support vector machine classifier for breast cancer diagnosis. Mathematical Problems
Agarap, A. F. (2018). Deep learning using rectified linear units (relu). arXiv preprint in Engineering, 2018, 1–13.
arXiv:1803.08375. Liu, X.-Y., Wu, J., & Zhou, Z.-H. (2008). Exploratory undersampling for class-imbalance
Ali, M. (2020). PyCaret: An open source, low-code machine learning library in Python. learning. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics),
PyCaret version 1.0. 39(2), 539–550.
Barupal, D. K., & Fiehn, O. (2019). Generating the blood exposome database using Moschoglou, S., Papaioannou, A., Sagonas, C., Deng, J., Kotsia, I., & Zafeiriou, S. (2017).
a comprehensive text mining and database fusion approach. Environmental Health Agedb: the first manually collected, in-the-wild age database. In Proceedings of the
Perspectives, 127(9), 2825–2830. IEEE conference on computer vision and pattern recognition workshops (pp. 51–59).
Branco, P., Torgo, L., & Ribeiro, R. P. (2016). A survey of predictive modeling on Natekin, A., & Knoll, A. (2013). Gradient boosting machines, a tutorial. Frontiers in
imbalanced domains. ACM Computing Surveys (CSUR), 49(2), 1–50. Neurorobotics, 7, 21.
Branco, P., Torgo, L., & Ribeiro, R. P. (2019). Pre-processing approaches for imbalanced Rifai, S., Vincent, P., Muller, X., Glorot, X., & Bengio, Y. (2011). Contractive auto-
distributions in regression. Neurocomputing, 343, 76–99. encoders: Explicit invariance during feature extraction. In Proceedings of the 28th
Camacho, L., Douzas, G., & Bacao, F. (2022). Geometric SMOTE for regression. Expert international conference on international conference on machine learning (pp. 833–840).
Systems with Applications, Article 116387. Omni Press.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Rothe, R., Timofte, R., & Van Gool, L. (2018). Deep expectation of real and apparent
synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, age from a single image without facial landmarks. International Journal of Computer
16, 321–357. Vision, 126(2–4), 144–157.
Derrac, J., Garcia, S., Sanchez, L., & Herrera, F. (2015). Keel data-mining software tool: Segal, M. R. (2004). Machine learning benchmarks and random forest regression.
Data set repository, integration of algorithms and experimental analysis framework. eScholarship.
Journal of Multiple-Valued Logic and Soft Computing, 17. Sun, Y., Wong, A. K., & Kamel, M. S. (2009). Classification of imbalanced data: A
dos Santos Coelho, L., Hultmann Ayala, H. V., & Cocco Mariani, V. (2024). CO and review. International Journal of Pattern Recognition and Artificial Intelligence, 23(04),
NOx emissions prediction in gas turbine using a novel modeling pipeline based on 687–719.
the combination of deep forest regressor and feature engineering. Fuel, 355, Article Thanathamathee, P., & Lursinsap, C. (2013). Handling imbalanced data sets with
129366. synthetic boundary data generation using bootstrap re-sampling and AdaBoost
Douzas, G., & Bacao, F. (2019). Geometric SMOTE a geometrically enhanced drop-in techniques. Pattern Recognition Letters, 34(12), 1339–1347.
replacement for SMOTE. Information Sciences, 501, 118–135. Torgo, L., Branco, P., Ribeiro, R. P., & Pfahringer, B. (2015). Resampling strategies for
Elhassan, T., & Aljurf, M. (2016). Classification of imbalance data using tomek link regression. Expert Systems, 32(3), 465–476.
(t-link) combined with random under-sampling (rus) as a data reduction method. Torgo, L., & Ribeiro, R. (2007). Utility-based regression. In PKDD 2007: 11th European
Global Journal of Technolology and Optimization S, 1, 2016. conference on principles and practice of knowledge discovery in databases: vol. 7, (pp.
Elor, Y., & Averbuch-Elor, H. (2022). To SMOTE, or not to SMOTE? arXiv preprint 597–604). Springer.
arXiv:2201.08528. Torgo, L., Ribeiro, R. P., Pfahringer, B., & Branco, P. (2013). Smote for regression.
Fernández, A., García, S., Galar, M., Prati, R. C., Krawczyk, B., & Herrera, F. (2018). In Progress in artificial intelligence: 16th portuguese conference on artificial intelligence
Learning from imbalanced data sets: vol. 10, Springer. (pp. 378–389). Springer.
Gan, D., Shen, J., An, B., Xu, M., & Liu, N. (2020). Integrating TANBN with Tunçay, T., Alaboz, P., Dengiz, O., & Başkan, O. g. (2023). Application of regression
cost sensitive classification algorithm for imbalanced data in medical diagnosis. kriging and machine learning methods to estimate soil moisture constants in a
Computers & Industrial Engineering, 140, Article 106266. semi-arid terrestrial area. Computers and Electronics in Agriculture, 212, Article
Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., & Bing, G. (2017). 108118.
Learning from class-imbalanced data: Review of methods and applications. Expert Vapnik, V., & Vapnik, V. (1998). Statistical learning theory wiley. New York, 1(624),
Systems with Applications, 73, 220–239. 2.
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Wang, Y., Yao, H., & Zhao, S. (2016). Auto-encoder based dimensionality reduction.
Knowledge and Data Engineering, 21(9), 1263–1284. Neurocomputing, 184, 232–242.
Islam, A., & Belhaouari, S. B. (2021). Class aware auto encoders for better feature Yang, Y., Zha, K., Chen, Y., Wang, H., & Katabi, D. (2021). Delving into deep
extraction. In 3rd International conference on electrical, communication, and computer imbalanced regression. In Proceedings of the 38th international conference on machine
engineering (pp. 1–5). IEEE. learning (pp. 11842–11851). MLR Press.
Islam, A., Belhaouari, S. B., Rehman, A. U., & Bensmail, H. (2022a). K nearest neighbor Zeiler, M. D., Krishnan, D., Taylor, G. W., & Fergus, R. (2010). Deconvolutional
OveRsampling approach: An open source python package for data augmentation. networks. In 2010 IEEE computer society conference on computer vision and pattern
Software Impacts, 12, Article 100272. recognition (pp. 2528–2535). IEEE.
Islam, A., Belhaouari, S. B., Rehman, A. U., & Bensmail, H. (2022b). KNNOR: An Zhong, J., He, Z., Guan, K., & Jiang, T. (2023). Investigation on regression model
oversampling technique for imbalanced datasets. Applied Soft Computing, 115, for the force of small punch test using machine learning. International Journal of
Article 108288. Pressure Vessels and Piping, 206, Article 105031.
Johnson, J. M., & Khoshgoftaar, T. M. (2019). Survey on deep learning with class
imbalance. Journal of Big Data, 6(1), 1–54.
Juez-Gil, M., Arnaiz-González, Á., Rodríguez, J. J., & García-Osorio, C. (2021).
Experimental evaluation of ensemble classifiers for imbalance in big data. Applied
Soft Computing, 108, Article 107447.
19