0% found this document useful (0 votes)
11 views

Oversampling techniques for imbalanced data in regression

This study presents novel oversampling techniques for addressing imbalanced regression data in machine learning, specifically adapting K-Nearest Neighbor Oversampling-Regression (KNNOR-Reg) for low population datasets and evolving it for high population datasets. The authors introduce Auto-Inflaters and Multi-Level Autoencoders to enhance data augmentation and improve predictive performance, demonstrating significant success rates for various regression models. The paper emphasizes the importance of addressing data imbalance to ensure reliable predictions in critical domains such as healthcare and finance.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Oversampling techniques for imbalanced data in regression

This study presents novel oversampling techniques for addressing imbalanced regression data in machine learning, specifically adapting K-Nearest Neighbor Oversampling-Regression (KNNOR-Reg) for low population datasets and evolving it for high population datasets. The authors introduce Auto-Inflaters and Multi-Level Autoencoders to enhance data augmentation and improve predictive performance, demonstrating significant success rates for various regression models. The paper emphasizes the importance of addressing data imbalance to ensure reliable predictions in critical domains such as healthcare and finance.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Expert Systems With Applications 252 (2024) 124118

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

Oversampling techniques for imbalanced data in regression


Samir Brahim Belhaouari a , Ashhadul Islam a , Khelil Kassoul b ,∗, Ala Al-Fuqaha a ,
Abdesselam Bouzerdoum a,c
a Division of Information and Computing Technology, Hamad Bin Khalifa University, Qatar
b Geneva School of Business Administration, University of Applied Sciences Western Switzerland, HES-SO, 1227 Geneva, Switzerland
c
School of Electrical, Computer and Telecommunications Engineering, University of Wollongong, Wollongong, Australia

ARTICLE INFO ABSTRACT

Keywords: Our study addresses the challenge of imbalanced regression data in Machine Learning (ML) by introducing
Data augmentation tailored methods for different data structures. We adapt K-Nearest Neighbor Oversampling-Regression (KNNOR-
Machine learning Reg), originally for imbalanced classification, to address imbalanced regression in low population datasets,
AutoInflaters
evolving to KNNOR-Deep Regression (KNNOR-DeepReg) for high-population datasets. For tabular data, we also
Nearest neighbor
present the Auto-Inflater neural network, utilizing an exponential loss function for Autoencoders. For image
Imbalanced data
datasets, we employ Multi-Level Autoencoders, consisting of Convolutional and Fully Connected Autoencoders.
For such high-dimension data our approach outperforms the Synthetic Minority Oversampling Technique
for Regression (SMOTER) algorithm for the IMDB-WIKI and AgeDB image datasets. For tabular data we
conducted a comprehensive experiment using various models trained on both augmented and non-augmented
datasets, followed by performance comparisons on test data. The outcomes revealed a positive impact of data
augmentation, with a success rate of 83.75% for Light Gradient Boosting Method (LightGBM) and 71.57% for
the 18 other regressors employed in the study. This success rate is determined by the frequency of instances
where models performed better when augmented data was used compared to instances with no augmentation.
Access to the comparative code can be found in GitHub.

1. Introduction In machine learning, models trained on imbalanced datasets can


exhibit a bias towards the majority class, which can impact the reli-
The effectiveness of conventional machine learning (ML) techniques ability of predictions in critical areas such as healthcare and finance.
greatly depends on the underlying data distribution they are trained Imbalanced datasets can be found in both classification and regression
on. In scenarios involving classification or regression, a model trained problems (Fernández et al., 2018). In classification, an imbalanced
on an imbalanced dataset can exhibit a bias towards the majority
dataset can have a disproportionate representation of categories, with
class (Gan, Shen, An, Xu, & Liu, 2020; Liu et al., 2018). This may result
some categories having fewer samples than others, creating an im-
in seemingly high accuracy, while minority data points are frequently
balanced binary or multi-class problem. Binary classification is the
misclassified or mispredicted. Such a scenario can compromise the de-
pendability of ML models, especially in critical domains like healthcare primary focus of research in imbalanced learning, but imbalanced
and finance, where rare, malignant, or suspicious data can hold sub- data can also arise in regression tasks (Sun, Wong, & Kamel, 2009).
stantial consequences. Recognizing a dataset as imbalanced depends on In regression, the dependent variable is a continuous value, and the
the specific problem, as illustrated in Fig. 1. This variability highlights imbalance occurs when a specific interval of the target variable has a
the critical need to actively address data imbalance during machine reduced representation in the dataset (Branco, Torgo, & Ribeiro, 2016).
learning model training. Neglecting this aspect can lead to skewed The imbalanced regression problem is challenging as it requires the
results and compromised model reliability, particularly in critical do- model not only to create artificial minority points but also to predict the
mains like healthcare and finance, where accurate predictions for rare dependent value for each new data point. Oversampling is a common
or suspicious cases are crucial. Thus, acknowledging and mitigating technique used to address this problem. Fig. 2 illustrates oversampling
data imbalance stands as a vital step in ensuring the robustness and in imbalanced regression problems, where the target values are used
effectiveness of machine learning applications.

∗ Corresponding author.
E-mail addresses: [email protected] (S.B. Belhaouari), [email protected] (A. Islam), [email protected] (K. Kassoul),
[email protected] (A. Al-Fuqaha), [email protected] (A. Bouzerdoum).

https://doi.org/10.1016/j.eswa.2024.124118
Received 31 August 2023; Received in revised form 3 March 2024; Accepted 24 April 2024
Available online 20 May 2024
0957-4174/© 2024 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118

Fig. 1. Types of Imbalance. In this paper we are focusing on the continuous data sets
where the target values are skewed to one side.
Fig. 3. Domain and Co-Domain relationship in imbalanced regression. The non-ordered
set of features is represented on the left. The right side shows the histogram of the
target values, increasing vertically upwards. As the regression relationship between the
features is unknown, we need first to generate new examples of features, ensuring
that it represents the minority data set. Then we compute the possible target value
corresponding to the newly created feature.

Fig. 2. Oversampling regression on Imbalanced data. a. Imbalanced data set. b.


Dependent value distribution. The values in green correspond to the majority values,
while the red bars represent the frequency of the values on the minority dataset. The
process of augmentation has two parts. c Adding more data points to the minority set.
d. Setting target values corresponding to the newly created features. The blue section
of the histogram in d shows the increased frequency of values between 0.5 and 0.6 Fig. 4. Different approaches presented in this paper according to the type of dataset
due to the creation of new data points and corresponding target values. under scrutiny. Regarding Tabular data, we offer different methods depending on the
population and number of features. We propose KNNOR-Regression (Section 3.2) for
low population data. For data with a high population, we advocate KNNOR Deep
Regression which has two flavors. For high-population data with a high number of
to identify minority data points. The imbalanced regression dataset features, we use Target Aware Autoencoders (Section 3.3.1), and for high-population
is visually depicted in Fig. 2a, where the majority and minority data data with a low number of features, we use AutoInflaters (Section 3.3.2). Finally,
points are represented in green and red, respectively. A histogram in for Image datasets, we use a combination of Convolution and Fully Connected
Autoencoders called Multi-Level Autoencoders (Section 3.4). We define high population
Fig. 2b displays the distribution of dependent values, where the green
data in terms of its volume wherein the entire data cannot be stored in a single
bars represent the majority values, and the red bars indicate the fre- machine (Juez-Gil, Arnaiz-González, Rodríguez, & García-Osorio, 2021).
quency of minority values. Typically, the identification of imbalanced
regression data begins with examining the target values and labeling
the corresponding data points as minority.
The process is further explained in Fig. 3, where a domain-range rela-
In Fig. 2b, the histogram illustrates that target values falling within
tionship is depicted. The left side shows the non-ordered set of features
the range of 0 to 0.5 (highlighted in green) exhibit a higher frequency
̂ while the target values 𝑦 are represented on the right in increasing
𝑋,
compared to the range from 0.5 to 0.6 (highlighted in red), which
displays a lower frequency. Fig. 2a represents the independent vari- order of their frequencies. The research question is to generate a new
ables, primarily constituting the majority dataset depicted in green, representation of 𝑋̂ that increases the representation of the minority
whereas the minority dataset is represented by red scatter points in dataset, and then map the new X-value to a y-value that falls within
Fig. 2a. Notably, these red scatter points correspond to the red segment the range of lower frequencies. This requires an approximation of the
of the histogram in Fig. 2b. The process of augmentation yields two key regression algorithm only for the minority or rare data, achieved by
outcomes. The first outcome involves generating additional data points creating specialized functions using simple statistical methods as well
that resemble those in the minority dataset, symbolized by the blue as deep neural networks, which are elaborated on in the following
points in Fig. 2c. The second outcome encompasses the target values pages.
linked to these newly generated data points, as depicted in the blue The imbalanced regression problem is apparent in many real-world
section of the histogram in Fig. 2d. This augmentation process serves tasks like medical applications where the different health metrics like
to augment the frequency of values falling within the 0.5 to 0.6 range. blood pressure, heart rate, and Oxygen saturation are continuous,

2
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118

and their distribution is often skewed across the patient population. 2.1. Problem description
Other industries like finance, meteorology, and fault diagnosis are also
plagued with imbalanced regression problems (Krawczyk, 2016). This In a standard regression problem, the goal is to predict continuous
article proposes several novel methods of oversampling Imbalanced Re- values based on a set of examples and their corresponding target
gression Data. The advantages of the proposed techniques in this study, values. However, in the case of an imbalanced regression problem, the
when compared to those identified in the existing literature, can be data distribution may be skewed, with only a few examples having
summarized as follows: Firstly, we extend the K Nearest Neighbor Over- target values within a specific range of interest, while the majority of
sampling (KNNOR) method (Islam, Belhaouari, Rehman, & Bensmail, examples have target values in a different range. This imbalance can
2022b) to KNNOR-Regression (KNNOR-Reg), enabling the generation result in a biased model that tends to predict values within the range of
of target values for imbalanced regression problems. Additionally, we the majority data. This bias occurs because most regression algorithms
expand upon Class Aware Autoencoders by introducing Target Aware are designed to minimize the average error across all data points (Gan
Auto Encoders, which are designed for estimating target values for et al., 2020; Liu et al., 2018). As a consequence, the model’s accuracy is
new features. We also introduce a novel architecture known as Target typically higher for the majority class but lower for the minority class.
Aware AutoInflaters, serving to extract features from low-dimensional This can be misleading when assessing the overall model accuracy. In
data. Furthermore, our study involves the development of Multi-level fact, a classifier or regressor may predict the entire dataset to belong
Auto Encoders, which are adept at extracting features from images to the majority class or within the majority range, thus getting the
and generating new features and target values. To enhance prediction minority class or rare data wrong, while still achieving a seemingly high
accuracy, we employ an exponential loss function within the AutoIn- accuracy due to the averaging effect. This phenomenon is important
flater, effectively highlighting the differences between predicted and to consider when evaluating the performance of models in imbalanced
actual target values. Lastly, our approach incorporates the use of the regression scenarios (Fernández et al., 2018; Gan et al., 2020; Liu et al.,
maximum test target value as a normalizer for calculating regression 2018).
loss, providing a comprehensive and effective framework for addressing In the context of imbalanced regression, it is necessary to define
{( )}𝑁
the identified challenges. the relevant terms. Let 𝐷 = 𝐱𝑖 , 𝑦𝑖 𝑖=1 denote the set of training
The contributions of this paper are summarized as below. data, where 𝐱𝑖 ∈ R𝑑 represents the input features and 𝑦𝑖 ∈ R repre-
sents the dependent value, which is continuous in nature. To further
• Extending the KNNOR method to KNNOR-Regression (KNNOR- characterize the dependent value space, Branco, Torgo, and Ribeiro
Reg) to generate target values for imbalanced regression problems (2019) introduce a threshold value 𝑡𝑟 that divides the dataset into two
(see Section 3.2); complementary sets: the common data, represented by 𝐷𝑁 , and the
• Extending Class Aware Autoencoders to Target Aware Auto En- rare data, denoted by 𝐷𝑅 , where |𝑦𝑖 | < 𝑡𝑟 indicates rare data and
coders for estimating target values for new features (see Sec- |𝑦𝑖 | ≥ 𝑡𝑟 indicates common data. The imbalanced regression problem
tion 3.3.1); arises when the following conditions hold:
• Introducing a novel architecture called Target Aware AutoIn-
flaters to extract features from low dimensional data (see Sec- • Accurate prediction of 𝐷𝑅 is more crucial for determining the
tion 3.3); performance of the model;
• Developing Multi-level Auto Encoders for extracting features from • 𝐷𝑅 ≪ 𝐷𝑁 , where 𝐷𝑅 and 𝐷𝑁 represent the cardinalities of the
images and creating new features and target values (see Sec- rare and common datasets, respectively.
tion 3.4);
Minority and Majority in Continuous data
• Using an exponential loss function within the AutoInflater to bet-
In contrast to classification problems, labeling in regression prob-
ter highlight the difference between predicted and actual target
lems can be more complex since the focus is on identifying rare
values (see Section 3.3.3);
events or valuable data points, such as fraudulent transactions, highly
• Calculating regression loss using the maximum test target value
profitable stock market actions, or ecological catastrophes. Therefore,
as a normalizer (see Section 4.1.1).
the identification of minority data points is of utmost importance.
The paper is structured as follows: In Section 2, we introduce the However, it is also essential to consider that misclassifications can have
issue of imbalanced linear regression, explain its significance, and different costs. To address this, the utility theory is used to define a
provide a review of existing literature on the subject. In Section 3, we relevance function that assigns importance to each target value (Torgo
present our solutions to address this problem, which are categorized & Ribeiro, 2007). The relevance function is a continuous, real-valued
into two frameworks based on dataset size. For smaller datasets, we function that is dependent on the domain and maps each target value
introduce an extended version of KNNOR, while for larger datasets, to a relevance scale. Eq. (1) defines a relevance function that takes into
we propose a novel AutoEncoders implementation. Fig. 4 offers an account the application-specific bias and maps each target value to a
overview of our proposed methods, organized by data type and struc- continuous scale of relevance ranging from 0 to 1, where 0 indicates
ture. For tabular data, we propose different methods, such as KN- minimum and 1 indicates maximum importance.
NOR Regression (Section 3.2) and KNNOR DeepRegression with target- 𝜙(𝑌 ) ∶  → [0, 1] (1)
aware Auto Encoders/Inflaters (Sections 3.3.1 and 3.3.2), depending
on dataset size and feature count. In the case of image datasets, we To obtain the relevance function, we use the box and whisker plot
suggest a Multi-Level AutoEncoder scheme (Section 3.4). In Section 4, of the target value, where the median value is assigned an importance
we outline the experimental design, where we assess the effectiveness value of 0 and the upper adjacent and all higher values are assigned an
of our methods on well-known imbalanced regression datasets and importance value of 1. Similarly, all lower adjacent values are assigned
present the results and subsequent discussion. Section 5 encompasses an importance value of 1. To interpolate between these importance val-
the conclusion and outlines future work based on the results. ues and obtain a smooth relevance function, we use a piece-wise cubic
Hermite interpolation method (Camacho, Douzas, & Bacao, 2022).
2. Presentation of the problem and related work The relevance values, calculated using the same method, are illus-
trated in Fig. 5. In Fig. 5a, the histogram represents the target values
In this section, we begin by introducing the concept of imbalanced of the compactiv dataset, with the corresponding relevance values
linear regression, elucidating its importance, and conducting a compre- depicted below. Notably, there is a correlation between the relevance
hensive examination of the existing body of literature pertaining to this values and the frequency of the target values. The gap in the histogram
topic. is responsible for the discontinuity in the plot. Moving to Fig. 5b, it

3
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118

Fig. 5. Relevance function for the a. compactiv and b.bank8FM data set. Figure a displays a histogram depicting the distribution of target values within the compactiv dataset,
with the corresponding relevance values presented below. It is worth noting that there is a noticeable correlation between these relevance values and the frequency of target
values. The interruptions in the plot can be attributed to gaps within the histogram. Figure b shows the histogram for the bank8FM dataset, along with the associated relevance
values displayed below it.

shows the histogram for the bank8FM dataset, along with the associated
relevance values displayed beneath it. It is worth mentioning that
the positioning of the extremes is a crucial factor in determining the
relevance function. In this exercise, we considered datasets with either
high (Fig. 5b) or low (Fig. 5a) extremes but not both. The red and
blue sections in the lower half of the figure are also of significance.
The threshold of importance is manually defined, with target values
above this threshold considered critical. We adhere to established
practices (Camacho et al., 2022; Torgo & Ribeiro, 2007) to partition
the data into two subsets: 𝐷𝑁 and 𝐷𝑅 . For each dataset, a user-defined
coefficient is employed to determine the extent to which the whiskers
extend from the interquartile range in the box plot of the dependent
data. This threshold plays a pivotal role in segregating the data into
rare (𝐷𝑅 ) and common (𝐷𝑁 ) subsets. We rely on the methodology
proposed by Camacho et al. (2022) to derive these thresholds. Once the
data has been categorized into these two segments, our algorithm can
be applied. However, before delving into the specifics of our proposed
methods, it is essential to review recent work in the field of imbalanced Fig. 6. SMOTE — the fundamental augmentation algorithm. 𝑥𝑛𝑒𝑤
𝑖 is the generated point
regression. at a random distance between 𝑥𝑖 and 𝑥𝑛𝑒𝑎𝑟𝑒𝑠𝑡
𝑖 .

2.2. Literature review


properties, offering new insights into predicting material tensile prop-
In a comprehensive exploration of machine learning applications, erties using SPT (Zhong, He, Guan, & Jiang, 2023). In a soil-related
particular emphasis is placed on addressing imbalanced data aug- investigation, machine learning algorithms are harnessed to estimate
mentation challenges. One study investigates the use of augmentation soil properties, with artificial neural networks (ANN) emerging as the
techniques to balance data, especially in the context of carbon oxides most effective predictor (Tunçay, Alaboz, Dengiz, & Başkan, 2023).
(CO) and nitrogen oxides (NOx) emissions prediction from a gas tur- These studies collectively underscore the significance of machine learn-
bine (dos Santos Coelho, Hultmann Ayala, & Cocco Mariani, 2024). The ing in addressing imbalanced data challenges and optimizing predictive
research unveils the importance of hyperparameter tuning and feature capabilities across diverse domains.
engineering, particularly with the Deep Forest Regression (DFR) model, In the realm of addressing imbalanced data in classification, par-
in enhancing predictive accuracy for these emissions. Additionally, the ticularly in the context of oversampling, there are primarily three
study delves into the small punch test (SPT) and employs machine fundamental approaches. These approaches are aimed at mitigating
learning to establish the correlation between SPT forces and material imbalanced data issues, particularly when dealing with classification

4
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118

Fig. 7. Application of SMOTER to generate new point as well as target value. Fig. 8. Simulated dataset — before augmentation. 𝑝0 is the source point from where
augmentation will start. 𝑝1 , 𝑝2 and 𝑝3 are its 3 nearest neighbors in increasing order
of distance.

problems: (1) Data-Driven Techniques, as described by Laza, Pavón,


Reboiro-Jato, and Fdez-Riverola (2011), involves sampling-based meth-
point and then places a new point at a random distance between the
ods that focus on adjusting the distribution of each category within
chosen point and one of its closest neighbors. SMOTER then calculates
the dataset. Data-driven techniques are more universally applicable
the possible target value for the new point based on its distance to
and widely employed due to their adaptability and effectiveness; (2)
its parent points, as illustrated in Fig. 7. The new target value 𝑦𝑛𝑒𝑤
𝑖 is
Algorithm-Based Methods, as exemplified by Elhassan and Aljurf (2016),
computed using Eq. (3):
Kubat et al. (1997), Thanathamathee and Lursinsap (2013), entail mod-
ifications to the training algorithms used in the classification process. 𝑦𝑖 𝑦𝑛𝑒𝑎𝑟𝑒𝑠𝑡
𝑖
𝑑1
+ 𝑑2
These adjustments are specific to the classifiers employed and may not 𝑦𝑛𝑒𝑤
𝑖 = (3)
1 1
be as common as data-driven techniques. However, they can prove to 𝑑1
+ 𝑑2
be highly effective in specific cases; (3) Hybrid Approaches, as proposed
where, 𝑦𝑖 is the target value of the 𝑥𝑖 data point. Its nearest neighbor is
by Johnson and Khoshgoftaar (2019), combines elements of both data-
𝑥𝑛𝑒𝑎𝑟𝑒𝑠𝑡 whose target value is 𝑦𝑛𝑒𝑎𝑟𝑒𝑠𝑡
𝑖 . The target value of the generated
driven and algorithm-based techniques. This approach seeks to leverage
point 𝑥𝑛𝑒𝑤
𝑖 is proportional to the euclidean distance of the new point
the advantages of each method and is gaining attention for its potential
from the parent points. Branco et al. (2019) propose three approaches
to deliver comprehensive solutions to imbalanced data challenges.
to address imbalanced datasets: random oversampling, the introduction
The choice of which approach to employ depends on the spe-
of Gaussian Noise, and WERCS. Random oversampling involves creat-
cific characteristics of the dataset and the problem at hand. Data-
ing copies of the rare class data points to add balance to the dataset.
driven techniques are often favored for their broad applicability, but
Gaussian Noise is used to introduce small variations in the feature set
algorithm-based methods and hybrid approaches can be valuable in
and target values to generate new data points with dependent values.
situations where tailored adjustments to classification algorithms are
The WERCS technique combines over and undersampling by assigning
required. Data-driven techniques for handling imbalanced classification
a probability of duplication or deletion based on the relevance value of
problems often use oversampling or undersampling methods (He & Gar-
each data point, determined by a user-defined threshold. More recently,
cia, 2009; Liu, Wu, & Zhou, 2008). A more intelligent approach is Syn-
the Geometric SMOTE (G-SMOTE) (Douzas & Bacao, 2019) approach
thetic Minority Oversampling Technique (SMOTE) (Chawla, Bowyer,
has been used in regression (Camacho et al., 2022). In G-SMOTE, data
Hall, & Kegelmeyer, 2002), which creates new data points belonging
points are classified as rare or common based on the target value, and
to the minority class by using existing minority data. As illustrated
new data points are generated using the G-SMOTE method. The label of
in Fig. 6, this algorithm selects a random minority point and places a
the new data point is the weighted average of the target values of the
new one at a random distance between the chosen point and one of its
two instances used to create the new point (Camacho et al., 2022). In
closest neighbors. Mathematically, SMOTE generates an artificial point
this paper, we use a recently published approach that enhances SMOTE
𝑥𝑛𝑒𝑤 according to Eq. (2):
𝑖 with the KNNOR approach (Islam et al., 2022b). The KNNOR algorithm
𝑥𝑛𝑒𝑤 = 𝑥𝑖 + (𝑥𝑖 − 𝑥𝑛𝑒𝑎𝑟𝑒𝑠𝑡 ) ∗ 𝛼𝑖 (2) combines SMOTE with KNN to generate new synthetic data points
𝑖 𝑖
and addresses the issue of the oversampling of noisy and ambiguous
where 𝑥𝑖 is a minority data point, its nearest neighbor of same class instances.
is 𝑥𝑛𝑒𝑎𝑟𝑒𝑠𝑡
𝑖 and 𝛼𝑖 is independent and identically distributed number
uniformly distributed on [0, 1]. 2.2.2. K Nearest Neighbor OveRsampling approach (KNNOR)
The KNNOR has been proposed as a solution to the challenge of
2.2.1. SMOTE based imbalanced regression class imbalance in classification tasks. Compared to the popular SMOTE
A fundamental solution to address imbalanced regression is the use algorithm, KNNOR addresses issues such as noisy data, small disjuncts,
of random under-sampling and Synthetic Minority Oversampling Tech- and within-class imbalances, as demonstrated by Islam and Belhaouari
nique for Regression (SMOTER) (Torgo, Branco, Ribeiro, & Pfahringer, (2021). One of the key features of KNNOR is its novel filtering method,
2015). Under-sampling involves randomly selecting data points with which helps to identify minority data points that better represent the
relevance values below a specified threshold and removing them from population.
the dataset. SMOTER is an extension of SMOTE (Chawla et al., 2002) To generate new synthetic data points, KNNOR uses multiple near-
that generates new data points for the minority class by using existing est neighbor points of these crucial minority points. The process of
data. As shown in Fig. 6, the algorithm first selects a random minority creating an artificial data point begins by selecting one of the crucial

5
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118

successfully applied in various areas, such as age detection from facial


images, weather prediction, electricity consumption estimation, and
autonomous vehicle trajectory projection, where rare data is present
at the extremes. Recent research in imbalanced regression using deep
neural networks has resulted in notable works, such as Label and
Feature Distribution Smoothing (LDS and FDS, respectively) (Yang,
Zha, Chen, Wang, & Katabi, 2021), where different kernel functions are
applied to the labels and features to create a more balanced data dis-
tribution. To improve performance, the loss functions are re-weighted
by multiplying them with the inverse of the estimated LDS. A recent
study by Sharan et al. (2023) focuses on the domain of probabilistic
forecasting, addressing the challenge of predicting long-tailed rare data,
as discussed by Menon (2020). The authors of this study introduce
novel concepts related to moment-based tailedness measurements to
improve predictions. They propose two loss functions: the Kurtosis loss,
which assesses the fourth moment around the distribution mean and is
symmetric, and the Pareto loss, which evaluates the right-tailedness of
Fig. 9. The iterative process of creation of an augmented point 𝑚2 with the help of the distribution and is asymmetric. Notably, this paper stands out for
three neighbors (𝑝1 , 𝑝2 and 𝑝3 ), starting with 𝑝0 . its innovative approach, as it combines deep learning and statistical
methods to create artificial samples with precise target values. To
achieve this, the authors extend the capabilities of a specialized neural
network model known as AutoEncoders, which is employed to extract
relevant features and generate accurate target values.

2.2.4. AutoEncoders
Autoencoder neural networks have the ability to generate output
features that match the input features. They are composed of three
main components: the encoder, the bottleneck, and the decoder (Ri-
fai, Vincent, Muller, Glorot, & Bengio, 2011). Due to their common
usage in image data, autoencoders typically have high-dimensional
input features that are reduced to the bottleneck size by the en-
coder. The decoder is then trained to reconstruct the initial output
by minimizing a cost function. Fig. 10 provides a block diagram of
an autoencoder that includes the Encoder, Bottleneck, and Decoder.
Fig. 10. Auto Encoder Block Diagram.
Autoencoders can be fully connected or can include Convolution and
De-convolution layers (Zeiler, Krishnan, Taylor, & Fergus, 2010), with
the bottleneck typically being a fully connected layer that extracts
minority points, denoted as 𝑥0 , and repeating the following steps for a one-dimensional feature representation. Autoencoders are used to
each of its k nearest neighbors: reduce high-dimensional datasets like images, making them more suit-
• generate a random point on a line between the start point and the able for statistical methods (Wang, Yao, & Zhao, 2016). This work
next closest neighbor; employs an innovative form of autoencoder known as the Class-Aware
Autoencoder (Islam, Belhaouari, Rehman, & Bensmail, 2022a), which
• make the point generated as the start point.
is further explained below.
Fig. 8 shows an artificial imbalanced dataset and Fig. 9 gives a Class Aware AutoEncoders
pictorial representation of the process of augmentation using three Autoencoders aim to minimize the difference between input and
neighboring points. In a general case where k neighbors are used output data. However, class-aware autoencoders take this a step further
to create an artificial point, the process can be represented by the by incorporating the class label information into the output data. This
following. means that the output of the class-aware Autoencoder includes both a
∀𝑖 ∈ [0, 1...𝑘] the sequence is defined using Eq. (4): close approximation of the input feature set and the corresponding class
label (Islam et al., 2022a). Fig. 11 illustrates the concept of a class-
𝑥𝑛𝑒𝑤
𝑖+1
= 𝑥𝑛𝑒𝑤
𝑖 + (𝑝𝑖 − 𝑥𝑛𝑒𝑤
𝑖 ) ∗ 𝛼𝑖 (4)
aware Autoencoder. To match the dimensions, a random or constant
where 𝑥𝑛𝑒𝑤
0
is any safe point in the minority class, 𝑝𝑖 is the ith nearest feature is added to the input data, and the output is then matched with
neighbor of 𝑥𝑛𝑒𝑤
0
and 𝛼𝑖 is uniform random variables over [0, 𝑀], where the actual class label. This approach has been primarily used in labeled
𝑀 is any positive value less than 1. datasets for classification tasks. However, it can also be extended to
At each iteration of the process, the new point generated at the regression data and applied to predict the target value for new data
preceding step becomes the starting point. A new point is synthesized points by modifying the loss function, as described in Section 3.3.1.
at a distance 𝑟𝑎𝑛𝑑𝑜𝑚(0, 𝑀) on the straight line joining the starting
point and the (𝑖 + 1)𝑡ℎ nearest neighbor of the origin point that started 3. Material and methods
the exercise. The synthetic point obtained after the last iteration is
considered the new augmented data point for the entire process. In this section, we outline our strategies for addressing this is-
sue, which are classified into two frameworks based on dataset size.
2.2.3. Deep learning in imbalanced regression For smaller datasets, we introduce an extended version of KNNOR,
In addition to statistical techniques, this paper proposes a deep while for larger datasets, we propose a novel AutoEncoders imple-
learning approach, which is crucial to understanding imbalanced re- mentation. Fig. 4 provides an overview of our proposed methods,
gression using deep neural networks. Neural networks are particularly categorized by data type and structure. Regarding tabular data, we
useful for high-dimensional datasets, such as images, and have been propose various methods, including KNNOR Regression (Section 3.2)

6
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118

Fig. 11. Class Aware Auto Encoder.

and KNNOR DeepRegression with target-aware AutoEncoders/Inflaters


(Sections 3.3.1 and 3.3.2), depending on dataset size and feature count.
For image datasets, we recommend a Multi-Level AutoEncoder scheme
(Section 3.4).

3.1. Methods

This paper introduces three innovative techniques for creating tar-


get variables that cater to the structure and size of the dataset.

• Method 1 (For low population data): We propose the KNNOR


approach with an additional step of calculating the target vari-
able, which we refer to as KNNOR-Regression (or KNNOR-Reg)
in Section 3.2.
• Method 2 (For high-population data): We present KNNOR Deep
Regression or KNNOR-DeepReg, which has two variations:

– Method 2a (For high-population and high-dimensional data):


Combination of KNNOR with Target Aware AutoEncoder Fig. 12. The extension of KNNOR to estimate the target value using the distances
(Section 3.3.1). of the new point 𝑥𝑛𝑒𝑤 to the points associated with its creation — starting with 𝑥0
– Method 2b (For high-population and low-dimensional data): followed by 𝑥1 , 𝑥2 and 𝑥3 .

Combination of KNNOR with Target Aware AutoInflater


(Section 3.3.2).

• Method 3 (For image data): We propose a Multi-Level AutoEn-


coder scheme for imbalanced image regression problems, which
we discuss in Section 3.4.

3.2. KNNOR - Regression (KNNOR-Reg) - [Low population data]

The approach described in 2.2.2 expands on the KNNOR approach.


While KNNOR is primarily used for classification data, generating labels
for the artificial minority data point is a straightforward task in that
context. However, when dealing with regression, we maintain a record
of each point involved in producing the new point. After creating the
new point, we consider the distance between the artificial point and
each of these points to determine the target value. The distance, de-
( )
noted as 𝑑 𝑦𝑗 , 𝑦𝑖 = ||𝑥𝑗 −𝑥𝑖 || and represented as 𝑑𝑗 , can be generalized
as follows. If 𝑑𝑖 is the distance of the artificial point 𝑥𝑛𝑒𝑤 to the 𝑖𝑡ℎ Fig. 13. The extension of KNNOR to estimate the target value using the origin point
parent point 𝑥𝑖 , and 𝑦𝑖 is the target value for this point, then the target 𝑥1 and two of its nearest neighbors, 𝑥𝑛𝑒𝑎𝑟𝑒𝑠𝑡
11
and 𝑥𝑛𝑒𝑎𝑟𝑒𝑠𝑡
12
.
value 𝑦𝑛𝑒𝑤 corresponding to the freshly created data point is expressed
using Eq. (5) as follows:
√ ( )𝛼
√ to 𝑥𝑖 . The vectors 𝑥1 to 𝑥𝑘 are the 𝑘 nearest neighbors of 𝑥𝑛𝑒𝑤 The
√ ∑𝑘 𝑖
√ 𝑦𝑗
√ 𝑗=1 𝑑 (𝑥𝑗 ,𝑥new ) value of 𝛼 is 2 in case of euclidean distance or infinity in case of higher

=√
𝑖
𝑦new
𝑖
𝛼
√∑ ( )𝛼 (5) dimension. In our experiments we have used the value of 𝛼 as 2. The
√ 𝑘 1 number of points or neighbors that participated in the creation of the
( )
𝑗=1 𝑑 𝑥𝑗 ,𝑥new
𝑖
point is denoted by 𝑘. When the value of 𝛼 is 1 and 𝑘 is 2, it gives the
where 𝑦𝑖 = 𝑅𝑒𝑔(𝑥𝑖 ), Reg() is the function that we try to estimate better. original case of SMOTE-Regression. When the value of 𝑘 changes from
The value 𝑑(𝑦𝑗 , 𝑦𝑖 ) = ||𝑥𝑗 − 𝑥𝑖 || = 𝑑𝑗 and 𝑥𝑗 is the nearest neighbor 2 onward, it is an application of the KNNOR-Regression method. The

7
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118

process is illustrated in Fig. 12 which uses a weighted sum of the target


values corresponding to 𝑥0 , 𝑥1 , 𝑥2 and 𝑥3 to generate the target value
for the new data point.
An alternative description of the process is shown in Fig. 13. The
origin point is 𝑥1 and two of its nearest neighbors are 𝑥𝑛𝑒𝑎𝑟𝑒𝑠𝑡 11
and
𝑥𝑛𝑒𝑎𝑟𝑒𝑠𝑡
12
. 𝑥𝑛𝑒𝑤
1
is the artificial point created by using 𝑥1 and its two nearest
neighbors following the process of KNNOR (Islam et al., 2022b). While
the new point 𝑥𝑛𝑒𝑤 is being generated, we keep track of the points
and neighbors involved in the operation. In this case, they are 𝑥1 ,
𝑥𝑛𝑒𝑎𝑟𝑒𝑠𝑡
11
and 𝑥𝑛𝑒𝑎𝑟𝑒𝑠𝑡
12
. Their corresponding target values are represented
in the vertical line on the right side of Fig. 13. The target value
corresponding to 𝑥1 is 𝑦1 , 𝑥𝑛𝑒𝑎𝑟𝑒𝑠𝑡
11
is 𝑦𝑛𝑒𝑎𝑟𝑒𝑠𝑡
11
and 𝑥𝑛𝑒𝑎𝑟𝑒𝑠𝑡
12
is 𝑦𝑛𝑒𝑎𝑟𝑒𝑠𝑡
12
. We
also measure the distance between the newly created point 𝑥𝑛𝑒𝑤 1
and
the three points 𝑥1 , 𝑥𝑛𝑒𝑎𝑟𝑒𝑠𝑡
11
and 𝑥𝑛𝑒𝑎𝑟𝑒𝑠𝑡
12
. The distances are 𝑑1 , 𝑑11 and
𝑑12 respectively. Finally, we utilize Eq. (5) in the following manner, as
specified in Eq. (6) as follows:
𝑦1 𝑦𝑛𝑒𝑎𝑟𝑒𝑠𝑡
11
𝑦𝑛𝑒𝑎𝑟𝑒𝑠𝑡
12
𝑑1
+ 𝑑11
+ 𝑑12
𝑦𝑛𝑒𝑤
1
= (6)
1 1 1
𝑑1
+ 𝑑11
+ 𝑑12

where 𝛼 is set to 1. The process of generating artificial data points and


their corresponding target values using a combination of KNNOR and
SMOTE-Regression is outlined in Algorithm 1.

Algorithm 1 KNNOR Regression (KNNOR-Reg)


Input Training data set 𝑆𝑡𝑟𝑎𝑖𝑛 ;
number of neighbors 𝑘;
count of datapoints to be augmented 𝑎𝑢𝑔_𝑛𝑢𝑚;
Output Artificial point with target values
𝑛𝑒𝑤_𝑝𝑜𝑖𝑛𝑡𝑠 = generate artificial points using KNNOR
𝑛𝑒𝑤_𝑡𝑎𝑟𝑔𝑒𝑡𝑠 = empty array of size(𝑛𝑒𝑤_𝑝𝑜𝑖𝑛𝑡𝑠) Fig. 14. Flowchart showing the process of generation of novel artificial data points
For point 𝑝 in each 𝑛𝑒𝑤_𝑝𝑜𝑖𝑛𝑡𝑠 do using the k nearest neighbors.
𝑘_𝑛𝑏𝑟𝑠 = the 𝑘 points that were used to create 𝑝
𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑠𝑢𝑚_𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = 0
𝑤𝑒𝑖𝑔ℎ𝑡𝑠 = 0
For each neighbor 𝑛 in 𝑘_𝑛𝑏𝑟𝑠 do
𝑑𝑖𝑠𝑡 = distance between 𝑛 and 𝑝
𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑠𝑢𝑚_𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 += (target value of 𝑛)/𝑑𝑖𝑠𝑡
𝑤𝑒𝑖𝑔ℎ𝑡𝑠 += 1/𝑑𝑖𝑠𝑡
End For
𝑓 𝑖𝑛𝑎𝑙_𝑡𝑎𝑟𝑔𝑒𝑡_𝑣𝑎𝑙𝑢𝑒 = 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑠𝑢𝑚_𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒/𝑤𝑒𝑖𝑔ℎ𝑡𝑠
add 𝑓 𝑖𝑛𝑎𝑙_𝑡𝑎𝑟𝑔𝑒𝑡_𝑣𝑎𝑙𝑢𝑒 to 𝑛𝑒𝑤_𝑡𝑎𝑟𝑔𝑒𝑡𝑠
Return 𝑛𝑒𝑤_𝑝𝑜𝑖𝑛𝑡𝑠, 𝑛𝑒𝑤_𝑡𝑎𝑟𝑔𝑒𝑡𝑠

The process of generating artificial data points along with their cor-
responding target values is elucidated through a flowchart, as depicted
in Figs. 14 and 15. Fig. 14 outlines the methodology for creating new
data points, while Fig. 15 illustrates the procedure for computing target
values associated with the newly generated data points.

3.3. KNNOR-Deep Regression (KNNOR-DeepReg) — [High population data]

Although KNNOR-Reg is a powerful approach for data imputation


and the inclusion of multiple neighbors introduces non-linearity, it is
possible to enhance the method further by utilizing neural networks.
This concept draws inspiration from our previous research on Class
Aware Autoencoders (Islam et al., 2022a), as described in Section 2.2.4.
In this study, we expand upon Class Aware Autoencoders and intro-
duce Target Aware Autoencoders and Target Aware AutoInflaters, as
outlined below.

3.3.1. Target Aware AutoEncoders - [High population, high-dimension


data]
To delve deeper into the concept of Target Aware Autoencoders, Fig. 15. After generation of the new data points (Fig. 14), this flowchart shows the
we refer to Fig. 16, which represents our objective. Our aim is to train process of computing the corresponding target values of each novel data point by using
the target values of the k nearest neighbors.
a neural network that can predict target values while simultaneously

8
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118

Fig. 16. Target Aware Auto Encoder (Fully Connected). The neuron highlighted in red represents the additional value generated, matching the target value. The network is capable
not only to extract features but also estimate the target value. The latter is generated by including an additional component in the loss function.

Fig. 17. Target Aware Inflater-Deflater (Fully Connected) for low dimension data. The neuron highlighted in red represents the additional value generated, matching the target
value. The bottleneck expands the feature set while the Network gives an additional output, the target value corresponding to the features. This is done by including the target
value in the output as well as the loss function.

learning the features of the rare dataset. In this context, the Autoen-
coder is defined as the function 𝐹 , from R𝑑 to R𝑑+1 using Eq. (7) as
follows:
̂ 𝑦)
𝐹 (𝑋) = (𝑋, ̂ (7)

where 𝑋 is the input feature, 𝑋̂ and 𝑦̂ are the approximations of input


features and target values by the model. The model can thus mimic
the input features on the output side and also generates a target value
that it learns while learning the features. Thus, although on the input
side, we only provide the features, on the output, we enforce the target-
aware model to generate an estimated target value. This is achieved
by defining a new loss function that enables the model to learn from
the features as well as the targets. It is explained in the following
Section 3.3.3. Initially, the model has a high error in generating the
target value, gradually improving with multiple iterations. It is inter-
esting to note here that the target values, 𝑦 are not represented in the
input of the neural network unlike the class aware auto encoders (Islam
et al., 2022a) as tests have shown that removing the target from the
input achieves better results. The second loss term is fed to the model
externally during every training batch to gauge the features and the
target. The process is shown in Fig. 16 where the neuron highlighted in
red captures the additional value generated, matching the target value.
Fig. 18. Role of Target Aware Auto Encoder in generating new data samples as well
as estimating target value for the novel data points. Same can be applied to target
3.3.2. Target Aware AutoInflaters - [high population, low-dimension data]
aware auto-inflater.
Autoencoders are commonly used on high-dimensional data to re-
duce the number of features and extract valuable information. This is
why they are often applied to image datasets. However, in real-life
The ultimate objective of both the target-aware autoencoder and
tabular regression data, the number of features is typically not too
deflater is identical. They aim to learn dataset features in a manner that
large, and compressing the data may result in loss of information. To the bottleneck representation can be used to generate new data points
address this issue, we propose an expansion-decoder network specifi- while the decoder can be used to obtain target values for any data point.
cally designed for datasets with a lower number of features. Fig. 17 This process is illustrated in Fig. 18, depicting the sequential steps from
illustrates this design, where the initial part of the network inflates 1 to 6. Initially, a target-aware autoencoder/inflater is trained. Step 1
the feature information into a basin and then deflates it back to both represents passing the data as input to the system. Bottleneck features
the features and the target value. This approach allows us to preserve are extracted at step 2 and passed to the KNNOR Regression algorithm
the information in the data without distortion. in steps 3 and 4 to generate new data points. The decoder is leveraged

9
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118

converts them into vectors at the bottleneck. At the second level lies a
target aware, fully connected Auto Encoder that uses the bottleneck of
the previous level Autoencoder as input and trains a target-aware Neu-
ral Network. Fig. 20 illustrates the process. The external Convolution
Auto Encoder is responsible for extracting the features from the image
dataset at the bottleneck (marked as BottleNeck1 in Fig. 20). These
features and the target values (provided externally) are used to train
the internal target-aware Auto-Encoder. The bottleneck of the internal
Auto-Encoder (marked as BottleNeck2) is used to reduce the feature
size of the dataset further, and KNNOR is applied to these extracted
features to generate new data points. This approach proves to be more
efficient than training a single target-aware autoencoder on the images
directly, as seen in Table 7.

3.4.1. Approach summary


Fig. 19. Progression of error values with increasing distance between actual and The approach for estimating the target value of artificial data points
predicted. needs to adapt as the shape of the data changes. In this regard, two
key characteristics are taken into account: the number of features in
the data and the population of the minority dataset. These factors play
at steps 5 and 6 to estimate the target value for the artificial data point a significant role in determining the appropriate method for estimating
created. the target value in the context of generating artificial data points.

• Large Dataset with a high number of features. In this case, we employ


3.3.3. Exponential loss function [Applicable to Target Aware Auto En-
a Target Aware autoencoder (depicted in Fig. 16) to extract the
coders/Inflaters]
features and reduce their dimensionality. Subsequently, the KN-
To assess the accuracy of both the AutoInflaters and Target-Aware
NOR algorithm is applied to this feature set in order to upsample
AutoEncoders, common regression loss functions such as Mean Abso-
the minority dataset. Finally, we utilize the pre-trained Target
lute Error (MAE) or Root Mean Square Error (RMSE) can be employed.
Aware autoencoder to predict the target values for the generated
The overall loss is reiterated in Eq. (8) as follows:
artificial data points. The process is illustrated in Fig. 21. This
𝑡𝑜𝑡𝑎𝑙_𝑙𝑜𝑠𝑠 = (𝑋, ̂ 𝑦,
̂ 𝑋) + 𝑊 ∗ ( ̂ 𝑦) (8) approach is commonly utilized for numerous image datasets.
• Large Dataset with low number of features. In this case, we employ
where 𝑡𝑜𝑡𝑎𝑙_𝑙𝑜𝑠𝑠 is the accumulated loss over the features and the target a Target aware Inflate-Deflate architecture to expand the dataset
values. The ̂ function can be same as the function  applied on the and enhance its representation. The features are extracted from
feature set, however we propose the two functions to be separate. As the basin and utilized to generate artificial data points using the
the count of features outnumbers the target variable, the network is KNNOR algorithm. Subsequently, these data points are fed into
coerced to prioritize the target variable by: the Deflater part of the network to generate the corresponding
• Using a penalty function ̂ on the target values such that the target data points. Fig. 22 illustrates this process. This approach
difference between the predicted and actual target value is mag- is particularly useful for large tabular datasets.
nified; • Small Dataset with a high or low number of features. In case where
• Adding 𝑊 to the above loss value to balance for the disparity the dataset is small, we employ the KNNOR algorithm to cre-
between cardinality of feature sets (>> 1) and cardinality of ate artificial data points regardless of the number of features.
target values (usually one); Subsequently, we utilize the KNNOR-Reg method to predict the
potential target values for these artificial data points. The steps
• By introducing a weight value and employing a distinct loss
̂ we aim to incorporate a penalizing function that involved in this process are illustrated in Fig. 23.
function ,
exhibits accelerated growth as the deviation from the original The result of augmentation process for different datasets has been
value increases. This function is depicted in Eq. (9) as follows. shown in Figs. 24 and 25. Figures a and b of each figure show a scatter
plot of the data after doing a Principal Component Analysis (PCA) for
̂ 𝑦) = |𝑦̂ − 𝑦| ∗ 𝑒|𝑦−𝑦|
̂ 𝑦,
( ̂
(9)
representation in 2-dimensions.
The efficacy of the function is described in Fig. 19. The orange Fig. 24(a and b) shows the augmentation efforts on the laser dataset
line depicts 1000 randomly generated delta values or the absolute using the KNNOR-Regression method (Section 3.2). Fig. 24(c and d)
difference between the predicted and actual, while the blue line plots illustrate a variant of the KNNOR-DeepRegression method where KN-
̂ calculation along with the
the penalty value . This additional loss () NOR is used to oversample the data and then an AutoInflater is used to
weight factor (𝑊 ) helps in balancing the influence of the input features estimate the target values for the synthetic data. In Figs. 24a and 24c,
over the target variable. the augmented points are the same, however since the method to obtain
the target value is different in each case, Figs. 24b and 24d capture
3.4. Multi-level Auto Encoder - [Image data] a different distribution of the augmented target values. Fig. 25 shows
the same comparison for the ele-2 dataset using the same 2 methods.
In the case of image datasets, the approach was different as training Here also, Fig. 25b shows the target values obtained using the target
a target-aware Auto Encoder directly on the images was not yielding values of the k-nearest neighbors. In case of Fig. 25d, a Target Aware
good results. The reason was that the image features’ loss function AutoInflater was trained on the training data. Consequently the new
overwhelmed the target variables’ loss function. The accuracy of the points generated using KNNOR were passed into the AutoInflater to
entire model was high despite the accuracy of target prediction being obtain their target values. The difference is apparent in the histogram of
relatively low. In order to cope with this discrepancy, a multi-level distributions as shown in Figs. 25b and 25d, where, although the shapes
Autencoder scheme is proposed. The first-level auto-encoder is a simple of the histograms are similar, the frequency of the different ranges of
Convolution Autoencoder that extracts features from the images and values is different.

10
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118

Fig. 20. Multi-level Auto Encoders for Images. First, CNN Auto Encoders to reduce the dimension of images. Second, Fully Connected Target Aware Auto Encoders to learn the
target values of minority data set.

Fig. 22. Target generation steps for data with high population and low dimension.

Fig. 21. Target generation steps for data with high population and high dimension.

capabilities of three state-of-the-art regressors. Lastly, the fourth experi-


ment gauges the efficiency of our oversampling algorithm in improving
3.5. Summary
the predictive prowess of 18 regressors.
Fig. 26 provides an overview of the various approaches mentioned
in the previous subsections, allowing users to easily assess and select 4.1. Experiment design
the most suitable method based on the data structure being consid-
ered. The figure showcases the approach used for image data, while This section focuses on evaluating the effectiveness of oversampling
for tabular data, specific recommendations are provided for scenarios techniques and target value predictors in the subsequent regression
involving high or low population and high or low dimensions. This step. The experimental procedure is designed to compare the proposed
comprehensive visualization aids in navigating the available techniques methods in this paper against existing techniques, with a primary
and making informed choices based on the characteristics of the data emphasis on SMOTER (Torgo, Ribeiro, Pfahringer, & Branco, 2013)
at hand. for both tabular and high-dimensional (image) datasets. 1 and 2 pro-
vide details on the datasets used, sourced from the data repository
4. Results and discussion at https://paobranco.github.io/DataSets-IR/ (Branco et al., 2019), and
the Keel repository (Derrac, Garcia, Sanchez, & Herrera, 2015). 33%
In this section, we detail our experimental design, where we eval- of the data in each file was preserved for test while the rest was
uate the efficacy of our methods on established imbalanced regression used for training and oversampling. For these tabular datasets, we
datasets and present the results and discussions. We have delineated applied the approaches illustrated in Figs. 22 and 23. To assess the
four distinct sets of experiments. The first experiment involves a perfor- performance on high-dimensional datasets, we utilized the Image-Age
mance comparison between our oversampling technique and SMOTER dataset obtained from the AgeDB (Moschoglou et al., 2017) and IMDB-
on tabular data, while the second experiment replicates this perfor- WIKI (Rothe, Timofte, & Van Gool, 2018) repositories, employing the
mance evaluation using image data. The third experiment assesses approach illustrated in Fig. 21. In the case of AgeDB, images with an
whether our augmentation method genuinely enhances the regression age label exceeding 80 were considered rare, while for IMDB-WIKI, the

11
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118

Fig. 23. Target generation steps for data with low population.

Fig. 24. Augmentation of data and target values on the laser dataset, using KNNOR-
Regression and KNNOR-DeepRegression. Figures a and c show the scatter plot of data
considered to be common and rare and also show the datapoints after augmentation, Fig. 25. Augmentation of data and target values on the ele-2 dataset, using
Figures b and d show the frequency of the labels of the common, rare and augmented KNNOR-Regression and KNNOR-DeepRegression. Figures a and c display scatter plots
data points. representing data points categorized as common and rare, including the augmented
data points. Meanwhile, Figures b and d illustrate the label distribution for common,
rare, and augmented data points.

age threshold was set at 75. By conducting these experiments, we aim


to provide a comprehensive comparison and evaluation of our proposed
2. KNNOR-DeepReg process: Similar to the KNNOR-Reg process,
methods alongside existing techniques.
the training data was split into rare and common classes. The
rare data was used to train target-aware autoencoders (depicted
4.1.1. Evaluation process and metrics in Figs. 16 and 17). The KNNOR method was then applied
The dataset was initially divided into training and test sets, with separately to create artificial data points for the rare class.
the test dataset being kept separate throughout the augmentation and These generated artificial points were passed through the trained
training process. The three methods employed are outlined as follows: autoencoders to obtain the target labels for the new data points.
The augmented data points, along with their labels, were incor-
1. The training data was split into two classes, rare and common, porated back into the training set, and regressors were trained
depending on the relevance threshold. Considering it to be a to assess performance on the test dataset;
classification dataset (rare and common), it was passed through 3. For image datasets, the process began by dividing the training
the KNNOR approach (Islam et al., 2022b) approach to create data into rare and common classes. A generic autoencoder was
artificial data points of the rare category. The target labels of trained on the images to learn and extract their features, thereby
points used in the creation of each artificial data point were then reducing the dimensionality of the data. The entire training
aggregated to calculate the label of the new point created. The image set was then fed through the autoencoder to extract
augmented data points with labels were added back to the train- features for both rare and common data. A target-aware autoen-
ing set, and regressors were trained to check the performance on coder was trained on the extracted features of the rare class to
the test dataset; learn the target values specific to those features. KNNOR was

12
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118

Fig. 26. Recommendation of techniques for varying structure of data. Regarding Tabular data, we offer different methods depending on the population and number of features.
We propose KNNOR-Regression (Section 3.2) for low population data. For data with a high population, we advocate KNNOR Deep Regression which has two flavors. For high-
population data with a high number of features, we use Target Aware AutoEncoders (Section 3.3.1), and for high-population data with a low number of features, we use Auto
Inflaters (Section 3.3.2). Finally, for Image datasets, we use a combination of Convolution and Fully Connected AutoEncoders called Multi-Level AutoEncoders (Section 3.4).

Table 1
Numerical datasets used in comparison.
Data set Instances Features Relevance threshold Rare Rare (percentage) Type of extreme
ANACALT 4052 7 0.8 835 0.21 lower
bank8FM 4499 8 0.8 285 0.06 upper
baseball 337 16 0.5 50 0.15 upper
boston 506 13 0.8 113 0.22 upper
compactiv 8192 21 0.8 713 0.09 lower
concrete 1030 8 0.8 52 0.05 upper
cpuSm 8192 12 0.8 713 0.09 lower
ele-1 495 2 0.8 43 0.09 upper
ele-2 1056 4 0.8 110 0.1 upper
forestFires 517 12 0.8 78 0.15 upper
friedman 1200 5 0.5 48 0.04 upper
laser 993 4 0.8 75 0.08 upper
machineCPU 209 6 0.8 31 0.15 upper
mortgage 1049 15 0.8 106 0.1 upper
quake 2178 3 0.8 118 0.05 upper
stock 950 9 0.5 63 0.07 upper
treasury 1049 15 0.8 109 0.1 upper
wankara 321 9 0.5 31 0.1 lower

Table 2
High dimension (image) datasets used in comparison.
Data set Instances Features Rare Rare (percentage) Type of extreme
AgeDB 16 488 64 × 64 494 0.03 upper
IMDB-WIKI 213 553 64 × 64 2721 0.012 upper

subsequently applied to the training set of extracted features, class. The augmented features were then passed through the
creating additional artificial data points belonging to the rare target-aware autoencoder to generate the corresponding target

13
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118

Table 3
Hyperparameters of regressors.
Regressor Hyperparameters
Support Vector regression kernel = ‘rbf’, degree = 3, gamma = ‘scale’, coef0 = 0.0, tol = 0.001, C = 1.0
Random Forest regression n_estimators = 100, criterion = ‘squared_error’, max_depth = None, min_samples_split = 2,
min_samples_leaf = 1
Gradient Boosting regression loss = ‘squared_error’, learning_rate = 0.1, n_estimators = 100, subsample = 1.0, criterion =
‘friedman_mse’, min_samples_split = 2, min_samples_leaf = 1
Fully Connected Network hidden_layer_sizes = (100,), activation = ‘relu’, *, solver = ‘adam’, alpha = 0.0001, batch_size
= ‘auto’, learning_rate = ‘constant’, learning_rate_init = 0.001

Table 4 • State-of-art: In this context, we utilize the SMOTE for Regres-


Demonstration of relative error. The two rows represent two different distributions. For
sion (SMOTER) technique, which is introduced by Torgo (Torgo
the first distribution, maximum of test target values is 0.1 while that for the second
is 100. Error percentage is Absolute Difference divided by the maximum of true target
et al., 2013), as our chosen augmentation method. The parameters
values for each distribution. configured for SMOTER are as follows:
True value Predicted value Absolute difference Error percentage
– Maximum distance: The maximum distance from existing
0.1 1.1 1 0.909
minority points, with values ranging from 0.001 to 0.01 to
100 101 1 0.0099
0.1. This determines the placement of new points relative to
the existing minority points;
– Addition proportion: A list of values ranging from 0.1 to
values. The augmented data points, along with their generated 0.5 to 0.8, indicating the proportion of the population to be
labels, were integrated back into the training set of extracted added to the minority dataset. For example, an addition pro-
features. Regressors were trained and applied to the test data to portion of 0.1 means that, after adding the minority points,
evaluate accuracy. the minority population would be 60% of the majority popu-
By employing these three methods, the performance and accuracy of lation. It is worth noting that, unlike (Camacho et al., 2022),
each approach were assessed using the test dataset. It can be noted we did not aim to match the minority population with the
that the third method is an extension of the second method as it majority population exactly, to save computational effort
contains an additional step of feature extraction from image datasets. and showcase comparative improvement over the original
Also the processes of transformation on the training data have been data (Haixiang et al., 2017).
applied to the test data to enable accuracy estimation. The machine • Proposed Approach: In this case, we employ the KNNOR ap-
learning algorithms used at the end include Linear Regression (Barupal
proach (Islam et al., 2022b) as the augmentation technique. The
& Fiehn, 2019), Support Vector regression (SVR) (Vapnik & Vapnik,
parameters used in KNNOR are as follows:
1998), Random Forest regression (RF) (Segal, 2004), Gradient Boost-
ing regression (GBR) (Natekin & Knoll, 2013) and a Fully Connected – Number of neighbors: The number of neighboring points
Network (FCN) (Kohler & Langer, 2021). The fully connected network used to generate a new data point, with values ranging from
consists of the input layer, followed by 5 hidden layers and the output 2 to 5 to 10;
layer. The activation function is Rectified Linear Units (ReLU) (Agarap, – Usable minority proportion: The proportion of the minor-
2018) in each case. Table 3 defines the different hyper parameters for ity population used in generating the artificial data point,
each regressor. with values ranging from 0.2 to 0.6 to 0.9. For example,
The RMSE has been used in calculating the error percentage overall. if the minority population is 100 and the usable minority
Although RMSE is a ubiquitous error metric for regression, it does not proportion is 0.6, only 60 minority data points are utilized
lend itself easily to comparison for results on normalized data. The in producing the artificial data points. The selection of these
data needs to be denormalized before calculating RMSE. We propose 60 data points is based on a criticality estimate explained
a different relative error calculation method on the normalized output in Islam and Belhaouari (2021).
itself in order to obtain more intuitive and easily comparable results.
The relative max error (RMaxE) is defined in Eq. (10) as follows: With the range of parameters mentioned, a total of 243 experiments
√ are conducted on each file, resulting in a cumulative total of 53,217
√ 𝑛 ( )
1√ ∑ 𝑦̂𝑖 − 𝑦𝑖 2
𝑅𝑀𝑎𝑥𝐸 = √ (10) experiments for the 18 numerical datasets.
𝑛 𝑖=1 𝑦𝑚𝑎𝑥

where 𝑦̂𝑖 is the predicted value and 𝑦𝑖 is the actual target value. The 4.2. Results
value 𝑦𝑚𝑎𝑥 is the maximum target value in the test set, and 𝑛 is the
number of samples in the same. The utility of the error metric is Tables 5, 6, and 7 provide the results of the conducted comparisons.
illustrated in Table 4, which shows the difference in error percentage Table 5 presents the mean of the best performance achieved by various
even when the absolute differences are the same. A true value of 0.1, regressors on different datasets, with the max-relative-score serving as
represented as 1.1, is inferior to predicting 101 against the true value of the error metric. In Table 6, the mean Root Mean Squared Error (RMSE)
100. In order to allow the error to be expounded, we divide the absolute score on different datasets is presented. Table 7 focuses specifically
difference by the maximum true value in the test set. This helps magnify on the RMSE score for image datasets. Each column in these tables
the error at lower values compared to the same error at higher values. represents a different technique employed in the experiment, while
the rows correspond to the various regression algorithms utilized. It
4.1.2. Augmentation framework is important to mention that in Table 6, the AutoInflaters have been
For each train–test split of every dataset, we conduct evaluations trained with either the exponential or RMSE loss, corresponding to the
with varying levels of augmentation employing different regression columns in Table 5. In Tables 5 and 6, the columns are structured as
techniques. The initial evaluation is conducted without any augmen- follows.
tation, and subsequent assessments involve the application of both
state-of-the-art and our proposed augmentation methods. • Column 1: Different Regressors used;

14
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118

Table 5
Relative-max-error scores for different datasets.
Regression No SMOTE Type Type Type Type Type Type Type
algorithm oversampling regression I II III IV V VI VII
RFR 1.1749 1.1397 1.1212 1.1263 1.135 1.2892 1.2589 1.2844 1.3244
GBR 1.1479 1.1184 1.0924 1.1054 1.1055 1.3297 1.2951 1.3291 1.3526
SVR 1.7765 1.7095 1.662 1.6663 1.6647 1.629 1.6316 1.6069 1.6902
LR 2.0867 2.1257 2.1056 2.1047 2.1114 1.6612 0.4964 1.6502 1.5737
FCN 0.3951 0.4886 0.4866 0.3883 0.3915 0.5244 0.5665 0.5187 0.5676

Table 6
RMSE scores for different datasets.
Regression No SMOTE Type Type Type Type Type Type Type
algorithm oversampling regression I II III IV V VI VII
RFR 0.0085 0.0075 0.0071 0.0074 0.0073 0.0097 0.0095 0.0097 0.0091
GBR 0.0078 0.0074 0.0065 0.0065 0.0064 0.0098 0.0087 0.0097 0.0092
SVR 0.013 0.0126 0.0119 0.0118 0.0117 0.0117 0.0117 0.0115 0.0116
LR 0.0122 0.0121 0.0118 0.0119 0.0118 0.011 0.0092 0.0107 0.0095
FCN 0.0066 0.006 0.0058 0.0064 0.0064 0.0121 0.0126 0.0121 0.0125

• Column 2: Regression error when no augmentation is applied; Table 7


RMSE score for the image datasets using a fully connected regressor.
• Column 3: Regression error when state of the art SMOTE-Regress-
Dataset Error No SMOTER KNNOR
ion is applied;
metric augmentation DeepReg
• Column 4 (Type I/KNNOR-Regression): Regression error after
IMDB-WIKI RMSE 0.709 0.153 0.137
applying KNNOR to create synthetic datapoints and then applying
Relative-Max 3.469 1.657 1.567
KNNOR-Regression (Ref 3.2) to determine the target values of the
AgeDB RMSE 2.33 0.296 0.294
newly created points; Relative-Max 3.328 1.183 1.178
• Column 5 (Type II): Initially, KNNOR is used to generate artificial
data points. Subsequently, the Target-Aware AutoInflaters (Ref
3.3.2) trained on the training data are employed to estimate
the target values for these newly created data points. The loss • KNNOR-Regression. Represents the frequency of ranks for re-
function used in these AutoInflaters is RMSE; gressors trained on data augmented by KNNOR and target values
• Column 6 (Type III): This column represents the same process calculated by the KNNOR-Reg method 3.2;
as the previous one, with the difference being that the loss func- • KNNOR-DR-I. Represents the frequency of ranks for regressors
tion used in these AutoInflaters is the exponential loss function trained on data augmented by KNNOR and target values calcu-
specified in 3.3.3; lated by using the target aware AutoInflaters 3.3;
• Column 7 (Type IV): In this scenario, KNNOR is applied to the • KNNOR-DR-II. In this case KNNOR was applied on features ex-
expanded features of the training data obtained using the first part tracted by the target aware AutoInflaters. The same AutoInflaters
of the AutoInflaters. The target values are determined using the were used to calculate the target values;
latter (Deflater) part of the same AutoInflaters. The loss function • KNNOR-DR-III. In this case KNNOR was applied on features
used in these AutoInflaters is RMSE; extracted by the target aware AutoInflaters. KNNOR-Reg was used
• Column 8 (Type V): This column follows a similar process to the on the artificial data point to calculate the target values.
previous one, but the loss function used in the AutoInflaters is the
KNNOR-Regression achieves the best rank among all, followed by
exponential loss function specified in 3.3.3;
different variations of KNNOR-DeepRegression (KNNOR-DR-X).
• Column 9 (Type VI): In this case, the AutoInflaters are employed
Table 7 shows the RMSE score for the two Image datasets used in the
to extract the features. Subsequently, KNNOR-Regression is ap-
experiment. As the number of datapoints and features are considerably
plied to the inflated features to create the artificial dataset as
high, we have only tested them on fully connected deep regressors.
well as the target data points. The loss function used in the
Data imputation increases the accuracy of the models manifolds, with
AutoInflaters is RMSE;
KNNOR-DeepReg, the accuracy is enhanced further.
• Column 10 (Type VII): This column is similar to the previous
The KNNOR-Regression method demonstrates superior performance
one, with the difference being that the loss function used in the
across the numerical datasets, followed by the KNNOR-DeepReg meth-
AutoInflaters is the exponential loss function specified in 3.3.3.
ods. We believe that as the dataset size increases, the KNNOR-DeepReg
method, which utilizes Target Aware AutoInflaters, has the potential to
The distribution of ranks achieved by non-augmented methods and
outperform the KNNOR-Reg process.
augmentation methods, including SMOTER, KNNOR-Regression, and
The overall performance, as indicated by the RMSE metric, is con-
KNNOR-DeepReg, is visualized in Figs. 27 and 28. Fig. 27 displays
sistent with the results obtained using the Relative-max-error metric,
the ranks obtained when the regression metric was relative max error,
confirming its validity. Furthermore, the range of scores in Table 4
while Fig. 28 presents the ranks of the various regressors when RMSE
(using the relative-max-error metric) is more pronounced and easier
was the evaluation metric. The variants of KNNOR-Regression have
to compare than in Table 6, where the error values show differences in
ranked better a higher number of times than the SMOTE-Regressor.
the third or fourth decimal place.
In both Figs. 27 and 28, the lines represent different algorithms, and
they are organized as follows:
4.3. Experimenting with strong and weak regressors
• No Augmentation. Represents the frequency of ranks for regres-
sors trained on non-augmented data; In a recent scientific paper titled ‘‘To SMOTE, or not to SMOTE?’’
• SMOTE-Regression. Represents the frequency of ranks for regres- (Elor & Averbuch-Elor, 2022), the benefits of balancing techniques are
sors trained on data augmented by the SMOTER method; explored, particularly in relation to advanced classifiers. The study

15
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118

Fig. 27. Ranks of different methods on all datasets when the error metric was max-relative-error. Lines represent the different methods explained in Section 4.2. KNNOR-Regression
denoted by the green line is the best performing method having ranked the maximum number of times.

involves conducting extensive experiments using three state-of-the-art


classifiers, comparing them to weaker learners used in previous studies. Table 8
The results reveal that balancing techniques significantly improved the Comparing strong regressors on Augmented vs. Original data. Column ‘‘Wins’’ indicates
how many times the performance of regressor on augmented data was better than
prediction accuracy of weak classifiers, including multi-layer percep-
regressor on original imbalanced data.
tron, support vector machines, decision tree, and Adaptive Boosting
Regressor Error metric Wins Losses
(AdaBoost). However, no noticeable impact is observed on the per-
formance of more robust classifiers like eXtreme Gradient Boosting Mean Absolute Error 9 7
Mean Squared Error 11 5
(XGBoost) and categorical boosting (Catboost).
CatBoost R-squared 11 5
To assess the efficacy of the methodology introduced in our pa- Explained Variance Score 11 5
per, we extend our analysis to include the regression variants of Median Absolute Error 10 6
the advanced predictors mentioned in the aforementioned publica- Total 52 28
tion. We compare their performance on both the original and aug-
Mean Absolute Error 12 4
mented datasets, observing notable performance improvements follow- Mean Squared Error 10 6
ing augmentation. In our assessment, we utilize a comprehensive set of XGBoost R-squared 10 6
regression metrics for evaluation including the following: Explained Variance Score 11 5
Median Absolute Error 13 3
• MAE measures the average absolute difference between predicted Total 56 24
and actual values; Mean Absolute Error 13 3
• Mean Squared Error (MSE). MSE calculates the average squared Mean Squared Error 14 2
difference between predicted and actual values; LightGBM R-squared 14 2
• R-squared (R2). R2 measures the proportion of the variance in Explained Variance Score 14 2
the dependent variable that is predictable from the independent Median Absolute Error 12 4

variables. It ranges from 0 to 1, where 1 indicates a perfect fit; Total 67 13


• Explained Variance Score. This metric quantifies the proportion
of the variance in the target variable that is explained by the
model. A score of 1 indicates a perfect fit;
regressor on the original data. This count is significantly higher im-
• Median Absolute Error (MedAE). Quantifies the median mag-
plying that the augmentation of imbalanced data using our technique
nitude of errors between predicted and actual values, offering
enhances the performance of strong regressors. For those interested in
robustness to outliers.
exploring the underlying code and details of this simulation, we have
In Table 8, the ‘‘Wins’’ column shows how many times the state-of- made the code repository available on GitHub at the following Github
the-art (SOTA) regressor on augmented data outperformed the SOTA link.

16
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118

Fig. 28. Ranks of different methods on all datasets when the error metric was RMSE. Lines represent the different methods explained in Section 4.2. KNNOR-DR-I is the best
performing method followed by the KNNOR-Regression method as they rank 1st maximum number of times.

Table 9
Models used for extensive testing. All of them have been used by leveraging the Pycaret library in python which
gives standard implementation of all the models (Ali, 2020).
Models Models Models
AdaBoost Regressor Extreme Gradient Boosting Least Angle Regression
Bayesian Ridge Gradient Boosting Regressor Light Gradient Boosting Machine
Decision Tree Regressor Huber Regressor Linear Regression
Dummy Regressor K Neighbors Regressor Orthogonal Matching Pursuit
Elastic Net Lasso Least Angle Regression Passive Aggressive Regressor
Extra Trees Regressor Lasso Regression Random Forest Regressor

4.4. Testing with additional models a robust assessment of its efficacy in enhancing predictive models. A
total of 1368 experiments were made, for 18 models on 18 datasets.
In order to cement our claims, we have added a host of models to Augmentation was performed for 3, 4 and 5 neighbors and the best
train on augmented as well as non-augmented data and then compare result was selected.
their performance on the test data. The datasets have been already The results in Table 10 provide a comparative overview of 18 re-
mentioned in Table 1 and the models used are mentioned in Table 9. gression models evaluated on both augmented and original imbalanced
Default hyper-parameters were used for each model as mentioned in Ali data. Following is a detailed summary of the findings:
(2020).
We conducted a comprehensive evaluation by applying each model 1. Passive Aggressive Regressor, Linear Regression, and Light Gra-
to both augmented and non-augmented datasets. To assess their per- dient Boosting Machine are the top-performing models, with
formance rigorously, we employed a set of six essential metrics: MAE, wins in approximately 83.33% of cases.
MSE, RMSE, R-squared (R2), Root Mean Squared Logarithmic Error 2. Ridge Regression and Bayesian Ridge follow closely, with wins
(RMSLE), and Mean Absolute Percentage Error (MAPE). These met- in about 82.29% of cases.
rics have been defined in Section 4.3 and provide a well-rounded 3. Decision Tree Regressor, Random Forest Regressor, and Ad-
perspective on how well the models make predictions. aBoost Regressor also demonstrate strong performance, with
Our analysis involved calculating how many times the models wins in 79.17% to 75.15% of cases.
trained on augmented data outperformed those trained on non-augme- 4. Several models, including Extreme Gradient Boosting, Least An-
nted data across all six metrics. This cumulative count provides a con- gle Regression, and Lasso Regression, exhibit wins in approxi-
solidated measure of the augmentation’s impact on model performance. mately 75% of cases.
By considering multiple metrics and aggregating the results, we gain a 5. Models like Extra Trees Regressor, Huber Regressor, and Elastic
holistic understanding of the benefits of using augmented data, offering Net have wins in the range of 68.75% to 66.67% of cases.

17
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118

Table 10 Our experimental results clearly demonstrate the superior perfor-


Comparing 18 regression models on Augmented vs. Original Data. Column ‘‘Wins’’
mance of the proposed approaches in comparison to previous state-
indicates how many times the performance of regressor on augmented data was better
than regressor on original imbalanced data.
of-the-art methods. In an effort to facilitate the adoption of these
Regressor Wins Losses Percentage Wins (%)
techniques, we provide the code as a user-friendly tool, accessible even
to non-specialist users. These methods hold significant potential for
Passive Aggressive Regressor 480 96 83.33
Linear Regression 480 96 83.33 improving critical areas of regression where imbalanced data presents
Light Gradient Boosting Machine 480 96 83.33 a significant challenge.
Ridge Regression 474 102 82.29 To make our code readily accessible, we host it on GitHub and
Bayesian Ridge 474 102 82.29 created a Python package. This code primarily focuses on the core
Decision Tree Regressor 456 120 79.17
features of our concept, particularly in managing imbalanced numeric
Random Forest Regressor 444 132 77.08
AdaBoost Regressor 517 171 75.15 datasets by applying distribution-aware oversampling. The code base
Extreme Gradient Boosting 432 144 75 can be found at Github and the python package can be installed from
Least Angle Regression 432 144 75 pypi. A jupyternotebook is also available to the reader as an end-to-end
Lasso Regression 414 162 71.88
implementation of the algorithm.
Extra Trees Regressor 402 174 69.79
Huber Regressor 396 180 68.75
Elastic Net 384 192 66.67 5.1. Future research directions
Gradient Boosting Regressor 366 210 63.54
Orthogonal Matching Pursuit 324 252 56.25 Building on the findings of this study, the following research en-
K Neighbors Regressor 318 258 55.21
Lasso Least Angle Regression 228 348 39.58
deavors can be pursued:
Total 7501 2979 71.57 1. Expansion to Other Data Types: While this paper focuses on
numerical and image datasets, future work could explore the ap-
plication of these augmentation techniques to other data types,
6. Gradient Boosting Regressor, Orthogonal Matching Pursuit, and such as time-series or audio data, where imbalance also poses
K Neighbors Regressor have relatively lower win percentages, significant challenges.
ranging from 63.54% to 55.21%. 2. Integration with Deep Learning Models: Further research could
7. Lasso Least Angle Regression has the lowest win percentage at investigate the integration of the proposed augmentation meth-
39.58%. ods directly within deep learning training pipelines, potentially
enhancing model performance by providing more balanced and
The overall trend indicates that several regression models benefit informative training data.
from using augmented data, resulting in a performance improvement. 3. Automated Feature and Target Value Adjustment: Developing
Passive Aggressive Regressor, Linear Regression, and Light Gradient algorithms that automatically adjust feature representations and
Boosting Machine consistently outperform others, showcasing their target values based on the specific characteristics of the dataset
robustness in handling imbalanced data when augmented. However, could lead to more generalized and effective augmentation
it is important to note that the degree of improvement varies among strategies.
the models, and some models may not benefit significantly from aug- 4. Cross-Domain Application: Examining the applicability and ef-
mentation. The aggregate win percentage across all models is 71.57%, fectiveness of the proposed techniques in diverse domains such
underscoring the overall efficacy of data augmentation in enhancing re- as finance, healthcare, and environmental science could reveal
gression model performance. The code base for comparison is available insights into their versatility and adaptability to various types of
at the following Github link. regression problems.
5. Enhancing Autoencoder Architectures: Future studies might fo-
5. Conclusion and future work cus on optimizing the architecture of Target Aware Autoen-
coders and AutoInflaters, including exploring different neural
This paper introduces innovative techniques for augmenting regres- network models and training procedures to further improve their
sion datasets, specifically addressing the challenge of imbalanced data. accuracy in estimating target values.
We extend the widely used K Nearest Neighbor OverSampling (KNNOR)
approach, commonly employed in imbalanced classification datasets, By pursuing these future directions, the research community can
to work effectively in regression datasets. Unlike classification data, re- build upon the foundation laid by this paper, driving forward the de-
gression datasets require estimating continuous target values for newly velopment of robust solutions to the persistent challenge of imbalanced
created data points. To tackle this, we enhance the KNNOR algorithm data in regression analysis.
by keeping track of the neighboring data points used in generating new
data points and accumulating their corresponding target values. The Code availability
target value of the newly created point is then estimated based on the
distances to these neighboring points. The code base is available at https://github.com/ashhadulislam/
Furthermore, we propose the use of Target Aware Autoencoders augmentdatalib_reg_source/tree/main And the python package can be
and Target Aware AutoInflaters as an alternative approach for estimat- installed from https://pypi.org/project/knnor-reg/ The following
ing target values in regression datasets. This involves incorporating a Jupyter notebook contains an end-to-end example implementation of
weighted component of target estimation into the loss equation while the algorithm. https://github.com/ashhadulislam/augmentdatalib_reg_
training the autoencoders. By learning the features of the rare dataset source/blob/main/example/Example.ipynb. This information has been
and fine-tuning the weights, these models can predict continuous values added to the conclusion section (Section 5) in the paper.
of the target variable.
In addition to numerical datasets, we also introduce a two-tier au- Declaration of competing interest
toencoder scheme tailored for imbalanced image datasets. This scheme
enables the extraction of features and learning of target values at The authors declare that they have no known competing finan-
subsequent levels of autoencoders, allowing for effective augmentation cial interests or personal relationships that could have appeared to
of imbalanced image data. influence the work reported in this paper.

18
S.B. Belhaouari et al. Expert Systems With Applications 252 (2024) 124118

Data availability Kohler, M., & Langer, S. (2021). On the rate of convergence of fully connected deep
neural network regression estimates. The Annals of Statistics, 49(4), 2231–2249.
Krawczyk, B. (2016). Learning from imbalanced data: open challenges and future
Open source Links to codebase has been mentioned in the paper.
directions. Progress in Artificial Intelligence, 5(4), 221–232.
Kubat, M., Matwin, S., et al. (1997). Addressing the curse of imbalanced training sets:
Acknowledgments one-sided selection. In International conference on machine learning: vol. 97, (pp.
179–186). Morgan Kaufmann.
Open Access funding provided by the Qatar National Library. Laza, R., Pavón, R., Reboiro-Jato, M., & Fdez-Riverola, F. (2011). Evaluating the effect
of unbalanced data in biomedical document classification. Journal of Integrative
Bioinformatics, 8(3), 105–117.
References Liu, N., Shen, J., Xu, M., Gan, D., Qi, E.-S., & Gao, B. (2018). Improved cost-sensitive
support vector machine classifier for breast cancer diagnosis. Mathematical Problems
Agarap, A. F. (2018). Deep learning using rectified linear units (relu). arXiv preprint in Engineering, 2018, 1–13.
arXiv:1803.08375. Liu, X.-Y., Wu, J., & Zhou, Z.-H. (2008). Exploratory undersampling for class-imbalance
Ali, M. (2020). PyCaret: An open source, low-code machine learning library in Python. learning. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics),
PyCaret version 1.0. 39(2), 539–550.
Barupal, D. K., & Fiehn, O. (2019). Generating the blood exposome database using Moschoglou, S., Papaioannou, A., Sagonas, C., Deng, J., Kotsia, I., & Zafeiriou, S. (2017).
a comprehensive text mining and database fusion approach. Environmental Health Agedb: the first manually collected, in-the-wild age database. In Proceedings of the
Perspectives, 127(9), 2825–2830. IEEE conference on computer vision and pattern recognition workshops (pp. 51–59).
Branco, P., Torgo, L., & Ribeiro, R. P. (2016). A survey of predictive modeling on Natekin, A., & Knoll, A. (2013). Gradient boosting machines, a tutorial. Frontiers in
imbalanced domains. ACM Computing Surveys (CSUR), 49(2), 1–50. Neurorobotics, 7, 21.
Branco, P., Torgo, L., & Ribeiro, R. P. (2019). Pre-processing approaches for imbalanced Rifai, S., Vincent, P., Muller, X., Glorot, X., & Bengio, Y. (2011). Contractive auto-
distributions in regression. Neurocomputing, 343, 76–99. encoders: Explicit invariance during feature extraction. In Proceedings of the 28th
Camacho, L., Douzas, G., & Bacao, F. (2022). Geometric SMOTE for regression. Expert international conference on international conference on machine learning (pp. 833–840).
Systems with Applications, Article 116387. Omni Press.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Rothe, R., Timofte, R., & Van Gool, L. (2018). Deep expectation of real and apparent
synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, age from a single image without facial landmarks. International Journal of Computer
16, 321–357. Vision, 126(2–4), 144–157.
Derrac, J., Garcia, S., Sanchez, L., & Herrera, F. (2015). Keel data-mining software tool: Segal, M. R. (2004). Machine learning benchmarks and random forest regression.
Data set repository, integration of algorithms and experimental analysis framework. eScholarship.
Journal of Multiple-Valued Logic and Soft Computing, 17. Sun, Y., Wong, A. K., & Kamel, M. S. (2009). Classification of imbalanced data: A
dos Santos Coelho, L., Hultmann Ayala, H. V., & Cocco Mariani, V. (2024). CO and review. International Journal of Pattern Recognition and Artificial Intelligence, 23(04),
NOx emissions prediction in gas turbine using a novel modeling pipeline based on 687–719.
the combination of deep forest regressor and feature engineering. Fuel, 355, Article Thanathamathee, P., & Lursinsap, C. (2013). Handling imbalanced data sets with
129366. synthetic boundary data generation using bootstrap re-sampling and AdaBoost
Douzas, G., & Bacao, F. (2019). Geometric SMOTE a geometrically enhanced drop-in techniques. Pattern Recognition Letters, 34(12), 1339–1347.
replacement for SMOTE. Information Sciences, 501, 118–135. Torgo, L., Branco, P., Ribeiro, R. P., & Pfahringer, B. (2015). Resampling strategies for
Elhassan, T., & Aljurf, M. (2016). Classification of imbalance data using tomek link regression. Expert Systems, 32(3), 465–476.
(t-link) combined with random under-sampling (rus) as a data reduction method. Torgo, L., & Ribeiro, R. (2007). Utility-based regression. In PKDD 2007: 11th European
Global Journal of Technolology and Optimization S, 1, 2016. conference on principles and practice of knowledge discovery in databases: vol. 7, (pp.
Elor, Y., & Averbuch-Elor, H. (2022). To SMOTE, or not to SMOTE? arXiv preprint 597–604). Springer.
arXiv:2201.08528. Torgo, L., Ribeiro, R. P., Pfahringer, B., & Branco, P. (2013). Smote for regression.
Fernández, A., García, S., Galar, M., Prati, R. C., Krawczyk, B., & Herrera, F. (2018). In Progress in artificial intelligence: 16th portuguese conference on artificial intelligence
Learning from imbalanced data sets: vol. 10, Springer. (pp. 378–389). Springer.
Gan, D., Shen, J., An, B., Xu, M., & Liu, N. (2020). Integrating TANBN with Tunçay, T., Alaboz, P., Dengiz, O., & Başkan, O. g. (2023). Application of regression
cost sensitive classification algorithm for imbalanced data in medical diagnosis. kriging and machine learning methods to estimate soil moisture constants in a
Computers & Industrial Engineering, 140, Article 106266. semi-arid terrestrial area. Computers and Electronics in Agriculture, 212, Article
Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., & Bing, G. (2017). 108118.
Learning from class-imbalanced data: Review of methods and applications. Expert Vapnik, V., & Vapnik, V. (1998). Statistical learning theory wiley. New York, 1(624),
Systems with Applications, 73, 220–239. 2.
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Wang, Y., Yao, H., & Zhao, S. (2016). Auto-encoder based dimensionality reduction.
Knowledge and Data Engineering, 21(9), 1263–1284. Neurocomputing, 184, 232–242.
Islam, A., & Belhaouari, S. B. (2021). Class aware auto encoders for better feature Yang, Y., Zha, K., Chen, Y., Wang, H., & Katabi, D. (2021). Delving into deep
extraction. In 3rd International conference on electrical, communication, and computer imbalanced regression. In Proceedings of the 38th international conference on machine
engineering (pp. 1–5). IEEE. learning (pp. 11842–11851). MLR Press.
Islam, A., Belhaouari, S. B., Rehman, A. U., & Bensmail, H. (2022a). K nearest neighbor Zeiler, M. D., Krishnan, D., Taylor, G. W., & Fergus, R. (2010). Deconvolutional
OveRsampling approach: An open source python package for data augmentation. networks. In 2010 IEEE computer society conference on computer vision and pattern
Software Impacts, 12, Article 100272. recognition (pp. 2528–2535). IEEE.
Islam, A., Belhaouari, S. B., Rehman, A. U., & Bensmail, H. (2022b). KNNOR: An Zhong, J., He, Z., Guan, K., & Jiang, T. (2023). Investigation on regression model
oversampling technique for imbalanced datasets. Applied Soft Computing, 115, for the force of small punch test using machine learning. International Journal of
Article 108288. Pressure Vessels and Piping, 206, Article 105031.
Johnson, J. M., & Khoshgoftaar, T. M. (2019). Survey on deep learning with class
imbalance. Journal of Big Data, 6(1), 1–54.
Juez-Gil, M., Arnaiz-González, Á., Rodríguez, J. J., & García-Osorio, C. (2021).
Experimental evaluation of ensemble classifiers for imbalance in big data. Applied
Soft Computing, 108, Article 107447.

19

You might also like