0% found this document useful (0 votes)
9 views

Particle_Swarm_Optimization-Based_Random_Forest_Framework_for_the_Classification_of_Chronic_Diseases

This paper presents a hybrid machine learning approach combining Particle Swarm Optimization (PSO) and Random Forest (RF) for the classification of chronic diseases (CDs), addressing issues like misdiagnosis due to overlapping symptoms and data imbalances. The proposed PSORF framework improves data quality using SMOTE and EM Imputation techniques, and demonstrates superior performance in accuracy and other metrics compared to traditional classifiers across five chronic disease datasets. The study highlights the effectiveness of metaheuristic optimization in enhancing machine learning models for medical diagnosis.

Uploaded by

padmajakamaraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Particle_Swarm_Optimization-Based_Random_Forest_Framework_for_the_Classification_of_Chronic_Diseases

This paper presents a hybrid machine learning approach combining Particle Swarm Optimization (PSO) and Random Forest (RF) for the classification of chronic diseases (CDs), addressing issues like misdiagnosis due to overlapping symptoms and data imbalances. The proposed PSORF framework improves data quality using SMOTE and EM Imputation techniques, and demonstrates superior performance in accuracy and other metrics compared to traditional classifiers across five chronic disease datasets. The study highlights the effectiveness of metaheuristic optimization in enhancing machine learning models for medical diagnosis.

Uploaded by

padmajakamaraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Received 13 October 2023, accepted 13 November 2023, date of publication 28 November 2023,

date of current version 1 December 2023.


Digital Object Identifier 10.1109/ACCESS.2023.3335314

Particle Swarm Optimization-Based Random


Forest Framework for the Classification
of Chronic Diseases
AKANSHA SINGH1 , NUPUR PRAKASH2 , AND ANURAG JAIN 1
1 University
School of Information, Communication, and Technology, Guru Gobind Singh Indraprastha University, Delhi 110078, India
2 Department of Computer Science and Engineering, The Northcap University, Gurugram 122017, India

Corresponding author: Anurag Jain ([email protected])

ABSTRACT In this paper, a hybrid metaheuristic-based Machine learning approach has been propounded
for the classification of various Chronic Diseases (CDs). The CDs often get misdiagnosed due to various
issues viz., similar and overlapping symptoms, sensitive devices, lack of clinical experts, etc. Based on the
above issues, this study has utilized a fusion of Particle Swarm Optimization with Random Forest (PSORF)
for the automatic identification of CDs. The approach PSORF comprises of two main components: PSO
for obtaining the minimal optimal feature set, also to optimize the performance of the RF classifier, and
RF classifier for the classification of multiple CDs. In this research, five different CD datasets have been
deployed onto a series of experiments have been conducted to identify the best approach for the classification
of CDs. To address the issues of imbalanced and incomplete data in the datasets used, Synthetic Minority
Oversampling Technique (SMOTE) and Expected Minimization (EM) Imputation techniques have been
applied before training the model. This ensures the data quality is improved before being used for analysis.
Furthermore, the performance of the PSO and RF classifiers has been compared with other metaheuristic
and ML classifiers in terms of different performance metrics. For this purpose, Friedman’s tests have been
employed to calculate the mean ranks of all the classifiers across all the datasets for different metrics. The
results showed that the proposed technique achieved the highest mean rank in terms of Accuracy, F-measure,
and Receiver Operating Characteristics (ROC) across all five datasets.

INDEX TERMS Chronic diseases, machine learning, metaheuristic techniques, multi-classification, PSO,
SMOTE.

I. INTRODUCTION this study has focused on three different domains of CDs as


Chronic diseases (CD) are long-lasting diseases causing shown in Figure 1.
millions of deaths and disability worldwide. Especially, post- CDs such as heart disease, lung disease, cancer, diabetes,
pandemic CDs are on the rise as the virus not only affects etc., are the leading cause of death and disability worldwide.
the lungs but also the other parts of the body.1 Such diseases These are such diseases whose symptoms show up at the later
cannot be cured completely but can be controlled and treated stages which makes it even harder to treat them. The most
only if detected early.2 In regard to the classification of CDs, prevalent form of heart disease is Coronary Artery Disease
(CAD), which occurs when a major artery (such as the
Left Anterior Descending (LAD), Left Circumference Artery
The associate editor coordinating the review of this manuscript and
(LCA), or Right Coronary Artery (RCX)) becomes narrowed
due to stenosis. The deaths resulting from CAD were reported
approving it for publication was Kostas Kolomvatsos .
1 Post-COVID symptoms and effects, accessed on 19/06/2023. as 382,820 in 2020 [1]. The symptom of CAD includes
2 Early diagnosis of Chronic diseases, accessed on 19/06/23. shortness of breath, chest pain, chest tightness, and sweats.

2023 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.
VOLUME 11, 2023 For more information, see https://creativecommons.org/licenses/by/4.0/ 133931
A. Singh et al.: PSORF Framework for the Classification of CDs

• Rely on manual labeling by doctors which is


time-consuming and laborious.10
Therefore, in response to the aforementioned issues, several
researchers have proposed implementing computer-aided
diagnosis, which can simplify the work of physicians and
decrease the likelihood of misdiagnosis. Previous research
on various CDs [4], [5], [6], [7], [8], [9], [10], [11] has
demonstrated that it is feasible to detect and categorize
CDs using Machine Learning (ML) technology. However,
such approaches have shown lower performances and exhibit
some limitations such as imbalanced data, accuracy paradox,
missing data problems, slow convergence by metaheuristic
techniques like Genetic Algorithm (GA), etc. Some stud-
ies even achieved excellent performances. Howbeit, that
FIGURE 1. Taxonomy of the chronic diseases utilized in this study.
might be the case of accuracy paradox, a condition that
occurs when the models achieve excellent performance by
training on biased or imbalanced data. Hence, this study
aims at providing a metaheuristic-based framework Particle
The next group of CDs is respiratory diseases consisting of Swarm Optimization based Random Forest (PSORF) that
Chronic Obstructive Pulmonary Disease (COPD), Asthma, can diagnose CRDs efficiently while dealing with issues
Pneumonia, Tuberculosis (TB), etc [2]. These are the found in previous studies. In this study, the problem of
group of diseases that has been greatly affected by the imbalanced data and missing data has been rectified by
COVID-19 pandemic as it has directly attacked the lungs utilizing the SMOTE filter and EM Imputation method
making them less immune to other respiratory diseases. The respectively. The slow convergence problem of GA has been
number of deaths reported due to COPD, Asthma, Lung resolved by using PSO. The ability of PSO to search larger
cancer, and Pneumonia are 3.2, 260, 1.8, and 2.4 million spaces efficiently, being less computationally expensive, and
respectively.3 Such chronic diseases give rise to a range faster convergence has made it an effective and efficient
of symptoms, such as shortness of breath, excessive mucus global search technique as compared to other techniques
production, chest pain, tightness in the chest, coughing, and such as the Genetic, Bat, and Firefly algorithms. Similarly,
many more. Breast cancer is currently the most prevalent Random Forest (RF) has a great advantage over other ML
form of cancer, resulting in 685,000 fatalities worldwide.4 techniques such as its ability to deal with missing and
Similarly, Diabetes, a chronic metabolic disease contributes imbalanced data, reduce overfitting by using an ensemble of
to 1.5 million deaths each year.5 It is riskier as it can affect the various decision trees, etc. The performance of the proposed
other major organs such as the Heart and kidneys. Although approach has been evaluated through a series of experiments
there are various fundamentally well-organized primary care on five different chronic disease datasets and compared to
approaches for treating CDs such as Spirometry pulmonary other benchmark metaheuristics and ML techniques. The
functional test for COPD, X-rays, scans for other lung remarkable performance of the proposed approach is evident
diseases, surgical removal, radiation therapy, mammograms from the results, surpassing other techniques. The major
for Breast cancer [3], and Angiography for CAD, there contributions of this study have been listed as follows-
are various issues related to such treatments mentioned as • A hybrid metaheuristic-based ML classifier (PSORF)
follows: has been proposed that can not only diagnose a disease
• Overdiagnosis in case of mammograms, radiation injury but can also differentiate various similar CDs based on
during chemotherapy.6 symptomatic information.
• The miniature size of tumors and lung nodules cannot • EM Imputation and SMOTE techniques have been
be read clearly by clinical experts.7 employed to fill in the missing values and treat
• Misclassification of diseases due to similar and overlap- imbalance data problems respectively.
ping symptoms.8 • Performance of different metaheuristic techniques such
• Expensive medical procedures such as Angiography.9 as PSO, GA, Bat, and Firefly Algorithm (FA) has been
compared using radar charts.
• Friedman’s Test has been utilized to corroborate the
3 Respiratory disease, Number of deaths reported, accessed on 19/06/23.
4 Breast cancer, factsheet for Breast Cancer‘‘, accessed on 20/06/23.
performance of the proposed approach with the other
5 Diabetes‘‘, accessed on 20/09/23. ML classifiers by comparing their mean ranks.
6 Overdiagnosis of mammograms, accessed on 22/06/23. The remaining sections of this paper are structured as
7 Missed detection of lung cancer, accessed on 22/06/23. follows: Section II discusses the previous research and
8 Misdiagnosis of lung diseases, accessed on 22/06/23.
9 Coronary artery Angiogram, accessed on 22/06/23. 10 Mannual data labeling, accessed on 22/06/23.

133932 VOLUME 11, 2023


A. Singh et al.: PSORF Framework for the Classification of CDs

identifies gaps in the use of ML techniques for detecting optimization algorithms for feature selection with ML
CDs. Section III outlines the materials and methods utilized classifiers greatly reduces the computational power
in this study. In addition, it explains the benchmarks feature required. Therefore, it can be concluded that Meta-
selection and ML techniques in brief. Furthermore, it explains heuristic Optimization (MHO) based ML classifiers
the proposed methodology and all its stages in detail. have been shown to outperform DL models.
Section IV illustrates the experimental work carried out on • Accuracy Paradox: Despite various issues and research
different datasets using the proposed approach and shows the gaps, previous studies achieved excellent performances.
comparison of the proposed approach with other ML and The accuracy paradox may be at play here. Even
metaheuristic techniques. It further explains the limitations though the training model achieves high accuracy levels,
and future work of the study. Section V concludes the study. it has low predictive value. This is especially true when
handling an imbalanced Breast cancer dataset, where
II. RELATED WORK AND RESEARCH GAPS the accuracy rate can be over 97% in all cases [12],
In the literature, several researchers have examined numerous [13], [14], [15], [16], [17]. However, such a model
ML models for the detection of various CDs to help clinical trained on this data may not perform well in identifying
decision-making. In this regard, this section discusses multi- cancer patients in real-life situations, despite producing
ple works done for the classification of Breast Cancer, heart, accurate training results due to a high proportion of
Diabetes, and respiratory diseases as shown in Table 1, 2, 3, cancer patients’ examples.
and 4 and also identifies the research gaps. • No statistical testing: It’s worth noting that only a few
Upon review of prior research, it was found that many studies have been found in the literature that utilized
studies employed metaheuristic optimization algorithms statistical testing to validate their models and achieve
for feature selection and different machine learning (ML) optimal performance [13] and [22]. Most studies instead
and deep learning (DL) models for disease classification, compared various ML and DL models using different
as outlined in Tables 1, 2, 3, and 4. While these previous performance metrics to determine the top performer.
studies have yielded promising outcomes, there are still some However, these results were not adequately explained in
areas for further research and improvement, as described those studies.
below.
In order to create a reliable and effective model, this
• Imbalance dataset: It must be acknowledged that pre-
study has addressed all of the research gaps mentioned
vious research has frequently depended on imbalanced
previously. The issue of imbalanced and missing data was
datasets to predict diseases, producing biased outcomes.
tackled in section III, while section IV thoroughly explains
However, a study conducted by Zhang et al. [21]
and confirms the classification performance of the proposed
resolved this issue by utilizing the SMOTE filter. It is
model.
crucial to meticulously scrutinize potential biases when
interpreting research findings. III. MATERIALS AND METHODS
• Missing data: In this study, it was found that the In this section, the materials and methods utilized in this
Exasens dataset contains some missing values that study have been examined. It describes the different datasets
must be addressed before being used in the training employed in this study and then discusses the benchmark
model. If left untreated, such values can significantly metaheuristic and ML classification techniques. It further
impact the accuracy of the classification model. Previous showcases the different stages of the proposed methodology
studies by Ramachandra and Murthy [27] and Gill and in detail.
Pathwar [28] did not address these missing values.
However, Amutha and Sekar [26] utilized the KNN A. DATASETS
Imputation method to address this issue. While this This study has employed five publicly available datasets
method is effective, adaptive, and flexible, it can be as evaluation benchmarks: the International Confer-
susceptible to outliers and is computationally expensive. ence on Biomedical Health Informatics (ICBHI) lung
• Lower performances: Previous studies have clearly sound database [35], Wisconsin Breast Cancer Dataset
demonstrated that certain datasets exhibit lower per- (WBCD) [36], Z-Alizadehsani dataset [37], Exasens
formance levels due to missing data, lack of feature dataset [38], and Diabetes dataset [39] collected from UCI
selection, and high computational models [22], [29], library, Kaggle, and dataworld. For ease purpose, datasets
[30]. It has been observed that studies that employed ICBHI, WBCD,
metaheuristic-based ML classifiers outperformed those Z-Alizadehsani, Exasens, and Diabetes have been specified
using DL models when comparing studies that utilized as D1, D2, D3, D4, and D5 respectively. Detailed information
the same dataset. While DL models are known for regarding each dataset has been presented in Table 5.
their automatic feature selection, it is important to note In this study structured data consisting of symptomatic
that tuning these features and the model’s parameters information in accordance with the respective diseases has
can consume a significant amount of computational been considered for the evaluation of ML classifiers for
resources. On the other hand, utilizing metaheuristic classifying Chronic diseases. The distribution of instances

VOLUME 11, 2023 133933


A. Singh et al.: PSORF Framework for the Classification of CDs

TABLE 1. Previous works done for the detection of breast cancer using different feature selection and ML approaches on the wisconsin breast cancer
dataset.

TABLE 2. Previous works done for the detection of Coronary artery disease using different feature selection and ML approaches on the Z-alizadehsani
dataset.

TABLE 3. Previous works done for the detection of Diabetes using different feature selection and ML approaches on the vanderbilt diabetes dataset.

into different classes corresponding to different diseases is Further details regarding the datasets are mentioned as
shown in Figure 2. follows:
133934 VOLUME 11, 2023
A. Singh et al.: PSORF Framework for the Classification of CDs

TABLE 4. Previous works done for the detection of Respiratory Disease using different feature selection and ML approaches on the ICBHI and exasens
dataset.

TABLE 5. Deployement of datasets for the identification of chronic diseases.

length ranging from 10s to 90s making it a total of


5.5 hours of recordings. The recordings are collected
from 126 patients. It contains 6898 respiratory cycles
wherein 1864, 886, and 506 contain crackles, wheezes
and both crackles and wheeze respectively [40], [41].
• WBCD: The dataset was created at the University of
Wisconsin Hospitals in 1992. The attribute ‘‘diagnosis’’
has been denoted as the class label that classifies the
tumor as Malignant (M) and Benign (B). In the literature,
the majority of the papers worked on unstructured data
for Breast cancer like mammograms [42], [43].
• Z-Alizadehsani Dataset: The data was collected from
heart disease patients at Shaheed Rajaei Cardiovascular,
Medical, and Research Center, Tehran, Iran. This dataset
is an extension of the Z-Alizadehsani dataset and was
collected from the UCI library. In this dataset, the
information about the major three arteries has been
added increasing the total number of attributes to 59. The
attributes are grouped into four categories: demographic
information, symptoms and examination, ECG, and
laboratory and echo features [40], [44].
• Exasens Dataset: The dataset was collected at Research
FIGURE 2. Distribution of instances into the number of target classes Center Borstel, Germany. It contains information regard-
corresponding to datasets a) ICBHI dataset D1, b) WBCD dataset D2, c)
Z-Alizadehsani dataset D3, d) Exasens dataset D4, e) Diabetes dataset D5. ing the four groups of saliva samples namely, COPD,
Asthma, Infected, and healthy [40].
• Diabetes Dataset: The dataset utilized in this study is
• ICBHI Respiratory Sound Database: The dataset was a modified version of the original Vanderbilt Diabetes
collected by two research teams in Portugal and dataset [45] originated from a study conducted on rural
Greece. It consists of 920 annotated recordings of African Americans. The original dataset consisted of

VOLUME 11, 2023 133935


A. Singh et al.: PSORF Framework for the Classification of CDs

patients with several missing values. Before deploying • Multilayer Perceptron (MLP): It is the simplest form of
the dataset into this study, 13 patients with heavily neural network that learns a function f (·) : Rp −→
missing data were excluded.11 Rq by training on a dataset where p is the number
of dimensions for the input and q is the number of
B. BENCHMARK TECHNIQUES dimensions for the output [13], [16]. The model consists
This section discusses the benchmark techniques utilized in of three layers: the ‘‘Input layer’’ consists of a set of
this study for comparing and validating the performance of neurons xi |x1 , x2 , . . . ..xn indicating the input features,
the proposed approach. As mentioned earlier, the proposed the middle layer is the ‘‘Hidden layer’’ consisting of
approach comprises two components i.e., PSO and RF. one or more layers containing neurons that transform the
Hence, for comparison purposes, two sets of benchmark previous layer values into a weighted linear summation
techniques have been utilized. One set is for comparing and then apply a non-linear activation function g(·) :
feature selection techniques and another set is for comparing Rp −→ Rq , and the last layer ‘‘Output layer’’ that
proposed approaches with state-of-art classifiers. receives the input from the hidden layer and transform it
into the output values [25].
1) FEATURE SELECTION • Sequential Minimal Optimization (SMO): A supervised
In this study, to compare and validate the performance of learning algorithm designed for the training of SVM as
PSO, three benchmark metaheuristic optimization techniques its training requires solving large complicated Quadratic
GA [15], [16], [19], Bat [19], and FA [19] have been Programming (QP) optimization problems. This prob-
employed. These algorithms are population-based algorithms lem becomes more cumbersome when dealing with
where the agents perform both local and global searches. large datasets leading to a running time of O(N 3 ) [47].
They are iterative in nature. They generally start from a SMO breaks these large QP problems into small QP
randomly chosen solution and move forward. The goal is problems which then can be solved analytically. All
to find an optimal solution at each iteration until no further these calculations make SMO scale between linear or
improvements can be made. Also, It is not advisable to use quadratic in the training set size hence making it faster
the Firefly algorithm as one of the benchmark techniques than SVM.
due to its ‘‘center bias operator’’ problem [46] because • Bagging. It is an averaging ensemble classifier that
this operator enables the algorithm to optimize its function builds several estimators independently and then aver-
in a way that places its respective optima in the center ages their predictors. The idea is that the com-
of the feasible set. Despite this, numerous studies in the bined estimators perform better than single estima-
literature have utilized this algorithm for feature selection and tors due to the reduction in variance. It works best
tuning of hyper-parameters of ML classifiers. For comparison with strong and complex models as they reduce
purposes, this study has incorporated both types of MHO overfitting [23].
algorithms, one with and others without a center bias operator
problem. C. PROPOSED METHODOLOGY
This section introduces the details of the proposed approach
2) ML CLASSIFIERS
PSO-RF for the multiclassification of Chronic Diseases.
This section discusses the cutting-edge classifiers that were Additionally, various stages of the proposed approach have
employed to assess and verify the effectiveness of the been exhibited in Figure 3.
proposed method. The key elements of each stage are briefly elaborated on as
• Naïve Bayes: This supervised learning classifier is an
follows:
amalgamation of two terms: The term ‘‘naive’’ indicates
that the algorithm assumes conditional independence
between all features, given the value of the class 1) STAGE 1: DATA PREPROCESSING
variable. On the other hand, the term ‘‘Bayes’’ indicates In this stage, the original raw data has been treated in terms
that the method is based on the Bayes theorem [12], [27]. of quantity and quality by having it pass through different
This theorem describes the relationship between the sub-stages to enhance the performance of the proposed
class variable (denoted as z) and the dependent feature approach. The various sub-stages are shown in Figure 4.
vectors (y1 through yn ). as shown in (1). The datasets were first checked for their types. Among all
(P(z)P(y1 . . . ..yn |z)) the datasets, dataset D1 was unstructured and needed to be
P(z|y1 , . . . ..yn ) = (1) converted into structured data using Python programming.
P(y1 . . . ..yn )
Hence, the .csv file containing the patient id and disease
There are different versions of Naïve Bayes which differ has been aligned with the .txt file of different .wav files to
only in terms of the assumption they make regarding the get a structured file. In addition, it has been observed from
distribution P(yi |z) [32]. Table 5 that the Exasens dataset suffers from a missingness
11 Diabetes dataset, Modified dataset by Robert Hoyt‘‘, accessed on problem. The dataset consists of 33.36% of the whole data
20/09/23. missing values. In this regard, this study has deployed

133936 VOLUME 11, 2023


A. Singh et al.: PSORF Framework for the Classification of CDs

FIGURE 3. Overview of the proposed PSO-RF approach. The PSO-RF consists of a preprocessing module, a metaheuristic feature selector, and an
ensemble Random Forest classifier.

FIGURE 4. Representation of multiple stages of preprocessor module for treating raw unstructured data.

TABLE 6. Increment in the number of instances after applying SMOTE TABLE 7. Balancing the weights of an imbalanced ICBHI dataset D1 using
across all the datasets. classbalancer.

Expected Minimization (EM) Imputation technique to fill in


the missing values. Furthermore, all the datasets employed in
this study have an imbalanced distribution of instances among
different classes. To tackle this problem, SMOTE [19], [21]
technique has been utilized for datasets D2, D3, D4, and Therefore, the authors utilized the Class Balancer filter
D5. It creates synthetic examples of the minority class to equally assign weights to all the classes as shown in
instances using the K-nearest neighbor. After the application Table 7.
of SMOTE filter, the rise in the number of instances can be The Class Balancer filter has reassigned equal weights to
seen in Table 6. different class instances in such a way that the total sum
However, in the case of dataset D1, the distribution of of the instance weights i.e., 10120 remains the same even
instances is highly skewed The majority class (COPD) after balancing them. This allows the Classifier to know
has 8723 instances and the minority class (Asthma) has that each class holds equal importance and need not to be
11 instances. Similarly, for other classes, the number ignored.
of instances is much less as compared to the majority
class. Increasing the number of instances through over- 2) STAGE 2: PARTICLE SWARM OPTIMIZATION
sampling using SMOTE will escalate the total number of The second phase is the Feature selection (FS) phase which
instances to approximately 70k, quite high to be handled deals with selecting the best features subset that can aid in
by the model. With such a large number of instances, achieving optimal results. This is an optional step as it is not
the probability of getting highly noisy data is also high. always required. However, FS is crucial when dealing with

VOLUME 11, 2023 133937


A. Singh et al.: PSORF Framework for the Classification of CDs

Algorithm 1 Particle Swarm Optimization Based Random Forest Approach


Require:A Training set S = (p1 , q1 ) . . . .(pn , qm ), Feature set F, number of trees in forest= B, Generation counter (t = 1),
T :Maximum generators
Ensure:An optimal feature set (F i ), Output= H : predicted disease
PSO(F)
{Initialization of PSO parameters}
foreach particle i ∈ 1 . . . . . . .Nm do
Position = Xi (0)
Velocity = Vi (0)
pbest = Xi (0)
gbest ← best of pbest
end
{Update pbest and gbest of each particle}
whilet < T
iff (Xi ) < f (pbesti )
then
pbesti (t) = Xi (t)
gbesti (t) ∈ {pbest1 (t), . . . .pbestm (t)}|f (gbesti (t)) = min{f (pbest1 (t), . . . .pbestm (t))
end
fori = 1; i ≤ N ; i + + do
{Update Velocity and Position}
Vi (t + 1) = wVi (t) + c1 r1 (pbesti (t) − Xi (t)) + c2 r2 (gbesti (t) − Xi (t))
Xi (t + 1) = Xi (t) + Vi (t + 1)
Evaluate fitness function of Xi (t + 1)
t =t +1
return F i
end
RandomForest(SFi )
O←φ
fori = 1; i ≤ N ; i + + do
S i ← A random sample from S
oi ← RandTree(SF i )
O ← O ∪ {oi }
end
return O
RandTree(SF)
foreach node
sf ← a small subset of F i
Split on best features of F i
return H

Chronic disease metadata as the diagnosis of a disease is done It is a stochastic population-based approach influenced by
using the differential diagnosis method where the idea is to fish schooling or bird flocking behavior. It is different from
rule out the non-related diseases. Hence, a lot of tests such as other optimization algorithms like Differential Evolution in
laboratory tests, scans, X-rays, and blood tests were done, all terms that it does not depend on any gradient or differ-
of which are not really required, and also may not be related ential gradient. It simply explores and exploits the search
to the actual disease. And this unrelated existence of these space using the particle’s position and velocity information.
tests might cause an overfitting problem [34]. Therefore, There are various advantages of PSO including being
FS is essential before training the classification model as computationally inexpensive, having low system require-
it will lead to a faster, more accurate, and cost-effective ments, faster convergence, easy implementation, etc [49].
model. It is mostly used for finding the maxima or minima of
For this purpose, this study has utilized a metaheuristic a function defined over a multidimensional vector space.
approach PSO introduced by Kennedy and Eberhart [48]. It performs feature selection by considering the features as

133938 VOLUME 11, 2023


A. Singh et al.: PSORF Framework for the Classification of CDs

particles in a high dimensional space where each particle TABLE 8. Description of different evaluation metrics utilized in this study.
in the swarm is an optimal solution. The fitness function
is calculated for each particle in the swarm based on
its position [13], [16], [19]. Each particle’s position is
represented as Xi = xi1 , xi2 , .........xid , where d denotes
the dimension. Likewise, every particle has an associated
velocity, denoted by Vi = vi1 , vi2 , ........., vid . After each
iteration, the velocity and position values at any time instant t
and t + 1 for each particle are updated as shown in (2) and (3)
respectively.
Vi (t + 1) = wVi (t) + c1 r1 (pbesti (t) − Xi (t))
+ c2 r2 (gbesti (t) − Xi (t)) (2)
Xi (t + 1) = Xi (t) + Vi (t + 1) (3)
In the above equations, w is the inertia constant with values
between 0 and 1. It determines how much each particle keeps Algorithm 1. The basic idea of RF is to form a single
up with its previous velocity. In the same way, r1 and r2 are strong classifier by combining multiple decision trees by
constants selected at random, with a value ranging from 0 to 1. either taking the average of their outputs or taking the
Meanwhile, c1 and c2 are coefficients linked to cognitive and majority vote. In previous works, RF has shown an excellent
social aspects. They control the trade-off between exploration performance as compared to other classifiers [12], [15]. The
and exploitation as c1 helps in finding the local minima reason is that it uses bagging for the ensemble process
and c2 helps in finding the global minima among the local which reduces the correlation between the trees. Also, the
minima. The determination of the optimal local and global variance and overfitting of the classifier get reduced [20],
value is based on the variables pbest and gbest respectively. [31]. Moreover, by restricting the features, the decision trees
These variables depend on the position of the particle Xi (t) can learn faster and hence can be built in a small amount of
as shown in (4) and (5). In order to determine the pbest time.
and gbest values, the fitness function (f ) of a particle at The algorithm 1 also considers a forest L comprising of
t +1 instant is compared with its fitness function at t instant of various small decision trees l wherein for each l belonging to
time. L, it selects a bootstrap sample S* from S. Furthermore, for
each node of the tree, a very small feature set sf is obtained
pbesti (t) = Xi (t)iff (Xi ) < f (pbesti ) (4) from F which is then used for node splitting.
Also, gbesti (t) ∈ {pbest1 (t), . . . .pbestm (t)}
|f (gbesti (t)) = min{f (pbest1 (t), . . . .pbestm (t)) (5) IV. EXPERIMENTAL RESULTS AND DISCUSSION
The experimental work conducted on the four chronic
The complete procedure for the proposed approach has
disease datasets, namely D1, D2, D3, D4, and D5 has
been illustrated in Algorithm 1, where the preprocessed
been thoroughly discussed in this section. The experiments
training set S = (p1 , q1 ), .........(pn , qm ) consisting of n rows
illustrate the efficacy of the components of the proposed
and m columns considered in this study where S ∈ D, i.e.,
model by comparing them with the conventional feature
S could be any of the five datasets D. The selected optimal
selection and classification methods. Moreover, Friedman’s
feature set F i was then passed to the training model Random
test has also been employed as a statistical test for validating
forest. The goal was to select the feature set that maximizes
the performance of the proposed approach against previous
the classification accuracy and minimizes the number of
methods.
features. To achieve this goal, the fitness function (f) set for
PSO is shown in (6).
A. EXPERIMENTAL SETUP
Ns
Fitness(f ) = θ ∗ acc(f ) + (1 − θ) ∗ (1 − ) (6) All experiments were run on a Windows 11 with AMD
Nf Ryzen 5 4600H with Radeon Graphics processor and 24 GB
where Ns and Nf define the number of selected and total RAM. All the computations in this study have been
number of features respectively. The classification accuracy done using three different software. The preprocessing and
has been denoted by acc(f ), and θ signifies the weighing classification have been done using the Weka and Jupyter
factor between the classification accuracy and the number of Notebook. In addition, for statistical testing, the SPSS tool
selected features. has been utilized.

3) STAGE 3: TRAINING ON RANDOM FOREST CLASSIFIER B. EVALUATION METRICS


This study has utilized a Random Forest classifier, an ensem- The various evaluation metrics utilized in this study for the
ble technique for the classification of CDs as shown in classification of CDs have been described in Table 8.

VOLUME 11, 2023 133939


A. Singh et al.: PSORF Framework for the Classification of CDs

TABLE 9. Value of parameters set for Genetic, PSO, Firefly, and Bat algorithm across all datasets.

FIGURE 5. Representation of a minimal optimal number of features


selected by PSO, GA, Bat,and FA techniques corresponding to datasets D1,
D2, D3, D4, and D5.

In the above Table, TP, TN, FP, and FN denote True


Positive, True Negative, False Positive, and False Negative
respectively. Similarly, TPR and FPR indicate a True positive
rate and a False positive rate respectively. Furthermore, for
MAE, n is the total number of samples, Ei is the expected or
actual value, and Oi is the observed value i.e., the predicted
value of it h data sample obtained by the classifier. For Kappa
statistics, Pr (a) and Pr (e) denote the actual and observed
accuracy respectively. FIGURE 6. Classification output parameters of Naïve Bayes corresponding
to different FS algorithms for datasets a) D1 (ICBHI), b) D2 (WBCD), c) D3
(Z-Alizadehsani), d) D4 (Exasens), and D5 (Diabetes).
C. COMPARISON OF PSO WITH OTHER OPTIMIZATION
TECHNIQUES
This section discusses the effectiveness of the PSO opti- minimal attributes. As a rule of thumb, it is known that
mization technique by comparing its performance with neither too many nor too few features should be utilized
other state-of-the-art optimization feature selection methods for the prediction [25]. This study utilized the original set
Genetic Algorithm (GA), Bat and Firefly Algorithm (FA). of features for D1 as the resulting optimal features were
In this regard, the parameters corresponding to PSO, GA, too less. For dataset D3, PSO and FA has provided the
Bat, and FA have been set across all five datasets for minimal set of features and for dataset D2, FA provided the
determining the minimal optimal feature subset as shown in minimal set. However, for dataset D4 all three techniques
Table 9. resulted in a minimal subset of features. Also, for dataset D5,
The number of minimal attributes resulting from all four PSO obtained a minimal optimal feature set of 9 attributes
optimization techniques are shown in Figure 5. which is higher than the feature set provided by the other FS
Different techniques provided the minimal set of features techniques. Furthermore, to validate the performance of PSO
across all the datasets except D1 as it already contained the over GA, FA, and Bat algorithms, Radar charts have been

133940 VOLUME 11, 2023


A. Singh et al.: PSORF Framework for the Classification of CDs

FIGURE 9. Classification output parameters of Bagging corresponding to


FIGURE 7. Classification output parameters of MLP corresponding to
different FS algorithms for datasets a)D1 (ICBHI), b) D2 (WBCD), c) D3
different FS algorithms for datasets a) D1 (ICBHI), b) D2 (WBCD), c) D3
(Z-Alizadehsani), d) D4 (Exasens), and D5 (Diabetes).
(Z-Alizadehsani), d) D4 (Exasens), and D5 (Diabetes).

techniques have been compared for the Naïve Bayes classifier


as shown in Figure 6.
It can be clearly observed from the above figure that
for datasets D2, D3, and D5, PSO has shown the best
performance. However, in the case of datasets D1 and D4,
a similar number of attributes has been obtained by all the
FS techniques, consequently leading to overlapping charts.
Similarly, for MLP, SMO, Bagging, and RF, different charts
have been obtained as shown in Figure 7, 8, 9, and 10
respectively.
It is clear from the figures above that the minimal optimal
feature set obtained from PSO has greatly helped all the
classifiers in achieving the highest performance as compared
to other FS techniques.

D. COMPARISON OF RF CLASSIFIER WITH BENCHMARK


ML CLASSIFIERS
This section benchmarks the performance of the ensemble RF
classifier towards other state-of-the-art classifiers, i.e., NB,
MLP, SMO, and Bagging. In this regard, the hyperparameters
corresponding to all these classifiers have been set across all
five datasets as shown in Table 10.
FIGURE 8. Classification output parameters of SMO corresponding to
different FS algorithms for datasets a) D1 (ICBHI), b) D2 (WBCD), c) D3 Moreover, this study has employed 10-fold cross-
(Z-Alizadehsani), d) D4 (Exasens), and D5 (Diabetes). validation for splitting the dataset into training and testing
sets. Thereafter, the training set and selected feature set were
passed to all the classifiers for the classification of CDs. The
drawn across all five datasets for different ML classifiers. resulting classification performance of all the classifiers has
Each chart evaluates the performance of all optimization been compared across all the datasets for different evaluation
techniques for different evaluation metrics. Firstly, the FS metrics as shown in Table 11 and 12.

VOLUME 11, 2023 133941


A. Singh et al.: PSORF Framework for the Classification of CDs

FIGURE 11. Comparison of Accuracy, ROC, and F-measure across all


classifiers in terms of mean Rank calculated by Friedman’s Test.

FIGURE 10. Classification output parameters of RF corresponding to


different FS algorithms for datasets a) D1 (ICBHI), b) D2 (WBCD), c) D3
(Z-Alizadehsani), d) D4 (Exasens), and D5 (Diabetes).

TABLE 10. Values of hyperparameters across all the classifiers.

FIGURE 12. Comparison of MAE and RMSE across all classifiers in terms
of mean Rank calculated by Friedman’s Test.

E. STATISTICAL TESTING
In this section, a thorough comparison has been conducted
between the proposed approach and other benchmark clas-
sifiers, utilizing Friedman’s statistical test to determine the
results [19]. This test with the associated p-value has been
performed for multiple comparisons. It has been undertaken
to detect the performance difference between the PSO-RF
The results obtained from the experimentation work and different classifiers. The null hypothesis with threshold
illustrated two important observations. value p = 0.05 considered for this study was that there is no
• Firstly, a situation of accuracy paradox has been raised significant difference between PSO-RF and other classifiers.
for dataset D1. The performance of all the classifiers The indication of a significant difference is appraised by
for different metrics across dataset D1 is ideal, which p<0.05. Different test statistics set for Friedmann’s test have
is quite impossible. This is due to the presence of a been shown in Table 13.
high imbalance across the classes of dataset D1. These It is worth mentioning that the performance difference
biased outcomes have resulted because of the biased between PSO-RF and other classifiers is highly significant
data. (p < 0.05) for Accuracy, F-measure, and RMSE. Hence,
• Secondly, there are cases where multiple classifiers rejecting the null hypothesis for these parameters that, there
have shown similar results corresponding to the same is no significant difference between PSO-RF and other
metric. For example, for dataset D3, SMO, Bagging, classifiers.
and RF have shown similar performance in terms of The Friedmann mean rank obtained on the above exper-
Accuracy. imental results for different classifiers across different
Hence, to further assess the classification performance of evaluation metrics is shown in Figures 11 and 12. In terms
the proposed approach against other state-of-the-art classi- of Accuracy, ROC, and F-measure, the higher the rank of
fiers, some statistical tests are required that are discussed in the classifier the better the classifier. Whereas for MAE, and
the subsection below. RMSE, the lower the error rank the better the classifier.

133942 VOLUME 11, 2023


A. Singh et al.: PSORF Framework for the Classification of CDs

TABLE 11. Comparison of performance of classifiers across all five datasets (D1, D2, D3, D4, and D5) in terms of accuracy (in %), ROC (in %), F-measure
(in %).

TABLE 12. Comparison of performance of classifiers across all five datasets (D1, D2, D3, D4, and D5) in terms of MAE and RMSE.

TABLE 13. Values of different test statistics are set across different
performance metrics during friedman’s test.

FIGURE 13. Comparison of Proposed approach with previous studies with


respect to ICBHI Dataset in terms of Accuracy. Convolutional Neural
Network-Long short term memory (CNN-LSTM), Visual Geometry Group
The results obtained from Friedman’s Test showed that (VGG).

out of all the classifiers, PSO-RF obtained the highest mean


rank in terms of accuracy, ROC, and F-measure. Also, the
existing studies. A state-of-art comparison with the proposed
mean of MAE is similar for PSO-Bagging and PSO-RF.
approach for dataset D1 has also been shown in Figure 13.
Similarly, the mean RMSE is lowest for PSO-RF. Hence, it is
From Fig. 13, it is evident that for Datasets D1, the
evidently visible from Fig. 11 and 12 that the proposed model
proposed approach, i.e., PSORF has outperformed the
(PSO-RF) has risen as the best model as it exhibits the highest
previous studies’ results by obtaining the highest accuracy
ranks among all the classifiers.
of 100%. The accuracies obtained by studies [33], [34], [50]
were way too low for dataset D1 as compared to the proposed
F. COMPARISON OF PROPOSED APPROACH WITH approach. The second highest accuracy was obtained in [50]
PREVIOUS METHODS wherein the author utilized an ensemble of Support vector
For the sake of universality and comprehensiveness, this machines (SVM). However, the study couldn’t identify the
study further contrasts the proposed approach with other feature importance as it utilized the radial basis function

VOLUME 11, 2023 133943


A. Singh et al.: PSORF Framework for the Classification of CDs

FIGURE 14. Comparison of Proposed approach with previous studies with FIGURE 16. Comparison of Proposed approach with previous studies with
respect to WBCD Dataset in terms of Accuracy.Multilayer respect to Exasens Dataset D4 in terms of Accuracy. Deep Convolution
perceptron+Open source development Model Algorithm (MLP+ODMA), neural network (DCNN).
Support Vector Machine- Wolf Optimization Algorithm
(SVM+WOA),Particle Swarm Optimization+Artificial Neural Network
(PSO+ANN).

FIGURE 17. Comparison of Proposed approach with previous studies with


respect to Vanderbilt Diabetes Dataset D5 in terms of Accuracy.

FIGURE 15. Comparison of Proposed approach with previous studies with


respect to Z-Alizadehsani Dataset in terms of Accuracy.Support Vector selection in study [21], and imbalance data problem in
Machine+Q learning based Bee Swwarm Optimization (SVM+QBSO). study [23]. All these limitations have been overcome in the
proposed approach, consequently leading to better accuracy
as compared to the previous studies. Similarly, in the case
to derive the best-performing model. Hence, it could be of dataset D4 as shown in Figure 16, the proposed approach
said that in terms of feature importance and classification obtained the second-highest accuracy of 99.05% which is
performance, the proposed approach performed the best for 0.5% less than the accuracy obtained by researcher [31].
dataset D1. For dataset D2 as can be seen in Figure 14, the The proposed approach has completely rectified the
proposed approach obtained similar results as that of study problem of missing value by utilizing EM Imputation
[13] in terms of accuracy. whereas the problem still persists in study [31], [32]. At last,
Howbeit, this might be due to overfitting as the dataset was for dataset D5, the proposed approach obtained the third
left imbalanced in study [13], and also the researchers utilized highest accuracy of 93.5% as shown in Figure 17.
a highly computational Deep learning model for obtaining The other studies [26], [27] obtained an accuracy of 98.7%
high accuracy. Similarly, for studies [16], [17], the dataset and 93.9% respectively. However, the problem with these
was left imbalanced, and no statistical tests were performed approaches is that the dataset had missing values and was
to support the classification performances obtained by their left imbalanced. In addition, the study [26] utilized GWO and
respective models. Therefore, in terms of computational WOA (that should not be used as these algorithms exhibit
power, the proposed approach for dataset D2 is better than center bias problem) as the base feature selection technique
all previous studies. Furthermore, for dataset D3, it is clearly as their proposed model. The technique proposed in this study
evident from the above Figure 15, that the proposed approach is free from center bias problem and also the problem of
obtained the highest accuracy of 99.7% as compared to missingness and imbalance data has been rectified. Hence,
studies [20], [21], [23]. at last, it could be said that across all the datasets except
The second highest accuracy has been obtained by D5, the proposed approach, i.e., PSORF has performed the
study [21] wherein the authors utilized the LightGBM model best. It has not only detected Chronic diseases but also
for the detection of CAD disease. However, the problem with multi-classified symptomatically similar diseases. This study
previous studies related to dataset D3 had some limitations also has some limitations. Firstly, for dataset D1, the proposed
such as increased time complexity for study [20], no feature approach obtained almost ideal results which might be a

133944 VOLUME 11, 2023


A. Singh et al.: PSORF Framework for the Classification of CDs

result of the presence of a high imbalance in the dataset. [3] L. J. Grimm, C. S. Avery, E. Hendrick, and J. A. Baker, ‘‘Benefits and risks
The ML classifiers utilized in this study are not complex of mammography screening in women ages 40 to 49 years,’’ J. Primary
Care Community Health, vol. 13, Jan. 2022, Art. no. 215013272110583.
enough to deal with such highly imbalanced data. Secondly. [4] S. Selvakani, K. Vasumathi, and V. Aadhiseshan, ‘‘Application of machine
to tackle down the imbalance data problem, this study has learning in predicting heart disease,’’ Asian Basic Appl. Res. J., vol. 5,
utilized SMOTE filter which might result in the generation of pp. 61–68, Apr. 2023.
[5] A. Chaurasia, ‘‘Ensemble technique to predict heart disease using machine
some noisy data. Therefore, in the future, this study aims to learning classifiers,’’ Netw. Biol., vol. 13, no. 1, p. 1, 2023.
provide a suitably complex AI-based predictive model for the [6] G. N. Ahamad, Shafiullah, H. Fatima, Imdadullah, S. M. Zakariya,
multi-classification of diseases in dataset D1. Furthermore, M. Abbas, M. S. Alqahtani, and M. Usman, ‘‘Influence of optimal
hyperparameters on the performance of machine learning algorithms for
for the problem of imbalance dataset, different variants of predicting heart disease,’’ Processes, vol. 11, no. 3, p. 734, Mar. 2023.
SMOTE can be applied in the future studies. [7] A. Singh and N. Prakash, ‘‘A review of AI models for prediction and
detecting heart disease for improved wellbeing,’’ Vivekananda J. Res.,
vol. 10, pp. 14–25, Oct. 2021.
V. CONCLUSION
[8] S. W. Ali, M. Asif, M. Rashid, S. Tanvir, S. Shams, and S. Abid, ‘‘Detection
This study aimed to provide an efficient Machine learning of crackle and wheeze in lung sound using machine learning technique for
framework PSORF that can not only detect but also clinical decision support system,’’ Vawkum Trans. Comput. Sci., vol. 11,
no. 1, pp. 67–78, 2023.
Classify similar Chronic diseases such as COPD, Asthma,
[9] M. A. Elsadig, A. Altigani, and H. T. Elshoush, ‘‘Breast cancer detection
Bronchiectasis, etc. For this purpose, this study considered using machine learning approaches: A comparative study,’’ Int. J. Electr.
five different datasets across which a series of experiments Comput. Eng., vol. 13, no. 1, p. 736, Feb. 2023.
have been performed. The datasets obtained from public [10] V. R. Allugunti, ‘‘Breast cancer detection based on thermographic images
using machine learning and deep learning algorithms,’’ Int. J. Eng. Comput.
repositories suffered from missing values and imbalanced Sci., vol. 4, no. 1, pp. 49–56, Jan. 2022.
data problems that were rectified through EM Imputation [11] B. S. Abunasser, M. R. J. Al-Hiealy, I. S. Zaqout, and S. S. Abu-Naser,
and SMOTE techniques. The processed data was then passed ‘‘Breast cancer detection and classification using deep learning Xception
algorithm,’’ Int. J. Adv. Comput. Sci. Appl., vol. 13, no. 7, pp. 223–228,
through the PSO-RF framework which provided the best 2022.
optimal feature set and efficient classification result on [12] T. O. Oladele, B. J. Olorunsola, T. O. Aro, H. B. Akande, and
all the datasets. In addition, to validate the classification O. A. Olukiran, ‘‘Nature-inspired meta-heuristic optimization algorithms
for breast cancer diagnostic model: A comparative study,’’ FUOYE J. Eng.
performance of the PSORF framework, both PSO and Technol., vol. 6, no. 1, pp. 26–29, Mar. 2021.
RF were compared with different metaheuristic and ML [13] R. O. Ogundokun, S. Misra, M. Douglas, R. Damaševičius, and
classifiers respectively. The performance of PSO with other R. Maskeliūnas, ‘‘Medical Internet-of-Things based breast cancer diag-
nosis using hyperparameter-optimized neural networks,’’ Future Internet,
metaheuristic techniques, namely firefly, Bat and Genetic vol. 14, no. 5, p. 153, May 2022.
search were compared through radar graphs on the basis of [14] B. Sahu, S. Mohanty, and S. Rout, ‘‘A hybrid approach for breast cancer
various evaluation metrics. It was evident from the graphs that classification and diagnosis,’’ ICST Trans. Scalable Inf. Syst., vol. 6, no. 20,
Jul. 2018, Art. no. 156086.
across all the datasets, PSO provided the best results. Hence, [15] B. J. Olorunsola, T. O. Oladele, T. O. Aro, H. Babalola, and O. A. Olukiran,
for further evaluation, five different PSO-based classifiers ‘‘Performance comparison of selected swarm intelligence algorithms on
were compared by using various performance metrics. The breast cancer diagnosis,’’ Afr. J. MIS, vol. 3, no. 1, pp. 5–21, 2021.
[16] Z. Guo, L. Xu, and N. A. Asgharzadeholiaee, ‘‘A homogeneous
results showed that among all the classifiers, the PSO-based ensemble classifier for breast cancer detection using parameters tuning
RF classifier outperformed the other classifiers in terms of MLP neural network,’’ Appl. Artif. Intell., vol. 36, no. 1, Dec. 2022,
of Accuracy, F-measure, and ROC. However, there were Art. no. 2031820.
[17] X. Jia, X. Sun, and X. Zhang, ‘‘Breast cancer identification using
some classifiers whose performances were similar across machine learning,’’ Math. Problems Eng., vol. 2022, pp. 1–8,
all the datasets. Therefore, for further clarification on the Oct. 2022.
classification performance of the classifiers, Friedman’s [18] H. Huang, X. Feng, S. Zhou, J. Jiang, H. Chen, Y. Li, and C. Li, ‘‘A new
fruit fly optimization algorithm enhanced support vector machine for
testing was performed. The test results proved that among all diagnosis of breast cancer based on high-level features,’’ BMC Bioinf.,
the classifiers PSO-RF achieved the highest rank indicating vol. 20, no. S8, pp. 1–14, Jun. 2019.
that it has outperformed other classifiers. The proposed [19] A. Gupta, R. Kumar, H. S. Arora, and B. Raman, ‘‘C-CADZ: Com-
putational intelligence system for coronary artery disease detection
PSO-RF framework not only classified the binary Chronic using Z-Alizadeh Sani dataset,’’ Int. J. Speech Technol., vol. 52, no. 3,
diseases such as Breast cancer, Diabetes and Heart disease pp. 2436–2464, Feb. 2022.
but also classified multiple chronic diseases that were [20] Y. A. Z. A. Fajri, W. Wiharto, and E. Suryani, ‘‘Hybrid model feature
selection with the bee swarm optimization method and Q-learning on the
symptomatically similar such as COPD, Asthma, Pneumonia, diagnosis of coronary heart disease,’’ Information, vol. 14, no. 1, p. 15,
Bronchiectasis, etc. Dec. 2022.
[21] S. Zhang, Y. Yuan, Z. Yao, J. Yang, X. Wang, and J. Tian, ‘‘Coro-
nary artery disease detection model based on class balancing meth-
REFERENCES ods and LightGBM algorithm,’’ Electronics, vol. 11, no. 9, p. 1495,
[1] C. W. Tsao, A. W. Aday, Z. I. Almarzooq, A. Alonso, A. Z. May 2022.
Beaton, M. S. Bittencourt, A. K. Boehme, A. E. Buxton, A. P. Carson, [22] J. Hassannataj Joloudari, F. Azizi, M. A. Nematollahi, R. Alizadehsani,
Y. Commodore-Mensah, and M. S. Elkind, ‘‘Heart disease and stroke E. Hassannatajjeloudari, I. Nodehi, and A. Mosavi, ‘‘GSVMA: A genetic
statistics, 2022 update: A report from the American heart associatio,’’ support vector machine ANOVA method for CAD diagnosis,’’ Frontiers
Circulation, vol. 145, no. 8, pp. e153–e639, 2022. Cardiovascular Med., vol. 8, p. 2178, Feb. 2022.
[2] A. Singh, N. Prakash, and A. Jain, ‘‘A review on prevalence of [23] B. Kolukisa and B. Bakir-Gungor, ‘‘Ensemble feature selection and
worldwide COPD situation,’’ in Proceedings of Data Analytics and classification methods for machine learning-based coronary artery dis-
Management (Lecture Notes in Networks and Systems), vol. 572, ease diagnosis,’’ Comput. Standards Interfaces, vol. 84, Mar. 2023,
A. Khanna, Z. Polkowski, and O. Castillo, Eds. Singapore: Springer, 2023. Art. no. 103706.

VOLUME 11, 2023 133945


A. Singh et al.: PSORF Framework for the Classification of CDs

[24] B. Kolukisa, L. Yavuz, A. Soran, B.-G. Burcu, D. Tuncer, A. Onen, [47] Y. Wan, Z. Wang, and T.-Y. Lee, ‘‘Incorporating support vector machine
and V. C. Gungor, ‘‘Coronary artery disease diagnosis using optimized with sequential minimal optimization to identify anticancer peptides,’’
adaptive ensemble machine learning algorithm,’’ Int. J. Bioscience, BMC Bioinf., vol. 22, no. 1, p. 286, May 2021.
Biochemistry Bioinf., vol. 10, no. 1, pp. 58–65, 2020. [48] J. Kennedy and R. Eberhart, ‘‘Particle swarm optimization,’’ in Proc. IEEE
[25] A. Singh and A. Payal, ‘‘CAD diagnosis by predicting stenosis in arteries Int. Conf. Neural Netw. (ICNN), vol. 4, Aug. 2002, pp. 1942–1948.
using data mining process,’’ Intell. Decis. Technol., vol. 15, no. 1, [49] A. Singh and A. Jain, ‘‘Financial fraud detection using bio-inspired key
pp. 59–68, Mar. 2021. optimization and machine learning technique,’’ Int. J. Secur. Appl., vol. 13,
[26] S. Amutha and J. R. Sekar, ‘‘An optimized framework for diabetes mellitus no. 4, pp. 75–90, Dec. 2019, doi: 10.33832/ijsia.2019.13.4.08.
diagnosis using grid search based support vector machine,’’ in Proc. Int. [50] J. S. Park, K. Kim, J. H. Kim, Y. J. Choi, K. Kim, and D. I. Suh, ‘‘A machine
Conf. Comput., Commun., Signal Process. Cham, Switzerland: Springer, learning approach to the development and prospective evaluation of a
Jan. 2023, pp. 153–167. pediatric lung sound classification model,’’ Sci. Rep., vol. 13, no. 1,
[27] A. C. Ramachandra and D. Murthy, ‘‘Diabetes prediction using machine p. 1289, Jan. 2023.
learning approach,’’ Strad Res., vol. 10, no. 8, 2023.
[28] S. Gill and P. Pathwar, ‘‘Prediction of diabetes using various feature
selection and machine learning paradigms,’’ in Modern Approaches AKANSHA SINGH received the B.Tech. and
in Machine Learning & Cognitive Science: A Walkthrough. Cham, M.Tech. degrees in computer science and engi-
Switzerland: Springer, 2022, pp. 133–146. neering from Guru Gobind Singh Indraprastha
[29] P. Rajendra and S. Latifi, ‘‘Prediction of diabetes using logistic regression University (GGSIPU), Delhi, India, in 2017 and
and ensemble techniques,’’ Comput. Methods Programs Biomed. Update, 2020, respectively, where she is currently pur-
vol. 1, Jan. 2021, Art. no. 100032. suing the Ph.D. degree in computer science
[30] J. Dhar, ‘‘Multistage ensemble learning model with weighted voting and and engineering. Her research interests include
genetic algorithm optimization strategy for detecting chronic obstructive machine learning, computational metaheuristic
pulmonary disease,’’ IEEE Access, vol. 9, pp. 48640–48657, 2021. models, deep learning, bioinformatics, and data
[31] R. R. Irshad, S. Hussain, S. S. Sohail, A. S. Zamani, D. Ø. Madsen, mining. She was a recipient of two best paper
A. A. Alattab, A. A. A. Ahmed, K. A. A. Norain, and O. A. S. Alsaiari, awards at an international and national conference respectively. Her honors
‘‘A novel IoT-enabled healthcare monitoring framework and improved
include the Short Term Research Fellowship (STRF) from GGSIPU and the
grey wolf optimization algorithm-based deep convolution neural network
STEM Fellowship.
model for early diagnosis of lung cancer,’’ Sensors, vol. 23, no. 6, p. 2932,
Mar. 2023.
[32] P. S. Zarrin, N. Roeckendorf, and C. Wenger, ‘‘In-vitro classification of NUPUR PRAKASH received the B.E. degree
saliva samples of COPD patients and healthy controls using machine in electronics and communication and the M.E.
learning tools,’’ IEEE Access, vol. 8, pp. 168053–168060, 2020.
degree in computer science and technology from
[33] G. Petmezas, G.-A. Cheimariotis, L. Stefanopoulos, B. Rocha, R. P. the University of Roorkee (now IIT Roorkee), in
Paiva, A. K. Katsaggelos, and N. Maglaveras, ‘‘Automated lung sound
1981 and 1986, respectively, and the Ph.D. degree
classification using a hybrid CNN-LSTM network and focal loss function,’’
from Punjab University, in 1998. She is currently
Sensors, vol. 22, no. 3, p. 1232, Feb. 2022.
a Professor with the Department of Computer
[34] S. W. Ali, M. Asif, M. Rashid, S. Tanvir, S. Shams, and S. Abid, ‘‘Detection
of crackle and wheeze in lung sound using machine learning technique for
Science and Engineering and holds the position of
clinical decision support system,’’ Vawkum Trans. Comput. Sci., vol. 11, the Vice Chancellor of The Northcap University,
no. 1, pp. 67–78, 2023. Gurgaon, India. Prior to joining The NorthCap
[35] ICBHI Dataset. Accessed: Jun. 20, 2023. [Online]. Available: University, she was the Vice-Chancellor of Indira Gandhi Delhi Technical
https://paperswithcode.com/dataset/icbhi-respiratory-sound-database University for Women; the Principal of the Indira Gandhi Institute of
[36] WBCD. Accessed: Jun. 20, 2023. [Online]. Available: Technology, Delhi; the Dean of the School of Engineering and Technology;
https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data and the Dean of the School of ICT, Guru Gobind Singh Indraprastha
[37] Z-Alizadehsani Dataset. Accessed: Jun. 20, 2023. [Online]. Available: University, Government of Delhi. She has been a strong propagator of STEM
https://archive.ics.uci.edu/dataset/extention-of-z-alizadehsani-dataset education among girls and has won many awards and accolades. She has
[38] EXASENS. Accessed: Jun. 20, 2023. [Online]. Available: guided 12 Ph.D. scholars and authored more than 100 research papers and
https://archive.ics.uci.edu/dataset/523/exasens articles in various national and international journals/conferences of repute.
[39] Diabetes Prediction Dataset. Accessed: Sep. 20, 2023. [Online]. Available: Her H-index and i10 index are 17 and 30, respectively, with 1844 citations.
https://data.world/informatics-edu/diabetes-prediction Her research interests include artificial neural networks, natural language
[40] M. Zhang, M. Li, L. Guo, and J. Liu, ‘‘A low-cost AI-empowered processing, mobile communication, secure wireless networks, and machine
stethoscope and a lightweight model for detecting cardiac and respiratory learning algorithms. She is a Life Member of the Computer Society of India
diseases from lung and heart auscultation sounds,’’ Sensors, vol. 23, no. 5, (CSI) and a Former Member of the IEEE Women in Engineering (WIE),
p. 2591, Feb. 2023. USA. She has chaired various expert committees of UGC, NBA, and NAAC.
[41] C. Wall, L. Zhang, Y. Yu, A. Kumar, and R. Gao, ‘‘A deep ensemble neural
network with attention mechanisms for lung abnormality classification
using audio inputs,’’ Sensors, vol. 22, no. 15, p. 5566, Jul. 2022. ANURAG JAIN received the M.Tech. degree from
[42] A. Mohamed, E. Amer, S. N. Eldin, J. Khaled, and M. Hossam, IIT Kharagpur and the Ph.D. degree from Guru
‘‘The impact of data processing and ensemble on breast cancer detection Gobind Singh Indraprastha University, Delhi,
using deep learning,’’ J. Comput. Commun., vol. 1, no. 1, pp. 27–37, India.
Feb. 2022. He is currently a Professor with Guru Gob-
[43] X. Wang, I. Ahmad, D. Javeed, S. Zaidi, F. Alotaibi, M. Ghoneim, ind Singh Indraprastha University. He is doing
Y. Daradkeh, J. Asghar, and E. Eldin, ‘‘Intelligent hybrid deep learning research in the areas of healthcare, cybersecurity,
model for breast cancer detection,’’ Electronics, vol. 11, no. 17, p. 2767, and speech processing. He has also been involved
Sep. 2022. in identifying the importance of ML and data
[44] H. Mohammedqasim, R. Mohammedqasem, O. Ata, and E. I. Alyasin, science in his research domain. He has published
‘‘Diagnosing coronary artery disease on the basis of hard ensemble voting many national and international research papers in many reputed journals
optimization,’’ Medicina, vol. 58, no. 12, p. 1745, Nov. 2022. and conferences. His i10 index is 14 with nearly 675 citations. His research
[45] Vanderbilt Diabetes Datasets. Accessed: Sep. 20, 2023. [Online]. Avail- interests include speech processing, natural language processing, artificial
able: https://hbiostat.org/data/ intelligence, machine learning, and data mining in the healthcare domain.
[46] J. Kudela, ‘‘The evolutionary computation methods no one should use,’’ Prof. Jain is a Life Member of the Computer Society of India (CSI).
2023, arXiv:2301.01984.

133946 VOLUME 11, 2023

You might also like