0% found this document useful (0 votes)
41 views

A Novel Approach For Polycystic Ovary Syndrome Prediction Using Machine Learning in Bioinformatics

This document summarizes a research article that proposes a novel machine learning approach for predicting polycystic ovary syndrome (PCOS) using bioinformatics. The researchers built models using a clinical and physical parameter dataset of women and proposed a novel feature selection approach based on optimized chi-squared. They applied 10 machine learning models in comparison and found that a Gaussian naive Bayes model using the novel feature selection approach outperformed other models, achieving 100% accuracy, precision, recall, and F1-scores in 0.002 seconds. The model was also 100% accurate on cross-validation. The study identified prolactin, blood pressure, thyroid stimulating hormone, relative risk, and pregnancy history as prominent predictive factors for PCOS.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

A Novel Approach For Polycystic Ovary Syndrome Prediction Using Machine Learning in Bioinformatics

This document summarizes a research article that proposes a novel machine learning approach for predicting polycystic ovary syndrome (PCOS) using bioinformatics. The researchers built models using a clinical and physical parameter dataset of women and proposed a novel feature selection approach based on optimized chi-squared. They applied 10 machine learning models in comparison and found that a Gaussian naive Bayes model using the novel feature selection approach outperformed other models, achieving 100% accuracy, precision, recall, and F1-scores in 0.002 seconds. The model was also 100% accurate on cross-validation. The study identified prolactin, blood pressure, thyroid stimulating hormone, relative risk, and pregnancy history as prominent predictive factors for PCOS.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/363500424

A Novel Approach for Polycystic Ovary Syndrome Prediction Using Machine


Learning in Bioinformatics

Article in IEEE Access · September 2022


DOI: 10.1109/ACCESS.2022.3205587

CITATIONS READS

9 451

5 authors, including:

Kashif Munir Ali Raza


Khwaja Fareed University of Engineering & Information Technology Khwaja Fareed University of Engineering & Information Technology
54 PUBLICATIONS 212 CITATIONS 23 PUBLICATIONS 116 CITATIONS

SEE PROFILE SEE PROFILE

Faizan Younas
Khwaja Fareed University of Engineering & Information Technology
8 PUBLICATIONS 39 CITATIONS

SEE PROFILE

All content following this page was uploaded by Kashif Munir on 15 September 2022.

The user has requested enhancement of the downloaded file.


This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3205587

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI

A Novel Approach for Polycystic Ovary


Syndrome Prediction Using Machine
Learning in Bioinformatics
SHAZIA NASIM1 , MUBARAK ALMUTAIRI2,* , KASHIF MUNIR3,* , ALI RAZA1,* , AND FAIZAN
YOUNAS.1
1
Department of Computer Science, Khwaja Fareed University of Engineering and Information Technology, Rahim Yar Khan, 64200, Pakistan (e-mail:
[email protected], [email protected])
2
College of Computer Science and Engineering, University of Hafr Al Batin, Hafr Alabtin, 31991, Saudi Arabia
3
Faculty of Computer Science and IT, Khawaja Fareed University of Engineering IT, Rahim Yar Khan, 64200, Pakistan
Corresponding author: Mubarak Almutairi (e-mail: [email protected]), Kashif Munir (e-mail: [email protected]) and Ali
Raza (e-mail: [email protected]).
This work was supported by the University of Hafr Albatin, Saudi Arabia.

ABSTRACT Polycystic ovary syndrome (PCOS) is a critical disorder in women during their reproduction
phase. The PCOS disorder is commonly caused by excess male hormone and androgen levels. The follicles
are the collections of fluid developed by ovaries and may fail to release eggs regularly. The PCOS results in
miscarriage, infertility issues, and complications during pregnancy. According to a recent report, PCOS
is diagnosed in 31.3% of women from Asia. Studies show that 69% to 70% of women did not avail
of a detecting cure for PCOS. A research study is needed to save women from critical complications
by identifying PCOS early. The main aim of our research is to predict PCOS using advanced machine
learning techniques. The dataset based on clinical and physical parameters of women is utilized for building
study models. A novel feature selection approach is proposed based on the optimized chi-squared (CS-
PCOS) mechanism. The ten hyper-parametrized machine learning models are applied in comparison.
Using the novel CS-PCOS approach, the gaussian naive bayes (GNB) outperformed machine learning
models and state-of-the-art studies. The GNB achieved 100% accuracy, precision, recall, and f1-scores
with minimal time computations of 0.002 seconds. The k-fold cross-validation of GNB achieved a 100%
accuracy score. The proposed GNB model achieved accurate results for critical PCOS prediction. Our study
reveals that the dataset features prolactin (PRL), blood pressure systolic, blood pressure diastolic, thyroid
stimulating hormone (TSH), relative risk (RR-breaths), and pregnancy are the prominent factors having
high involvement in PCOS prediction. Our research study helps the medical community overcome the
miscarriage rate and provide a cure to women through the early detection of PCOS.

INDEX TERMS Bioinformatics, Data Analysis, Infertility, Machine Learning, Pregnancy Complications,
Polycystic Ovary Syndrome, PCOS Prediction, Syndrome Classification.

I. INTRODUCTION are a burden for patients and doctors. A machine learning-


PCOS is a medical ailment [1] which is the main reason for based platform must be built for efficient and early prediction
hormonal disorder in women during their reproduction phase. of PCOS.
The PCOS arises due to a disorder in hormones [2]. The hor- The common indications of PCOS are a more ratio of
mone disorder results in the ovaries growing small amounts androgen level (heightened male hormones) [4], an unbal-
of fluid called follicles (cysts). The ovaries are unable to anced menstrual cycle, polycystic ovaries, and metabolism
produce eggs due to PCOS, which is the prominent problem problems. Early detection of PCOS-related symptoms helps
women with PCOS have critical complications in pregnancy to adopt essential lifestyle changes. During pregnancy, the
[3]. PCOS disease is usually inherited and is an unexpected chances of miscarriage in women with PCOS are more than
critical situation. The time and cost of countless medical tests three times that of women without PCOS. Women with

VOLUME 4, 2016 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3205587

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

PCOS undergo infertility, resulting in gynaecological cancer research methodology analysis is conducted in Section III.
[5]. Early detection of PCOS results in saving miscarriage. The employed machine learning models for PCOS prediction
PCOS affects many women at an early age. However, they are examined in Section IV. The scientific results validation
are not diagnosed. Numerous studies show that 69% to 70% and evaluations of our research approaches are analyzed
of women did not avail of a detecting cure [6]. According in Section V. The research study concluding remarks are
to a recent report, PCOS is diagnosed in 4.8% of women of described in Section VI.
white Americans, 8% of African Americans, 6.8% of women
in Spain and 31.3% of women in Asia [7]. Due to these II. RELATED WORK
complications and statics, early diagnosis of PCOS is crucial. The related literature to our proposed research study is exam-
The PCOS treatment [8] consists of modification in ined in this section. The past applied state-of-the-art study for
lifestyle, weight reduction, and an appropriate healthy diet PCOS prediction is analyzed. The related research findings
plan. The women’s everyday workout results in minimized and proposed techniques are examined.
free androgen indexed and reduced biochemical hyperandro- One of the most common health problems [19] caught
genism [9], [10], [11], [12], [13]. Studies show that with the in early age women is PCOS disease. PCOS disease is a
increase in age, the PCOS symptoms become less extreme, complicated health dilemma distressing women of childbear-
and women get menopause [14], [15], [16]. ing age, which can be identified based on different medical
Machine learning (ML) is the core area of computer indicators and signs. Accurate identification and detection of
science. Nowadays, ML allows computers to learn without PCOS is the essential baseline for appropriate treatment. For
going from their environment. The ML performs an essential this purpose, researchers applied different machine learning
role in the healthcare department [17]. The ML deals with approaches such as SVM, random forest, CART, logistic
obscure enormous datasets. The ML analyses the data, trans- regression and naive bayes classification to identify PCOS
form it into a useable form for clinical procedures and assists patients. After comparing the results, the Random Forest
in identifying the nature of different diseases. The three main algorithm gave a high performance with 96% accuracy in
types of machine learning are used in the medical field [18]. PCOS diagnostics on a given dataset [20].
Medical Image Processing, NLP in medical documentation, Machine learning algorithms were implemented on a
and statistical material about genetics are significant applica- dataset of 541 patients, from which 177 have PCOS disease.
tions. Our primary research contributions are as follows: The dataset consists of 43 features. As all features did not
• A novel CS-PCOS feature selection approach is pro- have equal importance, researchers used a feature selection
posed based on the optimized chi-squared mechanism. model to rank them according to their value, called the uni-
The twenty dataset features with a high importance variate feature selection model. This model is implemented
value are selected using the CS-PCOS approach for to get ten high-ranked features that can be used to predict the
building machine learning models. By using the CS- PCOS disease. After splitting the dataset into the train and
PCOS approach, our proposed model outperformed ma- test portion, different algorithms were implemented to get
chine learning techniques and past proposed state-of- a result. These models include gradient boosting classifiers
the-art studies; [21], logistic regression classifiers, random forest classifiers,
• The PCOS exploratory data analysis (PEDA) is con- RFLR abbreviation of random forest and logistic regression.
ducted to find the data patterns that are the primary As a result, the proposed RFLR algorithm achieved a 90.01%
cause of PCOS disease. The PEDA is based on graphs, accuracy score in classifying the PCOS patients with ten
charts, and statistical data analysis; highly ranked features [22].
• The ten advanced machine learning models are applied A new technique was proposed for the early detection and
in comparison to predict PCOS. The applied machine identification of PCOS disease in 2021. The proposed model
learning techniques are stochastic gradient descent was based on XGBRF and catBoost. After preprocessing the
(SGD), linear regression (LIR), random forest (RF), data, the top 10 attributes were selected by the univariate
bayesian ridge (BR), support vector machine (SVM), feature selection method. The classifiers implemented to
k-neighbors classifier (KNC), multi-layer perceptron compare the accuracy results are MLP, decision tree, SVM,
(MLP), logistic regression (LOR), gaussian naive bayes HRFLR, random forest, logistic regression, and gradient
(GNB), and gradient boosting classifier (GBC). The boosting. Results showed that XGBRF performed with an
GNB model is our proposed model; 89% accuracy score while catBoost outperformed with a 95%
• The k-fold cross-validation is applied to validate over- accuracy score. The accuracy scores of other classifiers lay
fitting in our applied machine learning models. The between 76% and 85%. The catBoost technique was the best
ten folds of research data are used during the k-fold model for the early detection of PCOS disease [23].
analysis. The machine learning models are generalized Researchers have demonstrated that PCOS identification
and give accurate performance scores for unseen test depends on morphological, biological, clinical processes [24]
data. and methods [25]. Due to advanced technology such as
The remainder of the research study is as follows: Section ultrasonography, the surplus follicle has become a critical
II is based on the related literature analysis of PCOS. Our indicator of polycystic ovarian morphology (PCOM). Since
2 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3205587

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Uterus

Polycystic Healthy Data Data


Ovary Ovary

Polycystic ovary Feature Engineering PCOS Exploratory


syndrome (PCOS) Data By CS-PCOS data analysis (PEDA)

PCOS Yes Training 80%

PCOS No Machine Learning 20%


PCOS Syndrome Model Testing
Dataset Spliting
Detected! Model
Predictions

FIGURE 1. The methodological architecture analysis of the proposed research study in predicting the PCOS syndrome.

2003, most researchers have used the inception of twelve [27].


follicles (having 2-9 mm measurement in diameter) per com- The gene expression classification in bioinformatics using
plete ovary. However, that now appears to be outdated [26]. a hybrid machine learning framework was proposed [28].
The variations in the amount of ovarian volume or having The proposed genetic model is based on a cuckoo search
space may also be acknowledged as accurate indicators of algorithm using an artificial bee colony (ABC). The six
PCOS morphology. However, their effectiveness compared benchmark gene expression dataset was utilized for building
with overweight and extra follicles remains mystified. a naive bayes classifier. The study contributes to high accu-
For the first time, researchers analyzed attributes and racy performance compared to the previously published fea-
characteristics of woman’s genes involved in PCOS with ture selection techniques. The classification of cancer-based
a specific pattern and order. The 233 patients with PCOS on gene expression using a novel framework was proposed
participated in the prediction process. Researchers used ma- [29]. The ABC-based modified metaheuristics optimization
chine learning algorithms such as decision trees and SVM technique was applied for the classification task.
with various kernel features (linear, polynomial, RBF) and The identification of PCOS using a novel immune infil-
k-nearest neighbor (KNN) to predict PCOS by identifying tration and candidate biomarker was proposed in this study
new genes. From these classifiers, SVM (linear) gave the best [30]. The proposed approach was the machine learning-
accuracy performance as it was 80%, and the KNN accuracy based logistic regression and support vector machine models.
score was between 57% to 79% [3]. The five datasets were utilized for training and testing the
According to a stat, every 3 to 4 women from 10 are models. The proposed model achieved a 91% accuracy score
presently distressed from PCOS. To detect and predict PCOS for PCOS identification. The study contributes to presenting
in the first phase, the authors proposed an automated system a novel framework for analysis. The mutational landscape
which can detect and predict PCOS disease for medical screening-based modified PCOS-related genes analysis was
treatment. The authors applied five machine learning models: proposed in this study [31]. The PCOS-related gene data of
gaussian naïve Bayes, SVM, k-neighbours, random forest, nsSNPs of the 27 were selected for analysis.
and logistic regression. They used applied models on a
dataset with 41 attributes. The top 30 features were selected III. METHODOLOGY
by a statistical method. After comparing the results of all Our research study uses the PCOS-related clinical and phys-
five models, it was observed that the accuracy of the random ical features dataset for machine learning model building.
forest model is 90%, while the results of the other models The dataset feature engineering is done by using the novel
were between 86% and 89%. The random forest model was proposed CS-PCOS approach. The PCOS exploratory data
the proposed approach to detect and predict the PCOS patient analysis (PEDA) is applied to figure out the data patterns
VOLUME 4, 2016 3

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3205587

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 1. The PCOS dataset descriptive feature analysis

Sr no. Feature Non-Null Count Data Type Sr no. Feature Non-Null Count Data Type
1 Sl. No 541 int64 22 TSH (mIU/L) 541 float64
2 Patient File No. 541 int64 23 AMH(ng/mL) 540 float64
3 PCOS (Y/N) 541 int64 24 PRL(ng/mL) 541 float64
4 Age (yrs) 541 int64 25 Vit D3 (ng/mL) 541 float64
5 Weight (Kg) 541 float64 26 PRG(ng/mL) 541 float64
6 Height(Cm) 541 float64 27 RBS(mg/dl) 541 float64
7 BMI 541 float64 28 Weight gain(Y/N) 541 int64
8 Blood Group 541 int64 29 hair growth(Y/N) 541 int64
9 Pulse rate(bpm) 541 int64 30 Skin darkening (Y/N) 541 int64
10 RR (breaths/min) 541 int64 31 Hair loss(Y/N) 541 int64
11 Cycle(R/I) 541 int64 32 Pimples(Y/N) 541 int64
12 Cycle length(days) 541 int64 33 Fast food (Y/N) 540 float64
13 Marraige Status (Yrs) 540 float64 34 Reg. Exercise(Y/N) 541 int64
14 Pregnant(Y/N) 541 int64 35 BP _Systolic (mmHg) 541 int64
15 No. of absorptions 541 int64 36 BP _Diastolic (mmHg) 541 int64
16 FSH(mIU/mL) 541 float64 37 Follicle No. (L) 541 int64
17 LH(mIU/mL) 541 float64 38 Follicle No. (R) 541 int64
18 FSH/LH 541 float64 39 Avg. F size (L) (mm) 541 float64
19 Hip(inch) 541 int64 40 Avg. F size (R) (mm) 541 float64
20 Waist(inch) 541 int64 41 Endometrium (mm) 541 float64
21 Waist: Hip Ratio 541 float64

PCOS Dataset PCOS Dataset


Original Features Selected Features

F1 F2 ... F39 F1 F2 ... F20

CS-PCOS Feature
Selection Approach
FIGURE 2. The CS-PCOS approach operational flow of feature selection from the original dataset.

and factors that are the primary cause of PCOS disease. B. NOVEL CS-PCOS FEATURE ENGINEERING
The dataset is fully preprocessed during feature engineering. TECHNIQUE
The preprocessed dataset is split into two portions train and The feature engineering techniques are applied to transform
test. The split ratio used is 80% for training and 20% for the dataset features into the best fit for a predictive model
the model’s evaluations on unseen test data. The hyper- with high accuracy. A novel CS-PCOS feature selection
parametrized model is completely trained and tested. The approach is proposed based on the optimized chi-squared
proposed model is ready to predict the POCS disease in mechanism. The operational flow of feature selection by the
deployment. The research methodology working flow is ex- CS-PCOS approach is visualized in Figure 2. The proposed
amined in Figure 1. CS-PCOS technique checks the independence by compar-
ing the observed frequencies (categorically data) with the
A. POLYCYSTIC OVARY SYNDROME DATASET expected frequencies (target data). The proposed CS-PCOS
The PCOS dataset [32] is utilized in our research study. The technique extracts the vital value statistics based on goodness
clinical and physical parameters of 541 patients are used to of fit. The 39 features are input to our proposed feature
create the dataset. The PCOS dataset features are analyzed in selection technique and determine the importance values for
Table 1. The dataset contains a total of 41 features. We have each feature.
filled the null values in our dataset with zero to preprocess the The feature importance values analysis is demonstrated
dataset. We have dropped the dataset columns ’Sl. No’ and in Table 2. The essential features have the highest value
’Patient File No.’ due to containing unnecessary information. near one. Furthermore, the element which has zero value
The dataset was collected from ten different hospitals across is non-vital. The Waist Hip Ratio is the most impor-
Kerala in India. The memory usage size of the dataset is tant in the segment. The feature having zero impor-
177.6 KB. tance values is dropped. The drop features are Age (yrs),
Weight (Kg), BMI, Cycle(R/I), Cycle length(days), Mar-
4 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3205587

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 2. The CS-PCOS approach features importance values analysis.

Sr no. Feature Importance value Sr no. Feature Importance value


1 Age (yrs) 0.00 21 AMH(ng/mL) 0.00
2 Weight (Kg) 0.00 22 PRL(ng/mL) 0.72
3 Height(Cm) 0.44 23 Vit D3 (ng/mL) 0.00
4 BMI 0.00 24 PRG(ng/mL) 0.00
5 Blood Group 0.67 25 RBS(mg/dl) 0.03
6 Pulse rate(bpm) 0.27 26 Weight gain(Y/N) 0.00
7 RR (breaths/min) 0.74 27 hair growth(Y/N) 0.00
8 Hb(g/dl) 0.60 28 Skin darkening (Y/N) 0.00
9 Cycle(R/I) 0.00 29 Hair loss(Y/N) 0.00
10 Cycle length(days) 0.01 30 Pimples(Y/N) 0.00
11 Marraige Status (Yrs) 0.00 31 Fast food (Y/N) 0.00
12 Pregnant(Y/N) 0.61 32 Reg. Exercise(Y/N) 0.19
13 No. of absorptions 0.09 33 BP _Systolic (mmHg) 0.90
14 FSH(mIU/mL) 0.00 34 BP _Diastolic (mmHg) 0.57
15 LH(mIU/mL) 0.00 35 Follicle No. (L) 0.00
16 FSH/LH 0.00 36 Follicle No. (R) 0.00
17 Hip(inch) 0.02 37 Avg. F size (L) (mm) 0.00
18 Waist(inch) 0.02 38 Avg. F size (R) (mm) 0.06
19 Waist: Hip Ratio 0.99 39 Endometrium (mm) 0.07
20 TSH (mIU/L) 0.61

FIGURE 3. The correlation analysis of selected dataset features by the proposed CS-PCOS techniques.

raige Status (Yrs), FSH(mIU/mL), LH(mIU/mL), FSH/LH, diction in our research study. The selected feature correlation
AMH(ng/mL), Vit D3 (ng/mL), PRG(ng/mL), Weight analysis is conducted in Figure 3. The correlation analysis
gain(Y/N), hair growth(Y/N), Skin darkening (Y/N), Hair demonstrates that all selected features have a positive corre-
loss(Y/N), Pimples(Y/N), Fast food (Y/N), Follicle No. (L), lation.
and Follicle No. (R). The twenty most prominent features are
selected by our proposed technique and used for PCOS pre-

VOLUME 4, 2016 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3205587

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

(a) The count plot shows the number of instances of both classes (b) The Pie chart shows the distribution of PCOS
in the dataset class in percentage

FIGURE 4. The PCOS patient’s data distribution analysis by class.

(a) The Waist Hip Ratio, PRL(ng/mL), and PCOS feature is (b) The 3D scatter plot is drawn on TSH(mlU/L),
plotted in 3D scatter plot to visualize the datapoint Bp_Systolic(mmHg), and PCOS(Y/N) features

FIGURE 5. In the 3D analysis of features distribution analysis by class.

C. PCOS EXPLORATORY DATA ANALYSIS (PEDA) 5(a). No PCOS occur when the TSH(mmHg) is less than
This section analyses the PCOS data and the dataset’s differ- 50 and Bp_Systolic is above 80. Figure 5(b) demonstrates
ent patterns to understand the cause of PCOS. The analysis that, When the value of TSH(mmHg) is above 50 and the
focus on 20 features with a significant value selected by Bp_Systolic value less than 80, then PCOS happen.
the proposed CS-PCOS technique that is used to train the The lmplot is dragged on the dataset’s high-value features
machine learning models. These features are analyzed from to represent the PCOS regression described in Figure 6. The
other angles using different graphs. The seaborn, pandas and lmplot is a two-dimensional plot that combines regplot and
matplotlib libraries of Python are used to visualize the chart. FacetGrid. The FacetGrid class helps visualize the distribu-
The count plots are drawn to see the number of instances tion of one variable and the relationship between multiple
of both classes in the PCOS dataset. In Figure 4(a), the count variables separately within subsets of your dataset using nu-
plot shows the number of instances of both categories. The merous panels. The lmplot is more computationally intensive
no category has 364 instances, and the yes category has 177 and is intended as a convenient interface to fit regression
instances in the dataset. The dataset is binary class. The 0 models across conditional subsets of a dataset.
indicates No PCOS, and 1 represents Yes PCOS. In Figure In Figure 6(A), a lmplot is drawn between the Hip(inch)
4(b), the pie chart shows the percentage of each class in the and Waist(inch) to visualize the PCOS Regression. As the
dataset. 67.3% of data belong to the PCOS No class, and waist and Hip size increase, the Chance of PCOS increases.
32.7% of data belongs to the Yes class. In Figure 6(B), the Waist: Hip Ratio and Hb(g/dl) subset
The 3d scatter plot is to visualize and analyze the most is used to analyze the PCOS regression. When the value
critical feature data point in 3D. It plots data points on three of Hb(g/dl) is greater than 14 and less than 9, there is
axes to show the relationship between three features. When more chance of PCOS. Figure 6(C) plots the lmplot be-
the value of PRL(ng/mL) is more than 40, and waist Hip tween Pregrent(Y/N) and BP_Systolic. This plot shows that
Ratio is less than 0.90, PCOS happens, as shown in Figure if the value of BP_Systolic(mmHg) is 140 and the patient
6 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3205587

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Figure A: The lmplot is draw on feature Hip(inch) Figure B: The lmplot is draw on feature Hb(g/dl) & Figure C: The lmplot is draw on feature Pregrent &
& Waist(inch) to visualize the regression of PCOS. Waist:Hip Ratio to visualize the regression of PCOS. BP_Systolic(mmHg) to visualize the regression of PCOS.

Figure D: The lmplot is draw on feature Blood Group & Figure E: The lmplot is draw on feature TSH(mlU/L) & Figure F: The lmplot is draw on feature Reg.Excercise &
RR(breaths/min) to visualize the regression of PCOS. BP_Diastolic(mmHg) to visualize the regression of PCOS. No. of abortion to visualize the regression of PCOS.

FIGURE 6. The lmplot regression graph analysis of values features with the PCOS class.

is Pregnant or not, the PCOS does not occur. In Figure feature is taken to analyze the frequency distribution. The
6(D), Blood Group and RR(breath/min) features are taken BP_systolic (mmHg) the highest frequency is 175 at 100
from the dataset to visualize the Regression plot. When for No PCOS. In Figure 7(H), PRL (ng/ml) has the highest
the value of RR(breath/min) is more significant than 25, frequency at 20, gradually decreasing. In Figure 8(I), the
no PCOS happens. In Figure 6(E), the lmplot is plot- frequency of Waist Hip Ratio is from 0.75 to 0.95. The
ted between TSH(mlU/L) and BP_Diastolic(mmHg) feature. highest frequency for yes No PCOS is 0.95.
When the value of TSH(mlU/L) is between 0 to 20 and
BP_Diastolic(mmHg) is 80, there is more chance of PCOS. D. DATASET SPITING
In Figure 6(F), No of Abortion and Regular Exercise(Y/N) The data splitting is applied to prevent model overfitting and
are taken to visualize the lmplot. When the number of abor- evaluate the trained model on the unseen test portion of the
tions is above three and not doing regular exercise, PCOS dataset. The PCOS dataset is split into two portions for the
does not occur. training and testing employing machine learning models. The
The histogram is plotted to analyze the frequency distri- 80:20 ratio is used for dataset splitting. The 80% portion of
bution of PCOS Yes or No on imported features in Figure the dataset is used for model training, and a 20% portion of
7. Figure 7(A) plots the Hip(inch) to identify the frequency the dataset is used for employed model’s results evaluations
distribution. The frequency of both classes is highest between on unseen data. Our research models are trained and evalu-
35 and 40. Figure 7(B) plots the histogram on Hb(g/dl). ated with high accuracy results.
The PCOS yes has the highest count of 60 at HB(g/dl) 11.
Furthermore, PCOS class No has a maximum count of 140 IV. EMPLOYED MACHINE LEARNING TECHNIQUES
before the value of 11. In figure 7(C), the pregnant(Y/N) The employed machine learning techniques are examined for
feature is used to plot the histogram. This graph presents the PCOS prediction in this section. The working mechanism
highest value of both classes at no pregnancy. In Figure 7(D), and mathematical notations for machine learning models are
the BP_Diastolic(mmHg) is taken to plot the histogram. The described. The ten predictive machine learning models are
highest frequency of class 0 is 250 at 80 BP_Diastolic. under examination for PCOS prediction in our research study.
In Figure 7(E), the maximum frequency of RR(breath/min) The stochastic gradient descent (SGD) classifier [33] uses
is at a value of 10, which is above 175 for No PCOS and loss functions based on the SGD learning routine for classi-
75 for yes PCOS. In Figure 7(F), the feature TSH(mmU/L) fication. The SGD is used for large-scale learning. The SGD
has a frequency between zero and ten. For Yes, PCOS has is easy to build and has good efficiency. The SGD efficient
the highest frequency, 90 at 0 and approximately 340 for No optimization model is utilized to minimize a loss function
PCOS at a value of 5. In Figure 7(G), BP_systolic (mmHg) by finding the optimal parameters values of the function.
VOLUME 4, 2016 7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3205587

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Figure A: The Histogram is plotted on HIp(inch) Figure B: The Histogram is plotted on Hb(gdl) Figure C: The Histogram is plotted on Pregrent(Y/N)
feature of dataset to examine the cause of PCOS feature of dataset to examine the cause of PCOS feature of dataset to examine the cause of PCOS

Figure D: The Histogram is plotted on BP_Dialolic(mmHg) Figure E: The Histogram is plotted on RR(breaths/min) Figure F: The Histogram is plotted on TSH(mlU/L)
feature of dataset to examine the cause of PCOS feature of dataset to examine the cause of PCOS feature of dataset to examine the cause of PCOS

Figure G: The Histogram is plotted on BP_Systolic(mmHg) Figure H: The Histogram is plotted on PRL(ng/mL) Figure I: The Histogram is plotted on Waist Hip Ratio
feature of dataset to examine the cause of PCOS feature of dataset to examine the cause of PCOS feature of dataset to examine the cause of PCOS

FIGURE 7. The histogram analysis analyses the frequency distribution of PCOS Yes or No for selected features.

The performance of SGD is based on the loss function. The represent the target output. The majority voting prediction
logistic cost function is expressed in equation 1. of decision trees is selected as the final prediction. The gini
index and entropy are used for data splitting in tress nodes as
( expressed in equations 3 and 4.
−log(hθ (x)) if y = 1
Cost(hθ (x), y) = (1)
−log(1 − hθ (x)) if y = 0 n
X 2
Gini index = 1 − (Pi ) (3)
Linear regression (LIR) [34] is the statistical method used
i=1
for classification that finds the linear relationship between
the dependent variable (y) and independent variables (x). A
linear relationship analyses how dependent variable values Entropy(S) = −p(+) log p(+) − p(−) log p(−) (4)
change according to the independent variable values. The
LIR model [35] provides a straight line separating the data The bayesian ridge (BR) [37] algorithm uses probability
points. The regression line in the LIR model minimizes the computations for the classification task. The BR model is
sum of the Square of Residuals, known as the ordinary least suitable for real-world problems where the data is insufficient
square (OLS). The mathematical notation to express the LIR and poorly distributed. The BR model formulates a linear
model is analyzed in equation 2. regression model by using the probability distributors. The
BR model predicts the target (y) by calculating it from a
Y = mX + b (2) probability distribution instead estimating a single feature
The random forest (RF) [36] is a supervised classification value. The mathematical notation to find the y target using
model that creates a forest of multiple decision trees. The de- the BR model is expressed in equation 5.
cision trees are created randomly based on the data samples.
Decision nodes represent the features, and tree leaf nodes p(y | X, w, a) = N (y | Xw , a) (5)
8 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3205587

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 3. The best-fit hyperparameters analysis of all employed machine learning models.

Technique Hyperparameters
SGD loss=’hinge’, penalty=’l2’, alpha=0.0001, l1_ratio=0.15, max_iter=1000, tol=1e-3, learning_rate=’optimal’.
LIR copy_X=True, fit_intercept=True, positive=False, normalize=False.
RF max_depth=20, random_state=0, n_estimators=100, criterion=’gini’, max_features=’sqrt’, bootstrap=True.
BR tol=1e-3, n_iter=300, alpha_1=1e-6, lambda_1=1e-6, alpha_2=1e-6, lambda_2=1e-6.
SVM kernel=’linear’, C=1.0, degree=3, gamma=’scale’, tol=1e-3, cache_size=200, decision_function_shape=’ovr’.
KNC n_neighbors=5, weights=’uniform’, algorithm=’auto’, metric=’minkowski’, leaf_size=30, p=2.
MLP hidden_layer_sizes=(100,), activation=’relu’, solver=’adam’, alpha=0.0001, learning_rate=’constant’.
LOR penalty=’l2’, tol=1e-4, C=1.0, solver=’lbfgs’.
GNB var_smoothing=1e-9.
GBC loss=’log_loss’, max_depth=3, learning_rate=0.1, criterion=’friedman_mse’, n_estimators=100.

The support vector machine (SVM) [38] is a supervised n


machine learning model. The SVM is mainly used for clas- u(x) =
X
wi xi (8)
sification and regression problems. The primary aim of the i=1
SVM model is to best decision boundary that separates the The logistic regression (LOR) [42] is a supervised machine
data points into their relevant category in n-dimensional fea- learning model for binary classification. The LOR model [43]
ture space. The best decision boundary in SVM is known as forecasts the categorical dependent variable using training
the hyperplane. The SVM model selects the extreme vectors data of independent variables. The target class must be in
to crate the hyperplane. The vectors are called support vec- the form of a discrete value. The LOR model gives the
tors. The hyperplane in SVM is used to predict as expressed probabilistic values. The output values lie between 1 and 0.
in equation 6. The LOR is similar to the LIR model, with only a difference
( in their use. The logistic function of s-shaped formed in the
+1 if w.x + b ≥ 0 LOR model, which forecasts the values 1 or 0. The logistic
h(xi ) = (6)
−1 if w.x + b < 0 function is analyzed in equation 9.
y
The k-neighbors classifier (KNC) [39] is the simplest and log[ ] = b0 + b1 x1 + b2 x2 + ... + bn xn (9)
non-parametric algorithm machine learning model for classi- 1−y
fication problems. The KNC model calculates the similarity The gaussian NB (GNB) [44] is a supervised machine
between the data points and places the new input points into learning model. The GNB model is based on the naive bayes
a similar category that is similar to each other. The KNC methods and theorem. The GNB technique [45] has the
model saves available data and predicts the new data to its powerful assumption that all the predictors are independent
suite category based on similarity. The KNC is a lazy learner of each other. One feature in a class is independent of
model because it performs a prediction at the time of classi- another feature in the same class. The GNB utilized gaussian
fication. It does not learn immediately from the training data. distribution and naive assumptions to predict the target class.
The time computations are high and have low efficiency. The The target feature prediction by the GNB model is expressed
euclidian distance between data points is found as expressed in equation 10.
in equation 7. P (Y )P (f eatures|Y )
P (Y |f eatures) = (10)
q P (f eatures)
2 2
E(A1 , A2 ) = (X2 − X1 ) + (Y 2 − Y1 ) (7) The gradient boosting classifier (GBC) [46] is an ensemble
learning-based Boosting model mainly used for classification
The MLP classifier (MLP) [40] is a feedforward artificial and regression tasks. The GBC [47] models work incremen-
neural network-based supervised machine learning model. tally. The principle of GBC is to build models sequentially
The MLP model [41] is based on types of network layers. by training each base model. The motive is to make a robust
The types of layers are input, output and hidden layer. The model. The several models combine to make a week learner a
input layer in the network handles the input data points, and robust model. Several gradient-boosted trees are involved in
the output layer is responsible for the prediction task. The making a GBC. The final powerful model has the correction
hidden layer processes the data within the neural network. prediction values. The three main components in GBC as loss
The MLP uses the back-propagation technique based on function, Weak learner and additive model. for classification,
data passing in the forward direction. The neurons present the GBC model prediction for the target class is expressed in
in the MLP network are trained with the backpropagation equation 11.
technique. The neurons use nonlinear activation functions
between the output and input layers. The weighted sum of the P
Residual
input features in MLP is calculated as analyzed in equation y= P (11)
[P rev probability ∗ (1 − P rev probability)]
8.
VOLUME 4, 2016 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3205587

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 4. The comparative performance evaluation of employed machine learning models for unseen test data without using the proposed technique.

Technique Training time(second) Accuracy (%) Precision (%) Recall (%) F1-score (%)
SGD 0.006 79 82 79 79
LIR 0.034 84 85 84 85
RF 0.193 89 89 89 89
BR 0.014 84 85 84 84
SVM 0.666 88 88 88 88
KNC 0.002 70 68 70 68
MLP 0.472 83 83 83 83
LOR 0.042 80 80 80 80
GNB 0.003 81 81 81 80
GBC 0.259 89 89 89 89

A. HYPERPARAMETER TUNING predicting a data sample as positive. The precision score of


The iterative training and testing process selects the best- our proposed model is 100%. The mathematical notations to
fit hyperparameters [48] for all applied machine learning express precision scores are as follows:
techniques. The hyper-parameters are selected as final, and
TP
a machine learning model gives accurate prediction results. P recision score = (13)
The hyperparameter tuning [49] of our research models is an- TP + FP
alyzed in Table 3. The analysis demonstrates the parameters The recall score of employed models is the measure of how
utilized to achieve the high-performance metrics score. The many of the TP were recalled (found) correctly. The recall
hyper-parameters proved very beneficial for our employed is also called the sensitivity of a learning model. The recall
machine learning models in this research study. score of our proposed model is 100%. The mathematical
notations to define the recall are as follows:
V. RESULTS AND DISCUSSIONS
TP
The results and scientific evaluations of our proposed re- Recall score = (14)
search study are examined in this section. The Python pro- TP + FN
gramming tool and scikit-learn library module are used for The f1 score is the statistical measure that sums up a pre-
building the employed machine learning model. The per- dictive model’s performance by combing the precision and
formance metrics used are the accuracy score, precision recall values. The f1 measure is the harmonic mean between
score, recall score, and f1-score. The performance metrics the recall and precision. The f1 score of our proposed model
are evaluated for scientific validation of our research models. is 100%. The mathematical equation to calculate the f1 score
The followings are the essential components of evaluation is expressed as:
metrics:
• The predicted values and actual values are positively 2 ∗ (P recision ∗ Recall)
known as true positive (TP). F 1 − score = (15)
(P recision + Recall)
• The predicted values and actual values are negative,
The comparative performance metrics analysis of applied
known as true negative (TN).
learning models is conducted in Table 4. The time complexity
• The actual value is negative, and the predicted value is
computations and performance metrics results are calculated
positive, refers as false positive (FP).
without using our proposed approach. The analysis demon-
• The actual value is positive, and the predicted value is
strated that all applied learning models achieved average
negative, which refers to false negative (FN).
scores in predicting PCOS. From the analysis and Figure 8,
The employed model’s accuracy score shows how much the highest accuracy, precision, recall, and f1 score is 89%,
the model is good in prediction. The accuracy is also related achieved by RF and GBC techniques. The minimum accuracy
to the error rate of a model. Higher the accuracy, lower the score is 70%, the precision score is 68%, the recall score is
error rate. The accuracy is determined by dividing the correct 70%, and the f1 score is 68% achieved by the KNC technique.
number of predictions by the total number. The accuracy The time complexity analysis describes that KNC have less
score of our proposed model is 100%. Mathematically, the training time of 0.002. However, also have low-performance
accuracy score is demonstrated as: metrics scores.
The performance metrics comparative analysis of applied
TP + TN learning models is conducted in Table 5. The performance
Accuracy score = (12)
TP + TN + FP + FN metrics results and time complexity computations are cal-
The precision score of a learning model is also known as culated using our proposed approach. The analysis demon-
positive predictive value. The precision is measured by the strated that all applied learning models achieved the highest
positively predicted label proportion that is positive. The pre- performance metrics scores in predicting the PCOS. From
cision, in general, calculates the employed model accuracy in the analysis and Figure 9, the highest accuracy, precision,
10 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3205587

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 8. The accuracy scores comparative evaluation of employed machine learning models for unseen test data without using the proposed
technique.

TABLE 5. Using the proposed technique, the comparative performance evaluation of the employed machine learning model for unseen test data.

Technique Training time(second) Accuracy (%) Precision (%) Recall (%) F1-score (%)
SGD 0.004 69 68 69 68
LIR 0.024 100 100 100 100
RF 0.147 100 100 100 100
BR 0.004 100 100 100 100
SVM 0.842 100 100 100 100
KNC 0.002 56 53 56 54
MLP 0.592 99 99 99 99
LOR 0.025 100 100 100 100
GNB 0.002 100 100 100 100
GBC 0.071 100 100 100 100

FIGURE 9. Using the proposed technique, the accuracy scores comparative evaluation of employed machine learning models for unseen test data.

recall, and f1 score is 100%, achieved by LIR, RF, BR, SVM, score is 56%, the precision score is 53%, the recall score
LOR, GNB, and GBC techniques. The minimum accuracy is 56%, and the f1 score is 54%, achieved by the KNC

VOLUME 4, 2016 11

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3205587

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 10. The accuracy scores comparative analysis of the K-Fold technique to validate the overfitting of the employed learning techniques.

TABLE 6. The classification report analysis of employed learning models for each employed learning model is examined in Table 6.
by using the proposed technique.
The classification report values are calculated for the models
Target Precision Recall F1-score Support Score using the proposed approach. The analysis demonstrates
Category that the KNC and SDG have low accuracy scores in class-
SGD wise metrics evaluations. The outperformed GNB model has
0 0.72 0.83 0.77 70
1 0.59 0.44 0.50 39
achieved 100% scores in classification report analysis.
LIR
0 1.00 1.00 1.00 70 TABLE 7. The K-Fold cross-validation analysis to validate the overfitting
1 1.00 1.00 1.00 39 of the employed learning techniques.
RF
0 1.00 1.00 1.00 70 Sr no K-Fold Technique Accuracy Score (%)
1 1.00 1.00 1.00 39 1 10 SGD 60
SVM 2 10 LIR 100
0 1.00 1.00 1.00 70 3 10 RF 100
1 1.00 1.00 1.00 39 4 10 BR 100
BR 5 10 SVM 100
0 1.00 1.00 1.00 70 6 10 KNC 60
1 1.00 1.00 1.00 39 7 10 MLP 98
KNC 8 10 LOR 100
0 0.63 0.74 0.68 70 9 10 GNB 100
1 0.33 0.23 0.27 39 10 10 GBC 100
MLP
0 1.00 0.99 0.99 70
1 0.97 1.00 0.99 39
To validate the overfitting of employed machine learning
LOR models, we have applied the k-fold cross-validation tech-
0 1.00 1.00 1.00 70 nique as analyzed in Table 7. The 10 folds of the dataset
1 1.00 1.00 1.00 39 are used for validation. The analysis demonstrates that tech-
GNB
0 1.00 1.00 1.00 70 niques achieved 100% scores using our proposed approach
1 1.00 1.00 1.00 39 and 100% accuracy using the k-fold techniques. Figure 10
GBC shows the accuracy of comparative analysis of employed
0 1.00 1.00 1.00 70
models by using the k-fold validation. The visualized anal-
1 1.00 1.00 1.00 39
ysis demonstrates that the MLP model achieved 99%, and
by using k-fold, 98% accuracy was achieved. The SGD
and KNC models achieve the lowest accuracy scores in this
technique. The time complexity analysis describes that GNB analysis. In conclusion, all employed models are validated
has less training time of 0.002. However, the GNB have high- using k-fold technique. The k-fold analysis demonstrates that
performance metrics scores. The GNB is our proposed model our employed machine learning models are not overfitted.
for predicting the PCOS. Models are in generalize form and accurate results on unseen
The classification report analysis by individual target class test data.
12 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3205587

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 8. The performance validation comparative analysis with the past applied state-of-the-art approaches.

Literature Year Learning Type Proposed Technique Accuracy (%) Recall (%) Precision (%)
[22] 2020 Machine Learning RFLR 91 90 89
Proposed 2022 Machine Learning CS-PCOS + GNB 100 100 100

pregnancy are the most prominent factors having high in-


volvement in PCOS prediction. The study limitations and in
future work, we will enhance the dataset by collecting more
data on PCOS-related patients and applying data balancing
techniques. Also, the deep learning-based will be applied for
PCOS prediction.

REFERENCES
[1] I. Kyrou, E. Karteris, T. Robbins, K. Chatha, F. Drenos, and H. S. Randeva,
“Polycystic ovary syndrome (PCOS) and COVID-19: An overlooked
female patient population at potentially higher risk during the COVID-19
pandemic,” BMC Medicine, vol. 18, pp. 1–10, jul 2020.
[2] B. J. Sherman, N. L. Baker, K. T. Brady, J. E. Joseph, L. M. Nunn, and
A. McRae-Clark, “The effect of oxytocin, gender, and ovarian hormones
on stress reactivity in individuals with cocaine use disorder,” Psychophar-
macology 2020 237:7, vol. 237, pp. 2031–2042, may 2020.
FIGURE 11. The confusion matrix validation analysis of our proposed
[3] X. Z. Zhang, Y. L. Pang, X. Wang, and Y. H. Li, “Computational charac-
model.
terization and identification of human polycystic ovary syndrome genes,”
Scientific Reports, vol. 8, p. 12949, dec 2018.
[4] E. Khashchenko, E. Uvarova, M. Vysokikh, T. Ivanets, L. Krechetova,
The comparative analysis of past applied state-of-the-art N. Tarasova, I. Sukhanova, F. Mamedova, P. Borovikov, I. Balashov, and
G. Sukhikh, “The Relevant Hormonal Levels and Diagnostic Features of
studies is examined in Table 8. The comparison parameters Polycystic Ovary Syndrome in Adolescents,” Journal of Clinical Medicine
are the year, learning type, proposed technique, accuracy 2020, Vol. 9, Page 1831, vol. 9, p. 1831, jun 2020.
score, recall score, and precision score. The analysis demon- [5] M. Woźniak, R. Krajewski, S. Makuch, and S. Agrawal, “Phytochemicals
in Gynecological Cancer Prevention,” International Journal of Molecular
strates that using our novel proposed CS-PCOS technique,
Sciences 2021, Vol. 22, Page 1219, vol. 22, p. 1219, jan 2021.
the outperformed GNB model achieved the highest scores [6] D. Dewailly, M. E. Lujan, E. Carmina, M. I. Cedars, J. Laven, R. J.
compared with the past proposed techniques. Our proposed Norman, and H. F. Escobar-morreale, “Definition and significance of
model outperformed the state of art studies. polycystic ovarian morphology: a task force report from the Androgen
Excess and Polycystic Ovary Syndrome Society,” Human reproduction
The confusion matrix analysis is conducted to validate our update, vol. 20, no. 3, pp. 334–352, 2014.
performance metrics scorers as analyzed in Figure 11. The [7] A. S. Prapty and T. T. Shitu, “An Efficient Decision Tree Establishment
analyzed confusion matrix is for outperformed GNB model. and Performance Analysis with Different Machine Learning Approaches
on Polycystic Ovary Syndrome,” ICCIT 2020 - 23rd International Confer-
The analysis demonstrates that 70 samples are found as TP, ence on Computer and Information Technology, Proceedings, dec 2020.
and 39 samples are found as TN. The 0 samples are found for [8] E. C. Costa, J. C. F. De Sá, N. K. Stepto, I. B. B. Costa, L. F. Farias-Junior,
FN and FP in this analysis. The confusion matrix validates S. D. N. T. Moreira, E. M. M. Soares, T. M. A. M. Lemos, R. A. V. Browne,
and G. D. Azevedo, “Aerobic Training Improves Quality of Life in Women
our proposed model for achieving the 100% accuracy score with Polycystic Ovary Syndrome,” Medicine and science in sports and
in predicting the PCOS. exercise, vol. 50, pp. 1357–1366, jul 2018.
[9] M. A. Karimzadeh and M. Javedani, “An assessment of lifestyle mod-
ification versus medical treatment with clomiphene citrate, metformin,
VI. CONCLUSIONS and clomiphene citrate–metformin in patients with polycystic ovary syn-
The prediction of PCOS disease using data of 541 patients drome,” Fertility and Sterility, vol. 94, pp. 216–220, jun 2010.
through machine learning is proposed in this research study. [10] I. Almenning, A. Rieber-Mohn, K. M. Lundgren, T. S. Løvvik, K. K.
Garnæs, and T. Moholdt, “Effects of High Intensity Interval Training and
A novel CS-PCOS feature selection technique is proposed. Strength Training on Metabolic, Cardiovascular and Hormonal Outcomes
The ten machine learning techniques are SGD, LIR, RF, in Women with Polycystic Ovary Syndrome: A Pilot Study,” PLOS ONE,
BR, SVM, KNC, MLP, LOR, GNB, and GBC applied in vol. 10, p. e0138793, sep 2015.
[11] D. Chizen, S. Serrao, J. Rooke, L. McBreairty, R. Pierson, P. Chilibeck,
comparison. The proposed GNB outperformed with a 100% and G. Zello, “The “pulse” diet PCOS,” Fertility and Sterility, vol. 102,
accuracy score and time computation of 0.002 by using the p. e267, sep 2014.
proposed CS-PCOS feature selection techniques. The state [12] H. H. Mehrabani, S. Salehpour, B. J. Meyer, and F. Tahbaz, “Beneficial ef-
fects of a high-protein, low-glycemic-load hypocaloric diet in overweight
of art studies comparison shows that the proposed model
and obese women with polycystic ovary syndrome: a randomized con-
outperformed. The proposed model’s overfitting is validated trolled intervention study,” Journal of the American College of Nutrition,
using a ten-fold cross-validation technique. Our research vol. 31, pp. 117–125, apr 2012.
study concludes that the dataset features prolactin (PRL), [13] F. Giallauria, S. Palomba, L. Maresca, L. Vuolo, D. Tafuri, G. Lombardi,
A. Colao, C. Vigorito, and F. Orio, “Exercise training improves autonomic
blood pressure systolic, blood pressure diastolic, thyroid function and inflammatory pattern in women with polycystic ovary syn-
stimulating hormone (TSH), relative risk (RR-breaths), and drome (PCOS),” Clinical endocrinology, vol. 69, pp. 792–798, nov 2008.

VOLUME 4, 2016 13

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3205587

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

[14] F. Saleem and S. W. Rizvi, “New Therapeutic Approaches in Obesity [35] H. Lee, J. Wang, and B. Leblon, “Using Linear Regression, Random
and Metabolic Syndrome Associated with Polycystic Ovary Syndrome,” Forests, and Support Vector Machine with Unmanned Aerial Vehicle Mul-
Cureus, nov 2017. tispectral Images to Predict Canopy Nitrogen Weight in Corn,” Remote
[15] G. Ladson, W. C. Dodson, S. D. Sweet, A. E. Archibong, A. R. Kunselman, Sensing 2020, Vol. 12, Page 2071, vol. 12, p. 2071, jun 2020.
L. M. Demers, N. I. Williams, P. Coney, and R. S. Legro, “The effects [36] M. A. Khan, S. A. Memon, F. Farooq, M. F. Javed, F. Aslam, and R. Aly-
of metformin with lifestyle therapy in polycystic ovary syndrome: a ousef, “Compressive Strength of Fly-Ash-Based Geopolymer Concrete by
randomized double-blind study,” Fertility and sterility, vol. 95, mar 2011. Gene Expression Programming and Random Forest,” Advances in Civil
[16] A. Gambineri, L. Patton, A. Vaccina, M. Cacciari, A. M. Morselli-Labate, Engineering, vol. 2021, 2021.
C. Cavazza, U. Pagotto, and R. Pasquali, “Treatment with flutamide, met- [37] M. H. Na, W. H. Cho, S. K. Kim, and I. S. Na, “Automatic Weight
formin, and their combination added to a hypocaloric diet in overweight- Prediction System for Korean Cattle Using Bayesian Ridge Algorithm on
obese women with polycystic ovary syndrome: a randomized, 12-month, RGB-D Image,” Electronics 2022, Vol. 11, Page 1663, vol. 11, p. 1663,
placebo-controlled study,” The Journal of clinical endocrinology and may 2022.
metabolism, vol. 91, no. 10, pp. 3970–3980, 2006. [38] S. Shabani, S. Samadianfard, M. T. Sattari, A. Mosavi, S. Shamshirband,
[17] A. Qayyum, J. Qadir, M. Bilal, and A. Al-Fuqaha, “Secure and Robust T. Kmet, and A. R. Várkonyi-Kóczy, “Modeling Pan Evaporation Using
Machine Learning for Healthcare: A Survey,” IEEE Reviews in Biomedi- Gaussian Process Regression K-Nearest Neighbors Random Forest and
cal Engineering, vol. 14, pp. 156–180, 2021. Support Vector Machines; Comparative Analysis,” Atmosphere 2020, Vol.
[18] A. Garg and V. Mago, “Role of machine learning in medical research: A 11, Page 66, vol. 11, p. 66, jan 2020.
survey,” Computer Science Review, vol. 40, p. 100370, may 2021. [39] Mohebbanaaz, L. V. Rajani Kumari, and Y. Padma Sai, “Classification
[19] D. Hu, W. Dong, X. Lu, H. Duan, K. He, and Z. Huang, “Evidential mace of Arrhythmia Beats Using Optimized K-Nearest Neighbor Classifier,”
prediction of acute coronary syndrome using electronic health records,” Lecture Notes in Networks and Systems, vol. 185 LNNS, pp. 349–359,
BMC Medical Informatics and Decision Making, vol. 19, no. 2, pp. 9–17, 2021.
2019. [40] R. Pahuja and A. Kumar, “Sound-spectrogram based automatic bird
[20] M. Mubasher Hassan and T. Mirza, “Comparative Analysis of Machine species recognition using MLP classifier,” Applied Acoustics, vol. 180,
Learning Algorithms in Diagnosis of Polycystic Ovarian Syndrome,” p. 108077, sep 2021.
International Journal of Computer Applications, vol. 175, pp. 42–53, sep [41] U. Azmat, Y. Y. Ghadi, T. Al Shloul, S. A. Alsuhibany, A. Jalal, and J. Park,
2020. “Smartphone Sensor-Based Human Locomotion Surveillance System Us-
ing Multilayer Perceptron,” Applied Sciences 2022, Vol. 12, Page 2550,
[21] G. Du, L. Ma, J.-S. Hu, J. Zhang, Y. Xiang, D. Shao, and H. Wang, “Pre-
vol. 12, p. 2550, feb 2022.
diction of 30-day readmission: an improved gradient boosting decision
[42] A. M. Almeshal, A. I. Almazrouee, M. R. Alenizi, and S. N. Alhajeri,
tree approach,” Journal of Medical Imaging and Health Informatics, vol. 9,
“Forecasting the Spread of COVID-19 in Kuwait Using Compartmental
no. 3, pp. 620–627, 2019.
and Logistic Regression Models,” Applied Sciences 2020, Vol. 10, Page
[22] S. Bharati, P. Podder, and M. R. Hossain Mondal, “Diagnosis of Poly-
3402, vol. 10, p. 3402, may 2020.
cystic Ovary Syndrome Using Machine Learning Algorithms,” 2020 IEEE
[43] K. Shah, H. Patel, D. Sanghvi, and M. Shah, “A Comparative Analysis
Region 10 Symposium, TENSYMP 2020, pp. 1486–1489, jun 2020.
of Logistic Regression, Random Forest and KNN Models for the Text
[23] S. A. Bhat, Detection of Polycystic Ovary Syndrome using Machine Classification,” Augmented Human Research 2020 5:1, vol. 5, pp. 1–16,
Learning Algorithms. PhD thesis, Dublin, National College of Ireland, mar 2020.
2021. [44] D. T. Barus, R. Elfarizy, F. Masri, and P. H. Gunawan, “Parallel Pro-
[24] S. Yang, X. Zhu, L. Zhang, L. Wang, and X. Wang, “Classification and gramming of Churn Prediction Using Gaussian Naïve Bayes,” 2020 8th
prediction of tibetan medical syndrome based on the improved bp neural International Conference on Information and Communication Technology,
network,” IEEE Access, vol. 8, pp. 31114–31125, 2020. ICoICT 2020, jun 2020.
[25] D. Dewailly, M. E. Lujan, E. Carmina, M. I. Cedars, J. Laven, R. J. [45] L. Cataldi, L. Tiberi, and G. Costa, “Estimation of MCS intensity for Italy
Norman, and H. F. Escobar-Morreale, “Definition and significance of from high quality accelerometric data, using GMICEs and Gaussian Naïve
polycystic ovarian morphology: a task force report from the androgen ex- Bayes Classifiers,” Bulletin of Earthquake Engineering, vol. 19, pp. 2325–
cess and polycystic ovary syndrome society,” Human reproduction update, 2342, apr 2021.
vol. 20, no. 3, pp. 334–352, 2014. [46] D. D. Rufo, T. G. Debelee, A. Ibenthal, and W. G. Negera, “Diagnosis
[26] A. Saravanan and S. Sathiamoorthy, “Detection of Polycystic Ovarian Syn- of Diabetes Mellitus Using Gradient Boosting Machine (LightGBM),”
drome: A Literature Survey,” Asian Journal of Engineering and Applied Diagnostics 2021, Vol. 11, Page 1714, vol. 11, p. 1714, sep 2021.
Technology, vol. 7, pp. 46–51, nov 2018. [47] C. Bowd, A. Belghith, J. A. Proudfoot, L. M. Zangwill, M. Christopher,
[27] V. Thakre, S. Vedpathak, K. Thakre, and S. Sonawani, “PCOcare: PCOS M. H. Goldbaum, H. Hou, R. C. Penteado, S. Moghimi, and R. N. Wein-
Detection and Prediction using Machine Learning Algorithms,” Bio- reb, “Gradient-Boosting Classifiers Combining Vessel Density and Tissue
science Biotechnology Research Communications, vol. 13, pp. 240–244, Thickness Measurements for Classifying Early to Moderate Glaucoma,”
dec 2020. American Journal of Ophthalmology, vol. 217, pp. 131–139, sep 2020.
[28] R. M. Aziz, “Nature-inspired metaheuristics model for gene selection [48] J. Isabona, A. L. Imoize, and Y. Kim, “Machine Learning-Based Boosted
and classification of biomedical microarray data,” Medical & Biological Regression Ensemble Combined with Hyperparameter Tuning for Optimal
Engineering & Computing, vol. 60, no. 6, pp. 1627–1646, 2022. Adaptive Learning,” Sensors 2022, Vol. 22, Page 3776, vol. 22, p. 3776,
[29] R. M. Aziz, “Application of nature inspired soft computing techniques may 2022.
for gene selection: a novel frame work for classification of cancer,” Soft [49] E. Elgeldawi, A. Sayed, A. R. Galal, and A. M. Zaki, “Hyperparameter
Computing, pp. 1–18, 2022. Tuning for Machine Learning Algorithms Used for Arabic Sentiment
[30] Z. Na, W. Guo, J. Song, D. Feng, Y. Fang, and D. Li, “Identification of Analysis,” Informatics 2021, Vol. 8, Page 79, vol. 8, p. 79, nov 2021.
novel candidate biomarkers and immune infiltration in polycystic ovary
syndrome,” Journal of ovarian research, vol. 15, no. 1, pp. 1–13, 2022.
[31] S. Dhar, S. Mridha, and P. Bhattacharjee, “Mutational landscape screening
through comprehensive in silico analysis for polycystic ovarian syndrome–
related genes,” Reproductive Sciences, vol. 29, no. 2, pp. 480–496, 2022.
[32] PRASOON KOTTARATHIL, “Polycystic ovary syndrome (PCOS) | Kag-
gle.”
[33] J. Huang, S. Ling, X. Wu, and R. Deng, “GIS-Based Comparative Study
of the Bayesian Network, Decision Table, Radial Basis Function Network
and Stochastic Gradient Descent for the Spatial Prediction of Landslide
Susceptibility,” Land 2022, Vol. 11, Page 436, vol. 11, p. 436, mar 2022.
[34] S. Ghosal, S. Sengupta, M. Majumder, and B. Sinha, “Linear Regression
Analysis to predict the number of deaths in India due to SARS-CoV-2 at
6 weeks from day 0 (100 cases - March 14th 2020),” Diabetes Metabolic
Syndrome: Clinical Research Reviews, vol. 14, pp. 311–315, jul 2020.

14 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3205587

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

SHAZIA NASIM pursuing her MS Computer FAIZAN YOUNAS was born in Pakistan in 1999.
Science degree from the Khwaja Fareed Univer- He received the Bachelor of Science in Com-
sity of Engineering and Information Technology puter Science degree from the Khwaja Fareed Uni-
(KFUEIT) Rahim Yar Khan, Pakistan. She has versity of Engineering Information Technology
received a Master of Computer Science degree (KFUEIT), Rahim Yar Khan, Pakistan, in 2021,
in 2012 from the Bahauddin Zakariya Univer- and pursuing his MS in Computer Science also
sity, Multan. Her current research interest includes from KUEIT. His main areas of research interest
Data mining and Machine learning. are Natural Language Processing (NLP), Machine
Learning, and Deep Learning.

MUBARAK S. ALMUTAIRI is currently the Dean


of the Applied College, University of Hafr Al-
batin (UHB). He received the B.Sc. degree in
systems engineering from King Fahd University
of Petroleum Minerals, Dhahran, Saudi Arabia,
in 1997, the MSc. degree in industrial and sys-
tems engineering from the University of Florida,
Gainesville, Florida, USA, in 2003, and the Ph.D.
degree in systems design engineering from the
University of Waterloo, Waterloo, Canada, in
2007. From 1997 to 2000, he was an industrial engineer with the Saudi
Arabia Oil Company (Aramco). He is currently an associate professor in
the computer science and engineering department at the University of Hafr
Albatin, Hafr Albatin, Saudi Arabia. His research interests include decision
analysis, expert systems, risk assessment, information security, fuzzy logic,
and mobile government application.

KASHIF MUNIR have been in the field of


higher education since 2002. After an initial teach-
ing experience with courses in Binary College,
Malaysia for one semester and at Stamford Col-
lege, Malaysia for around four years, I later relo-
cated to Saudi Arabia. I worked with King Fahd
University of Petroleum and Minerals, KSA from
September 2006 till December 2014. I moved
to University of Hafr Al-Batin, KSA in January
2015. In July 2021, I joined Khwaja Farid Uni-
versity of Engineering IT, Rahim Yar Khan. as Assistant Professor in IT
Department. I received BSc degree in Mathematics and Physics from Islamia
University Bahawalpur, Pakistan in 1999, after which I received my MSc
degree in Information Technology from University Sains Malaysia in 2001.
I also obtained another MS degree in Software Engineering from University
of Malaya, Malaysia in 2005. I completed my PhD in Informatics from
Malaysia University of Science and Technology, Malaysia in 2015. I have
published journal papers, conference papers, book and book chapters. I
have been in the technical programme committee of many peer-reviewed
conferences and journals where I have reviewed many research papers. My
research interests are in the areas of Cloud Computing Security, Software
Engineering, and Project Management.

ALI RAZA pursuing his MS Computer Science


degree from the Khwaja Fareed University of En-
gineering and Information Technology (KFUEIT)
Rahim Yar Khan, Pakistan. He has received a
Bachelor of Science in Computer Science degree
in 2021 from the Department of Computer Sci-
ence, KFUEIT. His current research interest in-
cludes Data Science, Artificial Intelligence, Data
mining, Natural Language Processing, Machine
learning, Deep Learning, and Image processing.

VOLUME 4, 2016 15

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
View publication stats

You might also like