A Novel Approach For Polycystic Ovary Syndrome Prediction Using Machine Learning in Bioinformatics
A Novel Approach For Polycystic Ovary Syndrome Prediction Using Machine Learning in Bioinformatics
net/publication/363500424
CITATIONS READS
9 451
5 authors, including:
Faizan Younas
Khwaja Fareed University of Engineering & Information Technology
8 PUBLICATIONS 39 CITATIONS
SEE PROFILE
All content following this page was uploaded by Kashif Munir on 15 September 2022.
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
ABSTRACT Polycystic ovary syndrome (PCOS) is a critical disorder in women during their reproduction
phase. The PCOS disorder is commonly caused by excess male hormone and androgen levels. The follicles
are the collections of fluid developed by ovaries and may fail to release eggs regularly. The PCOS results in
miscarriage, infertility issues, and complications during pregnancy. According to a recent report, PCOS
is diagnosed in 31.3% of women from Asia. Studies show that 69% to 70% of women did not avail
of a detecting cure for PCOS. A research study is needed to save women from critical complications
by identifying PCOS early. The main aim of our research is to predict PCOS using advanced machine
learning techniques. The dataset based on clinical and physical parameters of women is utilized for building
study models. A novel feature selection approach is proposed based on the optimized chi-squared (CS-
PCOS) mechanism. The ten hyper-parametrized machine learning models are applied in comparison.
Using the novel CS-PCOS approach, the gaussian naive bayes (GNB) outperformed machine learning
models and state-of-the-art studies. The GNB achieved 100% accuracy, precision, recall, and f1-scores
with minimal time computations of 0.002 seconds. The k-fold cross-validation of GNB achieved a 100%
accuracy score. The proposed GNB model achieved accurate results for critical PCOS prediction. Our study
reveals that the dataset features prolactin (PRL), blood pressure systolic, blood pressure diastolic, thyroid
stimulating hormone (TSH), relative risk (RR-breaths), and pregnancy are the prominent factors having
high involvement in PCOS prediction. Our research study helps the medical community overcome the
miscarriage rate and provide a cure to women through the early detection of PCOS.
INDEX TERMS Bioinformatics, Data Analysis, Infertility, Machine Learning, Pregnancy Complications,
Polycystic Ovary Syndrome, PCOS Prediction, Syndrome Classification.
VOLUME 4, 2016 1
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3205587
PCOS undergo infertility, resulting in gynaecological cancer research methodology analysis is conducted in Section III.
[5]. Early detection of PCOS results in saving miscarriage. The employed machine learning models for PCOS prediction
PCOS affects many women at an early age. However, they are examined in Section IV. The scientific results validation
are not diagnosed. Numerous studies show that 69% to 70% and evaluations of our research approaches are analyzed
of women did not avail of a detecting cure [6]. According in Section V. The research study concluding remarks are
to a recent report, PCOS is diagnosed in 4.8% of women of described in Section VI.
white Americans, 8% of African Americans, 6.8% of women
in Spain and 31.3% of women in Asia [7]. Due to these II. RELATED WORK
complications and statics, early diagnosis of PCOS is crucial. The related literature to our proposed research study is exam-
The PCOS treatment [8] consists of modification in ined in this section. The past applied state-of-the-art study for
lifestyle, weight reduction, and an appropriate healthy diet PCOS prediction is analyzed. The related research findings
plan. The women’s everyday workout results in minimized and proposed techniques are examined.
free androgen indexed and reduced biochemical hyperandro- One of the most common health problems [19] caught
genism [9], [10], [11], [12], [13]. Studies show that with the in early age women is PCOS disease. PCOS disease is a
increase in age, the PCOS symptoms become less extreme, complicated health dilemma distressing women of childbear-
and women get menopause [14], [15], [16]. ing age, which can be identified based on different medical
Machine learning (ML) is the core area of computer indicators and signs. Accurate identification and detection of
science. Nowadays, ML allows computers to learn without PCOS is the essential baseline for appropriate treatment. For
going from their environment. The ML performs an essential this purpose, researchers applied different machine learning
role in the healthcare department [17]. The ML deals with approaches such as SVM, random forest, CART, logistic
obscure enormous datasets. The ML analyses the data, trans- regression and naive bayes classification to identify PCOS
form it into a useable form for clinical procedures and assists patients. After comparing the results, the Random Forest
in identifying the nature of different diseases. The three main algorithm gave a high performance with 96% accuracy in
types of machine learning are used in the medical field [18]. PCOS diagnostics on a given dataset [20].
Medical Image Processing, NLP in medical documentation, Machine learning algorithms were implemented on a
and statistical material about genetics are significant applica- dataset of 541 patients, from which 177 have PCOS disease.
tions. Our primary research contributions are as follows: The dataset consists of 43 features. As all features did not
• A novel CS-PCOS feature selection approach is pro- have equal importance, researchers used a feature selection
posed based on the optimized chi-squared mechanism. model to rank them according to their value, called the uni-
The twenty dataset features with a high importance variate feature selection model. This model is implemented
value are selected using the CS-PCOS approach for to get ten high-ranked features that can be used to predict the
building machine learning models. By using the CS- PCOS disease. After splitting the dataset into the train and
PCOS approach, our proposed model outperformed ma- test portion, different algorithms were implemented to get
chine learning techniques and past proposed state-of- a result. These models include gradient boosting classifiers
the-art studies; [21], logistic regression classifiers, random forest classifiers,
• The PCOS exploratory data analysis (PEDA) is con- RFLR abbreviation of random forest and logistic regression.
ducted to find the data patterns that are the primary As a result, the proposed RFLR algorithm achieved a 90.01%
cause of PCOS disease. The PEDA is based on graphs, accuracy score in classifying the PCOS patients with ten
charts, and statistical data analysis; highly ranked features [22].
• The ten advanced machine learning models are applied A new technique was proposed for the early detection and
in comparison to predict PCOS. The applied machine identification of PCOS disease in 2021. The proposed model
learning techniques are stochastic gradient descent was based on XGBRF and catBoost. After preprocessing the
(SGD), linear regression (LIR), random forest (RF), data, the top 10 attributes were selected by the univariate
bayesian ridge (BR), support vector machine (SVM), feature selection method. The classifiers implemented to
k-neighbors classifier (KNC), multi-layer perceptron compare the accuracy results are MLP, decision tree, SVM,
(MLP), logistic regression (LOR), gaussian naive bayes HRFLR, random forest, logistic regression, and gradient
(GNB), and gradient boosting classifier (GBC). The boosting. Results showed that XGBRF performed with an
GNB model is our proposed model; 89% accuracy score while catBoost outperformed with a 95%
• The k-fold cross-validation is applied to validate over- accuracy score. The accuracy scores of other classifiers lay
fitting in our applied machine learning models. The between 76% and 85%. The catBoost technique was the best
ten folds of research data are used during the k-fold model for the early detection of PCOS disease [23].
analysis. The machine learning models are generalized Researchers have demonstrated that PCOS identification
and give accurate performance scores for unseen test depends on morphological, biological, clinical processes [24]
data. and methods [25]. Due to advanced technology such as
The remainder of the research study is as follows: Section ultrasonography, the surplus follicle has become a critical
II is based on the related literature analysis of PCOS. Our indicator of polycystic ovarian morphology (PCOM). Since
2 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3205587
Uterus
FIGURE 1. The methodological architecture analysis of the proposed research study in predicting the PCOS syndrome.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3205587
Sr no. Feature Non-Null Count Data Type Sr no. Feature Non-Null Count Data Type
1 Sl. No 541 int64 22 TSH (mIU/L) 541 float64
2 Patient File No. 541 int64 23 AMH(ng/mL) 540 float64
3 PCOS (Y/N) 541 int64 24 PRL(ng/mL) 541 float64
4 Age (yrs) 541 int64 25 Vit D3 (ng/mL) 541 float64
5 Weight (Kg) 541 float64 26 PRG(ng/mL) 541 float64
6 Height(Cm) 541 float64 27 RBS(mg/dl) 541 float64
7 BMI 541 float64 28 Weight gain(Y/N) 541 int64
8 Blood Group 541 int64 29 hair growth(Y/N) 541 int64
9 Pulse rate(bpm) 541 int64 30 Skin darkening (Y/N) 541 int64
10 RR (breaths/min) 541 int64 31 Hair loss(Y/N) 541 int64
11 Cycle(R/I) 541 int64 32 Pimples(Y/N) 541 int64
12 Cycle length(days) 541 int64 33 Fast food (Y/N) 540 float64
13 Marraige Status (Yrs) 540 float64 34 Reg. Exercise(Y/N) 541 int64
14 Pregnant(Y/N) 541 int64 35 BP _Systolic (mmHg) 541 int64
15 No. of absorptions 541 int64 36 BP _Diastolic (mmHg) 541 int64
16 FSH(mIU/mL) 541 float64 37 Follicle No. (L) 541 int64
17 LH(mIU/mL) 541 float64 38 Follicle No. (R) 541 int64
18 FSH/LH 541 float64 39 Avg. F size (L) (mm) 541 float64
19 Hip(inch) 541 int64 40 Avg. F size (R) (mm) 541 float64
20 Waist(inch) 541 int64 41 Endometrium (mm) 541 float64
21 Waist: Hip Ratio 541 float64
CS-PCOS Feature
Selection Approach
FIGURE 2. The CS-PCOS approach operational flow of feature selection from the original dataset.
and factors that are the primary cause of PCOS disease. B. NOVEL CS-PCOS FEATURE ENGINEERING
The dataset is fully preprocessed during feature engineering. TECHNIQUE
The preprocessed dataset is split into two portions train and The feature engineering techniques are applied to transform
test. The split ratio used is 80% for training and 20% for the dataset features into the best fit for a predictive model
the model’s evaluations on unseen test data. The hyper- with high accuracy. A novel CS-PCOS feature selection
parametrized model is completely trained and tested. The approach is proposed based on the optimized chi-squared
proposed model is ready to predict the POCS disease in mechanism. The operational flow of feature selection by the
deployment. The research methodology working flow is ex- CS-PCOS approach is visualized in Figure 2. The proposed
amined in Figure 1. CS-PCOS technique checks the independence by compar-
ing the observed frequencies (categorically data) with the
A. POLYCYSTIC OVARY SYNDROME DATASET expected frequencies (target data). The proposed CS-PCOS
The PCOS dataset [32] is utilized in our research study. The technique extracts the vital value statistics based on goodness
clinical and physical parameters of 541 patients are used to of fit. The 39 features are input to our proposed feature
create the dataset. The PCOS dataset features are analyzed in selection technique and determine the importance values for
Table 1. The dataset contains a total of 41 features. We have each feature.
filled the null values in our dataset with zero to preprocess the The feature importance values analysis is demonstrated
dataset. We have dropped the dataset columns ’Sl. No’ and in Table 2. The essential features have the highest value
’Patient File No.’ due to containing unnecessary information. near one. Furthermore, the element which has zero value
The dataset was collected from ten different hospitals across is non-vital. The Waist Hip Ratio is the most impor-
Kerala in India. The memory usage size of the dataset is tant in the segment. The feature having zero impor-
177.6 KB. tance values is dropped. The drop features are Age (yrs),
Weight (Kg), BMI, Cycle(R/I), Cycle length(days), Mar-
4 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3205587
FIGURE 3. The correlation analysis of selected dataset features by the proposed CS-PCOS techniques.
raige Status (Yrs), FSH(mIU/mL), LH(mIU/mL), FSH/LH, diction in our research study. The selected feature correlation
AMH(ng/mL), Vit D3 (ng/mL), PRG(ng/mL), Weight analysis is conducted in Figure 3. The correlation analysis
gain(Y/N), hair growth(Y/N), Skin darkening (Y/N), Hair demonstrates that all selected features have a positive corre-
loss(Y/N), Pimples(Y/N), Fast food (Y/N), Follicle No. (L), lation.
and Follicle No. (R). The twenty most prominent features are
selected by our proposed technique and used for PCOS pre-
VOLUME 4, 2016 5
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3205587
(a) The count plot shows the number of instances of both classes (b) The Pie chart shows the distribution of PCOS
in the dataset class in percentage
(a) The Waist Hip Ratio, PRL(ng/mL), and PCOS feature is (b) The 3D scatter plot is drawn on TSH(mlU/L),
plotted in 3D scatter plot to visualize the datapoint Bp_Systolic(mmHg), and PCOS(Y/N) features
C. PCOS EXPLORATORY DATA ANALYSIS (PEDA) 5(a). No PCOS occur when the TSH(mmHg) is less than
This section analyses the PCOS data and the dataset’s differ- 50 and Bp_Systolic is above 80. Figure 5(b) demonstrates
ent patterns to understand the cause of PCOS. The analysis that, When the value of TSH(mmHg) is above 50 and the
focus on 20 features with a significant value selected by Bp_Systolic value less than 80, then PCOS happen.
the proposed CS-PCOS technique that is used to train the The lmplot is dragged on the dataset’s high-value features
machine learning models. These features are analyzed from to represent the PCOS regression described in Figure 6. The
other angles using different graphs. The seaborn, pandas and lmplot is a two-dimensional plot that combines regplot and
matplotlib libraries of Python are used to visualize the chart. FacetGrid. The FacetGrid class helps visualize the distribu-
The count plots are drawn to see the number of instances tion of one variable and the relationship between multiple
of both classes in the PCOS dataset. In Figure 4(a), the count variables separately within subsets of your dataset using nu-
plot shows the number of instances of both categories. The merous panels. The lmplot is more computationally intensive
no category has 364 instances, and the yes category has 177 and is intended as a convenient interface to fit regression
instances in the dataset. The dataset is binary class. The 0 models across conditional subsets of a dataset.
indicates No PCOS, and 1 represents Yes PCOS. In Figure In Figure 6(A), a lmplot is drawn between the Hip(inch)
4(b), the pie chart shows the percentage of each class in the and Waist(inch) to visualize the PCOS Regression. As the
dataset. 67.3% of data belong to the PCOS No class, and waist and Hip size increase, the Chance of PCOS increases.
32.7% of data belongs to the Yes class. In Figure 6(B), the Waist: Hip Ratio and Hb(g/dl) subset
The 3d scatter plot is to visualize and analyze the most is used to analyze the PCOS regression. When the value
critical feature data point in 3D. It plots data points on three of Hb(g/dl) is greater than 14 and less than 9, there is
axes to show the relationship between three features. When more chance of PCOS. Figure 6(C) plots the lmplot be-
the value of PRL(ng/mL) is more than 40, and waist Hip tween Pregrent(Y/N) and BP_Systolic. This plot shows that
Ratio is less than 0.90, PCOS happens, as shown in Figure if the value of BP_Systolic(mmHg) is 140 and the patient
6 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3205587
Figure A: The lmplot is draw on feature Hip(inch) Figure B: The lmplot is draw on feature Hb(g/dl) & Figure C: The lmplot is draw on feature Pregrent &
& Waist(inch) to visualize the regression of PCOS. Waist:Hip Ratio to visualize the regression of PCOS. BP_Systolic(mmHg) to visualize the regression of PCOS.
Figure D: The lmplot is draw on feature Blood Group & Figure E: The lmplot is draw on feature TSH(mlU/L) & Figure F: The lmplot is draw on feature Reg.Excercise &
RR(breaths/min) to visualize the regression of PCOS. BP_Diastolic(mmHg) to visualize the regression of PCOS. No. of abortion to visualize the regression of PCOS.
FIGURE 6. The lmplot regression graph analysis of values features with the PCOS class.
is Pregnant or not, the PCOS does not occur. In Figure feature is taken to analyze the frequency distribution. The
6(D), Blood Group and RR(breath/min) features are taken BP_systolic (mmHg) the highest frequency is 175 at 100
from the dataset to visualize the Regression plot. When for No PCOS. In Figure 7(H), PRL (ng/ml) has the highest
the value of RR(breath/min) is more significant than 25, frequency at 20, gradually decreasing. In Figure 8(I), the
no PCOS happens. In Figure 6(E), the lmplot is plot- frequency of Waist Hip Ratio is from 0.75 to 0.95. The
ted between TSH(mlU/L) and BP_Diastolic(mmHg) feature. highest frequency for yes No PCOS is 0.95.
When the value of TSH(mlU/L) is between 0 to 20 and
BP_Diastolic(mmHg) is 80, there is more chance of PCOS. D. DATASET SPITING
In Figure 6(F), No of Abortion and Regular Exercise(Y/N) The data splitting is applied to prevent model overfitting and
are taken to visualize the lmplot. When the number of abor- evaluate the trained model on the unseen test portion of the
tions is above three and not doing regular exercise, PCOS dataset. The PCOS dataset is split into two portions for the
does not occur. training and testing employing machine learning models. The
The histogram is plotted to analyze the frequency distri- 80:20 ratio is used for dataset splitting. The 80% portion of
bution of PCOS Yes or No on imported features in Figure the dataset is used for model training, and a 20% portion of
7. Figure 7(A) plots the Hip(inch) to identify the frequency the dataset is used for employed model’s results evaluations
distribution. The frequency of both classes is highest between on unseen data. Our research models are trained and evalu-
35 and 40. Figure 7(B) plots the histogram on Hb(g/dl). ated with high accuracy results.
The PCOS yes has the highest count of 60 at HB(g/dl) 11.
Furthermore, PCOS class No has a maximum count of 140 IV. EMPLOYED MACHINE LEARNING TECHNIQUES
before the value of 11. In figure 7(C), the pregnant(Y/N) The employed machine learning techniques are examined for
feature is used to plot the histogram. This graph presents the PCOS prediction in this section. The working mechanism
highest value of both classes at no pregnancy. In Figure 7(D), and mathematical notations for machine learning models are
the BP_Diastolic(mmHg) is taken to plot the histogram. The described. The ten predictive machine learning models are
highest frequency of class 0 is 250 at 80 BP_Diastolic. under examination for PCOS prediction in our research study.
In Figure 7(E), the maximum frequency of RR(breath/min) The stochastic gradient descent (SGD) classifier [33] uses
is at a value of 10, which is above 175 for No PCOS and loss functions based on the SGD learning routine for classi-
75 for yes PCOS. In Figure 7(F), the feature TSH(mmU/L) fication. The SGD is used for large-scale learning. The SGD
has a frequency between zero and ten. For Yes, PCOS has is easy to build and has good efficiency. The SGD efficient
the highest frequency, 90 at 0 and approximately 340 for No optimization model is utilized to minimize a loss function
PCOS at a value of 5. In Figure 7(G), BP_systolic (mmHg) by finding the optimal parameters values of the function.
VOLUME 4, 2016 7
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3205587
Figure A: The Histogram is plotted on HIp(inch) Figure B: The Histogram is plotted on Hb(gdl) Figure C: The Histogram is plotted on Pregrent(Y/N)
feature of dataset to examine the cause of PCOS feature of dataset to examine the cause of PCOS feature of dataset to examine the cause of PCOS
Figure D: The Histogram is plotted on BP_Dialolic(mmHg) Figure E: The Histogram is plotted on RR(breaths/min) Figure F: The Histogram is plotted on TSH(mlU/L)
feature of dataset to examine the cause of PCOS feature of dataset to examine the cause of PCOS feature of dataset to examine the cause of PCOS
Figure G: The Histogram is plotted on BP_Systolic(mmHg) Figure H: The Histogram is plotted on PRL(ng/mL) Figure I: The Histogram is plotted on Waist Hip Ratio
feature of dataset to examine the cause of PCOS feature of dataset to examine the cause of PCOS feature of dataset to examine the cause of PCOS
FIGURE 7. The histogram analysis analyses the frequency distribution of PCOS Yes or No for selected features.
The performance of SGD is based on the loss function. The represent the target output. The majority voting prediction
logistic cost function is expressed in equation 1. of decision trees is selected as the final prediction. The gini
index and entropy are used for data splitting in tress nodes as
( expressed in equations 3 and 4.
−log(hθ (x)) if y = 1
Cost(hθ (x), y) = (1)
−log(1 − hθ (x)) if y = 0 n
X 2
Gini index = 1 − (Pi ) (3)
Linear regression (LIR) [34] is the statistical method used
i=1
for classification that finds the linear relationship between
the dependent variable (y) and independent variables (x). A
linear relationship analyses how dependent variable values Entropy(S) = −p(+) log p(+) − p(−) log p(−) (4)
change according to the independent variable values. The
LIR model [35] provides a straight line separating the data The bayesian ridge (BR) [37] algorithm uses probability
points. The regression line in the LIR model minimizes the computations for the classification task. The BR model is
sum of the Square of Residuals, known as the ordinary least suitable for real-world problems where the data is insufficient
square (OLS). The mathematical notation to express the LIR and poorly distributed. The BR model formulates a linear
model is analyzed in equation 2. regression model by using the probability distributors. The
BR model predicts the target (y) by calculating it from a
Y = mX + b (2) probability distribution instead estimating a single feature
The random forest (RF) [36] is a supervised classification value. The mathematical notation to find the y target using
model that creates a forest of multiple decision trees. The de- the BR model is expressed in equation 5.
cision trees are created randomly based on the data samples.
Decision nodes represent the features, and tree leaf nodes p(y | X, w, a) = N (y | Xw , a) (5)
8 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3205587
TABLE 3. The best-fit hyperparameters analysis of all employed machine learning models.
Technique Hyperparameters
SGD loss=’hinge’, penalty=’l2’, alpha=0.0001, l1_ratio=0.15, max_iter=1000, tol=1e-3, learning_rate=’optimal’.
LIR copy_X=True, fit_intercept=True, positive=False, normalize=False.
RF max_depth=20, random_state=0, n_estimators=100, criterion=’gini’, max_features=’sqrt’, bootstrap=True.
BR tol=1e-3, n_iter=300, alpha_1=1e-6, lambda_1=1e-6, alpha_2=1e-6, lambda_2=1e-6.
SVM kernel=’linear’, C=1.0, degree=3, gamma=’scale’, tol=1e-3, cache_size=200, decision_function_shape=’ovr’.
KNC n_neighbors=5, weights=’uniform’, algorithm=’auto’, metric=’minkowski’, leaf_size=30, p=2.
MLP hidden_layer_sizes=(100,), activation=’relu’, solver=’adam’, alpha=0.0001, learning_rate=’constant’.
LOR penalty=’l2’, tol=1e-4, C=1.0, solver=’lbfgs’.
GNB var_smoothing=1e-9.
GBC loss=’log_loss’, max_depth=3, learning_rate=0.1, criterion=’friedman_mse’, n_estimators=100.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3205587
TABLE 4. The comparative performance evaluation of employed machine learning models for unseen test data without using the proposed technique.
Technique Training time(second) Accuracy (%) Precision (%) Recall (%) F1-score (%)
SGD 0.006 79 82 79 79
LIR 0.034 84 85 84 85
RF 0.193 89 89 89 89
BR 0.014 84 85 84 84
SVM 0.666 88 88 88 88
KNC 0.002 70 68 70 68
MLP 0.472 83 83 83 83
LOR 0.042 80 80 80 80
GNB 0.003 81 81 81 80
GBC 0.259 89 89 89 89
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3205587
FIGURE 8. The accuracy scores comparative evaluation of employed machine learning models for unseen test data without using the proposed
technique.
TABLE 5. Using the proposed technique, the comparative performance evaluation of the employed machine learning model for unseen test data.
Technique Training time(second) Accuracy (%) Precision (%) Recall (%) F1-score (%)
SGD 0.004 69 68 69 68
LIR 0.024 100 100 100 100
RF 0.147 100 100 100 100
BR 0.004 100 100 100 100
SVM 0.842 100 100 100 100
KNC 0.002 56 53 56 54
MLP 0.592 99 99 99 99
LOR 0.025 100 100 100 100
GNB 0.002 100 100 100 100
GBC 0.071 100 100 100 100
FIGURE 9. Using the proposed technique, the accuracy scores comparative evaluation of employed machine learning models for unseen test data.
recall, and f1 score is 100%, achieved by LIR, RF, BR, SVM, score is 56%, the precision score is 53%, the recall score
LOR, GNB, and GBC techniques. The minimum accuracy is 56%, and the f1 score is 54%, achieved by the KNC
VOLUME 4, 2016 11
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3205587
FIGURE 10. The accuracy scores comparative analysis of the K-Fold technique to validate the overfitting of the employed learning techniques.
TABLE 6. The classification report analysis of employed learning models for each employed learning model is examined in Table 6.
by using the proposed technique.
The classification report values are calculated for the models
Target Precision Recall F1-score Support Score using the proposed approach. The analysis demonstrates
Category that the KNC and SDG have low accuracy scores in class-
SGD wise metrics evaluations. The outperformed GNB model has
0 0.72 0.83 0.77 70
1 0.59 0.44 0.50 39
achieved 100% scores in classification report analysis.
LIR
0 1.00 1.00 1.00 70 TABLE 7. The K-Fold cross-validation analysis to validate the overfitting
1 1.00 1.00 1.00 39 of the employed learning techniques.
RF
0 1.00 1.00 1.00 70 Sr no K-Fold Technique Accuracy Score (%)
1 1.00 1.00 1.00 39 1 10 SGD 60
SVM 2 10 LIR 100
0 1.00 1.00 1.00 70 3 10 RF 100
1 1.00 1.00 1.00 39 4 10 BR 100
BR 5 10 SVM 100
0 1.00 1.00 1.00 70 6 10 KNC 60
1 1.00 1.00 1.00 39 7 10 MLP 98
KNC 8 10 LOR 100
0 0.63 0.74 0.68 70 9 10 GNB 100
1 0.33 0.23 0.27 39 10 10 GBC 100
MLP
0 1.00 0.99 0.99 70
1 0.97 1.00 0.99 39
To validate the overfitting of employed machine learning
LOR models, we have applied the k-fold cross-validation tech-
0 1.00 1.00 1.00 70 nique as analyzed in Table 7. The 10 folds of the dataset
1 1.00 1.00 1.00 39 are used for validation. The analysis demonstrates that tech-
GNB
0 1.00 1.00 1.00 70 niques achieved 100% scores using our proposed approach
1 1.00 1.00 1.00 39 and 100% accuracy using the k-fold techniques. Figure 10
GBC shows the accuracy of comparative analysis of employed
0 1.00 1.00 1.00 70
models by using the k-fold validation. The visualized anal-
1 1.00 1.00 1.00 39
ysis demonstrates that the MLP model achieved 99%, and
by using k-fold, 98% accuracy was achieved. The SGD
and KNC models achieve the lowest accuracy scores in this
technique. The time complexity analysis describes that GNB analysis. In conclusion, all employed models are validated
has less training time of 0.002. However, the GNB have high- using k-fold technique. The k-fold analysis demonstrates that
performance metrics scores. The GNB is our proposed model our employed machine learning models are not overfitted.
for predicting the PCOS. Models are in generalize form and accurate results on unseen
The classification report analysis by individual target class test data.
12 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3205587
TABLE 8. The performance validation comparative analysis with the past applied state-of-the-art approaches.
Literature Year Learning Type Proposed Technique Accuracy (%) Recall (%) Precision (%)
[22] 2020 Machine Learning RFLR 91 90 89
Proposed 2022 Machine Learning CS-PCOS + GNB 100 100 100
REFERENCES
[1] I. Kyrou, E. Karteris, T. Robbins, K. Chatha, F. Drenos, and H. S. Randeva,
“Polycystic ovary syndrome (PCOS) and COVID-19: An overlooked
female patient population at potentially higher risk during the COVID-19
pandemic,” BMC Medicine, vol. 18, pp. 1–10, jul 2020.
[2] B. J. Sherman, N. L. Baker, K. T. Brady, J. E. Joseph, L. M. Nunn, and
A. McRae-Clark, “The effect of oxytocin, gender, and ovarian hormones
on stress reactivity in individuals with cocaine use disorder,” Psychophar-
macology 2020 237:7, vol. 237, pp. 2031–2042, may 2020.
FIGURE 11. The confusion matrix validation analysis of our proposed
[3] X. Z. Zhang, Y. L. Pang, X. Wang, and Y. H. Li, “Computational charac-
model.
terization and identification of human polycystic ovary syndrome genes,”
Scientific Reports, vol. 8, p. 12949, dec 2018.
[4] E. Khashchenko, E. Uvarova, M. Vysokikh, T. Ivanets, L. Krechetova,
The comparative analysis of past applied state-of-the-art N. Tarasova, I. Sukhanova, F. Mamedova, P. Borovikov, I. Balashov, and
G. Sukhikh, “The Relevant Hormonal Levels and Diagnostic Features of
studies is examined in Table 8. The comparison parameters Polycystic Ovary Syndrome in Adolescents,” Journal of Clinical Medicine
are the year, learning type, proposed technique, accuracy 2020, Vol. 9, Page 1831, vol. 9, p. 1831, jun 2020.
score, recall score, and precision score. The analysis demon- [5] M. Woźniak, R. Krajewski, S. Makuch, and S. Agrawal, “Phytochemicals
in Gynecological Cancer Prevention,” International Journal of Molecular
strates that using our novel proposed CS-PCOS technique,
Sciences 2021, Vol. 22, Page 1219, vol. 22, p. 1219, jan 2021.
the outperformed GNB model achieved the highest scores [6] D. Dewailly, M. E. Lujan, E. Carmina, M. I. Cedars, J. Laven, R. J.
compared with the past proposed techniques. Our proposed Norman, and H. F. Escobar-morreale, “Definition and significance of
model outperformed the state of art studies. polycystic ovarian morphology: a task force report from the Androgen
Excess and Polycystic Ovary Syndrome Society,” Human reproduction
The confusion matrix analysis is conducted to validate our update, vol. 20, no. 3, pp. 334–352, 2014.
performance metrics scorers as analyzed in Figure 11. The [7] A. S. Prapty and T. T. Shitu, “An Efficient Decision Tree Establishment
analyzed confusion matrix is for outperformed GNB model. and Performance Analysis with Different Machine Learning Approaches
on Polycystic Ovary Syndrome,” ICCIT 2020 - 23rd International Confer-
The analysis demonstrates that 70 samples are found as TP, ence on Computer and Information Technology, Proceedings, dec 2020.
and 39 samples are found as TN. The 0 samples are found for [8] E. C. Costa, J. C. F. De Sá, N. K. Stepto, I. B. B. Costa, L. F. Farias-Junior,
FN and FP in this analysis. The confusion matrix validates S. D. N. T. Moreira, E. M. M. Soares, T. M. A. M. Lemos, R. A. V. Browne,
and G. D. Azevedo, “Aerobic Training Improves Quality of Life in Women
our proposed model for achieving the 100% accuracy score with Polycystic Ovary Syndrome,” Medicine and science in sports and
in predicting the PCOS. exercise, vol. 50, pp. 1357–1366, jul 2018.
[9] M. A. Karimzadeh and M. Javedani, “An assessment of lifestyle mod-
ification versus medical treatment with clomiphene citrate, metformin,
VI. CONCLUSIONS and clomiphene citrate–metformin in patients with polycystic ovary syn-
The prediction of PCOS disease using data of 541 patients drome,” Fertility and Sterility, vol. 94, pp. 216–220, jun 2010.
through machine learning is proposed in this research study. [10] I. Almenning, A. Rieber-Mohn, K. M. Lundgren, T. S. Løvvik, K. K.
Garnæs, and T. Moholdt, “Effects of High Intensity Interval Training and
A novel CS-PCOS feature selection technique is proposed. Strength Training on Metabolic, Cardiovascular and Hormonal Outcomes
The ten machine learning techniques are SGD, LIR, RF, in Women with Polycystic Ovary Syndrome: A Pilot Study,” PLOS ONE,
BR, SVM, KNC, MLP, LOR, GNB, and GBC applied in vol. 10, p. e0138793, sep 2015.
[11] D. Chizen, S. Serrao, J. Rooke, L. McBreairty, R. Pierson, P. Chilibeck,
comparison. The proposed GNB outperformed with a 100% and G. Zello, “The “pulse” diet PCOS,” Fertility and Sterility, vol. 102,
accuracy score and time computation of 0.002 by using the p. e267, sep 2014.
proposed CS-PCOS feature selection techniques. The state [12] H. H. Mehrabani, S. Salehpour, B. J. Meyer, and F. Tahbaz, “Beneficial ef-
fects of a high-protein, low-glycemic-load hypocaloric diet in overweight
of art studies comparison shows that the proposed model
and obese women with polycystic ovary syndrome: a randomized con-
outperformed. The proposed model’s overfitting is validated trolled intervention study,” Journal of the American College of Nutrition,
using a ten-fold cross-validation technique. Our research vol. 31, pp. 117–125, apr 2012.
study concludes that the dataset features prolactin (PRL), [13] F. Giallauria, S. Palomba, L. Maresca, L. Vuolo, D. Tafuri, G. Lombardi,
A. Colao, C. Vigorito, and F. Orio, “Exercise training improves autonomic
blood pressure systolic, blood pressure diastolic, thyroid function and inflammatory pattern in women with polycystic ovary syn-
stimulating hormone (TSH), relative risk (RR-breaths), and drome (PCOS),” Clinical endocrinology, vol. 69, pp. 792–798, nov 2008.
VOLUME 4, 2016 13
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3205587
[14] F. Saleem and S. W. Rizvi, “New Therapeutic Approaches in Obesity [35] H. Lee, J. Wang, and B. Leblon, “Using Linear Regression, Random
and Metabolic Syndrome Associated with Polycystic Ovary Syndrome,” Forests, and Support Vector Machine with Unmanned Aerial Vehicle Mul-
Cureus, nov 2017. tispectral Images to Predict Canopy Nitrogen Weight in Corn,” Remote
[15] G. Ladson, W. C. Dodson, S. D. Sweet, A. E. Archibong, A. R. Kunselman, Sensing 2020, Vol. 12, Page 2071, vol. 12, p. 2071, jun 2020.
L. M. Demers, N. I. Williams, P. Coney, and R. S. Legro, “The effects [36] M. A. Khan, S. A. Memon, F. Farooq, M. F. Javed, F. Aslam, and R. Aly-
of metformin with lifestyle therapy in polycystic ovary syndrome: a ousef, “Compressive Strength of Fly-Ash-Based Geopolymer Concrete by
randomized double-blind study,” Fertility and sterility, vol. 95, mar 2011. Gene Expression Programming and Random Forest,” Advances in Civil
[16] A. Gambineri, L. Patton, A. Vaccina, M. Cacciari, A. M. Morselli-Labate, Engineering, vol. 2021, 2021.
C. Cavazza, U. Pagotto, and R. Pasquali, “Treatment with flutamide, met- [37] M. H. Na, W. H. Cho, S. K. Kim, and I. S. Na, “Automatic Weight
formin, and their combination added to a hypocaloric diet in overweight- Prediction System for Korean Cattle Using Bayesian Ridge Algorithm on
obese women with polycystic ovary syndrome: a randomized, 12-month, RGB-D Image,” Electronics 2022, Vol. 11, Page 1663, vol. 11, p. 1663,
placebo-controlled study,” The Journal of clinical endocrinology and may 2022.
metabolism, vol. 91, no. 10, pp. 3970–3980, 2006. [38] S. Shabani, S. Samadianfard, M. T. Sattari, A. Mosavi, S. Shamshirband,
[17] A. Qayyum, J. Qadir, M. Bilal, and A. Al-Fuqaha, “Secure and Robust T. Kmet, and A. R. Várkonyi-Kóczy, “Modeling Pan Evaporation Using
Machine Learning for Healthcare: A Survey,” IEEE Reviews in Biomedi- Gaussian Process Regression K-Nearest Neighbors Random Forest and
cal Engineering, vol. 14, pp. 156–180, 2021. Support Vector Machines; Comparative Analysis,” Atmosphere 2020, Vol.
[18] A. Garg and V. Mago, “Role of machine learning in medical research: A 11, Page 66, vol. 11, p. 66, jan 2020.
survey,” Computer Science Review, vol. 40, p. 100370, may 2021. [39] Mohebbanaaz, L. V. Rajani Kumari, and Y. Padma Sai, “Classification
[19] D. Hu, W. Dong, X. Lu, H. Duan, K. He, and Z. Huang, “Evidential mace of Arrhythmia Beats Using Optimized K-Nearest Neighbor Classifier,”
prediction of acute coronary syndrome using electronic health records,” Lecture Notes in Networks and Systems, vol. 185 LNNS, pp. 349–359,
BMC Medical Informatics and Decision Making, vol. 19, no. 2, pp. 9–17, 2021.
2019. [40] R. Pahuja and A. Kumar, “Sound-spectrogram based automatic bird
[20] M. Mubasher Hassan and T. Mirza, “Comparative Analysis of Machine species recognition using MLP classifier,” Applied Acoustics, vol. 180,
Learning Algorithms in Diagnosis of Polycystic Ovarian Syndrome,” p. 108077, sep 2021.
International Journal of Computer Applications, vol. 175, pp. 42–53, sep [41] U. Azmat, Y. Y. Ghadi, T. Al Shloul, S. A. Alsuhibany, A. Jalal, and J. Park,
2020. “Smartphone Sensor-Based Human Locomotion Surveillance System Us-
ing Multilayer Perceptron,” Applied Sciences 2022, Vol. 12, Page 2550,
[21] G. Du, L. Ma, J.-S. Hu, J. Zhang, Y. Xiang, D. Shao, and H. Wang, “Pre-
vol. 12, p. 2550, feb 2022.
diction of 30-day readmission: an improved gradient boosting decision
[42] A. M. Almeshal, A. I. Almazrouee, M. R. Alenizi, and S. N. Alhajeri,
tree approach,” Journal of Medical Imaging and Health Informatics, vol. 9,
“Forecasting the Spread of COVID-19 in Kuwait Using Compartmental
no. 3, pp. 620–627, 2019.
and Logistic Regression Models,” Applied Sciences 2020, Vol. 10, Page
[22] S. Bharati, P. Podder, and M. R. Hossain Mondal, “Diagnosis of Poly-
3402, vol. 10, p. 3402, may 2020.
cystic Ovary Syndrome Using Machine Learning Algorithms,” 2020 IEEE
[43] K. Shah, H. Patel, D. Sanghvi, and M. Shah, “A Comparative Analysis
Region 10 Symposium, TENSYMP 2020, pp. 1486–1489, jun 2020.
of Logistic Regression, Random Forest and KNN Models for the Text
[23] S. A. Bhat, Detection of Polycystic Ovary Syndrome using Machine Classification,” Augmented Human Research 2020 5:1, vol. 5, pp. 1–16,
Learning Algorithms. PhD thesis, Dublin, National College of Ireland, mar 2020.
2021. [44] D. T. Barus, R. Elfarizy, F. Masri, and P. H. Gunawan, “Parallel Pro-
[24] S. Yang, X. Zhu, L. Zhang, L. Wang, and X. Wang, “Classification and gramming of Churn Prediction Using Gaussian Naïve Bayes,” 2020 8th
prediction of tibetan medical syndrome based on the improved bp neural International Conference on Information and Communication Technology,
network,” IEEE Access, vol. 8, pp. 31114–31125, 2020. ICoICT 2020, jun 2020.
[25] D. Dewailly, M. E. Lujan, E. Carmina, M. I. Cedars, J. Laven, R. J. [45] L. Cataldi, L. Tiberi, and G. Costa, “Estimation of MCS intensity for Italy
Norman, and H. F. Escobar-Morreale, “Definition and significance of from high quality accelerometric data, using GMICEs and Gaussian Naïve
polycystic ovarian morphology: a task force report from the androgen ex- Bayes Classifiers,” Bulletin of Earthquake Engineering, vol. 19, pp. 2325–
cess and polycystic ovary syndrome society,” Human reproduction update, 2342, apr 2021.
vol. 20, no. 3, pp. 334–352, 2014. [46] D. D. Rufo, T. G. Debelee, A. Ibenthal, and W. G. Negera, “Diagnosis
[26] A. Saravanan and S. Sathiamoorthy, “Detection of Polycystic Ovarian Syn- of Diabetes Mellitus Using Gradient Boosting Machine (LightGBM),”
drome: A Literature Survey,” Asian Journal of Engineering and Applied Diagnostics 2021, Vol. 11, Page 1714, vol. 11, p. 1714, sep 2021.
Technology, vol. 7, pp. 46–51, nov 2018. [47] C. Bowd, A. Belghith, J. A. Proudfoot, L. M. Zangwill, M. Christopher,
[27] V. Thakre, S. Vedpathak, K. Thakre, and S. Sonawani, “PCOcare: PCOS M. H. Goldbaum, H. Hou, R. C. Penteado, S. Moghimi, and R. N. Wein-
Detection and Prediction using Machine Learning Algorithms,” Bio- reb, “Gradient-Boosting Classifiers Combining Vessel Density and Tissue
science Biotechnology Research Communications, vol. 13, pp. 240–244, Thickness Measurements for Classifying Early to Moderate Glaucoma,”
dec 2020. American Journal of Ophthalmology, vol. 217, pp. 131–139, sep 2020.
[28] R. M. Aziz, “Nature-inspired metaheuristics model for gene selection [48] J. Isabona, A. L. Imoize, and Y. Kim, “Machine Learning-Based Boosted
and classification of biomedical microarray data,” Medical & Biological Regression Ensemble Combined with Hyperparameter Tuning for Optimal
Engineering & Computing, vol. 60, no. 6, pp. 1627–1646, 2022. Adaptive Learning,” Sensors 2022, Vol. 22, Page 3776, vol. 22, p. 3776,
[29] R. M. Aziz, “Application of nature inspired soft computing techniques may 2022.
for gene selection: a novel frame work for classification of cancer,” Soft [49] E. Elgeldawi, A. Sayed, A. R. Galal, and A. M. Zaki, “Hyperparameter
Computing, pp. 1–18, 2022. Tuning for Machine Learning Algorithms Used for Arabic Sentiment
[30] Z. Na, W. Guo, J. Song, D. Feng, Y. Fang, and D. Li, “Identification of Analysis,” Informatics 2021, Vol. 8, Page 79, vol. 8, p. 79, nov 2021.
novel candidate biomarkers and immune infiltration in polycystic ovary
syndrome,” Journal of ovarian research, vol. 15, no. 1, pp. 1–13, 2022.
[31] S. Dhar, S. Mridha, and P. Bhattacharjee, “Mutational landscape screening
through comprehensive in silico analysis for polycystic ovarian syndrome–
related genes,” Reproductive Sciences, vol. 29, no. 2, pp. 480–496, 2022.
[32] PRASOON KOTTARATHIL, “Polycystic ovary syndrome (PCOS) | Kag-
gle.”
[33] J. Huang, S. Ling, X. Wu, and R. Deng, “GIS-Based Comparative Study
of the Bayesian Network, Decision Table, Radial Basis Function Network
and Stochastic Gradient Descent for the Spatial Prediction of Landslide
Susceptibility,” Land 2022, Vol. 11, Page 436, vol. 11, p. 436, mar 2022.
[34] S. Ghosal, S. Sengupta, M. Majumder, and B. Sinha, “Linear Regression
Analysis to predict the number of deaths in India due to SARS-CoV-2 at
6 weeks from day 0 (100 cases - March 14th 2020),” Diabetes Metabolic
Syndrome: Clinical Research Reviews, vol. 14, pp. 311–315, jul 2020.
14 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3205587
SHAZIA NASIM pursuing her MS Computer FAIZAN YOUNAS was born in Pakistan in 1999.
Science degree from the Khwaja Fareed Univer- He received the Bachelor of Science in Com-
sity of Engineering and Information Technology puter Science degree from the Khwaja Fareed Uni-
(KFUEIT) Rahim Yar Khan, Pakistan. She has versity of Engineering Information Technology
received a Master of Computer Science degree (KFUEIT), Rahim Yar Khan, Pakistan, in 2021,
in 2012 from the Bahauddin Zakariya Univer- and pursuing his MS in Computer Science also
sity, Multan. Her current research interest includes from KUEIT. His main areas of research interest
Data mining and Machine learning. are Natural Language Processing (NLP), Machine
Learning, and Deep Learning.
VOLUME 4, 2016 15
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
View publication stats