0% found this document useful (0 votes)
23 views

Prediction of PCOS Using Machine Learning Algorithm

Uploaded by

ANMOL AGGARWAL
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Prediction of PCOS Using Machine Learning Algorithm

Uploaded by

ANMOL AGGARWAL
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

PROJECT REPORT

ON
Prediction of Polycystic Ovary Syndrome (PCOS) using Machine
Learning Algorithms

SUBMITTED TO

UNIVERSITY SCHOOL OF MANAGEMENT STUDIES


Guru Gobind Singh Indraprastha University

IN FULFILLMENT OF
POST GRADUATE DIPLOMA IN DATA ANALYTICS
(SECOND SEMESTER-2023-24)

UNDER THE SUPERVISION OF MR. SANJAY KUMAR

SUBMITTED BY:
ANISHA (00116640623)

1
DECLARATION BY THE STUDENT

I, Anisha student of PGDDA herby declare that the project titled Prediction of
Polycystic Ovary Syndrome (PCOS) using Machine Learning Algorithms which is
submitted by us to PGDDA, GGSIPU (Main Campus), Dwarka sector-14, in
partial fulfilment of requirement for the award of the degree of PGDDA, has
not been previously formed the basis for the award of any degree, diploma or
other similar title or recognition.

The Author attests that permission has been obtained for the use of any copy
righted material appearing in the Dissertation / Project report other than brief
excerpts requiring only proper acknowledgement in scholarly writing and all
such use is acknowledged.

Name of the student: Anisha (00116640623)


Date: MAY 12TH, 2024

2
CERTIFICATE OF ORIGINALITY

Based on declaration submitted by Anisha student of PGDDA.


I hereby certify that the project titled” Prediction of Polycystic Ovary
Syndrome (PCOS) using Machine Learning Algorithms “which is submitted
to, PGDDA, GGSIPU (MAIN CAMPUS), Dwarka sector-14, in partial
fulfilment of the requirement for the award of the degree of PGDDA, is an
original contribution with existing knowledge and faithful record of work
out by him/them under my guidance and supervision.

To the best of my knowledge this work has not been submitted in part or
full for any Degree or Diploma to this University or elsewhere.

Date: May 6th, 2024

Signature

3
ACKNOWLEDGEMENT

It gives me immense pleasure to present this project report on “Prediction


of Polycystic Ovary Syndrome (PCOS) using Machine Learning Algorithms”
carried out at GGSIPU (MAIN CAMPUS), Dwarka sector-14, PGDDA.

I am gratified to take this opportunity to express our gratitude to those who


have been helpful to us in completing this project report.

At the outset we would like to thank our Mentor Mr. Sanjay Kumar who
allowed us to undertake this project and helped us at every point
throughout the tenure of the project. he patiently listened to our
difficulties, tried to sort them out and gave us valuable suggestions and
remarks to make this project a more meaningful one. His guidance has
made us learn a lot about the technical domain.

We are grateful for the time he spent on this project out of his busy
schedule.

Lastly, I would like to thank our parents, friends, and well-wishers who
encouraged us to do this research work and all those who contributed
directly or indirectly in completing this project to whom we are obligated to
even though anonymously.

4
Abstract

Syndrome or Disease analysis is the study that has been conducted to analyse medical records
to assess if it is possible to predict the disease based on diagnostic measures. In our major
project we are going to analyse records to assess if it is possible to predict the onset of PCOS.
The focus of this study is to use python to classify and predict PCOS follow up control
satisfaction data. This project consists of several symptomatic predictor along with some
lifestyle variables which can help us to make the prediction in a more accurate manner.
Variables include age, knowledge about PCOS, family history, consumption of junk food and
alcohol, exercising, sleep cycle, High blood pressure, periods monitoring, pain monitoring
during mensuration, uneven hair growth, oily or acne prone skin, hair fall or thinning of hair,
BMI, pain in the pelvic area, anxiety, or depression and so on.
In this study, we use machine learning approaches to forecast the likelihood of developing
PCOS and how to mitigate its effect.
• The goal of this project is to create a predictive machine learning model that can predict
whether a patient has PCOS or is prone to have PCOS based on the diagnostic measures.
• This programme aids in Time Management since it saves time and allows you to quickly
determine whether a woman has PCOS.

5
Table of Content

Sno Particulars Pg No.


1 Candidate’s declaration 2
2 Certificate of originality 3
3 Acknowledgement 4
4 Abstract 5
5 Table of content 6
6 Chapter 1: Introduction 7-10
7 Chapter 2: Literature Review 11-13
Chapter 3: Research Methodology
3.1: Data Collection
8 14-22
3.2: Data Cleaning
3.3: Data Transformation
9 Chapter 4: Data Analysis and Results 23 - 30
10 Discussion 31- 32
11 Conclusion 33
12 References

6
Chapter 1
Introduction
Polycystic Ovary Syndrome (PCOS) is a medical condition which causes hormonal disorder
in women in their childbearing years. The hormonal imbalance leads to a delayed or even
absent menstrual cycle. Women with PCOS majorly suffer from excessive weight gain, facial
hair growth, acne, hair loss and irregular periods leading to infertility in rare cases. Polycystic
Ovary Syndrome (PCOS) is a medical condition which causes hormonal disorder in women
in their childbearing years. PCOS occurs because of hormonal imbalances. In this disorder,
the ovaries develop small collections of fluids called follicles (cysts) and fail to release eggs,
which is why women suffering from PCOS tend to have complications in conceiving [Zhang,
2018]. A lot of women have PCOS, but do not get diagnosed with it at an earlier stage. In a
study, 69 to 70 percent of women did not have a pre-existing diagnosis [Desailly, 2013].
While the actual causes of PCOS remain a mystery, studies say that it is generally inherited.
Common symptoms include irregular periods, excessive androgen levels (male hormones),
polycystic ovaries. Irregular periods, cysts on the ovaries, and excess unwanted facial hair
growth were their three top concerns. Main cause of this syndrome is a women’s lifestyle,
their eating habits, the stress and strain they are giving to their body. PCOS is like a sign
which women generally ignore thinking it is a very common and they do not go for a proper
treatment. It is a common problem but it can affect later women’s health problem later in their
life. Around 50.86 percent of women are working in a multinational company which is 9 to 5
job and it reduces the time to focus on the diagnosed at the age of 20.8 years (SD 4.8).
Though it is not a life-threatening problem but it can be a cause of some of the other serious
diseases. It is a very unpredictable condition as the cure is uncertain since there is no
observable trend for this medical condition. The time and cost of taking innumerable medical
tests and scanning is a burden for the patients and the doctors too. Hence, early diagnosis and
treatments are important as long-term health risks like type-2 diabetes, cardiovascular
diseases can be avoided by simple changes in lifestyle. Early detection can help make
necessary lifestyle changes beforehand and hence reduce risks of the condition as women
with PCOS are three times more likely to undergo miscarriages in early stages of pregnancy,
suffer from infertility and in rare cases, gynaecological cancer. In the past few decades,
technology has revolutionized our universe and affected our lives, making them easier day by
day. Emerging technologies are reshaping mankind in a lot of ways. These days, machine
learning, a field of study that gives computers to learn without being explicitly programmed,
is playing a key role in the healthcare sector. Hence, parameters such as Follicle-Stimulating
Hormone (FSH), Luteinizing Hormone (LH), Human Chorionic Gonadotropin (HCG),
number of follicles, Thyroid Stimulating Hormone (TSH), Age, cycle length, cycle regularity,
etc. are considered to formulate the feature vector for our machine learning models. Machine
learning can deal with obscenely huge datasets, convert analysed data into clinical insights
and help in the diagnosis of various ailments. The existing methodologies and treatments are
insufficient for early-stage detection and prediction.
To deal with this problem, we propose a system which can help in early detection and
prediction of PCOS treatment from an optimal and minimal set of parameters. m which can
help in early detection and prediction of PCOS treatment from an optimal and minimal set of
parameters. To detect whether a woman is suffering from PCOS, 5 different machine learning

7
classifiers like Random Forest, SVM, Logistic Regression, Gaussian Naïve Bayes, K
Neighbours have been used.
Prediction of Polycystic Ovary Syndrome (PCOS) is regarded as one of the most important
subjects in the section of clinical data analysis for women. With the help of this project, we
hope to get a clearer picture of whether some women are prone to PCOS or not. A predictive
model like this will aid us in identification of women’s who are at high risk of developing this
problem. Early detection enables early intervention and personalized care for the women’s
who are having or can have PCOS. The dataset that we will be working on consists of both
categorical data which is in the form of Yes and No and is later converted into 0 and 1 form
for the analysis. Before moving further, cleaning and filtering will be done on these records to
filter the irrelevant data from the database. This project helps in determining patterns and
relationships associated with PCOS. These datasets help enhance our understanding of PCOS
through data-driven approaches. With the help of this form some women got to know that
they are prone to have PCOS in near future and this was helpful because now they can have a
clinical guidance and can make changes in their lifestyle. Comprising a diverse set of clinical
and demographic attributes, this dataset enables the development and evaluation of machine
learning models for PCOS diagnosis. Attributes such as age, sex, various questions about
their lifestyle, eating habits and various issues which they face during their mensuration cycle
which commonly women’s take very lightly might be a sign of something big. The binary
target variable, indicating the presence or absence of PCOS, transforms the dataset into a
powerful tool for classification tasks. By using this dataset, we can explore the intricate
relationships between different health indicators and the likelihood of PCOS occurrence.

.1 Basics of Polycystic Ovary Syndrome (PCOS)


Polycystic ovary syndrome (PCOS) is a common hormonal condition that affects women
of reproductive age. It usually starts during adolescence, but symptoms may fluctuate
over time. PCOS can cause hormonal imbalances, irregular periods, excess androgen
levels and cysts in the ovaries. Irregular periods, usually with a lack of ovulation, can
make it difficult to become pregnant. PCOS is a leading cause of infertility. PCOS is a
chronic condition and cannot be cured. However, some symptoms can be improved
through lifestyle changes, medications, and fertility treatments. The cause of PCOS is
unknown but women with a family history or type 2 diabetes are at higher risk.
PCOS a significant public health problem and is one of the commonest hormonal
disturbances affecting women of reproductive age. The condition affects an estimated 8–

8
13% of women of reproductive age, and up to 70% of cases are undiagnosed. The
prevalence of PCOS is higher among some ethnicities and these groups often experience
more complications, in particular related to metabolic problems. The biological and
psychological effects of PCOS, particularly those related to obesity, body image and
infertility, can lead to mental health challenges and social stigma.

Symptoms

• heavy, long, intermittent, unpredictable, or absent periods


• infertility
• acne or oily skin
• excessive hair on the face or body
• hair thinning or hair loss
• weight gain, especially around the belly.

9
People with PCOS are more likely to have other health conditions including:

• type 2 diabetes
• hypertension (high blood pressure)
• high cholesterol
• heart disease
• endometrial cancer (cancer of the inner lining of the uterus).

PCOS can also cause anxiety, depression, and a negative body image. Some symptoms
such as infertility, obesity and unwanted hair growth can lead to social stigma. This can
affect other life areas such as family, relationships, work, and involvement in the
community.

Types of PCOS

Some scientists propose dividing PCOS into types based on symptoms and hormone
levels:

• Non-hyperandrogenic PCOS, or type D: You have problems with ovulation


(which can lead to irregular periods or loss of periods) and cysts on your ovaries.
But your levels of androgens (male hormones) are normal.
• Ovulatory PCOS (type C): You have increased levels of androgens along with
cysts on your ovaries.
• Non-PCO PCOS (type B): You have high levels of androgens as well as
problems with ovulation.
• Full-blown PCOS (type A): You have high levels of androgens, problems with
ovulation, as well as cysts on your ovaries.

Is PCOS genetic?

The genetic link to PCOS is not clear, but you are more likely to get it if your close
relatives also have it. Some 20%-40% of those with PCOS have a mother or sister with
the condition. This may be related to similar lifestyles as well as to genes.

10
Chapter 2
Literature Review

From 1 in 10 women suffering from PCOS worldwide to currently 3-4 in 10 women, PCOS is
now exponentially increasing among women due to an unhealthy lifestyle. The literature says
that 1 in every 5 women in India suffers from PCOS. It is estimated that 105 million people
suffer this syndrome among 15- to 49-year-old women worldwide [Nasiri Amiri, Ramezani
Tehrani, Simbar, & Mohammadpour Thamtan, 2013]. Women with PCOS are at risk of
fertility problems (mense disorders, failure to ovulate, late menopause, endometrial cancer,
and infertility), metabolic problems (insulin resistance, diabetes type 2, dyslipidemia,
hypertension, and cardiovascular diseases), physical problems (central obesity, acne,
hirsutism, hair loss and baldness), and psychological problems (depression, stress, and
anxiety) (Bozdag & Yildiz, 2013; Balakrishnan, 2013; Lass, Kleber, Winkel, Wunsch, &
Reinehr, 2011).
PCOS symptoms differ in every patient. The major diagnosis includes scanning for follicles,
their number and sizes using Ultrasound imaging. We need to refer to the categories of PCOS
standards to gain complete understanding of what PCOS is. Even though it is called
Polycystic Ovary Syndrome, it is not essentially described by ovarian cysts. The condition
affects an estimated 8–13% of women of reproductive age, and up to 70% of cases are
undiagnosed. The prevalence of PCOS is higher among some ethnicities and these groups
often experience more complications, in particular related to metabolic problems. The
biological and psychological effects of PCOS, particularly those related to obesity, body
image and infertility, can lead to mental health challenges and social stigma. Almost 50% of
the women with polycystic ovary syndrome (PCOS) are obese. Obesity in PCOS affects
reproduction via various mechanisms. Hyperandrogenism, increased luteinizing hormone
(LH) and insulin resistance play a pivotal role. Several substances produced by the adipose
tissue including leptin, adiponectin, resistin and visfatin may play a role in the
pathophysiology of PCOS. [Fertil Steril (2009)] Women with PCOS are overweight (35–
80%; body mass index (BMI) above 25 kg/m [Fertil Steril (2009)]) or obese (20–69%; BMI
above 30 kg/m) and the rate is affected by various parameters including ethnicity. The
prevalence of PCOS is 4.3% among women with a body mass index (BMI) less than or equal
to 25 kg/m2 and 14% among women with a BMI above 30 kg/m2. Moreover, it has been
reported that the risk of obesity is four times higher among patients with PCOS than among
healthy controls. Obesity in women with PCOS not only involves the peripheral tissue but
also a significant increase occurs in the intra-abdominal fat, which is independent of obesity
[Trends Endocrinol Metab(2007)]. Previous studies have shown that a high BMI causes
metabolic abnormalities in patients with PCOS, such as increased insulin resistance and
exacerbation of hyperandrogenemia. Increased body weight and insulin resistance are the
underlying causes of symptoms in PCOS patients with obesity. Hence, international
evidence-based guidelines emphasize the importance of pre-pregnancy weight management
among PCOS patients. Good Nutrition also plays an important role in dealing with PCOS. If
a woman consumes proper balance diet with high nutrition meal on daily basis it will reduce
the risk of PCOS. An average should daily consume proper amount of calcium, vitamins, and
minerals to keep herself in a fit state. Exercising and keeping oneself fit is another very

11
important factor to deal with PCOS. When a woman exercises on daily basis there are many
hormonal activities are their which gets a boast. Their starts to release happy hormone and
their metabolic activities are also maintains which will directly help them to beat PCOS.
PCOS can also have a significant negative impact on women’s health-related quality of life
and psychological function. It has been reported that women with PCOS have higher levels of
depression than other women a finding replicated in a variety of populations. Likewise,
women with PCOS typically report higher levels of anxiety compared with healthy women.
High anxiety levels have also been reported in adolescent girls with PCOS. In PCOS, due to
the Androgens increases the sebum production, causing abnormal acne on face followed by
oily skin followed by brown velvety moist, verrucous hyperpigmentation of skin, usually
seen on the back of the neck and intertriginous areas like armpits and groins, underneath
breast, inside thighs. Women with PCOS usually have anovulatory cycle which predisposes to
endometrial hyperplasia and later carcinoma. So, women with irregular cycles should be
advised to take OCPs or progesterone withdrawal at least every 3–4 months interval. In
women with abnormal uterine bleeding or thickened endometrium, endometrial biopsy
should be considered to rule out hyperplasia or malignancy [Royal College of Obstetricians
and Gynaecologists; 2007].
Hyperprolactinemia, Cushing’s syndrome, and non-classic congenital adrenal hyperplasia are
few examples. [Zhang, 2018] have used different machine learning algorithms like K-nearest
neighbour (KNN), decision tree and SVM with different kernel functions to predict PCOS
from the identification of new genes. [P. Mehrotra, 2011] have used machine learning
algorithms like Bayes and Logistic Regression (LR) to develop an automated system that will
act as an assisted tool for the doctor for saving considerable time in examining the patients
and hence reducing the delay in diagnosing the risk of PCOS by using metabolic and clinical
factors in a feature vector. [Norman, 2007], have done a comprehensive study on the disorder
and its three diagnostic criteria in depth giving us insights on not just PCOS but also
abnormalities of insulin, gonadotropin and folliculogenesis. [Essah, 2006], have discussed
how there exists an overlap between the metabolic syndrome and the polycystic ovary
syndrome (PCOS). That article discusses the existing data regarding the familiarity,
characteristics, and treatment of the metabolic syndrome in women with PCOS. [Amsy
Denny, 2011], have proposed a system for the early detection and prediction of PCOS from
an optimal and minimal but promising clinical and metabolic parameter, which act as an early
marker for this disease. [Dewailly, 2013] have illustrated in their literature how the diagnosis
of PCOS depends on biological, clinical, and morphological criteria. As ultrasonography has
technologically advanced, the excess follicle has become the primary criterion of polycystic
ovarian morphology (PCOM). Since 2003, most investigators have used a threshold of 12
follicles (measuring 2–9 mm in diameter) per whole ovary, but that now seems obsolete [A.
Saravanan, 2018]. The fluctuations in the quantity of ovarian volume or area may also be
considered as accurate markers of PCOS Morphology, yet their utility compared with excess
follicle remains a puzzle.

12
Summary of the Literature Review
PCOS, a condition affecting 3-4 in 10 women globally, is increasing due to unhealthy
lifestyles. It affects 15-49-year-old women worldwide and is linked to fertility, metabolic,
physical, and psychological issues. India has 1 in 5 women affected. Polycystic Ovary
Syndrome (PCOS) affects 8-13% of women of reproductive age, with 70% undiagnosed. Its
symptoms vary, and diagnosis involves ultrasound imaging for follicle numbers and sizes.
Ethnicity-specific complications are higher, leading to metabolic issues. Obesity affects
reproduction and can lead to mental health challenges and social stigma. Hyperandrogenism,
increased luteinizing hormone, and insulin resistance are key factors in the pathophysiology
of polycystic ovary syndrome (PCOS). Women with PCOS are overweight or obese, with a
higher risk of obesity compared to healthy controls. Obesity not only affects peripheral tissue
but also intra-abdominal fat. High BMI causes metabolic abnormalities, such as increased
insulin resistance and hyperandrogenemia exacerbation. Pre-pregnancy weight management
and good nutrition are crucial for managing PCOS, with a balanced diet and adequate
calcium, vitamins, and minerals. Exercise is crucial for managing PCOS, as it stimulates
hormonal activities and maintains metabolic activities. However, PCOS can negatively
impact women's health and psychological function, leading to higher levels of depression and
anxiety. Androgens increase sebum production, causing acne, oily skin, and
hyperpigmentation. Women with irregular cycles are at risk for endometrial hyperplasia and
carcinoma. Regular OCPs or progesterone withdrawal is advised, and endometrial biopsy is
considered for abnormal uterine bleeding or thickened endometrium. Machine learning
algorithms have been used to predict polycystic ovary syndrome (PCOS) using new genes
and developing automated systems for doctors. Studies have explored the disorder's
diagnostic criteria, including insulin, gonadotropin, and folliculogenesis abnormalities. The
metabolic syndrome and PCOS overlap, and a system for early detection and prediction of
PCOS has been proposed. The diagnosis of PCOS depends on biological, clinical, and
morphological criteria. Excess follicles have become the primary criterion of polycystic
ovarian morphology, but fluctuations in ovarian volume or area may also be accurate
markers.

13
Chapter 3
Research Methodology

This is (2-class) classification project using methods of machine learning:

PCOS

Classifier
model

Patient’s medical predictor variables


NO
PCOS

Working process of our application is, we have collected the data from the people using a
google form on which all the experiments were performed. We have used Support Vector
Machine Learning Algorithms. Then our application will conclude that where this research is
used in the management arena.

Health related independent variables


1. Irregular periods or no periods
2. Uneven hair growth like on face, butt, or chest
3. Hair fall and thinning of hair
4. Body Mass Index
5. Pain in Pelvic area
Health related dependent variables
Polycystic Ovary Syndrome (PCOS)

14
3.1 Data Collection
We have prepared a Google Form where there are separate questions.
Total number of observations: 57
Questions in this survey is around 18 and the answer to them are in the form of Categorical
form i.e. Yes and No, which later be transformed into 1 and 0 for the analysis work.

15
16
17
18
3.2Data Cleaning

In the collected there were no missing values and no duplicates because all women have
different type of body and lifestyle.
It is the google sheet of the responses which have the whole data of the survey.

The data was then downloaded in the Microsoft Excel.

Then the columns which were of no use were removed like the Timestamp column and Email
Address column for further better analysis work.

19
20
3.3Data Transformation

8.1 As we know that the data that have been collected is in the form of Yes and No.
Dataset Before Transformation

We have converted the categorical data which was Yes and No in the form of
1 for Yes
0 for No
By assigning 0 and 1 value to the dataset our analysing becomes easier.
Dataset After Transformation

21
8.2 Reading the data in Jupyter Notebook

8.3 Dimensions of the dataset

22
Chapter 4

4.1 Data Analysis


Statistical modelling refers to the process of using statistical techniques to describe and
quantify relationships within data. It involves creating mathematical representations,
or models, that capture the underlying patterns or structures in the data. Statistical
models are used for making predictions, understanding the relationships between
variables, and drawing inferences about the population from which the data was
sampled.
Common types of statistical models include linear regression, logistic regression,
decision trees, random forests, support vector machines, and many others. The choice
of model depends on the specific characteristics of the data and the goals of the
analysis. Statistical modelling is widely used in various fields such as economics,
biology, psychology, finance, and more to gain insights from data and make informed
decisions.
In our research we have used Support Vector Machine Model.
For visual representation of the variables and showing relationship between them
we have used scatter plot.

4.1.1 About Support Vector Machine

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is
used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector
Machine. Consider the below diagram in which there are two different categories that are
classified using a decision boundary or hyperplane:

23
4.1.2 Support Vector Machine Terminology
1. Hyperplane: Hyperplane is the decision boundary that is used to separate the data
points of different classes in a feature space. In the case of linear classifications, it will
be a linear equation i.e. wx+b = 0.
2. Support Vectors: Support vectors are the closest data points to the hyperplane, which
makes a critical role in deciding the hyperplane and margin.
3. Margin: Margin is the distance between the support vector and hyperplane. The main
objective of the support vector machine algorithm is to maximize the margin. The
wider margin indicates better classification performance.
4. Kernel: Kernel is the mathematical function, which is used in SVM to map the
original input data points into high-dimensional feature spaces, so, that the hyperplane
can be easily found out even if the data points are not linearly separable in the original
input space. Some of the common kernel functions are linear, polynomial, radial basis
function (RBF), and sigmoid.
5. Hard Margin: The maximum-margin hyperplane or the hard margin hyperplane is a
hyperplane that properly separates the data points of different categories without any
misclassifications.
6. Soft Margin: When the data is not perfectly separable or contains outliers, SVM
permits a soft margin technique. Each data point has a slack variable introduced by
the soft-margin SVM formulation, which softens the strict margin requirement and
permits certain misclassifications or violations. It discovers a compromise between
increasing the margin and reducing violations.
7. C: Margin maximisation and misclassification fines are balanced by the regularisation
parameter C in SVM. The penalty for going over the margin or misclassifying data
items is decided by it. A stricter penalty is imposed with a greater value of C, which
results in a smaller margin and perhaps fewer misclassifications.
8. Hinge Loss: A typical loss function in SVMs is hinge loss. It punishes incorrect
classifications or margin violations. The objective function in SVM is frequently
formed by combining it with the regularisation term.
9. Dual Problem: A dual Problem of the optimisation problem that requires locating the
Lagrange multipliers related to the support vectors can be used to solve SVM. The
dual formulation enables the use of kernel tricks and more effective computing.

4.1.3 SVM model on our dataset

 Importing libraries in the Jupyter notebook

24
Here is a breakdown of what each import does:

1. pandas (pd):
Provides data structures and data analysis tools. Useful for loading data from various file
formats (CSV, Excel, etc.), manipulating data frames, and performing exploratory data
analysis.
2. NumPy (np):
Offers efficient numerical computing functionalities. Used for creating and working with
numerical arrays, performing mathematical operations, and linear algebra tasks.
3. matplotlib.pyplot (plt):
A library for creating static, animated, and interactive visualizations. Used for generating plots
like scatter plots, histograms, bar charts, etc., to explore and understand data visually.
4. sklearn.svm (SVC):
Part of the scikit-learn library for machine learning. Specifically imports the SVC class used
for Support Vector Classification (SVC), a popular algorithm for classification tasks.
5. sklearn. model_selection (train_test_split):
Provides utilities for splitting data into training and testing sets. The train_test_split function is
commonly used to create training and testing sets from your data for model evaluation.
6. sklearn. preprocessing (StandardScaler):
Offers tools for data preprocessing, including scaling features. The StandardScaler class is
used to standardize features by removing the mean and scaling to unit variance. This can be
helpful for improving the performance of machine learning models.
7. sklearn.metrics (accuracy_score, classification_report):
Contains functions for evaluating the performance of machine learning models.
accuracy_score calculates the accuracy of a classification model. classification_report
provides a detailed report on the model's performance, including precision, recall, F1-score,
and support for each class.
8. sklearn.metrics (metrics):
This import might be redundant as accuracy_score and classification_report are already
imported from sklearn.metrics. You can remove this line if you are only using those two
functions.

25
 Splitting our dataset in test set and training set

This line splits your data into four variables:

• X_Train: This will contain the training data for your features (independent
variables).
• X_Test: This will contain the testing data for your features.
• Y_Train: This will contain the training data for your target variable (dependent
variable).
• Y_Test: This will contain the testing data for your target variable.
Parameters:

• X: This represents your entire feature data (independent variables).


• Y: This represents your entire target variable data (dependent variable).
• test_size: This parameter specifies the proportion of data to be included in the
testing set. In this case, 0.25 means 25% of the data will be allocated to the
testing set, and the remaining 75% will be used for training. You can adjust this
value based on your needs.
• random_state: This parameter sets a seed for the random number generator,
ensuring reproducibility if you run the code multiple times. Setting it to a specific
value (like 0 here) will guarantee the same split each time. For different random
splits in future runs, you can set random_state to a different value or remove it
altogether.
By using train_test_split, you can ensure that your machine learning model is trained
and evaluated on separate sets of data, leading to more reliable performance
estimates.

26
 Accuracy of Linear SVM Model

• The kernel type as 'linear' to create a linear SVM model. This model finds a
hyperplane that best separates the data points belonging to different classes with the
maximum margin.
• random_state=0 sets a seed for the random number generator, ensuring
reproducibility if you run the code multiple times.
• X_Train contains the features (independent variables) from your training set, and
Y_Train contains the corresponding target variable (dependent variable) labels.
• X_Test contains the features from your testing set.
• The predict method of the classifier predicts the class labels (0 or 1 in this case) for
each data point in X_Test. These predictions are stored in Y_Pred.
• the performance of the model on the testing data using the accuracy_score function
from scikit-learn's metrics module.
• Y_Test contains the actual class labels for the testing data.
• Y_Pred contains the predicted class labels by the model.
• accuracy_score calculates the accuracy, which is the proportion of correctly predicted
labels. The result is printed as "Accuracy:".
 Accuracy of rbf (Radial Basis Function) SVM Model

27
• The line creates an SVC object and specifies the kernel type as 'rbf' (Radial Basis
Function). This creates an RBF SVM model.
• You’ve set custom value for the following parameters:
• gamma (gamma=15): This parameter controls the influence of single training
examples. A higher gamma puts more emphasis on local variations, potentially
leading to overfitting. You can adjust this value based on your data and experiment to
find the optimal setting.
• C (C=7): This parameter controls the trade-off between fitting the training data and
keeping the model generalizable. A higher C allows for more complex decision
boundaries but might also lead to overfitting. Experiment with different C values to
find a good balance.
• random_state=0: Sets a seed for the random number generator, ensuring
reproducibility if you run the code multiple times.
• X_Train contains the features (independent variables) from your training set, and
Y_Train contains the corresponding target variable (dependent variable) labels.
• During training, the model learns the patterns and relationships between the features
and the target variable in the training data.
• X_Test contains the features from your testing set.
• The predict method predicts the class labels (0 or 1 in this case) for each data point in
X_Test. These predictions are stored in Y_Pred.
• It evaluates the model's performance on the testing data using accuracy.
• Y_Test contains the actual class labels for the testing data.
• Y_Pred contains the predicted class labels by the model.
• metrics.accuracy_score calculates the accuracy (proportion of correctly predicted
labels). The score is printed along with a message indicating it's for the custom RBF
kernel.
 Accuracy of Polynomial SVM Model

• It creates an SVC object and specifies the kernel type as 'poly' (polynomial). This
creates a polynomial SVM model.
• You've set the degree of the polynomial kernel to 6. This means the model can
consider features up to the sixth power in the feature space. A higher degree allows for
more complex decision boundaries but can also lead to overfitting if not chosen
carefully.
• X_Train contains the features (independent variables) from your training set, and
Y_Train contains the corresponding target variable (dependent variable) labels.
• During training, the model learns the patterns and relationships between the features
and the target variable in the training data.

28
• X_Test contains the features from your testing set.
• The predict method predicts the class labels (0 or 1 in this case) for each data point in
X_Test. These predictions are stored in Y_Pred.
• Y_Test contains the actual class labels for the testing data.
• Y_Pred contains the predicted class labels by the model.
• metrics.accuracy_score calculates the accuracy (proportion of correctly predicted
labels). The score is printed along with a message indicating it's for the poly kernel
with degree 6.
4.2 Result
After running different types of SVM Model, we find the Precision, F1 Score, Support, and
recall.

• It imports the classification_report function from scikit-learn's metrics module. This


function is used to generate a detailed report on the performance of a classification
model.
• Y_Test contains the actual class labels for the testing data.
• Y_Pred contains the predicted class labels for the testing data by your model.
• classification_report(Y_Test, Y_Pred) calculates various metrics that evaluate the
model's performance on a classification task. The resulting report is then printed using
print.
What the Report Contains:
The classification report typically includes the following metrics for each class:
• Precision: Proportion of predicted positives that were correct (out of all positive
predictions). According to the Precision score of this data Out of all data points
predicted as class 1, 88% were class 1 (12% might be false positives).
• Recall: Proportion of actual positives that were correctly identified (out of all actual
positive cases). The model correctly identified all (100%) of the actual class 1
instances in the testing data (no misses).
• F1-score: Harmonic mean of precision and recall, combining both metrics into a
single score. 0.93 - This balanced score combines precision and recall, indicating a
good performance for class 1.
• Support: Total number of true instances for that class in the testing data. There were 7
actual class 1 instances in the testing data.
By analysing the precision, recall, F1-score, and support for each class, you can gain
valuable insights into your model's performance.

29
• A high precision for a class indicates that most of the positive predictions for that
class were correct.
• A high recall for a class indicates that the model identified most of the positive cases
in the testing data.
• A balanced F1-score indicates a good balance between precision and recall.
Overall Performance:
• Accuracy: 0.92 - The model correctly predicted 92% of the data points in the testing
set.
• Macro Average: 0.94 (precision) & 0.90 (recall) - These are the averages of precision
and recall across both classes, indicating a generally good balance.
• Weighted Average: 0.93 (precision) & 0.92 (recall) - These averages consider the class
imbalance (more weight is given to the class with more instances, class 1 in this case).
Interpretation:
• This report suggests your model performs well overall, with an accuracy of 92%.
• It excels at identifying all actual class 1 instances (100% recall) but misses some class
0 instances (80% recall).
• There seems to be a slight class imbalance, with more class 1 instances (7) than class
0 (5) in the testing data.

30
Discussion

As we now know about the accuracy of this, lets discuss its utility in various arena.
Machine learning (ML) algorithms have the potential to be a valuable tool in predicting
PCOS (Polycystic Ovary Syndrome) but they should not be solely relied upon for diagnosis.
Here are some potential uses of ML for PCOS prediction:
1. Aiding Early Detection:
• ML models can analyse data from a woman's medical history, blood tests, and
potentially ultrasound scans (if image analysis is incorporated) to identify patterns
associated with PCOS.
• Early detection can lead to earlier intervention and management of symptoms,
potentially improving long-term health outcomes.
• This can be particularly helpful for women with mild PCOS symptoms that might be
overlooked initially.
2. Risk Stratification:
• ML models might help stratify women based on their risk of developing PCOS. This
can be based on factors like family history, weight, and hormonal profiles.
• Risk stratification can guide further investigations or interventions for women at
higher risk.
3. Personalized Treatment Recommendations:
• By analysing a woman's individual data and medical history, ML models might assist
healthcare professionals in tailoring treatment plans for PCOS.
• This could involve recommendations for lifestyle changes, medications, or fertility
treatments, depending on the specific needs of the patient.
4. Clinical Trial Design:
• ML can be used to analyse large datasets and identify potential new therapeutic
targets for PCOS.
• This can inform the design of clinical trials to test the effectiveness of new
medications or treatment strategies.
5. HR Strategies for their Employers Wellbeing
As an HR manager in a company this model can be useful because according this model HR
can find out that which female employee is their company is suffering from this syndrome
and they can provide them with various offers on gym membership or can get consultation
with a dietician which can really make all their female employee feel good about the
company in which they are working.
For Example: Deloitte provides CULT Fit membership with 25% of discount and their
employee take good advantage of this offer. This had made many of their employee fit and
healthy.

31
6. Food chain entrepreneur
As a food chain entrepreneur, can use this research to know about which all women near their
outlet are suffering from all these problems. This can further help them to prepare their menu
in such a manner where they can provide a good nutrition home cooked food which is rich in
fibre to those women. They can find out new alternatives for the women’s who love to eat
junk food and provide them with alternative healthy option for them. During Premenstrual
syndrome (PMS) a women have many cravings due to high hormonal activities so they have
various type of craving like to have Cakes, Muffin but in PCOS its not recommended to have
them. This research can provide small food chain entrepreneur to make such bakery which
these types of women can consume.
For example: The Cinnamon Kitchen founded in 2018 by Priyasha Saluja aims to redefine
guilt-free indulgence with its range of spreads, snacks, cookies, cakes, and breads. Priyasha
emphasised the focus on packaged goods, citing increased customer repeat percentage and
average order value. The journey of The Cinnamon Kitchen, fuelled by Priyasha's personal
battle with PCOS and a passion for healthy living, has been a remarkable one. She recently
participated in Shark Tank 3, India and secured a Rs 60 lakh investment on Shark Tank India
3 in exchange for a five per cent equity stake by Aman Gupta co-founder of boat.
7. School Nutritionist
In a school girls of age group 15-17 have high chances of PCOS because of their stressful
environment due to studies. As a school nutritionist this research can really be helpful to
know which all girls are suffering from PCOS. They can start having physical exercises
classes more often. Can have workshop on how to deal with PCOS at an early stage. Also, if
the school provides midday meal or have canteen in the premises, they can have a healthy yet
tasty alternatives of the food.
For example: Salwan Public School, Rajinder Nagar is a coeducation school. Their
Nutritionist makes sure that they have healthy meals in their canteen which makes parents
tension a bit less knowing the fact even though their child is consuming outside still it is not
junk food as it is prepared fresh on daily basis. They have special sports activity in the
morning in which every student participates and devote their full 45 minutes for their
physical movements. This has really helped the girl students who are prone to PCOS.
Just like this there are many areas where this research can be helpful. It is a very basic
symptoms survey which generally women in their day-to-day life ignores it because of their
busy lifestyle. But if we find out the problem at an early stage then PCOS can be cured very
easily just by making some essential changes in their lifestyle.

32
Conclusion
In conclusion, the development and implementation of a predictive model for PCOS risk
assessment represent a crucial stride towards proactive healthcare management. The
integration of advanced technologies, such as machine learning algorithms, enables us to
harness vast datasets and extract meaningful insights, empowering healthcare professionals to
identify women’s at higher risk of PCOS.
Traditionally, Gynaecologists have evaluated whether a women is suffering from PCOS with
the help of some symptoms based questions and full blood profile test. Mensuration cycle is
one of the most important factor which they consider. Another factor like body mas index has
also played an important role in diagnosis of a PCOS women. There are another very
important factor which cannot be ignored is pain in the pelvic area and uneven hair growth.
Uneven hair growth signifies increase in hormones like testosterone in the blood which can
be main cause of PCOS.
This paper surveyed and studied the existing state of-the-art in order to predict and detect
PCOS. PCOS is not that harmful but can be the cause of various other diseases if not treated
at an early stage. Various Machine learning algorithms were applied such as Logistic
Regression, Classification, KNN and SVM. As conclusion, Support Vector Model have the
potential to detect the PCOS in an early symptomatic stage. It is clear that this model
improves accuracy and precision of PCOS prediction .
This study provides wide overview of the relative performances of different machine learning
algorithms for prediction of PCOS. This important information of relative performance can
be used to PCOS researcher in the selection of an appropriate machine learning algorithm for
their studies. According to the research paper Support Vector Model has better accuracy and
is applied frequently.
The predictive model not only serves as a valuable tool for early detection but also aids in the
formulation of personalized prevention and intervention strategies. By leveraging factors
such as lifestyle choices, and symptoms, the model can provide a nuanced and tailored
approach to risk assessment, allowing for more precise and targeted interventions.
Furthermore, the continuous refinement of predictive models through ongoing research and
updates ensures their adaptability to evolving medical knowledge and changing
demographics. As we strive for a more patient-centric and preventative healthcare paradigm,
these predictive models play a pivotal role in shifting the focus from reactive treatments to
proactive risk management.
However, it is essential to acknowledge the ethical considerations and potential biases
associated with predictive models. Ensuring transparency, fairness, and accountability in
model development and deployment is paramount to fostering trust among both healthcare
professionals and the individuals being assessed. Through a collaborative effort between
healthcare providers, researchers, and technology experts, we can harness the power of data-
driven insights to pave the way for a healthier and more informed future.

33
References
1. Polycystic ovary syndrome - https://www.who.int/news-room/fact-
sheets/detail/polycystic-ovary-syndrome
2. Early identification of PCOS with commonly known diseases: Obesity, diabetes, high
blood pressure and heart disease using machine learning techniques , Shivani
Aggarwal [2023]
https://www.sciencedirect.com/science/article/abs/pii/S0957417423000337
3. A Global Survey of Ethnic Indian Women Living with Polycystic Ovary Syndrome:
Co-Morbidities, Concerns, Diagnosis Experiences, Quality of Life, and Use of
Treatment Methods[2024] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9740300/
4. PCOS Forum: research in polycystic ovary syndrome today and tomorrow[2010]
PCOS Forum: research in polycystic ovary syndrome today and tomorrow - Pasquali - 2011 -
Clinical Endocrinology - Wiley Online Library

34

You might also like