Risk Assessment of Pregnancy-Induced Hypertension Using A Machine Learning Approach
Risk Assessment of Pregnancy-Induced Hypertension Using A Machine Learning Approach
4th ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (NCON)
Abstract—This research aimed to develop a predictive model in 2010 found that 99 percent of the pregnant women who died
of the risk assessment of pregnancy-induced hypertension using lived in the countryside and were impoverished [3].
a machine learning approach. Pregnancy-induced hypertension Furthermore, in rural areas of Thailand, there is a shortage of
is a complication that has a serious impact on pregnant women doctors and medical personnel. Consequently, several
and fetuses. It is the world’s top three cause of death among agencies have established projects to manage the problem of
pregnant women [1]. Nowadays, the exact cause of pregnancy- distributing medical services to the countryside, but this
induced hypertension is unknown and therefore cannot be problem still occurs. According to the statistics of Thailand in
prevented. Early detection and received treatment can reduce
2017, the Medical Council found that one doctor must take
the severity and danger. A public dataset of Logan (2020) was
care of an average of 1,143 patients per year [4-5]. Thus, the
used in this research [2]. The dataset was collected from a case-
control study on the determinants of 83 pre-eclampsia and five
ultimate objective of this research was to develop a predictive
eclampsia cases among 352 pregnant women delivering in model of the risk assessment of pregnancy-induced
county hospitals in Nairobi, Kenya. According to the dataset, 75 hypertension using a machine learning approach. The purpose
percent of the pregnant women were healthy. Only 25 percent was to support doctors for early patient assessment, treatment
of the pregnant women were pre-eclampsia and eclampsia. planning, and recovery care.
Thus, this would result in a problem of an imbalanced The rest of the paper is arranged as follows. First, a brief
classification when one of the two classes had more data than the
introduction is given. The related work is explained in Section
other class. As such, this problem was resolved with the
II. Section III contains an overview of the proposed method.
Synthetic Minority Over-sampling Technique (SMOTE). Risk
assessment of pregnancy-induced hypertension was performed Experiment results are included in Section IV. Finally, in
on seven machine learning algorithms, which were logistic Section V, the conclusions are discussed.
regression (LR), K-nearest neighbor (KNN), decision tree (DT),
II. R e l a t ed Wo r k
random forest (RF), multilayer perceptron neural network
(MLP), support vector machines (SVM), and naive Bayes (NB). There is currently some research, which uses machine
In the experimental results, RF had the highest accuracy at learning to create a predictive model for the risk of pregnancy -
89.62 percent compared to other machine learning algorithms. induced hypertension; such as, eclampsia and pre-eclampsia.
Tahir et al. [6] proposed a prediction of the risk of the pre-
Keywords - Pregnancy-induced hypertension, machine eclampsia level in pregnant women during the pregnancy
learning, imbalanced classification, synthetic minority over- process using the neural network (NN) and deep learning (DL)
sampling technique algorithms. The number of attributes was reduced from 17 to
nine using particle swarm optimization. DL provided the most
I. In t r o d u c t io n
accuracy with 95.68 percent. Nikolaides et al. [7] developed
Pregnancy-induced hypertension is a complication that an early-stage indicator for the risk of pre-eclampsia. The
has a serious impact on pregnant women and fetuses [2]. It dataset was composed using 6,838 pregnant women cases in
mostly occurs after the twentieth week of pregnancy. In the UK involving 24 variables. The dataset only contained 116
addition to death, there is a risk of serious complications; such cases of pregnant women with pre-eclampsia. A multi-slab
as, premature placenta, pulmonary flood, abnormal blood neural network provided the most accurate result at 93.8
clotting, temporary or permanent blindness, and bleeding in percent. Moreover, Tahir et al. [8] proposed the detection of
the brain. Nowadays, the exact cause of pregnancy-induced pre-eclampsia by using a neural network compared with other
hypertension is unknown and therefore cannot be prevented. algorithms. The pre-eclampsia dataset was taken from the Haji
Early detection and received treatment can reduce the severity General Hospital Surabaya, Indonesia. The neural network
and danger. Data from the World Health Organization (WHO) algorithm and LOO validation provided the most accuracy
Authorized licensed use limited to: Carleton University. Downloaded on May 25,2021 at 12:14:50 UTC from IEEE Xplore. Restrictions apply.
The 6th International Conference on Digital Arts, Media and Technology (DAMT) and
4th ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (NCON)
with 96.66 percent. Likewise, Leemaqz et al. [9] presented a A. Pregnancy data
tiered pre-eclampsia predictive model focusing on the The researcher explored the acquired public dataset
convergence of multiple models. They analyzed that Bayes' relating to the risk of pregnancy-induced hypertension. The
theorem could be used to integrate multiple models. The
public dataset of Logan (2020) was used in this research [2].
integrated model provided the most accuracy, where 81
The dataset was collected from a case-control study on the
percent was accurately identified at 20 weeks of gestation.
determinants of 83 pre-eclampsia and five eclampsia cases
In addition to the above research, machine learning has among 352 pregnant women delivering in county hospitals in
been used to create a predictive model for the risk of other Nairobi, Kenya. The researcher selected 17 attributes, which
diseases. Khalilia et al. [10] proposed the Random Forest (RF) were matched to the attributes of the maternal and child
for predicting the risk of eight chronic diseases by using the health handbook in Thailand, and this was used to record the
database of the nationwide inpatient sample from samples of mother and child health data [14]. The 17 attributes
hospitals in the United States. The ensemble learning comprised maternal age, age of the first pregnancy, diabetes
approach was used to solve the problem of the imbalanced family history, diabetes personal history, hypertension family
data. The results from testing the accuracy was 88.79 percent.
history, hypertension personal history, first antenatal care
In addition, Pattekari and Parveen [11] proposed a prediction
visit, number of antenatal care visits, distance of pregnancy,
system for heart disease. They used mining techniques
cesarean section delivery, multifetal pregnancy, gravidity,
consisting of DT, naive Bayes (NB), and NN. This system
could intelligently answer complex questions in diagnosing parity, province, residence, alcohol use, and tobacco use. The
heart disease and help medical practitioners. This further researcher then converted the data for appropriateness in the
assisted in enhancing and reducing the treatment costs. imbalanced data process.
Moreover, Akhil jabbar et al. [12] proposed the K-nearest
neighbor (KNN) and genetic algorithms to analyze heart
disease using six datasets from the UCI Repository (UCI
Machine Learning Repository) and one dataset from various
hospitals in Andhra Pradesh, India. To enhance the accuracy
of the research, KNN was used to collect all cases and classify
new patients using a genetic algorithm to calculate the
similarities. The results showed that using both methods
together gave a 95.73 percent accuracy. Nayeem et al. [13]
proposed a multilayer perceptron neural network (MLP) to
predict heart disease, liver disease, and lung cancer using a
feed-forward backpropagation neural network algorithm and
MLP to distinguish between infected and uninfected
individuals. They used the MIT-BIH Arrhythmia dataset, and (a) Im b a la n ced data.
the results of the research showed that the accuracy of the
heart disease predictions was 82 percent, liver disease was 82
percent, and lung cancer was 91 percent, respectively.
III. T h e P r o po s e d METHOD
In this section, the process was the development of the risk
prediction model of pregnancy-induced hypertension as
shown in Figure 1. The proposed method consisted of five
processes: (1) Pregnancy data, (2) imbalanced data, (3) data
preparation, (4) feature selection, and (5) evaluate machine
learning algorithms.
(b) B a la n c e d data.
B. Imbalanced data
In the pregnancy data [13], 75 percent of pregnant women
were healthy and only 25 percent were pre-eclampsia and
eclampsia. The problem of the imbalanced classification
arose when one of the two classes had more data than the
other class, and this data affected the classification of the
minority class. Typical classification was effective when
each data class had a similar number, and the prediction was
inclined to the majority class.
D.Feature selection E.Evaluate machine To compare the performance of the imbalanced data and
learning algorithms balanced data approaches, the researcher performed the
Synthetic Minority Over-sampling Technique (SMOTE)
Fig. 1. Proposed method development processes. algorithm on the dataset [15]. The SMOTE works by utilizing
234
Authorized licensed use limited to: Carleton University. Downloaded on May 25,2021 at 12:14:50 UTC from IEEE Xplore. Restrictions apply.
The 6th International Conference on Digital Arts, Media and Technology (DAMT) and
4th ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (NCON)
the KNN algorithm to construct synthetic data. The SMOTE took a 77: 33 ratio keeping 77 percent of the dataset for
began by selecting random data from the minority class after training and the remaining 33 percent for testing. The
which the KNNs were created from the data. The random data researcher trained a machine learning algorithm on the first
and the randomly chosen KNN were then merged to construct part then evaluated the predictions on the test set against the
synthetic data. The process was replicated until the minority expected results.
class's proportion equaled that of the majority class.
The researcher plotted the class distribution to show the IV. E x p e r im e n t a l Re s u l t s
imbalanced data in the selected dataset as shown in Figure To obtain the predictive model of the risk assessment of
2(a). Figure 2(b) shows the balanced data was handled with pregnancy-induced hypertension, there were three main steps.
the SMOTE. Firstly, the three preprocessing methods were performed on
the imbalanced data and balanced data using scikit-learn with
C. Data preparation the MinMaxScaler, StandardScaler and Normalizer. Next,
In the data preparation process, the researcher used data PCA was used to extract the three principal components on
transformation in order to convert the data to better reveal the the imbalanced data and balanced data. Finally, the three
structure of the classification problem for the risk of principal components were performed on seven machine
pregnancy-induced hypertension. The Python program and learning algorithms, which were logistic regression (LR), K-
scikit- learn library were utilized to develop the predictive nearest neighbor (KNN), decision tree (DT), random forest
model in this research. The 17 attributes were a redistributed (RF), multilayer perceptron neural network (MLP), support
or rescaled dataset using the three preprocessing methods, vector machines (SVM), and naive Bayes (NB). The
which consisted of the MinMaxScaler, StandardScaler, and evaluation results are shown in Tables I-VI.
Normalizer.
TABLE I. R e s u l t s Of t h e Im b a l a n c e d Da t a Wi t h St a n d a r d s c a l e r .
D. Feature selection
Feature Selection
Feature selection for machine learning is the method of Machine Learning
choosing features that are relevant to each other to reduce the PCA
number of features to only the feature necessary in order to Logistic Regression 69.01
develop a predictive model. Principal component analysis K-Nearest Neighbor 64.79
(PCA) is a data reduction technique that is commonly used in
Decision Tree 61.97
combination with linear algebra to minimize the
dimensionality of a dataset by compressing a dataset of Random Forests 70.42
attributes while preserving most of the information in the Multilayer Perceptron 71.83
dataset. PCA was used to extract the three principal
Naive Bayes 28.17
components on the 17 attributes, imbalanced data, and
balanced data. Support Vector Machines 70.42
TABLE II. Re s u l t s Of Th e Im b a l a n c e d Da t a Wi t h Mi n m a x s c a l e r .
Feature Selection
Machine Learning
PCA
Logistic Regression 71.83
K-Nearest Neighbor 67.61
Decision Tree 60.56
Random Forests 71.83
Multilayer Perceptron 71.83
Naive Bayes 28.17
235
Authorized licensed use limited to: Carleton University. Downloaded on May 25,2021 at 12:14:50 UTC from IEEE Xplore. Restrictions apply.
The 6th International Conference on Digital Arts, Media and Technology (DAMT) and
4th ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (NCON)
TABLE III. RE S U L T S OF TH E IM B A L A N C E D DA T A WIT H N O R M A L IZ E R . As shown in Table VI, the highest accuracy rate was
Feature Selection
85.85 percent, shown by the underlined letter, which
Machine Learning belonged to the RF algorithm.
PCA
According to the experiments conducted with the
Logistic Regression 71.83 dataset, the RF algorithm yielded the best result based on the
K-Nearest Neighbor 73.24 balanced data. The highest accuracy rate was 89.62 percent.
Decision Tree 59.15 V. Co n c l u s io n
Random Forests 67.61 This paper presented the risk assessment of pregnancy-
Multilayer Perceptron 71.83 induced hypertension using a machine learning approach. A
Naive Bayes 28.17
public dataset of Logan (2020) was used in this research. In
dealing with imbalanced data, the researcher performed the
Support Vector Machines 71.83 SMOTE algorithm on the dataset. In the data preparation
process, the researcher used data transformation, which
TABLE IV. RE S U L T S OF THE BA L A N C E D DA T A WIT H ST A N D A R D SC A L E R . consisted of MinMaxScaler, StandardScaler and Normalizer,
Feature Selection
in order to convert data to better reveal the structure of the
Machine Learning prediction problem. PCA was used to extract the three
PCA
principal components on the imbalanced data and balanced
Logistic Regression 69.81 data. In the experimental results, the RF algorithm achieved
K-Nearest Neighbor 72.64 the best performance when compared to other machine
learning algorithms. The prediction model yielded an
Decision Tree 82.08
accuracy rate of up to 89.62 percent, which was based on the
Random Forests 89.62 balanced data. In future work, the researcher should apply the
Multilayer Perceptron 45.28 predictive model on the pregnancy-induced hypertension
dataset from the Chaophraya Abhaibhubejhr Hospital,
Naive Bayes 48.11
Prachin Buri.
Support Vector Machines 78.30
Ac k n o w l ed g men t
TABLE V. RE S U L T S OF TH E BA L A N C E D DA T A WIT H MIN M A X S C A L E R . This research was supported by the Department of Computer
Engineering, Faculty of Engineering, Mahidol University.
Feature Selection
Machine Learning This research was also supported the National Research
PCA Council of Thailand for the proj ect “M om’s buddy: AI chatbot
Logistic Regression 66.04 for pregnancy health information”. Furthermore, this research
was supported by Chaophraya Abhaibhubejhr Hospital on
K-Nearest Neighbor 68.87
medical knowledge.
Decision Tree 79.25
Re f e r en c es
Random Forests 85.85
[1] Thanomrat Prasith-thimet, and Kasem Wetsutthanon, “Causes of
Multilayer Perceptron 45.28 maternal deaths in Regional Health 4 during Fiscal Year 2014-2016,”
Journal of Health Science, vol.26, pp.5, 2017.
Naive Bayes 50.00
[2] Logan Gorbee, “Replication Data for: Determinants of preeclampsia
Support Vector Machines 59.43 and eclampsia among women delivering in county hospitals in Nairobi,
Kenya”, Harvard Dataverse, version 1.0, 2020. [online] Available:
http://www.doi.org/10.7910/DVN/BYFL3J. [Accessed: Nov. 18,
The results can be seen in Table V. The highest 2020].
accuracy rate was 85.85 percent, shown by the underlined [3] WHO, UNICEF, UNFPA, “The World Bank Trends in maternal
letter, which belonged to the RF algorithm. mortality: 1990 to 2010”, 2010. [online] Available:
https://www.who.int/reproductivehealth/publications/monitoring/978
TABLE VI. Re s u l t s Of t h e Ba l a n c e d Da t a Wi t h No r m a l iz e r . 9241503631/en/. [Accessed: Jan.1, 2019].
[4] The Medical Council of Thailand, “Medical statistics”, 2017. [online]
Feature Selection Available: http://www.tmc.or.th/pdf/01_stat _med2560.pdf.
Machine Learning [Accessed: Jan.1, 2019].
PCA
[5] Official Statistics Registration Systems, “The population in each age,
Logistic Regression 47.17 Nationwide”, 2017. [online] Available:
http://stat.dopa.go.th/stat/statnew/upstat_age_disp.php. [Accessed:
K-Nearest Neighbor 69.81 Jan.1, 2019].
Decision Tree 73.58 [6] Muhlis Tahir, Tessy Badriyah, Iwan Syarif, “Classification Algorithms
of Maternal Risk Detection For Preeclampsia With Hypertension
Random Forests 85.85 During Pregnancy Using Particle Swarm Optimization,” EMITTER
International Journal of Engineering Technology, vol.6, pp.236-250,
Multilayer Perceptron 45.28
2018.
Naive Bayes 48.11 [7] Costas K. Neocleous, Panagiotis Anastasopoulos, Kypros H.
Nikolaides, Christos N. Schizas, Kleanthis C. Neokleous, “Neural
Support Vector Machines 45.28
networks to estimate the risk for preeclampsia occurrence,” in Proc.
IEEE International Joint Conference on Neural Networks, 2009,
pp.2221 -2224.
236
Authorized licensed use limited to: Carleton University. Downloaded on May 25,2021 at 12:14:50 UTC from IEEE Xplore. Restrictions apply.
The 6th International Conference on Digital Arts, Media and Technology (DAMT) and
4th ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (NCON)
[8] Muhlis Tahir, Tessy Badriyah, and Iwan Syarif, “Neural Networks Techniques and Applications, CIMTA, Kalyani, Kolkata, India,
Algorithm to Inquire Previous Preeclampsia Factors in Women with September 27, 2013, pp.85-94.
Chronic Hypertension During Pregnancy in Childbirth Process,” in [13] Md. Osman Goni Nayeem, Maung Ning Wan, and Md. Kamrul Hasan,
Proc. IEEE International Electronics Symposium on Knowledge “Prediction of Disease Level Using Multilayer Perceptron of Artificial
Creation and Intelligent Computing (IES-KCIC), 2018, pp.51-55. Neural Network for Patient Monitoring,” International Journal of Soft
[9] S.Y. Leemaqz, G.A. Dekker and C.T. Roberts “Tiered Prediction Computing and Engineering (IJSCE), vol.5, pp. 17-23, September
System for Preeclampsia: an integrative application of multiple 2015.
models,” in Proc. 20th International Congress on Modelling and [14] Department of Health, and National health security office (nhso)
Simulation (MODSIM), 2013, pp.2041-2045. thailand, “mother and child health records or pink notebooks of general
[10] Mohammed Khalilia, Sounak Chakraborty, and Mihail Popescu, hospitals in Thailand,” 2018. [online] Available:
“Predicting disease risks from highly imbalanced data using random http://www.oic.go.th/FILEWEB/
forest,” BMC Medical Informatics and Decision Making, vol.11, pp.1- CABINF0CENTER17/DRAWER002/GENERAL/DATA0001/0000
13, 2011. 1375.PDF. [Accessed: Jan.1, 2019].
[11] Shadab Adam Pattekari and Asma Parveen, “Prediction system for [15] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, “SMOTE:
heart disease using naive Bayes,” Biomedical Research, vol.29, Synthetic Minority Over-sampling Technique,” journal O f Artificial
pp.2646-2649, 2018. Intelligence Research, vol.16, pp. 321-357, 2002.
[12] M.Akhil jabbar, B.L Deekshatulu, and Priti Chandra, “Classification of
Heart Disease Using K- Nearest Neighbor and Genetic Algorithm,”
International Conference on Computational Intelligence: Modeling
237
Authorized licensed use limited to: Carleton University. Downloaded on May 25,2021 at 12:14:50 UTC from IEEE Xplore. Restrictions apply.