130 - Cervical Cancer Prediction Through Different Screening Methods Using Data Mining
130 - Cervical Cancer Prediction Through Different Screening Methods Using Data Mining
ed
Screening Methods using Data Mining
Talha Mahboob Alam1, Muhammad Milhan Afzal Khan2, Muhammad Atif Iqbal3, Abdul Wahab4, Mubbashar
Mushtaq5
iew
Computer Science and Engineering Department, University of Engineering and Technology Lahore, Pakistan1,2,3,5
School of Systems and Technology, University of Management and Technology Lahore, Pakistan 4
Talhamahbo
Abstract—Cervical cancer remains an important reason of occurrence is abundant in low and middle income countries
deaths worldwide because effective access to cervical screening [9]. The important task of cervical cancer is screening. An
ev
methods is a big challenge. Data mining techniques including ideal screening test is the one that is least incursive, easy to
decision tree algorithms are used in biomedical research for achieve, acceptable to subject, cheap and effective in
predictive analysis. The imbalanced dataset was obtained from diagnosing the disease process in its early incursive stage
the dataset archive belongs to the University of California, when the treatment is easy for ailment. There are four
Irvine. Synthetic Minority Oversampling Technique (SMOTE) screening methods including cervical cytology also called Pap
r
has been used to balance the dataset in which the number of smear test, biopsy, Schiller and Hinslemann [10]. Cytology
instances has increased. The dataset consists of patient age,
screening method is a microscopic analysis of cells scratched
number of pregnancies, contraceptives usage, smoking patterns
and chronological records of sexually transmitted diseases
from the cervix and is used to detect cancerous or pre-
(STDs). Microsoft azure machine learning tool was used for
simulation of results. This paper mainly focuses on cervical
cancer prediction through different screening methods using
data mining techniques like Boosted decision tree, decision forest
er
cancerous conditions of the cervix [11]. Biopsy method is a
surgical process which includes finding of a living tissue
sample for performing diagnosis [12]. The solution of iodine
has applied for visual inspection of cervix known as
Hinslemann test. Lugol's iodine is used for visual inspection
pe
and decision jungle algorithms as well performance evaluation
has done on the basis of AUROC (Area under Receiver operating of cervix after smearing Lugol's iodine detection rate of
characteristic) curve, accuracy, specificity and sensitivity. 10-fold doubtful region over the cervix, this is also known as Schiller
cross-validation method was utilized to authenticate the results test [13].
and Boosted decision tree has given the best results. Boosted
decision tree provided very high prediction with 0.978 on The size of data is increasing gradually. Expansive,
AUROC curve while Hinslemann screening method has used. complex and useful datasets have now expanded in all the
The results obtained by other classifiers were significantly worse different fields of science, business and especially in
ot
than boosted decision tree. healthcare domain. With these larger data sets, the capacity to
mine beneficial hidden knowledge in these huge volume of
Keywords—Boosted decision tree; cervical cancer; data mining; data is gradually significant in today’s economical world. The
dcision trees; decision forest; decision jungle; screening methods method of applying novel techniques for discovering
tn
control [1]. Each year around 8.2 million people die from mathematics and other domains, it is now possible to extract
cancer which is 13% of total deaths worldwide. In 2017, only the meaningful information from raw data. Data mining is
26% of under developing countries reported having screening helpful where large collections of healthcare data are available
services available for public. In 90% developed countries [15]. Several data mining techniques like support vector
treatment services are available compared to less than 26% of machine (SVM), kernel learning methods as well as clustering
techniques were used in healthcare [16]. With the rise of
ep
uterus from the vagina where cervical cancer occurs [4]. measures have proved to be ineffective because the number of
Sexually transmitted human papillomavirus (HPV) is the parameters for screening of cervical cancer are still debatable
important cause of cervical cancer [5-8]. Cervical cancer [4, 8, 10]. The methods and techniques have been used for
388 | P a g e
www.ijacsa.thesai.org
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=3474371
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 10, No. 2, 2019
screening of cervical cancer are limited to small number of number of parameters with the help of data mining techniques.
parameters. The available literature for screening of cervical As the current techniques are not sufficient, it is necessary to
cancer explores mainly Papanicolaou (Pap) smear test [17], explore the all parameters or symptoms for screening
ed
hormonal status, FIGO stage [18] and cervical intraepithelial prediction of cervical cancer. Decision tree methods have been
neoplasia (CIN) [19] but only single parameter was used for used to predict cervical cancer but the demographic and
screening prediction of cervical cancer. The available data medical attributes were different in previous studies. The aim
mining techniques using large number of parameters [20-23] of this study was to predict the cervical cancer, based on the
were not given effective results. A comparison of studies for demographic information, tumor related parameters, sexually
screening prediction of cervical cancer along with approaches transmitted diseases (STD) related parameters and important
iew
has presented in Table 1. It was not found effective results in medical records.
screening prediction of cervical cancer while using huge
Data Set
Reference Technique Results
ev
Repository Attributes Instances
[20] Universitario de Caracas Hospital patients 28 858 Hybrid method using deep learning AUC = 0.6875
r
[24] NCBI 61 160 CART Algorithm Accuracy = 83.87%
[25]
Chung Shan Medical University Hospital
Tumor Registry
38
er
75 SVM
GEP
ot
AUROC=0.72
MLP AUROC=0.67
[18] State Hospital in Rzeszow 10 107
PNN AUROC=0.56
tn
RBFNN AUROC=0.48
PNN AUROC=0.818
AUROC=0.659
MLP
GEP AUROC=0.651
ep
RBFNN AUROC=0.640
Pr
k-Means AUROC=0.406
389 | P a g e
www.ijacsa.thesai.org
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=3474371
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 10, No. 2, 2019
B. Data Preprocessing
II. RELATED WORK
Data mining fundamentally depends on the quality of data.
Kelwin Fernandes et al. [20] presented an automated Raw data generally vulnerable to noisy data, missing values,
ed
method for predicting the effect of the patient biopsy for the outliers and inconsistency. So, it is vital for selected data to be
diagnosis of cervical cancer by using medical history of processed before being mined. Preprocessing the data is an
patients. Their technique allows a joint and fully supervised essential step to enhance data efficiency. Data preprocessing is
optimization method for high dimensional reduction and one of the most vital data mining step which deals with data
classification. They discovered certain medical results from preparation and transformation of the dataset which make
iew
the embedding spaces and confirmed through the medical knowledge discovery more efficient. There are following steps
literature. R. Vidya and G. M. Nasira [24] predicted cervical which were used to preprocess data in this study for the
cancer using random forest with K-means learning and experiments.
implemented the techniques in MATLAB tool. These
experiments were performed with the help of NCBI dataset to Step 1: Ignoring some instances and attributes which
construct decision tree using classification methods. Yulia et makes the data consistent because of high ratio of missing
al. [25] predicted cervical cancer using Pap smear test results. values. This method is very effective because there were
The Pap smear test results were divided into two categories: several instances and attributes with missing values in the
ev
cancerous and non-cancerous patients. Three classification dataset which has been used. Some attributes in this dataset
methods Naïve Bayes, support vector machine and random like STDs: Time since first diagnosis and STDs: Time since
forest were used to compute the results in which random forest last diagnosis, in which more than 80% data was missing so
tree was given better results. Jimin kahng et al. [21] predicted these attributes were deleted. Two attributes STDs:cervical
the cervical cancer development using SVM. Weka was used condylomatosis and STDs:AIDS has constant value so these
were also deleted.
r
to train and test the data set as well as analyze relationships
between attributes. Chang et al. [17] predicted the recurrence Step 2: There were many attributes with missing values
of cervical cancer in patients using MARS (Multivariate like number of pregnancies, hormonal conceptive etc. whereas
Adaptive Regression Splines) and C5.0 algorithm. MARS
powerfully estimated the relationship between a dependent
variable and set of descriptive variables in a pair wise
regression. C5.0 used greedy method in which a top down
approach was used to build the decision tree and then trained
er
missing values denoted in data as “?” then replace these values
with median values of respective class. The median value was
computed as following [29].
( ) ( )
pe
the data with the help of significant attributes. Maciej Kusy et
al. [18] presented neural networks to predict adverse events in Step 3: The other important task was outlier detection in
cervical cancer patients. MLP is a type of neural network data. An outlier is a data object that deviates significantly
where the input signal is fed forward through a number of from the rest of the objects. In this study, two attributes like
layers. MLP contains input layer, hidden layer and output age and number of partners contains outliers. To solve this
layer. The GEP classifier delivered efficient results in the issue defining lower and upper threshold limits, these outliers
prediction of the adverse events in cervical cancer as compare were replaced with median value.
ot
to other methods. Kelwin Fernandes et al. [26] used transfer Step 4: Normalization is scaling technique of data
learning technique for cervical cancer screening. Their study preprocessing. There were several methods of normalization
consists on linear predictive models. Positive results were i.e. Min-Max, Z-score and decimal scaling normalization [30].
obtained in most experiments as compared to other methods. Decimal scaling normalization was applied by using following
tn
III. METHODOLOGY hormonal conceptive etc. are scaled between [0-10] and
Boolean attributes like smokes, HPV,STD etc. are scaled
Our methodology consists of three main steps; the first [0,1].
step is data set selection. The second step includes
preprocessing in which the original data is prepared for Step 5: After data cleaning, cervical cancer data set
classification. The last step contains building effective consists of 734 instances and 32 attributes. This data is
ep
classification based model for prediction. imbalanced because only 70 instances are cancerous and 663
are non-cancerous diagnosed patients. To overcome this
A. Dataset problem of imbalanced data, Synthetic Minority
Publicly available dataset have been utilized [28] which Oversampling Technique (SMOTE) has been used. This is a
was obtained from the UCI repository, in this research. The statistical method for increasing the number of instances in
dataset contains 858 patients and 36 attributes which includes dataset in a balanced way. The module works by producing
Pr
the patient age, number of pregnancies, contraceptives usage, new instances from existing minority cases that supplied as
smoking patterns and chronological records of sexually input. By using SMOTE, majority instances do not change.
transmitted diseases (STDs). The new instances are not just copies of existing minority
390 | P a g e
www.ijacsa.thesai.org
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=3474371
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 10, No. 2, 2019
classes because the algorithm takes samples of the feature previous value is discarded. The new function is written as:
space for each target class and its nearest neighbors which
generate new instances that associate the features of the target ( ) ( ) ∑ ( ) ∑ ( ( ) )
ed
class. This method makes the samples more generic [32]. is
a minority class and searches the nearest neighbors and one
Terminals nodes or leaves are denoted by J in the tree. The
neighbor is randomly selected as then random numbers
accuracy of boosted decision tree will improve if number of
between [0,1] 𝟃 selected. The new sample was created
leaves and size of tree also increases but over fitting problem
as:
and longer processing time may occur.
iew
( ) 𝟃
2) Decision forest: The other algorithm to perform
SMOTE outperforms random oversampling method classification by utilizing ensemble learning method is known
because it also avoids over fitting problem [33]. Using as decision forest. Ensemble methods are generalized rather
SMOTE function the total instances have increased. After than depend on a single model. A generalized model generates
SMOTE, minority class has oversampled from 70 to 563 multiple associated models and merging them which gives
instances. better results. Mostly, ensemble models offer efficient
ev
C. Classification Models accuracy as compared to single decision tree. Decision forest
A supervised method for classification is decision trees, differs from random forest method, in random forest method
which is very popular because most of biomedical data mining the individual decision trees might only use some randomized
tasks have already used decision trees for efficient prediction portion of the data or features. There were many methods to
[18]. Three decision tree methods were used in this study as ensemble decision trees but voting is one of the effective
r
follows. method for making results in an ensemble model [35].
1) Boosted decision tree: The transformation of a Decision forest works by constructing multiple decision trees
weakened classifier to a vigorous or strong classifier is the key and then voting on the most popular output class. By utilizing
role of boosting. A weak classifier is generally a poor
performance prediction model which leads to low accuracy
due to high misclassification rate. Boosted method works
perfect when majority vote of all weak learners for each
er
the whole data set and different starting points, set of
classification trees are constructed. Decision forest outputs
non-normalized frequency of histograms of labels for each
decision tree. Probabilities of each label is determined by
pe
prediction combines in such way that final prediction results aggregation method which sums the histograms then
are effective. Each iteration for a weak learner is added in base normalizes the results. Final decision of the ensemble is based
learner which trained with respect to the error of the whole on trees in which high prediction confidence depends on high
ensemble. When weak learner is added iteratively in an weight. Criminisi [36] presented a complete detail associated
ensemble then it delivers the precise classification. A learning with decision forest.
method consecutively tries new models to provide an extra Step 1: Forest training is done by optimizing the
ot
accuracy of the class variable which leads to gradient parameters of the weak learner at each split node j and
boosting. The negative gradient of the loss function is denotes the parent set and split parameters.
correlated with each new model which tends to minimize the ( )
error. Friedman [34] presented a complete detail associated
tn
with boosted decision tree. Step 2: The objective function or loss function denoted as I
which takes the value of information gain. ( ) Described as
Step 1: ( ) fit a decision tree to pseudo residuals.
Represents the number of leaves and input space divided into Entropy of example set parent node, denotes the
disjoint regions R1 m… R m which predicts a constant value
in each region. The output can be written as: weighting left/right children and ( ) represents entropy of
rin
( ) ∑ ( ) ( ) ( ) ∑ ( )
, -
Denotes the predicted value in region. Step 3: The entropy of generic set of training points were
ep
( ) ( ) ( ) ∑ ( ( ) ( ))
This method contrasts from random forest method like
Pr
391 | P a g e
www.ijacsa.thesai.org
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=3474371
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 10, No. 2, 2019
ed
minimization of the total weighted entropy of instances,
large amount of data the number of nodes in decision trees
defined as:
will develop exponentially with depth. Decision jungles
method compares two new node merging algorithms that (* + * + * +)=∑ H( )
jointly optimize both the features and the structure of the
(* + * + * +) Presents features and branches for all
directed acyclic graph (DAGs) powerfully. DAGs have same
iew
parent nodes , ∑ presents sum over child nodes and
structure as decision trees except the nodes can have multiple
number of examples at , H ( ) denotes entropy of examples
parents. Node splitting and node merging is determined by
that reach child node .
objective function and entropies of weighted sum at leaves.
The training of DAGs is done level by level by combining Step 4: To solve the minimization problem cluster search
objective function over both structure of DAGs and split method was used which substitutes among optimizing the
function. At each level, the algorithm jointly learns the branching variables and the split parameters but optimizes the
features and branching structure of the nodes. This is done by branching variables more globally.
ev
minimizing an objective function defined over the predictions. IV. RESULTS AND DISCUSSION
Decision jungles require radically less memory while
In this study numerous methods have been examined and
considerably improved generalization. Shotton [37] presented three methods that have the best performances has been
a comprehensive detail related to decision jungles. presented. 10 fold cross validation method was used in the
r
Step 1: Set of parent nodes, and a set of child nodes were evaluation of the proposed methods. Cross validation method
denoted by and . Denotes the parameters of split feature was used because it uses the entire training dataset for both
function for parent node and Si denotes the set of labeled training and evaluation, instead of some portion [38]. Among
that reach node i. The set of instances that reach any child
node is.
(* + * + * +)=[⋃ ( )] ꓴ [⋃ ( )]
er 858 patients, 124 patients have huge number of missing values
due to privacy concerns and the remaining 734 were
considered. Using SMOTE method, imbalanced dataset
problem was overcome and instances were increased. The new
balanced dataset consists of 32 attributes and 1226 patients in
pe
Step 2: The objective function E related with the current which cancer patients were 563 and non-cancer patients were
level of the DAG is a function of * + . The difficulty of 663 as shown in Fig. 1 of confusion matrix. The median value
learning the parameters of the decision DAG as a joint of patients’ age was 26 years (range, 13-84). The median
minimization of the objective over the split parameters * + number of sex partners was 2 (range, 1–10). The median of
and the child assignments * + * + were resolved. The task of first sexual intercourse age was 17 (range, 10-32) and median
learning the current level of a DAG can be written as: of number of pregnancies was 2 (range, 0-10).
ot
tn
rin
ep
Pr
392 | P a g e
www.ijacsa.thesai.org
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=3474371
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 10, No. 2, 2019
There were four screening methods (target attributes) in Boosted decision tree, decision forest and decision jungle
the data set labeled as biopsy, cytology, Schiller and algorithms were used to determine the prediction ability of
hinslemann. These four screening methods have been used to tested models by computing the accuracy, sensitivity,
ed
diagnose cancer and each screening method was trained with specificity and AUROC curve. AUROC curve is a best
same dataset but individually. Boosted decision tree measure to evaluate the performance of classification models
outperformed all other methods as shown in Table 2. The [39-42]. The AUROC curve performance of proposed models
hinslemann screening method also outperformed other has shown in Fig. 2.
methods as AUROC curve performance is 0.978 which was
slightly higher from Biopsy but significant higher from The AUROC curve is a summary measure of performance
iew
cytology and Schiller. The AUROC curve has also given that indicates whether on average a true positive is ranked
better results on boosted decision tree i.e. 0.974 on biopsy, higher than a false positive rate or not. AUROC curve was
0.959 on cytology and 0.943 on Schiller target attribute. The also used for evaluation of different techniques [18, 27] in
complete performance of proposed models has given in Fig. 3 biomedical data mining.
and performance on AUROC curve has shown in Fig. 2.
TABLE II. AUROC CURVE OBTAINED BY THE ML TECHNIQUES ON THE RISK PREDICTION TASK WITH MULTIPLE SCREENING METHODS: BIOPSY, CYTOLOGY,
ev
SCHILLER AND HINSELMANN. PERFORMANCE WAS ALSO EVALUATED IN TERMS OF ACCURACY, SENSITIVITY AND SPECIFICITY
r
Decision Jungle 0.863 0.733 0.968 0.929
Boosted Decision Tree 0.934 0.893 0.965 0.959
Decision Forest
Decision Jungle
Boosted Decision Tree
Cytology
er
0.888
0.879
0.909
0.790
0.735
0.870
0.963
0.989
0.942
0.935
0.929
0.943
pe
Decision Forest Schiller 0.865 0.766 0.948 0.918
Decision Jungle 0.863 0.726 0.978 0.908
Boosted Decision Tree 0.941 0.896 0.974 0.978
Decision Forest Hinslemann 0.892 0.793 0.965 0.945
Decision Jungle 0.879 0.730 0.991 0.934
ot
tn
rin
ep
Pr
Fig. 2. Comparison of Area under Receiver operating Characteristic (AUROC) Curve between Boosted Decision Tree (Blue Line) and Decision Forest (Red
Line) as these Model Gives Best Results. Plots are Shown for the Models with Threshold=5.
393 | P a g e
www.ijacsa.thesai.org
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=3474371
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 10, No. 2, 2019
ed
iew
ev
Fig. 3. The Results in Terms of Accuracy, Sensitivity, Specificity and AUROC Curve in the Prediction of Cervical Cancer.
There are 50% of cervical cancer identification in females processing stage. Several data mining methods like artificial
age (35–54) and around 20% diagnosed more than 65 years neural networks, support vector machines and k-nearest
r
old as well as around 15% of between the age of (20 – 30). neighbor method were also used to resolve the high
Median age for diagnosis in cervical cancer is 48 years. dimensional classification problem [54]. In this study, high
Cervical cancer is significantly unusual in females, younger dimensional classification problem was resolved by using
than age 20. In any case, several young females end up
infected with different sorts of human papilloma infection
(HPV), which can expand their danger of getting cervical
cancer in future. Young females with early abnormal changes
who don't have regular checkup are at high risk of cervical
er decision tree methods because only those attributes were
considered which showed highest relevance with the screening
method (target class). The Hinslemann screening method
showed high performance because Hinslemann is also
traditional method of screening of cervical cancer which is
pe
cancer when they reach at the age of 40 [43-45]. The main risk effective [55-57]. The performance of biopsy screening
factor for cervical cancer growth is HPV. Sexual relation with method was slightly low from Hinslemann screening method.
infected persons is another risk factor for HPV. Different From various studies, it was also found that biopsy screening
parameters with respect to sexual relation like sexual relation has huge impact for cervical cancer detection [58, 59]. The use
with multiple persons are also danger factor for females which of boosted decision tree was preferred because it focused on
leads to cervical cancer. Sexually dynamic females (sexually misclassified instances and had tendency to increase accuracy.
obsessed) have never been in danger of cervical cancer as Boosting is one way to decrease the misclassification rate
ot
compare to those who have multiple sexual partners [46,47]. because inside boosting, iteration was introduced [60]. In
Smoking is related with a higher risk for precancerous general, this increased the degree of accuracy in classification.
fluctuations in the cervix and development to invasive cervical Since, boosted decision tree is an ensemble model in which
cancer, particularly for women infected with HPV. Women results from various models are consolidated. The outcome
tn
with weak immune system are more prone to getting HPV acquired from ensemble model is normally higher to the
[48]. outcome from any of individual model. In this study,
maximum number of leaves per tree were 20 and minimum
This study was exploited late advancements in statistical number of leaves per tree were 10. Learning rate has taken
learning for handling the high dimensional data with low which is 0.1 but processing time slightly increases
numerous features. Other promising areas of research in these
rin
developments in technology [53]. Generally the problem of slow when a large number of trees are made. These algorithms
large dimensional data modelling has been solved by variable are fast to train but quite slow to create predictions once they
reduction methods in the preprocessing and in the post- are trained. The accuracy may increases when the number of
394 | P a g e
www.ijacsa.thesai.org
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=3474371
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 10, No. 2, 2019
trees were also increased [64] but also leads slower model for incidence and mortality worldwide for 36 cancers in 185 countries," CA:
prediction. In most real world applications the decision forest is a cancer journal for clinicians, vol. 68, pp. 394-424, 2018.
fast enough but in some situations run time performance is [10] R. A. Kerkar, "Screening for cervical cancer: an overview."
ed
important and other methods would be chosen. Decision forest [11] G. Guvenc, A. Akyuz, and C. H. Açikel, "Health belief model scale for
cervical cancer and Pap smear test: psychometric testing," Journal of
was also used to understand protein interactions and making advanced nursing, vol. 67, pp. 428-437, 2011.
predictions based on all the protein domains [65]. The other
[12] M. T. Galgano, P. E. Castle, K. A. Atkins, W. K. Brix, S. R. Nassau, and
applications of decision forest were prediction of different M. H. Stoler, "Using biomarkers as objective standards in the diagnosis
types of liver diseases including alcoholic, liver damage and of cervical biopsies," The American journal of surgical pathology, vol.
liver cirrhosis [66]. Other than biomedical classification,
iew
34, p. 1077, 2010.
Decision forest method was applied for academic data analysis [13] H. Ramaraju, Y. Nagaveni, and A. Khazi, "Use of Schiller’s test versus
[67] as well as classification and forecasting of chronic kidney Pap smear to increase detection rate of cervical dysplasias," International
disease [68]. Decision Jungles were used for feature selection Journal of Reproduction, Contraception, Obstetrics and Gynecology,
vol. 5, pp. 1446-1450, 2017.
for images with some modification to achieve efficient results
[14] N. Jothi and W. Husain, "Data mining in healthcare–a review," Procedia
with modest training time [69]. Computer Science, vol. 72, pp. 306-313, 2015.
V. CONCLUSION [15] P. Ahmad, S. Qamar, and S. Q. A. Rizvi, "Techniques of data mining in
healthcare: a review," International Journal of Computer Applications,
ev
Nowadays, cervical cancer is a common disease and its vol. 120, 2015.
screening often involves very time consuming clinical tests. In [16] T. M. Alam and M. J. Awan, "Domain Analysis of Information
this perspective, machine learning can deliver efficient ExtractionTechniques," INTERNATIONAL JOURNAL OF
MULTIDISCIPLINARY SCIENCES AND ENGINEERING, vol. 9, pp.
methods to speed up the diagnosis procedure. Furthermore in 1-9, 2018.
this research work, Data mining methods especially tree based
[17] C.-C. Chang, S.-L. Cheng, C.-J. Lu, and K.-H. Liao, "Prediction of
r
algorithms enable sound prediction for cervical cancer Recurrence in Patients with Cervical Cancer Using MARS and
patients. The imbalanced data set problem in which cancerous Classification," International Journal of Machine Learning and
patients were too small as compared to non-cancerous patients Computing, vol. 3, p. 75, 2013.
has been resolved by using SMOTE method. The prediction
ability of the boosted decision tree measured by the AUROC
curve value which outperformed decision forest and decision
jungle. The low AUROC curve value for the decision forest
and decision jungle methods disqualified them as best
er
[18]
[19]
M. Kusy, B. Obrzut, and J. Kluska, "Application of gene expression
programming and neural networks to predict adverse events of radical
hysterectomy in cervical cancer patients," Medical & biological
engineering & computing, vol. 51, pp. 1357-1365, 2013.
J. M. Yamal, M. Guillaud, E. N. Atkinson, M. Follen, C. MacAulay, S.
pe
B. Cantor, et al., "Prediction using hierarchical data: Applications for
predictive classifiers. We believe that with the growing automated detection of cervical cancer," Statistical Analysis and Data
collection of cervical cancer patient’s data and the rapidly Mining: The ASA Data Science Journal, vol. 8, pp. 65-74, 2015.
advancing methods for analyzing this data, we will begin to be [20] K. Fernandes, D. Chicco, J. S. Cardoso, and J. Fernandes, "Supervised
able to identify best screening method for cervical cancer deep learning embeddings for the prediction of cervical cancer
patients that will be informative for patient care. In future, this diagnosis," PeerJ Computer Science, vol. 4, p. e154, 2018.
study can be used as a prototype to develop a healthcare [21] J. Kahng, E.-H. Kim, H.-G. Kim, and W. Lee, "Development of a
cervical cancer progress prediction tool for human papillomavirus-
system for cervical cancer patients.
ot
395 | P a g e
www.ijacsa.thesai.org
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=3474371
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 10, No. 2, 2019
[30] S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, "Data preprocessing for [52] N. Öcal, M. K. Ercan, and E. Kadıoğlu, "Predicting Financial Failure
supervised leaning," International Journal of Computer Science, vol. 1, Using Decision Tree Algorithms: An Empirical Test on the
pp. 111-117, 2006. Manufacturing Industry at Borsa Istanbul," International Journal of
ed
[31] S. Patro and K. K. Sahu, "Normalization: A preprocessing stage," arXiv Economics and Finance, vol. 7, 2015.
preprint arXiv:1503.06462, 2015. [53] V. Pappu and P. M. Pardalos, "High-Dimensional Data Classification,"
[32] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, in Clusters, Orders, and Trees: Methods and Applications: In Honor of
"SMOTE: synthetic minority over-sampling technique," Journal of Boris Mirkin's 70th Birthday, F. Aleskerov, B. Goldengorin, and P. M.
artificial intelligence research, vol. 16, pp. 321-357, 2002. Pardalos, Eds., ed New York, NY: Springer New York, 2014, pp. 119-
150.
[33] Z. Zheng, Y. Cai, and Y. Li, "Oversampling method for imbalanced
[54] M. Zekić-Sušac, S. Pfeifer, and N. Šarlija, "A Comparison of Machine
iew
classification," Computing and Informatics, vol. 34, pp. 1017-1037,
2016. Learning Methods in a High-Dimensional Classification Problem,"
Business Systems Research Journal, vol. 5, pp. 82-96, 2014.
[34] J. H. Friedman, "Greedy function approximation: a gradient boosting
machine," Annals of statistics, pp. 1189-1232, 2001. [55] Y. Eraso, "Migrating techniques, multiplying diagnoses: the contribution
of Argentina and Brazil to early'detection policy'in cervical cancer,"
[35] L. Rokach, "Decision forest: Twenty years of research," Information História, Ciências, Saúde-Manguinhos, vol. 17, pp. 33-51, 2010.
Fusion, vol. 27, pp. 111-125, 2016.
[56] M. Aref‐Adib and T. Freeman‐Wang, "Cervical cancer prevention
[36] A. Criminisi and J. Shotton, Decision forests for computer vision and and screening: the role of human papillomavirus testing," The
medical image analysis: Springer Science & Business Media, 2013. Obstetrician & Gynaecologist, vol. 18, pp. 251-263, 2016.
[37] J. Shotton, T. Sharp, P. Kohli, S. Nowozin, J. Winn, and A. Criminisi,
ev
[57] I. Löwy, "Cancer, women, and public health: the history of screening for
"Decision jungles: Compact and rich models for classification," in cervical cancer," História, Ciências, Saúde-Manguinhos, vol. 17, pp. 53-
Advances in Neural Information Processing Systems, 2013, pp. 234-242. 67, 2010.
[38] D. Krstajic, L. J. Buturovic, D. E. Leahy, and S. Thomas, "Cross-
[58] P. Ghosh, G. Gandhi, P. Kochhar, V. Zutshi, and S. Batra, "Visual
validation pitfalls when selecting and assessing regression and
inspection of cervix with Lugol's iodine for early detection of
classification models," Journal of cheminformatics, vol. 6, p. 10, 2014.
premalignant & malignant lesions of cervix," The Indian journal of
[39] F. Garrido, W. Verbeke, and C. Bravo, "A Robust profit measure for
r
medical research, vol. 136, p. 265, 2012.
binary classification model evaluation," Expert Systems with
[59] K. Petry, J. Horn, A. Luyten, and R. Mikolajczyk, "Punch biopsies
Applications, vol. 92, pp. 154-160, 2018.
shorten time to clearance of high-risk human papillomavirus infections
[40] M. Vihinen, "How to evaluate performance of prediction methods? of the uterine cervix," BMC cancer, vol. 18, p. 318, 2018.
Measures and their interpretation in variation effect analysis," in BMC
genomics, 2012, p. S2.
[41] D. J. Hand, "Measuring classifier performance: a coherent alternative to
the area under the ROC curve," Machine learning, vol. 77, pp. 103-123,
2009.
er
[60] A. Niculescu-Mizil and R. Caruana, "Obtaining Calibrated Probabilities
from Boosting.".
[61] V. Athanasiou and M. Maragoudakis, "A novel, gradient boosting
framework for sentiment analysis in languages where NLP resources are
not plentiful: a case study for modern greek," Algorithms, vol. 10, p. 34,
pe
[42] K. Hajian-Tilaki, "Receiver operating characteristic (ROC) curve 2017.
analysis for medical diagnostic test evaluation," Caspian journal of [62] S. F. Weng, J. Reps, J. Kai, J. M. Garibaldi, and N. Qureshi, "Can
internal medicine, vol. 4, p. 627, 2013.
machine-learning improve cardiovascular risk prediction using routine
[43] C. Sun, A. J. Brown, A. Jhingran, M. Frumovitz, L. Ramondetta, and D. clinical data?," PloS one, vol. 12, p. e0174944, 2017.
C. Bodurka, "Patient preferences for side effects associated with cervical [63] Z. Wei, W. Wang, J. Bradfield, J. Li, C. Cardinale, E. Frackelton, et al.,
cancer treatment," International journal of gynecological cancer: official "Large sample size, wide variant spectrum, and advanced machine-
journal of the International Gynecological Cancer Society, vol. 24, p. learning technique boost risk prediction for inflammatory bowel
1077, 2014.
disease," The American Journal of Human Genetics, vol. 92, pp. 1008-
ot
[44] I.C.o.E.S.o.C. Cancer, "Cervical cancer and hormonal contraceptives: 1012, 2013.
collaborative reanalysis of individual data for 16 573 women with
[64] S. Fong, W. Song, R. Wong, C. Bhatt, and D. Korzun, "Framework of
cervical cancer and 35 509 women without cervical cancer from 24
Temporal Data Stream Mining by Using Incrementally Optimized Very
epidemiological studies," The Lancet, vol. 370, pp. 1609-1621, 2007.
Fast Decision Forest," in Internet of Things and Big Data Analytics
[45] G. Danaei, S. Vander Hoorn, A. D. Lopez, C. J. Murray, M. Ezzati, and Toward Next-Generation Intelligence, ed: Springer, 2018, pp. 483-502.
tn
[47] S. de Sanjosé, M. Brotons, and M. A. Pavón, "The natural history of Primary Hepatoma, Liver Cirrhosis, and Cholelithiasis," Journal of
human papillomavirus infection," Best practice & research Clinical healthcare engineering, vol. 2018, 2018.
obstetrics & gynaecology, vol. 47, pp. 2-13, 2018.
[67] A. J. Fernández-García, L. Iribarne, A. Corral, and J. Criado, "A
[48] E. Mazarico, R. Gómez, L. Guirado, N. Lorente, and E. Gonzalez- Comparison of Feature Selection Methods to Optimize Predictive
Bosquet, "Relationship between smoking, HPV infection, and risk of Models Based on Decision Forest Algorithms for Academic Data
cervical cancer," Eur. J. Gynaec. Oncol.-ISSN, vol. 392, p. 2936, 2015. Analysis," in World Conference on Information Systems and
ep
[49] L. Rokach, "Ensemble-based classifiers," Artificial Intelligence Review, Technologies, 2018, pp. 338-347.
vol. 33, pp. 1-39, 2010. [68] W. Gunarathne, K. Perera, and K. Kahandawaarachchi, "Performance
[50] A. Franco-Arcega, L. Flores-Flores, and R. F. Gabbasov, "Application Evaluation on Machine Learning Classification Techniques for Disease
of decision trees for classifying astronomical objects," in Artificial Classification and Forecasting through Data Analytics for Chronic
Intelligence (MICAI), 2013 12th Mexican International Conference on, Kidney Disease (CKD)," in Bioinformatics and Bioengineering (BIBE),
2013, pp. 181-186. 2017 IEEE 17th International Conference on, 2017, pp. 291-296.
[51] K. Chitra and B. Subashini, "Data mining techniques and its applications [69] S. Baek, K. I. Kim, and T.-K. Kim, "Deep Convolutional Decision
Pr
in banking sector," International Journal of Emerging Technology and Jungle for Image Classification," arXiv preprint arXiv:1706.02003,
Advanced Engineering, vol. 3, pp. 219-226, 2013. 2017.
396 | P a g e
www.ijacsa.thesai.org
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=3474371