0% found this document useful (0 votes)
4 views

Prediction of Comorbid

This study focuses on predicting the survivability of comorbid cancer patients using machine learning techniques, highlighting the lower survival rates associated with cancer comorbidity. It proposes a two-stage model utilizing the SEER database to first classify patients' five-year survival probability and then predict their remaining survival time. The findings suggest that machine learning can enhance the accuracy of survival predictions, which is crucial for clinical decision-making and personalized treatment plans.

Uploaded by

dimpuk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Prediction of Comorbid

This study focuses on predicting the survivability of comorbid cancer patients using machine learning techniques, highlighting the lower survival rates associated with cancer comorbidity. It proposes a two-stage model utilizing the SEER database to first classify patients' five-year survival probability and then predict their remaining survival time. The findings suggest that machine learning can enhance the accuracy of survival predictions, which is crucial for clinical decision-making and personalized treatment plans.

Uploaded by

dimpuk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Journal of Theoretical and Applied Information Technology

30th April 2023. Vol.101. No 8


© 2023 Little Lion Scientific

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

PREDICTION OF COMORBID MALIGNANCY PATIENT


SURVIVABILITY –EMPIRICAL PERSPECTIVE
DR Y PADMA1, DR NIDAMANURU SRINIVASA RAO2, MR. PAVAN KUMAR KOLLURU3,
DR C ASHOK KUMAR MS. SHAIK SALMA BEGUM5 DR. SURESH CHANDANAPALLI6
KODEPOGU KOTESWARA RAO*
1
Asst Professor, IT Dept, PVP Siddhartha Institute of Technology, Vijayawada,
2
Associate Professor Dept of CSE, Narasimha Reddy Engineering College, Secunderabad
3
Asst Professor, CSE Dept, VFSTR Deemed to be University, Guntur,
4
Asst Professor, Dept Of Computing Technologies, School Of Computing, SRM Institute of Science and
Technology Kattankulathur Chennai,
5
Asst Professor, CSE Dept, SR Gudlavalleru Engineering Collge, Gudlavalleru,
6
Professor, IT Dept, SR Gudlavalleru Engineering Collge, Gudlavalleru,
*
Dept. of CSE, PVP Siddhartha Institute of Institute of Technology, Vijayawada, India,

E-mail: [email protected], , [email protected] , [email protected],


[email protected] , [email protected] , [email protected] ,
[email protected]

ABSTRACT

Modeling the survivability of comorbid cancer patients has both theoretical and practical implications.
Cancer is one of the leading causes of death worldwide. Stomach, Liver, Thyroid, Lung and Skin Cancers
are some of the most frequent cancers. The detection and prevention of these malignancies are crucial
goals. According to recent discoveries, some people have cancer comorbidity. A number of studies have
shown poorer survival among cancer patients with comorbidity. Several mechanisms may underlie this
finding. The majority of studies found that cancer patients with comorbidity had a lower 5-year survival
rate than those without, with hazard ratios ranging from 1.1 to 5.8. Only a few studies looked into the
impact of specific chronic illnesses. Comorbidity does not appear to be linked to more aggressive cancers
or other abnormalities in tumor biology in general. Another conclusion was that patients with comorbidity
are less likely to obtain standard cancer therapies such surgery, chemotherapy, and radiation therapy, and
their chances of completing a course of treatment are reduced. Predicting cancer survival may help with
clinical decision-making and tailored therapy. Large data sets appropriate for machine learning analysis are
available through the Surveillance, Epidemiology, and End Results (SEER) program. We regard survival
prediction to be a two-stage problem in our study. The first is to forecast a patient's five-year survival rate.
The second stage calculates the remaining survival time for individuals whose anticipated outcome is
'death.' The SEER database was used to identify and label male and female comorbid cancer cases
(Stomach, Lung, Liver, Thyroid and Skin Cancers). The dataset was handled utilizing CHI2- based feature
selection throughout the classification stage. These two solutions tackled the problems of a skewed data
set.
Keywords: CHI2,SEER, COMORBID, Survivability, Empirical Study

1. INTRODUCTION: never had cancer. Multiple primary cancer (MPC)


patients are on the rise as a result of an increasing
Cancer prognosis has improved dramatically as a number of cancer survivors and an ageing
result of increased cancer screening, advances in population. Cancer comorbidity refers to the
medical knowledge, and improvements in presence of numerous cancers at the same time. [1-
supportive care. In 2016, the 5-year cancer survival 5]
rate was double that in 1950. Cancer survivors have Cancer survival prediction is a popular topic of
a higher risk of having a secondary cancer, which is study. Predicting patients' chances of survival
estimated to be 14% higher than the risk of accurately could help doctors give better medical
developing a primary cancer in persons who have advice and prescribe more tailored medications.

3162
Journal of Theoretical and Applied Information Technology
30th April 2023. Vol.101. No 8
© 2023 Little Lion Scientific

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

Survivability refers to a patient's ability to live for patients classified as negative (unable to survive
more than five years after being diagnosed with more than five years) is unknown, which deserves
cancer. It's a medical metric for assessing treatment more attention, particularly for high mortality
outcomes. The majority of cancer survival studies cancers. Survival time prediction can be used to
try to forecast patients' five-year survival rates. make more precise predictions, which is more
These studies only provide a small quantity of data difficult but also more meaningful for medical
to help doctors make decisions. If a patient's doctors. Traditional studies commonly use
prognosis is 'death,' the patient's survival time is statistical tools to build prediction models based on
unclear. To provide more exact information for survival-related factors such as palliative prognostic
medical decision-making, survival time prediction score, palliative performance index, and cancer,
should be investigated[6]. intra-hospital cancer mortality risk model and
prognostic score However, keep in mind that the
The paucity of large-scale medical data available to above statistically-based prediction models are for
the public makes cancer survival research difficult. terminal cancer patients whose survival time is less
The SEER program (Surveillance, Epidemiology, than one month in order to provide proper
and End Results) is an open-source database that support.[10]
provides de-identified, coded, and annotated data The goal of this paper was to use machine learning
on cancer statistics in the United States. Machine methods to predict survival time on a monthly
learning techniques can be used to analyze the data basis, which can aid in making effective treatment
because it is huge enough. decisions. So far, it has been demonstrated that
The goal of this essay is to forecast survival time predicting survival times is extremely difficult
on a monthly basis. When one-stage regression because large generalization errors frequently occur
models are applied, however, substantial when one-stage regression models are used. To
generalization errors frequently occur, making address this challenge, we propose a two-stage
survival time prediction difficult. A two-stage model based on tree ensembles for cancer survival
prediction model is offered as a solution to this prediction, in which an effective classifier is used in
problem. A classifier is used in the first stage to the first stage to predict whether patients can
estimate whether the patients would live for more survive for five years, and a novel regression tree
than five years. A regression model is employed in ensemble is used in the second stage to predict the
the second stage to forecast the survival time of specific survival time for patients who are predicted
patients who have been identified as not having a to be unable to survive for five years.
five-year survival rate. CHI2 feature selection using [11]Kaviarasi, R et’al proposed Accuracy enhanced
eigenvector centrality (ECFS), and mutual lung cancer prognosis for improving patient
information-based feature selection are the survivability using proposed Gaussian classifier
methodologies for comparing feature selection system. Measurable classifier and great precision
methods for two-stage classifiers. These methods are a fundamental piece of the exploration in
for selecting features are open to the public. clinical information mining. Exact forecast of
Because the anticipated outcome is continuous, the cellular breakdown in the lungs is a fundamental
foregoing enhancements cannot be made during the stage for pursuing powerful clinical choices.
regression stage. However, without data pre- Subsequent to recognizing the cellular breakdown
treatment, the error rate is significant, and the in the lungs, least degrees are accessible in the
training time is considerable. [7-9] prescriptions for patient living on the planet.
Hemoglobin level and TNM stage wise patient’s
2. LITERATURE SURVEY endurance period must be fluctuated. Some
gathering endurance period is insignificant and one
[10]Y. Wang, et’al proposed A tree ensemble based more gathering endurance time is extended. This
two-stage model for advanced-stage colorectal study is meant to foster a forecast model with new
cancer survival prediction. The majority of existing clinical factors to anticipate cellular breakdown in
data-driven cancer survival prediction studies use the lungs patients. It depends on modified eighth
classification to predict whether a patient will live version investigation of TNM in cellular
for more than five years. The prediction results breakdown in the lungs. These new traits are
obtained in this manner, however, are not precise gathered from SEER data sets, Indian malignant
enough to support medical decision-making. For growth medical clinics and examination focuses.
example, in the five-year survivability The gathered new traits are ordered utilizing
classification, the exact outcome (survival time) of regulated AI calculations of direct relapse, Naïve

3163
Journal of Theoretical and Applied Information Technology
30th April 2023. Vol.101. No 8
© 2023 Little Lion Scientific

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

Bayes classifier and proposed calculations of however, the dataset was heterogeneous and
Gaussian K-Base NB classifier. Specifically, for complex, with numerous missing values.
TNM stage 1 gathering with typical hemoglobin [13]David Riao, Ricardo, and Kleinlein suggested
level (NHBL), that gathering of cellular breakdown persistence of data-driven knowledge to forecast
in the lungs patient personal satisfaction is survival from breast cancer. By adapting machine
profoundly improved. Which demonstrated by learning prediction models to the stage of the
utilizing managed AI calculations. The proposed cancer at the time of diagnosis, breast cancer
calculation grouped the data set as far as regarding survival rates can be increased. However, the
growth size and HB level and the outcomes are predictive capability of these models as well as the
affirmed in the R climate. The nonstop trait order importance of the clinical characteristics in that
technique to demonstrate first degree of TNM in prediction may alter with time. figured out if the
cellular breakdown in the lungs patient alongside results about the performance of machine learning
standard hemoglobin must be kept up with that models and the effect of clinical factors in the
individual’s survivability rate is higher than the prediction of breast cancer survival are temporary
more modest degree of hemoglobin individual’s or permanent, and if temporary, how long the
endurance rate. The Gaussian K-Base NB classifier newly acquired information will be valid if it is.
is more compelling than the current AI calculations On the application of machine learning techniques
for cellular breakdown in the lungs forecast model. to predict breast cancer survival, there have been
The proposed order exactness has estimated fifteen recent publications with pertinent
utilizing ROC strategies. conclusions. Several data-driven models were
[12]Ryu, Sung Mo, et al proposed Predicting subsequently developed throughout time to estimate
survival of patients with spinal ependymoma using the five-year survival of breast cancer using the
machine learning algorithms with the SEER breast cancer data in the SEER database. Three
database. The purpose of this study was to learn different machine learning techniques were used.
about the clinical and demographic factors that Step-specific models and joint models were taken
influence the overall survival (OS) of patients with into consideration for each stage. The predictive
spinal ependymoma and to predict the OS using capability of the models and the significance of
machine learning (ML) algorithms. The clinical indicators were submitted to a persistence
Surveillance, Epidemiology, and End Results study over time in order to establish the validity and
(SEER) registry was used to compile cases of long-term viability of these fifteen results. Only
spinal ependymoma diagnosed between 1973 and 53% of the judgments in the SEER cases from 1988
2014. Statistical analyses were performed using the to 2009 were accurate, and only 75% of these
Kaplan-Meier method and the Cox proportional across time.Relevant conclusions, such as the
hazards regression model to identify the factors inability to increase survival prediction accuracy
influencing survival. In addition, we used machine for the most frequent stages with more data or the
learning algorithms to predict the survival of significance of cancer grade in predicting breast
patients with spinal ependymoma. Age 65 years, cancer survival for patients with distant metastasis,
histologic subtype, extraneural metastasis, multiple were found to be false when subjected to a temporal
lesions, surgery, radiation therapy, and gross total analysis. Our study has found that before being
resection (GTR) were found to be independent used in clinical and professional settings, data-
predictors of OS in the multivariate analysis model. driven knowledge generated through machine
Our ML model predicted a 5-year OS of spinal learning techniques has to be evaluated over time.
ependymoma with an area under the receiver A model developed by Narges Habibi, Majid, and
operating characteristic curve (AUC) of 0.74 (95 Naghizadeh employs an ensemble learning method
percent confidence interval [CI], 0.72-0.75) and a to predict the prognosis of cancer comorbidity.
10-year OS with an AUC of 0.81 (95 percent CI, Cancer is one of the leading causes of death
0.80-0.83). The stepwise logistic regression model worldwide. Breast and vaginal cancer in women, as
performed worse, with an AUC of 0.71 (95 percent well as prostate cancer in men, are some of the
confidence interval, 0.70-0.72) for predicting a 5- most common malignancies. The early detection
year OS and an AUC of 0.75 (95 percent and prevention of these cancers are crucial goals.
confidence interval, 0.73-0.77) for predicting a 10- Conditions have a worse chance of survival than
year OS.SEER data confirmed that therapeutic those with just one type of cancer. The significance
factors such as surgery and GTR were associated of concurrent chronic illnesses during cancer
with improved overall survival. ML techniques therapy is assessed using a range of machine-
outperformed statistical methods in predicting OS; learning approaches using SEER data. Use the

3164
Journal of Theoretical and Applied Information Technology
30th April 2023. Vol.101. No 8
© 2023 Little Lion Scientific

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

gradient boosting ensemble technique for feature few discrete outputs. With an RMSE value of
selection. According to recent investigations, some 15.32, the findings also show that GBM was the
people have concurrent cancer. The accuracy of most reliable model among the five created
estimating cancer patient survival rates in patients separately. The SVM failed to match predictions
with related illnesses is improved by modeling. despite having an RMSE of 15.82, The outcomes of
This technique shows a significant improvement in the models are foreseeable when a conventional
prediction accuracy when compared to prior Cox relevant hazards model is utilized as a
proposed models and suggests an increase in the viewpoint approach. We believe that measuring
survival rates for comorbid cancer. The forecasting patient endurance time with the explicit goal of
of the survival rate in patients with cancer illuminating patient consideration choices could be
comorbidity is recommended using an ensemble- aided by applying these administered learning
based technique. The initial stage in the strategy to strategies to the SEER data set's information on
locate the targeted comorbid patients was cellular breakdown in the lungs, and that the
combining the necessary SEER data sets. The demonstration of these procedures with this specific
important input features are determined using dataset may be comparable to that of conventional
ensemble methodologies after each record is methods.
classified as either living or dead, preprocessing
(such as handling missing values), and balancing 3. PROBLEM STATEMENT
the resulting data set.. Several prediction methods
are tested using a traintest split, and Gradient A well-known research topic has been the
Boosting is finally chosen as the best predictor anticipation of malignant development durability.
because to its improved performance. According to The majority of illness survivability research
the findings of the studies done here, the suggested focuses on trying to forecast patients' five-year
model performs better than the other approaches in survival rates. These tests provide a constrained
terms of precision, error, sensitivity, and specificity amount of information for making clinical
when it comes to predicting survival in cancer decisions. The patient's remaining components'
comorbidity.. endurance season is unknown if the patient's
[14]J. A. Bartholomai Recently, results for patients prognosis is "death." To provide more precise
with malignant growth have been assessed using a information for clinical decision-making, endurance
variety of AI techniques on significant datasets like time expectation should be investigated. With this
the Surveillance, Epidemiology, and End Results project, the endurance time will be predicted on a
(SEER) programme data set. et'al Supervised month-to-month basis. The suggested forecasting
machine learning classification methods for model has two stages[15].
predicting lung cancer patient survival. Particularly Objective
for cellular breakdown in the lungs, it is uncertain Comorbidity focused on illnesses that already
which procedures would produce more accurate coexisted. Examining actual disease cases reveals
data and which information credits should be that some diseases have stronger correlations than
employed to establish this data. This study uses a others. A well-known scientific area has been the
number of directed learning approaches, including expectation of disease endurance. Accurately
as straight relapse, Decision Trees, Gradient predicting a patient's chance of survival might help
Boosting Machines (GBM), Support Vector professionals with therapeutic advice and
Machines (SVM), and a custom ensemble, to group pharmaceutical recommendations. The likelihood
patients with cellular breakdown in the lungs that a patient will survive a long period after the
according to endurance. By using these strategies, diagnosis of their illness is known as survivability.
credit for essential information will be given. The It is a clinical marker for evaluating the effects of
expectation is viewed as a nonstop objective rather treatment. The majority of illness survivability
than a classification as a first step in improving research focuses on strategies to predict patients'
endurance forecast. The results show that the five-year survivorship. These tests provide a
anticipated features match the actual qualities constrained amount of information for making
during low to direct endurance durations, which clinical decisions. To provide more precise
make up the majority of the data. The custom information for clinical decision-making, endurance
troupe performed the best, with an RMSE of 15.05 time projection should be taken into account.[16]
for Root Mean Square Error. In the custom group, The focus of comorbidity was on diseases that
GBM was the most effective model, with Decision previously coexisted. Some diseases have higher
Trees perhaps being useless since they provided too associations than others, as shown by an

3165
Journal of Theoretical and Applied Information Technology
30th April 2023. Vol.101. No 8
© 2023 Little Lion Scientific

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

examination of real sickness cases. Expected illness generation RF method and other feature selection
endurance has been a well-known scientific field. techniques.[18]
The ability to accurately anticipate a patient's The main goal of this study was to investigate the
likelihood of survival might aid specialists in endurance issue from a different angle.Instead of
making therapeutic suggestions and medication the enduring rate on a time point of an associate
recommendations. The probability that a patient following the finding in the conventional endurance
will live a significant amount of time following the examination, we tried to address the question of
diagnosis of their condition is referred to as how long a specific patient would survive after the
survivability. It serves as a clinical indicator for conclusion. It was demonstrated through a sequence
assessing the outcomes of treatment. The majority of data from standard trials that the survival could
of research on sickness survivability concentrates be achieved using normal machine learning
on methods to forecast patients' five-year survival techniques.[19]
rates. Making clinical judgments using the
information from these tests is limited. The forecast 4. PROPOSED WORK:
of endurance time should be taken into
consideration to offer more exact information for The training and testing datasets each had 10985
clinical decision-making.[17] instances. When the characteristics of the numerous
The purpose of this article is to predict the primary malignancies were pooled, several
endurance time on a month-to-month basis. characteristics were the same. After removing
However, the anticipation of endurance time has duplicate features from the merged feature pool,
been shown to be challenging because when one- features were chosen and translated using Label
stage relapse models are used, significant Encoding, consisting entirely of zeros and ones. In
speculative errors typically occur. A two-stage the classification step, CHI2 feature selection
expectation approach is suggested to solve this decreased data dimensionality, while splitting the
problem. At the first step, classification, a classifier dataset decreased the number of training cases. The
is used to determine if the patients would be able to linear SVM classifier and the Nave Bayes classifier
survive for more than five years. A relapse model is were employed as classifiers. The classification
used to predict the endurance season of patients stage employs the CHI2 feature selection approach.
who have been identified as having no choice to During the regression step, patients who lived for
survive for a long period at the next stage, which is more than 60 months were excluded from the total
regression.. dataset.. The random forest Regressor was
Poor classification performance is the problem that employed because of how well suited to the
develops during the classification step. The issue of regression process it is by nature. The element-wise
bias is shown using a survival time histogram in the feature dropping RMSE scores are also compared
section that follows, and the classification using these techniques. The top 10 characteristics
performance using SVM and Naive Bayes is are kept. Their RMSE ratings drop as additional
determined. It is suggested that CHI2 feature characteristics are taken out of the pool. Every
selection be used in cascade with the support vector iteration, the training set instructs the classifier, and
machine and nave bayes classifiers to improve the accuracy score of the testing set is recorded for
classification performance. For two-stage comparison.
classifiers, the feature selection approach CHI2 is 4.1 Data Preprocessing
used. The public is welcome to use this feature Two types of preprocessing are used to balance and
selection process. The aforementioned clean the data:
enhancements cannot be utilized at the regression 1) Data balancing:
step since the projected outcome is continuous. The class imbalance problem, which is common in
However, the error rate is large and training takes a supervised learning methods, is characterized by a
long period without data pretreatment. The large discrepancy in sample counts between classes.
suggested two-stage framework outperforms the Because learning algorithms are typically biased
one-stage strategy in both classification and towards large classes and perform badly on smaller
regression tasks. The original linear support vector classes, unbalanced data sets are a problem. As a
machine (Linear-SVM) and logistic regression have result, stratified sampling is employed in this work
higher prediction accuracy than the naïve bayes to balance samples prior to modeling. Making the
classifier in the classification stage. In the second necessary modifications and comprehending the
stage, the RMSE of the enhanced random forests distribution of your training data across the classes
(RF) approach is lower than the RMSE of the first- you wish to forecast are essential elements in

3166
Journal of Theoretical and Applied Information Technology
30th April 2023. Vol.101. No 8
© 2023 Little Lion Scientific

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

creating a high-quality classification model. When approach is applied. The public is welcome to use
trying to anticipate something infrequent, such these feature selecting techniques.
infrequent fraudulent transactions or odd equipment
breakdowns, imbalanced datasets are highly prone
to happen. The distribution of the target classes
should always be taken into account, regardless of
the domain.[20-21]
2) Data cleaning:
There must be proper handling of missing values
because the SEER data set includes certain fields
with blank values. These fields can make it more
difficult to create models during the learning phase
and decrease prediction accuracy and processing
speed. Features having more than 50% nonexistent
values are not included in this scenario. The median
values of the characteristics with fewer than 50%
missing data are changed. Only a portion of the
SEER variables and the variables that were
excluded from the models are included, along with
descriptions of those variables, due to the length of Fig 4.1: Biased data before over-sampling
the entire list. The aforementioned improvements cannot be
Data cleaning is the process of preparing data for utilized during the regression stage because the
analysis by removing or altering data that is predicted outcome is continuous. However, the
inaccurate, lacking, irrelevant, duplicated, or error rate is high and training takes a long time
formatted incorrectly. This information is typically without data preprocessing. Regression is carried
not needed or useful when it comes to data analysis out using a random forest Regressor.[24-25]
because it might slow down the procedure or lead 1. Assume there are two stages to the
to erroneous findings. There are several techniques survival prediction issue.
for cleaning data, depending on how it is kept and 2. Build cancer comorbid datasets using the
the questions asked. Data cleaning involves finding SEER database.
ways to optimize a data set's correctness without 3. Use CHI2 feature selection during the
necessarily losing information. It goes beyond just classification phase.
eliminating data to create place for new data. In 4. Apply SVM to the classification process.
addition to deleting data, data cleaning also 5. Employ the random forest Regressor
involves addressing spelling and grammar during the regression phase.
problems, standardizing data sets, and resolving 6. Compare and contrast the two-stage
errors including empty fields, missing codes, and classification and regression model with
other types of errors and locating data points that the one-stage regression model.
are duplicates. Because it is essential to the The suggested two-stage framework outperforms
analytical process and the identification of the one-stage strategy in both classification and
trustworthy solutions, data cleaning is regarded as a regression tasks. In the classification stage, the
fundamental component of data science Naive Bayes classifier's prediction accuracy is
fundamentals.[22-23] inferior to that of the original Linear-SVM and
4.2 Approach- Two Stage Prediction Logistic Regression. In the second stage, the RMSE
Biased datasets and subpar classification of the enhanced random forests (RF) approach is
performance are two problems that come up during lower than the RMSE of the first-generation RF
the classification step. The bias issue is shown method and other feature selection techniques.[26-
using a survival time histogram as an illustration. It 27]
is calculated how well the support vector machine 4.3 Methodology
and naïve bayes classifier do at classifying data. The majority of malignant growth projection
Improved CHI2 feature selection is suggested to studies are limited to determining whether a patient
cascade with the Support Vector Machine, Logistic will live for a specific amount of time. Then, the
Regression, and Naive Bayes classifiers in order to patient is designated as "made due" or "dead." Most
overcome poor classification performance. For two- cases of liver malignant development would be
stage classifiers, the CHI2 feature selection considered "dead" because of the high fatality

3167
Journal of Theoretical and Applied Information Technology
30th April 2023. Vol.101. No 8
© 2023 Little Lion Scientific

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

incidence. These patients' endurance duration is yet 4. 4. To assess the accuracy of the
unknown. Then, we provide a two-stage order predictions, consider the root mean
model that consists of a characterization model that squared error (RMSE), mean absolute
forecasts the patient's likelihood of survival and a error (MAE), and R2 score.
relapse model that forecasts the excess life
expectancy of patients whose projected result is
"dead." With the exception of the fundamental AI
kinds, the two phases use similar methodologies. In
the grouping step, straight SVM classifiers, Naive
Bayes classifiers, and RF classifiers are employed
to predict the endurance condition. Regressors are
used to predict endurance months during the relapse
period. Two problems are encountered throughout
the ordering process. The main problem is that a
one-sided classifier would result from a one-sided
preparation set. Cases from the minority class
would be incorrectly categorized as belonging to
the larger group. Information adjustment is
necessary to address this problem. The next
problem is that the element pool is quite large and
the characterization outcome is subpar. The
fountain is subjected to CHI2 Feature Selection
using a support vector machine classifier and a
Nave Bayes classifier selecting a selection of pool
highlights. The first classifier was not preferred by
the flowing framework during grouping
execution.[28]
Fig 4.2 Methodology
The steps of the categorization framework are as Linear SVM and Naive Bayes were the classifiers
follows: employed in the classification. Prior to categorizing
samples based on distance, it builds a dividing
1. consulting the SEER database for statistics hyper plane between two classes. With just zeros
on MPCs such liver, lung, stomach, thyroid, and and ones, the data has been one-hot encoded.
skin malignancies. Linear SVM and Naive Bayes were able to separate
2. Combine the data and change the order of one-hot encoded data as needed. It employed a
the data. chain of CHI2 feature selection and random under
3. Divide the data into training and testing sampling. In the regression phase, the RF repressor
sets. was applied.
4. To balance the dataset, employ SMOTE As a typical bagging Regressor, it was cascaded
(Synthetic Minority Oversampling Technique). with feature selection based on contribution score.
5. Select the top characteristics for modelling 4.4 Materials and Methods
using CHI2 Feature Selection. Data about malignancies in the United States are
6. Use the linear-SVM, Naive Bayes, and deidentified, categorized, and annotated in the free
Logistic Regression classifiers for prediction. and open-source database SEER. The database is
7. Evaluate the outcomes that were predicted big enough to provide machine learning algorithms
using error metrics like accuracy and f-score. a lot of examples to study. The clinical or
The steps in the regression framework are as microscopic confirmation of a cancer diagnosis in
follows: the SEER cancer registries was performed by a
1. 1. Remove instances with a survival licensed medical professional.
month of more than 60 from the The majority of cancer prognosis studies merely
categorization data. estimate how long a patient will live. The patient is
2. 2. Separate the data into training and then classified as having "survived" or "passed
testing sets. away." The majority of liver cancer patients would
3. 3. Apply the RF Regressor to the forecast. be considered "dead" because of the disease's high
fatality rate. The lifespans of these patients remain

3168
Journal of Theoretical and Applied Information Technology
30th April 2023. Vol.101. No 8
© 2023 Little Lion Scientific

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

unknown. We thus suggest a two-stage simple: To categorized the data, the algorithm
categorization methodology. It contains a creates a line or a hyper plane. SVMs initially
classification model that forecasts the patient's identify a line (or hyper plane) that divides the data
likelihood of survival and a regression model that from two classes. The SVM algorithm takes data as
forecasts the life expectancy of patients whose input and produces, if it is possible, a line that
forecasted result is "dead." divides those classes.[30]
Both phases adhere to identical techniques, with the
exception of the fundamental machine learning Naïve Bayes:
types. The survival condition is predicted using A collection of classification methods founded on
linear-SVM, Naive Bayes, and logistic regression Bayes' Theorem are referred to as "Naive Bayes
classifiers in the classification stage, and the classifiers." It is a group of algorithms that are all
survival months are predicted using RF regressor based on the idea that every pair of characteristics
and Decision Tree Regressor in the regression that is used to classify something is independent of
stage. During the categorization phase, two the other. In applications such as sentiment
problems emerge. The first problem is that the analysis, spam filtering, recommendation systems,
biassed training set would produce a biassed and others, naive Bayes algorithms are frequently
classifier. Minority-group cases would be employed. Although they are quick and easy to
incorrectly categorised as belonging to the implement, their primary drawback is the
dominant group. To address this issue, data requirement for independent predictors. The
balancing is required. The second problem is the predictors are often dependent in real-world
size of the feature pool, which leads to a subpar scenarios, which hinders the effectiveness of the
classification outcome. classifier. The Naive Bayes algorithm is a
A support vector machine classifier and a Nave supervised learning method that addresses
Bayes classifier are used in a cascade with CHI2 classification issues by applying the Bayes theorem.
Feature Selection to select a subset of features from With a sizable training dataset, it is primarily used
the pool. In terms of classification performance, the for text classification. The Naive Bayes Classifier is
cascaded system outperforms the original a rapid and efficient classification technique that
classifier.[29] helps create machine learning models that can learn
quickly and anticipate outcomes. Being a
probabilistic classifier, it makes predictions based
4.5 Algorithms Applied on the likelihood of an item. Popular Naive Bayes
Naive Bayes and linear SVM were used as the Algorithm uses include spam filtration, sentiment
classifiers in the classification. It builds a analysis, and article categorization.
separation hyper plane between two classes before Random Forest:
categorizing samples according to their distance. A well-known machine learning method from the
One-hot encoding was used to create the data, supervised learning approach is Random Forest. It
which solely contains zeros and ones. may be used to solve machine learning challenges
The need for segregating one-hot encoded data was including classification and regression. It is based
met using linear SVM and Naive Bayes classifier. on the idea of ensemble learning, which is a method
With the choice of CHI2 features, it cascaded. RF that combines several classifiers to solve a
served as the Regressor in the regression step. It challenging problem and enhance the performance
was cascaded with contribution score-based feature of the model. A classifier called Random Forest
selection as a typical bagging Regressor. uses a number of decision trees on different subsets
Linear SVM: of the provided dataset. It takes the average to
When a dataset can be divided into two classes by a increase the dataset's forecast accuracy," as the
single straight line, it is said to be linearly name suggests. The random forest uses the
separable, and the Linear SVM classifier is used to forecasts from each decision tree to anticipate the
separate the dataset into its two groups. Depending ultimate result based on the majority vote of
on the dataset, we employ different machine predictions rather than depending just on one
learning techniques to forecast and categorized decision tree. The accuracy is higher and the risk of
data. A linear model called the SVM, or Support over fitting is lower the more trees there are in the
Vector Machine, can be utilized to address forest. One of Decision Trees' biggest drawbacks,
classification and regression issues. It has several variation, is addressed with the Machine Learning
practical uses and may be applied to both linear and method Random Forests.
nonlinear situations. The basic idea behind SVM is

3169
Journal of Theoretical and Applied Information Technology
30th April 2023. Vol.101. No 8
© 2023 Little Lion Scientific

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

Decision Trees are a greedy algorithm in spite of Building a training model is the purpose of
their versatility and simplicity. Instead of employing a decision tree. That can learn
concentrating on how that split impacts the entire straightforward decision rules from historical data
tree, it concentrates on optimizing for the present and anticipate a target variable's class or value
node split. A rapacious strategy expedites Decision (training data). In decision trees, we start at the
however makes them vulnerable to over fitting. A tree's base to forecast a record's class label. We
high-variance learning model is produced as a contrast the root attribute and the record attribute's
result of an over fit tree being highly optimized for values. We follow the branch leading to that value's
forecasting the values in the training dataset.[31] value based on the comparison and go on to the
Logistic regression: next node. The correctness of a tree is strongly
We employ the logistic regression statistical influenced by the choice of strategic splits.
modeling method when the result is binary. When Regression and classification trees have different
the outcome variable is binary, logistic regression decision criteria. Decision trees use a variety of
modeling may be used to predict the outcome algorithms to determine whether to divide a node
whether the independent variables are continuous into two or more sub-nodes. The homogeneity is
or categorical. Logistic regression is the method of increased by the creation of sub-nodes. of the
estimating the likelihood of a discrete outcome resulting sub-nodes. In reference to the target
from an input variable. The majority of logistic variable, the node's purity rises, in other words. The
regression models feature a binary result that can be decision tree divides the nodes according to all of
true or false, yes or no, or another value. Modeling the variables that are available, then it selects the
situations with more than two discrete outcomes split that results in the most homogeneous sub-
may be done using multinomial logistic regression. nodes.[32]
A helpful analysis technique for classification Oversampling and under sampling:
issues is logistic regression, which may be used to A considerable skew in the class distribution can be
determine if a new sample belongs in a particular seen in imbalanced datasets, such as 1:100 or
category. because of factors Logistic regression is a 1:1000 samples in the minority class relative to the
helpful analytical method for classification issues in majority class. Many machine learning algorithms
cyber security, such attack detection. Logistic may be affected by this bias in the training dataset,
regression is an easier and more effective solution and others may totally ignore the minority class.
for issues involving binary and linear classification. Minority forecasts are sometimes the most crucial,
It is a classification model with linearly separable thus this is a concern. Randomly resampling the
classes that is straightforward to use and produces training dataset is one way to address class
outstanding results. It is a classification method that imbalance. Under sampling, or removing examples
is frequently used in business. The logistic from the majority class, and oversampling, or
regression model is a statistical technique for binary duplicating examples from the minority class, are
classification that can be extended to multiclass the two main techniques for randomly resampling
classification, just like the Adaline and Perceptron. an unbalanced dataset.
Multiclass classification tasks can be handled by The two primary methods of random resampling
the highly optimized logistic regression are oversampling and under sampling for
implementation in Scikit-learn. categorization that is unfair.
Decision tree: Duplicate samples in the minority class selected at
The family of supervised machine learning random using oversampling.
algorithms includes the decision tree method. Both Random Under sampling, randomly remove
classification and regression issues may be solved instances from the majority class.
with it. The objective of this approach is to build a The technique of randomly choosing instances from
model that predicts the value of a target variable. the minority class and substituting them in the
To do this, a decision tree is used, which represents training dataset is known as random oversampling.
the issue as a tree with characteristics represented The act of randomly picking instances from the
on the core node of the tree and a leaf node that majority class and eliminating them from the
corresponds to a class label. The family of training dataset is known as random under
supervised learning algorithms includes the sampling. Both methods can be used repeatedly
decision tree algorithm. In contrast to other until the training dataset achieves the desired class
supervised learning algorithms, the decision tree distribution, such as an equal split across the
technique may also be utilized to address classes.
classification and regression issues.

3170
Journal of Theoretical and Applied Information Technology
30th April 2023. Vol.101. No 8
© 2023 Little Lion Scientific

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

They are referred to as "naive resampling" Two distributions are compared using the Chi-
techniques since they employ neither heuristics nor Square test to see how comparable their relative
assumptions about the data. Because of this, they variances are. Its null hypothesis is based on the
are simple to use and quick to carry out It is perfect supposition that the provided distributions are
for really big and complicated datasets. Both independent. Thus, by identifying which features
methods may be used to classify issues with two are most reliant on the output class label, this test
classes (binary) or with many classes that include may be used to identify the optimal features for a
one or more majority or minority classes. certain dataset. Each feature in the dataset has its
Importantly, the training dataset is the sole one to chi2 value determined, and the features are then
which the class distribution modification is sorted in decreasing order using the chi2 value. The
performed. The intention is to alter the models' fit. more dependent the output label is on the feature
It is not necessary to resample the test or holdout and the more crucial the feature is in determining
datasets used to assess a model's performance. the output, the higher the chi 2 values.
These simplistic techniques may work in general, The Chi-Square test's application in machine
but it also depends on the particulars of the dataset learning and its effects are extensively questioned.
and models being used. The practice of random Because we will have several features in line and
oversampling involves adding duplicates of must choose the best ones to create the model,
minority class samples to the training dataset. feature selection is a crucial issue in machine
Machine learning algorithms that are impacted by learning. By analyzing the relationship between the
skewed distributions and when several variables are characteristics, the chi-square test assists in feature
present may benefit from this strategy For a given selection.
class, duplicate examples might affect model fit. The chi-square test in statistics is used to examine if
This may involve iteratively learning coefficients- two occurrences are independent of one another.
based techniques like stochastic gradient descent- From the data of two variables, we can obtain the
based artificial neural networks. Support vector observed count O and the expected count E. The
machines and decision trees are two examples of difference between the observed count O and the
models that might be affected.[32] expected count E is calculated using the Chi-Square
. 4.6 Chi Square Feature Selection formula.
The process of removing the most pertinent features The observed count is close to the expected value
from a dataset and then using machine learning when two features are independent therefore, the
algorithms to boost the performance of the model is Chi-Square value is lower. expected count. A large
known as feature selection, also known as attribute value for the Chi-Square statistic suggests that the
selection. Over fitting is more likely and training independence hypothesis is untrue. Simply put, the
time is exponentially increased by a large number more dependent a feature is on the response, the
of irrelevant features. higher the Chi-Square value, and the more suitable
Chi-Square Feature Extraction: it is for model training.[32]
To extract categorical characteristics from a dataset,
utilize the Chi-square test. The Chi-square test is Limitations:
performed between each feature and the target, and In table cells, Chi-Square is sensitive to low
the features with the highest Chi-square scores are frequencies. In general, chi-square can produce
chosen. It determines whether the relationship false results when the expected value in a table cell
between two categorical variables in the sample is less than 5.
accurately reflects their relationship in the
population. 5. RESULTS

SMOTE (Synthetic Minority Oversampling


Technique) oversampling and CHI2 feature
selection make up the enhancement of the
A well-liked technique for choosing features from classification stage. The classification performance
text data is the Chi-Square feature selection metrics for SVC, Gaussian Nave Bayes, and
method. The 2 test in statistics is used to establish Logistic regression are listed in the table below. F1
the independence of two events. Determine whether score, Accuracy, and Confusion matrix are the
the occurrence of a specific term and the occurrence performance metrics used for comparison. The R2
of a specific class are independent in feature score, RMSE, and MAE are the performance
selection. indicators used in regression.

3171
Journal of Theoretical and Applied Information Technology
30th April 2023. Vol.101. No 8
© 2023 Little Lion Scientific

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

5.1 Classification Stage: MODEL ACCURACY F1


To transform text input into numerical data, we SCORE
had employed label encoding. The issue of the class
gap has been covered in the sections above. The Gaussian Naïve Bayes 74.63 0.769
"less than 5 years of survival" class of cases Logistic Regression 77.74 0.763
dominates the other class of cases, as can be shown
Support Vector 78.54 0.788
in Fig. 1. One of the most popular approaches to
Classifier
address the issue of class imbalance is over-
sampling. Given the small size of the dataset, we
had taken into account the SMOTE oversampling Table2: Results From The Confusion Matrices Of The
method in this case. The dataset's size increased Three Classification Models.
from 10985 to 15439 cases after SMOTE was
applied. MODEL PREDICTE PREDICTE ACTUA
D0 D1 L
Gaussian 1251 728 0
Naïve 251 1630 1
Bayes
Logistic 1613 366 0
Regressio
n 493 1388 1

Support 1488 491 0


Vector
Classifier 337 1544 1

Figure 5.1 Classification Stage 5.2 Regression Stage:


The classification stage's output was filtered to only
Results after applying SMOTE: 0-cases with less include instances with anticipated labels of 0. (less
than 5 years of survival, 1-cases with more than 5 than five years of survival time). Decision tree and
years of survival. random forest regression models are used. R2,
Using CHI2-based feature selection, we had RMSE, and MAE are the comparison metrics for
chosen the six top features out of a total of 16 these two models. The random forest regressor has
characteristics. These six features were chosen by the highest R2, the lowest RMSE, and MAE of the
the CHI2-feature selection as the top ones: two.
1. Replace the age with one (1).
2. AJCC T, 6th edition derived (2004-2015). MODEL R2 RMSE MAE
3. AJCC N, 6th edition derived (2004-2015).
SCORE
4. AJCC M, 6th ed. derived (2004-2015).
5. Labelled Primary Site. Random 0.42 32.03 21.60
6. AJCC Stage Group, 6th edition derived Forest
Regressor
The data were divided one to three. There were Decision 0.41 32.29 21.69
11,759 test cases and 3860 train cases in the data, Tree
respectively. For patients whose expected survival Regressor.
time is less than 60 months and for patients whose
expected survival time requires more than five 6. CONCLUSION & FUTURE SCOPE:
years, the classifiers assigned labels of 0 and 1,
respectively. With an F1 score of 0.788, the SVC The bulk of current survival analyses concentrate
has the highest among the three models and is more on the relationships between the characteristics and
accurate than the other two. patients' chances of surviving five years. The
The results of the Confusion matrix are listed below specific question of how long a patient with
the table along with the accuracy and F1 score of concomitant cancer would live is still mostly
the three models. unanswered. In this experiment, the patient-specific
survival time of cancer patients with concomitant
Table1: Accuracy And F1 Score Of The Three conditions was predicted. The customized query is
Classification Models. split into two machine learning issues. The

3172
Journal of Theoretical and Applied Information Technology
30th April 2023. Vol.101. No 8
© 2023 Little Lion Scientific

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

distinction between patients who will live longer prostate cancer,’’ Cancer, vol. 120, no. 9, pp.
than five years and those who won't is the first 1290–1314, May 2014.
problem. The second step is to develop a regression [5] H. M. Zolbanin, D. Delen, and A. Hassan
model that forecasts the patient's five-year survival Zadeh, ‘‘Predicting overall survivability in
rate. comorbidity of cancers: A data mining
Cancers of the lung, liver, stomach, thyroid, and approach,’’ Decis. Support Syst., vol. 74, pp.
skin are among the most prevalent. It can be 150–161, Jun. 2015.
beneficial for doctors, patients, and families to [6] Y. Wang, D. Wang, X. Ye, Y. Wang, Y. Yin,
predict the prognosis of cancer patients. The and Y. Jin, ‘‘A tree ensemblebased two-stage
suggested two-stage approach not only predicts model for advanced-stage colorectal cancer
survival but also the number of months a patient survival prediction,’’ Inf. Sci., vol. 474, pp.
will live. The first stage foretells whether or not a 106–124, Feb. 2019.
patient will survive for more than five years. The [7] C. M. Lynch, B. Abdollahi, J. D. Fuqua, A. R.
second stage estimates the patient's remaining de Carlo, J. A. Bartholomai, R. N.
months of life if the prediction is death. Scaling of Balgemann, V. H. van Berkel, and H. B.
features is used in the classification stage during Frieboes, ‘‘Prediction of lung cancer patient
feature selection. Use of the Random Forest survival via supervised machine learning
Classifier is made during the regression phase. classification techniques,’’ Int. J. Med.
Applying Feature Selection during the regression Informat., vol. 108, pp. 1–8, Dec. 2017.
stage can further increase accuracy. Investigating [8] NCI SEER Overview. (2015). Overview of the
multidisciplinary and intradisciplinary dispersions Seer Program. Surveillance Epidemiology and
can help the feature selection process become even end Results. [Online]. Available:
better. We will keep looking at feature selection http://seer.cancer. gov/about/
techniques that might boost present prediction [9] P. Liu, L. Li, C. Yu, and S. Fei, ‘‘Two staged
performance in the future. Studying second primary prediction of gastric cancer patient’s survival
breast cancers are another MPC that may be via machine learning techniques,’’ in Proc.
investigated. 7th Int. Conf. Artif. Intell. Appl., 2020, pp.
105–116, doi: 10.5121/csit.2020.100308.
REFERENCES [10] B. Garzín, K. E. Emblem, K. Mouridsen, B.
Nedregaard, P. Due-Tønnessen, T. Nome, J.
[1] N. Howlader. (Apr. 2019). Seer Cancer K. Hald, A. Bjørnerud, A. K. Håberg, and Y.
Statistics Review, 1975–2016. SEER Data Kvinnsland, ‘‘Multiparametric analysis of
Submission, Posted to the SEER Web Site. magnetic resonance images for glioma
Accessed: Nov. 2018. [Online].Available: grading and patient survival time prediction,’’
https://seer.cancer.gov/csr/1975_2016/ Acta Radiologica, vol. 52, no. 9, pp. 1052–
[2] R. E. Curtis, New Malignancies Among Cancer 1060, Nov. 2011.
Survivors: SEER Cancer Registries, 1973– [11] I. H. E. A. T. Magome and A. Haga, ‘‘TH-E-
2000, no. 5. Washington, DC, USA: US BRF-05: Comparison of survival-time
Department of Health and Human Services, prediction models after radiotherapy for high-
National Institutes of Health, 2006. grade glioma patients based on clinical and
[3] C. Diederichs, K. Berger, and D. B. Bartels, DVH features,’’ Med. Phys., vol. 41, no. 33,
‘‘The measurement of multiple chronic p. 570, 2014.
diseases–A systematic review on existing [12] G. Roffo, S. Melzi, U. Castellani, and A.
multimorbidity indices,’’ J. Gerontology Ser. Vinciarelli, ‘‘Infinite latent feature selection:
A, Biol. Sci. Med. Sci., vol. 66A, no. 3, pp. A probabilistic latent graph-based ranking
301–311, Mar. 2011. approach,’’ in Proc. IEEE Int. Conf. Comput.
[4] B. K. Edwards, A.-M. Noone, A. B. Mariotto, Vis. (ICCV), Oct. 2017, pp. 1398–1406.
E. P. Simard, F. P. Boscoe, S. J. Henley, A. [13] Giorgio. (Jun. 24, 2020). Feature Selection
Jemal, H. Cho, R. N. Anderson, B. A. Kohler, Library. MATLAB Central File Exchange.
C. R. Eheman, and E. M. Ward, ‘‘Annual [Online]. Available:
report to the nation on the status of cancer, https://www.mathworks.com/
1975–2010, featuring prevalence of matlabcentral/fileexchange/56937-feature-
comorbidity and impact on survival among selectio%n-library
persons with lung, colorectal, breast, or [14] L. J. E. A. Z. Li and Y. Yang, ‘‘Unsupervised
feature selection using nonnegative spectral

3173
Journal of Theoretical and Applied Information Technology
30th April 2023. Vol.101. No 8
© 2023 Little Lion Scientific

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

analysis,’’ in Proc. 26th AAAI Conf. Artif. [26] N. M. Donin, L. Kwan, A. T. Lenis, A.
Intell., Jul. 2012, pp. 1026–1032. Drakaki, and K. Chamie, ‘‘Second primary
[15] Y. Yang, H. T. Shen, and Z. Ma, ‘‘`2,1-norm lung cancer in united states cancer survivors,
regularized discriminative feature selection 1992–2008,’’ Cancer Causes Control, vol. 30,
for unsupervised,’’ in Proc. 2nd Int. Joint no. 5, pp. 465–475, May 2019.
Conf. Artif. Intell., 2011, pp. 1–6. [27] R. J. M. Adamo and L. Dickie,‘‘SEER
[16] J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. program coding and staging manual,’’ in U.S.
Trevino, J. Tang, and H. Liu, ‘‘Feature Department of Health and Human Services
selection: A data perspective,’’ ACM National Institutes of Health National Cancer
Comput. Surv., vol. 50, no. 6, p. 94, 2016. Institute. Bethesda, MD, USA: National
[17] F. Song, Z. Guo, and D. Mei, ‘‘Feature Cancer Institute, 2018, Art. no. 20892.
selection using principal component [28] G. Roffo, S. Melzi, and M. Cristani, ‘‘Infinite
analysis,’’ in Proc. Int. Conf. Syst. Sci., Eng. feature selection,’’ in Proc. IEEE Int. Conf.
Design Manuf. Informatization, Nov. 2010, Comput. Vis. (ICCV), Dec. 2015, pp. 4202–
pp. 27–30. 4210.
[18] Y.-Q. Liu, C. Wang, and L. Zhang, ‘‘Decision [29] N. Japkowicz, ‘‘The class imbalance problem:
tree based predictive models for breast cancer Significance and strategies,’’ in Proc. Int.
survivability on imbalanced data,’’ in Proc. Conf. Artif. Intell., 2000, pp. 111–117.
3rd Int. Conf. Bioinf. Biomed. Eng., Jun. [30] L. Ali, C. Zhu, N. A. Golilarz, A. Javeed, M.
2009, pp. 1–4. Zhou, and Y. Liu, ‘‘Reliable Parkinson’s
[19] J. Thongkam, G. Xu, Y. Zhang, and F. Huang, disease detection by analyzing handwritten
‘‘Breast cancer survivability via adaboost drawings: Construction of an unbiased
algorithms,’’ in Proc. 2nd Australas. cascaded learning system based on feature
Workshop Health Data Knowl. Manage., vol. selection and adaptive boosting model,’’
80, 2008, pp. 55–64. IEEE Access, vol. 7, pp. 116480–116489,
[20] K. Park, A. Ali, D. Kim, Y. An, M. Kim, and 2019.
H. Shin, ‘‘Robust predictive model for [31] G. Roffo and S. Melzi, ‘‘Features selection via
evaluating breast cancer survivability,’’ Eng. eigenvector centrality,’’ in Proc. New
Appl. Artif. Intell., vol. 26, no. 9, pp. 2194– Frontiers Mining Complex Patterns
2205, Oct. 2013. (NFMCP), Oct. 2016, pp. 1–12.
[21] R. Kaviarasi, ‘‘Accuracy enhanced lung cancer [32] H. Peng, F. Long, and C. Ding, ‘‘Feature
prognosis for improving patient survivability selection based on mutual information criteria
using proposed Gaussian classifier system,’’ of max-dependency, max-relevance, and min-
J. Med. Syst., vol. 43, no. 7, p. 201, Jul. 2019. redundancy,’’ IEEE Trans. Pattern Anal.
[22] H. Liu, Z. Su, and S. Liu, ‘‘Improved chi text Mach. Intell., vol. 27, no. 8, pp. 1226–1238,
feature selection based on word frequency Aug. 2005.
information,’’ Comput. Eng. Appl., vol. 49,
no. 22, pp. 110–114, 2013.
[23] S. M. Ryu, S.-H. Lee, E.-S. Kim, and W. Eoh,
‘‘Predicting survival of patients with spinal
ependymoma using machine learning
algorithms with the SEER database,’’ World
Neurosurg., vol. 124, pp. e331–e339, Apr.
2019.
[24] R. Kleinlein and D. Riaño, ‘‘Persistence of
data-driven knowledge to predict breast
cancer survival,’’ Int. J. Med. Informat., vol.
129, pp. 303–311, Sep. 2019.
[25] M. Naghizadeh and N. Habibi, ‘‘A model to
predict the survivability of cancer
comorbidity through ensemble learning
approach,’’ Expert Syst., vol. 36, no. 3, Jun.
2019, Art. no. e12392.

3174

You might also like