0% found this document useful (0 votes)
24 views

Reddy 2019

The document discusses a proposed integrated machine learning model for predicting lung cancer stages from textual data using an ensemble method. The model combines K-Nearest Neighbors, Decision Tree, and Neural Network models using a bagging ensemble method to enhance prediction accuracy compared to individual algorithms. The goal of the proposed work is to predict lung cancer stages from textual data using machine learning techniques like KNN, Decision Trees, and Neural Networks with ensemble learners.

Uploaded by

minda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Reddy 2019

The document discusses a proposed integrated machine learning model for predicting lung cancer stages from textual data using an ensemble method. The model combines K-Nearest Neighbors, Decision Tree, and Neural Network models using a bagging ensemble method to enhance prediction accuracy compared to individual algorithms. The goal of the proposed work is to predict lung cancer stages from textual data using machine learning techniques like KNN, Decision Trees, and Neural Networks with ensemble learners.

Uploaded by

minda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2019 1st International Conference on Advances in Information Technology

Integrated Machine Learning Model for Prediction of


Lung Cancer Stages from Textual data using
Ensemble Method
Emmidi Naga Hemanth Kumar
DendiGayathri Reddy
Dept. of CSE
Dept. of CSE
DSCE
DSCE
Bengaluru, India
Bengaluru, India
[email protected]
[email protected]
Monika P
Dept. of CSE
DesireddyLohithSaiCharan Reddy DSCE
Dept. of CSE Bengaluru, India
DSCE [email protected]
Bengaluru, India
[email protected]
Using the initial documented signs as important factors and
machine learning techniques, it is possible to predict various
Abstract— Research and Development on cancer detection stages to certain extent.
is more on imaging than textual data. With the help of
documented symptoms in the form of text and Machine Machine learning - a field of artificial intelligence,
Learning (ML) techniques, it is possible to predict the lung works on programmed algorithms by using their past learnt
cancerstages effectively. This paper conjectures the oeuvre experienceto draw new conclusions with better accuracy.
modelwhich is efficient in predicting the stages of lung There are many learning techniques such as classification,
carcinoma by applying the concepts of ML algorithms. The regression, association etc. that can be applied in the
proposed model is combination of K-Nearest Neighbours, applications as per the need to meet the best prediction
Decision Tree and Neural Networks modelsalong with accuracy as close as possible to the human predictions. The
bagging ensemble method for enhancing the accuracy of the choices of algorithms are based on the type of data being
overall prediction. The predictedresults of the suggested operated. Lot of inbuilt libraries are supported in the ML tools
model are showing better accuracy compared to individual and scripting languages for realizing the proposals.
algorithms.
The prediction accuracy of any model depends on its
Keywords: k-nearest neighbours, decision trees, neural learning capability. The learning ability of any algorithm
networks, bagging, ensemble, lung cancer, textual data, depends on correctness of the training set. Combining the ML
machine learning algorithms. algorithms helps overcome above restrictions with good
decision output. Ensemblelearning methods are generally used
I.INTRODUCTION to increase the performance of base classifiers by creating an
ensemble of multiple classifiers and combining results.The
In 2012, according to the American Cancer Society, one objective of the proposed work is to predict the stages of Lung
in four deaths is due to cancer in general with overall survival cancer on textual data using ML techniques like k-nearest
ratio of 10-15%. World Health Organisation [WHO] says that neighbours, decision trees and neural networks with ensemble
learners.
cancer is a leading cause of death in France and is responsible
for 1,50,000 deaths every year [1].Lung cancer is the most
recurrent ones causing high ephemerality rates due to huge The rest of the paper is organized as follows: Section
II describes the related work carried out by various researches
pollution and smoking habits. Though prostate and breast cancer
on lung cancer prediction. Section III presents the proposed
occur in male and female, mortality rates caused by lung cancer
model architecture, data pre-processing and its implementation
is higher.There are many prevention and treatment methods like details. Section IV discusses about the performance analysis
chemotherapy, radiotherapy and surgeries to remove the tumour. and comparison. Section V concludes followed with
Most of the patients across the world are diagnosed at the references.
advanced stage. It has become difficult for the doctors to II.RELATED WORK
diagnose in early stage as the symptoms are not much
noticeable.
R. Kaviarasiet. al. [2] states that the cancer needs to be
prevented and stopped at the earliest as possible. Decision Tree
algorithmhelps to assign risk scores to the attributes of the data
set and K-means clustering algorithm can be used to
differentiate cancerous and non-cancerous data based on the
risk scores.

Muhammad Imran Faisalet. al. [3] investigated on


evaluation of classifiers namely Multilayer perceptron [MLP],
Naïve Bayes [NB], Support Vector Machines [SVM], Gradient
Boosted Tree [GBT], Neural Networks [NN] and ensemble
978-1-7281-3241-9/19/$31.00 © 2019 IEEE 353

Authorized licensed use limited to: Auckland University of Technology. Downloaded on June 04,2020 at 06:59:48 UTC from IEEE Xplore. Restrictions apply.
2019 1st International Conference on Advances in Information Technology
classifier Random Forest [RF]. Based on performance metric, it
is found that GBT outperformed with an accuracy of 0.90 when
compared to 0.78 for MLP, 0.85 for NB, 0.79 for SVM, 0.71 Euclideandistance = ( − ) (1)
for NN and 0.79 for RF.

For diagnosing lung cancer, [4] reports that tumour


can is being identified in early phase by performing tests on The implementation steps of K-NN algorithm is as in fig. 1
DNA and proteins.Gene Expression Programming [GEP] is
used to construct three explicit predictive models by
considering different lung cancer characteristics. Experimental
results conclude that model constructed considering thecarcino-
embryonic antigen [CEA], neurone specific enolase [NSE] and
Cyfra21-1 has better accuracyof 88.9% when compared to
other models.

Jennifer Cabrera et. al. [5] proposed a system that


utilizes gene expression data from oligonucleotide microarrays
to predict the type of cancer and genes that are attributable to
that specific kind of cancer. Combinations of all Pre-processing
methods like Decimal Scale, Quantile, Z-Score, Min-Max are
used for processing the data. Support Vector Machines [SVM] Fig.1. K-NN algorithm
technique is used to classify the stages of cancer. Experimental
results conclude that combination of Quantile& Z-Score with
SVM gives highest accuracy of 0.93. B. Neural Networks

Ching-Hsien Hsuet. al.[6] used backward propagation


Backpropagation technique is predominantly used in
neural networks and decision trees on 50 tumorous and non-
Neural Networks for making predictions. For each new
tumorous data sample with 30 attributes. The results showed
instance given, the error is backpropagated to correct the edge
that decision tree is efficient with accuracy of 0.95.
weights. An input layer, one or more hidden layers and an
output layer are the components of neural network. Each layer
ZhuqingCaiet. al. [7] has consideredimproved is made up of units. Each unit is connected to the units in the
synthetic over-sampling technique(Borderline-SMOTE) in adjacent layer through edges. Each edge is associated with
order to expand data samples in unbalanced category of a some weight.Every layer consists of its own activation
cancer data set to improve accuracy of prediction. Support function. The inputs are fed into input layer. Output of the first
Vector Machine [SVM] and Cox-proportional hazard layer is fed into the first hidden layer and so on till it reaches
regression model [COX] algorithms are used for training the the output layer. Training and adjusting of weights is done by
model and calculating the accuracy. The studies conducted by Backpropagation. Fig. 2 shows the simple implementation
labelling dataset for 5-year survival time shows accuracy of steps of Backpropagation algorithm.
0.85 for SVM and 0.72 for COX. 2-year survival time shows
accuracy of 0.89 for SVM and 0.81 for COX. Thereby
concluding that the SVM results in better performance.
C. Decision Tree

III. PROPOSEDMODEL Classification and Regression Trees [CART] is one of the


well-known binary decision tree algorithm used extensively
in implementing ML concepts with classification and
The model proposed in this paperuses combination of
regression. Every internal node represents a single variable
three algorithms namely k-nearest neighbours, decision trees,
and a split on that variable. Leaf nodeshold the decision
neural networks integrated with the concept of bagging for
output. In the algorithm, all input variables and possible split
predicting lung cancer stages from textual data.
points are evaluated and chosen in a greedy manner based on
the cost function. The Gini index as in (2) is used as the cost
A. K-Nearest Neighbour function for computing the split value.
K-nearest neighbours’ algorithm (k-NN) is a non-
parametric technique used in regression and classification
problems. k - closest training examples in the feature space acts
as an input. The predicted class, to which an object belongs to,
depends upon the class of the neighbours around it.K-NN is
a lazy learning algorithm as it does not learn from the training
data but simply memorizes the training data. The algorithm is
sensitive to the local structure of the data. Euclidean distance as
in (1) is used as a distance function.

978-1-7281-3241-9/19/$31.00 © 2019 IEEE 354

Authorized licensed use limited to: Auckland University of Technology. Downloaded on June 04,2020 at 06:59:48 UTC from IEEE Xplore. Restrictions apply.
2019 1st International Conference on Advances in Information Technology

Fig. 2 Backpropagation
Fig.4 Proposed architecture

Gini(D) = 1 − p(i) (2) The process of data gathering, pre-processing, model


implementation, testing and verification are as follows:
Where,
P(i) is probability that a tuple in D belongs to class Ci ; m - is
total number of classes; D -data partition; Ci -Individual class a) Data Gathering
The Dataset is collected from the Data World source.
It consists of 1000 data samples with 23attributes and
D. Bootstrap Aggregating
associated prediction. The main attributes which contributes for
Bagging is an ensemble method which is used to lung cancer prediction include smoking, gender, air pollution,
reduce the variance of an estimate by averaging together chronic lung disease, chest pain, wheezing, dry cough, snoring,
multiple estimates. It uses bootstrap sampling to obtain the data swallowing difficulty and clubbing.
subsets for training the base learners.For aggregating the outputs
of base learners, bagging uses voting for
b) Data Pre-processing
classification and averaging for regression. Fig. 3 gives the
summarised implementation procedure of bagging algorithm. The data is interpolated for missing data and then
normalized using linear transformation algorithm [8] as in (3).

( − min ( ))
Y= (3)
(max( ) − min( ))
The normalized data is split into training (80%) and testing
(20%) dataset.The training dataset is split into n parts in order
to train n models using the concept of bagging.

c) Model Implementation

The n-parts of training dataset are fed as input to


Decision Tree, K-Nearest Neighbour and Neural Network
algorithms to create n- models of each algorithm. The group of
Fig. 3Bagging algorithm n-models of each algorithm form a bagging model. Final
E. Architecture bagging models generated from the algorithms are termed as
The work being presented here is a combination of integrated model and are used for testing and future predictions
three algorithms along with bagging ensemble method
mentioned above. The objective of the proposed work is to d) Validation of the integrated model
predict the stages of Lung cancer on textual data. The
architecture of the model is as shown in the fig. (4).

978-1-7281-3241-9/19/$31.00 © 2019 IEEE 355

Authorized licensed use limited to: Auckland University of Technology. Downloaded on June 04,2020 at 06:59:48 UTC from IEEE Xplore. Restrictions apply.
2019 1st International Conference on Advances in Information Technology
Test dataset is fed as input to the integrated model. is anticipated to be the class of closest training sample is
Each model in the integrated system predicts the output. known as nearest neighbour algorithm. Presence of noisy data
Majority poll of the above predictions is considered as the final degrades the accuracy of K-NN algorithm. Fig (6) shows
prediction (as in (4)) of individual test samples and accuracy variations of accuracywhen different ‘k’ values are chosen.
scores are computed.Then proposed system has been rigorously
validated on five different test data sets. For all the sets it is
observed that accuracy scores are matching approximately to
the maximum extent.

Final_prediction=Max{KNNb(t),NNb(t),DTb(t)} (4)

Where, KNNb is bagging model of KNN,


NNb is bagging model of Neural Networks,
DTb is bagging model of Decision Trees,
t is the test tuple

IV. PERFORMANCE ANALYSIS Fig. 6Accuracy variations for different ‘k’ values

The accuracy score is considered as the performance


metric and the observed values are tabulated in table 4.1. The accuracy of the neural networks depends on the number
of iterations. Iteration is a combination of a forward pass and a
TABLE4.1 Accuracy Scores backward pass. Fig (7) shows variations of accuracy when
different numbers of iterations are chosen.

The recorded accuracy scores for each of the


algorithms with bagging and without bagging depicts that
bagging technique enhances the performance of the individual
models with the accuracy readings - 0.97 (Decision Tree), 0.94
(K-NN) and 0.96 (Neural Networks). The integrated model Fig. 7 Accuracy variations for different iterations
accuracy scores to 0.98, which is better than the individual
algorithmic scores with and without bagging fig (5). V. CONCLUSION

Machine learning algorithms are being extensively


used in lot of applications these days.The proposed model uses
combination of K-Nearest Neighbour, Decision Tree and
Neural Network models with bagging ensemble method
topredict the stages of Lung cancer on textual data, enhancing
the accuracy of the overall prediction. Conclusions are drawn
by comparing the models with bagging and without bagging.It
is observed that bootstrap aggregating technique enhances the
performance of the individual models with the accuracy scores
0.97 (Decision Tree), 0.94 (K-NN) and 0.96 (Neural
Networks). The accuracy score of the integrated model is
noticed as 0.98.Integrated model enhances the accuracy by
3.33%.Theproposed model can be used in future for predicting
Fig. 5 Accuracy scores comparison the other chronic diseases in the healthcare and other related
domains.Further, the model can be tuned to work with clinical
observations along with additional symptoms if recorded
The accuracy of the K-NN depends on the number of during diagnosis phase.
neighbours ‘k’. The best decision of ‘k’ relies on information;
for the most part, bigger estimations of k lessen the impact of
the noise on classification, however make boundaries between
classes less distinct. The exceptional situation where the class REFERENCES

978-1-7281-3241-9/19/$31.00 © 2019 IEEE 356

Authorized licensed use limited to: Auckland University of Technology. Downloaded on June 04,2020 at 06:59:48 UTC from IEEE Xplore. Restrictions apply.
2019 1st International Conference on Advances in Information Technology
[1] R. Zemouri , N. Omri , C. Devalland , L. Arnould , B.
Morello , N. Zerhouni , F. Fnaiech, “ Breast cancer
diagnosis based on joint variable selectionand
Constructive Deep Neural Network”, IEEE 4th Middle
East Conference on Biomedical Engineering
(MECBME) 2018.

[2] R. Kaviarasi, A. Valarmathi, “Recognition and


Anticipation of Cancer and Non Cancer Prophecy
using Data Mining Approach”, IEEE International
Conference on Emerging Trends in Engineering,
Technology and Science(ICETETS) 2016.

[3] Muhammad Imran Faisal , Saba Bashir ,


ZainSikandar Khan , Farhan Hassan Khan, “An
Evaluation of Machine Learning Classifiers and
Ensembles for Early Stage Prediction of Lung
Cancer”, IEEE 3rd International Conference on
Emerging Trends in Engineering Sciences and
Technology (ICEEST) 2018.

[4] Yu, Z., Chen, X. Z., Cui, L. H., Si, H. Z., Lu, H. J., &
Liu, S. H.,“Prediction of lung cancer based on serum
biomarkers by gene expression programming
methods”, Asian Pacific Journal of Cancer
Prevention (APJCP) 2014.

[5] Jennifer Cabrera , AbigaileDionisio, Geoffrey


Solano, “ Lung cancer classification tool using
microarray data and support vector machines”,
IEEE6th International Conference on Information,
Intelligence, Systems and Applications (IISA) 2015.

[6] Ching-Hsien Hsu, GunasekaranManogaran,


ParthasarathyPanchatcharam, Vivekanandan S., “A
New Approach For Prediction of Lung Carcinoma
Using Back Propogation Neural Network with
Decision Tree Classifiers”, IEEE 8th International
Symposium on Cloud and Service Computing (SC2),
2018

[7] ZhuqingCaia, ZhuliangYua, HaiyuZhoub,


ZhenghuiGua, “The Early Stage Lung Cancer
Prognosis Prediction Model based on Support Vector
Machine”, IEEE 23rd International Conference on
Digital Signal Processing (DSP) 2018.

[8] ShutingShen, Ziqiang Fan, Qi Guo, “Design and


Application of Tumour prediction model based on
statistical method", ISSN:2469-9322.

978-1-7281-3241-9/19/$31.00 © 2019 IEEE 357

Authorized licensed use limited to: Auckland University of Technology. Downloaded on June 04,2020 at 06:59:48 UTC from IEEE Xplore. Restrictions apply.

You might also like