International Journal of Computer Science Trends and Technology (IJCST) – Volume 11 Issue 3, May-Jun 2023
RESEARCH ARTICLE OPEN ACCESS
Breast cancer survival prediction using support vector
machine
Dr M Narendra [1], A Nandini [2], T Kamal Raj [3], V Sai Sowmya [4],
Ch Brahma Reddy [5]
[1]
Professor, Department of Computer Science and Engineering
[2], [3], [4], [5]
Student, Department of Computer Science and Engineering
[1], [2], [3], [4], [5]
QIS College of Engineering and Technology, Ongole
ABSTRACT
Breast cancer (BC) is one of the most common cancers among women worldwide, representing the majority of
new cancer cases and cancer-related deaths according to global statistics, making it a significant public health
problem in today’s society. As the diagnosis of this disease manually takes long hours and the lesser availability
of systems, there is a need to develop the automatic diagnosis system for early detection of cancer. In recent
years, machine learning has been widely used in detection and achieved favourable performance. In this paper,
we have proposed a system that predicts the stages of the breast cancer. Without directly applying the machine
learning techniques, we first perform K fold cross validation in the dataset to find which technique is more
suitable for this dataset.
I. INTRODUCTION
Cancer is the second reason of human death all of two years recurrence of breast cancer disease.
over the world and accounts for roughly 9.6 million The dataset has been obtained from Iranian Center
deaths in 2018. Globally, for 1 human death in 6 of Breast Cancer (ICBC) program, collected the
can be said that is caused by cancer. Almost 70 time period of 1997-2008 years. The dataset is
percent of the deaths from cancer disease happen in consisted of population characteristics and 22 input
countries that have low and middle income [1]. The variables also the cases have been collected from
most common cancer type among women are 1189 women of diagnosed breast cancer. Artificial
breast, lung and colorectal, which totally symbolize Neural Network (ANN), Support Vector Machine
half of the all cancer cases. Also, breast cancer is (SVM) and Decision Tree (DT) have been applied
responsible for the thirty percent of all new cancer and SVM has been showed the best performance
diagnoses in women [2]. Machine learning (ML) with highest accuracy and least error rate
methods ensure analyzing the data and extracting Breast Cancer have been applied Wisconsin Breast
key characteristics of relationships and information Cancer (Original) dataset. SVM classification
from dataset. Also, it creates a computational method has been given the highest accuracy value
model for best description of the data. Especially, (97.13 %) with least error rate when the
according to in researches about cancer disease, it experimental results were compared was used as a
can be said that ML techniques can be handled on dataset and Weka software was used as a Machine
early detection and prognosis of cancer [3]. Asri et Learning tool. The key performance parameters of
al. Have compared some machine learning machine learning classifiers have been compared
algorithms for the risk prediction and diagnosis of according to accuracy, recall, precision and ROC
breast cancer. Support Vector Machine (SVM), k- area. They have suggested that BN has the best
Nearest Neighbors (knn), Naive Bayes (NB) and performance according to recall and precision
Decision Tree (C4.5) have been applied Wisconsin values and RF technique has optimum performance
Breast Cancer (Original) dataset. SVM in term of ROC area [5]. Ahmad et al. Have
classification method has been given the highest exercised machine learning algorithms for
accuracy value (97.13 %) with least error rate when predicting the rate of two years recurrence of breast
the experimental results were compared. cancer disease. The dataset has been obtained from
Breast Cancer was used as a dataset and Weka Iranian Center of Breast Cancer (ICBC) program,
software was used as a Machine Learning tool. The collected the time period of 1997-2008 years. The
key performance parameters of machine learning dataset is consisted of population characteristics
classifiers have been compared according to and 22 input variables also the cases have been
accuracy, recall, precision and ROC area. They collected from 1189 women of diagnosed breast
have suggested that BN has the best performance cancer. Artificial Neural Network (ANN), Support
according to recall and precision values and RF Vector Machine (SVM) and Decision Tree (DT)
technique has optimum performance in term of have been applied and SVM has been showed the
ROC area [5]. Ahmad et al. Have exercised best performance with highest accuracy
machine learning algorithms for predicting the rate Machine learning involves predicting and
ISSN: 2347-8578 www.ijcstjournal.org Page 9
International Journal of Computer Science Trends and Technology (IJCST) – Volume 11 Issue 3, May-Jun 2023
classifying data and to do so we employ various significantly increase processing speed and on a big
machine learning algorithms according to the scale can make the diagnostic significantly cheaper.
dataset. SVM or Support Vector Machine is a Our proposed system has an accuracy of 99%.
linear model for classification and regression
problems. It can solve linear and non-linear IV. IMPLEMENTATION
problems and work well for many practical Feature selection is an important part in machine
problems. The idea of SVM is simple: The learning to reduce data dimensionality and
algorithm creates a line or a hyperplane which extensive research carried out for a reliable feature
separates the data into classes. In machine learning, selection method. For feature selection filter
the radial basis function kernel, or RBF kernel, is a method and wrapper method have been used. In
popular kernel function used in various kernelized filter method, features are selected on the basis of
learning algorithms. In particular, it is commonly their scores in various statistical tests that measure
used in support vector machine classification. As a the relevance of features by their correlation with
simple example, for a classification task with only dependent variable or outcome variable. Wrapper
two features (like the image above), you can think method finds a subset of features by measuring the
of a hyperplane as a line that linearly separates and usefulness of a subset of feature with the dependent
classifies a set of data. Intuitively, the further from variable. Hence filter methods are independent of
the hyperplane our data points lie, the more any machine learning algorithm whereas in
confident we are that they have been correctly wrapper method the best feature subset selected
classified. We therefore want our data points to be depends on the machine learning algorithm used to
as far away from the hyperplane as possible, while train the model. In wrapper method a subset
still being on the correct side of it. evaluator uses all possible subsets and then uses a
classification algorithm to convince classifiers from
II. EXISTING SYSTEM the features in each subset. The classifier consider
the subset of feature with which the classification
Breast cancer is one of the most commonly algorithm performs the best. To find the subset, the
diagnosed female disorders globally. Numerous evaluator uses different search techniques like
studies have been conducted to predict survival depth first search, random search, breadth first
markers, although the majority of these analyses search or hybrid search. The filter method uses an
were conducted using simple statistical techniques. attribute evaluator along with a ranker to rank all
In lieu of that, this research employed machine the features in the dataset. Here one feature is
learning approaches to develop models for omitted at a time that has lower ranks and then sees
identifying and visualizing relevant prognostic the predictive accuracy of the classification
indications of breast cancer survival rates. A algorithm. Weights or rank put by the ranker
comprehensive hospital-based breast cancer dataset algorithms are different than those by the
was collected from the National Cancer Institute’s classification algorithm. Wrapper method is useful
SEER Program’s November 2017 update, which for machine learning test whereas filter method is
offers population-based cancer statistics. The suitable for data mining test because data mining
dataset included female patients diagnosed between has thousands of millions of features.
2006 and 2010 with infiltrating duct and lobular
carcinoma breast cancer (SEER primary cites V. CONCLUSION
recode NOS histology codes 8522/3). The dataset Breast Cancer is the most frequent disease as a
included nine predictor factors and one predictor cancer type for women. Therefore, any
variable that were linked to the patients’ survival development for diagnosis and prediction of cancer
status (alive or dead). disease is capital important for a healthy life. In this
paper, we have discussed two popular machine
III. PROPOSED SYSTEM learning techniques for Wisconsin Breast Cancer
classification. Artificial Neural Network and
We have proposed a system that predicts the stages Support Vector Machine are used as ML techniques
of the breast cancer. Without directly applying the for the classification of WBC (Original) dataset in
machine learning techniques, we first perform K WEKA tool. The effectiveness of applied ML
fold cross validation in the dataset to find which techniques is compared in term of key performance
technique is more suitable for this dataset. The metrics such as accuracy, precision, recall and ROC
Classification and Regression Tress (CART), area. Based on the performance metrics of the
Linear Support Vector Machines (SVM), Gaussian applied ML techniques, SVM (Sequential Minimal
Naïve Bayes (NB) and k-Nearest Neighbors Optimization Algorithm) has showed the best
(KNN). Techniques are validated and found that performance in the accuracy of 96,9957 % for the
SVM is better for the breast cancer dataset. Our diagnosis and prediction from WBC dataset.
proposed model is then constructed using SVM. REFERENCES
Using machine learning methods for diagnostic can [1]Cancer,https://www.who.int/en/news-room/fact
ISSN: 2347-8578 www.ijcstjournal.org Page 10
International Journal of Computer Science Trends and Technology (IJCST) – Volume 11 Issue 3, May-Jun 2023
sheets/detail/cancer. Last Access: 25.01.2019.
[2] Siegel, R. L., Miller, K. D., &Jemal, A. (2018).
Cancer statistics, Ca-a Cancer Journal for
Clinicians, 68 (1), pp. 7-30.
[3] Maity, N. G., & Das, S. (2017). Machine
learning for improved diagnosis and prognosis in
healthcare. In 2017 IEEE Aerospace Conference,
pp. 1-9.
[4] Asri, H., Mousannif, H., Al Moatassime, H., &
Noel, T. (2016). Using machine learning
algorithms for breast cancer risk prediction and
diagnosis. Procedia Computer Science, 83, pp.
1064-1069.
[5] Bazazeh, D., &Shubair, R. (2016). Comparative
study of machine learning algorithms for breast
cancer detection and diagnosis. In 2016 5 th
International Conference on Electronic Devices,
Systems and Applications, pp. 1-4.
ISSN: 2347-8578 www.ijcstjournal.org Page 11