Early Stage Breast Cancer Predicting Base Paper
Early Stage Breast Cancer Predicting Base Paper
Measurement: Sensors
journal homepage: www.sciencedirect.com/journal/measurement-sensors
A R T I C L E I N F O A B S T R A C T
Keywords: Breast cancer is one of the leading medical problems in the healthcare field among women. The cancer-related
Machine learning death rate is a major global health problem, particularly in developing countries. Early diagnosis of breast cancer
Early prediction is the only effective way to deal with this mortality factor. Although there are many methods for preventing
Breast cancer
cancer, there are still some types that have unknown cures. Breast cancer is extremely common and can be
Sensors
Disease
effectively treated if caught in its early stages. A correct diagnosis of breast cancer is a vital first step in treatment.
Predicting the subtype of breast cancer is an active area of study. In this article, an attempt has been made to
present a framework to predict breast cancer using machine learning techniques at an early stage to get the best
treatment for the patient. The proposed method uses the machine learning classifiers Logistic Regression and
Support Vector Machine to classify breast cancer patients into benign (Non Cancerous Tumor) and salignant
(Cancerous Tumor) categories. The proposed mechanism is implemented using Python and Jupyter Notebook on
the real dataset, which is generated through a sensing device and collected from the UCI repository. The per
formance is analyzed using performance metrics such as accuracy, precision, F-measure, etc. The Logistic
Regression (LR) model achieves an accuracy of 97.14%, whereas, in the SVM model, the obtained accuracy is
96%. The performance of the proposed framework was evaluated and compared with the other algorithms, and
the results indicated that the proposed framework achieved better performance than other models.
1. Introduction which provide nourishment to the inner linings of the milk ducts [7].
It is crucial to get a correct diagnosis of the tumor. Although most
In the 21st century breast cancer is the most significant disease breast tumors result from benign (noncancerous) alterations, incorrectly
among women. It is the most diagnosed cancer among women all over labeling a malignant tumor as benign can have fatal effects [8,9]. Early
the world. Early detection and treatment is the most effective way the detection and access to novel treatment options can severely lower
diagnosis of breast cancer. Artificial Intelligence based health monitored breast cancer mortality rates [10].
screening methods are the most promising and emerging field in breast
cancer early detection. AI-based methods enhance the capabilities to
learn from previous knowledge and work on the current conditions and 1.1. Background
identify new patterns in the data. Death from cancer is a key problem in
the healthcare system [1,2]. Breast cancer is more common in women Typically, women notice a lump in their breasts or under their arms.
with thick breast tissue because of the way the disease develops physi BSE, or monthly breast self-analysis, is the greatest way to detect it early
ologically. According to data from Globocan 2018 [3,4], one in every since it familiarizes the observer with the breast’s texture, change in
four cancer diagnoses in women globally is due to breast cancer, which size, skin condition, and any other changes that may help make a
is the sixth leading cause of mortality worldwide. There were 23.7 newly diagnosis [11,12]. There are two broad categories of breast cancer
diagnosed cases of age-related breast cancer per 100,000 persons and symptoms: early warning signs and late-stage symptoms. Breast lumps
6.8 fatalities per 100,000 people worldwide in 2018 [5]. The second or masses, genital discharge, armpit edema involving lymph nodes,
biggest cause of death for women is breast cancer [6], behind lung genital pain, genital scaliness, and inverted genitalia are all potential
cancer. Breast cancer originates in the lobules and ducts of the breast, signs of breast cancer. Weight loss and decreased appetite in the liver,
neurological pain or weakness, and bone pain in the advanced stage are
* Corresponding author.
E-mail addresses: [email protected] (D. Sharma), [email protected] (R. Kumar), [email protected] (A. Jain).
https://doi.org/10.1016/j.measen.2023.100901
Received 25 April 2023; Received in revised form 22 June 2023; Accepted 24 September 2023
Available online 25 September 2023
2665-9174/© 2023 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
D. Sharma et al. Measurement: Sensors 30 (2023) 100901
all considered [13]. The study also found that Glandular tissue and • The Comparison of machine learning models using performance
Stromal tissue are the two types of breast concerns. The ductal tissue metrics such as accuracy is calculated and presented.
that carries milk away from the glandular tissue is also made up of cells
that contribute to milk production. Stromal tissue consists of adipose
1.2. Motivation
connective tissue with fibrous elements [14].
Breast Cancer is the most prevailing cancer among women world
1.1.1. Stages
wide. The rate of mortality from breast cancer is rising in developing
All cancers are evaluated based on their findings to determine their
countries day by day. Early diagnosis and prognosis of the disease is the
stages. The doctors identify cancer cases with codes of letters and
most precise method to cure the disease. In the last few years, the use of
numbers. Stage 0 is followed by Stages 1, 2, 3, and 4. If the stage number
Artificial Intelligence based methods for disease prediction are at an
is greater, the cancer is further along in its progression. Fig. 1 shows the
astonishing pace. Demographic, laboratory, and mammographic risk
stages of cancer.
indicators are all necessary for accurate breast cancer prediction. Using
The stage of the cancer disease helps the doctor to decide the correct
some distinct model considerations, this research attempts to use ma
line of treatment and to proceed with the prognosis in the correct path.
chine learning to predict breast cancer.
The most effective method to identify the stage of cancer is mammog
raphy, MRI (Magnetic Resonance Imaging), Ultrasound, etc. Machine
learning-based methods had a significant success rate in the detection of 2. Literature review
tumors as compared to traditional methods. The proposed work uses the
machine Learning classifier LR and SVM on a pre-trained model to di This section discusses the various approaches taken by researchers in
agnose the type of tumor in breast cancer on the Wisconsin Breast the study of breast cancer. The healthcare business is one of the most
Cancer Dataset. An improved and efficient framework for cancer clas precise for data science applications because of the abundance of data it
sification to help medical practitioners is proposed. The main contri holds and the compatibility of the data kinds. In hospital environments,
bution of this paper is described below. The performance of the information flows continuously and typically takes the form of numbers.
proposed model in computed on the performance metrics like accuracy, The healthcare system is open to reform based on findings from data
AUC, and ROC curve. mining and machine learning studies. Breast cancer risk can be esti
The paper is organized into four sections. Section 1 depicts the role of mated with the help of age and risk variables, as given by Gail et al.
AI and ML to predict breast cancer at an early stage to categorize breast (1989) [15]. Age at menarche, first birth age, the presence of a
cancer patients at high risk and low risk so that patients get high risk first-degree relative with breast cancer, and other similar risk variables
may get early and better treatment. We also present various stages of were taken into account. Similarly, Breast ductal carcinoma in situ was
breast cancer and the background of prediction of breast cancer; the subject of research by Burstein HJ et al. (2004) [16]. All invasive
furthermore in section 2 review of the literature has been presented breast cancers are thought to develop from this cancer. This breast
concerning various parameters such as early prediction and detection of cancer did not spread beyond the duct’s lumen and did not involve the
breast cancer and the performance is measured using accuracy metrics; duct’s epithelial lining. The goal of treatment was to stop the disease
in section 3 proposed framework has been presented which depicts the from spreading locally again before it became the more aggressive and
role of SVM and LR to early predict the breast cancer data; moreover, deadly invasive form of breast cancer. Treatment protocols typically
section 4 presents results and analysis of proposed framework; at last involve imaging, surgery, pathology, and oncology services. Surgeons
conclusion and future research, the direction has been discussed in give treatment, such as partial or complete mastectomy, if an early
section 4. diagnosis is made with the help of a radiologist and pathologist. Around
Contribution of paper: 98% of ductal carcinoma in situ cases can be managed with a mastec
tomy, making it a very successful treatment. Furthermore, Evans (2007)
• The authors present a machine learning-based intelligent framework et al. [17] investigated the two primary drivers of risk variables for
for the early prediction of breast cancer. breast cancer and breast cancer death. Initially, it was thought that
• A literature review of various existing techniques for breast cancer possessing a mutation in a gene like BRCA1 or BRCA2 would be to
prediction is presented. blame. The second factor was an increased risk of cancer during the
• The authors present different stages of breast cancer, its types, and study period or the person’s lifetime. The author compared the models
various symptoms. based on the two primary criteria listed above. Not all models satisfac
• The visualization of breast cancer data using various plot functions torily addressed both requirements. Cell-related alleles provide an
and performance metrics is presented. additional level of complexity to the models. However, Amir et al. [18]
compiled all the data on the risks of breast cancer and the likelihood of
2
D. Sharma et al. Measurement: Sensors 30 (2023) 100901
detecting a BRCA or BRCA1 mutation. They talked about the various risk accomplished with the help of machine learning methods like logistic
models and how they classify women into high or low-risk groups for regression, random forest, and support vector machine (SVM). The re
breast cancer. Moreover, research conducted by Kourou (2014) et al. sults’ 95% confidence in sensitivity, specificity, and AUC are all tested
[19] examined various ML models for the prediction of breast cancer. with the Monte Carlo cross-validation method. Breast cancer risk
The author concludes that multi-dimensional heterogram data can be assessment using SVM showed a sensitivity of 82–88% and a specificity
used for feature selection and classification and that this should be done of 85–90%. An economically viable biomarker for breast cancer has
in conjunction with the many illness prediction technologies currently been made available. Similarly, Bonus (2018) et al. [27] proposed a
available and under development. Similarly, Cancer Association Map procedure for a low-cost and effective breast cancer prediction tool. The
Animation (CAMA) is a tool proposed by Iqbal et al. (2016) [20] to name “Prospero” has been given to the design. The research data comes
examine the interconnectedness of cancer and other disorders. Cancer from an extensive online search. Furthermore, K Shailja et al. (2018)
can be cured, but only if we take the necessary safeguards at the right [28] put forth a framework for the development of large data analysis
times. The author used a database containing information on 782 software. K-nearest-neighbor (KNN) is also utilized for classification,
million people to calculate the risk of nine different cancers in people and the R programming language is used for its implementation. Among
who had other medical conditions. CAMA provides a dynamic all types of cancer, breast cancer affects more women than any other.
time-lapse visualization tool and an animated perspective on the cor Hence, early disease prediction is preferable to curing the dataset to save
relation between cancer and other disorders in patients of varying dis lives. The author drew on the UCI machine learning resource. Tumors
ease stages and sex. The author has effectively constructed a novel diagnosed with cancer are often classified as benign or malignant using
early-stage cancer approach hypothesis and made predictions about the KNN method. The end product demonstrates an improvement in
new risk factors associated with the cancer hypothesis model. Similarly, reliability metrics like F-measures and accuracies. When applied to
the machine learning algorithms currently in use were evaluated by Asri survival data, the deep learning-based prediction method provided by
(2016) and colleagues [21]. The constructed model is evaluated in Kim et al. (2019) [29] improves accuracy. The author explains how
comparison to other machine learning techniques. The precision of the patients can benefit from risk stratification and treatment options to
data classification was tested in a simulated setting using the Weka data protect the patient’s life and avoid futile, ineffective care.
mining tool. The support vector machine was shown to produce the most Similarly, the highest-risk subset of women with breast cancer has
accurate outputs with the fewest inaccuracies. been identified by a risk prediction model provided by Lee et al. (2019)
Furthermore, whole slide image categorization and tumor localiza [30]. The breast and ovarian Analysis of disease incidence and carrier
tion were both proposed by Wang (2016) et al. [22]. Both the area under estimation algorithm (BOADICEAC) is a hypothetical program devel
the curve for the problem vs the practicing pathologist and the area oped by the author. Lifestyle, hormonal, and reproductive factors, as
under the curve for the result demonstrated that the framework made well as mammographic density, are all influenced by one’s genetic and
accurate predictions. When a pathologist’s diagnosis is combined with familial makeup. The author’s use of the aforementioned algorithm and
that of a deep learning system, both AVC and mistake rates improve by model facilitated the stratification of women at high risk for breast
as much as 85%. It argues that deep learning can help enhance results in cancer, allowing for more effective screening and treatment choices.
pathological diagnosis. Mammographic density enhances the accuracy of breast cancer predic
Similarly, the risk for breast cancer can be evaluated using the tion over existing clinical breast cancer models, according to research by
BCRAM model, which was proposed by Ref. [23] Li et al. The algorithm Yala et al. (2019) [31]. Prostate cancer prediction using mammography
classifies patients into high-risk and low-risk categories based on the data. Using a dataset of roughly 40,000 women, the author conducts the
prevalence of certain risk variables. In this context, high-risk patients are actual testing, validation, and training. Patients’ responses to ques
individuals who share several characteristics with people who get breast tionnaires and electronic health records (EHR) were used to determine
cancer. Three more models were compared to BCRAM after it was built. their risk factors. The results of comparing all of the models to the
Furthermore, Nickson et al. [24] used data from a survey taken online by TyrerCuzeck model showed that the hybrid deep learning model was the
over 40,000 women in their investigation. They looked at how the Gail most effective at risk discrimination. When comparing DL hybrid to TC,
model correlated with prospects for future invasive patients. The author the former yields 31% better results.
used the chi-square test of statistical significance by ranks of the Gail Furthermore, existing breast cancer that has a likelihood of relapsing
score to compare the estimated value with the observed value. Machine after 5 years is the focus of a model proposed by Nicolo et al. (2019)
learning is used to rank the Gail model’s input variables, and the [32]. The author uses machine learning to predict metastatic relapse
increased benefit in risk prediction is evaluated when more factors are within 5 years based on tumor size at diagnosis. When compared to the
added to the model. According to the author, the Gail model successfully existing COX regression model and Random Forest method, the model’s
classifies women into risk categories for developing invasive breast accuracy was superior.
cancer in the future. Patients profit because they are spared the hassle of Similarly, age-related differences in cell and metabolic development
screening, and they also receive a more advantageous overall deal. are a key factor to consider when training a disease detection model, as
Moreover, the aberrant changes in bodily function caused by disease highlighted by Feng et al. (2019) [33]. The model’s features and char
and the side effects of surgery and medical therapy were discussed by Fu acteristics are crucial to its creation. Methodology for analyzing features
(2018) et al. [25]. Lymphedema can be detected sooner with the help of for selection and classification. The results of this study demonstrate the
an M/L-based decision support system. Lymphedema is a permanent importance of including age in all stages of disease diagnosis.
threat for breast cancer sufferers. That can happen right after cancer Similarly, Aruna et al. [34] used naive Bayes, support vector ma
surgery, or it can happen 20 years later. Lymphedema can be detected chines, and decision trees to classify a Wisconsin breast cancer dataset.
more easily with the aid of machine learning’s real-time detection. To The highest accuracy was achieved by support vector machines (SVM),
detect lymphedema, a machine-learning system was evaluated using at 96.99%. Prediction of breast cancer based on examination of 202,932
real-time data for its accuracy, sensitivity, and specificity. The capabil patient records by Delen et al. [35]. Those who made it out alive (93,
ities of every available machine learning algorithm are evaluated and 273) and those who didn’t make it were separated in the dataset (109,
contrasted. The greatest results were obtained using ANN, which was 659). Similarly, Diabetes classification was evaluated using naive Bayes,
able to diagnose lymphedema with a sensitivity of 93.75, a specificity of decision trees, and random trees by Ou et al. [36]. Results showed that
96.65, and an accuracy of 91.03%. Similarly, Patricio et al. [26] conduct naive Bayes performed best among the classifiers tested (with an overall
an exploratory investigation with 166 participants, taking into account success rate of 76.3%). Moreover, Srinivas et al. [37] observed one de
age, body mass index, and some clinical variables such as glucose, in pendency augmented Nave Bayes classifier for the prediction of heart
sulin, HOMA, leptin, resistin, etc. Breast cancer risk prediction was attacks using medical profiles such as age, sex, blood pressure, and blood
3
D. Sharma et al. Measurement: Sensors 30 (2023) 100901
Fig. 2. Proposed framework for breast cancer prediction for the smart healthcare system.
sugar. According to the results of the investigation, naive Bay es per to the patient at an early stage. In the proposed framework, we apply
formed better. Similarly, Bernal et al. [38] used clinical data from two machine learning models, such as SVM and LR, to a real dataset to
medical ICUs in their study. Machine learning methods like logistic test the probability of predicting breast cancer at an early stage. Fig. 2
regression, neural networks, decision trees, and k-nearest neighbors depicts the proposed framework. This proposed framework is based on 5
were utilized to foretell the decrease in hospital admissions. step method process including the various steps that are discussed
Similarly, Pratiwi et al. [39] report that among women, breast cancer below:
is the main cause of mortality. It was suggested that machine learning
methods be used to aid in the detection of breast cancer. Similarly, using A. Phase 1: Import Data: In the first phase the datasets are imported
the feature ensemble technique, D. Sharma et al. [40] introduced an from the UCI repository and the data acquisition is performed. The
NN-ET breast cancer prediction model based on neural networks and an proposed model has been applied to the WBCD and multiple data
additional tree classifier. When used for patient prediction and catego augmentation procedures are applied to improve the quality of data.
rization, the model achieves an impressive 99.74% accuracy. B. Phase 2: Data Preprocessing: In the data preprocessing phase, the
Inferences drawn from the literature are discussed below: noise of data and imbalance of data ranges is removed. The
normalization of data is performed to predict the classes accurately.
• Machine Learning approaches have recently been applied to medical Multiple data cleaning procedures along with feature selection and
datasets for the classification of patients due to their significant re extraction methods are implemented to decrease the problem of
sults and performance. overfitting of the proposed model.
• Despite the encouraging results achieved by machine learning C. Phase 3: Model Building and learning phase: The datasets were
methods in detecting breast cancer, a large number of barriers are enhanced and divided into training and testing sets. Multiple data
still present to attain improved results. augmentation procedures are applied to improve the quality of the
• The major challenge for the prediction of breast cancer and to gain training set.
accuracy depends on the size and type of the dataset used. D. Phase 4: Result Generation: The model generation is performed in
• Pre-processing of the dataset is needed to be handled carefully as the this phase from the use of datasets using the machine learning clas
accuracies of algorithms vary with this step. sifiers SVM and LR to classify the patients.
• In the proposed model an improved method was proposed for the E. Phase 5: Result and performance evaluation phase: Breast cancer
classification of breast cancer which may help medical professionals classification is achieved by categorizing the patients in the Benign
in the diagnosis of breast cancer at an early stage. tumor class and the Malignant tumor class. Accuracy, ROC, and AUC
are the metrics used to evaluate the performance of the proposed
3. Proposed work model
In this section, a proposed framework has been presented for breast In the last analysis of results to validate the performance of the model
cancer prediction in the smart healthcare system. In a smart healthcare is conducted, the architecture of the proposed model is explained in
system, sensing devices are used to sense and collect data, and machine Fig. 2.
learning models are used to process the data to provide better treatment After gathering relevant information, our framework moves on to
4
D. Sharma et al. Measurement: Sensors 30 (2023) 100901
pre-processing, which consists of four stages—data cleansing, attribute Here is the support vector classifier’s input vector, and C is a penalty
selection, role setting, and feature extraction. The cleaned data is then factor used to regulate the error rate.
utilized to train machine learning algorithms to make breast cancer
predictions for cleaned data. We feed the model fresh data for which
labels are available to assess the efficiency of the algorithm. The 3.2. Logistic regression model
Train_test_split technique is commonly used to accomplish this by
dividing the obtained labeled data in half. 80% of the information is LR is a preferred and trustworthy statistical technique for making
used to train our machine learning model. 20% of the information is informed decisions because of its ability to accurately predict the
designated as test data or test sets and will be used to evaluate the probability of an occurrence by adapting data to the logistic function.
performance of the model. After putting several models to the test, we Clinical, demographic, and other data are employed in real-time to
evaluate the findings to determine which one has the most accurate and provide predictions about a patient’s clinical outcome [43]. Equally
which one is the best predictor of breast cancer. multivariate is LR. It seeks to establish a causal connection between
multiple independent factors and a single dependent variable. In this
experiment, we used binary LR to foretell one of two possible classes of
3.1. Support vector machine (SVM) model
results. The predicted odds or probability of a binary event is the most
obvious result of the LR model, but there is more information to be
This study employed the SVM algorithm to predict the probability of
gleaned from the model that may and should be used in making de
breast cancer. Nonlinear classification challenges can be tackled by
cisions. Probabilities greater than 50% in a two-class situation are
adapting SVM for use when the raw data cannot be divided linearly.
placed in the “1″ category. The value “0″ is used for everything else [44].
When training vectors belong to two linearly separable classes, we may
Consecutive iterations of the independent variable selections and
use support vectors to translate the raw data into a high-dimensional
coefficient computations were performed. When a0, a1, ….an are fore
feature space, where instances of one class can be separated from the
cast variables, α0, α1, …. αn are coefficient vectors, and P(Z) is the
rest [41,42].
probability of the occurrence of breast cancer is depicted in equation (v).
(ai , bi ), ai ∈ Rn , bi ∈ {+1, − 1}, i = 1, …, n, (i) We used LR along with the iterative parameter selection techniques to
estimate the prevalence of breast cancer [45,46].
In this case, bi is the training vector’s associated class label, and ai is an
1 1
n-dimensional input vector containing real values. Separating points on P(Z) = = (v)
(α0 +α1 a1 +α2 a2 +…+αn an )
1 + e− 1 + e− (b ×X)
T
a plane are located using an orthogonal vector w and a bias b.
w.a + b = 0 (ii) 4. Results and discussion
Hence, SVM’s technique for classifying data may be stated as
In this proposed framework, the Python NumPy, Matplotlib, Sea
[ ]
∑n
1∑ n
( ) born, Pandas, Scikit-Learn, and Plotly modules are used for data
maxα αi − αi αj bi bk a, bj (iii) acquisition and visualization. As the initial step in this study, the data
2 i,j=1
was imported into Python as a data frame, and it was subsequently
i=1
5
D. Sharma et al. Measurement: Sensors 30 (2023) 100901
Fig. 4. (a) Accuracy Plot, (b) ROC Curve Plot, and (c) AUC Plot for LR with all features.
may be shown that the dataset was not properly balanced. Fig. 3 (a) be compared.
depicts the results of this process, a heat map depicting the correlation Logical regression is one of the most commonly used techniques for
between all of the attributes. handling classification challenges. A sigmoid function is used to analyze
Some of the characteristics in the dataset had bigger numeric values, the correlation between categorical variables that are dependent and
whereas others had much lower values. Relationships between traits and independent. Accuracy scores of 97.14%, 92.6%, and 96.0%, ROCs for
imbalanced numerical values cannot be mapped effectively. After all three datasets of 99.32, 91.1, and 90%, and AUCs of 99.60, 98.60,
normalizing the feature’s numerical values, a boxplot was created to and 95.50% were achieved by using the logistic regression technique
cleanse the data. with the optimal cut off value or threshold, utilizing a variety of
Fig. 3(b) represents a boxplot of characteristics. It asserts that some libraries.
characteristics were discarded because too many outliers existed. This SVM was the second model used. There were three different datasets
means that there was insufficient information to categorize using these used. For all dataset characteristics, each application produced accuracy
qualities based on the box plot. Before the introduction of ML tech ratings of 96%, 92.57% with highly correlated features, and 96.0% with
niques, data were classified as positively correlated, negatively corre less correlated dataset features. For each data set, the ROC is 98.90%,
lated, or uncorrelated. 91.0%, and 91%, while the AUC is 98.90%, 98.20%, and 99.90%.
In this study, accuracy ratings for the classification approaches of LR 458 of the dataset’s female tumours were benign, compared to 241
and SVM were determined. Three distinct datasets with distinct feature that were malignant. With benign tumours being maintained more
set each received a distinct analysis approach. frequently than malignant ones, this distribution demonstrates the
In the first dataset, all of the characteristics that were not related to asymmetrical character of the data. On the heat map, the association
each other were included. In the second dataset, all of the characteristics between each attribute is displayed individually. Lighter blue hues
that were related to each other a lot were included. In the third data set, represent a negative relationship and an uncorrelated benign breast
all of the characteristics that were least correlated to each other were mass, whereas deeper blue hues represent a clear and positive link be
included. Each machine learning method was used on its own with three tween these characteristics. Similarly, darker red colors suggest a
different datasets, and accuracy results were collected so that they could distinct and positive link between these characteristics, whereas lighter
6
D. Sharma et al. Measurement: Sensors 30 (2023) 100901
Fig. 5. (a) Accuracy Plot, (b) ROC Curve Plot, and (c) AUC Plot for SVM with all features.
red hues denote a negative correlation. typically very small in benign cells; and (9) mitoses, the process of cell
division in which the nucleus divides. The train split technique is used to
4.1. Tools and library used train and test the data. 80% of the total information is used to train the
model and 20% of the information is used to test the model.
The proposed mechanism is implemented using Python and Jupyter
Notebook on the standard breast cancer dataset available publicly at the
4.3. Performance matrices
UCI repository. The libraries used to obtain the desired results and for
the data acquisition are the Python NumPy, Matplotlib, Seaborn,
The performance of the proposed framework is evaluated on three
Pandas, Scikit-Learn, and Plotly modules.
parameters as Accuracy. ROC (Receiver Operating Characteristic Curve)
and AUC (). The value of matrices can be calculated by using the values
4.2. Dataset used
of the Confusion Matrix. The values of the confusion matrix can be True
Positive (TP), True Negative (TN), False Positive (FP), and False Nega
The parameters that are used in the dataset are represented and
tive (FN). The equations for the calculation of Accuracy, ROC, and AUC
explained the data contains 10 attributes and 699 instances. This breast
are:
cancer database was obtainedfrom the University of Wisconsin Hospi
tals, Madison from Dr. William H. Wolberg. All parameters can be useful TP + TN
Accuracy (A) =
to classify cancer; if these parameters have relatively large values, it can TP + TN + FN + FP
be a sign of malignant tissue. The (1) parameter is ID, which is a number A graph that shows the performance of a classification model at all
that is used for identification [30]. The (2) parameter is the clump classification thresholds is called a ROC curve. It has two parameters
thickness, it indicates the grouping of cancer cells in the multilayer. such as:
After this next parameter is uniformity of cell size, indicating metastasis
to lymph nodes. The uniformity of cell shapes identifies the varying size TPR = True Positive Rate
of cancerous cells. (4) parameter of the dataset is marginal adhesion
which indicates the loss of adhesion a marker of malignancy, however TPR = TPTP + FN
since malignant cells lose this characteristic, the persistence of adhesion
FPR = False Positive Rate
serves as a cue; (5) the size of a single epithelial cell (SECS); if the SECS
increases, the cell may be malignant; (6) the presence of naked nuclei in
FPR = FPFP + TN
benign tumors; (7) the presence of bland chromatin, which is typically
found in benign cells; (8) the presence of normal nucleoli, which are ROC curve plots the TPR and FPR at the different classification
7
D. Sharma et al. Measurement: Sensors 30 (2023) 100901
Fig. 6. (a) Accuracy Plot, (b) ROC Curve Plot, and (c) AUC Plot for LR with highly correlated features.
Fig. 7. (a) Accuracy Plot and (b) ROC Curve Plot for SVM with highly correlated features.
thresholds of a classification model. AUC is the Area of the Curve that the accuracy of LR with all features, while Fig. 4 (b) and (c) display the
provides an aggregate measure of performance across all possible clas ROC curve and AUC curve of logistic regression with all features. Fig. 5
sification thresholds. (a) exhibits the accuracy of SVM with strongly correlated features,
whereas Fig. 5(b) illustrates the ROC for SVM. Fig. 5(c) depicts the
4.4. Discussion comparison of LR and SVM in the AUC curve.
Fig. 6 (a) depicts the accuracy of LR with highly correlated charac
The visualization of data is categorized based on three categories; teristics, while Fig. 6(b) and (c) exhibit the ROC curve and AUC curve of
with all features highly correlated and fewer categories. Fig. 4(a) depicts LR with highly correlated features, respectively. Fig. 7(a) exhibits the
8
D. Sharma et al. Measurement: Sensors 30 (2023) 100901
Fig. 8. (a) Accuracy Plot, (b) ROC Curve Plot, and (c) AUC Plot for LR with less correlated features.
Fig. 9. (a) Accuracy Plot and (b) ROC Curve Plot for SVM with less correlated features.
accuracy of SVM with strongly correlated features, while Fig. 7(b) il correlated attributes were developed and depicted after the dataset
lustrates the ROC for SVM. Fig. 7 (c) depicts the comparison of LR and cleaning method. The characteristics of bland chromatin, bare nuclei,
SVM in the AUC curve. marginal adhesion, and clumps were aggregated and plotted under the
Fig. 8 (a) depicts the accuracy of LR with fewer correlated features, strongly linked category. The category with the lowest correlation was
whereas Fig. 8 (b) and (c) display the ROC curve and AUC curve of LR similarly categorized, and their graphs were presented. The diagnostic
with fewer correlated features, respectively. Fig. 9 (a) represents the accuracy of logistic regression for breast cancer was high for malignant
accuracy of SVM with fewer associated features, while Fig. 9(b) illus tumor types and accurate for benign tumor types. The average accuracy
trates the ROC for SVM. Fig. 10 (a) depicts the comparison between LR was 97.14%. Figs. 5 and 10 provide a comparison between machine
and SVM in the AUC curve with less correlated features and Fig. 10 (b) learning methods and AUC outcomes. In the figures above, all accuracy
shows the comparison curve plot of LR and SVM with highly correlated result values for the three classification categories and both methods of
features. ML are presented.
All aspects, strongly correlated attributes, and less strongly
9
D. Sharma et al. Measurement: Sensors 30 (2023) 100901
Fig. 10. (a) Comparison curve plot of LR and SVM with less correlated features, (b) Comparison of accuracy LR and SVM with highly correlated features.
correlation; and features with low correlation were included in the third
Table 1
dataset. For tumor classification, ML techniques such as LR and SVM
Comaprison of proposed method and recent other studies.
were utilized. In comparison to alternative methodologies, LR provided
References Year Dataset Accuracy a more accurate classification. The key advantage of LR is its fast
[43] Serel- Ozman et al. 2022 WBCD 94.78% training time. Results show that in the LR prediction model, accuracy is
[44] Z Zhang et al. 2022 WBCD 95.52% 97.14% from the perspective of SVM.
[45] Nassar et al. 2023 WBCD 95.27%
In the future, various other datasets with multiple instances and at
[46] JM Wu et al. 2023 WBCD 94.9%
Proposed 2023 WBCD 97.41%
tributes will be used and performance on the proposed model. Also,
some other machine learning classifiers like Decision Tree, Naïve Byes,
KNN, etc will be used to classify the breast cancer patients. In addition to
4.5. Comparison with the other algorithms this, some hybrid or ensemble will be used for model training. In addi
tion, this proposed model can be used for solving different other dis
This section of the paper presents the comparison of results between eases. Also, the authors try to propose a mechanism based on deep
the proposed model and the various algorithms and models proposed by learning to predict breast cancer at its early stages based on an image
other authors in their work to show the effectiveness and accuracy of the dataset of the breast. Moreover, using the electronic health records of
proposed model. It was compared to the other state-of-the-art algo patients’ mammograms for the application of deep learning methods for
rithms and hybrid algorithms from the literature. For a good comparison early detection is a proposed future work.
of the proposed model, all the compared models and algorithms are
considered that have used the same dataset WBCD Wisconsin Breast
Cancer Dataset which has 699 instances and 10 attributes in their work. Declaration of competing interest
Table 1 presents the comparison between the proposed method and the
recent other studies by different authors: We confirm that this work is original and has not been published
elsewhere, nor it is currently under consideration for publication
elsewhere.
5. Conclusion and future work
There is no financial support from any organization.
We have no conflicts of interest to disclose.
There are a variety of individual factors that can contribute to the
death of people with breast cancer. Any number of factors can lower the
Data availability
risk of breast cancer-related death. Yet, these data underline the need for
early breast cancer diagnosis for both present and former patients. Our
Data will be made available on request.
research emphasizes the importance of early detection of breast cancer
in female patients. The purpose of this paper is to provide a framework
based on machine-learning models in the real world to find the most References
accurate model for predicting breast cancer so that patients can get the
[1] Harald Weedon-Fekjær, Bo H. Lindqvist, Lars J. Vatten, Odd O. Aalen,
best treatment and the death rate can be lowered. In this paper, we Tretli Steinar, Breast cancer tumor growth estimated through mammography
tested the proposed framework on a real dataset of breast cancer. First, screening data, Breast Cancer Res. 10 (3) (2008) 1–13.
the dataset was pre-processed and cleaned, and each numeric value was [2] Emad A. Rakha, Jorge S. Reis-Filho, Frederick Baehner, David J. Dabbs,
Thomas Decker, Vincenzo Eusebi, Stephen B. Fox, et al., Breast cancer prognostic
normalized in preparation for appearance. The first dataset includes all classification in the molecular era: the role of histological grade, Breast Cancer Res.
the features; the second dataset consists of features with strong 12 (4) (2010) 1–12.
10
D. Sharma et al. Measurement: Sensors 30 (2023) 100901
[3] J. Laurance, Breast Cancer Cases Rise 80% since the Seventies; BREAST CANCER, Prospective validation of the NCI breast cancer risk assessment tool (Gail model)
The Independent, London, 2006, pp. 1–6. on 40,000 Australian women, Breast Cancer Res. 20 (1) (2018) 155.
[4] Cintolo-Gonzalez, A. Jessica, Braun Danielle, Amanda L. Blackford, [25] Mei R. Fu, Yao Wang, Chenge Li, Zeyuan Qiu, Deborah Axelrod, Amber A. Guth,
Emanuele Mazzola, Ahmet Acar, Jennifer K. Plichta, Molly Griffin, Kevin Joan Scagliola, et al., Machine learning for detection of lymphedema among breast
S. Hughes, Breast cancer risk models: a comprehensive overview of existing cancer survivors, mHealth 4 (2018).
models, validation, and clinical applications, Breast Cancer Res. Treat. 164 (2) [26] Miguel Patrício, José Pereira, Joana Crisóstomo, Paulo Matafome, Manuel Gomes,
(2017) 263–284. Raquel Seiça, Francisco Caramelo, Using Resistin, glucose, age and BMI to predict
[5] Cintolo-Gonzalez, A. Jessica, Braun Danielle, Amanda L. Blackford, the presence of breast cancer, BMC Cancer 18 (1) (2018).
Mazzola Emanuele, Ahmet Acar, Jennifer K. Plichta, Molly Griffin, Kevin [27] AdwoaBemah Bonsu, Ncama Busisiwe Purity, Evidence of promoting prevention
S. Hughes, Breast cancer risk models: a comprehensive overview of existing and the early detection of breast cancer among women, a hospital-based education
models, validation, and clinical applications, Breast Cancer Res. Treat. 164 (2) and screening interventions in low-and middle-income countries: a systematic
(2017) 263–284. review protocol, Syst. Rev. 7 (1) (2018) 234–235.
[6] Eitan Amir, Orit C. Freedman, Bostjan Seruga, D. Gareth Evans, Assessing women [28] K. Shailaja, B. Seetharamulu, M.A. Jabbar, Machine learning in healthcare: a
at high risk of breast cancer: a review of risk assessment models, JNCI (J. Natl. review, in: 2018 Second International Conference on Electronics, Communication
Cancer Inst.) 102 (10) (2010) 680–691. and Aerospace Technology (ICECA), IEEE, 2018, pp. 910–914.
[7] A.J. Cruz, D.S. Wishart, Applications of machine learning in cancer prediction and [29] Dong Wook Kim, Sanghoon Lee, Sunmo Kwon, Woong Nam, In-Ho Cha, Hyung
prognosis, Cancer Inf. 2 (2006) 59–77. Jun Kim, Deep learning-based survival prediction of oral cancer patients, Sci. Rep.
[8] Min Chen, Kai Hwang YixueHao, Lu Wang, Lin Wang, Disease prediction by 9 (1) (2019) 6994.
machine learning over big data from healthcare communities, IEEE Access 5 (2017) [30] Andrew Lee, Nasim Mavaddat, Amber N. Wilcox, Alexander Cunningham, Tim
8869–8879. Carver, Simon Hartley, Chantal Babb de Villiers, et al., BOADICEA: a
[9] Habib Dhahri, Eslam Al Maghayreh, Awais Mahmood, Wail Elkilani, Comprehensive Breast Cancer Risk Prediction Model Incorporating Genetic and
Mohammed Faisal Nagi, Automated breast cancer diagnosis based on machine Nongeneticrisk Factors, 2019.
learning algorithms, J. Healthcare Eng. (2019) 1–11. [31] Adam Yala, Constance Lehman, Tal Schuster, Tally Portnoi, Regina Barzilay,
[10] https://www.cancer.org/content/dam/cance-org/research/cancer-facts-and-sta A deep learning mammography-based model for improved breast cancer risk
tistics/breastcancer-facts-and-figures/breast-cancer-factsand-figures-2019-2020. prediction, Radiology (2019), 182716.
[11] Nikita Pilnenskiy, Ivan Smetannikov, Feature selection algorithms as one of the [32] Chiara Nicolo, Cynthia Perier, Melanie Prague, Gregoire MacGrogan, Olivier Saut,
Python data analytical tools, Future Internet 12 (3) (2020) 1–14. Sebastien Benzekry, Machine Learning versus Mechanistic Modeling for Prediction
[12] Sara Alghunaim, Heyam H. Al-Baity, On the scalability of machine-learning of Metastatic Relapse in Breast Cancer, bioRxiv, 2019, 634428.
algorithms for breast cancer prediction in big data context, IEEE Access 7 (2019) [33] Xin Feng, Jialiang Li, Li Han, Hang Chen, Fei Li, Quewang Liu, Zhu-Hong You,
91535–91546. Fengfeng Zhou, Age is important for the early-stage detection of breast cancer on
[13] Ali Li, Rui Wang, Liyuan Liu, Lei Xu, Fei Wang, Fei Chang, Lixiang Yu, both transcriptomic and methylomic biomarkers, Front. Genet. 10 (2019) 212.
Yujuan Xiang, Fei Zhou, Zhigang Yu, BCRAM: a social-network-inspired breast [34] S. Aruna, S. Rajagopalan, L. Nandakishore, Knowledge based analysis of various
cancer risk assessment model, IEEE Trans. Ind. Inf. 15 (1) (2018) 366–376. statistical tools in detecting breast cancer, Comput. Sci. Inf. Technol. 2 (2011)
[14] Agarap, M. Abien Fred, On breast cancer detection: an application of machine 37–45.
learning algorithms on the Wisconsin diagnostic dataset, in: Proceedings of the 2nd [35] D. Delen, G. Walker, A. Kadam, Predicting breast cancer survivability: a
International Conference on Machine Learning and Soft Computing, ACM, 2018, comparison of three data mining methods, Artif. Intell. Med. 34 (2005) 113–127.
pp. 5–9. [36] Z. Qu, Predicting diabetes mellitus with machine learning techniques, Front. Genet.
[15] Mitchell H. Gail, Louise A. Brinton, David P. Byar, K. Donald, Corle Sylvan, 9 (2011) 515.
B. Green, Schairer Catherine, John J. Mulvihill, Projecting individualized [37] K. Srinivas, Analysis of coronary heart disease and prediction of heart attack in coal
probabilities of developing breast cancer for white females who are being mining regions using data mining techniques, in: Proceedings of the 5th
examined annually, JNCI (J. Natl. Cancer Inst.) 81 (24) (1989) 1879–1886. International Conference on Computer Science & Education, Hefei, China, 24–27,
[16] Harold J. Burstein, Polyak Kornelia, Julia S. Wong, Susan C. Lester, Carolyn August 2010, pp. 1344–1349.
M. Kaelin, Ductal carcinoma in situ of the breast, N. Engl. J. Med. 350 (14) (2004) [38] J.L. Bernal, S. Cummins, A. Gasparrini, Interrupted time series regression for the
1430–1441. evaluation of public health interventions: a tutorial, Int. J. Epidemiol. 46 (2017)
[17] D. Gareth R. Evans, Howell Anthony, Breast cancer risk-assessment models, Breast 348–355.
Cancer Res. 9 (5) (2007) 213. [39] P.S. Pratiwi, Development of intelligent breast cancer prediction using extreme
[18] Eitan Amir, Orit C. Freedman, Seruga Bostjan, D. Gareth Evans, Assessing women learning machine in Java, Int. J. Comput. Commun. Instrum. Eng. 3 (2016).
at high risk of breast cancer: a review of risk assessment models, J. Natl. Cancer [40] Deepti Sharma, Rajneesh Kumar, Anurag Jain, Breast cancer prediction based on
Inst. 102 (10) (2010) 680–691. neural networks and extra tree classifier using feature ensemble learning,
[19] K. Kourou, T.P. Exarchos, K.P. Exarchos, M.V. Karamouzis, D.I. Fotiadis, Machine Measurement: Sensors 24 (2022), 100560.
learning applications in cancer prognosis and prediction, Comput. Struct. [41] Budi Juarto, Breast cancer classification using outlier detection and variance
Biotechnol. J. 13 (2015) 8–17. inflation factor, Eng. Math. Comput. Sci. (EMACS) J. 5 (1) (2023) 17–23.
[20] Usman Iqbal, Chun-Kung Hsu, PhungAnh Alex Nguyen, Daniel LiviusClinciu, [42] A. Salcedo-Bernal, M.P. Villamil-Giraldo, A.D. Moreno-Barbosa, Clinical data
Richard Lu, Shabbir Syed-Abdul, Hsuan-Chia Yang, et al., Cancer-disease analysis: an opportunity to compare machine learning methods, Procedia Comput.
associations: a visualization and animation through medical big data, Comput. Sci. 100 (2016) 731–738.
Methods Progr. Biomed. 127 (2016) 44–51. [43] Özmen-Akyol, Serel, Estimating breast cancer class using artificial neural network
[21] Asri Hiba, Hajar Mousannif, Hassan Al Moatassime, Thomas Noel, Using machine and logistic regression methods, Eskişehir Türk Dünyası Uygulama ve Araştırma
learning algorithms for breast cancer risk prediction and diagnosis, Proc. Comput. Merkezi Bilişim Dergisi 3 (1) (2022) 26–31.
Sci. 83 (2016) 1064–1069. [44] Zirui Zhang, Zixuan Li, Evaluation methods for breast cancer prediction in machine
[22] Dayong Wang, Aditya Khosla, Rishab Gargeya, Humayun Irshad, Andrew H. Beck, learning field, in: SHS Web of Conferences, vol. 144, EDP Sciences, 2022, 03010.
Deep learning for identifying metastatic breast cancer (2016) arXiv preprint arXiv: [45] Hana Babiker Nassar, Classification for imbalanced breast cancer dataset using
1606.05718. resampling methods, IJCSNS 23 (1) (2023) 89.
[23] Ali Li, Rui Wang, Liyuan Liu, Lei Xu, Fei Wang, Fei Chang, Lixiang Yu, [46] Jiann-Ming Wu, Chao-Yuan Tien, Mobile-aided breast cancer diagnosis by deep
Yujuan Xiang, Fei Zhou, Zhigang Yu, BCRAM: a social-network-inspired breast convolutional neural networks, in: Research Anthology on Medical Informatics in
cancer risk assessment model, IEEE Trans. Ind. Inf. 15 (1) (2018) 366–376. Breast and Cervical Cancer, IGI Global, 2023, pp. 844–858.
[24] Carolyn Nickson, Pietro Procopio, Louiza S. Velentzis, Sarah Carr, Lisa Devereux,
Gregory Bruce Mann, Paul James, Grant Lee, Wellard Cameron, Ian Campbell,
11