Classification of Breast Cancer Using A Novel Neural Network-Based Architecture
Classification of Breast Cancer Using A Novel Neural Network-Based Architecture
Abstract—Breast cancer is currently one of the deadliest types various stages of breast cancer. If the appropriate treatment
of cancer, and the death rate has considerably grown as a result is not administered, patients will perish. Using a variety of
of a lack of knowledge about the disease, its symptoms, and methods, breast cancer can be detected accurately.
prevention techniques. Therefore, early detection at a nearly stage
is crucial and vital in order to limit the progression of cancer. Despite their diminished accuracy, early data classification
Malignant and benign breast cancer are the two further subtypes. algorithms are still effective for accurate categorization and
In order to categorise the different forms of breast cancer, an prediction. DL as well as ML algorithms for various different
automated system with logistic regression and neural network datasets are used to extract features and hidden features [5].
is proposed. The classification of the breast cancer data utilises
CNN produces accurate results for the dataset used, and the
the DNN with various levels of processing. The Digital Database
for Screening Mammography (DDMS) dataset was used in the convolution value is determined by the stride function, which
proposed study on the Kaggle platform. The dataset was divided is used to for the features extraction using the images of vary-
into various train-test splits. On the basis of accuracy, ROC ing sizes. The ”gold standard” consists of three tests and is the
AUC cuve the system’s performance is evaluated. The outcome foundation of the vast majority of cancer screening protocols
demonstrates that proposed-2 outperforms in a comparable sense
(clinical assessment, radiological imaging, and pathology test-
with a training accuracy of 98.99% and testing accuracy of
98.83%. ing). Regression is an integral part of the conventional method
Index Terms—CNN, Deep learning, Breast-Net, Stacking, for cancer detection, whereas model construction is the focus
Mammogram, Ultrasound of contemporary machine learning approaches and algorithms.
The model is designed to accurately predict unknown data
I. I NTRODUCTION during the training and testing phases. The three fundamental
Breast cancer is one of the most severe diseases and machine learning methodologies are preprocessing, feature
disorders in India, responsible for a significant number of selection or extraction, and classification [6]. To correctly
deaths [1]. Dietary and lifestyle modifications are increasing identify and forecast cancer, machine learning technology
the incidence of cancer in women [2]. It is the second leading must retrieve features. This method may distinguish between
cause of death for women around the world. Breast cancer is benign and malignant tumors.
a significant problem in the world, a state in southern India, The remaining part of the work focusses on literature on the
is no exception. According to recent studies, breast cancer breast cancer area in section II, methodology for the proposed
is the most common cancer among women, accounting for work in section III, section V deals with performance measure,
approximately 27% of all cancers diagnosed in women in the result analysis on section VI and finally conclusion on section
world. Additionally, the incidence of breast cancer in world VII.
has been increasing over the years, with a significant rise
reported in urban areas [3]. II. L ITERATURE S URVEY
Using machine learning and deep learning to predict breast
cancer on the basis of collected data (ML). The progression Researchers have been working on building automated
of cancer through its various stages is caused by the spread breast cancer detection systems in conjunction with the devel-
of cancer cells within a tissue, which is caused by the opment of automated medical applications. Machine learning
aberrant proliferation of fatty and fibrous components [4]. techniques have been used in numerous areas in medical [7],
Breast cancer can be treated more effectively than any other [8]. According to reports, machine learning is used to decide
tumors according to a government survey. With the proper what kinds of cancer patients need to receive in terms of
care and attention, it is possible to distinguish between the treatment [9]–[12]. Several writers have claimed that the early
computer-aided systems for breast cancer screening that were diagnosis (target variable) to identify the parameters with a
created did not significantly increase accuracy [5]. significant correlation. The mean features were found to have
Entropy, Angular Second Moment, Contrast, Mean, and a significant correlation with the target variable, except for the
Difference Moment from each density function are statistical fractal dimension mean parameter. The squared error features
texture traits that the authors of extracted from mammograms were found to have a significant correlation with the target
using the grey level co-occurrence matrix (GLCM) [8]. Nu- variable, except for texture se, smoothness se, symmetry se,
merous methods for spotting breast cancer were investigated and fractal dimension se. All the worst features were found
in, with a focus on machine learning. SVM, Artificial Neural to have a significant correlation with the target variable. Fur-
Network (ANN), K Nearest Neighbour (KNN), and Decision thermore, checked for multicollinearity, which is the presence
Tree (DT) algorithms were used to create a hybrid model [13]. of almost perfectly linear patterns between the attributes.
No appreciable progress has been made in these applications’ Identified that the radius, perimeter, and area attributes, as well
performance despite the continual development of machine as the concavity, concave points, and compactness attributes,
learning algorithms. Meanwhile, visual object recognition and were possibly implying multicollinearity. Thus, consider one
classification in many areas have been made successful by column from each set of highly correlated variables for further
deep learning, which learns representations from data and analysis. In conclusion, the analysis of the WBC dataset
encourages the learning of successive layers of increasingly revealed the significant parameters that contribute to breast
relevant representations [14]. cancer diagnosis and the presence of multicollinearity among
A thorough and in-depth analysis of deep learning and some of the attributes. This analysis will help in developing
machine learning methods for breast cancer detection and clas- accurate and efficient models for breast cancer diagnosis.
sification using medical imaging was presented by Houssein The preprocessing step involves preparing the data for
et al. They displayed all the latest diagnostic tools used in analysis by performing necessary transformations and cleaning
medicine as well as the quick adoption of deep learning and procedures. In this study, the following preprocessing steps
machine learning in the medical industry. were applied:
The extensive literature review in this part comes to the 1) Cleaning the dataset: The column labeled ”Unnamed:
conclusion that deep and transfer learning are not very ef- 32” was removed from the dataset as it contained empty values
fective for early cancer detection [15]. According to the and offered no useful information. This was done using the
mentioned research, the task can be clearly hard due to a drop function with axis=1. The column labeled ”id” was also
lack of resources or competent and experienced personnel. dropped as it was not relevant to the analysis. After these
Researchers from the medical field have already put in a lot of column removals, the dataset was checked for any remaining
work in this area, but their conclusions are not always accurate. null values. It was confirmed that the dataset was free from
Attempted to address these issues by enhancing the process null values.
of correctly classifying breast cancer by hypertuning neural 2) Data Exploratory Analysis: Exploratory analysis was
network, allowing for the accurate and promising detection of conducted to gain insights into the dataset and understand the
breast cancer in its early stages. The following is a summary relationships between different variables. The following steps
of the key goals of this study: were performed:
• Outlier Detection
• Apply pre-processing of the data to obtain more accurate
result in classification. To identify outliers in the dataset, boxplots were created
• Developed an enhanced Neural Network model with
for each attribute. The boxplots revealed the presence of
hyper parameter tuning to obtain accurate model for outliers in certain attributes, indicating potential extreme
classification of tumor. values or errors in the data.
• Bivariate Analysis
III. M ETHODOLOGY Bivariate analysis was performed to explore the relation-
The proposed method carried out with preprocessing of ships between variables. Specifically, the mean features,
the data, Exploratory data Analysis(EDA) process, model squared error features, and worst features were analyzed
selection, training and testing of the model and result analysis in relation to the diagnosis. The analysis indicated signif-
and accuracy assessment as shown in Fig.1. icant correlations between the diagnosis and most mean
and worst features, suggesting their potential as predic-
A. Preprocessing tors. However, certain squared error features exhibited
The Wisconsin Tabular Data (WBC) contains 33 parameters lower correlations with the diagnosis.
that directly or indirectly contribute to the breast cancer • Correlation with Diagnosis
diagnosis. To identify the parameters that play a significant The correlation between each feature and the diagnosis
role in the diagnosis, Performed thorough univariate, bivariate, was calculated to assess their predictive potential. The
and multivariate analyses. The target class distribution of the analysis revealed the following correlations:
dataset was found to be imbalanced, with a 63:37 ratio of – Mean Features: The majority of mean features
benign to malignant cases. Performed a correlation analysis demonstrated significant correlations with the diag-
of the mean, squared error, and worst features with the nosis, indicating their importance in distinguishing
values for each feature. The testing feature set, x test, was ranging between 0 and 1 representing the probability of
scaled using the transform method, applying the same scaling belonging to a particular class. The model is optimized
parameters as the training set. Feature scaling ensured that using the Adam optimizer, with a default learning rate.
all features were on a similar scale, avoiding bias due to The loss function employed was binary cross-entropy,
varying ranges and enabling fair comparisons and unbiased which was appropriate for our binary classification task.
model training and testing. To evaluate the model’s performance, Utilized accuracy
7) Proposed Model: The proposed work carried out with as the metric of interest. The model was trained using
two methods. Proposed-1 is based on machine learning method a training dataset, ’x train’, and corresponding target
Logistic regression and proposed-2 is based on Deep learning values, ’y train’. Then split the training data into train-
method Neural network. ing and validation subsets using a 75:25 ratio. During
• Proposed-1 training, we used a batch size of 28, which indicates that
Logistic Regression is a widely used classification algo- the model was updated after processing each batch of 28
rithm that models the probability of an instance belonging training samples. The model was trained for a total of 325
to a particular class [16]. The logistic function, also epochs, iterating over the entire training dataset multiple
known as the sigmoid function, is employed to transform times.
the output of a linear equation into a probability value.
The formula for logistic regression is as follows: IV. DATASET
1 The Wisconsin Tabular Data (WBC) contains 33 parameters
p(y = 1|x) = (1)
(1 + exp−z ) that directly or indirectly contribute to the breast cancer
Here, z represents the linear combination of the feature diagnosis. To identify the parameters that play a significant
values x role in the diagnosis, performed thorough univariate, bivariate,
• Proposed-2 and multivariate analyses. The target class distribution of the
To develop an effective classification model, implemented dataset was found to be imbalanced, with a 63:37 ratio of
a multilayer perceptron neural network using the Keras benign to malignant cases. Performed a correlation analysis
library . The neural network architecture comprised mul- of the mean, squared error, and worst features with the
tiple hidden layers, with each layer consisting of densely diagnosis (target variable) to identify the parameters with a
connected neurons. Utilized a total of 4 hidden layers, significant correlation. The mean features were found to have
a significant correlation with the target variable, except for the
fractal dimension mean parameter. The squared error features
were found to have a significant correlation with the target
variable, except for texture se, smoothness se, symmetry se,
and fractal dimension se. All the worst features were found
to have a significant correlation with the target variable.
Furthermore, checked for multi-colinearity, which is the
presence of almost perfectly linear patterns between the at-
tributes. Also, identified that the radius, perimeter, and area
attributes, as well as the concavity, concave points, and com-
pactness attributes, were possibly implying multi-colinearity.
Thus, we should only consider one column from each set of
highly correlated variables for further analysis.
Fig. 3. Neural Network with hyperparameter Tunning. In conclusion, the analysis of the WBC dataset revealed
the significant parameters that contribute to breast cancer
each containing 120 neurons. This choice was based
diagnosis and the presence of multicollinearity among some
on empirical observations and the complexity of the
of the attributes. This analysis will help in developing accurate
classification task. Carried out the ReLU (Rectified Linear
and efficient models for breast cancer diagnosis.
Unit) activation function in each hidden layer to introduce
non-linearity to the model. Additionally, incorporated
V. P ERFORMANCE M EASURE
batch normalization after each hidden layer to improve
training convergence and overall performance. During Evaluated the effectiveness of the classification techniques
training, applied dropout regularization after each hidden using several performance measures. These measurements
layer, with dropout rates of 0.8, 0.6, 0.625, 0.63, and include the area under the curve (AUC), accuracy and ROC
0.64, respectively. These dropout rates were chosen to curve. The proposed model’s accuracy were assessed using
prevent overfitting and promote robust feature learning. label class. By comparing the total no. of samples with no.
The output layer consisted of a single neuron with a of samples of benign and malignant correctly identified, the
sigmoid activation function. This configuration allowed degree of accuracy can be calculated. A description of the
the model to predict binary classifications, with values measurement metrics accuracy is provided below:
TP + TN
Accuracy = (2)
TP + TN + FP + FN
where, TP, TN, FP and FN are True positive, True negative,
False positive and False negative respectively.
VI. R ESULT AND A NALYSIS
The exploratory analysis provided valuable insights into the
dataset, including outlier detection, attribute distributions, cor-
relations with the diagnosis, and multicollinearity as shown in
Fig. 4. These findings guided subsequent steps in the analysis
and modeling process, contributing to a more comprehensive
understanding of the data.
Fig. 5. ROC AUC Graph.
TABLE I
W ISCONSIN DATASET ACCURACIES ON D IFFERENT M ODELS
Model Train Accuracy Test Accuracy
Logistic Regression 98.60% 98.04%
Decision Tree Classifier 98.01% 92.98%
RandomForestClassifier 97.8% 97.07%
KNeighborsClassifier(n neighbors=19) 95.97% 94.73%
Support Vector Machine Classifier 98.74 97.66
GaussianNB() 94.22% 93.35%
Neural Network Classifier 97.93% 97.66%
Neural Netwrok Architecture with multiple
99.24% 97.66%
hidden Layer
Neural Netwrok Architecture with 4 hidden layer
98.79 % 98.24 %
Batch size 28
LogisticRegression(C=1, max iter=50, tol=1e-05)
98.74% 98.24%
(Poposed -1)
Neural Netwrok Architecture with 4 hidden Layer
98.99% 98.83%
,Batch Size 26 (Poposed -2)
R EFERENCES
[1] M. Yusoff, T. Haryanto, H. Suhartanto, W. A. Mustafa, J. M. Zain, and
K. Kusmardi, “Accuracy analysis of deep learning methods in breast
cancer classification: A structured review,” Diagnostics, vol. 13, no. 4,
p. 683, 2023.
[2] D. B. Taylor, S. Burrows, C. M. Saunders, P. M. Parizel, and A. Ives,
“Contrast-enhanced mammography (cem) versus mri for breast cancer
staging: detection of additional malignant lesions not seen on conven-
tional imaging,” European Radiology Experimental, vol. 7, no. 1, p. 8,
2023.
[3] E. A. Rakha, G. M. Tse, and C. M. Quinn, “An update on the
pathological classification of breast cancer,” Histopathology, vol. 82,
no. 1, pp. 5–16, 2023.
[4] I. O. Ellis, E. A. Rakha, G. M. Tse, and P. H. Tan, “An international
unified approach to reporting and grading invasive breast cancer. an
overview of the international collaboration on cancer reporting (iccr)
Fig. 7. Model Accuracy. initiative,” Histopathology, vol. 82, no. 1, pp. 189–197, 2023.
[5] R. Haarika, T. Babu, R. R. Nair, and T. Rajesh, “Breast cancer prediction
using feature selection and classification with xgboost,” in 2023 Interna-
tional Conference on Recent Trends in Electronics and Communication
(ICRTEC). IEEE, 2023, pp. 1–6.
[6] R. Gayathri, P. B. Pati, T. Singh, and R. R. Nair, “A framework
for the prediction of diabtetes mellitus using hyper-parameter tuned
xgboost classifier,” in 2022 13th International Conference on Computing
Communication and Networking Technologies (ICCCNT). IEEE, 2022,
pp. 1–5.
[7] T. Babu, T. Singh, and D. Gupta, “Colon cancer prediction using 2dreca
segmentation and hybrid features on histopathology images,” IET Image
Processing, vol. 14, 12 2020.
[8] T. Babu, T. Singh, D. Gupta, and S. Hameed, “Colon cancer prediction
on histological images using deep learning features and bayesian opti-
mized svm,” Journal of Intelligent & Fuzzy Systems, vol. 41, no. 5, pp.
5275–5286, 2021.
[9] R. R. Nair and T. Singh, “Multi-sensor medical image fusion using
pyramid-based dwt: a multi-resolution approach,” IET Image Processing,
vol. 13, no. 9, pp. 1447–1459, 2019.
[10] R. R. Nair, T. Singh, R. Sankar, and K. Gunndu, “Multi-modal medical
image fusion using lmf-gan-a maximum parameter infusion technique,”
Journal of Intelligent & Fuzzy Systems, vol. 41, no. 5, pp. 5375–5386,
2021.
[11] T. Babu, T. Singh, D. Gupta, and S. Hameed, “Optimized cancer
detection on various magnified histopathological colon imagesbased
Fig. 8. Confusion Matrix of proposed methodology. on dwt features and fcm clustering,” Turkish Journal of Electrical
Engineering and Computer Sciences, vol. 30, no. 1, pp. 1–17, 2022.
[12] R. R. Nair, T. Singh, A. Basavapattana, and M. M. Pawar, “Multi-layer,
multi-modal medical image intelligent fusion,” Multimedia Tools and
VII. C ONCLUSION AND F UTURE S COPE Applications, pp. 1–27, 2022.
[13] A. Tripathi, A. Basavapattana, R. R. Nair, and T. Singh, “Visualization
of covid bimodal scan using dnn,” in 2021 12th International Conference
Numerous individuals are struggling with modern-day ail- on Computing Communication and Networking Technologies (ICCCNT).
ments in this day and age. One of the worst diseases on the IEEE, 2021, pp. 01–07.
rise globally is breast cancer, which is becoming more and [14] K. Jabeen, M. A. Khan, J. Balili, M. Alhaisoni, N. A. Almujally,
H. Alrashidi, U. Tariq, and J.-H. Cha, “Bc2netrf: breast cancer classifi-
more prevalent. The primary factors contributing to higher cation from mammogram images using enhanced deep learning features
death rates will be a lack of awareness and late disease and equilibrium-jaya controlled regula falsi-based features selection,”
identification. All types of people will find that computer-aided Diagnostics, vol. 13, no. 7, p. 1238, 2023.
[15] T. Babu and R. R. Nair, “Colon cancer prediction with transfer learning
diagnosis is the ideal method for making reliable diagnoses. and k-means clustering,” in Frontiers of ICT in Healthcare: Proceedings
While a CAD system won’t completely replace a doctor’s of EAIT 2022. Springer, 2023, pp. 191–200.
skill, it will greatly improve them in their decision-making by [16] Y. Xu, B. Klein, G. Li, and B. Gopaluni, “Evaluation of logistic
regression and support vector machine approaches for xrf based particle
supporting practitioners in reading patient data and making the sorting for a copper ore,” Minerals Engineering, vol. 192, p. 108003,
best possible choices. Due to inexperience or inadequate report 2023.
analysis, practitioners can make mistakes. In comparison to
earlier work, proposed-2 outperforms in a comparable sense
with a training accuracy of 98.99% and testing accuracy of
98.83%. The increased number of layers that the proposed
work faces as a limitation which causes larger execution time.
In future, plan to expand the capability of the proposed
model and increase the sensitivity of breast cancer detection
by using additional reliable datasets and image datasets.