An Intelligent Algorithm for Lung Cancer Diagnosis Using Extracted Features
An Intelligent Algorithm for Lung Cancer Diagnosis Using Extracted Features
Healthcare Analytics
journal homepage: www.elsevier.com/locate/health
1. Introduction than in men [2,3]. Other contributing factors include age, gender, race,
socioeconomic status, exposure to occupational and environmental
According to the World Health Organization (WHO), cancer is a factors, chronic lung disease, air pollution, individual genetics, obesity,
leading cause of death worldwide. In 2020, nearly 10 million people exposure to secondhand smoke, dry cough, alcohol consumption, and
died of various cancers. Financially, about 70% of cancer deaths occur diet. Even people’s lifestyles help to spread the disease [2,4]. Consid-
in low and middle-income countries. The economic impact of cancer is ering these, one can take a big step toward the early detection of this
significant and increasing, with the total annual cost of cancer treat- disease.
ment in 2010 being about $1.16 trillion. According to WHO statistics, With increased health data, management, analysis, and decision-
the most common causes of cancer death in 2020 were lung (1.8 million making have become very challenging in recent years. Moreover, as
deaths), colon and rectum (935,000 deaths), liver (830,000 deaths), the population increases, the medical community faces many prob-
stomach (769,000 deaths), and breast (686,000 deaths) [1]. lems dealing with and diagnosing various diseases. Thus, conducting
Given that lung cancer is at the top of this list, we tried to review experiments imposes enormous costs on the relevant organizations [5].
what causes the disease to get a checklist of the factors that affect it. Given the sheer volume of data and the various occupations of
Although diagnosing the disease in its early stages is complicated, its physicians, the possibility of errors in their decisions is very high. Thus,
symptoms are similar to respiratory infections, even though there may data mining algorithms will greatly help the medical community and
be no symptoms at first. Although the disease can affect anyone, lung patients. However, it should be noted that these methods confirm the
cancer is more likely to occur in smokers. Diagnosis in the early stages doctor’s opinion and have little reliability alone [6]. Many tests to
can save a patient’s life, as lung cancer cells can spread to other organs diagnose disease have devastating effects on the patient’s body and
before a doctor can diagnose them. Cancer metastasis makes treatment cost a lot of money, which can be a solid reason to use data mining
much more difficult. techniques to diagnose the disease [7]. Besides, data mining methods
Research on lung cancer cases claims that smoking is the most allow hidden inter-dependencies between the data, which sometimes
crucial cause of this disease, which is more common in women today take years to make through classical methods.
than in the past. In history, due to the lower consumption of women’s While an increasing body of work in the literature is investigating
cigarettes, a lower incidence rate of this type of disease was recorded lung cancer, this paper is distinguished, firstly, as the Computerized
∗ Corresponding author.
E-mail addresses: [email protected] (N. Maleki), [email protected], [email protected] (S.T.A. Niaki).
https://doi.org/10.1016/j.health.2023.100150
Received 5 December 2022; Received in revised form 9 February 2023; Accepted 17 February 2023
2772-4425/© 2023 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150
Tomography (CT) scan images are collected from a hospital in Tehran, machine learning algorithms are time-consuming and require expert
Iran considering a critical difference. In previous studies [8–13], re- knowledge to adjust their characteristics. However, deep learning al-
searchers used healthy lung versus cancerous lung CT images to di- gorithms can capture raw data, automatically adjust the features, and
agnose lung cancer. However, we gather our data from a different analyze and learn the data more quickly [20]. Although deep learning
perspective. In our dataset, CT scan images are divided into two main algorithms are powerful, they need an adequate number of data to
categories: Cancerous CT images and noncancerous ones. As the name show their power. While there is no specific rule to say how much
of the cancerous category shows, patients who struggle with lung data is enough to achieve good results, a rule of thumb says the more
cancer are placed in this category. However, the noncancerous class data, the better result. Looking at the literature shows that depending
includes healthy lungs and lungs with other diseases, e.g., COVID-19, on the data source, researchers utilized different amounts of data in
except lung cancer. their analysis. For example, KL et al. [21] used 100 CT images from
Second, we implement three approaches to determine which would an online source, Chaunzwa et al. [22] employed 331 CT images from
help physicians diagnose the disease early. The images are processed Massachusetts General Hospital (MGH), Khan et al. [23] utilized 2101
using Convolutional Neural Network (CNN) in the first approach, where CT images from Kaggle website, and Toğaçar et al. [24] used 100 CT
Artificial Neural Network (ANN) is employed to classify the images. In images from Cancer Imaging Archive (CIA). Therefore, collecting real-
the second approach, the images are pre-processed and segmented be- life data is not an easy task, and it requires lots of effort. In this paper,
fore utilizing CNN and ANN. In the third method, all the pre-processed we collect 364 real-life CT images from a hospital in Iran, so comparing
and segmented images are converted to numerical data via specific the number of images we analyzed with previous works concerning the
feature extraction algorithms in the last step. Besides, dimensional data source, we would say, we used a medium to large-scale data in our
reduction and feature selection algorithms are employed to classify analysis.
with three machine learning techniques, i.e., Gradient Boosting (GB), Radiologists take pulmonary CT images to diagnose and evaluate
Random Forest (RF), and Support Vector Machine (SVM). To our tumor growth. Therefore, the visual interpretation of these data leads
knowledge, this is the first time CT images are converted to numerical to the tumor’s identification in the final stages of tumor growth. Treat-
features, and dimensional reduction and feature selection are applied ment at this stage only increases the mortality rate in this type of
to them. The results show that our proposed framework outperforms cancer. As a result, diagnosing the tumor in the early stages of the
and reaches 95% accuracy in diagnosing lung cancer. disease is vital, which can be done by analyzing the images [12]. In [8],
The rest of this paper is organized as follows. Section 2 will review 1018 images of pulmonary CT were examined. In this study, researchers
what has been done so far to diagnose diseases by machine learning used a convolutional neural network method without processing or seg-
algorithms to achieve the importance of using this field in medical menting images, with a sensitivity of 78.9% with twenty false positives
science. Section 3 explains three methodologies and their phases in per scan and 71.2% with ten false positives per scan. Therefore, in this
detail. In Section 4, the methods are implemented step by step to illus- paper, we considered pursuing this approach and comparing the result
trate their results. Methods’ comparison, methods’ sensitivity analysis, with our framework. The result shows that our framework reaches 95%
and strengths and weaknesses of each technique will be discussed in accuracy.
Section 5. Finally, the best approach will be introduced, and future There is a large body of work using a combination of CNN with
works will be presented in Section 6. other machine learning or deep learning algorithms. For instance,
Bonavita et al. [25] and Moitra and Mandal [26] used CNN alone to
2. Literature review detect lung cancer, Saleh et al. [27] and Nanglia et al. [28] employed
CNN and SVM, Onishi et al. [29] and Huang et al. [30] applied Deep
Previous research works in this field are divided into two sub- CNN (DCNN) to detect or classify lung cancer in Ct images. Hence,
sections for better insight into the difference between machine learn- CNNs play a massive role in detecting or classifying lung cancer. Based
ing and deep learning algorithms on lung and other cancers. More- on the previous studies’ limitations, there are still opportunities to
over, it shows the importance of data mining in medical science and develop a combination of CNN with other machine learning or deep
determining the research gap. learning algorithms. In this paper, we use a combination of CNN with
ANN as a deep learning method and GB, RF, and SVM as machine
2.1. Cancer diagnosis using machine learning algorithms learning methods.
Despite problems in the image segmentation phase, many
Maleki et al. [14] used the k-Nearest Neighbor (kNN) algorithm on researchers [9–11,13,31] used image segmentation after pre-processing
the lung cancer dataset. They applied feature selection on the dataset step to identify the touching objects in the image. One of these seg-
and came up with the six most essential features among the dataset mentation algorithms is watershed segmentation which was used in
features. This paper uses the same feature selection algorithms in our [32] study after reviewing many different approaches. Therefore, in
proposed framework to develop the important features [14,15]. While this study, we use the watershed segmentation algorithm to identify
an increasing body of work applied SVM and decision tree algorithms the most critical objects in the CT images.
to diagnose different types of cancer [5,16–19]; there is little work used According to the literature shown in Table 1, a small number
Gradient Boosting algorithm for this purpose. To this end, we use two of studies used pre-processing methods and data dimensional reduc-
traditional algorithms (SVM and Decision Tree) in literature along with tion. However, using these techniques can significantly improve the
GB in our framework. efficiency of diagnostic and predictive algorithms. Moreover, powerful
The healthcare system is quite different from other industries, so algorithms such as GB or RF have not been used for classifying diseases.
it has a higher priority and customers in this area. In other words, However, these methods are potent and may produce better results.
regardless of its costs, patients expect the highest level of treatment One of the most critical gaps in the literature is the inability to
and service. Since machine and deep learning have been successful in compare the methods used to diagnose lung cancer. The most reliable
different areas and have provided precise solutions, they are considered way to compare the methods is to use the same data. Therefore, this
fundamental methods for solving health problems [20]. paper compares three different methods and shows that our proposed
framework outperforms the other two in diagnosing lung cancer. In the
2.2. Cancer diagnosis using deep learning algorithms first method, unprocessed CT images are considered input to the CNN
and ANN structural architecture. In the second method, the same tasks
Rapid progress in using machine learning algorithms in medicine are applied with the difference that pre-processed and segmented CT
has been seen in the literature. Compared to deep learning algorithms, images are used as input. Finally, in the third method, the proposed
2
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150
Table 1
The comparison table of literature review.
Authors Year Pre- Feature Dimensional Techniques Disease Data type
processing selection reduction
Gradient boosting
Random forest
Deep learning
Decision tree
GA - SVM
K-means
ANN
KNN
CNN
SVM
C4.5
LDA
PCA
NBs
GA
Chen and Yang [16] 2013 ✓ Breast Cancer Numerical
Zheng et al. [19] 2013 ✓ ✓ Breast Cancer Numerical
Odajima and Pawlovsky [33] 2014 ✓ Breast Cancer Numerical
Lynch et al. [5] 2017 ✓ ✓ Lung Cancer Numerical
Septiani et al. [34] 2017 ✓ ✓ Breast Cancer Numerical
Cherif [17] 2018 ✓ ✓ ✓ ✓ Breast Cancer Numerical
Kr and Aradhya [18] 2018 ✓ ✓ ✓ Lung Cancer Numerical
Maleki et al. [14] 2021 ✓ ✓ Lung Cancer Numerical
Zayed and Elnemr [13] 2015 ✓ Breast Cancer Image
Miah and Yousuf [10] 2015 ✓ ✓ ✓ Lung Cancer Image
Golan et al. [8] 2016 ✓ ✓ Lung Cancer Image
Kaucha et al. [9] 2017 ✓ ✓ ✓ ✓ Lung Cancer Image
Makaju et al. [32] 2018 ✓ ✓ ✓ Lung Cancer Image
Shakeel et al. [11] 2019 ✓ ✓ ✓ Lung Cancer Image
Onishi et al. [29] 2020 ✓ ✓ ✓ Lung Cancer Image
Saleh et al. [27] 2021 ✓ ✓ ✓ Lung Cancer Image
Nanglia et al. [28] 2021 ✓ ✓ ✓ Lung Cancer Image
Huang et al. [30] 2022 ✓ ✓ ✓ Lung Cancer Image
This paper ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Lung Cancer Image/
Numerical
framework, we extract numerical data with the help of 40 features from numerical dataset phase is completed. We obtained a vast dataset as
each pixel in segmented CT images to form a data frame to apply two we extracted any possible mathematical features from the CT images
different dimensional reduction algorithms (Principal Component Anal- to improve diagnosing cancerous patients from noncancerous ones.
ysis (PCA) and Linear Discriminant Analysis (LDA)), feature selection Therefore, the dimensional reduction phase, composed of two different
algorithm (Genetic Algorithm (GA)), and machine learning algorithms algorithms: PCA and LDA, is utilized to reduce dataset dimensions and
(GB, RF, and SVM). produce two separate datasets, one from the PCA algorithm and the
other by the LDA algorithm. From now on, the experiment is continued
3. The proposed lung cancer diagnosis methodologies in two parts. In part one, classification algorithms – GB, RF, and SVM –
are applied to both LDA and PCA datasets separately to classify the data
The methodology adopted in this paper is carried out in three into two groups based on extracted features. In part two, the genetic
different methods, shown in Fig. 1. This study’s problem is diagnosing algorithm is initially applied in the feature selection phase to select
whether the patient has cancer in the early stage. As is clear from the the essential components in the PCA dataset. Subsequently, the same
research purpose, the target variable is defined as discrete, so we need classification algorithms are used to classify the data into two groups
to use classification algorithms to identify the target variable. based on the selected features. In the following subsections, we will
According to Fig. 1, the first method comprises one fundamental describe each fundamental phase in detail.
building phase called image classification. It means, in this method,
the raw CT images were given to CNN followed by ANN without any 3.1. Image pre-processing
preprocessing (Raw CT images went through the third phase – the blue
rectangular – immediately). The second method includes three primary The term pre-processing belongs to a series of tasks needed for
building phases: image pre-processing, image segmentation, and image enhancing the quality of raw images, increasing the performance of the
classification (According to Fig. 1, Raw CT images went through first subsequent phases such as image segmentation, image feature extrac-
(the yellow rectangular), second (the green rectangular), and third tion, and classification. The primary objectives in this study phase are
(the blue rectangular) phases, respectively). Finally, the third method to apply image resizing and denoising.
comprises seven fundamental phases: image pre-processing, image seg-
mentation, image feature extraction, building a numerical dataset, 3.1.1. Image resizing
dimensional reduction, feature selection, and classification (According In the image resizing step, pixels are either added to the image
to Fig. 1, Raw CT images went through the first to eighth phases, or removed. Since medical images have many details that may be
respectively). Although these methods have fundamental phases in effective, no pixel removal is performed in the current research. On the
common, they are entirely different methods implemented on the same other hand, the CNN algorithm’s input images should preferably be in
lung CT scan images. square sizes for a better diagnostic process. Therefore, all images used
The pre-processing image phase of the study itself is composed of in this study have dimensions of 512 × 512 pixels.
two parts: image resizing and image denoising. Initially, raw lung CT
images are resized, and subsequently, the median filter is applied to 3.1.2. Image denoising
denoise them. The watershed segmentation algorithm identified the Image denoising is a process of applying filter(s) to reduce image
most critical objects in the CT images in the image segmentation phase noise. It should be noted that CT images have the lowest noise level
to make the following steps more reliable. In the image classification among the medical images, and it is practically unnecessary to perform
phase, raw CT images in method one and segmented CT images in this step. In any case, to prevent image distortion, the median filter is
method two are used as input. The CNN and ANN algorithms are applied to cancel any possible noise of lung CT.
applied to the CT images to classify whether the images belong to a
cancerous or noncancerous patient. 3.2. Image segmentation
So far, the fundamental phases that are implemented in methods
one and two are described. In the third method, the image feature The image segmentation tries to help the algorithm to diagnose
extraction phase is applied to segmented lung CT images to extract as best it can by removing unnecessary parts. In fact, at this point,
possible numerical features from each image’s pixels. The extracted the algorithm can only focus on the lung region, which will improve
statistical data for each image is stored in a dataset, so building a the classification performance. Watershed segmentation is used in the
3
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150
Fig. 1. The proposed framework, in comparison to other methods. . (For interpretation of the references to color in this figure legend, the reader is referred to the web version
of this article.)
suggested model [32]. The watershed technique is utilized when seg- 3.4.1. Gabor features
menting complicated pictures since basic thresholding and contour The Gabor feature is a linear filter used to analyze tissue in image
detection will not produce accurate results. The watershed method is processing. It explains whether the content of a particular frequency in
built on capturing specific background and foreground information. the image is in specific directions in a local area around the point [36].
Markers are then used to run watersheds and determine the precise
borders. Markers can be defined by users, e.g., manually, or defined by 3.4.2. Sobel filter
some algorithms, e.g., thresholding operation — we used thresholding The Sobel operator is used in image processing and computer vision,
operation in our analysis. especially in edge recognition algorithms. It creates a marginal image
[37].
3.3. Image classification
3.4.3. Scharr filter
This is a filter used to identify and highlight edges/slope features
Image classification is the primary domain in which deep neural
using derivative 1. It has a function like the Sobel filter and is used to
networks play the most critical role in medical image analysis. The im-
detect edges/changes in the pixel intensity [38].
age classification accepts the given input images and produces output
classification to identify whether the disease is present [35]. The image
3.4.4. Prewitt filter
classification phase is composed of two parts: CNN and ANN.
The Prewitt filter is like the Sobel in the sense that it uses two cores.
One to change the horizontal direction and the other to change the
3.4. Image feature extraction vertical direction. The two centers are wrapped with the original image
(meaning the same convolution operation) to calculate the derivatives
The following phases are performed in the third proposed method. roughly [38].
After the raw CT scan images are processed and segmented, several
numerical features are extracted from the images in the image feature 3.4.5. Gaussian filter
extraction phase. Each feature will be obtained from each pixel in a The Gaussian filter is linear. This filter is usually used to blur the
single image and then stored in a dataset. image or reduce the noise [39].
4
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150
3.4.6. Roberts edge Now, the objective function to be minimized is a weighted sum of
Crossover operator Roberts measures the slope of a simple, fast the 𝑚𝑐𝑟 and 𝑛𝑓 (number of selected features) defined as:
space to calculate two-dimensional on an image. It, therefore, high-
lights areas of high frequency that often correspond to the edges 𝑀𝑖𝑛 𝑍 = 𝑤1 ∗ 𝑚𝑐𝑟 + 𝑤2 ∗ 𝑛𝑓 (4)
[40].
Dividing both sides of Eq. (4) by 𝑤1 , we will have:
In the previous step, many features were extracted from each pixel Assuming 𝑤2∕𝑤1 = 𝑊 , the objective function becomes:
in a CT image. For example, an image with 256 × 256 dimensions has 𝑀𝑖𝑛 𝑍 = 𝑚𝑐𝑟 + 𝑊 ∗ 𝑛𝑓 . (6)
65,536 pixels, so if we extract 40 features from each and store them in
a dataset, we will have 1 row for the image, and 40 × 65,536 columns. Now, 𝑊 is defined as:
On the other hand, adding a target column should not be forgotten to
determine whether the input image belonged to a cancerous patient or 𝑊 ∝ 𝑚𝑐𝑟 → 𝑊 = 𝛽 ∗ 𝑚𝑐𝑟 → 𝑀𝑖𝑛 𝑍 = 𝑚𝑐𝑟 + 𝛽 ∗ 𝑚𝑐𝑟 ∗ 𝑛𝑓 (7)
a noncancerous one. This procedure is continued until all the images’
Finally, the objective function will be:
features are extracted and stored in a dataset.
𝑀𝑖𝑛 𝑍 = 𝑚𝑐𝑟(1 + 𝛽 ∗ 𝑛𝑓 ), (8)
3.6. Dimensional reduction
where 𝛽 is defined as a penalty for having an additional feature
(0 ≤ 𝛽 ≤ 1). Using this objective function, the GA finds the best com-
In the previous phase, an extensive dataset consisting of many
features was obtained. However, implementing classification on this bination of the features with the minimum number of features that
extensive dataset is time-consuming and not efficient. Implementing minimize both the cost and the misclassification rate. Here, the stop-
dimensional reduction algorithms on large data is one of the most ping criterion to end the iterations in GA is chosen to be a predefined
critical steps. PCA and LDA are two-dimensional reduction algorithms number of iterations.
used in this paper. The Pseudocode of the GA-based feature selection algorithm is:
5
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150
Fig. 2. (a) medical CT images of lung cancer patients, (b) medical CT images of lung patients other than lung cancer.
4.1. Data collection first method. In the first step of image pre-processing, all the image
sizes are set to 512 × 512 so that all the pixels in an image remain
Images are collected from a hospital situated in Tehran, Iran. The intact. The next step applies the median filter to remove any possible
images used in this study are provided at https://data.mendeley.com/ noise of resized lung CT images. Fig. 4 shows the image before and after
datasets. Part of these CT scan images of lungs belongs to lung cancer the median filter is performed on both the cancerous and noncancerous
patients and are classified as cancerous images. The rest belongs to lung CT scans. As seen in this figure, the images after the filter are
other lung diseases, such as patients who caught COVID-19, classified not much different from the images without the filter, which is a
as noncancerous images. As lung cancer symptoms are rare, all the characteristic of CT scan images.
possible lung diseases are considered noncancerous images to improve The image segmentation tries to help the algorithm to diagnose
lung cancer diagnosis efficiency. For this reason, lung cancer disease as best it can by removing unnecessary parts. In fact, at this point,
is not detectable in the early stages. Most physicians and doctors in the algorithm can only focus on the lung, which will improve the
the early stages of tumor growth diagnose a disease other than cancer, algorithm’s performance. For better understanding, all the steps are
which causes this type of cancer progression in the infected person. applied to the two filtered images in Fig. 4. The masks that cover
The total number of CT scan images used in this paper is 364, unnecessary parts of the images are shown in Fig. 5. Then, by placing
of which 238 are cancerous images, and the rest (126) belong to these masks on the filtered images, the lung will be visible in Fig. 6.
noncancerous images. All these images are collected with the help of As shown in Fig. 6, unnecessary parts are removed. After completing
a pulmonologist to skip any probable error in classifying images. Some all the above steps, 364 processed and segmented images are given to
of the CT images of the lungs acquired from the hospital database are the CNN and ANN algorithms. All the first method steps are repeated
shown in Fig. 2. from this point on, except that the images are not raw. Similar to the
first method, several different structures are evaluated in the second
4.2. The implementation results of the first method method. However, the architecture produced the best result consisting
of three convolution layers with 64, 64, and 128 feature maps, respec-
As seen in Fig. 2, applying any pre-processing or segmentation on tively, in the first, second, and third layers in the CNN section. The
the raw images is not needed when implementing CNN and ANN. In ANN also included two hidden layers, each containing 128 neurons.
other words, the raw lung CT scan images are fed as inputs to the CNN As mentioned, max pooling with dimensions of 2 × 2 is used after
and ANN architecture in the first method. Several different structures each convolution layer to maintain the image features. Fig. 7 shows
are evaluated to obtain the best structure to distinguish cancerous CT the graphical structure of the best structure.
images from noncancerous ones. However, the best structure consists The second method’s model has 63,109,441 trainable parameters.
of three convolution layers with 64, 64, and 128 feature maps, respec- It would have quadrupled the number of parameters and increased
tively, in the first, second, and third layers in the convolutional neural the time of the algorithm’s execution if max-pooling layers were not
network section. The artificial neural network also contains two hidden applied. In this method, the same as in the first method, 324 lung CT
layers, each containing 128 neurons. This study uses max pooling with images are considered for training, and 40 lung CT images are used
dimensions of 2 × 2 after each convolution layer to maintain the feature to evaluate the algorithm’s performance. Finally, by implementing this
maps. Fig. 3 shows the graphical structure of the best approach. structure on the processed and segmented images, a 100% accuracy
The structural model has 63,109,441 trainable parameters, which with a loss function of 3.0576 × 10-5 for the train and 93% accuracy
would have been quadrupled if the max-pooling layers had not been with a loss function of 1.096 × 10-7 for the test is obtained. The
used. This would have increased the algorithm’s execution time. An- difference between the two accuracies shows that the model is not
other case in point is the number of images used to learn and evaluate overfitted. Besides, performing image pre-processing and segmentation
the algorithm, of which 324 images were for training, and 40 images affects the first method’s performance significantly.
were for algorithm testing. Finally, implementing this structure on raw
(unprocessed) images, an accuracy of 65.81% was obtained with the 4.4. The implementation results of the third method
training loss function value of 5.4368 and the testing loss function
value of 4.0295 with 62.50% accuracy. The difference between the two As masks are placed on filtered images in the image segmentation
accuracies shows that the model is not overfitted. phase, images’ unnecessary parts are covered. Therefore, to reduce the
number of columns, the segmented images are resized to 256 × 256.
4.3. The implementation results of the second method By these dimensions for each segmented image, 2,621,440 data are
generated for each image when the features are extracted.
As shown in Fig. 1, image pre-processing and segmentation are The filters/features are applied alone to each image’s pixel and store
used in the second method before running the CNN and the ANN each pixel’s calculations in a data frame. To create a numeric dataset,
algorithms. This method aims to determine whether performing image the number of pixels in each image is 65,536, and the total number
pre-processing and image segmentation affects the performance of the of filters executed on each pixel is 40. The data values of the original
6
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150
Fig. 4. (a) On the left side, the cancerous lung CT image is shown before the median filter, and on the opposite side, the after median-filter result is shown, (b) On the left side,
the noncancerous lung CT image is shown before median-filter, and on the opposite side, the after median-filter is shown.
Fig. 5. The image on the right is a mask for a cancerous lung, and the image on the left is a mask for a noncancerous lung.
7
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150
Fig. 6. The image on the right shows lung cancer after applying the mask, and the image on the left shows noncancerous lung after using the mask.
8
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150
Table 3
Part of the number of component output.
n_components Variance description n_components Variance description n_components Variance description
332 0.99108 343 0.99486 354 0.99810
333 0.99145 344 0.99518 355 0.99836
334 0.99181 345 0.99549 356 0.99861
335 0.99217 346 0.99580 357 0.99886
336 0.99252 347 0.99610 358 0.99910
337 0.99287 348 0.99640 359 0.99934
338 0.99322 349 0.99670 360 0.99956
339 0.99356 350 0.99700 361 0.99974
340 0.99389 351 0.99728 362 0.99989
341 0.99422 352 0.99756 363 1
342 0.99454 353 0.99784 364 1
9
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150
Table 5
Confusion matrix of GBC, RFC, and SVC on the PCA dataset.
GBC Actual value RFC Actual value SVC Actual value
9 2 4 7 0 11
Predicted value Predicted value Predicted value
0 29 0 29 0 29
Fig. 9. The performance of GBC, RFC, and SVC on the PCA dataset.
Fig. 10. The performance of GBC, RFC, and SVC on the LDA dataset.
Table 6 As shown in Table 9, the RFC algorithm creates the most accuracy
Performance measurements of GBC, RFC, and SVC on the LDA dataset.
after execution on the dataset obtained from the GA. As is clear from
Methods Accuracy Precision Recall F1-score its confusion matrix in Table 10, this algorithm correctly recognized 34
GBC 0.78 0.84 0.78 0.79 data from 40 test data.
RFC 0.78 0.84 0.78 0.79
SVC 0.95 0.96 0.95 0.95 5. Sensitivity analyses and comparative study
10
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150
Fig. 11. ROC curves for three classifiers on the PCA dataset.
Fig. 12. ROC curves for three classifiers on the LDA dataset.
Table 7
Confusion matrix of GBC, RFC, and SVC on the LDA dataset.
GBC Actual value RFC Actual value SVC Actual value
10 1 10 1 11 0
Predicted value Predicted value Predicted value
8 21 8 21 2 27
or its complexity does not necessarily affect the result. However, the necessary to consider the accuracy and loss function of the test and
difference in implementing complex structures over simple ones is train simultaneously to achieve the desired result.
that the smaller the number of parameters, the higher the effort to As shown in Table 11, various convolution layers with different
achieve the best possible response. This implies shorter the algorithm’s filter detections are utilized. In each of them, multiple nodes in dif-
execution time due to the lower number of parameters. ferent hidden layers are also used to build a structural architecture.
Tables 11–14 show the sensitivity analyses of the first two methods, For example, as seen in the last row of Table 11, three convolution
the convolutional and artificial neural network on raw images, and the layers with 64, 64, and 128 detection filters are used to build the
convolutional and artificial neural network on the pre-processed and CNN structural architecture. The ANN structural architecture uses three
segmented images. As mentioned, the complexity of the structures has hidden layers containing 128, 128, and 256 neurons. In conclusion, as
continued to the point where there is no improvement anymore. It is the accuracy rate decreased and the loss function for both train and
11
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150
Table 8
Hyperparameters tuning process (the bold values show the best hyperparameters concerning the cost function value, and the underlining
values show the second best hyperparameters).
Max iteration Population % Mutation % Crossover Time (s) Cost function # of selected features
10 20 0.3 0.7 48,235.780 0.12890 172
10 20 0.4 0.7 27,149.952 0.16745 194
10 20 0.5 0.7 21,853.446 0.16745 194
10 20 0.3 0.8 22,557.235 0.14843 187
10 20 0.4 0.8 32,168.301 0.16738 191
10 20 0.5 0.8 21,063.200 0.16745 194
10 20 0.3 0.9 21,245.173 0.14843 187
10 20 0.4 0.9 47,362.354 0.16745 194
10 20 0.5 0.9 42,417.132 0.16745 194
10 50 0.3 0.7 40,130.765 0.12464 172
10 50 0.4 0.7 85,919.442 0.16738 191
10 50 0.5 0.7 103,398.083 0.16738 191
10 50 0.3 0.8 52,323.353 0.16738 191
10 50 0.4 0.8 91,157.160 0.16738 191
10 50 0.5 0.8 63,885.141 0.16738 191
10 50 0.3 0.9 66,239.779 0.13741 178
10 50 0.4 0.9 10,9868.125 0.15745 187
10 50 0.5 0.9 10,6193.759 0.15745 187
10 80 0.3 0.7 86,121.910 0.16738 191
10 80 0.4 0.7 91,011.980 0.16738 191
10 80 0.5 0.7 120,447.483 0.16738 191
10 80 0.3 0.8 74,341.422 0.16738 191
10 80 0.4 0.8 150,442.208 0.16738 191
10 80 0.5 0.8 104,947.371 0.16738 191
10 80 0.3 0.9 111,308.229 0.16738 191
10 80 0.4 0.9 114,145.831 0.16738 191
10 80 0.5 0.9 122,187.983 0.16738 191
test data increased, the implementation of more complex structures architecture. In the end, the implementation of a more complex struc-
stopped. Consequently, the fifth structure in Table 11 is determined as ture is canceled, and the fifth row of Table 12 is determined as the best
solution to the problems at hand.
the best answer to this problem.
Regarding the third method’s sensitivity analysis, the whole process
of this method is described in Section 4. Still, a brief explanation of this
As presented in Table 12, various convolution layers with different
method’s sensitivity analysis is not without merit here. The goal of the
filter detections are utilized. In each of them, multiple neurons in a third method was to analyze numerical data obtained after extracting
different number of hidden layers are also used to build a structural features from images. After extracting the features, two-dimensional
12
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150
Fig. 14. The performance measurements criteria of GBC, RFC, and SVC on the derived dataset from the GA.
Fig. 15. ROC curves for three classifiers on the derived dataset from the GA.
Table 9 and what results are obtained using machine learning algorithms.
Performance measurements of GBC, RFC, and SVC on the dataset derived from the Table 13 presents the results of these three steps and the accuracy of
GA.
each algorithm. It should be noted that other evaluation criteria are
Methods Accuracy Precision Recall F1-score given in Tables 4–7, discussed previously in Section 4.
GBC 0.82 0.83 0.82 0.83 As shown in Table 13, by applying the above three steps, the
RFC 0.85 0.88 0.85 0.83 SVC on the LDA dataset and GBC on the PCA dataset offers the best
SVC 0.73 0.53 0.72 0.61 performance in diagnosing lung cancer. However, in the third method,
we did not suffice with these results and tried to use different ways
to improve lung cancer diagnosis. Thus, the genetic feature selection
algorithm is implemented to see if there is an improvement in the
reduction methods (PCA and LDA) are used to examine which one
diagnostic determination of the disease. As mentioned in Section 4,
performs better. Following each of them, supervised machine learning it is impossible to implement the genetic feature selection algorithm
algorithms (GBC, RFC, and SVC) are implemented separately to see how on the LDA dimensional reduction method. One can only use it on the
performing only three feature extraction steps reduces the dimensions dataset obtained from the PCA dimensional reduction method. Table 14
13
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150
Table 10
Confusion matrix of GBC, RFC, and SVC on the dataset derived from the GA.
GBC Actual value RFC Actual value SVC Actual value
8 3 5 6 0 11
Predicted value Predicted value Predicted value
4 25 0 29 0 29
Table 11
Sensitivity analysis of performing CNN and ANN methods on raw images.
Num. of Num. of filter Num. of Num. of nodes in Num. of total Train Test Time (s)
convolution detectionsa hidden layers hidden L.b params
layers
Accuracy Loss Accuracy Loss
(%) function (%) function
1 64 1 128 532,686,849 65.71 5.5623 62.5 4.0295 312,418.58
2 64-64 1 128 130,095,169 65.75 5.5452 62.5 4.0295 76,300.26
2 64-64 2 128-128 130,111,681 65.77 5.5204 62.5 2.0148 76,309.95
3 64-64-128 1 128 63,092,929 65.8 5.4398 62.5 4.0295 37,003.74
3 64-64-128 2 128-128 63,109,441 65.81 5.4368 62.5 4.0295 37,013.42
3 64-64-128 3 128-128-256 63,142,593 65.8 5.5018 62.5 12.0886 37,032.86
a The number of feature detectors is listed in layers.
b
The number of neurons is listed in layers.
Table 12
Sensitivity analysis of performing CNN and ANN methods on processed and segmented images.
Num. of Num. of filter Num. of Num. of nodes in Num. of total Train Test Time (s)
convolution detectionsa hidden layers hidden L.b params
layers
Accuracy Loss Accuracy Loss
(%) function (%) function
1 64 1 128 532,686,849 99.52 0.019 92.5 0.754 307,915.06
2 64-64 1 128 130,095,169 99.58 0.023 92.5 1.5115 75,200.39
2 64-64 2 128-128 130,111,681 99.92 0.0024 93 1.12E−07 75,209.94
3 64-64-128 1 128 63,092,929 99.95 0.0012 93 1.84E−07 36,470.32
3 64-64-128 2 128-128 63,109,441 100 0.00003 93 1.09E−07 36,479.87
3 64-64-128 3 128-128-256 63,142,593 100 0.00004 93 5.13E−07 36,499.03
a
The number of feature detectors is listed in layers.
b The number of neurons is listed in layers.
Table 13 Table 14
Sensitivity analysis of two-dimensional reduction methods (PCA and LDA). Sensitivity analysis of the implementation of machine learning algorithms before and
Feature extraction after the implementation of the genetic feature selection algorithm.
PCA dimensional reduction
LDA dimensional reduction PCA dimensional reduction Algorithms
After GA feature selection Before GA feature selection Algorithms
78% 95% GBC
78% 82% RFC 82% 95% GBC
95% 73% SVC 85% 82% RFC
73% 73% SVC
14
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150
5.2.3. Method 3: Numerical features of the pre-processed and segmented [5] C. Lynch, B. Abdollahi, J. Fuqua, A. deCarlo, J. Bartholomai, R. Balgemann, H.
images and employing machine learning algorithms . Frieboes, Prediction of lung cancer patient survival via supervised machine
learning classification techniques, Int. J. Med. Inform. 108 (2017) 1–8, http:
In this method, all the pre-processing and segmentation phases are
//dx.doi.org/10.1016/j.ijmedinf.2017.09.013.
the same as in the second method. After these two phases, the numeri-
[6] W. Raghupathi, Data mining in health care, Healthc. Inform.: Improv. Effic. Prod.
cal features are extracted from the images and converted to numerical 211 (2010) 223.
data in a data frame. This dataset’s dimensions are 364 × 2,621,442, [7] D. Tomar, A survey on data mining approaches for healthcare, Int. J. Bio - Sci.
which is referred to as large or big data. As the size of this dataset Bio - Technol. 5 (2013) 241–266, http://dx.doi.org/10.14257/ijbsbt.2013.5.5.25.
is large, dimensional reduction algorithms are first used, and then a [8] R. Golan, C. Jacob, J. Denzinger, Lung nodule detection in CT images using deep
genetic feature selection algorithm is utilized to increase the processing convolutional neural networks, in: Paper Presented at the 2016 International
Joint Conference on Neural Networks, IJCNN, 2016.
speed. Next, machine learning algorithms are employed to classify
[9] D.P. Kaucha, P.W. Prasad, A. Alsadoon, A. Elchouemi, S. Sreedharan, Early
cancer and non-cancer images. The GBC algorithm obtains the best detection of lung cancer using SVM classifier in biomedical image processing, in:
results with 95% accuracy when implemented on the PCA dataset, Paper Presented At the 2017 IEEE International Conference on Power, Control,
the SVC algorithm with 95% accuracy when implemented on the LDA Signals and Instrumentation Engineering, ICPCSI, 2017.
dataset, and the RFC algorithm with 85% accuracy when performed on [10] M.B. Miah, M. Yousuf, Detection of lung cancer from CT image using image pro-
a dataset obtained from the genetic feature selection. cessing and neural network, in: Paper Presented at the International Conference
on Electrical Engineering and Information Communication Technology ICEEICT,
JU, Savar, Dhaka, Bangladesh, 2015.
6. Discussion and conclusion
[11] P.M. Shakeel, M.A. Burhanuddin, M.I. Desa, Lung cancer detection from CT
image using improved profuse clustering and deep learning instantaneously
In this study, three different methods were used to diagnose lung trained neural networks, Measurement 145 (2019) 702–712.
cancer in its early stages. One of the most significant achievements of [12] M. Vas, A. Dessai, Lung cancer detection system using lung CT image process-
this research was the comparability of all three methods. Methods were ing, in: Paper Presented At the 2017 International Conference on Computing,
Communication, Control and Automation, ICCUBEA, 2017.
comparable when all input images were the same in each method, as in
[13] N. Zayed, H. Elnemr, Statistical analysis of haralick texture features to dis-
this paper. Another comparable feature was the use of the same amount
criminate lung abnormalities, Int. J. Biomed. Imaging 2015 (2015) 1–7, http:
of data for training and testing sets; 324 images for training and the //dx.doi.org/10.1155/2015/267807.
remaining 40 images for the test in all investigations. Another contribu- [14] N. Maleki, Y. Zeinali, S.T.A. Niaki, A k-NN method for lung cancer prognosis
tion of this research includes the implementation of the third method. with the use of a genetic algorithm for feature selection, Expert Syst. Appl. 164
This method extracts numerical features from the pre-processed and (2021) 113981.
segmented lung CT images. The dimensional reduction algorithms (PCA [15] Y. Zeinali, S.T.A. Niaki, Heart sound classification using signal processing and
machine learning algorithms, Mach. Learn. Appl. 7 (2022) 100206, http://dx.
and LDA) are applied to the obtained dataset, GA feature selection is
doi.org/10.1016/j.mlwa.2021.100206.
utilized, and supervised machine learning algorithms (GB, RF, SVC) are [16] A.H. Chen, C. Yang, The improvement of breast cancer prognosis accuracy from
performed. integrated gene expression and clinical data, Expert Syst. Appl. 39 (5) (2012)
The comparison analysis showed that the third method had the best 4785–4795, http://dx.doi.org/10.1016/j.eswa.2011.09.144.
performance (95% accuracy for the testing set). Since the third method [17] W. Cherif, Optimization of K-NN algorithm by clustering and reliability coeffi-
is one of the main contributions of this paper, we can see that it has cients: application to breast-cancer diagnosis, Procedia Comput. Sci. 127 (2018)
293–299, http://dx.doi.org/10.1016/j.procs.2018.01.125.
the best accuracy in using two different machine learning classification
[18] P. Kr, N. Aradhya, Lung cancer survivability prediction based on performance
algorithms. Therefore, the results illustrated that separately applying
using classification techniques of support vector machines, C4.5 and naive Bayes
GB on the PCA dataset and SVC in the LDA has the best performance algorithms for healthcare analytics, Procedia Comput. Sci. 132 (2018) 412–420,
with 95% accuracy in both. http://dx.doi.org/10.1016/j.procs.2018.05.162.
Future works can improve the accuracy of the cancer diagnosis [19] B. Zheng, S.W. Yoon, S. Lam, Breast cancer diagnosis based on feature extraction
by executing GA feature selection before dimensional reduction. Us- using a hybrid of K-means and support vector machine algorithms, Expert Syst.
Appl. 41 (4) (2013) 1476–1482, http://dx.doi.org/10.1016/j.eswa.2013.08.044.
ing different algorithms for feature selection can probably improve
[20] M.I. Razzak, S. Naz, A. Zaib, Deep learning for medical image processing:
the accuracy of cancer detection. Moreover, extracting more different
Overview, challenges and the future, in: Classification in BioApps, Springer,
features in the feature extraction step may positively impact the sys- 2018, pp. 323–350.
tem’s accuracy. Moreover, other cancers kill countless people every [21] S. KL, S.N. Mohanty, K. S, N. A, G. Ramirez, Optimal deep learning model
year. Therefore, given that this research method’s implementation has for classification of lung cancer on CT images, Future Gener. Comput. Syst. 92
yielded promising results, we will try to perform the best approach on (2019) 374–382, http://dx.doi.org/10.1016/j.future.2018.10.009.
various cancers and diseases in future work. [22] T.L. Chaunzwa, A. Hosny, Y. Xu, A. Shafer, N. Diao, M. Lanuti, H.J. . Aerts,
Deep learning classification of lung cancer histology using CT images, Sci. Rep.
11 (1) (2021) 1–12.
Declaration of competing interest [23] M.A. Khan, S. Rubab, A. Kashif, M.I. Sharif, N. Muhammad, J.H. Shah, S.C.
. Satapathy, Lungs cancer classification from CT images: An integrated design of
The authors declare that they have no known competing finan- contrast based classical features fusion and selection, Pattern Recognit. Lett. 129
cial interests or personal relationships that could have appeared to (2020) 77–85, http://dx.doi.org/10.1016/j.patrec.2019.11.014.
influence the work reported in this paper. [24] M. Toğaçar, B. Ergen, Z. Cömert, Detection of lung cancer on chest CT images
using minimum redundancy maximum relevance feature selection method with
convolutional neural networks, Biocybern. Biomed. Eng. 40 (1) (2020) 23–39,
Data availability http://dx.doi.org/10.1016/j.bbe.2019.11.004.
[25] I. Bonavita, X. Rafael-Palou, M. Ceresa, G. Piella, V. Ribas, M.A.G. Ballester,
Data will be made available on request. Integration of convolutional neural networks for pulmonary nodule malignancy
assessment in a lung cancer classification pipeline, Comput. Methods Programs
References Biomed. 185 (2020) 105172.
[26] D. Moitra, R.K. Mandal, Classification of non-small cell lung cancer using
[1] WHO, Cancer, 2021, Retrieved from https://www.who.int/news-room/fact- one-dimensional convolutional neural network, Expert Syst. Appl. 159 (2020)
sheets/detail/cancer. 113564.
[2] B.C. Bade, C.S.D. Cruz, Lung cancer 2020: epidemiology, etiology, and [27] A.Y. Saleh, C.K. Chin, V. Penshie, H.R.H. Al-Absi, Lung cancer medical images
prevention, Clin. Chest Med. 41 (1) (2020) 1–24. classification using hybrid CNN-SVM, Int. J. Adv. Intell. Inform. 7 (2) (2021)
[3] C.R. MacRosty, M.P. Rivera, Lung cancer in women: A modern epidemic, Clin. 151–162.
Chest Med. 41 (1) (2020) 53–65. [28] P. Nanglia, S. Kumar, A.N. Mahajan, P. Singh, D. Rathee, A hybrid algorithm
[4] A.S. Ahmad, A.M. Mayya, A new tool to predict lung cancer based on risk factors, for lung cancer classification using SVM and neural networks, ICT Express 7 (3)
Heliyon 6 (2) (2020) e03402. (2021) 335–341.
15
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150
[29] Y. Onishi, A. Teramoto, M. Tsujimoto, T. Tsukamoto, K. Saito, H. Toyama, H. [34] N.W.P. Septiani, R. Wulan, M. Lestari, Breast cancer detection using data mining
. Fujita, Multiplanar analysis for pulmonary nodule classification in CT images classification methods, Proc. ICMETA 1 (1) (2017) 185–191.
using deep convolutional neural network and generative adversarial networks, [35] K. Balaji, K. Lavanya, Chapter 5 - medical image analysis with deep neural
Int. J. Comput. Assist. Radiol. Surg. 15 (2020) 173–178. networks, in: A.K. Sangaiah (Ed.), Deep Learning and Parallel Computing
[30] Y.-S. Huang, P.-R. Chou, H.-M. Chen, Y.-C. Chang, R.-F. Chang, One-stage Environment for Bioengineering Systems, Academic Press, 2019, pp. 75–97.
pulmonary nodule detection using 3-D DCNN with feature fusion and attention [36] S. Marčelja, Mathematical description of the response of simple cortical cells,
mechanism in CT image, Comput. Methods Programs Biomed. 220 (2022) J. Opt. Soc. Amer. 70 (1980) 1297–1300, http://dx.doi.org/10.1364/JOSA.70.
106786. 001297.
[31] C. Chen, R. Xiao, T. Zhang, Y. Lu, X. Guo, J. Wang, Z. Wang, Pathological lung [37] I. Sobel, G. Feldman, A 3 ×3 isotropic gradient operator for image processing,
segmentation in chest CT images based on improved random walker, Comput. Pattern Classif. Scene Anal. 27 (1973) 1–272.
Methods Programs Biomed. 200 (2021) 105864. [38] I. Sobel, An isotropic 3 ×3 image gradient operater, Mach. Vis. Three-Dimens.
[32] S. Makaju, P.W.C. Prasad, A. Alsadoon, A.K. Singh, A. Elchouemi, Lung cancer Scenes (1990) 376–379.
detection using CT scan images, Procedia Comput. Sci. 125 (2018) 107–114, [39] R.A. Haddad, A.N. Akansu, A class of fast Gaussian binomial filters for speech
http://dx.doi.org/10.1016/j.procs.2017.12.016. and image processing, IEEE Trans. Signal Process. 39 (3) (1991) 723–727.
[33] K. Odajima, A. Pawlovsky, A detailed description of the use of the kNN method [40] L.S. Davis, A survey of edge detection techniques, Comput. Graph. Image Process.
for breast cancer diagnosis, in: Paper Presented at the 2014 7th International 4 (3) (1975) 248–270, http://dx.doi.org/10.1016/0146-664X(75)90012-X.
Conference on Biomedical Engineering and Informatics, 2014.
16