0% found this document useful (0 votes)
16 views16 pages

An Intelligent Algorithm for Lung Cancer Diagnosis Using Extracted Features

Uploaded by

Muhammad Rizal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views16 pages

An Intelligent Algorithm for Lung Cancer Diagnosis Using Extracted Features

Uploaded by

Muhammad Rizal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Healthcare Analytics 3 (2023) 100150

Contents lists available at ScienceDirect

Healthcare Analytics
journal homepage: www.elsevier.com/locate/health

An intelligent algorithm for lung cancer diagnosis using extracted features


from Computerized Tomography images
Negar Maleki a , Seyed Taghi Akhavan Niaki b ,∗
a School of Information Systems and Management, Muma College of Business, University of South Florida, Tampa, FL, USA
b Department of Industrial Engineering, Sharif University of Technology, PO Box 11155-9414 Azadi Ave., Tehran 1458889694, Iran

ARTICLE INFO ABSTRACT


Keywords: According to the World Health Organization, lung cancer is a leading cause of death worldwide. This research
Machine learning aims to process the Computerized Tomography (CT) images of lung cancer patients for the early diagnosis of
Dimension reduction the disease. The images are processed using Convolutional Neural Network (CNN) in the first approach, where
Genetic algorithm
Artificial Neural Network (ANN) is employed to classify the images. In the second approach, the images are
Lung cancer
pre-processed and segmented before utilizing CNN and ANN. In the third method, all the pre-processed and
Image processing
Feature extraction
segmented images are converted to numerical data via specific feature extraction algorithms in the last step.
Besides, dimensional reduction and feature selection algorithms are employed to classify with three machine
learning techniques, i.e., Gradient Boosting (GB), Random Forest (RF), and Support Vector Machine (SVM).
An extensive comparative analysis is made to come up with the best technique. The comparisons are made
by evaluating the methodologies on a set of lung CT scan images collected from a medical center. The results
show that when either SVM or RF classification techniques are used, a 95% accuracy is obtained in diagnosing
lung cancer.

1. Introduction than in men [2,3]. Other contributing factors include age, gender, race,
socioeconomic status, exposure to occupational and environmental
According to the World Health Organization (WHO), cancer is a factors, chronic lung disease, air pollution, individual genetics, obesity,
leading cause of death worldwide. In 2020, nearly 10 million people exposure to secondhand smoke, dry cough, alcohol consumption, and
died of various cancers. Financially, about 70% of cancer deaths occur diet. Even people’s lifestyles help to spread the disease [2,4]. Consid-
in low and middle-income countries. The economic impact of cancer is ering these, one can take a big step toward the early detection of this
significant and increasing, with the total annual cost of cancer treat- disease.
ment in 2010 being about $1.16 trillion. According to WHO statistics, With increased health data, management, analysis, and decision-
the most common causes of cancer death in 2020 were lung (1.8 million making have become very challenging in recent years. Moreover, as
deaths), colon and rectum (935,000 deaths), liver (830,000 deaths), the population increases, the medical community faces many prob-
stomach (769,000 deaths), and breast (686,000 deaths) [1]. lems dealing with and diagnosing various diseases. Thus, conducting
Given that lung cancer is at the top of this list, we tried to review experiments imposes enormous costs on the relevant organizations [5].
what causes the disease to get a checklist of the factors that affect it. Given the sheer volume of data and the various occupations of
Although diagnosing the disease in its early stages is complicated, its physicians, the possibility of errors in their decisions is very high. Thus,
symptoms are similar to respiratory infections, even though there may data mining algorithms will greatly help the medical community and
be no symptoms at first. Although the disease can affect anyone, lung patients. However, it should be noted that these methods confirm the
cancer is more likely to occur in smokers. Diagnosis in the early stages doctor’s opinion and have little reliability alone [6]. Many tests to
can save a patient’s life, as lung cancer cells can spread to other organs diagnose disease have devastating effects on the patient’s body and
before a doctor can diagnose them. Cancer metastasis makes treatment cost a lot of money, which can be a solid reason to use data mining
much more difficult. techniques to diagnose the disease [7]. Besides, data mining methods
Research on lung cancer cases claims that smoking is the most allow hidden inter-dependencies between the data, which sometimes
crucial cause of this disease, which is more common in women today take years to make through classical methods.
than in the past. In history, due to the lower consumption of women’s While an increasing body of work in the literature is investigating
cigarettes, a lower incidence rate of this type of disease was recorded lung cancer, this paper is distinguished, firstly, as the Computerized

∗ Corresponding author.
E-mail addresses: [email protected] (N. Maleki), [email protected], [email protected] (S.T.A. Niaki).

https://doi.org/10.1016/j.health.2023.100150
Received 5 December 2022; Received in revised form 9 February 2023; Accepted 17 February 2023

2772-4425/© 2023 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150

Tomography (CT) scan images are collected from a hospital in Tehran, machine learning algorithms are time-consuming and require expert
Iran considering a critical difference. In previous studies [8–13], re- knowledge to adjust their characteristics. However, deep learning al-
searchers used healthy lung versus cancerous lung CT images to di- gorithms can capture raw data, automatically adjust the features, and
agnose lung cancer. However, we gather our data from a different analyze and learn the data more quickly [20]. Although deep learning
perspective. In our dataset, CT scan images are divided into two main algorithms are powerful, they need an adequate number of data to
categories: Cancerous CT images and noncancerous ones. As the name show their power. While there is no specific rule to say how much
of the cancerous category shows, patients who struggle with lung data is enough to achieve good results, a rule of thumb says the more
cancer are placed in this category. However, the noncancerous class data, the better result. Looking at the literature shows that depending
includes healthy lungs and lungs with other diseases, e.g., COVID-19, on the data source, researchers utilized different amounts of data in
except lung cancer. their analysis. For example, KL et al. [21] used 100 CT images from
Second, we implement three approaches to determine which would an online source, Chaunzwa et al. [22] employed 331 CT images from
help physicians diagnose the disease early. The images are processed Massachusetts General Hospital (MGH), Khan et al. [23] utilized 2101
using Convolutional Neural Network (CNN) in the first approach, where CT images from Kaggle website, and Toğaçar et al. [24] used 100 CT
Artificial Neural Network (ANN) is employed to classify the images. In images from Cancer Imaging Archive (CIA). Therefore, collecting real-
the second approach, the images are pre-processed and segmented be- life data is not an easy task, and it requires lots of effort. In this paper,
fore utilizing CNN and ANN. In the third method, all the pre-processed we collect 364 real-life CT images from a hospital in Iran, so comparing
and segmented images are converted to numerical data via specific the number of images we analyzed with previous works concerning the
feature extraction algorithms in the last step. Besides, dimensional data source, we would say, we used a medium to large-scale data in our
reduction and feature selection algorithms are employed to classify analysis.
with three machine learning techniques, i.e., Gradient Boosting (GB), Radiologists take pulmonary CT images to diagnose and evaluate
Random Forest (RF), and Support Vector Machine (SVM). To our tumor growth. Therefore, the visual interpretation of these data leads
knowledge, this is the first time CT images are converted to numerical to the tumor’s identification in the final stages of tumor growth. Treat-
features, and dimensional reduction and feature selection are applied ment at this stage only increases the mortality rate in this type of
to them. The results show that our proposed framework outperforms cancer. As a result, diagnosing the tumor in the early stages of the
and reaches 95% accuracy in diagnosing lung cancer. disease is vital, which can be done by analyzing the images [12]. In [8],
The rest of this paper is organized as follows. Section 2 will review 1018 images of pulmonary CT were examined. In this study, researchers
what has been done so far to diagnose diseases by machine learning used a convolutional neural network method without processing or seg-
algorithms to achieve the importance of using this field in medical menting images, with a sensitivity of 78.9% with twenty false positives
science. Section 3 explains three methodologies and their phases in per scan and 71.2% with ten false positives per scan. Therefore, in this
detail. In Section 4, the methods are implemented step by step to illus- paper, we considered pursuing this approach and comparing the result
trate their results. Methods’ comparison, methods’ sensitivity analysis, with our framework. The result shows that our framework reaches 95%
and strengths and weaknesses of each technique will be discussed in accuracy.
Section 5. Finally, the best approach will be introduced, and future There is a large body of work using a combination of CNN with
works will be presented in Section 6. other machine learning or deep learning algorithms. For instance,
Bonavita et al. [25] and Moitra and Mandal [26] used CNN alone to
2. Literature review detect lung cancer, Saleh et al. [27] and Nanglia et al. [28] employed
CNN and SVM, Onishi et al. [29] and Huang et al. [30] applied Deep
Previous research works in this field are divided into two sub- CNN (DCNN) to detect or classify lung cancer in Ct images. Hence,
sections for better insight into the difference between machine learn- CNNs play a massive role in detecting or classifying lung cancer. Based
ing and deep learning algorithms on lung and other cancers. More- on the previous studies’ limitations, there are still opportunities to
over, it shows the importance of data mining in medical science and develop a combination of CNN with other machine learning or deep
determining the research gap. learning algorithms. In this paper, we use a combination of CNN with
ANN as a deep learning method and GB, RF, and SVM as machine
2.1. Cancer diagnosis using machine learning algorithms learning methods.
Despite problems in the image segmentation phase, many
Maleki et al. [14] used the k-Nearest Neighbor (kNN) algorithm on researchers [9–11,13,31] used image segmentation after pre-processing
the lung cancer dataset. They applied feature selection on the dataset step to identify the touching objects in the image. One of these seg-
and came up with the six most essential features among the dataset mentation algorithms is watershed segmentation which was used in
features. This paper uses the same feature selection algorithms in our [32] study after reviewing many different approaches. Therefore, in
proposed framework to develop the important features [14,15]. While this study, we use the watershed segmentation algorithm to identify
an increasing body of work applied SVM and decision tree algorithms the most critical objects in the CT images.
to diagnose different types of cancer [5,16–19]; there is little work used According to the literature shown in Table 1, a small number
Gradient Boosting algorithm for this purpose. To this end, we use two of studies used pre-processing methods and data dimensional reduc-
traditional algorithms (SVM and Decision Tree) in literature along with tion. However, using these techniques can significantly improve the
GB in our framework. efficiency of diagnostic and predictive algorithms. Moreover, powerful
The healthcare system is quite different from other industries, so algorithms such as GB or RF have not been used for classifying diseases.
it has a higher priority and customers in this area. In other words, However, these methods are potent and may produce better results.
regardless of its costs, patients expect the highest level of treatment One of the most critical gaps in the literature is the inability to
and service. Since machine and deep learning have been successful in compare the methods used to diagnose lung cancer. The most reliable
different areas and have provided precise solutions, they are considered way to compare the methods is to use the same data. Therefore, this
fundamental methods for solving health problems [20]. paper compares three different methods and shows that our proposed
framework outperforms the other two in diagnosing lung cancer. In the
2.2. Cancer diagnosis using deep learning algorithms first method, unprocessed CT images are considered input to the CNN
and ANN structural architecture. In the second method, the same tasks
Rapid progress in using machine learning algorithms in medicine are applied with the difference that pre-processed and segmented CT
has been seen in the literature. Compared to deep learning algorithms, images are used as input. Finally, in the third method, the proposed

2
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150

Table 1
The comparison table of literature review.
Authors Year Pre- Feature Dimensional Techniques Disease Data type
processing selection reduction

Gradient boosting
Random forest

Deep learning
Decision tree

GA - SVM
K-means

ANN
KNN

CNN
SVM
C4.5
LDA
PCA

NBs
GA
Chen and Yang [16] 2013 ✓ Breast Cancer Numerical
Zheng et al. [19] 2013 ✓ ✓ Breast Cancer Numerical
Odajima and Pawlovsky [33] 2014 ✓ Breast Cancer Numerical
Lynch et al. [5] 2017 ✓ ✓ Lung Cancer Numerical
Septiani et al. [34] 2017 ✓ ✓ Breast Cancer Numerical
Cherif [17] 2018 ✓ ✓ ✓ ✓ Breast Cancer Numerical
Kr and Aradhya [18] 2018 ✓ ✓ ✓ Lung Cancer Numerical
Maleki et al. [14] 2021 ✓ ✓ Lung Cancer Numerical
Zayed and Elnemr [13] 2015 ✓ Breast Cancer Image
Miah and Yousuf [10] 2015 ✓ ✓ ✓ Lung Cancer Image
Golan et al. [8] 2016 ✓ ✓ Lung Cancer Image
Kaucha et al. [9] 2017 ✓ ✓ ✓ ✓ Lung Cancer Image
Makaju et al. [32] 2018 ✓ ✓ ✓ Lung Cancer Image
Shakeel et al. [11] 2019 ✓ ✓ ✓ Lung Cancer Image
Onishi et al. [29] 2020 ✓ ✓ ✓ Lung Cancer Image
Saleh et al. [27] 2021 ✓ ✓ ✓ Lung Cancer Image
Nanglia et al. [28] 2021 ✓ ✓ ✓ Lung Cancer Image
Huang et al. [30] 2022 ✓ ✓ ✓ Lung Cancer Image
This paper ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Lung Cancer Image/
Numerical

framework, we extract numerical data with the help of 40 features from numerical dataset phase is completed. We obtained a vast dataset as
each pixel in segmented CT images to form a data frame to apply two we extracted any possible mathematical features from the CT images
different dimensional reduction algorithms (Principal Component Anal- to improve diagnosing cancerous patients from noncancerous ones.
ysis (PCA) and Linear Discriminant Analysis (LDA)), feature selection Therefore, the dimensional reduction phase, composed of two different
algorithm (Genetic Algorithm (GA)), and machine learning algorithms algorithms: PCA and LDA, is utilized to reduce dataset dimensions and
(GB, RF, and SVM). produce two separate datasets, one from the PCA algorithm and the
other by the LDA algorithm. From now on, the experiment is continued
3. The proposed lung cancer diagnosis methodologies in two parts. In part one, classification algorithms – GB, RF, and SVM –
are applied to both LDA and PCA datasets separately to classify the data
The methodology adopted in this paper is carried out in three into two groups based on extracted features. In part two, the genetic
different methods, shown in Fig. 1. This study’s problem is diagnosing algorithm is initially applied in the feature selection phase to select
whether the patient has cancer in the early stage. As is clear from the the essential components in the PCA dataset. Subsequently, the same
research purpose, the target variable is defined as discrete, so we need classification algorithms are used to classify the data into two groups
to use classification algorithms to identify the target variable. based on the selected features. In the following subsections, we will
According to Fig. 1, the first method comprises one fundamental describe each fundamental phase in detail.
building phase called image classification. It means, in this method,
the raw CT images were given to CNN followed by ANN without any 3.1. Image pre-processing
preprocessing (Raw CT images went through the third phase – the blue
rectangular – immediately). The second method includes three primary The term pre-processing belongs to a series of tasks needed for
building phases: image pre-processing, image segmentation, and image enhancing the quality of raw images, increasing the performance of the
classification (According to Fig. 1, Raw CT images went through first subsequent phases such as image segmentation, image feature extrac-
(the yellow rectangular), second (the green rectangular), and third tion, and classification. The primary objectives in this study phase are
(the blue rectangular) phases, respectively). Finally, the third method to apply image resizing and denoising.
comprises seven fundamental phases: image pre-processing, image seg-
mentation, image feature extraction, building a numerical dataset, 3.1.1. Image resizing
dimensional reduction, feature selection, and classification (According In the image resizing step, pixels are either added to the image
to Fig. 1, Raw CT images went through the first to eighth phases, or removed. Since medical images have many details that may be
respectively). Although these methods have fundamental phases in effective, no pixel removal is performed in the current research. On the
common, they are entirely different methods implemented on the same other hand, the CNN algorithm’s input images should preferably be in
lung CT scan images. square sizes for a better diagnostic process. Therefore, all images used
The pre-processing image phase of the study itself is composed of in this study have dimensions of 512 × 512 pixels.
two parts: image resizing and image denoising. Initially, raw lung CT
images are resized, and subsequently, the median filter is applied to 3.1.2. Image denoising
denoise them. The watershed segmentation algorithm identified the Image denoising is a process of applying filter(s) to reduce image
most critical objects in the CT images in the image segmentation phase noise. It should be noted that CT images have the lowest noise level
to make the following steps more reliable. In the image classification among the medical images, and it is practically unnecessary to perform
phase, raw CT images in method one and segmented CT images in this step. In any case, to prevent image distortion, the median filter is
method two are used as input. The CNN and ANN algorithms are applied to cancel any possible noise of lung CT.
applied to the CT images to classify whether the images belong to a
cancerous or noncancerous patient. 3.2. Image segmentation
So far, the fundamental phases that are implemented in methods
one and two are described. In the third method, the image feature The image segmentation tries to help the algorithm to diagnose
extraction phase is applied to segmented lung CT images to extract as best it can by removing unnecessary parts. In fact, at this point,
possible numerical features from each image’s pixels. The extracted the algorithm can only focus on the lung region, which will improve
statistical data for each image is stored in a dataset, so building a the classification performance. Watershed segmentation is used in the

3
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150

Fig. 1. The proposed framework, in comparison to other methods. . (For interpretation of the references to color in this figure legend, the reader is referred to the web version
of this article.)

suggested model [32]. The watershed technique is utilized when seg- 3.4.1. Gabor features
menting complicated pictures since basic thresholding and contour The Gabor feature is a linear filter used to analyze tissue in image
detection will not produce accurate results. The watershed method is processing. It explains whether the content of a particular frequency in
built on capturing specific background and foreground information. the image is in specific directions in a local area around the point [36].
Markers are then used to run watersheds and determine the precise
borders. Markers can be defined by users, e.g., manually, or defined by 3.4.2. Sobel filter
some algorithms, e.g., thresholding operation — we used thresholding The Sobel operator is used in image processing and computer vision,
operation in our analysis. especially in edge recognition algorithms. It creates a marginal image
[37].
3.3. Image classification
3.4.3. Scharr filter
This is a filter used to identify and highlight edges/slope features
Image classification is the primary domain in which deep neural
using derivative 1. It has a function like the Sobel filter and is used to
networks play the most critical role in medical image analysis. The im-
detect edges/changes in the pixel intensity [38].
age classification accepts the given input images and produces output
classification to identify whether the disease is present [35]. The image
3.4.4. Prewitt filter
classification phase is composed of two parts: CNN and ANN.
The Prewitt filter is like the Sobel in the sense that it uses two cores.
One to change the horizontal direction and the other to change the
3.4. Image feature extraction vertical direction. The two centers are wrapped with the original image
(meaning the same convolution operation) to calculate the derivatives
The following phases are performed in the third proposed method. roughly [38].
After the raw CT scan images are processed and segmented, several
numerical features are extracted from the images in the image feature 3.4.5. Gaussian filter
extraction phase. Each feature will be obtained from each pixel in a The Gaussian filter is linear. This filter is usually used to blur the
single image and then stored in a dataset. image or reduce the noise [39].

4
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150

3.4.6. Roberts edge Now, the objective function to be minimized is a weighted sum of
Crossover operator Roberts measures the slope of a simple, fast the 𝑚𝑐𝑟 and 𝑛𝑓 (number of selected features) defined as:
space to calculate two-dimensional on an image. It, therefore, high-
lights areas of high frequency that often correspond to the edges 𝑀𝑖𝑛 𝑍 = 𝑤1 ∗ 𝑚𝑐𝑟 + 𝑤2 ∗ 𝑛𝑓 (4)
[40].
Dividing both sides of Eq. (4) by 𝑤1 , we will have:

3.5. Building a numerical dataset 𝑀𝑖𝑛 𝑍 = 𝑚𝑐𝑟 + 𝑤2∕𝑤1 ∗ 𝑛𝑓 . (5)

In the previous step, many features were extracted from each pixel Assuming 𝑤2∕𝑤1 = 𝑊 , the objective function becomes:
in a CT image. For example, an image with 256 × 256 dimensions has 𝑀𝑖𝑛 𝑍 = 𝑚𝑐𝑟 + 𝑊 ∗ 𝑛𝑓 . (6)
65,536 pixels, so if we extract 40 features from each and store them in
a dataset, we will have 1 row for the image, and 40 × 65,536 columns. Now, 𝑊 is defined as:
On the other hand, adding a target column should not be forgotten to
determine whether the input image belonged to a cancerous patient or 𝑊 ∝ 𝑚𝑐𝑟 → 𝑊 = 𝛽 ∗ 𝑚𝑐𝑟 → 𝑀𝑖𝑛 𝑍 = 𝑚𝑐𝑟 + 𝛽 ∗ 𝑚𝑐𝑟 ∗ 𝑛𝑓 (7)
a noncancerous one. This procedure is continued until all the images’
Finally, the objective function will be:
features are extracted and stored in a dataset.
𝑀𝑖𝑛 𝑍 = 𝑚𝑐𝑟(1 + 𝛽 ∗ 𝑛𝑓 ), (8)
3.6. Dimensional reduction
where 𝛽 is defined as a penalty for having an additional feature
(0 ≤ 𝛽 ≤ 1). Using this objective function, the GA finds the best com-
In the previous phase, an extensive dataset consisting of many
features was obtained. However, implementing classification on this bination of the features with the minimum number of features that
extensive dataset is time-consuming and not efficient. Implementing minimize both the cost and the misclassification rate. Here, the stop-
dimensional reduction algorithms on large data is one of the most ping criterion to end the iterations in GA is chosen to be a predefined
critical steps. PCA and LDA are two-dimensional reduction algorithms number of iterations.
used in this paper. The Pseudocode of the GA-based feature selection algorithm is:

3.7. Feature selection

Feature selection methods have become an unavoidable part of the


machine learning process to deal with high-dimensional data. Feature
selection can identify related features and eliminate unrelated and
repetitive ones to observe a subset of attributes that best describe the
problem.
The first goal of the proposed feature selection method is to reach
the same accuracy rate as the exclusive features. The second goal is to
improve the accuracy rate. Here, gathering extensive information on
the features costs too much, both in time and money, and new infor-
mation is wasted in classifying and diagnosis. Reducing the dimension
in terms of the number of features is recommended to get a better
response and find a better correlation between the features and the
outcomes.
A GA is a technique to select the best features. This technique
generates a binary random vector consisting of the features using
Eq. (1).

⎪1 𝑖𝑓 𝑉 𝑒𝑐𝑡𝑜𝑟𝑠𝑗 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑠 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒 𝑖
3.8. Classification
𝑉 𝑒𝑐𝑡𝑜𝑟(𝑠𝑗 ) ∶ 𝑠𝑗 = 𝑌𝑖 ; 𝑌𝑖 = ⎨ (1)
⎪0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Three different classifiers – GB, RF, and SVM – are implemented on
An objective function based on the misclassification performance cri-
three other datasets separately. The first dataset, the PCA dataset, is
terion is defined for any selected combination of the features. This
derived from the PCA dimensional reduction algorithm. The following
objective function is a penalty function that should be minimized to
dataset, called the LDA dataset, is obtained using the LDA dimensional
find the best features. Here, the misclassification rate is simple and is
reduction algorithm. The third one is constructed employing the GA,
obtained using Eq. (2).
which is applied to the PCA dataset. Each dataset has several rows
∑ [∑ ]
𝑚𝑐𝑟 = 𝑎𝑖𝑗 − 𝑎𝑖𝑗 ;(𝑖=𝑗) ∕∑ 𝑎𝑖𝑗 ; 𝑖, 𝑗 = 1, 2, … , 𝑚 (2) of images and has various columns depending on the method of ex-
ecution alongside a binary target feature that defines the samples as
The number of classification targets is the number of cases, the target
noncancerous or cancerous (0 or 1).
is classified as the target using the classification method. The elements
that construct the matrix in (3) form the so-called confusion matrix that
depends on the problem as well as the dataset. 4. Experimental results and analysis
⎛ 𝑎11 ⋯ 𝑎1𝑚 ⎞
⎜ ⎟ This section discusses the way the data is collected, the implementa-
⎜ ⋮ ⋱ ⋮ ⎟ (3)
⎜ ⎟ tion results of the proposed three methods on the data, and the analysis
⎜ ⎟ of the results.
⎝𝑎𝑚1 ⋯ 𝑎𝑚𝑚 ⎠𝑚×𝑚

5
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150

Fig. 2. (a) medical CT images of lung cancer patients, (b) medical CT images of lung patients other than lung cancer.

4.1. Data collection first method. In the first step of image pre-processing, all the image
sizes are set to 512 × 512 so that all the pixels in an image remain
Images are collected from a hospital situated in Tehran, Iran. The intact. The next step applies the median filter to remove any possible
images used in this study are provided at https://data.mendeley.com/ noise of resized lung CT images. Fig. 4 shows the image before and after
datasets. Part of these CT scan images of lungs belongs to lung cancer the median filter is performed on both the cancerous and noncancerous
patients and are classified as cancerous images. The rest belongs to lung CT scans. As seen in this figure, the images after the filter are
other lung diseases, such as patients who caught COVID-19, classified not much different from the images without the filter, which is a
as noncancerous images. As lung cancer symptoms are rare, all the characteristic of CT scan images.
possible lung diseases are considered noncancerous images to improve The image segmentation tries to help the algorithm to diagnose
lung cancer diagnosis efficiency. For this reason, lung cancer disease as best it can by removing unnecessary parts. In fact, at this point,
is not detectable in the early stages. Most physicians and doctors in the algorithm can only focus on the lung, which will improve the
the early stages of tumor growth diagnose a disease other than cancer, algorithm’s performance. For better understanding, all the steps are
which causes this type of cancer progression in the infected person. applied to the two filtered images in Fig. 4. The masks that cover
The total number of CT scan images used in this paper is 364, unnecessary parts of the images are shown in Fig. 5. Then, by placing
of which 238 are cancerous images, and the rest (126) belong to these masks on the filtered images, the lung will be visible in Fig. 6.
noncancerous images. All these images are collected with the help of As shown in Fig. 6, unnecessary parts are removed. After completing
a pulmonologist to skip any probable error in classifying images. Some all the above steps, 364 processed and segmented images are given to
of the CT images of the lungs acquired from the hospital database are the CNN and ANN algorithms. All the first method steps are repeated
shown in Fig. 2. from this point on, except that the images are not raw. Similar to the
first method, several different structures are evaluated in the second
4.2. The implementation results of the first method method. However, the architecture produced the best result consisting
of three convolution layers with 64, 64, and 128 feature maps, respec-
As seen in Fig. 2, applying any pre-processing or segmentation on tively, in the first, second, and third layers in the CNN section. The
the raw images is not needed when implementing CNN and ANN. In ANN also included two hidden layers, each containing 128 neurons.
other words, the raw lung CT scan images are fed as inputs to the CNN As mentioned, max pooling with dimensions of 2 × 2 is used after
and ANN architecture in the first method. Several different structures each convolution layer to maintain the image features. Fig. 7 shows
are evaluated to obtain the best structure to distinguish cancerous CT the graphical structure of the best structure.
images from noncancerous ones. However, the best structure consists The second method’s model has 63,109,441 trainable parameters.
of three convolution layers with 64, 64, and 128 feature maps, respec- It would have quadrupled the number of parameters and increased
tively, in the first, second, and third layers in the convolutional neural the time of the algorithm’s execution if max-pooling layers were not
network section. The artificial neural network also contains two hidden applied. In this method, the same as in the first method, 324 lung CT
layers, each containing 128 neurons. This study uses max pooling with images are considered for training, and 40 lung CT images are used
dimensions of 2 × 2 after each convolution layer to maintain the feature to evaluate the algorithm’s performance. Finally, by implementing this
maps. Fig. 3 shows the graphical structure of the best approach. structure on the processed and segmented images, a 100% accuracy
The structural model has 63,109,441 trainable parameters, which with a loss function of 3.0576 × 10-5 for the train and 93% accuracy
would have been quadrupled if the max-pooling layers had not been with a loss function of 1.096 × 10-7 for the test is obtained. The
used. This would have increased the algorithm’s execution time. An- difference between the two accuracies shows that the model is not
other case in point is the number of images used to learn and evaluate overfitted. Besides, performing image pre-processing and segmentation
the algorithm, of which 324 images were for training, and 40 images affects the first method’s performance significantly.
were for algorithm testing. Finally, implementing this structure on raw
(unprocessed) images, an accuracy of 65.81% was obtained with the 4.4. The implementation results of the third method
training loss function value of 5.4368 and the testing loss function
value of 4.0295 with 62.50% accuracy. The difference between the two As masks are placed on filtered images in the image segmentation
accuracies shows that the model is not overfitted. phase, images’ unnecessary parts are covered. Therefore, to reduce the
number of columns, the segmented images are resized to 256 × 256.
4.3. The implementation results of the second method By these dimensions for each segmented image, 2,621,440 data are
generated for each image when the features are extracted.
As shown in Fig. 1, image pre-processing and segmentation are The filters/features are applied alone to each image’s pixel and store
used in the second method before running the CNN and the ANN each pixel’s calculations in a data frame. To create a numeric dataset,
algorithms. This method aims to determine whether performing image the number of pixels in each image is 65,536, and the total number
pre-processing and image segmentation affects the performance of the of filters executed on each pixel is 40. The data values of the original

6
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150

Fig. 3. Best result graphical structure of the first method.

Fig. 4. (a) On the left side, the cancerous lung CT image is shown before the median filter, and on the opposite side, the after median-filter result is shown, (b) On the left side,
the noncancerous lung CT image is shown before median-filter, and on the opposite side, the after median-filter is shown.

Fig. 5. The image on the right is a mask for a cancerous lung, and the image on the left is a mask for a noncancerous lung.

7
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150

Fig. 6. The image on the right shows lung cancer after applying the mask, and the image on the left shows noncancerous lung after using the mask.

Fig. 7. Best result graphical structure of the second method.

Table 2 According to Table 3, by forming 363 columns in the PCA algorithm,


Dimension of the created dataset.
the total data variance can be fully explained. In other words, PCA
Number of rows Number of columns can combine 2,621,441 feature columns into 363 columns so that this
Dataset dimensions
364 2,621,442 number of columns can explain 100% of the variance of the entire
dataset. In Fig. 8, the number of components from 1 to 364 is drawn
with the percentage of the corresponding variance explanation for each.
The same procedure is done with the LDA algorithm, where the
image’s pixels and the label of being cancerous (1) or noncancerous
number of columns must be equal to the minimum number of rows
(0) are also given in this dataset. Thus, each picture contains 1 row
and the number of objective functions. Since there are 364 rows and
and 40 × 65,536 feature columns plus a label column and the original
two objective functions (0 for noncancerous and 1 for cancerous), a
pixel’s value. The dimensions of the dataset are shown in Table 2,
maximum of 2 components is possible. The HPC output of this number
considering all the images.
is set to 1 component.
PCA and LDA were separately applied to the dataset to produce the
PCA and LDA datasets, respectively. Given that all the following steps After reducing the dimensions, it is time to perform the feature
must be performed together on the entire dataset, a robust system is selection phase using the genetic algorithm. However, before imple-
required to read it. Therefore, all subsequent steps are performed on menting the genetic algorithm, all the following steps are taken once
High-Performance Computing (HPC) with 80 cores and 500 GB of RAM. without implementing the feature selection algorithm so that one can fi-
Before the PCA and LDA can be applied in the dimensional reduc- nally compare whether the feature selection improves the performance
tion phase, it is necessary to determine the number of components of the third method.
required for each algorithm. Therefore, a loop is used to obtain the As mentioned above, the LDA algorithm, by reducing the size of the
optimal number of components. Some of the results of this loop for dataset, eventually reaches 364 rows and two columns, named the LDA
the PCA method are shown in Table 3. The numbers shown in the dataset. The first column contains 2,621,441 feature columns, and the
number of components’ output required by the PCA in Table 3 show second column is the objective function column. In the PCA algorithm,
that the conversion of 2,621,441 to how many can explain the total after reducing the dimensions of the dataset, it finally reaches 364 rows
data variance. and 325 columns, named the PCA dataset. The first 324 columns are

8
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150

Table 3
Part of the number of component output.
n_components Variance description n_components Variance description n_components Variance description
332 0.99108 343 0.99486 354 0.99810
333 0.99145 344 0.99518 355 0.99836
334 0.99181 345 0.99549 356 0.99861
335 0.99217 346 0.99580 357 0.99886
336 0.99252 347 0.99610 358 0.99910
337 0.99287 348 0.99640 359 0.99934
338 0.99322 349 0.99670 360 0.99956
339 0.99356 350 0.99700 361 0.99974
340 0.99389 351 0.99728 362 0.99989
341 0.99422 352 0.99756 363 1
342 0.99454 353 0.99784 364 1

Fig. 8. The number of components corresponding to their variance explanation in PCA.

Table 4 confusion matrix in Table 7, the SVC algorithm correctly identifies 38


Performance measurements of GBC, RFC, and SVC on the PCA dataset.
data from 40 test data.
Methods Accuracy Precision Recall F1-score A receiver operating characteristic (ROC) curve is a graphical tool
GBC 0.95 0.95 0.95 0.95 that examines a binary classification performance in statistical analysis.
RFC 0.82 0.86 0.82 0.79 The information contained in ROC curves is beneficial in choosing
SVC 0.73 0.53 0.72 0.61 an appropriate classifier under specific criteria. The curve is plotted
using the true-positive rate (TPR), also known as sensitivity, against the
false-positive rate (FPR), equivalent to one minus specificity at various
cut-off points of a parameter. High values of TPR and low amounts
related to the combination of 2,621,441 feature columns, and the last of FPR indicate an improvement in ROC curves; these values cause
column is related to the objective function column. With these two the points to move towards the upper left corner of ROC and make
sets of data (PCA and LDA), supervised machine learning algorithms, a desirable decision. The area under the curve is calculated to evaluate
including GB classification, RF classification, and SVM classification, a given classifier’s performance on a set of data and classifier consis-
are implemented. Tables 4 and 6 show the performance of these three tency analysis. In this research, the three classifiers’ ROC curves are
classification algorithms on the PCA and LDA datasets, respectively. depicted in Figs. 11 and 12 for the PCA and LDA datasets, respectively.
Tables 5 and 7 present the confusion matrix of the three classifica- The graphs demonstrate that the GBC technique’s classification results
tion algorithms on the PCA and LDA datasets. In addition, Figs. 9 in Fig. 11 and the SVC technique in Fig. 12 are more accurate and
and 10 demonstrate a performance comparison among these three reliable than the other algorithms. In Fig. 12, the ROC graphs of both
classification algorithms on the PCA and LDA datasets, respectively. GBC and RFC are stacked on top of each other.
As shown in Table 4, the GB classification (GBC) algorithm creates After implementing the step reduction scale to perform the genetic
the most accuracy after execution on the PCA dataset. The confusion feature selection algorithm and then the machine learning algorithms,
matrix in Table 5 shows that the GBC algorithm correctly recognizes let us go back to the next step. As mentioned earlier, the LDA dataset
37 data from 40 test data. In all parts of the employed machine consists of two columns, so it is virtually impossible to run GA on
learning algorithms, 40 data for the test and 324 data for the train are it. Therefore, one can only execute the feature selection algorithm
considered. on the PCA dataset. While the maximum number of iterations was
The results in Table 6 show that the SVC algorithm creates the most set to 10, after the sixth iteration, the cost function value converges
accuracy after running on the LDA dataset. As demonstrated by its to 0.12464, as shown in Fig. 13, and the rest iterations remain the

9
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150

Table 5
Confusion matrix of GBC, RFC, and SVC on the PCA dataset.
GBC Actual value RFC Actual value SVC Actual value
9 2 4 7 0 11
Predicted value Predicted value Predicted value
0 29 0 29 0 29

Fig. 9. The performance of GBC, RFC, and SVC on the PCA dataset.

Fig. 10. The performance of GBC, RFC, and SVC on the LDA dataset.

Table 6 As shown in Table 9, the RFC algorithm creates the most accuracy
Performance measurements of GBC, RFC, and SVC on the LDA dataset.
after execution on the dataset obtained from the GA. As is clear from
Methods Accuracy Precision Recall F1-score its confusion matrix in Table 10, this algorithm correctly recognized 34
GBC 0.78 0.84 0.78 0.79 data from 40 test data.
RFC 0.78 0.84 0.78 0.79
SVC 0.95 0.96 0.95 0.95 5. Sensitivity analyses and comparative study

In the previous section, the three proposed methods were examined,


based on which the best results of each method were described. In
same. The population size was 50, the crossover operator probability
was 0.7, and the mutation probability was 0.3. Moreover, the roulette this section, the sensitivity of each method to its parameters is first
wheel method selects the parents in all operations. Table 8 shows GA’s analyzed. Then, they are compared to shed light on their strength and
hyperparameters tuning along with their cost functions. weaknesses.
After selecting 172 columns as practical ones, the rest of the
columns are first deleted, and the three previously mentioned machine 5.1. Sensitivity analysis
learning algorithms are executed. Tables 9 and 10 show the perfor-
mance and the confusion matrix of the three classification algorithms According to the previous section’s explanations given on CNN
on the dataset derived from the feature selection algorithm, respec- and ANN, each network includes parameters that must be adjusted to
tively. Besides, Figs. 14 and 15 demonstrate a performance comparison achieve the optimal result. To this aim, different structures, from the
among these three classification algorithms and the ROC curves of these simplest to the most complex ones, were examined until there was no
three classifiers on the derived dataset from the GA, respectively. improvement. It should be noted that the simplicity of the structure

10
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150

Fig. 11. ROC curves for three classifiers on the PCA dataset.

Fig. 12. ROC curves for three classifiers on the LDA dataset.

Table 7
Confusion matrix of GBC, RFC, and SVC on the LDA dataset.
GBC Actual value RFC Actual value SVC Actual value
10 1 10 1 11 0
Predicted value Predicted value Predicted value
8 21 8 21 2 27

or its complexity does not necessarily affect the result. However, the necessary to consider the accuracy and loss function of the test and
difference in implementing complex structures over simple ones is train simultaneously to achieve the desired result.
that the smaller the number of parameters, the higher the effort to As shown in Table 11, various convolution layers with different
achieve the best possible response. This implies shorter the algorithm’s filter detections are utilized. In each of them, multiple nodes in dif-
execution time due to the lower number of parameters. ferent hidden layers are also used to build a structural architecture.
Tables 11–14 show the sensitivity analyses of the first two methods, For example, as seen in the last row of Table 11, three convolution
the convolutional and artificial neural network on raw images, and the layers with 64, 64, and 128 detection filters are used to build the
convolutional and artificial neural network on the pre-processed and CNN structural architecture. The ANN structural architecture uses three
segmented images. As mentioned, the complexity of the structures has hidden layers containing 128, 128, and 256 neurons. In conclusion, as
continued to the point where there is no improvement anymore. It is the accuracy rate decreased and the loss function for both train and

11
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150

Fig. 13. The GA cost function for each iteration.

Table 8
Hyperparameters tuning process (the bold values show the best hyperparameters concerning the cost function value, and the underlining
values show the second best hyperparameters).
Max iteration Population % Mutation % Crossover Time (s) Cost function # of selected features
10 20 0.3 0.7 48,235.780 0.12890 172
10 20 0.4 0.7 27,149.952 0.16745 194
10 20 0.5 0.7 21,853.446 0.16745 194
10 20 0.3 0.8 22,557.235 0.14843 187
10 20 0.4 0.8 32,168.301 0.16738 191
10 20 0.5 0.8 21,063.200 0.16745 194
10 20 0.3 0.9 21,245.173 0.14843 187
10 20 0.4 0.9 47,362.354 0.16745 194
10 20 0.5 0.9 42,417.132 0.16745 194
10 50 0.3 0.7 40,130.765 0.12464 172
10 50 0.4 0.7 85,919.442 0.16738 191
10 50 0.5 0.7 103,398.083 0.16738 191
10 50 0.3 0.8 52,323.353 0.16738 191
10 50 0.4 0.8 91,157.160 0.16738 191
10 50 0.5 0.8 63,885.141 0.16738 191
10 50 0.3 0.9 66,239.779 0.13741 178
10 50 0.4 0.9 10,9868.125 0.15745 187
10 50 0.5 0.9 10,6193.759 0.15745 187
10 80 0.3 0.7 86,121.910 0.16738 191
10 80 0.4 0.7 91,011.980 0.16738 191
10 80 0.5 0.7 120,447.483 0.16738 191
10 80 0.3 0.8 74,341.422 0.16738 191
10 80 0.4 0.8 150,442.208 0.16738 191
10 80 0.5 0.8 104,947.371 0.16738 191
10 80 0.3 0.9 111,308.229 0.16738 191
10 80 0.4 0.9 114,145.831 0.16738 191
10 80 0.5 0.9 122,187.983 0.16738 191

test data increased, the implementation of more complex structures architecture. In the end, the implementation of a more complex struc-
stopped. Consequently, the fifth structure in Table 11 is determined as ture is canceled, and the fifth row of Table 12 is determined as the best
solution to the problems at hand.
the best answer to this problem.
Regarding the third method’s sensitivity analysis, the whole process
of this method is described in Section 4. Still, a brief explanation of this
As presented in Table 12, various convolution layers with different
method’s sensitivity analysis is not without merit here. The goal of the
filter detections are utilized. In each of them, multiple neurons in a third method was to analyze numerical data obtained after extracting
different number of hidden layers are also used to build a structural features from images. After extracting the features, two-dimensional

12
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150

Fig. 14. The performance measurements criteria of GBC, RFC, and SVC on the derived dataset from the GA.

Fig. 15. ROC curves for three classifiers on the derived dataset from the GA.

Table 9 and what results are obtained using machine learning algorithms.
Performance measurements of GBC, RFC, and SVC on the dataset derived from the Table 13 presents the results of these three steps and the accuracy of
GA.
each algorithm. It should be noted that other evaluation criteria are
Methods Accuracy Precision Recall F1-score given in Tables 4–7, discussed previously in Section 4.
GBC 0.82 0.83 0.82 0.83 As shown in Table 13, by applying the above three steps, the
RFC 0.85 0.88 0.85 0.83 SVC on the LDA dataset and GBC on the PCA dataset offers the best
SVC 0.73 0.53 0.72 0.61 performance in diagnosing lung cancer. However, in the third method,
we did not suffice with these results and tried to use different ways
to improve lung cancer diagnosis. Thus, the genetic feature selection
algorithm is implemented to see if there is an improvement in the
reduction methods (PCA and LDA) are used to examine which one
diagnostic determination of the disease. As mentioned in Section 4,
performs better. Following each of them, supervised machine learning it is impossible to implement the genetic feature selection algorithm
algorithms (GBC, RFC, and SVC) are implemented separately to see how on the LDA dimensional reduction method. One can only use it on the
performing only three feature extraction steps reduces the dimensions dataset obtained from the PCA dimensional reduction method. Table 14

13
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150

Table 10
Confusion matrix of GBC, RFC, and SVC on the dataset derived from the GA.
GBC Actual value RFC Actual value SVC Actual value
8 3 5 6 0 11
Predicted value Predicted value Predicted value
4 25 0 29 0 29

Table 11
Sensitivity analysis of performing CNN and ANN methods on raw images.
Num. of Num. of filter Num. of Num. of nodes in Num. of total Train Test Time (s)
convolution detectionsa hidden layers hidden L.b params
layers
Accuracy Loss Accuracy Loss
(%) function (%) function
1 64 1 128 532,686,849 65.71 5.5623 62.5 4.0295 312,418.58
2 64-64 1 128 130,095,169 65.75 5.5452 62.5 4.0295 76,300.26
2 64-64 2 128-128 130,111,681 65.77 5.5204 62.5 2.0148 76,309.95
3 64-64-128 1 128 63,092,929 65.8 5.4398 62.5 4.0295 37,003.74
3 64-64-128 2 128-128 63,109,441 65.81 5.4368 62.5 4.0295 37,013.42
3 64-64-128 3 128-128-256 63,142,593 65.8 5.5018 62.5 12.0886 37,032.86
a The number of feature detectors is listed in layers.
b
The number of neurons is listed in layers.

Table 12
Sensitivity analysis of performing CNN and ANN methods on processed and segmented images.
Num. of Num. of filter Num. of Num. of nodes in Num. of total Train Test Time (s)
convolution detectionsa hidden layers hidden L.b params
layers
Accuracy Loss Accuracy Loss
(%) function (%) function
1 64 1 128 532,686,849 99.52 0.019 92.5 0.754 307,915.06
2 64-64 1 128 130,095,169 99.58 0.023 92.5 1.5115 75,200.39
2 64-64 2 128-128 130,111,681 99.92 0.0024 93 1.12E−07 75,209.94
3 64-64-128 1 128 63,092,929 99.95 0.0012 93 1.84E−07 36,470.32
3 64-64-128 2 128-128 63,109,441 100 0.00003 93 1.09E−07 36,479.87
3 64-64-128 3 128-128-256 63,142,593 100 0.00004 93 5.13E−07 36,499.03
a
The number of feature detectors is listed in layers.
b The number of neurons is listed in layers.

Table 13 Table 14
Sensitivity analysis of two-dimensional reduction methods (PCA and LDA). Sensitivity analysis of the implementation of machine learning algorithms before and
Feature extraction after the implementation of the genetic feature selection algorithm.
PCA dimensional reduction
LDA dimensional reduction PCA dimensional reduction Algorithms
After GA feature selection Before GA feature selection Algorithms
78% 95% GBC
78% 82% RFC 82% 95% GBC
95% 73% SVC 85% 82% RFC
73% 73% SVC

contains the accuracy of the algorithms after execution on the dataset of


the genetic feature selection algorithm and before its implementation. artificial neural network to classify the images into cancerous and non-
It is recommended to refer to Tables 4–5 and 9 and 10 to see the other cancerous categories. Implementing this method on raw (unprocessed)
evaluation criteria. images resulted in an accuracy of 65.81% with a 5.4368 loss function
According to the results in Table 14, the implementation of machine value for the train and 62.50% accuracy with a 4.0295 loss function
learning algorithms on the genetic feature selection algorithm’s dataset value for the test.
has improved the RFC method’s results. However, this improvement is
not valid for GBC, as this method’s accuracy has been severely reduced. 5.2.2. Method 2: CNN and ANN on the pre-processed and segmented
Moreover, the performance of the SVC method did not change.
images
In this method, the images are pre-processed and segmented before
5.2. Comparison
the CT images enter the CNN system. The image size is changed in the
This section is devoted to the comparison of the methods. pre-processing phase, and the median filter is applied to reduce possible
image noise. By implementing this structure on the pre-processed and
5.2.1. Method 1: CNN and ANN on raw images segmented images, an accuracy of 100% is achieved with a loss function
As the name implies, all CT scans enter the CNN system without value of 3.0576 × 10−5 for the training set and 93% accuracy and a loss
pre-processing or modification. After passing this stage, they enter the function value of 1.096 × 10−7 for the test set.

14
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150

5.2.3. Method 3: Numerical features of the pre-processed and segmented [5] C. Lynch, B. Abdollahi, J. Fuqua, A. deCarlo, J. Bartholomai, R. Balgemann, H.
images and employing machine learning algorithms . Frieboes, Prediction of lung cancer patient survival via supervised machine
learning classification techniques, Int. J. Med. Inform. 108 (2017) 1–8, http:
In this method, all the pre-processing and segmentation phases are
//dx.doi.org/10.1016/j.ijmedinf.2017.09.013.
the same as in the second method. After these two phases, the numeri-
[6] W. Raghupathi, Data mining in health care, Healthc. Inform.: Improv. Effic. Prod.
cal features are extracted from the images and converted to numerical 211 (2010) 223.
data in a data frame. This dataset’s dimensions are 364 × 2,621,442, [7] D. Tomar, A survey on data mining approaches for healthcare, Int. J. Bio - Sci.
which is referred to as large or big data. As the size of this dataset Bio - Technol. 5 (2013) 241–266, http://dx.doi.org/10.14257/ijbsbt.2013.5.5.25.
is large, dimensional reduction algorithms are first used, and then a [8] R. Golan, C. Jacob, J. Denzinger, Lung nodule detection in CT images using deep
genetic feature selection algorithm is utilized to increase the processing convolutional neural networks, in: Paper Presented at the 2016 International
Joint Conference on Neural Networks, IJCNN, 2016.
speed. Next, machine learning algorithms are employed to classify
[9] D.P. Kaucha, P.W. Prasad, A. Alsadoon, A. Elchouemi, S. Sreedharan, Early
cancer and non-cancer images. The GBC algorithm obtains the best detection of lung cancer using SVM classifier in biomedical image processing, in:
results with 95% accuracy when implemented on the PCA dataset, Paper Presented At the 2017 IEEE International Conference on Power, Control,
the SVC algorithm with 95% accuracy when implemented on the LDA Signals and Instrumentation Engineering, ICPCSI, 2017.
dataset, and the RFC algorithm with 85% accuracy when performed on [10] M.B. Miah, M. Yousuf, Detection of lung cancer from CT image using image pro-
a dataset obtained from the genetic feature selection. cessing and neural network, in: Paper Presented at the International Conference
on Electrical Engineering and Information Communication Technology ICEEICT,
JU, Savar, Dhaka, Bangladesh, 2015.
6. Discussion and conclusion
[11] P.M. Shakeel, M.A. Burhanuddin, M.I. Desa, Lung cancer detection from CT
image using improved profuse clustering and deep learning instantaneously
In this study, three different methods were used to diagnose lung trained neural networks, Measurement 145 (2019) 702–712.
cancer in its early stages. One of the most significant achievements of [12] M. Vas, A. Dessai, Lung cancer detection system using lung CT image process-
this research was the comparability of all three methods. Methods were ing, in: Paper Presented At the 2017 International Conference on Computing,
Communication, Control and Automation, ICCUBEA, 2017.
comparable when all input images were the same in each method, as in
[13] N. Zayed, H. Elnemr, Statistical analysis of haralick texture features to dis-
this paper. Another comparable feature was the use of the same amount
criminate lung abnormalities, Int. J. Biomed. Imaging 2015 (2015) 1–7, http:
of data for training and testing sets; 324 images for training and the //dx.doi.org/10.1155/2015/267807.
remaining 40 images for the test in all investigations. Another contribu- [14] N. Maleki, Y. Zeinali, S.T.A. Niaki, A k-NN method for lung cancer prognosis
tion of this research includes the implementation of the third method. with the use of a genetic algorithm for feature selection, Expert Syst. Appl. 164
This method extracts numerical features from the pre-processed and (2021) 113981.
segmented lung CT images. The dimensional reduction algorithms (PCA [15] Y. Zeinali, S.T.A. Niaki, Heart sound classification using signal processing and
machine learning algorithms, Mach. Learn. Appl. 7 (2022) 100206, http://dx.
and LDA) are applied to the obtained dataset, GA feature selection is
doi.org/10.1016/j.mlwa.2021.100206.
utilized, and supervised machine learning algorithms (GB, RF, SVC) are [16] A.H. Chen, C. Yang, The improvement of breast cancer prognosis accuracy from
performed. integrated gene expression and clinical data, Expert Syst. Appl. 39 (5) (2012)
The comparison analysis showed that the third method had the best 4785–4795, http://dx.doi.org/10.1016/j.eswa.2011.09.144.
performance (95% accuracy for the testing set). Since the third method [17] W. Cherif, Optimization of K-NN algorithm by clustering and reliability coeffi-
is one of the main contributions of this paper, we can see that it has cients: application to breast-cancer diagnosis, Procedia Comput. Sci. 127 (2018)
293–299, http://dx.doi.org/10.1016/j.procs.2018.01.125.
the best accuracy in using two different machine learning classification
[18] P. Kr, N. Aradhya, Lung cancer survivability prediction based on performance
algorithms. Therefore, the results illustrated that separately applying
using classification techniques of support vector machines, C4.5 and naive Bayes
GB on the PCA dataset and SVC in the LDA has the best performance algorithms for healthcare analytics, Procedia Comput. Sci. 132 (2018) 412–420,
with 95% accuracy in both. http://dx.doi.org/10.1016/j.procs.2018.05.162.
Future works can improve the accuracy of the cancer diagnosis [19] B. Zheng, S.W. Yoon, S. Lam, Breast cancer diagnosis based on feature extraction
by executing GA feature selection before dimensional reduction. Us- using a hybrid of K-means and support vector machine algorithms, Expert Syst.
Appl. 41 (4) (2013) 1476–1482, http://dx.doi.org/10.1016/j.eswa.2013.08.044.
ing different algorithms for feature selection can probably improve
[20] M.I. Razzak, S. Naz, A. Zaib, Deep learning for medical image processing:
the accuracy of cancer detection. Moreover, extracting more different
Overview, challenges and the future, in: Classification in BioApps, Springer,
features in the feature extraction step may positively impact the sys- 2018, pp. 323–350.
tem’s accuracy. Moreover, other cancers kill countless people every [21] S. KL, S.N. Mohanty, K. S, N. A, G. Ramirez, Optimal deep learning model
year. Therefore, given that this research method’s implementation has for classification of lung cancer on CT images, Future Gener. Comput. Syst. 92
yielded promising results, we will try to perform the best approach on (2019) 374–382, http://dx.doi.org/10.1016/j.future.2018.10.009.
various cancers and diseases in future work. [22] T.L. Chaunzwa, A. Hosny, Y. Xu, A. Shafer, N. Diao, M. Lanuti, H.J. . Aerts,
Deep learning classification of lung cancer histology using CT images, Sci. Rep.
11 (1) (2021) 1–12.
Declaration of competing interest [23] M.A. Khan, S. Rubab, A. Kashif, M.I. Sharif, N. Muhammad, J.H. Shah, S.C.
. Satapathy, Lungs cancer classification from CT images: An integrated design of
The authors declare that they have no known competing finan- contrast based classical features fusion and selection, Pattern Recognit. Lett. 129
cial interests or personal relationships that could have appeared to (2020) 77–85, http://dx.doi.org/10.1016/j.patrec.2019.11.014.
influence the work reported in this paper. [24] M. Toğaçar, B. Ergen, Z. Cömert, Detection of lung cancer on chest CT images
using minimum redundancy maximum relevance feature selection method with
convolutional neural networks, Biocybern. Biomed. Eng. 40 (1) (2020) 23–39,
Data availability http://dx.doi.org/10.1016/j.bbe.2019.11.004.
[25] I. Bonavita, X. Rafael-Palou, M. Ceresa, G. Piella, V. Ribas, M.A.G. Ballester,
Data will be made available on request. Integration of convolutional neural networks for pulmonary nodule malignancy
assessment in a lung cancer classification pipeline, Comput. Methods Programs
References Biomed. 185 (2020) 105172.
[26] D. Moitra, R.K. Mandal, Classification of non-small cell lung cancer using
[1] WHO, Cancer, 2021, Retrieved from https://www.who.int/news-room/fact- one-dimensional convolutional neural network, Expert Syst. Appl. 159 (2020)
sheets/detail/cancer. 113564.
[2] B.C. Bade, C.S.D. Cruz, Lung cancer 2020: epidemiology, etiology, and [27] A.Y. Saleh, C.K. Chin, V. Penshie, H.R.H. Al-Absi, Lung cancer medical images
prevention, Clin. Chest Med. 41 (1) (2020) 1–24. classification using hybrid CNN-SVM, Int. J. Adv. Intell. Inform. 7 (2) (2021)
[3] C.R. MacRosty, M.P. Rivera, Lung cancer in women: A modern epidemic, Clin. 151–162.
Chest Med. 41 (1) (2020) 53–65. [28] P. Nanglia, S. Kumar, A.N. Mahajan, P. Singh, D. Rathee, A hybrid algorithm
[4] A.S. Ahmad, A.M. Mayya, A new tool to predict lung cancer based on risk factors, for lung cancer classification using SVM and neural networks, ICT Express 7 (3)
Heliyon 6 (2) (2020) e03402. (2021) 335–341.

15
N. Maleki and S.T.A. Niaki Healthcare Analytics 3 (2023) 100150

[29] Y. Onishi, A. Teramoto, M. Tsujimoto, T. Tsukamoto, K. Saito, H. Toyama, H. [34] N.W.P. Septiani, R. Wulan, M. Lestari, Breast cancer detection using data mining
. Fujita, Multiplanar analysis for pulmonary nodule classification in CT images classification methods, Proc. ICMETA 1 (1) (2017) 185–191.
using deep convolutional neural network and generative adversarial networks, [35] K. Balaji, K. Lavanya, Chapter 5 - medical image analysis with deep neural
Int. J. Comput. Assist. Radiol. Surg. 15 (2020) 173–178. networks, in: A.K. Sangaiah (Ed.), Deep Learning and Parallel Computing
[30] Y.-S. Huang, P.-R. Chou, H.-M. Chen, Y.-C. Chang, R.-F. Chang, One-stage Environment for Bioengineering Systems, Academic Press, 2019, pp. 75–97.
pulmonary nodule detection using 3-D DCNN with feature fusion and attention [36] S. Marčelja, Mathematical description of the response of simple cortical cells,
mechanism in CT image, Comput. Methods Programs Biomed. 220 (2022) J. Opt. Soc. Amer. 70 (1980) 1297–1300, http://dx.doi.org/10.1364/JOSA.70.
106786. 001297.
[31] C. Chen, R. Xiao, T. Zhang, Y. Lu, X. Guo, J. Wang, Z. Wang, Pathological lung [37] I. Sobel, G. Feldman, A 3 ×3 isotropic gradient operator for image processing,
segmentation in chest CT images based on improved random walker, Comput. Pattern Classif. Scene Anal. 27 (1973) 1–272.
Methods Programs Biomed. 200 (2021) 105864. [38] I. Sobel, An isotropic 3 ×3 image gradient operater, Mach. Vis. Three-Dimens.
[32] S. Makaju, P.W.C. Prasad, A. Alsadoon, A.K. Singh, A. Elchouemi, Lung cancer Scenes (1990) 376–379.
detection using CT scan images, Procedia Comput. Sci. 125 (2018) 107–114, [39] R.A. Haddad, A.N. Akansu, A class of fast Gaussian binomial filters for speech
http://dx.doi.org/10.1016/j.procs.2017.12.016. and image processing, IEEE Trans. Signal Process. 39 (3) (1991) 723–727.
[33] K. Odajima, A. Pawlovsky, A detailed description of the use of the kNN method [40] L.S. Davis, A survey of edge detection techniques, Comput. Graph. Image Process.
for breast cancer diagnosis, in: Paper Presented at the 2014 7th International 4 (3) (1975) 248–270, http://dx.doi.org/10.1016/0146-664X(75)90012-X.
Conference on Biomedical Engineering and Informatics, 2014.

16

You might also like