Prediction of Lung Cancer Using ML (1) - 2
Prediction of Lung Cancer Using ML (1) - 2
A Major Project Report Submitted in Partial Fulfillment for the Award of the Degree
of Bachelor of Technology in Information Technology
To
Submitted by:
Shivam Malaviya(1900100130086)
Assistant Professor
MAY 2023
1
CANDIDATE DECLARATION
We, hereby certify that the project entitled “Prediction Of Lung Cancer By Using
Machine Learning” submitted by us in partial fulfillment of the requirement for the
award of degree of the B. Tech. (Information Technology) submitted to Dr. A.P.J.
Abdul Kalam Technical University, Lucknow at United College of Engineering and
Research, Prayagraj is an authentic record of our own work carried out during a
period from June, 2022 to May, 2023 under the guidance of Assistant Prof. Vivek
Pandey, Department of Computer Science & Engineering). The matter presented in
this project has not formed the basis for the award of any other degree, diploma,
fellowship or any other similar titles.
Place:Prayagraj
Date:
2
CERTIFICATE
This is to certify that the project titled “Prediction Of Lung Cancer By Using
Machine Learning” is the bonafide work carried out by Ved Prakash Srivastava
(1900100130106), Saurabh Mishra (1900100130081), Mohd. Shazan
(1900100130060) and (Shivam Malaviya, Roll No; 1900100130086) in partial
fulfillment of the requirement for the award of degree of the B. Tech. (Information
Technology) submitted to Dr. A.P.J Abdul Kalam Technical University, Lucknow at
United College of Engineering and Research, Prayagraj is an authentic record of their
own work carried out during a period from June, 2022 to May, 2023 under the
guidance of Assistant Prof. Vivek Pandey, Department of Computer Science &
Engineering). The Major Project Viva-Voce Examination has been held on
__________________.
Place:
Date:
3
ABSTRACT
The lungs are the centre of breath control and ensure that every cell in the body
receives oxygen. At the same time, they filter the air to prevent the entry of useless
substances and germs into the body. The human body has specially designed defence
mechanisms that protect the lungs. However, they are not enough to completely
eliminate the risk of various diseases that affect the lungs. Infections, inflammation or
even more serious complications, such as the growth of a cancerous tumour, can
affect the lungs. Lung cancer generally occurs in both male and female due to
uncontrollable growth of cells in the lungs. This causes a serious breathing problem in
both inhale and exhale part of chest. Cigarette smoking and passive smoking are the
principal contributor for the cause of lung cancer as per world health organization.
The mortality rate due to lung cancer is increasing day by day in youths as well as in
old persons as compared to other cancers. Even though the availability of high-tech
medical facility for careful diagnosis and effective medical treatment, the mortality
rate is not yet controlled up to a good extent. Therefore, it is highly necessary to take
early precautions at the initial stage such that it’s symptoms and effect can be found at
early stage for better diagnosis. Machine learning now days has a great influence to
health care sector because of its high computational capability for early prediction of
the diseases with accurate data analysis. The lungs are the centre of breath control and
ensure that every cell in the body receives oxygen. At the same time, they filter the air
to prevent the entry of useless substances and germs into the body. The human body
has specially designed defence mechanisms that protect the lungs. However, they are
not enough to completely eliminate the risk of various diseases that affect the lungs.
Infections, inflammation or even more serious complications, such as the growth of a
cancerous tumour, can affect the lungs. In this work, we used machine learning (ML)
methods to build efficient models for identifying high-risk individuals for incurring
lung cancer and, thus, making earlier interventions to avoid long-term complications.
The suggestion of this article is the Rotation Forest that achieves high performance
and is evaluated by well-known metrics, such as precision, recall, F-Measure,
accuracy and area under the curve (AUC). More specifically, the evaluation of the
experiments showed that the proposed model prevailed with an AUC of 99.3%,
F-Measure, precision, recall and accuracy of 97.1%.
4
ACKNOWLEDGEMENT
We express our sincere gratitude to the Dr. A.P.J Abdul Kalam Technical University,
Lucknow for giving us the opportunity to work on the Major Project during our final
year of B.Tech. (IT) is an important aspect in the field of engineering. We would like
to thank Dr. H.P. Shukla, Principal and Dr. Vijay Kumar Dwivedi, Head of
Department, CSE at United College of Engineering and Research, Prayagraj for their
kind support. We also owe our sincerest gratitude towards Assistant Prof. (Mr. Vivek
Pandey) for his valuable advice and healthy criticism throughout our project which
helped us immensely to complete our work successfully. We would also like to thank
everyone who has knowingly and unknowingly helped us throughout our work. Last
but not the least, a word of thanks for the authors of all those books and papers which
we have consulted during our project work as well as for preparing the report.
5
List of Figure
Figure 1. CT Scan image for lung Cancer 2
Figure 2. Distribution of participants among the age groups in the balanced data 24
List of Table
6
Table of Contents
Title Page i
Certificate iii
Abstract iv
Acknowledgement v
List of Figure vi
List of Table vi
1. INTRODUCTION 1
1.1.4 Mesothelioma 3
1.2.1 Smoulder 4
1.2.3 Radon 5
Lung cancer risk is increased for cancer patients who underwent chest radiation
therapy. 6
1.3.1 Diet 6
7
1.5 What are the first signs of lung cancer? 7
1.8.2 Imaging 8
1.8.3 Biopsy 8
2. Literature Survey 11
2.2 ANALYSIS 13
2.3.5 Conclusion 29
8
4.2 Evaluation 36
4.3 Discussion 37
5. Conclusions 40
Future Work 41
References 46
9
1. INTRODUCTION
When bodily cells proliferate unchecked, a condition named as cancer outcomes.
When cancer develops in the lungs, it is referred to as lung cancer. Other bodily parts,
such as lymph nodes, organs including the brain, the lungs can also be the site of the
start of lung cancer. Lung cancer has the potential to spreading out to further organs.
The term "cancer cells" refers to cells which have spread from one organ to another.
The two main groups into which they are commonly separated are tiny cell and
non-tiny cell lung malignancies, which include adenocarcinoma and squamous cell
carcinoma. These numerous types of lung cancer have distinctive patterns of
development and therapeutic responses [1]. While small cell lung cancer is more
common, non-small cell lung cancer is more common. Lung cancer, which is also the
worst disease, is thought to be the main factor in high mortality in the modern world.
Compared to other cancers, lung cancer has a greater impact on people, and as
expected, it currently occupies position seven in the fatality rate index, contributing
1.6% of world death [2]. The brain is affected by lung cancer after it has spread to the
lung. There are two primary classifications of lung cancer. The two forms of lung
cancer are tiny cell and non-tiny cell. Acute chest hurt, a dry wheeze, shortness of
inhalation, body weight loss, and other symptoms are possible in patients [3]. Doctors
who study the causes and progression of cancer emphasise the role of smoking and
passive lung cancer is primarily caused by smoking. Lung cancer is treated with
abscission, chemo, diffraction, immune remedy, and other procedures. Despite this,
doctors can only diagnose lung cancer once it has advanced, making the diagnosis
relatively weak [4]. To quickly and effectively lower the mortality rate with effective
control, early prediction prior to the last phase is essential. Even with the right
treatment and diagnosis, the prediction for lung cancer is quite encouraging[5]. The
prognosis for lung cancer varies depending on the patient's age and gender, and race
are all factors, as well as health status. The American Cancer Society calculates that a
patient's likelihood of surviving lung cancer is 47% if it is identified at a young stage.
It is extremely improbable that lung cancer in its early stages will be accidently
discovered on an X-ray image [6]. Lesions with a diameter of 510 millimetres or less
10
that are spherical are notoriously difficult to find. Figure 1 displays a CT scan of a
patient with lung cancer.
A majority typical among lung cancers, non-tiny cell lung cancer is one. More slowly
compared to lung cancer with tiny cells, it grows and spreads. According to the kind
of cells that make up the tumour, in total, three basic category of lung Cancer without
tiny cell:
11
● The form of non-tiny cell lung cancer is the most familliar type that occurs
most frequently. It grows and spreads slower than lung cancer with tiny cells
[10]. lung cancer that is not tiny cell can be classified three primary categories
based on the type of cells that comprise tumour.
● Giant, abnormal-looking cells are present in a variety of malignancies known
as giant cell carcinomas [11]. These tumours frequently advance swiftly and
can start anywhere in the lungs.
● Tiny cell carcinoma (oat cell cancer, which accounts for the majority of tiny
cell lung malignancies);
● Tiny cell carcinoma combined
1.1.4 Mesothelioma
The most frequent source of the uncommon cancer of the lining of the chest,
mesothelioma, is asbestos exposure. It is the root due to around 5% of lung cancer
cases. It takes between 30 and 50 years between being exposed to asbestos and getting
the disease for mesothelioma to appear [14]. The majority of those who get
mesothelioma worked in environments where asbestos fibres were breathed. When
mesothelioma is found, it is staged, which tells the patient and the treating doctor how
big the tumour is and where it has spread from the initial site. Surgery, radiation
therapy, and chemotherapy are available treatments for mesothelioma [15]. Currently
being studied are combined strategies that combine various treatments, including the
use of chemotherapy before surgery and novel medications that precisely target
mesothelioma cells.
12
1.1.5 Breast Cancers
Breast Cancers tumour is uncommon. Tumours detected in the chest wall can be
benign or malignant, like other malignancies [16]. Tumours with cancer must be
treated. In relation to their location and the symptoms they produce, benign tumours
will be treated. For instance, a tumour needs to be treated if it presses against a lung
and prevents the patient from breathing.
Ex- cancers of the bladder, breast, colon, kidney, liver, neuroblastoma, prostate,
sarcoma, and Wilms' tumour
1.2.1 Smoulder
Lung cancer chance is primarily increased by smoking. For 80% to 90% of lung
cancer fatalities in the US, smoking cigarettes is to blame. Smoking tobacco,
including cigarettes, cigars, and pipes, raises the chance of lung cancer developing.
There are about 7,000 compounds in tobacco smoke, Consequently, it is very
poisonous. Lots of them are lethal. One way or another, minimum 70 of them have
been joined to either human or animal cancer [19].
13
Smokers have a 15–30-fold higher danger of non-smokers to acquire lung cancer or
die from it. Even light or infrequent cigarette usage raises the chance of lung cancer.
Smoking more frequently and for longer periods of time raises the chance.
Smokers who left smoking have a lower chance of lung cancer compare to they would
have otherwise, but they still have a higher risk than non-smokers [20]. Smoking
cessation can lower the danger of lung cancer at any age.
In practically each and every bodily part, smoulder increases the chance of cancer.
Smoking shoots up the risk of grow a number of cancers, including those of the voice
box (larynx), trachea, stomach, colon, rectal, liver, pancreas, mouth, throat,
oesophageal, stomach, colon, and bronchial.
1.2.3 Radon
In the US, smoking and radon are the two leading causes of lung cancer. Water, soil,
and rocks can all be the source of the radon-filled natural gas. It has no flavour or
smell and is translucent. Radon may become trapped and start to build up in the air
when it enters homes or other buildings through cracks or holes [22]. Those People
occupy or are employed by these residences, businesses are exposed to high amounts
of radon. Lung cancer can develop after a long duration due to radon exposure.
The Environmental Protection Agency (EPA) in the United States estimates that
Radon is a factor in the annual death toll from lung cancer of 21,000 persons. Lung
cancer is more likely to develop if you are exposed to radon in smokers compared to
non-smokers [23]. However, the EPA claims probably greater than 10% of deaths
from lung cancer associated with radon occur in smokers who have never smoked
cigarettes. Nearly one in every fifteen homes in the US have excessive radon levels.
Find out how to radon test your home and how help reduce radon levels if they are
excessive.
14
1.2.4 Various Substances
various examples of pollutants that can be found in various sectors and the risk is
elevated by asbestos, arsenic, diesel exhaust, and particular types of silica and
chromium [24]. Numerous these medications put smokers at a notably elevated risk of
getting lung cancer. If someone lives in an area with higher air pollution levels, their
risk of developing lung cancer may increase.
Lung cancer risk is increased for cancer patients who underwent chest radiation
therapy.
1.3.1 Diet
Researchers are examining a wide range of check out the meals and dietary
supplements if they affect the chance of acquiring lung cancer. There is still much to
discover. We are aware that those who both smoke and use beta-carotene supplements
are more prone to developing lung cancer [25]. Visit Lung Cancer Prevention for
additional facts. Additionally, drinking water contaminants like radon and arsenic
(mostly from private wells) can raise the potential for lung cancer.
15
Frequent Two more anomalies associated pneumonia attacks and swelling or
expanded lymph nodes (glands) into the chest, close to the lungs, are symptoms
associated with lung cancer.
According to where in the lungs the cancer first emerges, several of these signs may
show up quick (in phase I or II), while they typically don't until the infection has
going to a latest phase. In light of this, it's imperative to get checked out if you have a
greater than average chance of acquiring lung cancer.
There are numerous size and spread combinations for each stage that can fit into that
group. For example, while the main tumour in a level III cancer may be little than one
in a Stage II cancer, supplementary parameters may have raise the cancer to that level.
Lung cancer is generally staged as follows:
● Stage 0 (in-situ): The bronchus or lung's upper lining has cancer. It hasn't
gotten outside the lung or into other lung tissue.
● Stage I: The lung-specific cancer has not spread elsewhere.
● Stage II: A tumour in a lung lobe that is larger than Stage I, has migrated to
internal lymph nodes, or contains many tumours.
● Stage III: More advanced stage II cancer, metastases to nearby lymph nodes
or structures, or several tumours in different lobes identical lung.
● Stage IV: The cancer has unfurled to the fluid surrounding the heart, the other
lung, the fluid surrounding the lung, and other isolated organs.
16
1.7 Limited vs. extensive stage
Although doctors now refer to small cell lung cancer as being in stages I through IV,
you may also hear the terms restricted or extensive phase used. This depends on the
area's able to be fixed with only radiation field.
● The lymph nodes in the middle of the chest or above the collar bone on the
same side are occasionally involved in limited stage SCLC, which is localised
to one lung.
● One lung has developed advanced stage SCLC that has progressed to the
lymph nodes on the opposite side of the lung, the other lung, or other body
parts.
1.8.2 Imaging
Images from chest X-rays and CT scans might show your doctor changes in your
lungs. In order to assess a troubling CT scan finding or to ascertain whether cancer
has spread following a cancer diagnosis, PET/CT scans are frequently performed.
1.8.3 Biopsy
Your doctor may do a number of procedures to get a closer look at what's happening
inside your chest. Your doctor may perform a biopsy during the same procedures to
get specimen of tissue or fluid that can be examined with a microscope to identify the
kind of cancer and look for cancer cells. Testing for genetic abnormalities that might
impact your therapy is another option for samples.
17
with enucleation, chemo, emission, targeted remedy, or a composition of these [27].
The tiny cell lung cancer patients usually receive both chemotherapy and radiation
therapy.
Targeted therapy- Drug treatment to halt the growth and spread of cancer cells.
Tablets or intravenous injections of the substances are both options. Tests will be
performed on you to establish whether targeted therapy is suitable for your particular
type of cancer before beginning treatment [27].
18
● Adverse effects must be disclosed, investigated in human clinical studies, and
mentioned in patient information leaflets (PILs). The PIL is offered with the
sale of medications and medical supplies to the general population.
Examples comprise:
19
2. LITERATURE SURVEY
But instead of considering the many stages [Stage 0 - stage IV], the photographs were
simply categorised like "abnormal" or "normal," which this learning aims to do seeks
to fix.
Although a supreme perfection of 96.33% has been reached in the categorization, the
the writers assert two out of the five classes can perform significantly better with
additional checkup. The histopathology dataset, which is the microscopic examination
of an invasive biopsy, is being employed. Our strategy prioritises the use of CT scans,
minimally invasive technique for cancer screening.
Sajja T, Devarapalli R, et al. (2019) bring out a research that tested the Google-Net
pre-trained CNN model for lung cancer detection. In order to avoid overfitting, the
drop-out layers have 60 percent of all neurons deployed, which resulted in a simpler
and sparse network for classifying the CT scan picture as benign or malignant. To
regulate whether it works more precisely, the model needs to be tested using various
dropout rates. Our strategy seeks to build a condensed CNN model to categorise
cancer and provide medical details prices.
20
Tripathi P, Tyagi S, et al. (2019) Released a report that they try to identify lung cancer
using five distinct image processing partition methods.
They come to the conclusion that the best accurate findings are produced through
marker-controlled watershed segmentation. The comparison investigation reveals that
CT scans typically offer the best possibility of cancer detection and ought to be the
preferred method for doing the same. Therefore, we will classify the various stages
using Deep Learning on CT images.
Siddharth Bhatia et al. (2019) describe a technique that uses deep residual learning
to find lung cancer. In order to identify lung properties that are cancer-vulnerable
utilising UNet and ResNet models, they provide a number of preprocessing
techniques. To calculate the likelihood of identifying malignant CT images, they
assess the efficacy of classifiers like Random forest and XGBoost. When they
combine the two classifiers, the authors' accuracy is at its maximum, 84%. The
limitation in this situation is the possibility of a higher precision than was actually
possible.
After pre-processing the data with the Median and Gaussian filters, the Watershed
algorithm was used to segment the data. With a 92% accuracy rate, our improved
21
model outscored the prior best model by 5.4%. The model's only shortcoming because
it does not distinguish between the various stages of cancer (I through IV).
AlphaGo system, Ali I et al. (2018) created a deep learning method to classify the
presence of a malignant nodule based on a CT image's perception as a collection of
states. They use an algorithm called Reinforcement Learning, which gets better over
time and with more data.
The training data for the model, having a soaring precise of 99.1%, while the
authenticate data is only 64.4% accurate, according to their research. The outcome,
the model seems to be overfitted.
2.2 ANALYSIS
22
3 Sajja T, Lung Cancer A deep neural Need for maximum
Devarapalli R, Detection Based on network based dropout ratio due to
Kalluri H- 2018 CT Scan Images by on overfitted data
Using Deep Transfer Google-Net
Learning [31]
23
various stages of
cancer.
24
● Appropriate technique for representing rowdy and unbalanced data;
25
2.3.4 Introduction to Django Framework
Django is a high-level Python web framework that encourages rapid development and
clean, pragmatic design. It follows the model-view-controller (MVC) architectural
pattern and provides a robust set of tools and features for building web applications.
This documentation serves as a comprehensive guide for getting started with Django
and covers various aspects of web development using the framework.
Table of Contents
1. Installation
2. Project Setup
3. Models
4. Views
5. Templates
6. URL Routing
7. Forms
8. Middleware
11. Testing
12. Deployment
● Installation To install Django, you can use pip, the package manager for
Python. Open your terminal or command prompt and run the following
command: ``` pip install Django ```
● Project Setup Once Django is installed, you can create a new project using the
`django-admin` command-line utility. Navigate to the desired directory and
run the following command: ``` django-admin startproject project_name ```
This will create a new directory `project_name` containing the basic project
structure.
26
● Models Models in Django are Python classes that define the structure and
behavior of data. They are used to interact with the database and represent the
application's data objects. In your project directory, open the `models.py` file
in your desired app and define your models using Django's ORM
(Object-Relational Mapping).
● Views Views are Python functions or classes that handle HTTP requests and
return responses. They define the logic behind different pages or endpoints of
your web application. Create views by defining functions or classes in your
app's `views.py` file.
● Templates Templates are used to generate HTML dynamically based on data
and user input. Django's template engine allows you to create reusable
templates that can be populated with data from your views. Templates are
typically stored in a `templates` directory within your app.
● URL Routing URL routing maps URLs to views in Django. You can define
URL patterns using regular expressions or simpler route patterns. URL
patterns are defined in the `urls.py` file of your app or project.
● Forms Django provides a forms API for handling form data input and
validation. Forms can be created using Python classes and can be rendered in
templates. Forms are defined in the `forms.py` file of your app.
● Middleware Middleware is a Django feature that allows you to process
requests and responses globally. It sits between the web server and Django's
view processing. You can use middleware to perform tasks such as
authentication, session management, and request/response modification.
● Authentication and Authorization Django provides built-in support for user
authentication and authorization. It includes features such as user registration,
login, logout, and permission-based access control. Django's authentication
system is customizable and can be extended to suit your application's needs.
● Database Integration Django supports various databases including
PostgreSQL, MySQL, SQLite, and Oracle. It provides an abstraction layer
called the Object-Relational Mapping (ORM) that allows you to interact with
the database using Python code. Django takes care of generating SQL queries
and managing database connections.
● Testing Django includes a testing framework that allows you to write unit
tests for your application. You can create test cases to verify the behavior of
27
your models, views, forms, and other components. Testing in Django follows
the principle of "test-driven development" (TDD) and helps ensure the
stability and correctness of your code.
● Deployment Django applications can be deployed to various web servers and
hosting platforms. The process typically involves configuring the web server,
setting up a production database, and deploying your code. Django provides
guidelines and best practices for deployment, including using tools like
Gunicorn, Nginx, and Docker.
28
2.3.5 Conclusion
In conclusion, Django is a powerful and popular web framework for Python that
enables developers to build robust, scalable, and secure web applications with ease.
Its rich set of features, such as its ORM, template engine, URL routing, forms,
authentication, and testing framework, makes it a preferred choice for web
development. Django follows the MVC architectural pattern, promoting clean and
maintainable code. It emphasizes DRY (Don't Repeat Yourself) principles, providing
a high level of abstraction and automating common tasks, which leads to increased
productivity and faster development. With Django, you can integrate various
databases, handle user authentication and authorization, manage forms and validation,
and handle HTTP requests and responses efficiently. It also provides excellent
documentation and an active community, making it easy to find resources, tutorials,
and support. Furthermore, Django supports deployment to different web servers and
platforms, allowing you to scale your application and handle high traffic scenarios
effectively. Its deployment guidelines and best practices ensure a smooth transition
from development to production. Whether you are a beginner or an experienced
developer, Django offers a flexible and intuitive framework for web development in
Python. It empowers you to focus on building your application's logic and
functionality while handling common web development tasks behind the scenes. By
choosing Django, you can benefit from its stability, security, and the extensive
ecosystem of packages and libraries available. Start exploring Django today and
unlock the potential to create dynamic and sophisticated web applications
29
3. SYSTEM ANALYSIS & DESIGN
In this section, we'll go through the dataset we used as well as the two key
components of the methodology we used to forecast the risk of lung cancer: class
stabilize and feature in the stabilize data ranking. Additionally, we will note the
theoretical traits' incidence frequencies in bond to the major subtypes of lung cancer.
Performance indicators and ML models are also included.
• Age (years) [38]: The person's age is recorded using this feature.
• Smoking [39]: This characteristic lets you know whether or not a user smokes.
• Yellow fingers [40]: Whether a participant has yellow fingertips or not is indicated
by this characteristic.
• Anxiety [41]: This function reveals if the user is feeling nervous or not.
• Peer pressure [42]: This feature records whether or not the individual experiences
peer pressure.
• Chronic disease [43]: This element indicates whether or not the person has a
chronic illness.
• Fatigue [44]: Whether the participant is fatigued or not affects how this feature
behaves.
• Allergy [45]: Whether the participant is fatigued or not affects how this feature
behaves.
• Wheezing [46]: This attribute indicates whether or not the participant has wheezing.
• Alcohol [47]: This function reveals if the user drinks liquor or not.
30
• Whoop [48]: This characteristic relates to whether or not the participant coughs.
• Shortness of breath [49]: This characteristic deals with the participant's level of
breathlessness.
• Swallowing difficulty [50]: This characteristic lets you know whether or not the
individual has trouble swallowing.
• Chest pain [51]: This feature records whether or not the individual is experiencing
chest pain.
• Lung Cancer: This function indicates whether or not the user has received a lung
cancer diagnosis.
With the exception of age, which is a number, all the attributes are nominal.
31
ratio (GR) method [58], which assigns a score based on GR (fi) = (H(c)-H(c|
fi))/(H(fi)) where H(c) is the entropy of the variable that captures the class values,
H(c| fi) and H(fi) are the conditional entropy of the class given the feature, and the
entropy of the feature fi (i = 1, 2, 3, . . . , 15), respectively. In order to evaluate a
feature's capacity to best distinguish between instances in the two classes, Random
Forest computes the Gini impurity [59]. Table 1 displays the ranking scores in
downward-sloping. We can observe that five out of fourteen features were placed in
the same sequence as significance by both approaches based on the calculated scores,
while some of the other features were arranged in proximal or reverse order. Values
that are close to 0 and/or negative indicate characteristics that are of low or no
importance. All of the qualities will be taken into account while training and
validating the models because they are necessory predictors of lung cancer
development and medical professionals' guidance of it.
The breakdown of participation per age group is also shown in Figure 1. We note that
the age range 60–64 has the largest frequency of lung cancer cases, with those 50–79
years old being the most commonly affected.
32
Figure 2. Distribution of participants among the age groups in the balanced data.
Table 2 displays the apparent of the traits in every class. Men and women are almost
uniformly likely to be given a lung cancer diagnosis based on their gender.
Additionally, based on this table, we can consequently, each of the characteristics we
examined is turned on in lung cancer patients by 27% to 36%, despite the fact that a
significant number of patients reported these symptoms even before receiving a lung
cancer diagnosis. Even though the illness hadn't formed, keeping an eye on risk
factors, warning signs, and subsequent clinical checks may assist to shut out or
decrease the disease's unfavourable outcome.
Table 3. The breakdown of participants in the balanced data by feature values and
class label
LungCance LungCance
Feature r Feature r
33
Sex No Yes Senstivity No Yes
Women 26.12% 23.15% No 49.07% 19.05%
Men 23.88% 26.85% Yes 0.92% 30.94%
Smoudler No Yes Rasp No Yes
No 30.01% 21.30% No 47.45% 19.82%
Yes 20.01% 28.70% Yes 2.59% 30.18%
YellowFingers No Yes Liquor No Yes
No 29.82% 19.81% No 48.73% 19.46%
Yes 20.18% 30.19% Yes 1.31% 30.52%
Unesae No Yes Chock No Yes
No 33.51% 23.70% No 45.01% 18.71%
Yes 16.49% 26.30% Yes 5.10% 31.31%
ShortnessofInhal
PeerInfluence No Yes e No Yes
No 48.16% 23.15% No 11.77% 17.40%
Yes 1.87% 26.85% Yes 38.34% 32.58%
ChronicIllness No Yes ShallowProblem No Yes
No 41.84% 23.70% No 49.06% 24.06%
Yes 8.14% 26.30% Yes 0.93% 25.92%
tiredness No Yes Chestheart No Yes
No 15.92% 15.00% No 32.58% 20.38%
Yes 34.06% 35.00% Yes 17.42% 29.64%
The number of occurrences from all of the data that were correctly predicted is
measured and used to evaluate the presentation of the classification job. We also
looked at recall, which measures a model's sensitivity to distinguish between patients
who genuinely had lung cancer and were rightly classified as productive in
comparison to all deserving contributors. While recall is a gauge of number, precision
34
is a gauge of quality. The F-Measure, which combines precision and recall into a
single score, enables the evaluation of models. Finally, the AUC, which has a range
from 0 to 1, is used to identify the ML model that performs the excellent at
differentiating cases of lung cancer from cases of non-lung cancer. Separability is
measured by the AUC. When the AUC hits one, the models are completely capable of
differentiating between two class distributions.
35
4. Results and Discussion
Models Varibles
eps=0.002
gamma=0.0
SVM
kerneltype:linear
loss=0.2
K=3.1
KNN SearchAlgorithm:LinearNNSearch
withEuclidean
maxDepth=0
RF numIterations=100
numFeatures=0
4.2 Evaluation
Numerous machine learning models, including SVM, KNN, and RF, are assessed in
the framework of this study work to be able to identify utilising the model, greatest
predictive result in relation to accuracy, precision, recall, F-Measure, and AUC. Our
presentation evaluation of the models following SMOTE with 11-fold cross-evidence
is provided in Table 4. Percentages greater than 93.4% (RT) are shown by all of our
suggested models. With an AUC of 99.4%, it has 97.2% accuracy, precision, recall,
and F-Measure. The fact that RF, with 99.2%, and AdaBoostM1, with 98.6%, which
uses RF as its basis classifier, both obtain high percentages of AUC should also be
noticed. The proposed machine learning models' AUC ROC curve is finally plotted in
Figure 2 for reference.
36
Table 5. Evaluation of performance following SMOTE with 10-fold cross validation
Accurac Precisio AU
y n Recall Recall C
0.95
SVM 0.954 0.954 0.954 0.954 4
0.97
KNN 0.96 0.959 0.959 0.959 8
0.99
RF 0.952 0.952 0.952 0.952 1
Additionally, Table 5 compares the accuracy, recall, and precision of the models. The
authors of the reference [38] utilised a dataset with a same number of attributes as
ours. After 10-fold cross-validation, the models' results were obtained. Differentiating
to the examples in the previous paragraph study work, our suggested models
outperformed them in all three metrics. Table 5 also compares the models' recall,
precision, and accuracy. The study's authors used a record file [39] that had the same
amount of attributes as ours. The models' findings were attained after 10-fold
cross-validation. Our recommended models fared better across the board measures
than models in the previously stated study work.
Proposed Proposed
[38] Proposed Models [38] [38]
Models Models
0.90
SVM 0.954 0.909 0.954 0.916 0.954
3
0.84
KNN 0.952 0.855 0.952 0.874 0.952
7
37
Figure 3. Models Evaluation Based on AUC ROC Curves
4.3 Discussion
The suggested approach is founded on a dataset with components that characterise
human behaviours (such as smoking and liquor) lung cancer patients exhibit certain
symptoms and signs typically experience as danger elements. However, as we noted
through section of the materials and methods that analyses aspects, these symptoms
are not always connected to lung cancer disease. Since The signs of lung cancer,
which not visible to the naked eye, include frequently compounded by those of other
illnesses. Asthma, coughing, shortness of breath, and allergies are the most prevalent
symptoms [25]. In this study, we choose to train different classifiers on several danger
factors linked to these sign in order to be capable precisely recognize the class label
(Lung Cancer or Non-Lung Cancer) of an unspecified occurrence, and consequently
the risk related with it. No matter lung cancer has not yet appeared, Risk factor
surveillance and subsequent clinical evaluation are appropriate treatment modalities.
By diagnosing the disease early, these practises may help to prevent or reduce the
disease's unfavourable effects. An X-ray, CT, PET-CT, and MRI scan of the survivor’s
chest is typically conducted in order to perform a clinical assessment and identify
lung cancer [77]. For the quick identification of the illness and its phase, the
contemplate data file in compound with aspects generated via lung image data would
therefore be highly helpful. Recall that the purpose of this study is to determine if
lung cancer occurs or not. A binary classification trouble was therefore researched.
The challenge of identifying the cancer stage could be resolved from a machine
38
learning standpoint by using a multi-class classification programme, like technique
one vs. one (OVO) and one vs. all (OVA) [33]. We cannot tackle problem in this
manner, though, due to the dataset under discussion.
These two examples demonstrate how machine learning may be applied in healthcare
in a variety of ways and with flexibility. Regardless of the data or associated disease,
all models showed good performance across the board following class balancing with
SMOTE. Additionally, stacking and voting ensemble models, which were not
examined here, produced encouraging results as demonstrated in [32]. Lung cancer
and chronic kidney disease both, the predominance of the rotated forest classifier is
confirmed using tree models. As we wrap up the results and discussion part, we must
highlight a drawback of our subject. This experiment relied on a publicly accessible
dataset [36], not one from a hospital department or research centre that might have
given us with more comprehensive record file with a range of attribute. Accessing
sensitive medical data is also challenging due to worries about privacy. However, the
dataset we used has useful the qualities that helped us arrive at trustworthy and
precise research findings.
39
5. Conclusions
The primary respiratory organs are the lungs. Due to the lungs' ability to feed their
blood with oxygen, that is necessary for human existence, humans never cease
breathing until they pass away the most typical cancer-causing factor-related death in
people of both genders is lung cancer. The advanced phase of the cancer determines
the patient's life expectancy. The life expectancy increases with the timing of the
diagnosis.
40
Future Work
The current inquiry will eventually be expanded in two different directions. First, the
machine learning framework will be enhanced by applying deep learning techniques,
particularly long short-term memory (LSTM) and convolutional neural networks
(CNN), and by assessing the correctness of the results against related studies. For the
evaluation of classification models in the same dataset, a separate data-splitting
technique termed bootstrapping will be utilised in addition to the present 10-fold
cross-validation. This technique applies resampling with replacement in the original
data.
41
SNAPSHOTS OF PROJECT
42
43
44
45
References
1. Schiller, H.B.; Montoro, D.T.; Simon, L.M.; Rawlins, E.L.; Meyer, K.B.; Strunz,
M.; Vieira Braga, F.A.; Timens, W.; Koppelman,G.H.; Budinger, G.S.; et al. The
human lung cell atlas: A high-resolution reference map of the human lung in
health and disease.Am. J. Respir. Cell Mol. Biol. 2019, 61, 31–41. [CrossRef]
[PubMed]
2. Hervier, B.; Russick, J.; Cremer, I.; Vieillard, V. NK cells in the human lungs.
Front. Immunol. 2019, 10, 1263. [CrossRef] [PubMed]
3. Barroso, A.T.; Martín, E.M.; Romero, L.M.R.; Ruiz, F.O. Factors affecting lung
function: A review of the literature. Arch. De Bronconeumol. 2018, 54, 327–332.
[CrossRef]
4. Mirza, S.; Clay, R.D.; Koslow, M.A.; Scanlon, P.D. COPD guidelines: A review
of the 2018 GOLD report. In Mayo Clinic Proceedings; Elsevier: Amsterdam, The
Netherlands, 2018; Volume 93, pp. 1488–1502.
5. Dotan, Y.; So, J.Y.; Kim, V. Chronic bronchitis: Where are we now? Chronic
Obstr. Pulm. Dis. J. COPD Found. 2019, 6, 178. [CrossRef]
6. Stern, J.; Pier, J.; Litonjua, A.A. Asthma epidemiology and risk factors. In
Seminars in Immunopathology; Springer: Berlin/Heidelberg,Germany, 2020;
Volume 42, pp. 5–15.
7. Bell, S.C.; Mall, M.A.; Gutierrez, H.; Macek, M.; Madge, S.; Davies, J.C.; Burgel,
P.R.; Tullis, E.; Castaños, C.; Castellani, C.; et al.The future of cystic fibrosis care:
A global perspective. Lancet Respir. Med. 2020, 8, 65–124. [CrossRef]
8. Mandell, L.A.; Niederman, M.S. Aspiration pneumonia. N. Engl. J. Med. 2019,
380, 651–663. [CrossRef]
9. Barta, J.A.; Powell, C.A.;Wisnivesky, J.P. Global epidemiology of lung cancer.
Ann. Glob. Health 2019, 85, 8. [CrossRef]
10. Bradley, S.H.; Kennedy, M.; Neal, R.D. Recognising lung cancer in primary care.
Adv. Ther. 2019, 36, 19–30. [CrossRef]
11. Athey, V.L.; Walters, S.J.; Rogers, T.K. Symptoms at lung cancer diagnosis are
associated with major differences in prognosis. Thorax 2018, 73, 1177–1181.
[CrossRef]
46
12. Duma, N.; Santana-Davila, R.; Molina, J.R. Non–small cell lung cancer:
Epidemiology, screening, diagnosis, and treatment. In Mayo Clinic Proceedings;
Elsevier: Amsterdam, The Netherlands, 2019; Volume 94, pp. 1623–1640.
13. Romaszko, A.M.; Doboszy´ nska, A. Multiple primary lung cancer: A literature
review. Adv. Clin. Exp. Med. 2018, 27, 725–730.[CrossRef]
14. No Tobacco ’22. Available online:
https://www.lung.org/media/press-releases/no-tobacco-%E2%80%9922 (accessed
on 6 August 2022).
15. 15. Wadowska, K.; Bil-Lula, I.; Trembecki, Ł.; ´Sliwi ´nska-Mosso´ n, M. Genetic
markers in lung cancer diagnosis: A review. Int. J. Mol. Sci. 2020, 21, 4569.
[CrossRef] [PubMed]
16. 16. Thakur, S.K.; Singh, D.P.; Choudhary, J. Lung cancer identification: A review
on detection and classification. Cancer Metastasis Rev. 2020, 39, 989–998.
[CrossRef] [PubMed]
17. Yang, G.; Xiao, Z.; Tang, C.; Deng, Y.; Huang, H.; He, Z. Recent advances in
biosensor for detection of lung cancer biomarkers. Biosens. Bioelectron. 2019,
141, 111416. [CrossRef] [PubMed]
18. Artificial Intelligence/Machine Learning (AI/ML)-Based: Software as a Medical
Device (SaMD) Action Plan. Available online:
https://www.fda.gov/media/145022/download (accessed on 30 July 2022).
19. Mahler, M.; Auza, C.; Albesa, R.; Melus, C.; Wu, J.A. Regulatory aspects of
artificial intelligence and machine learning-enabledsoftware as medical devices
(SaMD). In Precision Medicine and Artificial Intelligence; Elsevier: Amsterdam,
The Netherlands, 2021;pp. 237–265.
20. Dritsas, E.; Trigka, M. Data-Driven Machine-Learning Methods for Diabetes Risk
Prediction. Sensors 2022, 22, 5304. [CrossRef [PubMed]
21. Dritsas, E.; Alexiou, S.; Konstantoulas, I.; Moustakas, K. Short-term Glucose
Prediction based on Oral Glucose Tolerance Test Values. In Proceedings of the
International Joint Conference on Biomedical Engineering Systems and
Technologies—HEALTHINF, Vienna, Austria, 9–11 February 2022; Volume 5,
pp. 249–255.
22. Dritsas, E.; Fazakis, N.; Kocsis, O.; Fakotakis, N.; Moustakas, K. Long-Term
Hypertension Risk Prediction with ML Techniques in ELSA Database. In
Proceedings of the International Conference on Learning and Intelligent
47
Optimization, Athens, Greece, 20–25 June 2021; Springer: Berlin/Heidelberg,
Germany, 2021; pp. 113–120.
23. De Felice, F.; Polimeni, A. Coronavirus disease (COVID-19): A machine learning
bibliometric analysis. In Vivo 2020, 34, 1613–1617.[CrossRef] [PubMed]
24. Dritsas, E.; Trigka, M. Machine Learning Methods for Hypercholesterolemia
Long-Term Risk Prediction. Sensors 2022, 22, 5365.[CrossRef]
25. Dritsas, E.; Alexiou, S.; Moustakas, K. COPD Severity Prediction in Elderly with
ML Techniques. In Proceedings of the 15thInternational Conference on PErvasive
Technologies Related to Assistive Environments, Corfu, Greece, 29 June–1 July
2022; pp. 185–189.
26. Dritsas, E.; Trigka, M. Stroke Risk Prediction with Machine Learning Techniques.
Sensors 2022, 22, 4670. [CrossRef] [PubMed]
27. Dritsas, E.; Alexiou, S.; Moustakas, K. Cardiovascular Disease Risk Prediction
with Supervised Machine Learning Techniques. In Proceedings of the ICT4AWE,
Online, 23–25 April 2022; pp. 315–321.
28. Spann, A.; Yasodhara, A.; Kang, J.; Watt, K.; Wang, B.; Goldenberg, A.; Bhat, M.
Applying machine learning in liver disease and transplantation: A comprehensive
review. Hepatology 2020, 71, 1093–1105. [CrossRef]
29. Muthazhagan B, Ravi T, Rajinigirinath D. An enhanced computer-assisted lung
cancer detection method using content-based image retrieval and data mining
techniques. Journal of Ambient Intelligence and Humanized Computing. 2020 Jun
2:1-9.
30. Masud M, Sikder N, Nahid AA, Bairagi AK, AlZain MA. A machine learning
approach to diagnosing lung and colon cancer using a deep learning-based
classification framework. Sensors. 2021 Jan;21(3):748.
31. Sajja T, Devarapalli R, Kalluri H. Lung Cancer Detection Based on CT Scan
Images by Using Deep Transfer Learning. Traitement du Signal. 2019
Oct;36(4):339-44.
32. Tripathi P, Tyagi S, Nath M. A comparative analysis of segmentation techniques
for lung cancer detection. Pattern Recognition and Image Analysis. 2019
Jan;29(1):167-73.
33. Nasrullah N, Sang J, Alam MS, Mateen M, Cai B, Hu H. Automated lung nodule
detection and classification using deep learning combined with multiple strategies.
Sensors. 2019 Jan;19(17):3722.
48
34. Bhatia S, Sinha Y, Goel L. Lung cancer detection: a deep learning approach.
InSoft Computing for Problem Solving 2019 (pp. 699-705). Springer, Singapore.
35. Makaju S, Prasad PW, Alsadoon A, Singh AK, Elchouemi A. Lung cancer
detection using CT scan images. Procedia Computer Science. 2018 Jan
1;125:107-14.
36. Ali I, Hart GR, Gunabushanam G, Liang Y, Muhammad W, Nartowt B, Kane M,
Ma X, Deng J. Lung nodule detection via deep reinforcement learning. Frontiers
in oncology. 2018 Apr 16;8:108.
37. Stapelfeld, C.; Dammann, C.; Maser, E. Sex-specificity in lung cancer risk. Int. J.
Cancer 2020, 146, 2376–2382. [CrossRef] [PubMed] 42. de Groot, P.M.; Wu,
C.C.; Carter, B.W.; Munden, R.F. The epidemiology of lung cancer. Transl.
Lung Cancer Res. 2018, 7, 220. [CrossRef] [PubMed]
38. O’Keeffe, L.M.; Taylor, G.; Huxley, R.R.; Mitchell, P.; Woodward, M.; Peters,
S.A. Smoking as a risk factor for lung cancer in women and men: A systematic
review and meta-analysis. BMJ Open 2018, 8, e021611. [CrossRef] [PubMed]
39. Al-Bander, B.; Fadil, Y.A.; Mahdi, H. Multi-Criteria Decision Support System for
Lung Cancer Prediction; IOP Conference Series: Materials Science and
Engineering; IOP Publishing: Bristol, UK, 2021; Volume 1076, p. 012036.
40. Hu, T.; Xiao, J.; Peng, J.; Kuang, X.; He, B.; et al. Relationship between
resilience, social support as well as anxiety/depression of lung cancer patients: A
cross-sectional observation study. J. Cancer Res. Ther. 2018, 14, 72.
41. Leshargie, C.T.; Alebel, A.; Kibret, G.D.; Birhanu, M.Y.; Mulugeta, H.; Malloy,
P.; Wagnew, F.; Ewunetie, A.A.; Ketema, D.B.; Aderaw, A.; et al. The impact of
peer pressure on cigarette smoking among high school and university students in
Ethiopia: A systemic review and meta-analysis. PLoS ONE 2019, 14, e0222572.
[CrossRef]
42. Schabath, M.B.; Cote, M.L. Cancer progress and priorities: Lung cancer. Cancer
Epidemiol. Prev. Biomarkers 2019, 28, 1563–1579. [CrossRef]
43. . Avancini, A.; Sartori, G.; Gkountakos, A.; Casali, M.; Trestini, I.; Tregnago, D.;
Bria, E.; Jones, L.W.; Milella, M.; Lanza, M.; et al. Physical activity and exercise
in lung cancer care: Will promises be fulfilled? Oncologist 2020, 25, e555–e569.
[CrossRef]
49
44. Kantor, E.D.; Hsu, M.; Du, M.; Signorello, L.B. Allergies and asthma in relation
to cancer risk. Cancer Epidemiol. Prev. Biomarkers 2019, 28, 1395–1403.
[CrossRef] [PubMed]
45. Alsharairi, N.A. The effects of dietary supplements on asthma and lung cancer
risk in smokers and non-smokers: A review of the literature. Nutrients 2019, 11,
725. [CrossRef] [PubMed]
46. Brenner, D.R.; Fehringer, G.; Zhang, Z.F.; Lee, Y.C.A.; Meyers, T.; Matsuo, K.;
Ito, H.; Vineis, P.; Stucker, I.; Boffetta, P.; et al. Alcohol consumption and lung
cancer risk: A pooled analysis from the International Lung Cancer Consortium
and the SYNERGY study. Cancer Epidemiol. 2019, 58, 25–32. [CrossRef]
[PubMed]
47. Harle, A.S.; Blackhall, F.H.; Molassiotis, A.; Yorke, J.; Dockry, R.; Holt, K.J.;
Yuill, D.; Baker, K.; Smith, J.A. Cough in patients with lung cancer: A
longitudinal observational study of characterization and clinical associations.
Chest 2019, 155, 103–113. [CrossRef] [PubMed]
48. Phillips, M.; Bauer, T.L.; Pass, H.I. A volatile biomarker in breath predicts lung
cancer and pulmonary nodules. J. Breath Res. 2019, 13, 036013. [CrossRef]
49. Brady, G.C.; Roe, J.W.; O’Brien, M.; Boaz, A.; Shaw, C. An investigation of the
prevalence of swallowing difficulties and impact on quality of life in patients with
advanced lung cancer. Support. Care Cancer 2018, 26, 515–519. [CrossRef]
50. Malinowska, K. The relationship between chest pain and level of perioperative
anxiety in patients with lung cancer. Pol. J. Surg. 2018, 90, 23–27. [CrossRef]
51. Mirza, S.; Clay, R.D.; Koslow, M.A.; Scanlon, P.D. COPD guidelines: A review
of the 2018 GOLD report. In Mayo Clinic Proceedings; Elsevier: Amsterdam, The
Netherlands, 2018; Volume 93, pp. 1488–1502.
52. Alsharairi, N.A. The effects of dietary supplements on asthma and lung cancer
risk in smokers and non-smokers: A review of the literature. Nutrients 2019, 11,
725. [CrossRef] [PubMed]
53. Leshargie, C.T.; Alebel, A.; Kibret, G.D.; Birhanu, M.Y.; Mulugeta, H.; Malloy,
P.; Wagnew, F.; Ewunetie, A.A.; Ketema, D.B.; Aderaw, A.; et al. The impact of
peer pressure on cigarette smoking among high school and university students in
Ethiopia: A systemic review and meta-analysis. PLoS ONE 2019, 14, e0222572.
[CrossRef]
50
54. Ali I, Hart GR, Gunabushanam G, Liang Y, Muhammad W, Nartowt B, Kane M,
Ma X, Deng J. Lung nodule detection via deep reinforcement learning. Frontiers
in oncology. 2018 Apr 16;8:108.
55. Alsharairi, N.A. The effects of dietary supplements on asthma and lung cancer
risk in smokers and non-smokers: A review of the literature. Nutrients 2019, 11,
725. [CrossRef] [PubMed]
56. Hervier, B.; Russick, J.; Cremer, I.; Vieillard, V. NK cells in the human lungs.
Front. Immunol. 2019, 10, 1263. [CrossRef] [PubMed]
57. Phillips, M.; Bauer, T.L.; Pass, H.I. A volatile biomarker in breath predicts lung
cancer and pulmonary nodules. J. Breath Res. 2019, 13, 036013. [CrossRef]
58. Kantor, E.D.; Hsu, M.; Du, M.; Signorello, L.B. Allergies and asthma in relation
to cancer risk. Cancer Epidemiol. Prev. Biomarkers 2019, 28, 1395–1403.
[CrossRef] [PubMed]
51
52