Report of Mini
Report of Mini
On
LUNG CANCER DETECTION
Submitted to
JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY ANANTAPUR,
ANANTHAPURAM
In Partial Fulfillment of the Requirements for the Award of the Degree of
BACHELOR OF TECHNOLOGY
In
COMPUTER SCIENCE & TECHNOLOGY
Submitted By
K. KAVYA - (18691A2813)
K.V. POOJITHA - (18691A2831)
D.V. PRATHYUSHA - (18691A2833)
K. REESHPA - (18691A2838)
Under the Guidance of
Dr. Rajakumar R
Associate Professor
Department of Computer Science & Technology
BONAFIDE CERTIFICATE
This is to certify that the mini project work entitled “LUNG CANCER DETECTION” is a
Bonafede work carried out by
K. KAVYA - (18691A2813)
K.V. POOJITHA - (18691A2831)
D.V. PRATHYUSHA - (18691A2833)
K. REESHPA - (18691A2838)
Submitted in partial fulfillment of the requirements for the award of degree Bachelor of
Technology in the stream of Computer Science & Technology in Madanapalle Institute of
Technology and Science, Madanapalle, affiliated to Jawaharlal Nehru Technological
University Anantapur, Anantapur during the academic year 2021-2022.
1
ACKNOWLEDGEMENT
We sincerely thank Dr. C.Yuvaraj, M.E., Ph.D., Principal for guiding and providing
facilities for the successful completion of our mini project at Madanapalle Institute of
Technology and Science, Madanapalle.
We express our deep sense of gratitude to Dr. M. Sreedevi, Ph.D., Professor and Head,
Department of CST for her valuable guidance and constant encouragement given to us during this
work.
We also wish to place on record my gratefulness to other Faculty of CST Department and
to our friends and our parents for their help and cooperation during our mini project work.
2
DECLARATION
We hereby declare that the results embodied in this project “LUNG CANCER
DETECTION” by us under the guidance of Dr. Rajakumar R, Associate Professor, Dept. of
Computer Science & Technology partial fulfillment of the award of Bachelor of Technology in
Computer Science & Technology from Jawaharlal Nehru Technological University
Anantapur, Anantapur and we have not submitted the same to any other University/institute for
award of any other degree.
Date :
Place :
PROJECTASSOCIATES:
K. KAVYA - (18691A2813)
K.V. POOJITHA - (18691A2831)
D.V. PRATHYUSHA - (18691A2833)
K. REESHPA - (18691A2838)
I certify that above statement made by the students is correct to the best of my knowledge.
3
Table of Contents
ACKNOWLEDGEMENT III
DECLARATION IV
ABSTRACT VIII
1 INTRODUCTION 1
1.1 Motivation 2
1.2 Existing System 2
1.2.1 Limitations of existing system 3
1.3 Objectives 3
1.4 Outcomes 3
1.5 Applications 3
1.6 Structure of Project (System Analysis) 3
1.6.1 Requisites Accumulating and Analysis 4
1.6.2 System Design 4
1.6.3 Implementation 4
1.6.4 Testing 4
1.6.5 Deployment of System and Maintenance 4
1.7 Functional Requirements 5
1.8 Nonfunctional Requirements 5
1.8.1 Examples 6
1.8.2 Advantages 6
1.8.3 Disadvantages 6
2 LITERATURE SURVEY 7
2.1 Literature Survey Conclusion 12
3 PROBLEM ANALYSIS 13
3.1 Existing Approach 13
3.1.1 Drawbacks 13
3.2 Proposed System 13
3.2.1 Advantages 13
3.3 Software and Hardware Requirements 14
3.4 About dataset 14
3.5 Algorithms 14
3.6 Flow chart 16
4 SYSTEM METHODOLOGY 17
4.1 CT scan 17
4.2 CAD 18
4.3 Feature Detection 18
5 IMPLEMENTATION 20
4
5.1 Code 20
6 EXPERIMENTAL RESULTS 40
7 TESTING 43
7.1 Software testing 43
7.1.1 Types of testing 43
8 CONCLUSION AND FUTURE SCOPE 44
7.1 Conclusion 44
7.2 Future scope 44
REFERENCES 45
5
LISTOF FIGURES
1 Lung CT 1
3 Project SDLC 3
4 Flow chart 16
5 Output 40, 41
6
ABSTRACT
Treacherous and appalling are the perfect combined words for Cancer, and if not detected at an
early stage, it will be a humongous risk for the health, Hence the pilot detection of the cancer is
vital which can be accomplished through deep learning, machine learning and computer vision
which have a significant potential with their recent advancements and developments and promise a
great virtue in early detection, which can be achieved through the usage of CT scans through the
honed deep learning algorithm. The snag with prior identification is exceedingly demanding,
therefore the usage of Low Dose Computer Tomography (LDCT) is so done, and for the prime
results different type of approaches are so traversed and compared. The methods employed are
segmentation techniques, 3D Convolution Neural Network (CNN), U-Net Building. The aim of this
venture is to assess the information (Slices of CT scans) furnished with numerous pre-processing
strategies and examine the information the usage of system studying algorithms, in this situation 3-d
Convolutional Neural Networks to teach and validate the version, to create a correct version which
may be used to decide whether or not someone has most cancers or not. This will substantially
assist withinside the identity and removal of most cancer cells withinside the early stages.
Keywords: Lung Cancer, Deep Learning, Convolutional Neural Networks, Low Dose Computer
Tomography, Computer Aided Detection, Feature Extraction, CT scan, Watershed Algorithm, U-
Net, Image Segmentation, VGG model, False Positives.
7
CHAPTER 1
INTRODUCTION
Lung cancer is a disease in which the cells of the lung tissues grow uncontrollably and form tumors.
It is the leading cause of death from cancer among both men and women worldwide, reporting over
1.8 million deaths according to World Health Organization (WHO) and the survey held by Global
Cancer Survey (GLOBOCAN 2020). The changes for this to happen occurs mainly the interaction
between a person's genetic factors and three categories of external agents such as: physical
carcinogens: such as ultraviolet and ionizing radiation, chemical carcinogens: such as asbestos,
components of tobacco smoke, aflatoxin (a food contaminant), and arsenic (a drinking water
contaminant), biological carcinogens, such as infections from certain viruses, bacteria, or parasites.
Fig. 1. shows the lung cancer through and X ray with blurred small dots
According to estimates from the World Health Organization (WHO) in 2019, cancer is the first or
second leading cause of death before the age of 70 years in 112 of 183 countries and ranks third or
fourth in a further 23 countries.
Cancer's rising prominence as a leading cause of death partly reflects marked declines in mortality
rates of stroke and coronary heart disease, relative to cancer, in many countries. The extent to which
the position of cancer as a cause of premature death reflects national levels of social and economic
development can be seen by comparing the maps in Figure 1 and Figure 2A, the latter depicting the
4-tier Human Development Index (HDI) based on the United Nation's 2019 Human Development
Report.
8
Fig. 2. (A) The 4-Tier Human Development Index (HDI) and (B) 20 Areas of the World. The sizes of
the respective populations are included in the legend. Source: United Nations Procurement
Division/United Nations Development Program.
Hence. The early detection of cancer is an immediate need of action, the cancer burden can also be
reduced through early detection of cancer and appropriate treatment and care of patients who
develop cancer. Many cancers have a high chance of cure if diagnosed early and treated
appropriately. Early detection of lung cancer is important because it allows for timely treatment and
has potential to reduce deaths. The way lung cancer is diagnosed is by inspecting a patient's CT
scan images, looking for small blobs in the lungs called nodules. Finding a nodule is not in itself
indicative of cancer; the nodules have to have certain characteristics (shape, size, etc.) to support a
cancer diagnosis.
1.1 MOTIVATION
From our investigation and anatomy, we have found out that there are multiple algorithms and
techniques used for the detection and segmentation but each has one or the other drawback that
makes it fall behind, hence we compare the best approaches to give out the best results thereby
making early detection efficient and less time consuming.
9
1.2.1 Limitations of existing system
The recent emergence of deep convolutional neural networks has provided some attractive solutions
for domain transfer mainly the usage of 2D and 3D.
1.3 OBJECTIVES
The paramount objective is the to detect the location of the cancerous lung nodules, to classify the
lung cancer and its severity and to use the best method for early detection by comparison of both
machine learning and deep learning methods such as flattening, pooling, U-net etc.
1.4 OUTCOMES
We describe the methods, implementation steps, and results of lung cancer detection. Within the
approaches explored three different methods to handle the labels: averaging nodule encodings per
patient, labeling nodules with the same label given to the patient, or, finally to not do nodule
detection at all and opt instead for a 3D model on the raw images. Of these, the first method seemed
to have given the best score.
1.5 APPLICATIONS
This Strategy is used for early detection of Cancer, Used for large datasets and
Used for the location of Cancer in the Lung.
Practical Implementation
10
Deployment of Application of System
Project Maintenance
1.6.3 Implementation
The Implementation is Phase where we endeavor to give the practical output of the work done in
designing stage and most of Coding in Business logic lay comes into action in this stage and is the
main and crucial part of the project.
1.6.4 Testing
UNIT TESTING
It's done by the developer at every stage of the project, and fine-tuning the bug and module
dependencies is also done by the developer, but we're only going to fix all the runtime mistakes
here.
MANUAL TESTING
Because our project is on academic level, we are unable to conduct any automated testing;
therefore, we rely on manual testing using trial and error methods.
11
1.6.5 Deployment of System and Maintenance
Once the project is complete, we will deploy the client system in the real world. Throughout our
academic break, we solely launched the client system in our college lab with all appropriate
equipment and a Windows OS. Our project's maintenance is a one-time procedure.
1.7 FUNCTIONAL REQUIREMENTS
1. Data Collection
2. Data Pre-processing
3. Training and Testing
4. Modelling
5. Predicting
NFRs also keep functional requirements in line, so to speak. Attributes that make the product
affordable, easy to use, and accessible, for example, come from NFRs. Let’s look at some actual
examples, each page must load within 2 seconds, the process must finish within 3 hours so data is
available by 8 a.m. local time after an overnight update, the system must meet Web Content
Accessibility Guidelines WCAG 2.1, database security must meet HIPPA requirements, Users shall
be prompted to provide an electronic signature before loading a new page. Description of non-
functional requirements is just as critical as a functional requirement.
Usability requirement
Serviceability requirement
Manageability requirement
Recoverability requirement
Security requirement
Data Integrity requirement
Capacity requirement
Availability requirement
Scalability requirement
Interoperability requirement
Reliability requirement
12
Maintainability requirement
Regulatory requirement
Environmental requirement
1.8.1 Examples
Here, are some examples of non-functional requirement:
The date format must be as follows: month. date. year
The web dashboard must be available to US user’s 98 percent of the time every month
during business hours EST.
1.8.2 Advantages
They ensure a positive user experience and software that is simple to use.
1.8.3 Disadvantages
Cons/drawbacks of non-function requirement are:
Non-useful requirement may also have an effect on the diverse high-degree software
program subsystem.
They require unique attention in the course of the software program structure/high-
degree layout section which will increase costs.
Their implementation does now no longer generally map to the precise software
program sub-system.
It is hard to regulate non-useful when you by skip the structure section.
13
CHAPTER 2
LITERATURE SURVEY
In 2004, Aristfanes c silva et al. [1] have published a paper diagnosing of lung nodule by
means of dice coefficient where there major contribution has been improving the index of dice
coefficient, providing an optimized ROC curve and Skeletisation, the algorithm that have been
employed here are texture processing algorithms mainly the statistical, skeptical and structural ones,
the proposed methods are a variety I,e. Spatial Gray Level Dependence Method-SGLDM, Gray
Level Difference Method-GLDM, and Gray Level Run Length Matrices GLRLM. Simulated
mainly with MATLAB and achieving an accuracy of 79.33% its major drawback has been that there
is no dataset used.
In 2019, Mr. Attique Khan et al. [2] published a paper regarding lung cancer detection in
which their major contribution has been developing a novel design of contrast stretching based
classical features fusion processing for localizing the cancer classification. The algorithm that has
been utilized here is CNN along with Gamma correction max intensity weight approach, entropy-
based approach with NCA (Neighborhood Component Analysis), their program has been simulated
with tools such as Jupyter Notebook and CAD which is an important notion. Lung data science
bowl is the dataset used, providing a Maximum Accuracy is 99.4%, however Finding difficulty in
locating in small lesions has been a limitation for them.
In 2012, Maxine Tan et al. [3] have published a paper with a main ideology being feature –
deselective neuroevolutionary classifier, with Novel Feature DE selective Neuro evolution Format
as their major goal and contribution, usage of genetic algorithm has been done along with proposed
methods FD-NEAT classifier and ANN and SVM respectively. LIDC database has been utilized,
which gives an accuracy about 83.93% and the challenge being the complexity in number of nodes.
In 2015, Shuang Feng dia Et al.[4] published a paper on neurocomputing with their major
focus and contribution had been on providing the best robust method on lung segmentation to
minimize the time of post processing as much as possible. The algorithms used here are GMMS and
EM algorithm and Minimum Cut theory as one of the methods for improved graph cuts alongside
morphological calculations. The dataset had been derived from General Hospital of Ningxia
14
Medical University and simulated by python tool it achieved an accuracy of 85% and sensitivity of
86%, despite all this it has drawbacks such as very high time consumption and high expense.
In 2019, Marjolein Et al.[5] published a paper where the main focus and successful major
contribution had been Identify early-stage lung cancer while preventing unnecessary workup for
benign nodules, using deep LR algorithm and proposed methods being LUNG-RADS, VDT
(Volume Doubling Time) with the utilization of NLST dataset and python as simulation tool it
achieved an accuracy and sensitivity of 94.5% and 99% respectively yet limited to the challenge of
time consumption being high for asymmetric lungs.
In 2019 Yuan Huang Et al.[6] published a paper regarding lung segmentation using full
CNN Algorithm and providing a major contribution by providing New weekly supervised training
scheme and proposed solutions being EWT GAN and using two datasets LOBE and LOLA 2011
respectively, even though they achieved an accuracy of 98% as compared to other methods they
have mentioned in their paper with CAD and python being the simulant tools, the major challenge
or drawback has been finding difficulty in detection is there are any other moderate lung diseases
present.
In 2018 Zhengwe hui Et al.[7] published a paper on pulmonary lung nodule detection, where
their major contribution has been to improve the nodule candidate detection using deep neural
networks, multi-level network has been threating the nodule detection task as pixel-level
segmentation problem and proposed method KNN and NMS. Dataset utilized is LUNA 16and
achieving 94.03% sensitivity score, 1/3 less false positives with CAD and Python tools and the
limitation being Heavily unbalanced positives and negatives. In 2019, Nadas El-Askary Et al.[8]
published a paper on Lung Nodule feature extraction where their major contribution and goal has
been Improve early detection for nodules, 5 stages model and Random forest algorithm is used and
proposed methods are SVM, KNN and L7DC database is the dataset used, simulated by CAD and
python it achieved 90.73% for accuracy, 90.67% for sensitivity, 90.08% for specificity and the
major limitation being its complex nature and multiple steps which makes it time consuming.
In 2016, Xinyan li Et al.[9] published a paper on enhanced lung segmentation which aims to
propose an efficient and accurate lung segmentation, The improved method used in this paper
combines mathematical morphology and kernel graph cuts and KMC, OTSC are the algorithms
used, Absence of dataset is the major drawback and is slower in nature and has been stimulated by
Jupyter notebook idle and CAD.
15
In 2020 Bijaya Kumar Et al.[10] published a paper regrading benign nodules segmentation ,
its main contribution is that is considered Benign tissue, Adenocarcinoma, Squamous cells
Carcinoma and CNN is the algorithm is used and proposed solutions being Machine Learning, Data
acquisition, Data formatting, Testing with the dataset being RGB color histopathology images
with .jpeg format and simulated with Python and CAD it achieved a 91% accuracy, 94% sensitivity
but the challenge is the evidence of more false negatives.
In 2017 Jong Won Kim Et al.[11] published a paper regarding diagnosis of lung cancer using deep
neural network, where their major contribution has been Segmentation of chest CT data and storing
in 3D Arrays, the algorithm used here is 2D CNN, DST, SVM, ANN, VGG are the proposed
methods. Simulated by Python and MATLAB it achieved a varying accuracy between 40%-70%
and the limitation is that there is no dataset used.
In 2021 Peter M Et al.[13] published a paper regarding nodule detections where their major
focus has been gaining information on nodule characteristics and more detailed information on
nodules and major contribution is that they had been able to do so and improve the accuracy. LCP-
CNN is the algorithm used here along with the proposed works being Area-Under the ROC curve
analysis (AVC) and US National Lung Screening Trail is the dataset used and simulant tool is CAD
and Python which provided an accuracy of 97% but the drawback is the Unbalance between
malignant and Benign.
In 2020 Waiya Chintanapakdee Et al.[14] published a paper regarding early cancer detection
and their major contribution has been “Early – stage cancerous cells detected”, their algorithm is
LCS program (Lung Cancer Screening) CNN and proposed work used here is radiology, the
simulant tool used here is CAD and Jupyter Notebook and the dataset used here is Medicare
Conversion factor and achieving an accuracy of 74.7% and the limitation is that it is time
consuming. In 2020 Ying Su Et al.[15] published a paper regarding lung cancer detection in which
16
their major contribution has been providing an automatic lung detection of lung nodules and the
algorithm used here is R-CNN algorithm and the proposed method being Artificial intelligence
technique. Simulated by CAD, Python and achieving an Accuracy of 91.2% and the dataset used
here is International Public Database (LIDC – IDRI) and the drawback is that the Optimization of
R-CNN is not up to the mark.
In 2019 Onur Ozdemir Et al.[16] published a paper regarding lung cancer detection and
segmentation where their major contribution has been characterization of model uncertainty in our
deep learning system, CT analysis well calibrated classification and the algorithm used here is an
End-to-end probabilistic diagnostic for lung cancer build on deep 3D CNN and the proposed work
is an End-to-end automated diagnostic tool to diagnose lung cancer. Simulated tools are CADe.
CADx, and the datasets used here are LUNA 16 and Kaggle which enabled the result of 0.921,
0.869 coefficient indices and the challenge being Lack of nodule detection in LUNA 16.
In 2020 Wadood Abdul Et al.[18] published a paper regarding lung cancer segmentation
where their major contribution has been providing CNN Model, ALDC system is built to detect,
classify whether tumors found in lungs are malignant or benign. ALCDC system validated using
images from LIDC, IDRI. Is the algorithm and the comparison shows proposed LADC system
performs better than existing state – of-the art systems is the proposed works with SVM KNN.
Simulated tools are CAD system, Python and the dataset is the LIDC- IDRI database which enabled
the result of 97.02% accuracy and the cons being that it is time consuming.
In 2018 Bohdon chapaliuk Et al.[19] published a paper regarding lung cancer segmentation
where their major contribution has been providing Trained c3d and 3c dense net network for whole
image classification and 3D, CNN are the algorithms used alongside the proposed works being is U-
Net and ANN. Simulated by CAD, Jupyter Notebook and the dataset is LIDC-IDRI, thereby
achieving an 85% accuracy and the challenge has been the evidence of anomalies. In 2018 Ruchita
17
tekade Et al.[20] published a paper regarding lung nodule detection where the major contribution is
the Improved efficiency of lung nodule detection. 2D and 3D CNN algorithms have been employed
here, the proposed methods are U-Net architecture, LUNA 16 and LIDC-IDRI are the datasets used
and the simulant tools used here is Python 3.5 and the result proctored is an accuracy of 95.66% and
the major drawback and challenge faced is Excessive complexity.
In 2018 Goran Jakimovski Et al.[21] published a paper regarding lung nodule detection and
lung feature extractions where the major contribution is the furnishing of DNN includes layers of
convolution that can search anomalies. The algorithm used here is 3D DNN which is used to test the
deep neural networks. The proposed works are PBR, LBP and various feature extractions. Simulant
tools are CAD along with Jupyter notebook and the accuracy achieved is 90.1% and the absence of
dataset is the main drawback. In 2017 Lei fan Et al.[22] published a paper regarding lung
segmentation where the major contribution is Image segmentation and pooling and 3D CNN and the
algorithm used here is 3DD CNN which is efficient in lung detection and the proposed works are
histograms and Zernike. The database used here is CT Image dataset and the simulant tool is CAD,
accuracy is 67.7% and the major drawbacks are that it is inefficient and slow.
In 2018 Guobin Zhang Et al.[23] published a paper regarding lung nodule detection where
the major contribution is the detailed report of the five major components in computer aided
detection system which are: data acquisition, pre-processing, lung segmentation, nodule detection
and false positive detections. CNN algorithm and K means Clustering are the algorithm and
proposed method used respectively, the dataset that has been utilized here is LIDC IDRI dataset
along with simulant tools that are CAD and Python which enabled the result of 89.49% sensitivity
and 84% accuracy and Time consuming and complex nature being the major cons and
disadvantages.
18
CHAPTER 3
PROBLEM ANALYSIS
3.1 EXISTING APPROACH
Existing methods are used to identify cancerous (malignant) and noncancerous(benign) nodules are
small growths of cells inside the lung. We all know that we need to detect the malignant lung
nodules at an early stage to cure the lung cancer for the crucial prognosis. In CT scan images they
appear different diagnosis on basis of slight morphological changes, locations, and clinical
biomarkers that we need to identify the nodules which are cancerous. In this we measure the
probability of malignancy for the early for the cancerous lung nodules. Several diagnostic
procedures are used by physicians, for the early diagnosis of malignant lung nodules, whereas
clinical settings, computed tomography (CT) scan analysis, positron emission tomography (PET)
and needle prick biopsy analysis. Mostly, for the investigation purpose we need computed
tomography (CT) images.
3.1.1 Drawbacks
In this, there is no significant difference in detection sensitivities between low-dose and standard-
dose CT images.
3.2.1 Advantages
Sequential_2 and VGG_16 is showing good when training the model and reaching high levels of
test accuracy.
19
3.3 SOFTWARE AND HARDWARE ARE REQUIREMENRS
Software Requirements
The functional requirements of the model we are used in this are following
Python idle 3.7 version (or)
Jupiter
Hardware Requirements
Minimum hardware requirements are required to approach a model. Dataset that needs store large
data/arrays in memory will require more RAM, whereas images are used in this model to find out
the nodules.
Operating system: windows 10
Processor : intel i3
Ram : 4 GB
Hard disk : 250 GB
3.5 ALGORITHMS
3D-CNN
First, we need know CNN, it is a convolutional neural network in deep learning algorithm, which
can take in an input image, assign importance to various aspects/objects in the image and able to
differentiate one from the other. In this, pre-processing required in a Connect is much lower as
compared to other classification algorithms. In this architecture, they are two types of CNN
Segmentation and CNN classification. Firstly, segmentation CNN identifies the regions in an image
from one or more classes of semantically interpretable objects. Secondly, classification CNN
identifies each pixel into one or more given to a set of real-world object categories. A CovNet can
successfully capture the spatial and temporal dependencies in an image through the applications pf
relevant filters. We have also used 3D-CNN algorithm, which is the best among all techniques that
were used in this model.
WATERSHED
This is region-based technique that utilizes image morphology. It requires selection of at least one
marker interior to each object of the image, including the background as separate object. It is a
classical algorithm used for segmentation for separating different objects in an image and its
20
transformation is defined as grayscale image. For better segmentation we integrate Sobel filter with
watershed algorithms. Also, it removes the external layers of the lungs where lung filter with
morphological operations and morphological gradients provides better segmented lungs. We tried
different models like Sequential_1, Sequential_2, VGG16-NET. VGG-Net gives more appreciable
results in object classification that any other models. It follows arrangement of convolution and max
pooling layers consistently throughout the whole architectures. Pooling operation that calculates the
maximum, value in each patch of each feature map. VGG net learned to extract the features that can
distinguish the objects and is used to classify the unseen objects.
LBP
Another algorithm we used in this LBP (local binary pattern), is a simple yet very efficient texture
operator which labels the pixels of an image by thresholding the neighborhood of each pixel and
considers the result as binary number. LBP operator transforms an image into an array or an image
of integer labels describing small-scale appearance of the image.
AUTO ENCODERS
Next, we have seen auto encoder, autoencoder means it is a type of Artificial Neural Network used
to learn efficient coding of unlabeled data. The encoding is validated and refined by attempting to
regenerate the input from the encoding. Autoencoders can be used for image denoising, image
compression, and in some cases even generation of image data.
FLATTENNING
Another, flattening is converting the data into 1D array for inputting it to next layer. We flatten the
output of convolutional layers to create a single long feature vector and it is connected to final
classification model, which is called fully connected layer.
POOLING
A pooling layer is another building block of CNN. Its function is to progressively reduce the spatial
size of representation to reduce the number of parameters and computation in network. Pooling
layer operates on each feature map independently. The most common approach is max pooling. It is
used in CNN for consolidating the features learned by the convolutional layer feature map. It is
basically helps in reduction of overfitting by the time of training of model by compressing the
features in feature map.
U-NET
U-Net is a CNN that was developed for biomedical image segmentation, the network which is based
on the fully convolutional network and its architecture was modified and extended to work with
fewer training images and to yield more precise segmentations. The main idea is to supplement a
21
usual contracting network by successive layer, where pooling operations are replaced by
unsampling operators. One important modification in U-Net is that there a large number of feature
channels in the unsampling part, which allow to network to propagate context information to higher
resolution layers.
22
CHAPTER 4
SYSTEM METHODOLOGY
4.1 CT SCAN IMAGES
A Computerized Tomography scan(CT) uses computers and rotating X-ray machines to create
cross-sectional images of the body. These images provide more detailed information than normal X-
ray images. They can show the soft tissues, blood vessels, and bones in various parts of the body. A
CT scan may be used to visualize the head, shoulders, spine, heart, abdomen, knee, chest.
During a CT scan, you lie in a tunnel-like machine while the inside of the machine rotates and takes
a series of X-rays from different angles. These pictures are then sent to a computer, where they’re
combined to create images of slices, or cross-sections, of the body. They may also be combined to
produce a 3D image of a particular area of the body.
It’s very important to stay still while CT images are being taken because movement can result in
blurry pictures. Your doctor may also ask to hold your breath for a short period during the test to
prevent your chest from moving up and down.
23
4.1.3 What do CT Scan Results Mean?
CT scan results are considered normal if the radiologist didn’t see any tumors, blood clots,
fractures, or other abnormalities in the images. If any abnormalities are detected during the CT scan,
you may need further tests or treatments, depending on the type of abnormality found.
Feature extraction is a general term for methods of constructing combinations of the variables to get
around these problems while still describing the data with sufficient accuracy. Many machine
learning practitioners believe that properly optimized feature extraction is the key to effective
model construction. The biggest advantage of Deep Learning is that we do not need to manually
extract features from the image. The network learns to extract features while training. You just feed
the image to the network (pixel values). When the input data to an algorithm is too large to be
processed and it is suspected to be redundant (e.g., the same measurement in both feet and meters,
or the repetitiveness of images presented as pixels), then it can be transformed into a reduced set of
24
features (also named a feature vector). Determining a subset of the initial features is called feature
selection.[2] The selected features are expected to contain the relevant information from the input
data, so that the desired task can be performed by using this reduced representation instead of the
complete initial data.
Feature extraction involves reducing the number of resources required to describe a large set of
data. When performing analysis of complex data one of the major problems stems from the number
of variables involved. Analysis with a large number of variables generally requires a large amount
of memory and computation power, also it may cause a classification algorithm to overfit to training
samples and generalize poorly to new samples. Feature extraction is a general term for methods of
constructing combinations of the variables to get around these problems while still describing the
data with sufficient accuracy.
25
IMAGE PROCESSING
– Algorithms are used to detect features such as shaped, edges, or motion in a digital image or
video.
26
CHAPTER 5
IMPLEMENTATION
5.1 CODE
Lung Cancer Convolutional Network:
From google.colab import drive
drive.mount('/content/gdrive')
import SimpleITK as sitk
import numpy as np
import pandas as pd
import os
import glob
%matplotlib inline
from IPython.display import clear_output
pd.options.mode.chained_assignment = None
annotations = pd.read_csv("/content/drive/MyDrive/annotations.csv")
candidates =pd.read_csv("/content/drive/MyDrive/candidates.csv")
annotations.head()
candidates.info()
print(len(candidates[candidates['class'] == 1]))
print(len(candidates[candidates['class'] == 0]))
import multiprocessing
num_cores = multiprocessing.cpu_count()
print(num_cores)
class CTScan(object):
def __init__(self, filename = None, coords = None):
self.filename = filename
self.coords = coords
self.ds = None
self.image = None
def reset_coords(self, coords):
self.coords = cords
def read_mhd_image(self):
path = glob.glob( "/content/drive/MyDrive/seg-lungs-LUNA16-20220103T180646Z-
001"+ self.filename + '.mhd')
self.ds = sitk.ReadImage(path[0])
self.image = sitk.GetArrayFromImage(self.ds)
27
def get_resolution(self):
return self.ds.GetSpacing()
def get_origin(self):
return self.ds.GetOrigin()
def get_ds(self):
return self.ds
def get_voxel_coords(self):
origin = self.get_origin()
resolution = self.get_resolution()
voxel_coords =[np.absolute(self.coords[j]origin[j]/resoluti[j]\
for j in range(len(self.coords))]
return tuple(voxel_coords)
def get_image(self):
return self.image
def get_subimage(self, width):
self.read_mhd_image()
x, y, z = self.get_voxel_coords()
subImage=self.image[z,y-width/2:y+width/2,x-width/2:x+width/2]
return subImage
def normalizePlanes(self, npzarray):
maxHU = 400.
minHU = -1000.
npzarray = (npzarray - minHU) / (maxHU - minHU)
npzarray[npzarray>1] = 1.
npzarray[npzarray<0] = 0.
return npzarray
def save_image(self, filename, width):
image = self.get_subimage(width)
image = self.normalizePlanes(image)
Image.fromarray(image*255).convert('L').save(filename)
positives = candidates[candidates['class']==1].index
negatives = candidates[candidates['class']==0].index
scan = CTScan(np.asarray(candidates.iloc[negatives[600]])[0], \
np.asarray(candidates.iloc[negatives[600]])[1:-1])
scan.read_mhd_image()
x, y, z = scan.get_voxel_coords()
image = scan.get_image()
dx, dy, dz = scan.get_resolution()
x0, y0, z0 = scan.get_origin()
filename = '1.3.6.1.4.1.14519.5.2.1.6279.6001.100398138793540579077826395208'
coords = (70.19, -140.93, 877.68)#[877.68, -140.93, 70.19]
scan = CTScan(filename, coords)
scan.read_mhd_image()
x, y, z = scan.get_voxel_coords()
image = scan.get_image()
dx, dy, dz = scan.get_resolution()
x0, y0, z0 = scan.get_origin()
positives
np.random.seed(42)
negIndexes =np.random.choice(negatives,len(positives)*5,replace=False)
candidatesDf = candidates.iloc[list(positives)+list(negIndexes)]
from sklearn.model_selection import train_test_split
28
X = candidatesDf.iloc[:,:-1]
y = candidatesDf.iloc[:,-1]
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20,random_state = 42)
X_train.size
y_train
y_test
X_train,X_val,y_train,y_val=train_test_split(X_train,y_train,tes
t_size = 0.20, random_state = 42)
X_train.size
X_train
y_train
len(X_train)
X_train.to_pickle(‘/content/drive/My Drive/preprocessed_data/traindata')
X_test.to_pickle('/content/drive/My Drive/preprocessed_data/testdata')
X_val.to_pickle('/content/gdrive/My Drive/preprocessed_data/valdata')
def normalizePlanes(npzarray):
maxHU = 400.
minHU = -1000.
npzarray = (npzarray - minHU) / (maxHU - minHU)
npzarray[npzarray>1] = 1.
npzarray[npzarray<0] = 0.
return npzarray
print('number of positive cases are ' + str(y_train.sum()))
print('total set size is ' + str(len(y_train)))
print('percentage of positive cases are'+str(y_train.sum()*1.0/len(y_train)))
tempDf = X_train[y_train == 1]
tempDf = tempDf.set_index(X_train[y_train == 1].index + 1000000)
X_train_new = X_train.append(tempDf)
tempDf = tempDf.set_index(X_train[y_train == 1].index + 2000000)
X_train_new = X_train_new.append(tempDf)
5.2 LUNG_CANCER_2
%matplotlib inline
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import pydicom as dicom
import os
import scipy.ndimage
import matplotlib.pyplot as plt
from skimage import measure, morphology, segmentation
from mpl_toolkits.mplot3d.art3d import Poly3DCollection
import scipy.ndimage as ndimage
from google.colab import drive
drive.mount('/content/drive')
# Some constants
INPUT_FOLDER = '/content/drive/My Drive/lungcancer/stage1_train'
patients = os.listdir(INPUT_FOLDER)
patients.sort()
print(len(patients))
#patients.remove('.DS_Store')
# Load the scans in given folder path
def load_scan(path):
ds = []
for s in os.listdir(path):
if( s != '.DS_Store'):
ds.append(s)
34
slices = [dicom.read_file((path + '/' + s),force = True) for s in ds]
slices.sort(key = lambda x: int(x.InstanceNumber))
try:
slice_thickness = np.abs(slices[0].ImagePositionPatient[2] - slices[1].ImagePositionPatient[2])
except:
slice_thickness = np.abs(slices[0].SliceLocation - slices[1].SliceLocation)
for s in slices:
s.SliceThickness = slice_thickness
return slices
def get_pixels_hu(scans):
image = np.stack([s.pixel_array for s in scans])
# Convert to int16 (from sometimes int16),
# should be possible as values should always be low enough (<32k)
image = image.astype(np.int16)
if slope != 1:
image = slope * image.astype(np.float64)
image = image.astype(np.int16)
image += np.int16(intercept)
ax.set_xlim(0, p.shape[0])
ax.set_ylim(0, p.shape[1])
ax.set_zlim(0, p.shape[2])
return plt.show()
plot_3d(pix_resampled, 400)
def largest_label_volume(im, bg=-1):
vals, counts = np.unique(im, return_counts=True)
biggest = vals[np.argmax(counts)]
return biggest
# Pick the pixel in the very corner to determine which label is air.
# Improvement: Pick multiple background labels from around the patient
# More resistant to "trays" on which the patient lays cutting the air
# around the person in half
36
background_label = labels[0,0,0]
Watershed Algortihm
#Watershed algorithm
watershed = segmentation.watershed(sobel_gradient, marker_watershed)
blackhat_struct = ndimage.iterate_structure(blackhat_struct, 8)
#Perform the Black-Hat
outline += ndimage.black_tophat(outline, structure=blackhat_struct)
#Use the internal marker and the Outline that was just created to generate the lungfilter
lungfilter = np.bitwise_or(marker_internal, outline)
#Close holes in the lungfilter
#fill_holes is not used here, since in some slices the heart would be reincluded by accident
lungfilter = ndimage.morphology.binary_closing(lungfilter, structure=np.ones((7,7)), iterations=3)
#Apply the lungfilter (note the filtered areas being assigned -2000 HU)
segmented = np.where(lungfilter == 1, image, -2000*np.ones((512, 512)))
#### nodule
lung_nodule_1 = np.bitwise_or(marker_internal, image)
lung_nodule = np.where(lungfilter == 1, lung_nodule_1, np.zeros((512, 512)))
#Some Testcode:
test_segmented, lung_nodule, test_lungfilter, test_outline, test_watershed, test_sobel_gradient,
test_marker_internal, test_marker_external, test_marker_watershed =
seperate_lungs(test_patient_images[100])
# Step 1 - Convolution
41
classifier.add(Conv2D(32, (3, 3), input_shape = (512, 512, 1), activation = 'relu'))
# Step 2 - Pooling
classifier.add(MaxPooling2D(pool_size = (2, 2)))
# Adding Convolution
classifier.add(Conv2D(32, (3, 3), activation = 'relu'))
# Step 3 - Flattening
classifier.add(Flatten())
# Step 4 - Full connection
classifier.add(Dense(units = 128, activation = 'relu'))
classifier.add(Dropout(0.5))
classifier.add(Dense(units = 128, activation = 'relu'))
classifier.add(Dense(units = 1, activation = 'sigmoid'))
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
classifier.summary()
import keras as k
import time
NAME = "test_1-{}".format(int(time.time()))
callbacks = [
# k.callbacks.EarlyStopping(patience=3, monitor='val_loss'),
k.callbacks.TensorBoard(log_dir='logs\{}'.format(NAME)),
k.callbacks.ModelCheckpoint('test_model_1.h5', save_best_only=True)]
hist_1 = classifier.fit(aug_train.flow(trainX, trainY, batch_size=32), steps_per_epoch=100, epochs
= 50, verbose = 1,
validation_data = (testX, testY), callbacks=callbacks)
aug_train.fit(trainX)
classifier_2 = Sequential()
# Step 1 - Convolution
classifier_2.add(Conv2D(32, (3, 3), input_shape = (512, 512, 1), activation = 'relu'))
# Step 2 - Pooling
classifier_2.add(MaxPooling2D(pool_size = (2, 2)))
# Step 3 - Flattening
classifier_2.add(Flatten())
data_preview.shape
test_image = data_preview[50]
test_image.shape
test_image.dtype
45
plt.imshow(test_image)
plt.imshow(test_image, cmap = 'gray')
import skimage
image = skimage.color.gray2rgb(test_image)
plt.imshow(image)
image.shape
import cv2
import numpy as np
img = np.array(test_image, dtype=np.uint8)
color_img = cv2.cvtColor(img, cv2.COLOR_GRAY2RGB)
plt.imshow(color_img)
color_img.shape
plt.imshow(data_preview[50], cmap='Accent')
labels[50]
plt.imshow(data_preview[100], cmap='Accent')
46
CHAPTER 7
EXPERIMENTAL RESULTS
Lung Cancer Convolution Outputs
Lung_Cancer2 Output
47
48
CHAPTER 6
TESTING
6.1 Software Testing
Testing: Testing is a manner of executing a software with the purpose of locating blunders. To
make our software program carry out properly it must be blunders free. If trying out is achieved
efficiently it's going to eliminate all of the mistakes from the software program
Integration Testing
Alpha Testing
Beta Testing
49
Type of trying out a software program product or device performed on the developer's site. Usually,
it's far finished via way of means of the quit users.
Beta Testing
Final checking out earlier than liberating utility for business purpose. It is usually accomplished
with the aid of using end- customers or others.
Performance Testing
Functional checking out carried out to assess the compliance of a gadget or factor with distinctive
overall performance requirements. It is commonly carried out through the overall performance
engineer.
CHAPTER 8
CONCLUSION AND FUTURE SCOPE
8.1 CONCLUSION
We evaluated different approaches that used various machine learning techniques. Within the
approaches described above, we explored three different methods to handle the labels: averaging
nodule encodings per patient, labeling nodules with the same label given to the patient, or, finally,
to not do nodule detection at all and opt instead for a 3D model on the raw images. Thus, we have
concluded from the explored approaches that the best one is the first approach with the most
efficiency and less drawbacks.
50
the version and minimizing the losses effectively. We can flow ahead with the deployment
technique making this software floor breaking concerning the scientific industry, enhancing the
fitness care device for this reason inflicting a massive useful effect at the lives of many individuals.
REFERENCES
[1] D Shiloh, Elizabeth, H. Khanna, C. Sunil A novel segmentation approach for improving diagnostic
accuracy of CAD systems for detecting lung cancer from chest computed tomography images,
ACM.2012.
[2] Jong won Kim, Hojun Lee, Taeson Yoon. Automated Diagnosis of Lung Cancer with the Use of
Deep Convolutional Neural Networks on Chest CT ACM.2017.
[3] Nadas El-Askary, Mohammed AM. Saleem, Mohammed I. Roushdy Feature Extraction and Analysis
for Lung Nodule Classification using Random Forest. ACM.2019.
[4] Xinyan li, Shaeting Feng, Daru pan. Enhanced lung segmentation in chest CT images based on kernel
graph cuts, ACM.2016.
[5] Shuang feng Dai, Ke Lu a , Jiayang Dong, Yifei Zhang , Yong Chen. A novel approach of lung
segmentation on chest CT images using graph cuts. ACM.2015.
[6] Yuan Huang, Fugen Zhu Lung Segmentation Using a Fully Convolutional Neural Network with
Weekly Supervision, ACM.2018
[7] Zhengwe hui, Ajim Muhammad, Ming Zhu Pulmonary Nodule Detection in CT Images via Deep
Neural Network: Nodule Candidate Detection. AXM.1018
51
[8] Paulo Cezar P. Carvalho, Aristfanes c silva, Marcelo Gattas, Diagnosis of Lung Nodule Using Gini
Coefficient and Skeletonization in Computerized Tomography Images, ACM.2004
[9] Maxine Tan, Rudi Deckles, Jan P Cornelis, Bart Jarens. Analysis of a Feature- Deselective
Neuroevolutionary Classifier (FD-NEAT) in a Computer-Aided Lung Nodule Detection System for
CT Images. ACM.2012.
[10] Ying Su , Dan Li , Xiaodong Chen , Lung Nodule Detection based on Faster R-CNN Framework,
Computer Methods and Programs in Biomedicine (2020).
[11] Lakshmanaprabu S.K., et al., Optimal deep learning model for classification of lung cancer on CT
images, Future Generation Computer Systems (2018).
[12] M. Attique Khana , S. Rubabb , Asifa Kashif c , Muhammad Imran Sharif d, Nazeer Muhammadb ,
Jamal Hussain Shahd, ,Yu-Dong Zhange , Suresh Chandra Satapathy. Lungs cancer classification
from CT images: An integrated design of contrast based classical features fusion and selection ✩ M.
Attique Khana , S. Rubabb , Asifa Kashif c , Muhammad Imran Sharif d, Nazeer Muhammadb , Jamal
Hussain Shahd,∗ , Yu-Dong Zhange , Suresh Chandra Satapathy. (2019)
[13] K. Mya, M. Tun, and A. S. Khaing, “Feature Extraction and Classification of Lung Cancer Nodule
using Image Processing Technique,” Int. J. Eng. Res. Technol., vol. 3, no. 3, pp. 2204–2211, 2014.
[14] Wariya Chintanapakdee, MDa,b , Dexter P. Mendoza, MDa , Eric W. Zhang, MDa , Ariel Botwin,
MDa , Matthew D. Gilman, MDa , Justin F. Gainor, MDc , Jo-Anne O. Shepard, MDa , Subba R.
Digumarthy, MD Detection of Extrapulmonary Malignancy During Lung Cancer Screening: 5-Year
Analysis at a Tertiary Hospital (2020)
[15] Marjolein A Heuvelmans, Matthijs Oudkerk Deep learning to stratify lung nodules on annual
follow-up CT.(2019)
[16] Marjolein A. Heuvelmans a,b, , Peter M.A. van Ooijen c , Sarim Ather d , Carlos Francisco Silva
e,g,n , Daiwei Han f , Claus Peter Heussel e,g,n , William Hickes h , HansUlrich Kauczor e,g,n , Petr
Novotny i,j , Heiko Peschl k , Mieneke Rook f,l , Roman Rubtsov e,g,n , Oyunbileg von Stackelberg
e,g,n , Maria T. Tsakok k , Carlos Arteta i , Jerome Declerck i , Timor Kadiri , Lyndsey Pickup i ,
Fergus Gleeson h , Matthijs Oudkerk. (2021)
[17] Anna Meldo a,b , Lev Utkin a, *, Maxim Kovalev a , Ernest Kasimov a. The natural language
explanation algorithms for the lung cancer computer-aided diagnosis system.(2020)
[18] Lei Fan, Zhaoqiang Xia, Xiaobiao Zhang, Xiaoyi Feng. Lung Nodule Detection Based on 3D
Convolutional Neural Networks LCD. (2019).
[19] Goran Jakimovski Danco Davcev Lung cancer medical image recognition using Deep Neural
Networks LCD. (2018).
52
[20] Ait Skourt Sidi Mohammed Ben Abdellah Fez, Nikola S. Nikolov, Nikola.Nikolov. Feature-
Extraction Methods for Lung-Nodule Detection: A Comparative Deep Learning Study Brahim. IEEE.
(2019).
[21] Ruchita tekade, Rajeshwari K. Lung Cancer detection and Classification using Deep Learning 2018
Fourth International Conference on Computing Communication Control and Automation
(ICCUBEA)
[22] Bohdon chapaliuk, yuriy zaychenko Deep /earning approach in & computer-aided detection 6ystem
for lung cancer IEEE. (2018)
[23] Wadood Abdul, An Automatic Lung Cancer Detection and Classification (ALCDC) System Using
Convolutional Neural Network 2020 IEEE 13th International Conference on Developments in
eSystems Engineering (DeSE)
[24] Hongyang Jiang, He Ma*, Wei Qian, Mengdi Gao and Yan Li. An Automatic Detection System of
Lung Nodule Based on Multi-Group Patch-Based Deep Learning Network
DOI10.1109/JBHI.2017.2725903, IEEE Journal of Biomedical and Health Informatics
[25] A 3D Probabilistic Deep Learning System for Detection and Diagnosis of Lung Cancer Using Low-
Dose CT scans Onur Ozdemir , Member, IEEE, Rebecca L. Russell, and Andrew A. Berlin, Member,
IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 39, NO. 5, MAY 2020.
53