Lung Cancer Detection
Lung Cancer Detection
Lung Cancer Detection
We would also like to thank Dr. Aman Kumar for giving us an opportunity to work under their guidance and blessing.
Lastly, we would like to thank my family and friends for their kind support. I feel grateful to Lord Almighty who has
showered His graces upon me during this period.
Table of Contents
1. Introduction
2. Visualization of Dataset
3. Watershed Algorithm
4. Proposed Model
5. Transfer Learning : VGG16-Net
6. Conclusion and Results
7. Future work
8. Reference
INTRODUCTION
Lung cancer is one of the deadliest cancers worldwide. However, the early detection of lung cancer significantly improves
survival rate. Cancerous (malignant) and noncancerous (benign) pulmonary nodules are the small growths of cells inside the
lung. Detection of malignant lung nodules at an early stage is necessary for the crucial prognosis.
Early-stage cancerous lung nodules are very much similar to non-cancerous nodules and need a differential diagnosis on the
basis of slight morphological changes, locations, and clinical biomarkers. The challenging task is to measure the probability of
malignancy for the early cancerous lung nodules. Various diagnostic procedures are used by physicians, in connection, for the
early diagnosis of malignant lung nodules, such as clinical settings, computed tomography (CT) scan analysis (morphological
assessment), positron emission tomography (PET) (metabolic assessments), and needle prick biopsy analysis
For the input layer, lung nodule CT images are used and are collected for various steps of the project. The source of the
dataset is the LUNA16 dataset .
The LUNA16 dataset is a subset of LIDC-IDRI dataset, in which the heterogeneous scans are filtered by different criteria.
Since pulmonary nodules can be very small, a thin slice should be chosen. Therefore scans with a slice thickness greater than
2.5 mm were discarded.
VISUALIZATION OF DATASET
Visualization of dataset is an important part of training , it gives better understanding of dataset. But CT scan images are hard
to visualize for a normal pc or any window browser. Therefore we use the pydicom library to solve this problem. The
Pydicom library gives an image array and metadata information stored in CT images like patient’s name,patient’s id, patient’s
birth date,image position , image number , doctor’s name , doctor’s birth date etc.
(fig 3.Small sample of Metadata contain in a single dicom
slice)
WATERSHED ALGORITHM
The watershed is a classical algorithm used for segmentation, that is, for separating different objects in an image.
Starting from user-defined markers, the watershed algorithm treats pixels values as a local topography (elevation). The
algorithm floods basins from the markers until basins attributed to different markers meet on watershed lines. In many cases,
markers are chosen as local minima of the image, from which basins are flooded.
First , we extract internal and external markers from CT scan images with the help of binary dilations and add them with a
complete dark image using watershed methods. And it removes external noise from the image and gives a watershed marker
of lungs and cancer cells. As we can see in the below figure watershed marker removes external noise and applies a binary
mask on the image , black pixels in lungs represent cancer cells.
(fig 4 . different markers extracted from CT scan image using watershed algorithm and fig 5. Image segmentation process visualization)
In fig 4 we have shown the segmented image using watershed algorithm.
For better segmentation we integrate sobel filter with watershed algorithms .It removes the external layer of lungs. After
removing the outer layer we use the internal marker and the Outline that was just created to generate the lung-filter using
bitwise_or operations of numpy. It also removes the heart from CT scan images. Next step is to close off the lung filter with
morphological operations and morphological gradients. It provides better segmented lungs than the previous process. We
can see this process in the figure above in fig 5.
PROPOSED MODELS
The proposed model is a convolutional neural network approach based on lung segmentation on CT scan images. At first we
preprocess the dataset of luna16. After preprocessing, the next step is to make lung segmentation with a watershed algorithm.
Watershed algorithm highlights the lung part and makes binary masks for lungs using semantic segmentation approach.
We tried three different models of Convolutional Neural Networks, which are based on the comparative study of performance
of each type model in different dataset and for different classification problems.
Our first model “Sequential_1” is the basic simple approach of using the convolution layers, flatten fully connected layers,
max pooling and dropout in the middle layers, which performs significantly well on the number classification problem.
Summary of model is given below:
Our second model “Sequential_2” is the Deep Convolutional Neural Network with Max Pooling and Fully connected layers in
the end. This model with specified number of elements and layers performed best in many research papers with different
datasets.
Summary of model is given below:
Our third and the last model approach was to use transfer learning on VGG-16 with some changes in the last three layers
which are fully connected. This model gives appreciable results in object classification.
Summary of model is given below:
Important Information regarding the training model:
After making successful binary lung segmented masks, we train models on segmented lungs with a batch size of 32 for image
data generator and using 100 images in each epoch for 30 epochs with exception of 500 images in each epoch for VGG-16.
We are training images with the shape of (512,512,1) for the first two models and the shape of (512,512,3) for VGG-16. For a
better result data augmentation is used to train models on different augmentation like shear range , zoom range , horizontal
flip , rotation range , centre shift etc. For the end layer we used a single node for binary classification as we want to classify
between cancer and non- cancer lungs.
Also we used the callbacks from tensorflow keras to save the best accuracy model so that we can run a complete 50 epoch
training session to plot the comparison graphs.
TRANSFER LEARNING : VGG16-NET
VGG Net is the name of a pre-trained convolutional neural network (CNN) invented by Simonyan and Zisserman from Visual
Geometry Group (VGG) at University of Oxford in 2014 and it was able to be the 1st runner-up of the ILSVRC (ImageNet
Large Scale Visual Recognition Competition) 2014 in the classification task. VGG Net has been trained on ImageNet
ILSVRC dataset which includes images of 1000 classes split into three sets of 1.3 million training images, 100,000 testing
images and 50,000 validation images. The model obtained 92.7% test accuracy in ImageNet. VGG Net has been successful in
many real world applications such as estimating the heart rate based on the body motion, and pavement distress detection
VGG Net has learned to extract the features (feature extractor) that can distinguish the objects and is used to classify unseen
objects. VGG was invented with the purpose of enhancing classification accuracy by increasing the depth of the CNNs. VGG
16 and VGG 19, having 16 and 19 weight layers, respectively, have been used for object recognition. VGG Net takes input of
224×224 RGB images and passes them through a stack of convolutional layers with the fixed filter size of 3×3 and the stride
of 1. There are five max pooling filters embedded between convolutional layers in order to down-sample the input
representation (image, hidden-layer output matrix, etc.). The stack of convolutional layers are followed by 3 fully connected
layers, having 4096, 4096 and 1000 channels, respectively. The last layer is a soft-max layer . Below figure shows VGG
network structure.
But in our approach we have images with the shape of (512,512) . so we build our own model using vgg16-net architecture.
And compile the model with a powerful adam optimizer , learning rate is 0.0001 , entropy is binary_crossentropy and
accuracy metrics. The below figure shows model summary , convolution layers, max-pooling layers and params.
CONCLUSION AND RESULTS
After training Lung cancer dataset on proposed models, we make a graph of accuracy and loss with respect to
epochs. And a tabular comparison between proposed models on basis of accuracy and loss.
3. ACO_SVM 93.2%
Link to paper
4. ACO_ANN 98.40%
Link to paper
From the study done above we came to the conclusions that the model is not working as accepted with the given dataset.
So, in order to increase the accuracy more efficient data-preprocessing techniques are to be implemented now after and before
the image segmentation process which will mainly focus on efficient division of data into cancerous and non-cancerous
classes and making the dataset compatible to be processed with computer vision library of python otherwise implementing
the algorithms on the dataset from self defined functions.
Also a new data processing, training and classification pipeline is to be proposed which will help the models to predict the
data more accurately.
Current Suggestions includes the use of some other transfer learning models from imagenet in keras including the one
proposed above and implementation of Feature Extraction Algorithms like BRISK and SIFT from Computer Vision Library
and also integrating the ML training methods.
REFERENCES
1. Bjerager M., Palshof T., Dahl R., Vedsted P., Olesen F. Delay in diagnosis of lung cancer in general practice. Br. J. Gen. Pract.
2006;56:863–868. [PMC free article] [PubMed] [Google Scholar]
2. Nair M., Sandhu S.S., Sharma A.K. Cancer molecular markers: A guide to cancer detection and management. Semin. Cancer Biol.
2018;52:39–55. doi: 10.1016/j.semcancer.2018.02.002. [PubMed] [Google Scholar]
3. Silvestri G.A., Tanner N.T., Kearney P., Vachani A., Massion P.P., Porter A., Springmeyer S.C., Fang K.C., Midthun D., Mazzone P.J.
Assessment of plasma proteomics biomarker’s ability to distinguish benign from malignant lung nodules: Results of the PANOPTIC
(Pulmonary Nodule Plasma Proteomic Classifier) trial. Chest. 2018;154:491–500. doi: 10.1016/j.chest.2018.02.012. [ PMC free article] [
PubMed] [Google Scholar]
4. Shi Z., Zhao J., Han X., Pei B., Ji G., Qiang Y. A new method of detecting pulmonary nodules with PET/CT based on an improved
watershed algorithm. PLoS ONE. 2015;10:e0123694. [PMC free article] [PubMed] [Google Scholar]
5. Lee K.S., Mayo J.R., Mehta A.C., Powell C.A., Rubin G.D., Prokop C.M.S., Travis W.D. Incidental Pulmonary Nodules Detected on
CT Images: Fleischner 2017. Radiology. 2017;284:228–243. [PubMed] [Google Scholar]
6. Diederich S., Heindel W., Beyer F., Ludwig K., Wormanns D. Detection of pulmonary nodules at multirow-detector CT: Effectiveness
of double reading to improve sensitivity at standard-dose and low-dose chest CT. Eur. Radiol. 2004;15:14–22. [ PubMed] [
Google Scholar]
7. Demir Ö., Çamurcu A.Y. Computer-aided detection of lung nodules using outer surface features. Bio-Med. Mater. Eng.
2015;26:S1213–S1222. doi: 10.3233/BME-151418. [PubMed] [Google Scholar]
8. Bogoni L., Ko J.P., Alpert J., Anand V., Fantauzzi J., Florin C.H., Koo C.W., Mason D., Rom W., Shiau M., et al. Impact of a
computer-aided detection (CAD) system integrated into a picture archiving and communication system (PACS) on reader sensitivity and
efficiency for the detection of lung nodules in thoracic CT exams. J. Digit. Imaging. 2012;25:771–781. doi: 10.1007/s10278-012-9496-
0. [PMC free article] [PubMed] [Google Scholar]
9. Al Mohammad B., Brennan P.C., Mello-Thoms C. A review of lung cancer screening and the role of computer-aided detection. Clin.
Radiol. 2017;72:433–442. doi: 10.1016/j.crad.2017.01.002. [PubMed] [Google Scholar]
10. Automated Lung Nodule Detection and Classification Using Deep Learning Combined with Multiple Strategies. Nasraullah
Nasrullah, Jun Sang, Mohammad S. Alam, Muhammad Mateen, Bin Cai and Haibo Hu. [PMC]
11. Setio A.A.A., Traverso A., de Bel T., Berens M.S.N., van den Bogaard C., Cerello P., Chen H., Dou Q., Fantacci M.E., Geurts B., et
al. Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography
images: The LUNA16 challenge. Med. Image Anal. 2017;42:1–13. doi: 10.1016/j.media.2017.06.015. [PubMed] [Google Scholar]
12. Zhu W., Liu C., Fan W., Xie X. DeepLung: Deep 3D dual path nets for automated pulmonary nodule detection and classification;
Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV); Lake Tahoe, NV, USA. 12–15 March 2018;
pp. 673–681. [Google Scholar]
13. Ren S., He K., Girshick R., Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans.
Pattern Anal. Mach. Intell. 2017;39:1137–1149. doi: 10.1109/TPAMI.2016.2577031. [PubMed] [Google Scholar]
14. Jiang H., Ma H., Qian W., Gao M., Li Y. An automatic detection system of lung nodules based on a multigroup patch-based deep
learning network. IEEE J. Biomed. Heal. Inform. 2018;22:1227–1237. doi: 10.1109/JBHI.2017.2725903. [PubMed] [Google Scholar]
15. Masood A., Sheng B., Li P., Hou X., Wei X., Qin J., Feng D. Computer-Assisted Decision Support System in Pulmonary Cancer
detection and stage classification on CT images. J. Biomed. Inform. 2018;79:117–128. doi: 10.1016/j.jbi.2018.01.005. [ PubMed] [
Google Scholar]
16. Gu Y., Lu X., Yang L., Zhang B., Yu D., Zhao Y., Gao L., Wu L., Zhou T. Automatic lung nodule detection using a 3D deep
convolutional neural network combined with a multi-scale prediction strategy in chest CTs. Comput. Biol. Med. 2018;103:220–231. doi:
10.1016/j.compbiomed.2018.10.011. [PubMed] [CrossRef] [Google Scholar]
17. Yu L., Dou Q., Chen H., Heng P.-A., Qin J. Multilevel contextual 3-D CNNs for false positive reduction in pulmonary nodule
detection. IEEE Trans. Biomed. Eng. 2016;64:1558–1567. [PubMed] [Google Scholar]
18. Huang G., Liu Z., Van Der Maaten L., Weinberger K.Q. Densely connected convolutional networks; Proceedings of the IEEE
conference on computer vision and pattern recognition; Honolulu, HI, USA. 21–26 July 2017; pp. 2261–2269. [Google Scholar]
19. Chen Y., Li J., Xiao H., Jin X., Yan S., Feng J. Advances in Neural Information Processing Systems. NIPS; San Diego, CA, USA:
2017. Dual path networks; pp. 4467–4475. [Google Scholar]
20. Wang W., Li X., Lu T., Yang J. Mixed link networks. aiXiv. 20181802.01808 [Google Scholar]
21. Nasrullah N., Sang J., Alam M.S., Xiang H. Pattern Recognition and Tracking XXX. International Society for Optics and Photonics;
Bellingham, WA, USA: 2019. Automated detection and classification for early stage lung cancer on CT images using deep learning; p.
27. [Google Scholar]
Link to the project folder on Google Drive:
Project_lung_cancer_binary_classification
Note: link is shareable only with NITH related google drives.
THANK YOU