Classification of Lung Diseases Using Deep Learning Models
Classification of Lung Diseases Using Deep Learning Models
Classification of Lung Diseases Using Deep Learning Models
Models
Matthew Zak
A Thesis
in
The Department
of
Concordia University
September 2019
complies with the regulations of this University and meets the accepted standards with respect to
Chair
Dr. Andrew Delong
External Examiner
Dr. Thomas Fevens
Examiner
Dr. Ching Yee Suen
Supervisor
Dr. Adam Krzyzak
Approved by
Lata Narayanan, Chair
Department of Computer Science and Software Engineering
2019
Amir Asif, Dean
Faculty of Engineering and Computer Science
iii
Abstract
Master of Computer Science
by Matthew Z AK
cal field, they required large volumes of data which is problematic due to pro-
ering the task of pulmonary disease detection in chest X-Ray images using
50, and InveptionV3) and asses them in the lung disease classification tasks
We also implemented activation maps for our system. The analysis of class
activation maps shows that not only does the segmentation improve results
in terms of accuracy but also focuses models on medically relevant areas of
lungs.
iv
“Everything has been, everything has happened. And everything has already been
written about.”
Vysogota de Corvo
“There is nothing noble in being superior to your fellow man; true nobility is being
Ernest Hemingway
vi
Acknowledgements
I want to thank everyone who has had the tiniest input in this thesis. If it had
not been for you, I would probably have never made it.
vii
Contents
Declaration of Authorship ii
Abstract iii
Acknowledgements v
1 Introduction 1
1.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Related Work 4
images analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Deep Learning Approaches in Chest X-Ray analysis . . . . . . 17
viii
2.5 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Image Data Augmentation . . . . . . . . . . . . . . . . . . . . . 20
2.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.8 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 26
2.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.2 ImageNet . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 VGG16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Inception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.5 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.6 Results and analysis . . . . . . . . . . . . . . . . . . . . . . . . 47
3.6.1 VGG results . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 Training Deep Learning Models On Segmented Images . . . . 73
4.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4 Results and analysis . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Bibliography 97
x
List of Figures
detection [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.11 "Auxiliary classifier on top of the last 17x17 layer. Batch nor-
malization of the layers in the side head results in a 0.4% abso-
lute gain in top-1 accuracy. The lower axis shows the number
of iterations performed, each with batch size 32." [8] . . . . . . 43
3.12 "Two alternative ways of reducing the grid size. The solution
on the left violates principle 1 of not introducing a representa-
tional bottleneck from Section 2. The version on the right is 3
on the right represents the same solution but from the perspec-
with the model which reached the highest accuracy during the
training. Image B show its per class ROC curves. . . . . . . . . 52
4.1 "U-net architecture (example for 32x32 pixels in the lowest res-
olution). Each blue box corresponds to a multi-channel fea-
box. The x-y-size is provided at the lower left edge of the box.
(extracted) lungs. . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5 VGG16 based model training and validation loss change. . . . 75
with the model which reached the highest accuracy during the
training. Image B show its per class ROC curves. . . . . . . . . 78
4.17 Two pairs of correctly classified images with their class activa-
tion maps. The left columns is non segmented images and the
right is the segmented ones. . . . . . . . . . . . . . . . . . . . . 91
xv
List of Tables
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.2 Comparison of different deep learning based solutions trained
literature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.3 Comparison of different deep learning based solutions trained
on the Montgomery dataset [11]. Our average performance is
Chapter 1
Introduction
Computer vision supported with deep neural networks finds its usefulness
in any area of life starting from facial emotion recognition to disease detec-
ness. The newest WHO (World Health Organisation) report states that just
in the United States over 1 million citizens seek care due to pneumonia and
there are nearly ten million cases of tuberculosis worldwide. Perhaps some
part of it is lethal due to lack of medical staff or human mistake. The man-
1.2 Motivation
Previous approaches required both a large volume of data and strong com-
puting power computers [4]. Instead of following the popular trend of creat-
solutions [13]. The motivation of this project was to create a pipeline allowing
us to detect pulmonary diseases using tiny datasets (< 103 images per class)
(see Chapter 2) and limited computational resources. Also, we want to show
1.3 Applications
to provide relatively good results even if the resources are limited (both data
and computational). We believe that hospitals could find this solution useful,
1.4 Contributions
This work brings a new outlook on pre-trained deep neural networks and
sets [4] (see chapter 3) and the positive impact of segmentation on la-
beling accuracy (see chapter 4)
The first chapter serves as an introduction and motivation to the topic of lung
fulness of a given task, yet the real reason is about making the world a better
place.
The second chapter describes the essential background and significant re-
Chapter 2
Related Work
applied in the field of lung diagnosis led to the usage of a GPU-based plat-
form which is able to process a large volume of images in high-resolution
get higher than common classifiers[10] as this method is able to avoid errors
caused by inaccurate segmentation and feature extraction.
2.1.1 Overview
preserving noise reduction in digital pictures and radiographs [23], and en-
hancement of subjective edges traced by a physician in cardiac ventriculo-
grams. Convolutional neural networks have been applied to classification
positive reduction in CAD schemes for detection of lung nodules [30], the
distinction between malignant nodules and the benign ones. This class of
bone tissue from soft tissue in CXR [35] and lung nodules enhancement in
Chest X-Ray is one of the most frequently used diagnosis when examining
that 82-95% of lung cancer cases were missed due to at least partially ob-
scured by ribs or clavicle. To address this issue, researchers examined dual-
energy imaging, a technique which can produce images of two tissues, which
is a "soft-tissue" image and a "bone" one [14]. This technique has many draw-
backs, but undoubtedly one of the most important ones is the exposure to
radiation.
The MTANNs models have been developed to address this problem and
serve as a technique for ribs/soft-tissue separation. The idea behind train-
ing those algorithms is to provide them with bone and soft-tissue images
Figure 2.1. Figure 2.2 shows the performance of the model on the unshown
image data. The ribs contrast is visibly suppressed in the resulting image,
( A ) Origi- ( B ) Bone-
nal input less image
( A ) Test ( B ) Gen-
sample erated
result
used in [37], a technique often used in the texture analysis. The proposed
method named Spatial Interdependence Matrix or SIM makes use of the co-
occurrence statistics to analyze the structural information based on the way
human visual system tries interpreting scenes. In this context, the new tech-
D ⇢ Z2 2 0, 1, 2, 3, ..., N (2.1)
where N represents the number of all grey levels. The transition of pixel in-
a two dimensional matrix of size NxN, where each element Mi,j is defined as:
Chapter 2. Related Work 9
Where the #. operator represents the set cardinality of I(p) and J(p) intensities
correlation (Corr), and represent the degradation level in three different per-
spectives: structural independence, structural degradation, structural simi-
larity, respectively.
To extract the structural attributes Idm and Corr, the authors in [37] used
M+ MT
an asymmetric version of MS = 2 , where M T is simply the transposed
N 1 (i µi ) j µ j Mi,j
Corr = Â q 2 [ 1, 1] (2.3)
i,j=0 si2 sj2
N 1 Mi,j
Idm = Â 1 + |i j |
2 [0, 1] (2.4)
i,j=0
N 1
(Oi Ei )2
Chi = Â Ei
(2.5)
i =0
N
i
µi = ÂN (2.6)
i =0
Chapter 2. Related Work 10
are distributed near the diagonal of the matrix. Otherwise, different patterns
appear depending on the degradation of the structure. When relating to the
lung’s diseases, the SIM pattern of fibrosis varies from one image to another
image of the healthy lungs. The structures of fibrosis are spread through the
lungs area, whereas in healthy lungs they are sparse and small.
images have blurred structures, authors in [37] convolve the inputs with a
gaussian kernel. They set the number of gray levels to 64 to compute the
Spatial Interdependence Matrix.
100 epochs. The trained models allowed them to reach the accuracy level
above 96%
work developed (see Figure 2.3) by researchers in [1] which first segment
Chapter 2. Related Work 11
images and then by combining texture and shape features tries to predict a
disease presented on CXRs. The algorithm uses similar intuition as radiol-
ogists during lung examination, which is the comparison of right and left
lung fiends. The texture features describe the inside lungs fields, and the
first stage is composed of three phases. The first one is content-based im-
age retrieval using Bhattacharyya shape similarity measure [38] and a par-
tial Radon transform [39]. Then, building a patient anatomic model of lung
shape based on SIFT-flow [40]. And finally, taking out lungs boundaries with
an approach of graph cut optimization. The next stage is texture feature ex-
traction of segmented lungs using features such as intensity histogram, gra-
This section briefs the SSL algorithm proposed in [2] for the classification
of the pulmonary disease, which uses on the ensemble method called CST-
Voting [42]. The idea behind this algorithm is to generate a classifier by ap-
plying multiple Semi-Supervised Learning methods to one dataset. Using
L. We repeat this procedure until the set U is empty, or we meet some stop-
ping criterion.
Co-training assumes that we have weak algorithms that are trained on the
set of labeled instances (L). Later, the two algorithms classify instances of the
unlabeled set U’ (fixed-size subset of all unlabeled data U) and move to the L
those where the prediction was the most confident. After removing samples
from U’, we refill this set with new instances from U. We repeat this proce-
dure until the set U is empty, or we meet some stopping criterion.
Chapter 2. Related Work 13
increased to three. Each classifier is trained on the labeled data and predicts
the class for instances in the unlabeled set. The majority makes the final de-
cision, and the classified sample is added to the labeled data. One way of
looking at this method is that the majority teaches the minority - the majority
self-training), which build the ensemble learn using the same unlabelled U
and label data set L. Afterwards, the decision on an unlabelled test sample
one week period with minimal effort and high effectiveness. In spite of its
points of interest, MODS is as yet restricted in remote, low asset settings,
since it requires changeless and prepared specialized staff for the picture
based diagnostics. Henceforth, it is essential to create elective arrangements,
searchers in [3] trained and then assessed a deep convolutional neural net-
work 2.8 for MODS digital images interpretation and diagnostics.
network [43] (also see Section 3.2). A 15 layers deep model comprising fully-
connected, max-pooling, convolutional layers organized into five blocks; four
ysis
tion algorithms like lung segmentation was demonstrated in the Chest X-Ray
image analysis [46]. Researchers in [4] aimed at improving tuberculosis de-
tection on relatively small data sets (< 103 images per class) by incorporat-
ing deep learning segmentation and classification methods from [46]. The
chapters 3 and 4
2.5 Dataset
This work combines two relatively small datasets (< 103 images per class)
and 306 of healthy patients contributing to a set of 918 samples coming from
different patients. Sample images coming from both datasets are presented
in Figure 2.8.
The Shenzhen Hospital dataset (SH) [13][47] containing CXR images was
created by People’s Hospital in Shenzhen (China). It includes both abnormal
(containing marks of tuberculosis) and the standard CXR images. Unfortu-
( A ) tuber- ( B ) pneu-
culosis monia
case case
disease, gender, or age as presented on the chart 2.9. Here, we extracted only
153 samples of healthy patients (153 from both datasets) and 306 of those la-
beled with marks of tuberculosis. Selecting information about one class from
different resources ensures that the model does not learn features typical for
Tomography and Chest X-Ray Images for Classification dataset [48] include
selected images of patients from the Medical Center in Guangzhou. It con-
sists of data with two classes - normal and those containing marks of pneu-
monia. All data comes from the patient’s routine clinical care. The volume of
the complete dataset includes thousands of validated OCT and X-Ray images
yet for our analyze we wanted to keep the dataset tiny and evenly distributed
thus only 153 images were taken (another 153 images come from the tuber-
culosis dataset) from the resources labeled as healthy and 306 as pneumonia
- both chosen randomly. The exact dataset class distribution is presented in
Chapter 2. Related Work 19
Table 2.2
information; bones, internal organs, etc. - Fig. 2.10) was proven to be effec-
tive in gaining better prediction accuracy [4].
To extract lungs information and exclude outside regions, we used the man-
FIgure 2.11.
cally increase the volume of the training set. Researchers in [44] incorporates
three techniques to augment the training dataset size. The first approach was
Chapter 2. Related Work 21
2.7 Evaluation
classifiers, and thus in this thesis, we will use the following metrics; accuracy,
F1-score, precision, sensitivity, specificity, a graphical performance - ROC (re-
ceiver operating characteristic curve) and AUC (Area under the ROC curve).
In this section, we define metrics for binary problems. To extend it for a three-
class problem we calculate them using the approach "one versus the others"
per every label [49].
Now, let us consider a situation given in Table 2.3. We see that the data is
completely imbalanced, therefore predicting just the class neutral provides
us with the accuracy of 86% which is considered high. However, we are not
supposed to assume that the model is valid when the decision is based on
respect to all the relevant ones. It calculates the ratio of true positives Tp to
Chapter 2. Related Work 23
Tp
R= (2.8)
Tp + Fn
Tp
P= (2.9)
Tp + Fp
That is to say; recall tells us how many samples we missed (in a positive
class). Precision tells us what proportion of positive samples were correctly
classified.
If one sample was diagnosed as tuberculosis (out of 50) and the rest as other,
then the precision P equals 100%. However, recall would be as low as 2%.
Precission ⇤ Recall
F1 = 2 ⇤ (2.11)
Precission + Recall
tween them.
Another way to show a classification model performance is a graph called
ROC (receiver operating characteristic curve) which creates a plot of the True
Positive Rate against False Positive Rate.
True Positive Rate (TPR) is just another name for Recall (R). False Positive
Fp
FPR = (2.12)
Fp + Tn
A ROC graph is a plot of True Positive Rate vs. False Positive Rate using vari-
sures the area underneath the two-dimensional curve from (0, 0) to (1, 1).
class.
The area defined by AUC varies from 0 to 1. A model who predicts a cor-
rect class in 100% of cases has an AUC of 1.0 and another whose predictions
classification-threshold-invariance
scale-invariance
The first one measures how well model predictions are irrespective of
the chosen threshold. The second one estimates the rank of the predictions,
rather than their absolute values.
Chapter 2. Related Work 26
renewed the interest of the research community. CNN and subsequent deep
those models are suited much better at capturing different features than tra-
The convolution layers are made of kernels, small tensors compared to win-
dows which process input and output information. Those operators can suc-
thus learn different local features like straight lines (horizontal or vertical)
and curves while upper layers (hidden) can perform detection of more so-
can distinguish three color layers: red, green, and blue respectively. The role
which extracts valuable input features and processes this information to the
next level, whereas reducing the dimensionality.
o = f Wx,j + b (2.13)
where W 2 R p,q is a weight matrix (kernel), p is the output size of the convo-
lution, q is the window size, f represents the non-linearity, and b is bias. Both
parameters b and W are shared across all inputs.
Similarly to the convolution layer, the Pooling operator is also responsible for
spacial size reduction and thus decreasing computational resources used in
processing data, albeit the pooling layer contains no parameters (there is no
Chapter 2. Related Work 29
pooling window. These activities are max pooling and average pooling. Fig-
ure 2.18 presents the extraction of dominant, rotational invariant information
- max pooling.
The ConvNet effectively learns the relations between surrounding pixels through-
out an image. Thanks to the convolution, the input is mapped into a con-
2.9 Summary
This chapter briefly reviewed the work related to the problem: Extreme Learn-
tion. We briefed deep convolutional neural networks and explained the op-
erations they conduct in image data analysis.
30
Chapter 3
the transfer of knowledge from a related task that has already been learned.”
starting point for a new task, we incorporate the pre-trained models skilled in
solving similar problems. This method is crucial in medical image processing
due to the shortage of sample volume.
learning models such as ResNet where the last layer information serves as
input features to a new classifier.
Reuse Model
True Model
The first option, Reuse Model, states that a pre-trained model can produce
a starting point for another model used in a different task. This involves the
incorporation of the whole model or its parts.
The third option considers selecting one of the available models. It is very
3.1.2 ImageNet
fication and detection tasks by providing them with a large image dataset.
This database contains roughly 14 million different images from over 20.000
classes. ImageNet also provides bounding boxes with annotations for over 1
million images, which are used in object localization problems.
Chapter 3. Transfer Learning in Lung Diseases Classification 32
In this work, we will focus on three models VGG, ResNet, and Inception
pre-trained on the ImageNet dataset.
3.2 VGG16
over 138 million parameters. This model was able to achieve 7.4% error rate
on the ImageNet dataset (see section 3.1.2). It improved the AlexNet [44]
network by changing the kernel size and instead of 11x11 and 5x5 filters in
the first two layers, it implemented multiple smaller ones 3x3 filters one after
another.
size images. The input is processed through a set of convolution layers which
use small-size kernels with a receptive field 3x3. This is the smallest size al-
lowing us to capture the notion of up, down, right, left, and center. The
architecture also incorporates 1x1 kernels which may be interpreted as linear
input transformation (followed by nonlinearity (see section 2.8). The stride
same after processing an input through a layer, e.g., the padding is fixed to
1 for 3x3 kernels. Spatial downsizing is performed by five consecutive pool-
This pile of convolutional layers ends with three Fully-Connected (FC) layers
where the first two consist of 4096 channels each and the third one 1000 as it
Chapter 3. Transfer Learning in Lung Diseases Classification 33
performs the 1000-way classification using softmax. All hidden layers have
the same non-linearity ReLU(rectification) [44].
Figure 3.1 visualises the architecture of the VGG model with 16 layers.
Chapter 3. Transfer Learning in Lung Diseases Classification 34
3.3 ResNet-50
The network classifies an input image into one of 1000 object classes like car,
airplane, horse or mouse. The network has learned a plentiful amount of
features thanks to the training images diversity and can achieve a 6.71% top-
5 error rate on the ImageNet dataset (see section 3.1.2).
site effect appears where accuracy saturates and eventually degrades. This,
however, is not caused by overfitting yet vanishing gradient [45].
able to build deeper networks as they did not perform better than their shal-
ure 3.2
Let us consider a deep neural network block whose accurate output distri-
bution is denoted as H ( x ) transformation of input x. The following formula
defines the difference or the residue between those arguments:
The residual block tries to learn the correct output H ( x ) and since there
is an identity connection there through skipping x to the output, the block
work learns the residual. Therefore, those blocks are called Residual.
the same. The only difference is in residual blocks; unlike those in ResNet-34
(Figure 3.3 A) ResNet-50 replaces every two layers in a residual block with a
three-layer bottleneck block and 1x1 convolutions, which reduce and even-
tually restore the channel depth. This allows reducing a computational load
The model input is first processed through a layer with 64 filters each 7x7
and stride 2 and downsized by a max-pooling operation, which is carried
over a fixed 2x2 pixel window, with a stride of 2 pixels. The second stage
tions starts with a dotted line (Figure 3.4) as there is a change in the dimen-
the first convolution bloc from 1 to 2 pixels. The fourth and fifth groups of
convolutions and skip connections follow the pattern presented in the third
stem od input processing, yet they change the number of filters (kernels) to
256 and 512, respectively[45]. This model has over 25 million parameters.
(A) (B)
Resnet-34 Resnet-50
residual residual
bloc. bloc.
3.4 Inception
The researchers from Google introduced the first Inception (InceptionV1) [7]
neural network in 2014 during the ImageNet competition (See subsection
3.1.2). The model consisted of blocs called "inception cell" that was able to
conduct convolutions using different scale filters and afterward aggregate
the results as one. Thanks to 1x1 convolution which reduces the input chan-
nel depth the model saves computations. Using a set of 1x1, 3x3, and finally,
5x5 size of filters, an inception unit cell learns extracting features of different
scale from the input image. Although inception cells use max-pooling oper-
ator, the dimension of a processed data is preserved due to "same" padding,
A follow-up paper was released not long after introducing a more efficient
solution to the first version of the inception cell. Large filters sized 5x5, and
Chapter 3. Transfer Learning in Lung Diseases Classification 40
7x7 are useful in extensive spatial features extraction, yet their disadvantage
lies in the number of parameters and therefore computational disproportion.
The researchers from Google found a way to save computations and reduce
number of parameters without dicreasing model’s efficiency. In the pro-
posed architecture [8] all 5x5 (Figure 3.6 (A)) convolutions were factorized
to two consecutive 3x3 (Figure 3.6 (B)) operations, improving the computa-
By processing input through a layer with a 5x5 pixel filter, the number of
The researchers went even further with a decreasing number of filter param-
eters showing another double asymmetric convolutions 3x1 and 1x3 (see Fig-
33% since instead of 9 (3x3 = 9) we need only 6 (2x3x1 = 6). The application
of asymmetric convolution can be seen in Figure 3.9. To achieve the results
of a bloc presented in Figure 3.14, n needs to be set to 3.
The filter banks were furthermore expanded, making them wider, not
Chapter 3. Transfer Learning in Lung Diseases Classification 42
fore, models incorporating such techniques are less prone to overfit and con-
sequently can get deeper.
First auxiliary classifiers (see image 3.11) were proposed along with the In-
ceptionV1 model [7]. Although the new models use the intuition behind
them, they are slightly modified. Instead of using 2 auxiliary classifiers [7],
only one is used on top of the 17x17 pixels layer (see Figure 3.14). The reason
Chapter 3. Transfer Learning in Lung Diseases Classification 43
for introducing this difference lies in their purpose. The first version of the
Inception deep neural network used auxiliary classifiers in order to make the
two sets of feature maps, together with having 640 channels. The first set
having 320 feature maps is an output from a convolution bloc with stride
equal 2. The second set constituting another 320 channels is obtained by max
pooling.
The effective grid size reduction is an efficient operation, although less
expensive.
The Figure 3.13 an inception module redicing grid-size.
3.4.5 Architecture
The InceptionV3 model [8] contains over 23 million parameters. The ar-
chitecture can be divided into 5 modules, as presented in Figure 3.14. The
Chapter 3. Transfer Learning in Lung Diseases Classification 45
age 3.6. Then, information is passed through the effective grid size reduction
(see 3.4.4) and processed through four consecutive inception cells with asym-
3.4.3) and another effective grid size-reduction block. Finally, data progresses
through a series of two blocs with wider filter banks (see image 3.10) and con-
sequently gets to a fully-connected layer ended with a Softmax classifier.
3.5 Experiments
3.5.1 Dataset
The first part of the experiments compares three modified versions of neural
networks introduced earlier; VGG16, ResNet-50 and InceptionV3 described
in 3.2, 3.3 and 3.4, respectively. We train those models on the database con-
taining X-Ray lungs images introduced in Section 2.5. All images 2.2 (918
samples, 306 per class) were resized to the same shape before training, 256x256
pixels. The dataset constituting of X-Ray images and one-hot encoded labels
was partitioned into three different categories for training, validation, and
10% served for testing (the same approach was used in [4]), and its class
distribution was kept in an even proportion, e.g., a third of samples were la-
part came from patients suffering from pneumonia. During the training pro-
cess, the input data is augmented [51] [52][53] by randomly selecting one of
3.5.2 Models
The analysis in this chapter compares three models using different transfer
learning described in 3.2, 3.3 and 3.4 for lung disease classification. The mod-
els were expanded with the same neural networks based classifier consisting
Chapter 3. Transfer Learning in Lung Diseases Classification 47
of a global average pooling layer, three fully connected layers having 1024,
512 and 256 neurons and a softmax classifier. The number of trainable param-
eters in the deep neural networks is 1,182,211 for VGG16 and 2,755,075 for
both ResNet50 and InceptionV3. Before passing input images through deep
neural networks which serve as feature extractors, the batches were adjusted
to the same formats (batch size, input scale, etc.) the mentioned models were
ing the training process. The output from the softmax classifier is a vector of
probabilities with which an input image belongs to one of the classes. The
final class is the one corresponding to the highest value, and its position is
then mapped back to a corresponding class.
python API, Keras. Our algorithms were trained on servers equipped with
nodes each having eight Nvidia K20 GPUs and additionally six computing
nodes with eight nVidia K80 boards each. Every K80 board includes two
GPU’s and so the total of 216 GPU’s in the cluster.
Training three models repeatedly ten times for 150 epochs took roughly one
day. This relatively short period is caused by setting parameters of pre-
trained networks to non-trainable, and thus the gradient caused by misclas-
sification flows only through the appended layers. The initial image prepro-
cessing was also advantageous for the duration since the real size images of
Chapter 3. Transfer Learning in Lung Diseases Classification 48
a relatively good epoch after which models were overfitting training data.
The training process was then stopped, and the final results were measured
as an average of all results obtained at that step. The last step was to show
the performance of selected models on the unseen data (test set).
The following results were generated for ten independent training runs to
observe a similar training pattern. Each of the ten training and validation
curves (see Figures 3.15 and 3.16) were plotted on the same charts based on
the tape (training or validation). To maintain a high level of readability, the
results were separated. The wider, dotted curve is an averaged result of all
obtained at the particular epoch. The red dot in Figures in 3.15 represent
the lowest loss value on training and validation data sets, whereas in 3.16 it
dataset around the 90th epoch yet then the validation error falls again and
eventually after 150 epoch achieves the best average results. Similar behavior
score for classes healthy, tuberculosis and pneumonia were equal 0.68, 0.82
and 0.80, respectively. Additionally, we present sample classification results
( A ) Confusion matrix
The following results were generated for ten independent training runs in
Figures 3.19 and 3.20. By splitting the results by type (separately training and
validation), we maintained a high level of visibility, allowing us to simplify
Chapter 3. Transfer Learning in Lung Diseases Classification 54
the analysis. The wider, dotted curve is an average of all results obtained
at the corresponding epoch. The red dot in Figures 3.19 and 3.20 stands for
minimum loss and maximum accuracy value, respectively. The Figure 3.19
shows that the model decreases its loss throughout the whole training yet
seems to reach the validation accuracy plateau around 120th epoch (see Fig-
ure 3.20 B)).
Finally, after 150 epochs, all the ResNet-50 based models were evaluated
on the test set and scored an average accuracy of 72.22%, which is almost
a ten points improvement to 3.6.1. In order to visualize the results, we se-
lected a network which achieved the best accuracy score. The confusion ma-
trix in Figure 3.21 A) shows that, similarly to 3.6.1, the model had problems
were correctly classified. Image B) shows that the AUC score for healthy, tu-
berculosis and pneumonia were equal 0.84, 0.76, and 0.84, respectively. This
is an improvement in comparison to results in the previous subsection (3.6.1).
larly to 3.6.1, both images A) and C) were correctly classified yet the overlap-
ping class activation maps (images B) and D)) show that the determinative
regions were not related to lungs. Areas like collarbones or heart decided of
the final label.
Chapter 3. Transfer Learning in Lung Diseases Classification 55
( A ) Confusion matrix
The following results were generated for ten independent training runs in
order to observe a similar training pattern. Each of the ten training and val-
idation curves (see Figures 3.23 and 3.24) were plotted on the same charts
is an averaged result of all ten runs at the particular epoch. The red dot in
Figures 3.23 3.24 stand for the lowest loss value on training and validation
dataset around the 20th epoch which can be witnessed by examining both
validation error and accuracy change. The validation error curve slowly in-
creases its value, whereas the accuracy level remains similar (see Figure 3.24).
Finally, we took all InceptionV3 based models at the 20th epoch and evalu-
ated them on the test set and scored an average accuracy of 80.55%, which
achieved the best accuracy score after the 20th epoch. The confusion matrix
in Figure 3.25 A) shows that the new model improved the number of true
network still faced minor problems with classifying all ‘tuberculosis images‘
to their corresponding label. Image B) shows that AUC score for healthy,
tuberculosis and pneumonia were equal 0.89, 0.87, and 0.92, respectively.
correctly classified yet the overlapping class activation maps (images B) and
D)) show that the determinative regions were not necessarily related to lungs.
Areas like armpits and internal organs decided on the final label.
Chapter 3. Transfer Learning in Lung Diseases Classification 60
( A ) Confusion matrix
After comparing the results obtained in sections 3.6.1, 3.6.2 and 3.6.3 we can
and here we see that using InceptionV3 model a random instance contain-
ing marks of either tuberculosis or pneumonia has a high probability to be
There was no work done with a similar dataset (Chest X-Ray multiclassifi-
cation problem with small dataset) therefore we did not compare ourselves
3.7 Summary
This chapter introduces three models which achieved the highest scores in
the ImageNet competition; VGG16, ResNet-50, and InceptionV3. We also
Chapter 3. Transfer Learning in Lung Diseases Classification 65
present our initial work in lung diseases classification using those pre-trained
deep neural networks as feature extractors for a simple 3-layers deep neural
network. The significant and promising results on small datasets show that
there is no need to build sophisticated, multiple-layers networks in order to
Chapter 4
ral Networks
Many vision-related tasks, especially those from the field of medical image
processing, expect to have a class assigned per pixel, i.e., every pixel is as-
sociated with a corresponding class. To conduct this process, we propose a
output resolution.
High-resolution features are combined with the upsampled output to do lo-
4.1.1 Architecture
a non-linearity, here rectified linear unit (ReLU), and 2x2 poling with stride
2. Each downsampling operation doubles the number of resulting feature
maps.
All expansive path operations are made of upsampling of the feature chan-
nels followed by a 2x2 deconvolution (or "up-convolution") which reduces
Chapter 4. Transfer Learning Models Accuracy Improvement in Lung
68
Diseases Classification Using Segmentated X-Ray Images
the number of feature maps twice. The result is then concatenated with the
corresponding feature layer from the contracting path and convolved with
3x3 kernels, and each passed through a ReLU. The final layers apply a 1x1
convolution to map each feature vector to the desired class.
Chapter 4. Transfer Learning Models Accuracy Improvement in Lung
69
Diseases Classification Using Segmentated X-Ray Images
perform the majority of CNN based models and achieve excellent results by
easily capturing spacial information about the lungs. As an outcome, we
propose a pipeline that consists of two stages; first segmentation and then
classification.
4.2.1 Dataset
model presented in 4.1. Our algorithms trained for 500 epochs on an ex-
tension of the SH dataset described in 2.5. The input to our u-shaped deep
neural network is a regular Chest X-Ray image, whereas the output is a man-
ually prepared binary mask of lung shape, matching the input. Figure 4.2 A)
4.2.2 Training
As mentioned before, our model was trained for 500 epochs using a dataset
divided into 80%, 10%, and 10% parts, for training, validation and test parts
respectively on the same machines introduced in section 3.5.2 using the batch
can easily notice, the validation error is slowly falling throughout the whole
training, whereas there is no major change after the 100th epoch. The final
Chapter 4. Transfer Learning Models Accuracy Improvement in Lung
70
Diseases Classification Using Segmentated X-Ray Images
error on the validation set is right below 0.05 and slightly above 0.06 on the
test set.
4.2.3 Results
Our algorithm learns shape-related features typical for lungs and can gener-
alize well further over unseen data. Figure 4.4 shows the results of our U-Net
trained models.
It is clear that the network was able to learn chest shape features and ex-
clude regions containing internal organs such as the heart. The incredibly
promising results allowed us to process the whole dataset presented in sub-
section 3.5.1 and continue our analysis on the newly processed images.
Chapter 4. Transfer Learning Models Accuracy Improvement in Lung
71
Diseases Classification Using Segmentated X-Ray Images
Images
as shown in 3.5.2
4.3.1 Dataset
Here, we conduct our experiments using the same data as in chapter 3. The
information - lungs. Figure 4.4 shows the training samples; the left and right
column correspond to input and output, respectively.
As in the previous chapter, training all three models repeatedly ten times
for 150 epochs took about one day. This short time is a result of setting all
parameters of pre-trained models as non-trainable. Therefore the gradients
flow only through the concatenated layers. The segmented images were re-
sized, which also beneficially influences the execution duration. The biggest
problem related to training our models was the maximum platform usage
find a relatively good number of epochs after which models were overfitting
Chapter 4. Transfer Learning Models Accuracy Improvement in Lung
74
Diseases Classification Using Segmentated X-Ray Images
the training data. The training process was then stopped, and the final results
were measured as an average of all results obtained at that step.
The following results were generated for ten independent training runs to
observe a similar training pattern. Each of the ten training and validation
curves (see Figures 4.5 and 4.6) were plotted on the same charts based on
the tape (training or validation). So as to keep up a high state of readability,
the results were isolated. The wider, dotted curve is an averaged result of
all obtained at the particular epoch. The red dot on Figures in 4.5 represent
the lowest loss value on training and validation data sets, whereas on 4.6 it
ror around the 70th epoch and maintains it’s level throughout the remaining
roughly 80 epochs. A similar behavior is experienced when examining Fig-
ure 4.6. Here, the average validation accuracy slows down and only slightly
increases.
Eventually, the models were evaluated on the test set and scored an aver-
age accuracy of 69.99%, which is over 6 percentage of improvement compar-
which obtained the best accuracy score after 150 epochs. The confusion ma-
trix in Figure 4.7 A) shows that the model had the biggest problems with clas-
sifying ’pneumonia images’ to the corresponding class and tend mistaken it
healthy, pneumonia and tuberculosis were equal 0.75, 0.81 and 0.90, respec-
tively. This is also an improvement with the results received in 3.6.1.
to the problem, unlike in 3.6.1. The network investigated lungs regions and
made the final decision based on extracted features.
Chapter 4. Transfer Learning Models Accuracy Improvement in Lung
78
Diseases Classification Using Segmentated X-Ray Images
( A ) Confusion matrix
The following results were generated for ten independent training runs in
Figures 4.9 and 4.10. By splitting the results by type (separately training and
validation), we maintained a high level of visibility, allowing to simplify the
Chapter 4. Transfer Learning Models Accuracy Improvement in Lung
80
Diseases Classification Using Segmentated X-Ray Images
analysis. The wider, dotted curve is an average of all results obtained at cor-
responding epoch. The red dot on Figures 4.9 and 4.10 stands for minimum
loss and maximum accuracy value, respectively. The image 4.9 shows that
the model decreases its loss throughout the whole training.
Finally, after 150 epochs all the ResNet-50 based models were evaluated
on the test set and scored an average accuracy of 74.99%, which is almost a 5
percent improvement to 4.4.1 and over two to 3.6.2. In order to visualize the
results, we selected a network which achieved the best accuracy score. The
confusion matrix in Figure 4.11 A) shows that, similarly to 3.6.2, the model
had problems with correctly labelling ‘tuberculosis images‘ although the ma-
jority of images were correctly classified. Image B) shows that AUC score for
healthy, pneumonia and tuberculosis were equal 0.77, 0.91, and 0.82, respec-
able to improve the results for healthy and pneumonia images classification.
When looking at AUC scores in 3.6.2, we scored better in labeling pneumo-
nia.
ilarly to 4.4.1, both images A) and C) were correctly classified and the over-
lapping class activation maps (images B) and D)) show that the determinative
regions were related to lungs.
Chapter 4. Transfer Learning Models Accuracy Improvement in Lung
81
Diseases Classification Using Segmentated X-Ray Images
( A ) Confusion matrix
The following results were generated for ten independent training runs in
order to observe a similar training pattern. Each of the ten training and val-
idation curves (see Figures 4.13 and 4.14) were plotted on the same charts
is an averaged result of all ten runs at the particular epoch. The red dot on
Figures 4.13 4.14 stand for the lowest loss value on training and validation
around the 40th epoch, although we do not witness any noticeable drop
in terms of accuracy. The validation error curve slowly increases its value,
whereas the accuracy level remains similar (see Figure 4.14). Finally, we
took all InceptionV3 based models at the 40th epoch and evaluated them on
the test set and scored an average accuracy of 82.22%, which is a small im-
accuracy across all the models we trained using different techniques. In or-
der to visualize the final results, similarly to the previous chapter, we se-
lected a model which achieved the best accuracy score after the 40th epoch.
The confusion matrix in Figure 4.15 A) shows that the new model improved
the number of true positives (TP) in all classes comparing to previous algo-
rithms. Image B) shows that the AUC score for healthy, tuberculosis and
pneumonia were equal to 0.90, 0.93, and 0.99, respectively. This is a slight
drop for the healthy class comparing to results in 4.4.1 and 3.6.3 albeit we
( A ) Confusion matrix
After comparing the results obtained in subsections 4.4.1, 4.4.2 and 4.4.3 we
can observe that transfer-learning models perform well in lung diseases clas-
sification using segmented images tasks even when the data resources are
limited. Not only is their accuracy improved, yet also the class activation
maps support our conclusion. Table 4.1 shows comparison of results for all
trained algorithms, both using segmented and non-segmented Chest X-Ray
mages.
The algorithm that scored the best in the majority results was InceptionV3
trained on the segmented images. What is more, it produced incredibly high
scores for the "diseased" classes showing that a random instance containing
marks of tuberculosis or pneumonia has over 90% probability to be classified
to the correct class. Although the scores of the healthy class are worse than
the diseased ones, its real cost is indeed lower as it is always worse to classify
lungs, force the network to explore it and thus make decisions based on ob-
served changes. That behavior was expected and additionally improved the
In this section, we would like to compare our models to the results achieved
zhen and Montgomery datasets [11] ten times, generated the results for all
the models and averaged their scores: accuracy, precision, sensitivity, speci-
ficity, F1 score, and AUC 2.7
Table 4.2 shows the comparison of different deep learning models trained
on the Shenzhen dataset [11]. Although our approach does not guarantee the
best performance, it’s close to the highest one yet less complex. Researchers
one percent and equal AUC, which means that our method gives an equal
they reached accuracies of 76% and 73% respectively, which is roughly 3 and
Chapter 5
5.1 Summary
using deep neural networks preceded by segmentation and not under the
supervision of the small size dataset (less than 103 examples). Moreover,
we examined class activation maps to explore the reasoning of our models
our shallow algorithms. The results are summarized in section 3.6.4, here
we only use accuracy to evaluate the performance since the test set is class
the results with those obtained after learning features from non-segmented
images. Here we also show how our solutions outperform deeper models
trained on the same data. After comparing class activation maps in section
4.5, we conclude that segmentation not only improved the accuracy score yet
also the reasoning behind the classification. Preprocessed Chest X-Ray im-
ages with remaining lungs force networks to explore only those areas which
Chapter 5. Conclusions and Future Work 95
5.2 Contributions
networks in the classification task. Even though the results are promising,
other networks could be explored and examined in terms of class activation
maps. Furthermore, we only used one classifier (3 layers deep neural net-
work) due to computational and time limitations. Another direction would
be the application of the introduced solutions to much bigger datasets such as
As our models label Chest X-Ray images based on features extracted from
Chapter 5. Conclusions and Future Work 96
segmented data, yet it is beyond our expertise to decide whether the deter-
minative regions truly contain marks of disease.
This work presents only a small part of research on lung disease classi-
fication. Considering the promising results we achieved and the relatively
recent interest of deep neural network techniques in the medical field, there
is plenty of room for improvements in other biomedical applications. Fur-
thermore, we hope that one day computers will accelerate and help with a
radiological examination and save the lives of millions.
97
Bibliography
[5] Google, “Machine learning crash course with tensorflow apis.” https:
//developers.google.com/machine-learning/crash-course/.
[6] Y. LeCun and Y. Bengio, The Handbook of Brain Theory and Neural Net-
works. Cambridge, MA, USA: MIT Press, 1998.
[11] S. Jaeger, S. Candemir, S. Antani, Y.-X. Wáng, P.-X. Lu, and G. Thoma,
network architectures for fast chest x-ray tuberculosis screening and vi-
delberg, 2007.
vector classification,” Medical Physics, vol. 38, pp. 1844–58, April 2011.
BIBLIOGRAPHY 100
detector,” IEEE Trans. Med. Imaging, vol. 23, no. 3, pp. 330–339, 2004.
[21] K. Suzuki, I. Horiba, and N. Sugie, “Neural edge enhancer for super-
2003.
[22] K. Suzuki, “Neural filter with selection of input features and its applica-
tion to image quality improvement of medical image sequences,” IEICE
[26] S.-C. Lo, S.-L. Lou, J.-S. Lin, M. T. Freedman, M. V. Chien, and S. Mun,
“Artificial convolution neural network techniques and applications for
[27] S.-C. B. Lo, H.-P. Chan, J.-S. Lin, H. Li, M. T. Freedman, and S. K. Mun,
“Artificial convolution neural network for medical image pattern recog-
[37] G. Ramalho and L. Bezerra, “Lung disease detection using feature ex-
traction and extreme learning machine,” Revista Brasileira de Engenharia
[40] C. Liu, J. Yuen, and A. Torralba, “Sift flow: Dense correspondence across
scenes and its applications,” IEEE Transactions on Pattern Analysis and
lems,” Journal of Intelligent and Fuzzy Systems, vol. 35, pp. 99–109, 2018.
[45] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
and bone shadow exclusion techniques for chest x-ray analysis of lung
cancer,” Computing Research Repository, vol. abs/1712.07632, 2017.
February 2014.
cal coherence tomography (oct) and chest x-ray images,” Cell, pp. 1122–
1131, 2018.
BIBLIOGRAPHY 104
[49] M. Hossin and S. M.N, “A review on evaluation metrics for data classi-
fication evaluations,” International Journal of Data Mining and Knowledge
[50] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
[52] R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun, “Deep image: Scaling up
[53] D. Hana, Q. Liu, and W. Fan, “A new image classification method using
cnn transfer learning and web data augmentation,” Expert Systems with
[54] Q. Xu, Y.-Z. Liang, and Y.-P. Du, “Monte carlo cross-validation for se-
lecting a model and estimating the prediction error in multivariate cali-
bration,” Journal of Chemometrics, vol. 18, pp. 112 – 120, February 2004.
[58] R. H. Anuj Rohilla and A. Mittal, “Tb detection in chest radiograph us-
ing deep learning architecture,” International Journal of Advance Research
2016.