Deep Residual Learning

Master of Science Thesis in Electrical Engineering
Department of Electrical Engineering, Linköping University, 2018
Object Detection Using

Convolutional Neural
Network Trained on
Synthetic Images
Margareta Vi
Master of Science Thesis in Electrical Engineering
Object Detection Using Convolutional Neural Network Trained on Synthetic
Images
Margareta Vi
LiTH-ISY-EX--18/5180--SE
Supervisor: Mikael Persson

isy, Linköpings universitet
Alexander Poole
Company
Examiner: Michael Felsberg

isy, Linköpings universitet
Computer Vision Laboratory

Department of Electrical Engineering
Linköping University
SE-581 83 Linköping, Sweden
Copyright © 2018 Margareta Vi

Abstract
Training data is the bottleneck for training Convolutional Neural Networks. A
larger dataset gives better accuracy though also needs longer training time. It
is shown by finetuning neural networks on synthetic rendered images, that the
mean average precision increases. This method was applied to two different
datasets with five distinctive objects in each. The first dataset consisted of ran-
dom objects with different geometric shapes. The second dataset contained ob-
jects used to assemble IKEA furniture. The neural network with the best perfor-
mance, trained on 5400 images, achieved a mean average precision of 0.81 on
a test which was a sample of a video sequence. Analysis of the impact of the
factors dataset size, batch size, and numbers of epochs used in training and dif-
ferent network architectures were done. Using synthetic images to train CNN’s
is a promising path to take for object detection where access to large amount of
annotated image data is hard to come by.
iii
Acknowledgments
I would like to thank my supervisor at my company Alexander Poole, for always
being helpful and coming with interesting ideas. I would also like to thank my
supervisor at the university, Mikael Persson for helping me with the report and
my examiner Michael Felsberg.
Additionally, I would like to give my thanks to IKEA for providing the CAD
models. Lastly, I would like to thank my family and boyfriend for supporting me
through all the hard times.
Linköping, November 2018

Margareta Vi
v
Contents
Notation ix
1 Introduction 1
1.1 Neural network/convolutional neural network in brief . . . . . . . 2
1.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Related work 5
2.1 Using synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Finetuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Object classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Object detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Summary: Related Work . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Method and Experiments 11

3.1 Generating the Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Rendering Images . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.2 Video Recording . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.3 Creation of Ground Truth Data . . . . . . . . . . . . . . . . 12
3.2 Dataset Distribution and Network Pairings . . . . . . . . . . . . . . 12
3.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.1 Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.2 PASCAL Mean Average Precision . . . . . . . . . . . . . . . 14
3.4 Parameters to tune . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Results 17
4.1 Testing Different Network Configuration . . . . . . . . . . . . . . . 17
4.1.1 Faster R-CNN and Inception . . . . . . . . . . . . . . . . . . 17
4.1.2 SSD and Inception . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.3 SSD and MobileNet . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.4 Summary: Single-Shot Multibox Detector . . . . . . . . . . 41
vii
viii Contents
4.1.5 Summary: Different Network Architecture And Batch Sizes 43

4.2 Epochs Versus Dataset Size . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Testing on real images . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4 Automatic and Manual Annotations . . . . . . . . . . . . . . . . . . 50
5 Discussion 57
5.1 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1.1 Single-Shot Multibox Detector . . . . . . . . . . . . . . . . . 57
5.1.2 Faster R-CNN and Inception . . . . . . . . . . . . . . . . . . 57
5.2 Epochs versus Batches . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Testing On Real Images, Video Sequence . . . . . . . . . . . . . . . 58
5.4 Annotation: Manual vs Automatic . . . . . . . . . . . . . . . . . . . 58
6 Conclusions And Future Work 61

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
A Datasets 65
Bibliography 67
Notation
Abbreviations
Abbreviation Description
cad Computer Aided Design
ilsvrc ImageNet Large Scale Visual Recognition Challenge
cnn Convolutional Neural Network
svm Support Vector Machine
map Mean Average Precision
rcnn Regional Convolutional Neural Network
ssd Single Shot Multibox Detector
iou Intersection over Union
ix
Introduction
1
Everyone has at least once compiled furniture from IKEA. They are relatively
cheap and come in flat packages, the key thing is you need to build them yourself
with the help of a booklet. The compilation starts by laying out all the pieces in
front of you. The large pieces are easy to recognize, though the screws and plugs
might cause problems. The items are small and have the same look, which in-
creases the difficulty to distinguish them from each other. Therefore when build-
ing IKEA furniture, one could say the hardest part is to find the correct piece to
use.
We humans solve this problem by first doing a coarse filtering by localizing

objects with the same form as the one we seek. The next step is a fine search
through the remaining items, with a specific image of the component in mind.
The problem boils down to having a scene full of items and we want to localize a
specific object i.e. object detection. Object detection is finding where an object is
and what type of object it is.
Manual feature matching is costly, therefore it is desirable to computerize

the task of finding such features. Two common techniques within object detec-
tion are using handcrafted features combined with machine learning approaches
such as Support Vector Machine, and using artificial neural networks. Hand-
crafted features are image properties derived using different algorithms. These
features include among other SIFT (Scale Invariant Feature Transform), SURF
(Speeded up robust feature), and BRIEF (Binary Robust Independent Elementary
Features). An artificial neural network on the other hand is a complicated model
inspired by how the human brain is structured. The difference between hand-
crafted features and artificial neural networks is that the neural network try to
learn patterns.
1
2 1 Introduction
Neural networks have great potential to solve problems which involve detec-
tion of patterns or trends. Scientists have created neural networks which can
solve tasks such as digit or word recognition, image classifications, face recogni-
tion, and object detection to name a few. Examples of neural networks which
solve such tasks are Watson and AlphaGo. Watson played and won against Jeop-
ardy champions [12] and AlphaGo was the first computer program which won
against a Go world champion player [11].
1.1 Neural network/convolutional neural network in

brief
In a human brain neurons are connected to each other via synapses, while in
artificial neural network neurons are functions and synapses are weights. The
model is shown in Figure 1.0.
Figure 1.0: A biological neuron and its mathematical representation. Image

acquired from [22].
1.1 Neural network/convolutional neural network in brief 3
From here on artificial neural networks will be referred as neural networks

(NNs). A neural network consists of many different layers: the input layer, the
hidden layer, and the output layer. The input layer contains of images, and the
output layer is the result of the task that the NN is trying to solve i.e. object de-
tection. The hidden layer consist of many different layers. In each layer different
mathematical operations occur, such as pooling, normalization, and convolution.
Input layer Hidden layer Output layer
Figure 1.1: Neural network
A specific type of a neural network which focuses on object detection is the

convolutional neural network (CNN). CNN is a neural network which uses the
convolution operation in at least one of its layers [15].
One of the big disadvantages of NNs is that they need a large amount of train-
ing data to have adequate performance. Therefore, getting access to data is the
bottleneck for neural networks. On the internet many 3D models of different
objects are available for free in various formats such as Computer Aided Design
(CAD). From a CAD model, it is possible to generate thousands of different syn-
thetic images by alternating the background and adding texture to the objects.
It is possible to decrease the amount of training data needed by using a method

called finetuning. This method is described further in chapter 2.
Many different types of neural networks exist, where the difference lies in
the combination of hidden layers. In this thesis, the networks used are Faster R-
CNN, Inception, SingleShot Multibox Detector, and MobileNet, all described in
chapter 2.
4 1 Introduction
1.2 Problem Formulation

This thesis will investigate if neural networks can be fine-tuned with synthetic
images for the task of object detection on a video sequence. To optimize the
development, the network will first be tested on images before being tested on a
video sequence.
1.2.1 Limitation
To reduce training time, fine-tuning will be used. There also will be limitations
on what type of objects the network will be able to detect.
There will be two different datasets: dataset A and dataset B. Dataset A con-
sists of objects such as screw and plugs provided by IKEA. Background and tex-
ture combinations in dataset A were realistic since the purpose was to test if
the network could differentiate objects in the real world. Dataset B is a video
sequence taken of the real objects in dataset A. Dataset A and B are shown in
Appendix A.
A computer with Intel Core i7-7700, NVIDIA GTX 1080 Ti was used. A Hover-
Cam web camera was used to capture the video sequence. No new neural network
architecture will be created. Instead an API called Tensorflow Object Detection
API (version 1.7) [18] will be used, together with OpenCV (version 3.4.1) [8],
Python (3.5) and Blender [7].
1.3 Thesis Outline

In chapter 2 the related work is presented. The method used is described in
chapter 3. The experiment is presented in section 3.5. The results are shown in
chapter 4 and discussed in chapter 5. The conclusions and future work of this
thesis are presented in chapter 6.
Related work
2
Four topics are addressed in this chapter: neural networks trained on synthetic
data, the concept of finetuning, object classification, and object classification us-
ing convolutional neural network. This chapter ends with a conclusion, describ-
ing a solution to the problems.
2.1 Using synthetic data

The time consuming parts of NNs are the training time and the gathering of train-
ing data. For the detection and classification of the object, the training data con-
sist of two parts: the images and the corresponding annotations for each image.
For this thesis, annotation means the creation of ground truth data, i.e. the bound-
ing box for each object in the images. To get access to lots of training data, one can
use synthetic data since it is possible to generate them automatically. Synthetic
images, in this thesis, are images generated by sampling CAD models unless oth-
erwise stated. By generating data automatically the ground truth is always acces-
sible.
Annotating training data is a problem for scientists since it takes a long time
and good accuracy is needed. Richter et al. were creative with annotating their
data. Using the video game engine from Grand Theft Auto they could get access
to both scenes with realistic appearances and labels at pixel level [32].
By using these realistic images, they showed that the work needed for annota-
tion could be notably reduced. By combining the semantic segmentation dataset
with real-world images, the accuracy increase even more.
Successful attempts have been made to train NNs using synthetic data to solve
classification problems. In this case, successes mean having the best result for a
5
6 2 Related work
specific type of benchmark. The neural network created by Jaderberg et al. was
trained for scene text classification [20], to classify whole words. The training
images were computer generated with different fonts, shadows, and color. Distor-
tion and noise were added to the rendered images to simulate the real world. It
outperformed previous state-of-the-art methods for scene word classifications in
the benchmarks ICDAR 2003, Street View Text, and IIIT5k-dataset. ICDAR 2003
is a competition in robust reading [25]. The amount of training data used was
between 4 millions and 9 million, depending on the benchmark used.
Jaderberg et al. also created another neural network for text spotting, mean-
ing detection and recognition of words. They created an end-to-end system for
text spotting [21]. For the word detection part, they used a region proposal based
mechanism and a CNN for the word recognition task. Their dataset was created
in the same way as in the work [20]. The dataset contained 9 million images,
32x100 pixels. They used 900000 for testing, the same amount for validation
and the rest for training. For the task of text recognition, their method had the
best accuracy compared to the previous state of the art methods. Jaderberg et
al. had good performance in the text spotting task, outperforming the previous
state-of-the-art method [21].
Georgakis et al. trained their network with a combination of both real images
and synthetic images [13]. The synthetic images were real images augmented.
Objects with different scales and positions had been superimposed onto these im-
ages. The task for the network was to do object detection in a cluttered indoor
environment.
Another work which trains a convolutional neural network with synthetic im-
ages is [30].
The network’s task was to predict a bounding box and the object class category
for each object of interest on RGB images captured inside a refrigerator. Training
the neural network with 4000 synthetic images, the network scored a mean Av-
erage Precision (mAP) of 24% on a test set. By adding 400 real images the mAP
increased with 12%. In this paper, they used IoU (intersection over Union) for
evaluating the bounding box predictions, see section 3.3.1 for a description of
IoU.
2.2 Finetuning
The concept of finetuning refers to the approach of reusing training weights.
These training weights comes from another neural network that as been created
for another task. The weights are used to initialize the training [28]. For example,
a neural network trained to classify cats can be fine-tuned to classify dogs. This
method has resulted in state-of-the-art performances for several tasks. Examples
of such tasks are object detection [33], [26], [27], tracking [36], segmentation [4],
and human pose estimation[9]. With finetuning, the training time can also be
2.3 Object classification 7
reduced [6].
2.3 Object classification

Object classification is identification of the object class in an image. Automatic
classification, where no human is involved in the classification step, can be done
using machine learning. Support Vector Machines (SVM) are methods used for
classification. The Support Vector Machine, was first invented for binary classi-
fication problem [10]. An SVM tries to find a function which can separate the
input data into categories, by mapping the input data non-linearly to a high di-
mensional vector space. In, for example, [14], [17], and [29], SVMs were used for
the task of classifying land cover images.
More recent progress in object classification has been achieved by neural net-
works. Two state-of-the-art object classification networks are ResNet [5] and In-
ception net[34].
ResNet is a deep residual network, hence the name ResNet, and consists of
152 layers. Due to its large depth, it managed to achieve a 3.6% error rate (top-5
error) in the 2015’s edition of ImageNet Large Scale Visual Recognition Competi-
tion (ILSVRC) and thus won the classification task in the 2015 edition of ILSVRC
[23]. A human has an error rate between 5 − 10%, meaning ResNet outperforms
humans on this task [5].
The other neural network, Inception net, is a network that consists of incep-
tion modules. An inception module is a block of multiple parallel convolutional
and max-pooling layers with different kernel sizes. The inception module makes
the Inception net different from the traditional networks, which stack up convo-
lutional and max-pooling layers [34]. It won the classification and detection task
of ILSVRC in 2014 [23].
Neural networks are computationally heavy, requiring capable hardware to
do the calculations. However, there is a network MobileNet which is a neural
network developed for Mobile Vision Applications. Instead of both filtering and
combining the output signal in one go, Mobilenet divides this step into two lay-
ers, one for filtering and one for combining. The two-layer separation greatly
reduced the computation and model size [16].
8 2 Related work
2.4 Object detection

Object detection includes object classification, since object detection is about find-
ing the object’s location and its category. The object’s location is mostly repre-
sented as a bounding box, shown in Figure 2.1 and Figure 2.2.
Figure 2.1: Input image Figure 2.2: Result
Two of the recent state-of-the-art methods for object detection are Faster R-
CNN and Single-Shot Multibox Detector (SSD).
A Faster Region based convolutional network (F-CNN) consists of two mod-

ules. One module is a deep, fully convolutional network, a Region Proposal Net-
work (RPN). A RPN takes an image as the input and outputs a set of rectangu-
lar regions. Each rectangle has a score indicating if the region is an object or
background. The second module is a Fast R-CNN detector, which applies object
detection on the regions proposed from the RPN. Faster R-CNN achieved a state-
of-the-art accuracy on the dataset PASCAL VOC 2007 [31].
Single-Shot Multibox Detector is a feed-forward CNN. It produces a collec-

tion of bounding boxes with fixed size, and the probability for the presence of
the object class in each box. To get the final detection, it has a non-maximum
suppression step. SSD achieved an increase in accuracy and speed, compared to
Faster R-CNN, when tested on the PASCAL VOC 2007 dataset [24].
The networks in section 2.4 used PASCAL VOC 2007 as a benchmark. Their
datasets were divided as follows: 50% for training/validation and 50% for testing,
i.e. images the network had not seen before, with a total of 9963 images [3].
Since they had the double amount of images, 9936 versus i 4830, this thesis
used 10% of the dataset for testing and the rest to train to compensate.
The remaining 90% was divided between the training and validation, 70% for
training and 30% for validation.
2.5 Summary: Related Work 9
2.5 Summary: Related Work

Getting access to a large dataset is a limiting parameter for neural networks; it
takes time and it is costly. This thesis will investigate how well NN performs after
finetuning with synthetic images.
Hyper-parameters connected to the images are the batch sizes (and the image
size), and number of batches/epochs to run the training. The effect of these pa-
rameters was the focus of this thesis and therefore pre-trained networks were
used. The Support vector machine is an old technique and newer methods have
surfaced with better accuracy. Thus, only deep learning will be used. The main in-
terest was compare the Single-Shot multi-box Detector against the Faster R-CNN.
Also to combine the object detection networks with the object classification net-
works, since all networks are state-of-the-art methods, is an interesting aspect.
section 3.2 states all combinations this thesis will use. The Residual network was
not used due to computer limitations.
Being inspired by [32], a comparison of the time needed to do manual and auto-
matic annotation on a dataset was done. It is also investigated how the network
performs on automatically annotated datasets vs manually annotated. The proce-
dure is described later in subsection 3.1.3.
Method and Experiments
3
First presented in this chapter is the rendering of synthetic data, followed by
combinations of neural networks and the evaluation method. In Figure 3.1 a flow
chart over the work flow is shown.
CAD Background Texture of

model images object
Webcamera
Synthetic
Images
Training data Validation data Test data
Neural
Network
Output
Figure 3.1: Flow chart of work flow.
11
12 3 Method and Experiments
3.1 Generating the Datasets

This section will describe the creation of the two datasets; A and B.
Dataset A is synthetic images of five different objects; attachment, shelf plug,
dowel, expandable plug, and screw. This dataset was used to train the neural
network. Dataset B are images sampled from a video sequence containing the
physical objects and was used to evaluate the networks. Examples of the two
dataset are seen in Appendix A.
3.1.1 Rendering Images

All computer generated data were created from CAD models. The models were
either provided by the furniture company IKEA or found on the website GrabCad
[2]. The images were rendered by the open source 3D creation suite program
Blender [7]. To generate a large variety of data, different backgrounds, object
texture and object rotation and camera locations were used. The background
images were taken from the website Pexels [1].
3.1.2 Video Recording

To create the dataset B, physical objects of the dataset A were acquired. By record-
ing with a web camera, the objects were introduced into the scene one by one.
3.1.3 Creation of Ground Truth Data

Ground truth data was created using either an open-source program, LabelImg
[35] or Blender. LabelImg allows the user to create a bounding box around each
object and save the data as an .xml (extensible markup language, file). This file
can then be converted to other types. The same information was created when
rendering the synthetic images using Blender. In this thesis, annotation implies
creating ground truth data.
3.2 Dataset Distribution and Network Pairings

Different network architectures were compared and evaluated against each other.
The method is described in section 3.3.
The network pairings used are stated below:
• Faster R-CNN + Inception
• SSD + Inception
• SSD + MobileNet
The networks were chosen based the literature study that was done in chap-
ter 2 and the availability of pre-trained models. Pre-trained models means the
network has already been trained on another dataset. Since no pre-trained mod-
els exists for Faster R-CNN + Mobilenet this pairing was not used.
3.3 Evaluation Metrics 13
3.3 Evaluation Metrics

To evaluate the networks, two different losses were used: classification loss and
localization loss. They were calculated for the three different stages: training,
validation, and testing. These losses were provided by the Tensorflow Object
Detection API. PASCAL mAP was also used on the validation and testing dataset.
3.3.1 Losses
This thesis used the same loss as the networks stated in section 3.2 due to usage
of fine-tuning. The loss for categorizing a detected object into categories, object
vs background, is the binary classification loss and is described as a sigmoid func-
tion, shown in (3.1). The localization is the loss of the bounding box regression
and is represented as a smooth L1 loss, the Auber loss , see (3.2).
1
Lc (x) = (3.1)
1 + e−x
(
0.5x2 if |x| < 1
LR = (3.2)
|x| − 0.5 otherwise
The lower the losses, the better the network performs.
Intersection Over Union

Intersection over Union (IoU) is a measurement for the overlap of two bounding
boxes, A and B. In this case it is the overlap between the ground truth and the
network’s output. The IoU is the quotient of the intersection and the area of
union [19]. In both [30] and [21], IoU was used as an evaluation metric. Due to
the simplicity of interpreting IoU, this metric will be used for evaluation within
this thesis.
Union area
Intersection area
Figure 3.3: Area of union

Figure 3.2: Area of intersection
3.3.2 PASCAL Mean Average Precision

To describe PASCAL mAP we need five terms:
• True Positive (TP)
• False Positive (FP)
• False Negative (FN)
• Precision
• Recall
The true positive rate is the number of correct detections. False negatives are
missed detections. False positive occurs when multiple detections of the same
object are detected, all detections other than the first correct one are false.
Recall is defined as the proportion of all positive detections with IoU equal or
greater than a certain value, in this case 0.5 [3].
Precision is the proportion of all recalls that are true positive [3].
PASCAL mAP is defined as the mean precision at a set of eleven equally
spaced recall levels [0, 0.1, .., 1] [3], see (3.3).
1 X
AP = P recision(Recall, i) (3.3)
11
Recalli
The higher the mAP value is, the better the network performs.
3.4 Parameters to tune

When training a neural network several parameters can be tuned to give better
performance. The ones evaluated in this thesis are:
• Batch size: number of images in one batch.
• Number of epochs: number of times all of the training data has gone through
the network.
• Total numbers of images used in training
3.5 Experiments 15
3.5 Experiments
As stated in chapter 3, the parameters tuned were batch size, number of epochs
and the total number of images used in training. Three main experiments were
executed; experiment 1, experiment 2, and experiment 3. 100000 batches were
used for all runs: training, validation and testing.
Experiment 1 only used synthetic data and consists of the following sub-experiments:
1. Testing different network configurations

2. Batch size vs epochs
3. Largest image size manageable
Sub-experiment 1 was done using sub-expeiment 2. Table 3.1 specifies how

the batch size and image size was varied in experiment 2.
In sub-experiment 3, due to large images, the batch size needs to be small.

The image size of 600x1040 with a batch size of 1 was used due to hardware
limitations. Also this experiment was done with the network architecture that
had the best performance when testing on dataset B, which would later be shown
to be Faster R-CNN + Inception net. Sub-experiment 2 and 3 used the network
that had the best performance in experiment 1.
Table 3.1: Experiment 1: Testing different batch size and epochs
Test Baseline #1 #2 #3
Batch Size 1 24 35 1
Image Size 300x300 300x300 240x240 600x1040
Epochs 100000 4166 2857 100000
In experiment 2 five different networks were tested, which are listed below:
• Faster R-CNN + Inception: 10 percent

• SSD + Inception
• SSD + MobileNet.
These networks were trained on the dataset A and then validated on dataset B.
Table 3.2 shows how many images that were used to train the different Faster R-
CNN + Inception networks. The reason why there are three different versions of
Faster R-CNN + Inception is due to it had the best mAP when testing on dataset
B, see Figure 4.44.
Table 3.2: Experiement 2: Testing different dataset size. The percentage is

in terms of total amount of images in the dataset.
Test #1 #2 #3
Number of real images 540 (10%) 2686 (50%) 5392 (100%)
Batch size 24 24 24
For experiment 2 the only interesting evaluation metric is the mean average
precision, since no parametera are tuned, thus only the mAP will be plotted.
Experiment 3 was to compare the automatic annotation with manual anno-
tation. In this experiment SSD + MobileNet was used due to the short training
time, shown in Figure 4.40. The validation was done by comparing the IoU of the
ground truth, the automatic generated and the manual one, and to compare the
classification and localization loss.
4
Results
In this chapter, the results of the different network configurations stated in sec-
tion 3.5 are presented. The chapter also includes the comparison of manual and
automatic annotation is evaluated.
In all figures where mean average precision is plotted for the whole dataset,
only results from the validation and the testing are shown. This is due the Ten-
sorflow API only calculating the mAP for the validation and testing set.
4.1 Testing Different Network Configuration

In this section, the results of the different network architectures are presented.
The classification and localization losses, and mean average precision are plotted
for each subset: training, validation, and testing.
4.1.1 Faster R-CNN and Inception

Here the results of the different configurations with Faster R-CNN + Inception
net are presented. The dataset used is dataset B.
Baseline
Time needed for training: 3.3 hours
Image size: 300x300
Batch size: 1
Results are shown in Figure 4.1 - Figure 4.3.
17
18 4 Results
Classificat ion Loss

Training
Validation
Test ing
0.20
0.15
Loss
0.10
0.05
0.00
0 20000 40000 60000 80000 100000

Bat ches
Figure 4.1: Classification loss, Faster R-CNN + In-

ception, batch size 1
Localizat ion Loss

0.175
Training
Validation
Test ing
0.150
0.125
0.100
Loss
0.075
0.050
0.025
0.000
0 20000 40000 60000 80000 100000

Bat ches
Figure 4.2: Localization loss, Faster R-CNN + In-

4.1 Testing Different Network Configuration 19
Mean Average Precision

0.6 Validation
Test ing
0.5
0.4
m AP
0.3
0.2
0.1
0.0
0 20000 40000 60000 80000 100000

Bat ches
Figure 4.3: Mean average precision, Faster R-CNN

+ Inception, batch size 1
20 4 Results
Configuration 1, #1
Time needed for training: 22 hours
Image size: 300x300
Batch size: 24
Classificat ion loss

Training
Validat ion
0.20 Test ing
0.15
Loss
0.10
0.05
0.00
0 20000 40000 60000 80000 100000

Batches
Figure 4.4: Classification loss, Faster R-CNN +

Inception, batch size 24
Localizat ion loss

Training
Validat ion
Test ing
0.04
0.03
Loss
0.02
0.01
0.00
0 20000 40000 60000 80000 100000

Batches
Figure 4.5: Localization loss, Faster R-CNN + In-


1.0
0.8
0.6
m AP
0.4
0.2
Validation
0.0 Test ing
0 20000 40000 60000 80000 100000

Bat ches
Figure 4.6: Mean average precision, Faster R-CNN

+ Inception, batch size 24
22 4 Results
Configuration 1, #2
Image size: 300x300
Batch size: 35

Training
Validation
0.05 Test ing
0.04
0.03
Loss
0.02
0.01
0.00
0 20000 40000 60000 80000 100000

Bat ches

Localizat ion Loss

Training
0.05 Validation
Test ing
0.04
0.03
Loss
0.02
0.01
0.00
0 20000 40000 60000 80000 100000
Bat ches
Figure 4.8: Localization loss, Faster R-CNN +


Validation
Test ing
0.6
0.5
0.4
m AP
0.3
0.2
0.1
0.0
0 20000 40000 60000 80000 100000

Bat ches
Figure 4.9: Mean average precision, Faster R-

CNN + Inception, batch size 35
24 4 Results
Batch of size 1, Image size 600x1040

Image size: 600x1040
Batch size: 1
Classification Loss
Training
0.25 Evaluation
Testing
0.20
0.15
Loss
0.10
0.05
0.00
0 20000 40000 60000 80000 100000

Batches

Inception, batch size of 1
Localization Loss
Training
Evaluation
Testing
0.08
0.06
Loss
0.04
0.02
0.00
0 20000 40000 60000 80000 100000

Batches

Inception, batch size of1
Mean Average precision, 0.5 IOU

1.0
0.8
m ean average precision
0.6
0.4
0.2
0.0 Evaluat ion
0 20000 40000 60000 80000 100000

St eps

CNN + Inception, batch size of 1
26 4 Results
Summary: Faster RCNN and Inception

The results of the testing dataset with all three different batch sizes, image size
300x300, are plotted together in Figure 4.13 to Figure 4.15.
Classification Loss
Faster R-CNN: Batch size 1
0.05 Faster R-CNN: Batch size 35
0.04
0.03
0.02
Loss
0.01
0.00
−0.01
0 20000 40000 60000 80000 100000

Batches

Inception
Localization Loss
0.025
0.020
0.015
Loss
0.010
0.005
0.000
0 20000 40000 60000 80000 100000
Batches

Inception

1.0
0.8
0.6
mAP
0.4
0.2

0.0 Faster R-CNN: Batch size 35
0 20000 40000 60000 80000 100000

Batches

CNN + Inception
28 4 Results
4.1.2 SSD and Inception

In the following sections, results from different SSD + Inception runs are pre-
sented.
Baseline
Image size: 300x300
Batch size: 1

Training
Validation
30 Test ing
25
20
Loss
15
10
0
0 20000 40000 60000 80000 100000
Bat ches
Figure 4.16: Classification loss, SSD + Incep-

tion, batch size of 1
Localizat ion Loss

20.0 Training
Validation
Test ing
17.5
15.0
12.5
Loss
10.0
7.5
5.0
2.5
0.0
0 20000 40000 60000 80000 100000

Bat ches
Figure 4.17: Localization loss, SSD + Inception,

batch size of 1

0.30 Validation
Test ing
0.25
0.20
m AP
0.15
0.10
0.05
0.00
0 20000 40000 60000 80000 100000

Bat ches
Figure 4.18: Mean average precision, SSD + In-

ception, batch size of 1
30 4 Results
Figure 4.17 has some incomplete values for the validation run, the values were
NaN and therefore not plotted.
Configuration 1, #1
Image size: 300x300
Batch size: 24

3.0 Training
Validation
Test ing
2.5
2.0
Loss
1.5
1.0
0.5
0 20000 40000 60000 80000 100000

Bat ches

32 4 Results
Localizat ion Loss

Training
0.40 Validation
Test ing
0.35
0.30
0.25
Loss
0.20
0.15
0.10
0.05
0.00
0 20000 40000 60000 80000 100000
Bat ches

batch size of 24

1.0
0.8
m ean average precision
0.6
0.4
0.2
Validation
0.0 Test ing
0 20000 40000 60000 80000 100000

St ep
Figure 4.21: Mean Average Precision, SSD + In-

ception„ batch size of 24
Configuration 1, #2
Image size: 300x300
Batch size: 35
Results are shown in Figure 4.22 - Figure 4.24

4.5 Training
Validat ion
4.0 Test ing
3.5
3.0
Loss
2.5
2.0
1.5
1.0
0.5
0 20000 40000 60000 80000 100000
Batches

34 4 Results
Localizat ion Loss

Training
0.5 Validat ion
Test ing
0.4
Loss
0.3
0.2
0.1
0 20000 40000 60000 80000 100000

Batches

batch size of 35

1.0
0.8
0.6
m AP
0.4
0.2
Validat ion
0.0 Test ing
0 20000 40000 60000 80000 100000

Batches

4.1.3 SSD and MobileNet

In this section, the results when using SSD together with mobilenet are presented.
Baseline
Image size: 300x300
Batch size: 1
Results are shown in Figure 4.25 - Figure 4.27

Training
Validation
Test ing
40
30
Loss
20
10
0 20000 40000 60000 80000 100000

Bat ches
Figure 4.25: Classification loss, SSD + Mo-

bileNet batch size of 1
36 4 Results
Localizat ion Loss

12 Training
Validation
Test ing
10
8
Loss
0 20000 40000 60000 80000 100000

Bat ches
Figure 4.26: Localization loss, SSD + MobileNet

batch size of 1

Validation
Test ing
0.20
0.15
m AP
0.10
0.05
0.00
0 20000 40000 60000 80000 100000

Bat ches

Configuration 1, #1
Image size: 300x300
Batch size: 24

Training
Validation
Test ing
2.0
1.5
Loss
1.0
0.5
0.0
0 20000 40000 60000 80000 100000

Bat ches

bileNet, batch size of 24
38 4 Results
Localizat ion Loss

Training
Validation
0.6
Test ing
0.5
0.4
Loss
0.3
0.2
0.1
0.0
0 20000 40000 60000 80000 100000
Bat ches
Figure 4.29: Localization loss, SSD + MobileNet,

batch size of 24

1.0
0.8
0.6
m AP
0.4
0.2
Validation
0.0 Test ing
0 20000 40000 60000 80000 100000

Bat ches

Configuration 1, #2
Image size: 300x300
Batch size: 35

Training
Validation
Test ing
10
6
Loss
0 20000 40000 60000 80000 100000

Bat ches

bileNet, batch size of 35
40 4 Results
Localizat ion Loss

Training
5
Validation
Test ing
3
Loss
0
0 20000 40000 60000 80000 100000
Bat ches
Figure 4.32: Localization loss, SSD + MobileNet,

batch size of 35

Validation
0.7 Test ing
0.6
0.5
0.4
m AP
0.3
0.2
0.1
0.0
0 20000 40000 60000 80000 100000

Bat ches
Figure 4.33: Mean Average Precision, SSD +

MobileNet, batch size of 35
4.1.4 Summary: Single-Shot Multibox Detector

In this section, all results for Single-shot Multibox Detector are combined and
plotted in Figure 4.34 to Figure 4.36. The network with the best performances is
the SSD with a batch size of 24, irrespective of using Mobilenet or Inception net.
SSD + Inception with a batch size of 35 performed almost as well as the batch
size of 24 in the loss categories. Its mean average precision converged after the
same number of epochs as batch 24. Using a batch size of 1 gave a high loss and
the mAP was low.
Classification Loss
SSD Inception: Batch size 1
20 SSD Inception: Batch size 24
SSD Mobilenet: batch size 1
15
10
Loss
0 20000 40000 60000 80000 100000

Batches
Figure 4.34: Classification loss, SSD

42 4 Results
Localization Loss
5 SSD Inception: Batch size 24
4
3
Loss
0 20000 40000 60000 80000 100000

Batches
Figure 4.35: Localization loss, SSD

1.0
0.8
0.6 SSD Inception: Batch size 1

mAP

0.4 SSD Mobilenet: batch size 35
0.2
0.0
0 20000 40000 60000 80000 100000
Batches
Figure 4.36: Mean Average Precision, SSD

4.1.5 Summary: Different Network Architecture And Batch Sizes

In Figure 4.37 and Figure 4.38, the classification loss and the localization loss are
plotted for all the different network architectures. The results are from using the
testing dataset. Training time is summarized in Figure 4.40.
Classification Loss
5 Faster R-CNN: Batch size 35
4 SSD Mobilenet: batch size 24
Faster R-CNN: Image size 600x1040
3
Loss
0
0 20000 40000 60000 80000 100000
Batches
Figure 4.37: Classification loss of testing dataset

44 4 Results
Localization Loss
5 Faster R-CNN: Batch size 24
4 SSD Mobilenet: batch size 24
Faster R-CNN: Image size 600x1040
3
Loss
0
0 20000 40000 60000 80000 100000
Batches
Figure 4.38: Localization loss of testing dataset


1.0
0.8
0.6
mAP
0.4
0.2 SSD Inception: Batch size 24
0.0 Faster R-CNN: Image size 600x1040
0 20000 40000 60000 80000 100000

Batches
Figure 4.39: Mean average precision of testing dataset
Time to train (h)
31,2
22
19,5
18,4
14 14
5,5
3,3 2,5 1,7
FASTER FASTER FASTER FASTER SSD + SSD + SSD + SSD + SSD + SSD +
RCNN + RCNN + RCNN + RCNN + INCEPTION: INCEPTION: INCEPTION: MOBILENET: MOBILENET: MOBILENET:
INCEPTION: INCEPTION: INCEPTION: INCEPTION: BATCH SIZE BATCH SIZE BATCH SIZE BATCH SIZE BATCH SIZE BATCH SIZE
BATCH SIZE BATCH SIZE BATCH SIZE BATCH SIZE 1 24 35 1 24 35
1: 300X300 1: 600X1040 24 35
Figure 4.40: Time to train

46 4 Results
4.2 Epochs Versus Dataset Size

In this section the results of varying the training dataset size are shown in Fig-
ure 4.41 to Figure 4.43. Is is shown in subsection 4.1.5 that the Faster R-CNN
+ Inception had the best performance in all three categories. Therefore this net-
work configuration with a batch size of 24 was chosen. The training ran for 400
epochs and the dataset used was the testing data from dataset B.
Classification Loss
0.040 10 percent
50 percent
Whole dataset
0.035
0.030
0.025
Loss
0.020
0.015
0.010
0.005
50 100 150 200 250 300 350 400
Epochs
Figure 4.41: Classification loss, varying dataset size

4.2 Epochs Versus Dataset Size 47
Localization Loss
10 percent
50 percent
Whole dataset
0.020
0.015
Loss
0.010
0.005
0 50 100 150 200 250 300 350 400

Epochs
Figure 4.42: Localization loss, varying dataset size

48 4 Results

1.0
0.8
0.6
mAP
0.4
0.2
10 percent
50 percent
Whole dataset
0.0
0 50 100 150 200 250 300 350
Epochs
Figure 4.43: Mean average Precision, varying dataset size

4.3 Testing on real images 49
4.3 Testing on real images

The interesting metric to look at in this case is the mean average precision, which
is shown in Figure 4.44.
Mean average Precision - Evaluation on real data

0.8 0.81
0.78
0.72 0.73
0.7
0.6
0.5
mAP
0.4
0.3
0.2
0.15
0.1
0.0
ercent ercent ercent eption SD MobileNet
t e r R-C NN: 10 pter R-CNN: 50 per R-CNN: 100 p SSD Inc S
Fas Fas Fas t
Figure 4.44: Mean average precision on dataset C

50 4 Results
4.4 Automatic and Manual Annotations

The time needed to generate the dataset via manual annotation and automatic
annotation is shown in Table 4.1.
Table 4.1: Time needed to create the dataset via manual

and automatic annotation
Manual Automatic
Number of images 3561 7806
Time 6h 38 min
Figure 4.45 shows the histogram over the IoU for the ground truth data between
the manual and automatic method.
Int ersect ion over Union, Manual vs aut om at ic annot at ion ground truth
200
175
150
Bin Count
125
100
75
50
25
0
0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00
IoU
Figure 4.45: Intersection over union, groundtruth data
A comparison of manual and automatic annotated data with the SSD + Mo-
bilenet was done and the result is shown in figure Figure 4.46 to Figure 4.51. A
batch size of 24 with 3561 images was used.
4.4 Automatic and Manual Annotations 51

Training - manual
5
Training - aut om at ic
3
Loss
0
0 20000 40000 60000 80000 100000
Batches
Figure 4.46: Classification loss comparison, training

dataset
52 4 Results

Validation - manual
Validation - automatic
5
4
Loss
0 20000 40000 60000 80000 100000

Batches
Figure 4.47: Classification loss comparison, validation

dataset

8
Test ing - manual
Test ing - aut om at ic
7
5
Loss
0 20000 40000 60000 80000 100000

Batches
Figure 4.48: Classification loss comparison, testing

dataset
54 4 Results
Localizat ion loss

1.0
Training - manual
Training - aut om at ic
0.8
0.6
Loss
0.4
0.2
0 20000 40000 60000 80000 100000

Batches
Figure 4.49: Localization loss comparison, training

dataset
Localizat ion loss

Validat ion - manual
5
Validat ion - aut om at ic
3
Loss
0
0 20000 40000 60000 80000 100000
Batches
Figure 4.50: Localization loss comparison, validation

dataset
56 4 Results
Localizat ion loss

Test ing - manual
0.8 Test ing - aut om at ic
0.6
Loss
0.4
0.2
0.0
0 20000 40000 60000 80000 100000
Batches
Figure 4.51: Localization loss comparison, testing

dataset
5
Discussion
In this chapter, the results from chapter 4 are analyzed.
5.1 Networks
In the following sections, the different results from each network configuration
are discussed. First the Single-Shot multibox Detector, followed by Faster R-
CNN.
5.1.1 Single-Shot Multibox Detector

It is showen in Figure 4.34 and Figure 4.35 that increasing the batch size to a
moderate size for SSD networks gives better results. However a huge batch size
yields a poorer outcome. The batch size of 35 had worse results for both SSD +
Inception and SSD + Mobilenet in the loss category. While for the mean average
precision, see Figure 4.36, a batch size of 35 gave the same result as a batch size
of 24 for the SSD + Inception network, it converged to 1. Training loss for all
network architectures and batch sizes was unstable at each run with which could
be a sign for overfitting. But also high learning rate or regulariazation could
cause this pattern.
5.1.2 Faster R-CNN and Inception

A batch size of 1 had an mAP converging to roughly 0.58, a batch size of 35
converged to 0.6 and a batch size of 24 converged to approximately 0.95. One
reason why a batch size of 1 performed worst could be that the network has yet
not learned enough features. The network suffered from overfitting when using
57
58 5 Discussion
a batch size of 35. It is also shown in Figure 4.39 that the outcome of using an
image size of 660x1040 with batch size 1
is as good as using a batch size of 24 with image size of 300x300.
5.2 Epochs versus Batches

It is shown in section 4.2 that using the smallest size of the dataset, 10% of total
amount of test images, gave the worst result in the classification/localization loss.
In addition, it required longer training time for the mean average precision to
converge. Using 50% or 100% of the dataset gave approximately the same results
when comparing the classification/localization loss. The mean average precision
converged the fastest when using 50% of the dataset.
5.3 Testing On Real Images, Video Sequence

The mAP decreased for all the networks when comparing the metrics between
synthetic data and real images. The network’s mAP will converge to 1 (see Fig-
ure 4.39), and it is shown in Figure 4.44 that when testing the network after
100000 batches the best result we get is 0.81. The decrease in performance might
be explained by the scale of an object in each image. An example of a frame can
be seen in Figure 5.1, where the distance to the camera is further away compared
to the training images can be seen in Appendix A.
Figure 5.1: Frame from video sequence
Another reason for worse performance could be the sharpness of the images.
Even though some of the training images had blur added to them as a pre-processing
step, the network had trouble with the object being out of focus.
It is shown in Figure 4.44 that all the networks performed approximately the
same except for SSD + Mobilnet.
5.4 Annotation: Manual vs Automatic

As seen in Table 4.1, the time necessary to manually annotate the images was
much longer. The manual annotation had a total of 3561 images due to a time
5.4 Annotation: Manual vs Automatic 59
limitation, doing 3561 images took 6 h.

It is shown in Figure 4.46 - Figure 4.51 that the two losses are higher in every
case. Implying that the automatic generated ground truth yields higher accuracy.
The reason for this can be found in Figure 4.45, which is human error.
Conclusions And Future Work
6
In this chapter, the conclusions and future works are presented.
6.1 Conclusions
There are several conclusions that can be drawn from this thesis. A neural net-
work with the purpose of object detection can be fine-tuned using synthetic data
to detect other objects. Faster R-CNN + Inception network had the best accu-
racy. Out of the three different network architectures used, while also taking the
longest time to train.
The result shows further that longer training time does not necessarily give
the best result, what mattered was the size of the datasets and the batch sizes.
The larger the dataset, the higher the accuracy. Yet having too large batch size
results in overfitting.
Large dataset requires a lot of labeling, if automatically generating ground

truth data can both increase the accuracy but also reduces the amount of manual
labor then large dataset would no longer be a problem. Less manual labor also
decreases the chance of human errors.
6.2 Future Work

Easy access to ground truth data is achievable by generating synthetic data auto-
matically. Instead of saving the bounding box of an object, also an object mask
could be used. An object mask means only the pixels of the object are marked.
The reason one would want to save an object mask instead of the bounding box
61
62 6 Conclusions And Future Work
is due to the bounding box having background noise while an object mask would
only contain the interesting pixels i.e. the object.
It would be interesting to verify whether this improves the accuracy further.
Also, in this thesis the networks were finetuned. An interesting aspect would
be to train a neural network from scratch using only computer-generated images
in order to verify that synthetic data is suitable for learning from scratch, too.
Appendix
A
Datasets
Two different datasets were used throughout, dataset B was only used for testing
purposes.
Dataset A
Dataset A consists of five different objects and consists of 5392 images. These
can be seen in Figure A.6 and Figure A.10.
Dataset B
Dataset B is a video recorded by a web-camera of the real objects in real life.
Figure A.1: handle Figure A.2: car Figure A.3: eStop
65
66 A Datasets
Figure A.4: cabel- Figure A.5: Turn

Protetor knob
Figure A.6: Attachment Figure A.7: Shelf plug
Figure A.8: Dowel Figure A.9: Expandable plug
Figure A.10: Screw

Bibliography
[1] Pexel. https://www.pexels.com/. Accessed: 2018-04-17. Cited on page

12.
[2] Grabccad. https://grabcad.com. Accessed: 2018-02-28. Cited on page

12.
[3] The pascal visual object classes challenge 2007. http://host.robots.
ox.ac.uk/pascal/VOC/voc2007/, 2007. Accessed: 2018-04-18. Cited
on pages 8 and 14.
[4] Rich feature hierarchies for accurate object detection and semantic segmen-
tation. 2013. URL https://arxiv.org/abs/1311.2524. Cited on page
6.
[5] Deep residual learning for image recognition. 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Computer Vision and
Pattern Recognition (CVPR), 2016 IEEE Conference on, page 770, 2016.
ISSN 978-1-4673-8851-1. URL https://arxiv.org/abs/1512.03385.
Cited on page 7.
[6] Nicholas Becherer, John Pecarina, Scott Nykl, and Kenneth Hopkinson. Im-
proving optimization of convolutional neural networks through parameter
fine-tuning. Neural Computing and Applications, Nov 2017. ISSN 1433-
3058. doi: 10.1007/s00521-017-3285-0. URL https://doi.org/10.
1007/s00521-017-3285-0. Cited on page 7.
[7] Blender Online Community. Blender. https://www.blender.org/. Ac-
cessed: 2018-04-17. Cited on pages 4 and 12.
[8] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools,
2000. Cited on page 4.
[9] X. Chu, W. Ouyang, W. Yang, and X. Wang. Multi-task recur-
rent neural network for immediacy prediction. 2015 IEEE Interna-
tional Conference on Computer Vision (ICCV), Computer Vision (ICCV),
67
68 Bibliography
2015 IEEE International Conference on, Computer Vision, IEEE In-

ternational Conference on, page 3352, 2015. ISSN 978-1-4673-8391-
2. URL http://www.ee.cuhk.edu.hk/~wlouyang/Papers/Chu_
Multi-Task_Recurrent_Neural_ICCV_2015_paper.pdf. Cited on
page 6.
[10] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine
Learning, 20(3):273–297, Sep 1995. ISSN 1573-0565. doi: 10.1023/A:
1022627411411. URL https://doi.org/10.1023/A:1022627411411.
Cited on page 7.
[11] DeepMind. Alphago. https://deepmind.com/, 2010. Accessed: 2018-
04-24. Cited on page 2.
[12] D. A. Ferrucci. Introduction to "this is watson";. IBM Journal of Research and
Development, 56(3.4):1:1–1:15, May 2012. ISSN 0018-8646. URL https:
//ieeexplore.ieee.org/document/6177724. Cited on page 2.
[13] Georgios Georgakis, Arsalan Mousavian, Alexander C. Berg, and Jana
Kosecka. Synthesizing training data for object detection in indoor scenes.
CoRR, abs/1702.07836, 2017. URL http://arxiv.org/abs/1702.
[14] Anthony Gidudu, Greg Hulley, and Tshilidzi Marwala. Classification of im-
ages using support vector machines. CoRR, abs/0709.3967, 2007. URL
http://arxiv.org/abs/0709.3967. Cited on page 7.
[15] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT
Press, 2016. http://www.deeplearningbook.org. Cited on page 3.
[16] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Wei-
jun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mo-
bilenets: Efficient convolutional neural networks for mobile vision appli-
cations. CoRR, abs/1704.04861, 2017. URL http://arxiv.org/abs/
1704.04861. Cited on page 7.
[17] C. Huang, J. R. G. Townshend, and L. S. Davis. An assessment of support vec-
tor machines for land cover classification. International Journal of Remote
Sensing, 23:725–749, February 2002. doi: 10.1080/01431160110040323.
URL http://citeseerx.ist.psu.edu/viewdoc/download?doi=
10.1.1.134.4958&rep=rep1&type=pdf. Cited on page 7.
[18] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Ko-
rattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio
Guadarrama, and Kevin Murphy. Speed/accuracy trade-offs for modern con-
volutional object detectors. 2016. URL https://arxiv.org/abs/1611.
[19] Paul Jaccard. The distribution of the flora in the alpine zone.
The New Phytologist, (2):37, 1912. ISSN 0028646X. URL
Bibliography 69
https://nph.onlinelibrary.wiley.com/doi/pdf/10.1111/
j.1469-8137.1912.tb05611.x. Cited on page 13.
[20] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman.
Synthetic data and artificial neural networks for natural scene text recogni-
tion. CoRR, abs/1406.2227, 2014. URL http://arxiv.org/abs/1406.
[21] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman.
Reading text in the wild with convolutional neural networks. International
Journal of Computer Vision, 116(1):1 – 20, 2016. ISSN 09205691. URL
https://arxiv.org/abs/1412.1842. Cited on pages 6 and 13.
[22] Andrej Karpathy. Neural network = http://cs231n.github.io/
neural-networks-1/, 2018. Accessed: 2018-04-17. Cited on page 2.
[23] Stanford Visual Lab. Imagenet large scale visual recognition challange.
http://www.image-net.org, 2010. Accessed: 2018-02-26. Cited on
page 7.
[24] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E.
Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: single shot multibox
detector. CoRR, abs/1512.02325, 2015. URL http://arxiv.org/abs/
1512.02325. Cited on page 8.
[25] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and R. Young. Icdar
2003 robust reading competitions. In Seventh International Conference
on Document Analysis and Recognition, 2003. Proceedings., pages 682–
687, Aug 2003. URL https://ieeexplore.ieee.org/document/
[26] W. Ouyang, X. Wang, X. Zeng, Shi Qiu, P. Luo, Y. Tian, H. Li, Shuo Yang,
Zhe Wang, Chen-Change Loy, and X. Tang. DeepID-Net: Deformable Deep
Convolutional Neural Networks for Object Detection. 2014. URL https:
//arxiv.org/abs/1409.3505. Cited on page 6.
[27] W. Ouyang, H. Li, X. Zeng, and X. Wang. Learning deep representation with
large-scale attributes. 2015 IEEE International Conference on Computer Vi-
sion (ICCV), Computer Vision (ICCV), 2015 IEEE International Conference
on, Computer Vision, IEEE International Conference on, page 1895, 2015.
ISSN 978-1-4673-8391-2. URL https://www.cv-foundation.org/
openaccess/content_iccv_2015/papers/Ouyang_Learning_
Deep_Representation_ICCV_2015_paper.pdf. Cited on page 6.
[28] W. Ouyang, X. Wang, C. Zhang, and X. Yang. Factors in Finetuning Deep
Model for object detection. ArXiv e-prints, January 2016. URL https:
//arxiv.org/abs/1601.05150. Cited on page 6.
[29] Mahesh Pal and Paul M. Mather. Support vector classifiers for land cover
classification. CoRR, abs/0802.2138, 2008. URL http://arxiv.org/
abs/0802.2138. Cited on page 7.
70 Bibliography
[30] Param S. Rajpura, Ravi S. Hegde, and Hristo Bojinov. Object detection using
deep cnns trained on synthetic images. CoRR, abs/1706.06782, 2017. URL
http://arxiv.org/abs/1706.06782. Cited on pages 6 and 13.
[31] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object
detection with region proposal networks. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 39(6):1137–1149, 2017. ISSN 01628828.
URL https://login.e.bibl.liu.se/login?url=https:
//search-ebscohost-com.e.bibl.liu.se/login.aspx?
direct=true&AuthType=ip,uid&db=edselc&AN=edselc.2-52.
0-85019258369&lang=sv&site=eds-live&scope=site. Cited on
page 8.
[32] Stephan R. Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. Playing
for data: Ground truth from computer games. CoRR, abs/1608.02192, 2016.
URL http://arxiv.org/abs/1608.02192. Cited on pages 5 and 9.
[33] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich. Going Deeper with Convolutions. ArXiv
e-prints, September 2014. URL https://arxiv.org/abs/1409.4842.
Cited on page 6.
[34] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Ra-
binovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.
URL http://arxiv.org/abs/1409.4842. Cited on page 7.
[35] Tzutalin. Labelimg, git code. https://github.com/tzutalin/

labelImg, 2015. Cited on page 12.
[36] L. Wang, W. Ouyang, X. Wang, and H. Lu. Visual tracking with fully con-
volutional networks. 2015 IEEE International Conference on Computer
Vision (ICCV), Computer Vision (ICCV), 2015 IEEE International Confer-
ence on, Computer Vision, IEEE International Conference on, page 3119,
2015. ISSN 978-1-4673-8391-2. URL https://ieeexplore.ieee.org/
document/7410714. Cited on page 6.

Deep Residual Learning

Uploaded by

Copyright:

Available Formats

Deep Residual Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Residual Learning

Uploaded by

Copyright:

Available Formats

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2018

Object Detection Using

Supervisor: Mikael Persson

Examiner: Michael Felsberg

Computer Vision Laboratory

Copyright © 2018 Margareta Vi

Linköping, November 2018

3 Method and Experiments 11

4.1.5 Summary: Different Network Architecture And Batch Sizes 43

6 Conclusions And Future Work 61

We humans solve this problem by first doing a coarse filtering by localizing

Manual feature matching is costly, therefore it is desirable to computerize

1.1 Neural network/convolutional neural network in

Figure 1.0: A biological neuron and its mathematical representation. Image

From here on artificial neural networks will be referred as neural networks

Input layer Hidden layer Output layer

Figure 1.1: Neural network

A specific type of a neural network which focuses on object detection is the

It is possible to decrease the amount of training data needed by using a method

1.2 Problem Formulation

1.3 Thesis Outline

2.1 Using synthetic data

2.3 Object classification

2.4 Object detection

Figure 2.1: Input image Figure 2.2: Result

A Faster Region based convolutional network (F-CNN) consists of two mod-

Single-Shot Multibox Detector is a feed-forward CNN. It produces a collec-

2.5 Summary: Related Work

CAD Background Texture of

Training data Validation data Test data

Figure 3.1: Flow chart of work flow.

3.1 Generating the Datasets

3.1.1 Rendering Images

3.1.2 Video Recording

3.1.3 Creation of Ground Truth Data

3.2 Dataset Distribution and Network Pairings

3.3 Evaluation Metrics

Intersection Over Union

Figure 3.3: Area of union

3.3.2 PASCAL Mean Average Precision

3.4 Parameters to tune

1. Testing different network configurations

Sub-experiment 1 was done using sub-expeiment 2. Table 3.1 specifies how

In sub-experiment 3, due to large images, the batch size needs to be small.

Table 3.1: Experiment 1: Testing different batch size and epochs

• Faster R-CNN + Inception: 10 percent

Table 3.2: Experiement 2: Testing different dataset size. The percentage is

4.1 Testing Different Network Configuration

4.1.1 Faster R-CNN and Inception

Classificat ion Loss

0 20000 40000 60000 80000 100000

Figure 4.1: Classification loss, Faster R-CNN + In-

Localizat ion Loss

0 20000 40000 60000 80000 100000

Figure 4.2: Localization loss, Faster R-CNN + In-

Mean Average Precision

0 20000 40000 60000 80000 100000

Figure 4.3: Mean average precision, Faster R-CNN