Deep Residual Learning
Deep Residual Learning
Deep Residual Learning
Margareta Vi
Master of Science Thesis in Electrical Engineering
Object Detection Using Convolutional Neural Network Trained on Synthetic
Images
Margareta Vi
LiTH-ISY-EX--18/5180--SE
iii
Acknowledgments
I would like to thank my supervisor at my company Alexander Poole, for always
being helpful and coming with interesting ideas. I would also like to thank my
supervisor at the university, Mikael Persson for helping me with the report and
my examiner Michael Felsberg.
Additionally, I would like to give my thanks to IKEA for providing the CAD
models. Lastly, I would like to thank my family and boyfriend for supporting me
through all the hard times.
v
Contents
Notation ix
1 Introduction 1
1.1 Neural network/convolutional neural network in brief . . . . . . . 2
1.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Related work 5
2.1 Using synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Finetuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Object classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Object detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Summary: Related Work . . . . . . . . . . . . . . . . . . . . . . . . 9
4 Results 17
4.1 Testing Different Network Configuration . . . . . . . . . . . . . . . 17
4.1.1 Faster R-CNN and Inception . . . . . . . . . . . . . . . . . . 17
4.1.2 SSD and Inception . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.3 SSD and MobileNet . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.4 Summary: Single-Shot Multibox Detector . . . . . . . . . . 41
vii
viii Contents
5 Discussion 57
5.1 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1.1 Single-Shot Multibox Detector . . . . . . . . . . . . . . . . . 57
5.1.2 Faster R-CNN and Inception . . . . . . . . . . . . . . . . . . 57
5.2 Epochs versus Batches . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Testing On Real Images, Video Sequence . . . . . . . . . . . . . . . 58
5.4 Annotation: Manual vs Automatic . . . . . . . . . . . . . . . . . . . 58
A Datasets 65
Bibliography 67
Notation
Abbreviations
Abbreviation Description
cad Computer Aided Design
ilsvrc ImageNet Large Scale Visual Recognition Challenge
cnn Convolutional Neural Network
svm Support Vector Machine
map Mean Average Precision
rcnn Regional Convolutional Neural Network
ssd Single Shot Multibox Detector
iou Intersection over Union
ix
Introduction
1
Everyone has at least once compiled furniture from IKEA. They are relatively
cheap and come in flat packages, the key thing is you need to build them yourself
with the help of a booklet. The compilation starts by laying out all the pieces in
front of you. The large pieces are easy to recognize, though the screws and plugs
might cause problems. The items are small and have the same look, which in-
creases the difficulty to distinguish them from each other. Therefore when build-
ing IKEA furniture, one could say the hardest part is to find the correct piece to
use.
1
2 1 Introduction
Neural networks have great potential to solve problems which involve detec-
tion of patterns or trends. Scientists have created neural networks which can
solve tasks such as digit or word recognition, image classifications, face recogni-
tion, and object detection to name a few. Examples of neural networks which
solve such tasks are Watson and AlphaGo. Watson played and won against Jeop-
ardy champions [12] and AlphaGo was the first computer program which won
against a Go world champion player [11].
One of the big disadvantages of NNs is that they need a large amount of train-
ing data to have adequate performance. Therefore, getting access to data is the
bottleneck for neural networks. On the internet many 3D models of different
objects are available for free in various formats such as Computer Aided Design
(CAD). From a CAD model, it is possible to generate thousands of different syn-
thetic images by alternating the background and adding texture to the objects.
Many different types of neural networks exist, where the difference lies in
the combination of hidden layers. In this thesis, the networks used are Faster R-
CNN, Inception, SingleShot Multibox Detector, and MobileNet, all described in
chapter 2.
4 1 Introduction
1.2.1 Limitation
To reduce training time, fine-tuning will be used. There also will be limitations
on what type of objects the network will be able to detect.
There will be two different datasets: dataset A and dataset B. Dataset A con-
sists of objects such as screw and plugs provided by IKEA. Background and tex-
ture combinations in dataset A were realistic since the purpose was to test if
the network could differentiate objects in the real world. Dataset B is a video
sequence taken of the real objects in dataset A. Dataset A and B are shown in
Appendix A.
A computer with Intel Core i7-7700, NVIDIA GTX 1080 Ti was used. A Hover-
Cam web camera was used to capture the video sequence. No new neural network
architecture will be created. Instead an API called Tensorflow Object Detection
API (version 1.7) [18] will be used, together with OpenCV (version 3.4.1) [8],
Python (3.5) and Blender [7].
Annotating training data is a problem for scientists since it takes a long time
and good accuracy is needed. Richter et al. were creative with annotating their
data. Using the video game engine from Grand Theft Auto they could get access
to both scenes with realistic appearances and labels at pixel level [32].
By using these realistic images, they showed that the work needed for annota-
tion could be notably reduced. By combining the semantic segmentation dataset
with real-world images, the accuracy increase even more.
Successful attempts have been made to train NNs using synthetic data to solve
classification problems. In this case, successes mean having the best result for a
5
6 2 Related work
specific type of benchmark. The neural network created by Jaderberg et al. was
trained for scene text classification [20], to classify whole words. The training
images were computer generated with different fonts, shadows, and color. Distor-
tion and noise were added to the rendered images to simulate the real world. It
outperformed previous state-of-the-art methods for scene word classifications in
the benchmarks ICDAR 2003, Street View Text, and IIIT5k-dataset. ICDAR 2003
is a competition in robust reading [25]. The amount of training data used was
between 4 millions and 9 million, depending on the benchmark used.
Jaderberg et al. also created another neural network for text spotting, mean-
ing detection and recognition of words. They created an end-to-end system for
text spotting [21]. For the word detection part, they used a region proposal based
mechanism and a CNN for the word recognition task. Their dataset was created
in the same way as in the work [20]. The dataset contained 9 million images,
32x100 pixels. They used 900000 for testing, the same amount for validation
and the rest for training. For the task of text recognition, their method had the
best accuracy compared to the previous state of the art methods. Jaderberg et
al. had good performance in the text spotting task, outperforming the previous
state-of-the-art method [21].
Georgakis et al. trained their network with a combination of both real images
and synthetic images [13]. The synthetic images were real images augmented.
Objects with different scales and positions had been superimposed onto these im-
ages. The task for the network was to do object detection in a cluttered indoor
environment.
Another work which trains a convolutional neural network with synthetic im-
ages is [30].
The network’s task was to predict a bounding box and the object class category
for each object of interest on RGB images captured inside a refrigerator. Training
the neural network with 4000 synthetic images, the network scored a mean Av-
erage Precision (mAP) of 24% on a test set. By adding 400 real images the mAP
increased with 12%. In this paper, they used IoU (intersection over Union) for
evaluating the bounding box predictions, see section 3.3.1 for a description of
IoU.
2.2 Finetuning
The concept of finetuning refers to the approach of reusing training weights.
These training weights comes from another neural network that as been created
for another task. The weights are used to initialize the training [28]. For example,
a neural network trained to classify cats can be fine-tuned to classify dogs. This
method has resulted in state-of-the-art performances for several tasks. Examples
of such tasks are object detection [33], [26], [27], tracking [36], segmentation [4],
and human pose estimation[9]. With finetuning, the training time can also be
2.3 Object classification 7
reduced [6].
More recent progress in object classification has been achieved by neural net-
works. Two state-of-the-art object classification networks are ResNet [5] and In-
ception net[34].
ResNet is a deep residual network, hence the name ResNet, and consists of
152 layers. Due to its large depth, it managed to achieve a 3.6% error rate (top-5
error) in the 2015’s edition of ImageNet Large Scale Visual Recognition Competi-
tion (ILSVRC) and thus won the classification task in the 2015 edition of ILSVRC
[23]. A human has an error rate between 5 − 10%, meaning ResNet outperforms
humans on this task [5].
The other neural network, Inception net, is a network that consists of incep-
tion modules. An inception module is a block of multiple parallel convolutional
and max-pooling layers with different kernel sizes. The inception module makes
the Inception net different from the traditional networks, which stack up convo-
lutional and max-pooling layers [34]. It won the classification and detection task
of ILSVRC in 2014 [23].
Neural networks are computationally heavy, requiring capable hardware to
do the calculations. However, there is a network MobileNet which is a neural
network developed for Mobile Vision Applications. Instead of both filtering and
combining the output signal in one go, Mobilenet divides this step into two lay-
ers, one for filtering and one for combining. The two-layer separation greatly
reduced the computation and model size [16].
8 2 Related work
Two of the recent state-of-the-art methods for object detection are Faster R-
CNN and Single-Shot Multibox Detector (SSD).
The networks in section 2.4 used PASCAL VOC 2007 as a benchmark. Their
datasets were divided as follows: 50% for training/validation and 50% for testing,
i.e. images the network had not seen before, with a total of 9963 images [3].
Since they had the double amount of images, 9936 versus i 4830, this thesis
used 10% of the dataset for testing and the rest to train to compensate.
The remaining 90% was divided between the training and validation, 70% for
training and 30% for validation.
2.5 Summary: Related Work 9
Webcamera
Synthetic
Images
Neural
Network
Output
11
12 3 Method and Experiments
3.3.1 Losses
This thesis used the same loss as the networks stated in section 3.2 due to usage
of fine-tuning. The loss for categorizing a detected object into categories, object
vs background, is the binary classification loss and is described as a sigmoid func-
tion, shown in (3.1). The localization is the loss of the bounding box regression
and is represented as a smooth L1 loss, the Auber loss , see (3.2).
1
Lc (x) = (3.1)
1 + e−x
(
0.5x2 if |x| < 1
LR = (3.2)
|x| − 0.5 otherwise
The lower the losses, the better the network performs.
Union area
Intersection area
• Precision
• Recall
The true positive rate is the number of correct detections. False negatives are
missed detections. False positive occurs when multiple detections of the same
object are detected, all detections other than the first correct one are false.
Recall is defined as the proportion of all positive detections with IoU equal or
greater than a certain value, in this case 0.5 [3].
Precision is the proportion of all recalls that are true positive [3].
PASCAL mAP is defined as the mean precision at a set of eleven equally
spaced recall levels [0, 0.1, .., 1] [3], see (3.3).
1 X
AP = P recision(Recall, i) (3.3)
11
Recalli
The higher the mAP value is, the better the network performs.
3.5 Experiments
As stated in chapter 3, the parameters tuned were batch size, number of epochs
and the total number of images used in training. Three main experiments were
executed; experiment 1, experiment 2, and experiment 3. 100000 batches were
used for all runs: training, validation and testing.
Experiment 1 only used synthetic data and consists of the following sub-experiments:
Test Baseline #1 #2 #3
Batch Size 1 24 35 1
Image Size 300x300 300x300 240x240 600x1040
Epochs 100000 4166 2857 100000
In experiment 2 five different networks were tested, which are listed below:
These networks were trained on the dataset A and then validated on dataset B.
Table 3.2 shows how many images that were used to train the different Faster R-
CNN + Inception networks. The reason why there are three different versions of
Faster R-CNN + Inception is due to it had the best mAP when testing on dataset
B, see Figure 4.44.
16 3 Method and Experiments
Test #1 #2 #3
Number of real images 540 (10%) 2686 (50%) 5392 (100%)
Batch size 24 24 24
For experiment 2 the only interesting evaluation metric is the mean average
precision, since no parametera are tuned, thus only the mAP will be plotted.
Experiment 3 was to compare the automatic annotation with manual anno-
tation. In this experiment SSD + MobileNet was used due to the short training
time, shown in Figure 4.40. The validation was done by comparing the IoU of the
ground truth, the automatic generated and the manual one, and to compare the
classification and localization loss.
4
Results
In this chapter, the results of the different network configurations stated in sec-
tion 3.5 are presented. The chapter also includes the comparison of manual and
automatic annotation is evaluated.
In all figures where mean average precision is plotted for the whole dataset,
only results from the validation and the testing are shown. This is due the Ten-
sorflow API only calculating the mAP for the validation and testing set.
Baseline
Time needed for training: 3.3 hours
Image size: 300x300
Batch size: 1
Results are shown in Figure 4.1 - Figure 4.3.
17
18 4 Results
0.20
0.15
Loss
0.10
0.05
0.00
0.150
0.125
0.100
Loss
0.075
0.050
0.025
0.000
0.5
0.4
m AP
0.3
0.2
0.1
0.0
Configuration 1, #1
Time needed for training: 22 hours
Image size: 300x300
Batch size: 24
Results are shown in Figure 4.4 - Figure 4.6.
0.15
Loss
0.10
0.05
0.00
0.03
Loss
0.02
0.01
0.00
0.8
0.6
m AP
0.4
0.2
Validation
0.0 Test ing
Configuration 1, #2
Time needed for training: 31.5 hours
Image size: 300x300
Batch size: 35
Results are shown in Figure 4.7 - Figure 4.9.
0.04
0.03
Loss
0.02
0.01
0.00
0.04
0.03
Loss
0.02
0.01
0.00
0 20000 40000 60000 80000 100000
Bat ches
0.5
0.4
m AP
0.3
0.2
0.1
0.0
Classification Loss
Training
0.25 Evaluation
Testing
0.20
0.15
Loss
0.10
0.05
0.00
Localization Loss
Training
Evaluation
Testing
0.08
0.06
Loss
0.04
0.02
0.00
0.8
m ean average precision
0.6
0.4
0.2
Classification Loss
Faster R-CNN: Batch size 1
Faster R-CNN: Batch size 24
0.05 Faster R-CNN: Batch size 35
0.04
0.03
0.02
Loss
0.01
0.00
−0.01
Localization Loss
Faster R-CNN: Batch size 1
Faster R-CNN: Batch size 24
Faster R-CNN: Batch size 35
0.025
0.020
0.015
Loss
0.010
0.005
0.000
0 20000 40000 60000 80000 100000
Batches
0.8
0.6
mAP
0.4
0.2
Baseline
Time needed for training: 2.5 hours
Image size: 300x300
Batch size: 1
Results are shown in Figure 4.16 - Figure 4.18.
25
20
Loss
15
10
0
0 20000 40000 60000 80000 100000
Bat ches
15.0
12.5
Loss
10.0
7.5
5.0
2.5
0.0
0.25
0.20
m AP
0.15
0.10
0.05
0.00
Figure 4.17 has some incomplete values for the validation run, the values were
NaN and therefore not plotted.
4.1 Testing Different Network Configuration 31
Configuration 1, #1
Time needed for training: 14 hours
Image size: 300x300
Batch size: 24
Results are shown in Figure 4.19 - Figure 4.21.
2.5
2.0
Loss
1.5
1.0
0.5
0.35
0.30
0.25
Loss
0.20
0.15
0.10
0.05
0.00
0 20000 40000 60000 80000 100000
Bat ches
0.8
m ean average precision
0.6
0.4
0.2
Validation
0.0 Test ing
Configuration 1, #2
Time needed for training: 8.4 hours
Image size: 300x300
Batch size: 35
Results are shown in Figure 4.22 - Figure 4.24
3.5
3.0
Loss
2.5
2.0
1.5
1.0
0.5
0 20000 40000 60000 80000 100000
Batches
0.4
Loss
0.3
0.2
0.1
0.8
0.6
m AP
0.4
0.2
Validat ion
0.0 Test ing
Baseline
Time needed for training: 1.7 hours
Image size: 300x300
Batch size: 1
Results are shown in Figure 4.25 - Figure 4.27
40
30
Loss
20
10
10
8
Loss
0.20
0.15
m AP
0.10
0.05
0.00
Configuration 1, #1
Time needed for training: 14 hours
Image size: 300x300
Batch size: 24
Results are shown in Figure 4.28 - Figure 4.29.
1.5
Loss
1.0
0.5
0.0
0.5
0.4
Loss
0.3
0.2
0.1
0.0
0 20000 40000 60000 80000 100000
Bat ches
0.8
0.6
m AP
0.4
0.2
Validation
0.0 Test ing
Configuration 1, #2
Time needed for training: 19.5 hours
Image size: 300x300
Batch size: 35
Results are shown in Figure 4.31 - Figure 4.32.
10
6
Loss
3
Loss
0
0 20000 40000 60000 80000 100000
Bat ches
0.6
0.5
0.4
m AP
0.3
0.2
0.1
0.0
Classification Loss
SSD Inception: Batch size 1
20 SSD Inception: Batch size 24
SSD Inception: Batch size 35
SSD Mobilenet: batch size 1
SSD Mobilenet: batch size 24
SSD Mobilenet: batch size 35
15
10
Loss
Localization Loss
SSD Inception: Batch size 1
5 SSD Inception: Batch size 24
SSD Inception: Batch size 35
SSD Mobilenet: batch size 1
SSD Mobilenet: batch size 24
SSD Mobilenet: batch size 35
4
3
Loss
0.8
0.2
0.0
0 20000 40000 60000 80000 100000
Batches
Classification Loss
Faster R-CNN: Batch size 1
Faster R-CNN: Batch size 24
5 Faster R-CNN: Batch size 35
SSD Inception: Batch size 1
SSD Inception: Batch size 24
SSD Inception: Batch size 35
SSD Mobilenet: batch size 1
4 SSD Mobilenet: batch size 24
SSD Mobilenet: batch size 35
Faster R-CNN: Image size 600x1040
3
Loss
0
0 20000 40000 60000 80000 100000
Batches
Localization Loss
Faster R-CNN: Batch size 1
5 Faster R-CNN: Batch size 24
Faster R-CNN: Batch size 35
SSD Inception: Batch size 1
SSD Inception: Batch size 24
SSD Inception: Batch size 35
SSD Mobilenet: batch size 1
4 SSD Mobilenet: batch size 24
SSD Mobilenet: batch size 35
Faster R-CNN: Image size 600x1040
3
Loss
0
0 20000 40000 60000 80000 100000
Batches
0.8
0.6
mAP
0.4
Faster R-CNN: Batch size 1
Faster R-CNN: Batch size 24
Faster R-CNN: Batch size 35
SSD Inception: Batch size 1
0.2 SSD Inception: Batch size 24
SSD Inception: Batch size 35
SSD Mobilenet: batch size 1
SSD Mobilenet: batch size 24
SSD Mobilenet: batch size 35
SSD Mobilenet: batch size 35
0.0 Faster R-CNN: Image size 600x1040
31,2
22
19,5
18,4
14 14
5,5
3,3 2,5 1,7
FASTER FASTER FASTER FASTER SSD + SSD + SSD + SSD + SSD + SSD +
RCNN + RCNN + RCNN + RCNN + INCEPTION: INCEPTION: INCEPTION: MOBILENET: MOBILENET: MOBILENET:
INCEPTION: INCEPTION: INCEPTION: INCEPTION: BATCH SIZE BATCH SIZE BATCH SIZE BATCH SIZE BATCH SIZE BATCH SIZE
BATCH SIZE BATCH SIZE BATCH SIZE BATCH SIZE 1 24 35 1 24 35
1: 300X300 1: 600X1040 24 35
Classification Loss
0.040 10 percent
50 percent
Whole dataset
0.035
0.030
0.025
Loss
0.020
0.015
0.010
0.005
50 100 150 200 250 300 350 400
Epochs
Localization Loss
10 percent
50 percent
Whole dataset
0.020
0.015
Loss
0.010
0.005
0.8
0.6
mAP
0.4
0.2
10 percent
50 percent
Whole dataset
0.0
0 50 100 150 200 250 300 350
Epochs
0.6
0.5
mAP
0.4
0.3
0.2
0.15
0.1
0.0
ercent ercent ercent eption SD MobileNet
t e r R-C NN: 10 pter R-CNN: 50 per R-CNN: 100 p SSD Inc S
Fas Fas Fas t
Manual Automatic
Number of images 3561 7806
Time 6h 38 min
Figure 4.45 shows the histogram over the IoU for the ground truth data between
the manual and automatic method.
Int ersect ion over Union, Manual vs aut om at ic annot at ion ground truth
200
175
150
Bin Count
125
100
75
50
25
0
0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00
IoU
A comparison of manual and automatic annotated data with the SSD + Mo-
bilenet was done and the result is shown in figure Figure 4.46 to Figure 4.51. A
batch size of 24 with 3561 images was used.
4.4 Automatic and Manual Annotations 51
3
Loss
0
0 20000 40000 60000 80000 100000
Batches
4
Loss
5
Loss
0.8
0.6
Loss
0.4
0.2
3
Loss
0
0 20000 40000 60000 80000 100000
Batches
0.6
Loss
0.4
0.2
0.0
0 20000 40000 60000 80000 100000
Batches
5.1 Networks
In the following sections, the different results from each network configuration
are discussed. First the Single-Shot multibox Detector, followed by Faster R-
CNN.
57
58 5 Discussion
a batch size of 35. It is also shown in Figure 4.39 that the outcome of using an
image size of 660x1040 with batch size 1
is as good as using a batch size of 24 with image size of 300x300.
Another reason for worse performance could be the sharpness of the images.
Even though some of the training images had blur added to them as a pre-processing
step, the network had trouble with the object being out of focus.
It is shown in Figure 4.44 that all the networks performed approximately the
same except for SSD + Mobilnet.
6.1 Conclusions
There are several conclusions that can be drawn from this thesis. A neural net-
work with the purpose of object detection can be fine-tuned using synthetic data
to detect other objects. Faster R-CNN + Inception network had the best accu-
racy. Out of the three different network architectures used, while also taking the
longest time to train.
The result shows further that longer training time does not necessarily give
the best result, what mattered was the size of the datasets and the batch sizes.
The larger the dataset, the higher the accuracy. Yet having too large batch size
results in overfitting.
61
62 6 Conclusions And Future Work
is due to the bounding box having background noise while an object mask would
only contain the interesting pixels i.e. the object.
It would be interesting to verify whether this improves the accuracy further.
Also, in this thesis the networks were finetuned. An interesting aspect would
be to train a neural network from scratch using only computer-generated images
in order to verify that synthetic data is suitable for learning from scratch, too.
Appendix
A
Datasets
Two different datasets were used throughout, dataset B was only used for testing
purposes.
Dataset A
Dataset A consists of five different objects and consists of 5392 images. These
can be seen in Figure A.6 and Figure A.10.
Dataset B
Dataset B is a video recorded by a web-camera of the real objects in real life.
65
66 A Datasets
67
68 Bibliography
https://nph.onlinelibrary.wiley.com/doi/pdf/10.1111/
j.1469-8137.1912.tb05611.x. Cited on page 13.
[20] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman.
Synthetic data and artificial neural networks for natural scene text recogni-
tion. CoRR, abs/1406.2227, 2014. URL http://arxiv.org/abs/1406.
2227. Cited on page 6.
[21] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman.
Reading text in the wild with convolutional neural networks. International
Journal of Computer Vision, 116(1):1 – 20, 2016. ISSN 09205691. URL
https://arxiv.org/abs/1412.1842. Cited on pages 6 and 13.
[22] Andrej Karpathy. Neural network = http://cs231n.github.io/
neural-networks-1/, 2018. Accessed: 2018-04-17. Cited on page 2.
[23] Stanford Visual Lab. Imagenet large scale visual recognition challange.
http://www.image-net.org, 2010. Accessed: 2018-02-26. Cited on
page 7.
[24] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E.
Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: single shot multibox
detector. CoRR, abs/1512.02325, 2015. URL http://arxiv.org/abs/
1512.02325. Cited on page 8.
[25] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and R. Young. Icdar
2003 robust reading competitions. In Seventh International Conference
on Document Analysis and Recognition, 2003. Proceedings., pages 682–
687, Aug 2003. URL https://ieeexplore.ieee.org/document/
1227617. Cited on page 6.
[26] W. Ouyang, X. Wang, X. Zeng, Shi Qiu, P. Luo, Y. Tian, H. Li, Shuo Yang,
Zhe Wang, Chen-Change Loy, and X. Tang. DeepID-Net: Deformable Deep
Convolutional Neural Networks for Object Detection. 2014. URL https:
//arxiv.org/abs/1409.3505. Cited on page 6.
[27] W. Ouyang, H. Li, X. Zeng, and X. Wang. Learning deep representation with
large-scale attributes. 2015 IEEE International Conference on Computer Vi-
sion (ICCV), Computer Vision (ICCV), 2015 IEEE International Conference
on, Computer Vision, IEEE International Conference on, page 1895, 2015.
ISSN 978-1-4673-8391-2. URL https://www.cv-foundation.org/
openaccess/content_iccv_2015/papers/Ouyang_Learning_
Deep_Representation_ICCV_2015_paper.pdf. Cited on page 6.
[28] W. Ouyang, X. Wang, C. Zhang, and X. Yang. Factors in Finetuning Deep
Model for object detection. ArXiv e-prints, January 2016. URL https:
//arxiv.org/abs/1601.05150. Cited on page 6.
[29] Mahesh Pal and Paul M. Mather. Support vector classifiers for land cover
classification. CoRR, abs/0802.2138, 2008. URL http://arxiv.org/
abs/0802.2138. Cited on page 7.
70 Bibliography
[30] Param S. Rajpura, Ravi S. Hegde, and Hristo Bojinov. Object detection using
deep cnns trained on synthetic images. CoRR, abs/1706.06782, 2017. URL
http://arxiv.org/abs/1706.06782. Cited on pages 6 and 13.
[31] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object
detection with region proposal networks. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 39(6):1137–1149, 2017. ISSN 01628828.
URL https://login.e.bibl.liu.se/login?url=https:
//search-ebscohost-com.e.bibl.liu.se/login.aspx?
direct=true&AuthType=ip,uid&db=edselc&AN=edselc.2-52.
0-85019258369&lang=sv&site=eds-live&scope=site. Cited on
page 8.
[32] Stephan R. Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. Playing
for data: Ground truth from computer games. CoRR, abs/1608.02192, 2016.
URL http://arxiv.org/abs/1608.02192. Cited on pages 5 and 9.
[33] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich. Going Deeper with Convolutions. ArXiv
e-prints, September 2014. URL https://arxiv.org/abs/1409.4842.
Cited on page 6.
[34] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Ra-
binovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.
URL http://arxiv.org/abs/1409.4842. Cited on page 7.