BTP Report Faster R CNN Compressed
BTP Report Faster R CNN Compressed
BTP Report Faster R CNN Compressed
OBJECT DETECTION
MA498 Project I
by
RISHON DSOUZA and MANISH KUMAR
(Roll No. 190123049 and 190123067)
to the
DEPARTMENT OF MATHEMATICS
INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI
GUWAHATI - 781039, INDIA
November 2022
CERTIFICATE
This is to certify that the work contained in this project report entitled
“Faster R-CNN for Real Time Object Detection” submitted by Rishon Dsouza
and Manish Kumar (Roll No.: 190123049 and 190123067) to the Department
of Mathematics, Indian Institute of Technology Guwahati towards partial
requirement of Bachelor of Technology in Mathematics and Computing has
been carried out by them under my supervision.
It is also certified that this report is a survey work based on the references
in the bibliography.
OR
It is also certified that, along with literature survey, a few new results are es-
tablished/computational implementations have been carried out/simulation
studies have been carried out/empirical analysis has been done by the stu-
dent under the project.
ii
ABSTRACT
The main aim of the project is to understand the most widely used state-
of-the-art object detection network – Faster R-CNN. It is the third iteration
of the R-CNN family. Here novel Region Proposal Network(RPN) is in-
troduced. The RPN is a fully convolutional neural network which shares
convolution layers with Fast R-CNN while training and brings down the
number of proposals from 2000 to 300 per image while maintaining, if not
improving, the evaluation index – mean average precision(mAP). Hence it
is computationally efficient as compared to region proposal methods like Se-
lective Search(SS) and EdgeBoxes which were used in earlier versions of the
R-CNN family.
iii
Contents
List of Figures vi
1 Introduction 1
1.1 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Review of Prior Works . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Fast R-CNN . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Faster R-CNN 6
2.1 CNN Backbone . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Region Proposal Network . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Anchor Boxes . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Non-maximal Suppression . . . . . . . . . . . . . . . . 10
2.2.3 Objectness Score . . . . . . . . . . . . . . . . . . . . . 10
2.2.4 Loss Function . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Detection Network . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Classification Layer . . . . . . . . . . . . . . . . . . . . 13
2.3.2 Regression Layer . . . . . . . . . . . . . . . . . . . . . 13
iv
3 Implementation Details and Applications 15
3.1 Feature Sharing for Backbones in RPN and Fast R-CNN . . . 15
3.2 Algorithm for Faster R-CNN . . . . . . . . . . . . . . . . . . . 17
3.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.1 Object Detection in Action . . . . . . . . . . . . . . . 17
3.3.2 Attendance Counting . . . . . . . . . . . . . . . . . . . 18
Bibliography 24
v
List of Figures
vi
List of Tables
vii
Chapter 1
Introduction
1
during testing (it takes approximately 200 milliseconds for Faster R-CNN).
Complexity of OD arises from the fact that it involves both object classifi-
cation and object localization. We will now proceed to review the some of the
prior works done in the field of OD, specifically earlier two version of R-CNN
family to fully understand what challenges Faster R-CNN is conquering.
2
1.2.1 R-CNN
R-CNN was one of the was the first breakthroughs of the use of convolution
neural networks (CNNs) for object detection.
2. Each proposal is warped to a fixed pre-defined size and then fed into
the second module consisting of deep CNN, which extracts a feature
vector of fixed length 4,096.
(a) a trained SVM layer to classify the region proposal to one of the
known classes and
(b) a linear regressor to find the object’s bounding box, if one exists.
3
In other words, we first propose regions, then extract features and then
classify those regions using its features. This is a simple and intuitive appli-
cation of CNNs. But this approach is slow and cannot be used for real time
object detection since each of the 2k proposals independently needs to pass
through the convNet for feature extraction.
1. Fast R-CNN computes the feature map of the entire image before
proposing regions, thus sharing computation across all proposals in-
stead of passing each proposal through the convNet independently.
This is done by new ROI Pooling layer, that extracts features specific
for a given input candidate region.
Remark 1.2.1. Despite the advantages of Fast R-CNN model, one big bottle-
neck that still remains, is using selective search for region proposal. Selective
search cannot be customized on a specific OD task and hence it may not be
accurate enough to detect all target objects in the dataset.
4
Figure 1.3: Summary of the Fast R-CNN architecture
In the next chapter, we will dive deep into the latest version of R-CNN
family – Faster R-CNN. It is a two stage object detection system. The first
stage is comprised of RPN which tells the detection network where to look
by giving accurate proposals with objectness score. The second stage is the
detection network comprised of Fast R-CNN which reports the bounding
boxes and prediction scores.
5
Chapter 2
Faster R-CNN
6
3. Detection Network - Fast RCNN outputs class scores and location
of objects.
7
maps. The feature map extracted is shared by the Detection Network and
the Region Proposal Network.
8
is identified by the pixel coordinates of two diagonal corners, as well as a
number. 1 indicated the object is in the bounding box, 0 indicates it is not
and -1 indicates is the object can be ignored.
The network must determine whether an object is present in the input image
at each point in the output feature map and estimate its size. For each
position on the output feature map from the backbone network, a set of
“Anchors” are placed on the input image in order to achieve this. These
anchors point to potential items of different sizes and aspect ratios that
might be present.
Figure 2.3: Anchor boxes: 3 different aspect ratios and 3 different scales
The network must determine whether the k related anchors spanning the
input image genuinely contain objects as it progresses through each pixel in
the output feature map. It must then refine the coordinates of these anchors
to provide bounding boxes as “Object proposals” or areas of interest.
9
The parameterization of the k suggestions is based on the k reference boxes.
At the contested sliding window, an anchor is in the centre. It has a scale
and aspect ratio attached to it. As a result of our employment of three scales
and three aspect ratios, each pixel/position has k = 9 anchors.
The classification layer outputs two elements for each proposal. When the
first and second element are 1 and 0 respectively, the portion is classified as
background. When the second element is 1 and the first element is 0 the
region represents an object. While training the RPN each anchor is given a
negative or positive objectness score. This score is based on the Intersection-
10
over-Union (IoU).
Area of Overlap
loU =
Area of U nion
Each box is given a label from (+1, 0, -1) based on the following conditions.
1. +1 is given to any anchor box with IoU greater than 0.7 with any
ground-truth box.
3. -1 is given to all anchors with IoU less than 0.3. This is taken as
background.
11
1 X 1 X ∗
Loss ({pi } , {Ci }) = Losscls (pi , p∗i ) + λ pi Lossreg (Ci , Ci∗ ) .
Ncls i Nreg i
(2.1)
For the bounding box regression, we adopt the parameterizations of the four
coordinates as following:
CX = (x − xanc ) /wanc , Cy = (y − yanc ) /hanc
CW = log (w/wanc ) , Ch = log (h/hanc )
CX∗ = (x∗ − xanc ) /wanc , Cy∗ = (y ∗ − yanc ) /hanc
∗
CW = log (w∗ /wanc ) , Ch∗ = log (h∗ /hanc )
here x, y, w, and h denote the bounding box’s center and its width and
height. In this equation x is the predicted box, xanc is the anchor box and
x∗ is the true label.
12
2.3.1 Classification Layer
This layer calculates the offsets required for regression coefficients i.e. bound-
ing box coordinates. Since the background class doesn’t have a box we ignore
this class. For this task we use the loss,
X
Lossloc (C u , v) = smoothL1 (Ciu − vi )
i∈{x,y,w,h}
in which
The L1 loss used is less sensitive to outliers than the L2 loss used in SPP-
13
net & R-CNN. However, it is less smooth. To overcome this issue we use
smoothing as stated above.
The parameter λ is used to control the emphasis on classification vs regres-
sion. For most cases we use λ = 1 as we want to maximize performance on
both tasks.
14
Chapter 3
15
consider a separate network by Fast R-CNN and initialize it with the
weights by ImageNet-pre-trained model. This network is trained using
the proposals generated in step 1 and fine-tuned for object detection
task. In the third step, we begin by initializing RPN network with
detector network weights and fine-tune it. But in this step, only the
layers unique to RPN are fine-tuned while keeping the shared layers
fixed. At this stage the two networks share convolutional layers. Fi-
nally, keeping the shared convolutional layers fixed, we fine-tune the
unique layers of Fast R-CNN.
(a) The forward pass, where region proposals are generated for train-
ing Fast R-CNN detector.
(b) In the backward propagation step, the losses from both the net-
works are combined for the shared convolution layers. Other steps
remain as usual.
This approach does not take the derivative w.r.t the proposal boxes
co-ordinates into account. This reduces the accuracy of this method.
But in practice it has produced close results, yet reducing the training
time by about 25-50%.
16
3.2 Algorithm for Faster R-CNN
3.3 Applications
We use the Faster R-CNN model for object detection and classification on
real-life images. The model has a ResNET 50 backbone has been trained on
the COCO dataset. The MS COCO (Microsoft Common Objects in Context)
dataset is an Image recognition dataset released in 2014 and updated in 2017.
It has been used in multiple competitions and is benchmark used to test
various models.
17
Figure 3.1: Photographs taken from phone in IIT Guwahati
The images in Fig 3.1 used are taken in IIT Guwahati and are shot on a
smartphone camera. The shots captured are a variety of scenes that one will
encounter on travelling around campus.
18
R-CNN model to get an estimate of the number of people present.
19
Chapter 4
4.1 Conclusion
We have reviewed some of the most widely used state-of-the-art object detec-
tion networks with main emphasis on the Faster R-CNN model. The RPN
network is able to deliver good quality proposals to the Fast R-CNN network
and region proposal step is nearly cost free.
20
4.2.1 Active Learning
This implies that many green data points will be mistakenly categorized
as red and vice-versa. The selection of data points for labelling was poor,
which is why there is this skew. Using an active learning query strategy, we
choose a small subset of points for the logistic regression in the image on the
right. This new decision boundary is vastly superior because it can clearly
distinguish between the two clusters. The classifier was able to provide an
effective decision boundary thanks to the superior data points that were
chosen for labelling.
21
Table 4.1: A scenario to understand the active learning startegies
where ŷ = arg maxy P rob(y | x), or the class label with the highest
probability.
22
The difference between object1’s first and second most likely labels is
0.05 (0.4 - 0.35), whereas it is 0.77 for object2 (0.85 - 0.08). As a result,
this sampling strategy will choose object1 for labelling once again.
The instance with the highest value is sent for labelling. In our example,
object1 has an entropy value of 0.469 and object2 has a value of 0.228,
hence the learner will choose object1 once again.
In the next phase, we will see some of these AL techniques in action and
explore more on the challenges existing in this area.
23
Bibliography
[2] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich
feature hierarchies for accurate object detection and semantic segmen-
tation. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 580–587, 2014.
[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyra-
mid pooling in deep convolutional networks for visual recognition. IEEE
transactions on pattern analysis and machine intelligence, 37(9):1904–
1916, 2015.
[5] Ying Li, Binbin Fan, Weiping Zhang, Weiping Ding, and Jianwei
Yin. Deep active learning for object detection. Information Sciences,
579:418–433, 2021.
24
[6] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott
Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multi-
box detector. In European conference on computer vision, pages 21–37.
Springer, 2016.
[7] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You
only look once: Unified, real-time object detection. In Proceedings of
the IEEE conference on computer vision and pattern recognition, pages
779–788, 2016.
[8] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-
cnn: Towards real-time object detection with region proposal networks.
Advances in neural information processing systems, 28, 2015.
[9] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev
Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla,
Michael Bernstein, et al. Imagenet large scale visual recognition chal-
lenge. International journal of computer vision, 115(3):211–252, 2015.
[10] Karen Simonyan and Andrew Zisserman. Very deep convolutional net-
works for large-scale image recognition. arXiv preprint arXiv:1409.1556,
2014.
25