BTP Report Faster R CNN Compressed

FASTER R-CNN FOR REAL TIME
OBJECT DETECTION
A Project Report Submitted

for the Course
MA498 Project I
by
RISHON DSOUZA and MANISH KUMAR
(Roll No. 190123049 and 190123067)
to the
DEPARTMENT OF MATHEMATICS
INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI
GUWAHATI - 781039, INDIA
November 2022
CERTIFICATE
This is to certify that the work contained in this project report entitled
“Faster R-CNN for Real Time Object Detection” submitted by Rishon Dsouza
and Manish Kumar (Roll No.: 190123049 and 190123067) to the Department
of Mathematics, Indian Institute of Technology Guwahati towards partial
requirement of Bachelor of Technology in Mathematics and Computing has
been carried out by them under my supervision.
It is also certified that this report is a survey work based on the references
in the bibliography.
OR
It is also certified that, along with literature survey, a few new results are es-
tablished/computational implementations have been carried out/simulation
studies have been carried out/empirical analysis has been done by the stu-
dent under the project.
Turnitin Similarity: 22%
Guwahati - 781039 (Dr. Rajen Kumar Sinha)

November 2022 Project Supervisor
ii
ABSTRACT
The main aim of the project is to understand the most widely used state-
of-the-art object detection network – Faster R-CNN. It is the third iteration
of the R-CNN family. Here novel Region Proposal Network(RPN) is in-
troduced. The RPN is a fully convolutional neural network which shares
convolution layers with Fast R-CNN while training and brings down the
number of proposals from 2000 to 300 per image while maintaining, if not
improving, the evaluation index – mean average precision(mAP). Hence it
is computationally efficient as compared to region proposal methods like Se-
lective Search(SS) and EdgeBoxes which were used in earlier versions of the
R-CNN family.
iii
Contents
List of Figures vi
List of Tables vii
1 Introduction 1
1.1 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Review of Prior Works . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Fast R-CNN . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Faster R-CNN 6
2.1 CNN Backbone . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Region Proposal Network . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Anchor Boxes . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Non-maximal Suppression . . . . . . . . . . . . . . . . 10
2.2.3 Objectness Score . . . . . . . . . . . . . . . . . . . . . 10
2.2.4 Loss Function . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Detection Network . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Classification Layer . . . . . . . . . . . . . . . . . . . . 13
2.3.2 Regression Layer . . . . . . . . . . . . . . . . . . . . . 13
iv
3 Implementation Details and Applications 15
3.1 Feature Sharing for Backbones in RPN and Fast R-CNN . . . 15
3.2 Algorithm for Faster R-CNN . . . . . . . . . . . . . . . . . . . 17
3.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.1 Object Detection in Action . . . . . . . . . . . . . . . 17
3.3.2 Attendance Counting . . . . . . . . . . . . . . . . . . . 18
4 Conclusion and Future Work 20

4.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Phase 2 Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.1 Active Learning . . . . . . . . . . . . . . . . . . . . . . 21
4.2.2 Popular Active Learning Strategies . . . . . . . . . . . 22
Bibliography 24
v
List of Figures
1.1 Object classification vs object localization . . . . . . . . . . . 2

1.2 Summary of the R-CNN architecture . . . . . . . . . . . . . . 3
1.3 Summary of the Fast R-CNN architecture . . . . . . . . . . . 5
2.1 Faster R-CNN: RPN + Fast R-CNN . . . . . . . . . . . . . . 7

2.2 Region Proposal Network . . . . . . . . . . . . . . . . . . . . . 8
2.3 Anchor boxes: 3 different aspect ratios and 3 different scales . 9
2.4 Intersection Over Union in multiple scenarios . . . . . . . . . . 11
3.1 Photographs taken from phone in IIT Guwahati . . . . . . . . 18

3.2 Attendance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1 Role of active learning in prioritizing the data to be labelled . 21
vi
List of Tables
2.1 Comparison of R-CNN family speed at test time . . . . . . . . 14
4.1 A scenario to understand the active learning startegies . . . . 22
vii
Chapter 1
Introduction
1.1 Object Detection

Object classification is assigning a label to an object from a group of known
classes. Object localization creates a bounding box around the object to give
an approximate idea of where the object is located in the image. These two
tasks are combined together to form Object Detection (OD) tasks. OD is an
integral part of computer vision since it helps us to understand and analyze
scenes in images and videos. OD is used almost everywhere nowadays. It has
huge applications in areas of self-driving cars, crowd counting, face detection
and video surveillance.
In phase 1, we are mostly concerned about application of deep learning

techniques in the field of object detection. Convolutional Neural Networks
(CNNs) are the foundation of the deep learning approach. To learn the fea-
tures directly from the data, all deep learning models require massive com-
putational power and large amounts of labelled data. But once trained, the
modern object detection networks can give real time predictions on images
1
during testing (it takes approximately 200 milliseconds for Faster R-CNN).
Figure 1.1: Object classification vs object localization
Complexity of OD arises from the fact that it involves both object classifi-
cation and object localization. We will now proceed to review the some of the
prior works done in the field of OD, specifically earlier two version of R-CNN
family to fully understand what challenges Faster R-CNN is conquering.
1.2 Review of Prior Works

Over the last decade, some OD methods – You Only Look Once (YOLO)[7],
Spatial Pyramid Pooling (SPP-Net)[3], Single shot Detector (SSD)[6] and
Faster R-CNN[8] have achieved remarkable results in OD tasks. In this
project, we will be mostly be focusing on Faster R-CNN. The R-CNN family
includes the techniques R-CNN[2], Fast R-CNN[1], and Faster R-CNN de-
signed for object classification and localization. Let’s take a quick look at
each of these techniques individually.
2
1.2.1 R-CNN
R-CNN was one of the was the first breakthroughs of the use of convolution
neural networks (CNNs) for object detection.
Figure 1.2: Summary of the R-CNN architecture
Mainly there are 3 modules in R-CNN:
1. The first module makes use of selective search algorithm to generate

approximately 2k region proposals per image.
2. Each proposal is warped to a fixed pre-defined size and then fed into
the second module consisting of deep CNN, which extracts a feature
vector of fixed length 4,096.
3. The feature vector is passed through
(a) a trained SVM layer to classify the region proposal to one of the
known classes and
(b) a linear regressor to find the object’s bounding box, if one exists.
3
In other words, we first propose regions, then extract features and then
classify those regions using its features. This is a simple and intuitive appli-
cation of CNNs. But this approach is slow and cannot be used for real time
object detection since each of the 2k proposals independently needs to pass
through the convNet for feature extraction.
1.2.2 Fast R-CNN
Fast R-CNN is an R-CNN extension that tackles its speed difficulties. In

contrast to 3 stages of R-CNN, this model has only one stage. It just takes
an image as input and outputs the bounding boxes and class probabilities
of the discovered objects. Fast R-CNN improved its detection speed mainly
through the following augmentations:
1. Fast R-CNN computes the feature map of the entire image before
proposing regions, thus sharing computation across all proposals in-
stead of passing each proposal through the convNet independently.
This is done by new ROI Pooling layer, that extracts features specific
for a given input candidate region.
2. SVM is replaced with a softmax layer, thus extending neural network

for predictions which makes the model more accurate than R-CNN.
Remark 1.2.1. Despite the advantages of Fast R-CNN model, one big bottle-
neck that still remains, is using selective search for region proposal. Selective
search cannot be customized on a specific OD task and hence it may not be
accurate enough to detect all target objects in the dataset.
4
Figure 1.3: Summary of the Fast R-CNN architecture
In the next chapter, we will dive deep into the latest version of R-CNN
family – Faster R-CNN. It is a two stage object detection system. The first
stage is comprised of RPN which tells the detection network where to look
by giving accurate proposals with objectness score. The second stage is the
detection network comprised of Fast R-CNN which reports the bounding
boxes and prediction scores.
5
Chapter 2
Faster R-CNN
Faster R-CNN as the name suggests is an extension of the popular object-

detection model Fast R-CNN. While consisting of multiple modules it appears
to the user as a end-to-end, single, unified network. One of the reasons it
is popular is that the network is fully backpropagatable, that is, all the
parameters can be tuned simultaneously by an end-to-end training.
Faster R-CNN solves the problem of selective search (SS) by replacing

it with Region Proposal Network (RPN).Using ConvNet, we first extract
feature maps from the input image, and these maps are then sent through an
RPN to provide object recommendations. These maps are finally categorised,
and the bounding boxes are anticipated.
There are 3 main modules of a Faster R-CNN architecture. They are as

following -
1. CNN backbone - generates features from entire input image.
2. Region proposal - estimates possible localization’s of the objects.
6
3. Detection Network - Fast RCNN outputs class scores and location
of objects.
Figure 2.1: Faster R-CNN: RPN + Fast R-CNN
2.1 CNN Backbone

In this layers we train filters to extract the appropriate features from the
image. This is the first step in the pipeline and can be separately trained
or initialized from pre-trained model if required. Convolution networks are
composed of Convolution layers which are typically of three types: convolu-
tion, pooling, and a fully connected or other extended component that will
be used for the proper purpose, such as classification or detection. Convolu-
tion is calculated by sliding a filter along the image given as input, and the
output is a 2d matrix called a feature map.
The backbone used in the original paper is a VGG-16 model[10]. We do not

require the final full-connected layer as we use this only to generate feature
7
maps. The feature map extracted is shared by the Detection Network and
the Region Proposal Network.
2.2 Region Proposal Network

The Region Proposal Network (RPN) is the novelty introduced in Faster
R-CNN, which replaces the Selective Search algorithm of the previous iter-
ations. The function of this portion of the network is to take a varied sized
input and output a set of rectangular object proposals each with an object-
ness score (degree to which a region can be considered an object). The inputs
to this networks are the convolutional maps used by the Fast R-CNN net-
work as well. This shared computation improves inference time by reducing
computation.
Figure 2.2: Region Proposal Network

Typically, the RPN is a simple three-layer convolutional network. Two
layers—one for classification and the other for bounding box regression—are
fed from a common layer. RPN is designed to create a large number of
Regions of Interest (ROIs) that have a high likelihood of containing any
object. This network produces a number of bounding boxes, each of which
8
is identified by the pixel coordinates of two diagonal corners, as well as a
number. 1 indicated the object is in the bounding box, 0 indicates it is not
and -1 indicates is the object can be ignored.
2.2.1 Anchor Boxes
The network must determine whether an object is present in the input image
at each point in the output feature map and estimate its size. For each
position on the output feature map from the backbone network, a set of
“Anchors” are placed on the input image in order to achieve this. These
anchors point to potential items of different sizes and aspect ratios that
might be present.
Figure 2.3: Anchor boxes: 3 different aspect ratios and 3 different scales
The network must determine whether the k related anchors spanning the
input image genuinely contain objects as it progresses through each pixel in
the output feature map. It must then refine the coordinates of these anchors
to provide bounding boxes as “Object proposals” or areas of interest.
9
The parameterization of the k suggestions is based on the k reference boxes.
At the contested sliding window, an anchor is in the centre. It has a scale
and aspect ratio attached to it. As a result of our employment of three scales
and three aspect ratios, each pixel/position has k = 9 anchors.
Our method’s translation invariance—both with regard to the anchors and

the functions that compute suggestions relative to the anchors—is a key
characteristic. The proposal should translate if an object is translated in
an image, and the same function should be able to forecast the proposal in
either location.
2.2.2 Non-maximal Suppression
RPN proposals occasionally have a lot of overlap. We use non-maximum

suppression (NMS) on the proposal regions based on their objectness scores
to lessen redundancy. Boxes that overlap with other boxes that have better
scores are removed by NMS (scores are unnormalized probabilities). The
training step involves the extraction of roughly 2000 boxes (the number is
lower, about 300 for testing phase). These boxes, along with their scores, are
sent directly to the Detection Network during the testing phase.
2.2.3 Objectness Score
The classification layer outputs two elements for each proposal. When the
first and second element are 1 and 0 respectively, the portion is classified as
background. When the second element is 1 and the first element is 0 the
region represents an object. While training the RPN each anchor is given a
negative or positive objectness score. This score is based on the Intersection-
10
over-Union (IoU).
Area of Overlap
loU =
Area of U nion
Each box is given a label from (+1, 0, -1) based on the following conditions.
1. +1 is given to any anchor box with IoU greater than 0.7 with any
ground-truth box.
2. +1 is assigned to the anchor with highest IoU when no other anchor

has a positive label.
3. -1 is given to all anchors with IoU less than 0.3. This is taken as
background.
4. 0 is assigned to all boxes where score is between 0.7 and 0.3.
Figure 2.4: Intersection Over Union in multiple scenarios
2.2.4 Loss Function
A multi-task loss function is used to train the classification and regression

portion of the Region Proposal Network.
11
1 X 1 X ∗
Loss ({pi } , {Ci }) = Losscls (pi , p∗i ) + λ pi Lossreg (Ci , Ci∗ ) .
Ncls i Nreg i
(2.1)
For the bounding box regression, we adopt the parameterizations of the four
coordinates as following:
CX = (x − xanc ) /wanc , Cy = (y − yanc ) /hanc
CW = log (w/wanc ) , Ch = log (h/hanc )
CX∗ = (x∗ − xanc ) /wanc , Cy∗ = (y ∗ − yanc ) /hanc
∗
CW = log (w∗ /wanc ) , Ch∗ = log (h∗ /hanc )
here x, y, w, and h denote the bounding box’s center and its width and
height. In this equation x is the predicted box, xanc is the anchor box and
x∗ is the true label.
2.3 Detection Network

The object proposals by the RPN network are passed to the Detection Net-
work for classification. This Detection Network is a Fast R-CNN model
without the selective search. This section is responsible for predicting the lo-
cation and contents of the boxes present in the images. It is divided into two
portions one for regression and the other for classification. The classification
layer is responsible for predicting the labels in the boxes while the regression
layer is responsible for marking the boxes.
12
2.3.1 Classification Layer
It outputs a discrete probability distribution per Region of Interest, p =

(p0 , ..., pK ), over K + 1 classes.
There are K labels with one extra label for background class. This back-
ground class taken as u = 0 is used to separate boxes that have objects vs
that do not. By doing this we only need to calculate the regression loss on
classes with u > 0, since the location of boxes do not matter when no object
is present.
This design choice also helps separate the RPN portion from the detection
network as it is not constrained to look for only objects who’s classes we care
about. p is computed by a softmax layer over the K + 1 outputs of the fully
connected layer. It is computed as follows-
Losscls (p, u) = − log pu
2.3.2 Regression Layer
This layer calculates the offsets required for regression coefficients i.e. bound-
ing box coordinates. Since the background class doesn’t have a box we ignore
this class. For this task we use the loss,
X
Lossloc (C u , v) = smoothL1 (Ciu − vi )
i∈{x,y,w,h}
in which
smoothL1 (x) = 0.5x2 , if |x| < 1 and |x| − 0.5, otherwise
The L1 loss used is less sensitive to outliers than the L2 loss used in SPP-
13
net & R-CNN. However, it is less smooth. To overcome this issue we use
smoothing as stated above.
The parameter λ is used to control the emphasis on classification vs regres-
sion. For most cases we use λ = 1 as we want to maximize performance on
both tasks.
Table 2.1: Comparison of R-CNN family speed at test time
Evaluation criteria RCNN Fast R-CNN Faster R-CNN

Test time/image
(with proposal) 50sec 2sec 0.2sec
Speedup 1x 25x 250x
mAP%
(VOC test 2007) 66.0 66.9 66.9
14
Chapter 3
Implementation Details and

Applications
3.1 Feature Sharing for Backbones in RPN

and Fast R-CNN
Although the backbone network in RPN and Fast R-CNN can be inde-
pendently trained, its important to realize that the speed enhancement in
Faster R-CNN comes from sharing convolution layers in the two networks.
The Faster R-CNN paper mentions three ways for training networks with
shared features:
1. Alternate Training - In this technique, training starts with the RPN

network where it is fine tuned end-to-end for region proposal task.
The shared convolutional layers of RPN are initialized by weights of
a pre-trained model for ImageNet classification[9] and the new layers
are initialized by sampling random numbers from Gaussian distribution
with mean zero and standard deviation 0.01. In the second step, we
15
consider a separate network by Fast R-CNN and initialize it with the
weights by ImageNet-pre-trained model. This network is trained using
the proposals generated in step 1 and fine-tuned for object detection
task. In the third step, we begin by initializing RPN network with
detector network weights and fine-tune it. But in this step, only the
layers unique to RPN are fine-tuned while keeping the shared layers
fixed. At this stage the two networks share convolutional layers. Fi-
nally, keeping the shared convolutional layers fixed, we fine-tune the
unique layers of Fast R-CNN.
2. Approximate Joint Training - In this approach, the two networks

of Faster R-CNN are merged together into a unified network while
training. Each stochastic gradient descent(SGD)[4] iteration comprises
of two steps:
(a) The forward pass, where region proposals are generated for train-
ing Fast R-CNN detector.
(b) In the backward propagation step, the losses from both the net-
works are combined for the shared convolution layers. Other steps
remain as usual.
This approach does not take the derivative w.r.t the proposal boxes
co-ordinates into account. This reduces the accuracy of this method.
But in practice it has produced close results, yet reducing the training
time by about 25-50%.
3. Non-Approximate Joint Training - This approach addresses the

issue discussed above with the use of an ROI pooling layer that is
differentiable w.r.t the box co-ordinates in the backpropagation step.
16
3.2 Algorithm for Faster R-CNN
Algorithm 1 Faster R-CNN algorithm

1: model ← pretrained f aster rcnn model f rom torchvision
2: procedure get prediction(img, threshold)
3: imgT ← transf orm(img) ▷ image is converted to a tensor
4: pred ← model(imgT ) ▷ get predictions from the model
5: pred class, boxes, score ← pred
6: indexes ← box indexes f or which pred score > threshold
7: return pred boxes[indexes], class[indexes]
8: end procedure
9: procedure object detection api(img, threshold)
10: boxes, pred class ← GET P REDICT ION (img, threshold)
11: for box in boxes do
12: point1 ← top lef t corner of box
13: point2 ← bottom right corner of box
14: draw prediction boxes on the image using opencv2 lib passing
point1 and point2 as parameters
15: print class names with box based on MS COCO class list
16: end for
17: show image using matplotlib
18: end procedure
3.3 Applications
3.3.1 Object Detection in Action
We use the Faster R-CNN model for object detection and classification on
real-life images. The model has a ResNET 50 backbone has been trained on
the COCO dataset. The MS COCO (Microsoft Common Objects in Context)
dataset is an Image recognition dataset released in 2014 and updated in 2017.
It has been used in multiple competitions and is benchmark used to test
various models.
17
Figure 3.1: Photographs taken from phone in IIT Guwahati
The images in Fig 3.1 used are taken in IIT Guwahati and are shot on a
smartphone camera. The shots captured are a variety of scenes that one will
encounter on travelling around campus.
3.3.2 Attendance Counting
One application of object detection could be person counting. In a crowded

scene it may become difficult to manually count the number of people. This
problem compounds when the people are moving. We thus employ Faster
18
R-CNN model to get an estimate of the number of people present.
Figure 3.2: Attendance
Another practical application of this along similar lines could be to assist

professors in taking attendance. To mark attendance in a classroom setting
a name sheet is passed around on which students are expected to sign. How-
ever, one of the issues with this is that students often sign for their peers
who are not present.
Since nowadays every professor is equipped with a mobile phone, a quick
snapshot of the class can be used to get the number of students currently
present in the class. This can then be cross referenced with the number of
signatures on the sheet to swiftly identify if the attendance collected is valid
or not.
19
Chapter 4
Conclusion and Future Work
4.1 Conclusion
We have reviewed some of the most widely used state-of-the-art object detec-
tion networks with main emphasis on the Faster R-CNN model. The RPN
network is able to deliver good quality proposals to the Fast R-CNN network
and region proposal step is nearly cost free.
4.2 Phase 2 Plan

As of now, Faster R-CNN model requires large amounts of labelled data
for training to achieve good results. This mode of training is referred to as
passive learning. Generally, the data labelling process is very time consuming
and a costly affair. To overcome this difficulty, we will be implementing some
of the Active Learning(AL)[5] strategies in phase 2.
20
4.2.1 Active Learning
AL is a technique which iteratively selects the most informative samples

from the unlabelled dataset for labelling in order to have the highest impact
while training the model. Consider the following example shown in Fig 4.1.
This is a simple binary classification problem where we are given two clusters
of unlabelled data initially. Unfortunately, labelling all the data points would
not be possible in the interest of time. So we randomly select a subset of
data points and send it for labelling. Then we train a logistic regression on
it. Clearly, as shown in the middle image below, the model is sub-optimal
and skewed away from the optimal boundary.
Figure 4.1: Role of active learning in prioritizing the data to be labelled
This implies that many green data points will be mistakenly categorized
as red and vice-versa. The selection of data points for labelling was poor,
which is why there is this skew. Using an active learning query strategy, we
choose a small subset of points for the logistic regression in the image on the
right. This new decision boundary is vastly superior because it can clearly
distinguish between the two clusters. The classifier was able to provide an
effective decision boundary thanks to the superior data points that were
chosen for labelling.
21
Table 4.1: A scenario to understand the active learning startegies
Instances Label A Label B Label C

Object1 0.35 0.4 0.25
Object2 0.85 0.08 0.07
4.2.2 Popular Active Learning Strategies
Some of the most popular AL strategies:
1. Least Confidence (LC): The learner employs this method by selecting

the case for which it has the least confidence in its most likely label.
Looking over the table, the learner is very certain about the label for
object2, believing that it should be labelled B with a probability of 0.8;
however, it is less certain about the label for object1, believing that it
should be labelled B with a chance of only 0.4. Thus, adopting the
principle of least confidence, the strategy would be to choose object1
to query for its real label. The LC formula for image classification is
as follows:
x∗LC = arg max (1 − P rob(ŷ | x)) (4.1)
x
where ŷ = arg maxy P rob(y | x), or the class label with the highest
probability.
2. Margin Sampling: The LC approach has a downside in that it only

considers the most likely label and ignores the other label possibilities.
The margin sampling technique attempts to compensate for this short-
coming by selecting the instance with the least difference between the
first and second most likely labels.
sM S = arg min (P rob(ŷmax | x) − P rob(ŷmax−1 | x)) (4.2)

x
22
The difference between object1’s first and second most likely labels is
0.05 (0.4 - 0.35), whereas it is 0.77 for object2 (0.85 - 0.08). As a result,
this sampling strategy will choose object1 for labelling once again.
3. Entropy Sampling: In this sampling strategy, we are able to use all

of the available label probabilities. Each instance is subjected to the
entropy formula mentioned below.
!
X
sE = arg max − P rob(ŷi | x) log P rob(ŷi | x) (4.3)
x
i
The instance with the highest value is sent for labelling. In our example,
object1 has an entropy value of 0.469 and object2 has a value of 0.228,
hence the learner will choose object1 once again.
In the next phase, we will see some of these AL techniques in action and
explore more on the challenges existing in this area.
23
Bibliography
[1] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international

conference on computer vision, pages 1440–1448, 2015.
[2] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich
feature hierarchies for accurate object detection and semantic segmen-
tation. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 580–587, 2014.
[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyra-
mid pooling in deep convolutional networks for visual recognition. IEEE
transactions on pattern analysis and machine intelligence, 37(9):1904–
1916, 2015.
[4] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson,

Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Back-
propagation applied to handwritten zip code recognition. Neural com-
putation, 1(4):541–551, 1989.
[5] Ying Li, Binbin Fan, Weiping Zhang, Weiping Ding, and Jianwei
Yin. Deep active learning for object detection. Information Sciences,
579:418–433, 2021.
24
[6] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott
Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multi-
box detector. In European conference on computer vision, pages 21–37.
Springer, 2016.
[7] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You
only look once: Unified, real-time object detection. In Proceedings of
the IEEE conference on computer vision and pattern recognition, pages
779–788, 2016.
[8] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-
cnn: Towards real-time object detection with region proposal networks.
Advances in neural information processing systems, 28, 2015.
[9] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev
Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla,
Michael Bernstein, et al. Imagenet large scale visual recognition chal-
lenge. International journal of computer vision, 115(3):211–252, 2015.
[10] Karen Simonyan and Andrew Zisserman. Very deep convolutional net-
works for large-scale image recognition. arXiv preprint arXiv:1409.1556,
2014.
25

BTP Report Faster R CNN Compressed

Uploaded by

Copyright:

Available Formats

BTP Report Faster R CNN Compressed

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BTP Report Faster R CNN Compressed

Uploaded by

Copyright:

Available Formats

FASTER R-CNN FOR REAL TIME

A Project Report Submitted

Turnitin Similarity: 22%

Guwahati - 781039 (Dr. Rajen Kumar Sinha)

List of Tables vii

4 Conclusion and Future Work 20

1.1 Object classification vs object localization . . . . . . . . . . . 2

2.1 Faster R-CNN: RPN + Fast R-CNN . . . . . . . . . . . . . . 7

3.1 Photographs taken from phone in IIT Guwahati . . . . . . . . 18

4.1 Role of active learning in prioritizing the data to be labelled . 21

2.1 Comparison of R-CNN family speed at test time . . . . . . . . 14

4.1 A scenario to understand the active learning startegies . . . . 22

1.1 Object Detection

In phase 1, we are mostly concerned about application of deep learning

Figure 1.1: Object classification vs object localization

1.2 Review of Prior Works

Figure 1.2: Summary of the R-CNN architecture

Mainly there are 3 modules in R-CNN:

1. The first module makes use of selective search algorithm to generate

3. The feature vector is passed through

1.2.2 Fast R-CNN

Fast R-CNN is an R-CNN extension that tackles its speed difficulties. In

2. SVM is replaced with a softmax layer, thus extending neural network

Faster R-CNN as the name suggests is an extension of the popular object-

Faster R-CNN solves the problem of selective search (SS) by replacing

There are 3 main modules of a Faster R-CNN architecture. They are as

1. CNN backbone - generates features from entire input image.

2. Region proposal - estimates possible localization’s of the objects.

Figure 2.1: Faster R-CNN: RPN + Fast R-CNN

2.1 CNN Backbone

The backbone used in the original paper is a VGG-16 model[10]. We do not

2.2 Region Proposal Network

Figure 2.2: Region Proposal Network

2.2.1 Anchor Boxes

Our method’s translation invariance—both with regard to the anchors and

2.2.2 Non-maximal Suppression

RPN proposals occasionally have a lot of overlap. We use non-maximum

2.2.3 Objectness Score

2. +1 is assigned to the anchor with highest IoU when no other anchor

4. 0 is assigned to all boxes where score is between 0.7 and 0.3.

Figure 2.4: Intersection Over Union in multiple scenarios

2.2.4 Loss Function

A multi-task loss function is used to train the classification and regression

2.3 Detection Network

It outputs a discrete probability distribution per Region of Interest, p =

Losscls (p, u) = − log pu

2.3.2 Regression Layer

smoothL1 (x) = 0.5x2 , if |x| < 1 and |x| − 0.5, otherwise

Table 2.1: Comparison of R-CNN family speed at test time

Evaluation criteria RCNN Fast R-CNN Faster R-CNN

Implementation Details and

3.1 Feature Sharing for Backbones in RPN

1. Alternate Training - In this technique, training starts with the RPN

2. Approximate Joint Training - In this approach, the two networks

3. Non-Approximate Joint Training - This approach addresses the

Algorithm 1 Faster R-CNN algorithm

3.3.1 Object Detection in Action

3.3.2 Attendance Counting

One application of object detection could be person counting. In a crowded

Figure 3.2: Attendance