Object Detection

Download as pdf or txt
Download as pdf or txt
You are on page 1of 49

OBJECT DETECTION USING SSD(Single Shot Detector)

ALGORITHM
University Project – II Report submitted to the Presidency University, Bengaluru in
partial fulfillment of the requirements for award of Degree of

BACHELOR OF TECHNOLOGY

in

ELECTRONICS AND COMMUNICATION


ENGINEERING

by

Mr. Amarnath.M ID No.: 20171ECE0001


Ms. A.Keerthana ID No.: 20171ECE0002
Ms. Biragani Gurukeerthi ID No.: 20171ECE0045
Mr. Birru Avinash ID No.: 20171ECE0046
Ms. Kalavakuri Reeta Rose Sapphire ID No.: 20171ECE0115

Under the Guidance of

Ms. Diana Steffi.D D


Assistant Professor

DEPARTMENT OF ELECTRONICS AND COMMUNICATION


PRESIDENCY UNIVERSITY
(Private University established in Karnataka State by Act No. 41 of 2013)
Itgalpur, Rajanakunte, Yelahanka, Bengaluru – 560064
Website: www.presidencyuniversity.in

May, 2021
PRESIDENCY UNIVERSITY
Bengaluru
Department of Electronics and Communication Engineering

Certificate

This is to certify that the University Project – II work entitled “OBJECT


DETECTION USING SSD(Single Shot Detector) ALGORITHM” was carried out by
Mr. Amarnath.M (ID No. 20171ECE0002), Ms. A.Keerthana (ID No.
20171ECE0002), Ms. Biragani Gurukeerthi (ID No. 20171ECE0045), Mr. Birru
Avinash (ID No. 20171ECE0046) and Ms. Kalavakuri Reeta Rose Sapphire (ID No.
20171ECE0115) who are bonafide students of VIII Semester B.Tech. Electronics and
Communication Engineering in Presidency University. This is in partial fulfillment of
the course work in place of Professional Practice – II of Bachelor of Engineering in
Presidency University, Bengaluru, during the year 2019-2020.

Ms. Diana Steffi.D D Dr. Rajiv Ranjan Singh


Supervisor & Assistant Professor Professor & Head
Department of Electronics and Department of Electronics and
Communication Engineering Communication Engineering

Examiner-1 ______________________

Examiner-2 ______________________
DECLARATION

We do hereby declare that the Project Report entitled “Object Detection Using
SSD(Single Shot Detector) Algorithm” is a record of an original work done by me
under the guidance of Ms. Diana Steffi.D D Professor in the Department of
Electronics and Communication Engineering, Presidency University, Bengaluru. This
report is submitted by us in the partial fulfillment of the requirements for the award of
degree of Bachelor of Technology in Electronics and Communication Engineering to
Presidency University, Bengaluru in the month of May, 2020. The results embodied in
this report have not been submitted to any other University or Institute for the award
of any degree or diploma.

Mr. Amarnath.M
ID No.: 20171ECE0001

Ms. A.Keerthana
ID No.: 20171ECE0002

Ms. Biragani Gurukeerthi


ID No.: 20171ECE0045

Mr. Birru Avinash


ID No.: 20171ECE0046

Ms.Kalavakuri Reeta Rose


Sapphire
ID No.: 20171ECE0115
ACKNOWLEDGEMENT

We would like to express our sincere thanks to the guide, Ms. Diana Steffi.D D
Assistant Professor, Department of Electronics and Communication Engineering
for his morale boosting, meticulous guidance, co-operation and supervision
throughout this project work.

We would like to owe our heartiest gratitude to Dr.Rajiv Ranjan Singh , Head of the
Department of Electronics and Communication Engineering for his
encouragement during the progress of this project work.

We would also like to pay our sincere thanks to Dr. Abdul Sharief, Dean, School of
Engineering for sharing his valuable experience in completing project work.

We would like to convey our sincere thanks to the Management of our university for
providing us required infrastructure within the college campus.

We would also like to thank all of our juniors, classmates and friends for their
valuable suggestions to complete our project work in time.

Last but not the least we would like to thank our parents for always staying beside us
and encouraging us all the time.

Date: 15-05-2021
Place: Presidency University, Bengaluru

Mr. Amarnath. M ID No.:20171ECE0001


Ms. A. Keerthana ID No.:20171ECE0002
Ms. Biragani Guru Keerthi ID NO.:20171ECE0045
Ms. Birru Avinash ID NO.:20171ECE0046
Ms. Kalavakuri Reeta Rose Sapphire ID NO.:-20171ECE0115
ABSTRACT

A few years ago, the creation of the software and hardware image processing systems
was mainly limited to the development of the user interface, which most of the
programmers of each firm were engaged in. The situation has been significantly
changed with the advent of the Windows operating system when the majority of the
developers switched to solving the problems of image processing itself. However, this
has not yet led to the cardinal progress in solving typical tasks of recognizing faces,
car numbers, road signs, analyzing remote and medical images, etc. Each of these
"eternal" problems is solved by trial and error by the efforts of numerous groups of
the engineers and scientists. As modern technical solutions turn out to be excessively
expensive, the task of automating the creation of the software tools for solving
intellectual problems is formulated and intensively solved abroad. In the field of
image processing, the required tool kit should be supporting the analysis and
recognition of images of previously unknown content and ensure the effective
development of applications by ordinary programmers. Just as the Windows toolkit
supports the creation of interfaces for solving various applied problems.

Object recognition is to describe a collection of related computer vision tasks that


involve activities like identifying objects in digital photographs. Image classification
involves activities such as predicting the class of one object in an image. Object
localization refers to identifying the location of one or more objects in an image and
drawing a bounding box around their extent. Object detection does the work of
combines these two tasks and localizes and classifies one or more objects in an image.
When a user or practitioner refers to the term “object recognition“, they often mean
“object detection“. It may be challenging for beginners to distinguish between
different related computer vision tasks.
CONTENTS
Title Page No.
CERTIFICATE
DECLARATION I
ACKNOWLEDGEMENT II
ABSTRACT III
CONTENTS V - XII
LIST OF TABLES XIII - IX
LIST OF FIGURES X - XI
CHAPTER 1 INTRODUCTION TO OBJECT DETECTION 1-7
USING SSD ALGORITHM
1.1 Project Introduction 1
1.2 Single shot Detector 1-2
1.2.1 Single Shot 2
1.2.2 Detector 2-3
1.3 Neural Approaches 3
1.4 Objective 3-4
1.5 Working 5-6
1.6 Performance 6-7
1.7 Motivation 7
CHAPTER 2 LITERATURE SURVEY 8-14
CHAPTER 3 PROPOSED METHODOLOGY 15 - 20
3.1 ResNet 15
3.2 R-CNN 15-16
3.2.1 Problems with R-CNN 16
3.3 Fast R-CNN 17
3.4 Faster R-CNN 17-18
3.5 YOLO 18-19
3.6 SSD
CHAPTER 4 SOFTWARE REQUIREMENTS 21-22
4.1 Installation Of Tensorflow 21
4.2 Installation Of Numpy 21-22
4.3 Installation Of OpenCV 22
4.4 Installation Of Matplotlib 22
CHAPTER 5 RESULTS AND DISCUSSIONS 23-32
5.1 Python Code for Object detection in Images 23-26
5.2 Python Code for Object detection in Videos 27-31
5.3 Python Code for Object detection using 32
Webcam
CHAPTER 5 CONCLUSIONS AND FUTURE SCOPE 33-34
REFERENCES 35-36
LIST OF FIGURES

Figure No. Figure Caption Page No.

Fig. 1.1 Objects detected with OpenCv’s Deep Neural Network by using 2
a SSD-v3 model trained on Cocco dataset capable to detect
objects of 80 common classes

Fig. 1.2 Architecture of a convolutional neural network with a SSD 4


detector

Fig. 1.3 Feature Extractor of Yolo v2, Yolo v3, SSD 5

Fig. 3.1 R-CNN-Regions with CNN features 15


.
Fig. 3.2 R-CNN 16

Fig. 3.3 Fast R-CNN 17

Fig. 3.4 Comparison of Object Detection Algorithms 17

Fig. 3.5 Faster R-CNN 18

Fig. 3.6 Comparison of test-time speed of object detection algorithm 18

Fig. 3.7 YOLO Object Detection 19

Fig. 3.8 SSD(Single Shot Detector) Architecture 19

Fig. 3.9 SSD Object Detection 20

Fig 5.1.1 Reading the Image-1 24

Fig 5.1.2 Detection and Labelling of Image-1 24

Fig 5.1.3 Reading the Image-2 24

Fig 5.1.4 Detection and Labelling of Image-2 25

Fig 5.1.5 Reading the Image-3 25

Fig 5.1.6 Detection and Labelling Of Image-3 25

Fig 5.1.7 Reading the Image-4 26

Fig 5.1.8 Detection and Labelling Of Image-4 26

Fig 5.2.1 Detection and Labelling of video part-1 28


Fig 5.2.2 Detection and Labelling of video part-2 29

Fig 5.2.3 Detection and Labelling of video part-3 29

Fig 5.2.4 Detection and Labelling of video part-4 30

Fig 5.2.5 Detection and Labelling of video part-5 30

Fig 5.2.6 Detection and Labelling of video part-6 31

Fig 5.3 Object detection using webcam 32


Chapter 1: INTRODUCTION TO OBJECT DETECTION
USING SSD(Single Shot Detector) ALGORITHM

1.1 Project Introduction

Object detection is a computer technology related to computer vision and image


processing that deals with detecting instances of semantic objects of a certain class
(such as humans, buildings, or cars) in digital images and videos. Well-researched
domains of object detection include face detection and pedestrian detection. Object
detection has applications in many areas of computer vision, including image retrieval
and video surveillance.

Deep learning achieves great success in image classification, object detection,


semantic segmentation and natural language processing. A few years ago, by
exploiting some of the leaps made possible in computer vision via CNNs, researchers
developed R-CNNs to deal with the tasks of object detection, localization and
classification. A R-CNN is a special type of CNN that is able to locate and detect
objects in images; the output is generally a set of bounding boxes that closely match
each of the detected objects, as well as a class output for each detected object. Many
improved methods based on the R-CNN, such as fast R-CNN, faster RCNN emerged
in the object detection area. These methods achieved high accuracy, but their network
structures are relatively complex.

1.2 Single Shot Detector(SSD)

SSD is also a part of the family of networks which predict the bounding boxes of
objects in a given image. It is a simple, end to end single network, removing many
steps involved in other networks which tries to achieve the same task, at the time of
its publishing. It works better than the state of the art Faster-RCNN in cases of higher
dimensional images.

The fundamental concept of SSD is mostly based on the feedforward convolution


network. It is discretized the output space of bounding boxes into a set of default
boxes over different aspect ratios and scales per feature map location. It then
generates scores forth presence of each object class in each default box and produces
adjustments to better match object shape.

The SSD model is comprised of mainly two structures: Base network and Auxiliary
network. The Base network is the early partof the model which is based on standard
architecture used for high quality image classification. The Auxiliary network has

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 1 of 40
features mainly focused for objects with different scales or aspect ratios. SSD has two
components in its structure. The first component, called a base network, is used for
classification of images. The second component has three useful features including
multi-scale features maps for detection, convolutional predictors for detection, and
six aspect ratios of detection boxes at the end of November 2016 and reached new
records in terms of performance and precision for object detection tasks, scoring over
74% mAP(mean Average Precision) at 59 frames per second on standard datasets
such as PascalVOC and COCO. To better understand SSD, let’s start by explaining
where the name of this architecture comes from:

1.2.1 Single Shot

This means that the tasks of object localization and classification are done in a single
forward pass of the network

1.2.2 Detector

The network is an object detector that also classifies those detected objects. The
bounding box regression technique of SSD is inspired by Szegedy’s work on
MultiBox, a method for fast classagnostic bounding box coordinate proposals.
Interestingly, in the work done on Multi-Box an Inception-style convolutional
network is used. 1x1 convolutions that help dimensionality reduction since the
number of dimensions will go down (but “width” and “height” will remain the same).
Such applications demand a protocol that can hide the node identities and
geographical positions as well as their traffic information. The protocols that ensure
this type of routing are known as anonymous routing protocols.
Anonymous routing protocols in MANETs are playing a crucial role to offer secure
communication.

Fig 1.1 Objects detected with OpenCV's Deep Neural Network module (dnn) by
using a YOLOv3 model trained on COCO dataset capable to detect objects of 80
common classes.
It is widely used in computer vision tasks such as image annotation, activity
recognition, face detection, face recognition, video object co-segmentation. It is also
used in tracking objects, for example tracking a ball during a football match, tracking
Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 2 of 40
movement of a cricket bat, or tracking a person in a video.Every object class has its
own special features that helps in classifying the class – for example all circles are
round. Object class detection uses these special features. For example, when looking
for circles, objects that are at a particular distance from a point (i.e. the center) are
sought. Similarly, when looking for squares, objects that are perpendicular at corners
and have equal side lengths are needed. A similar approach is used for face
identification where eyes, nose, and lips can be found and features like skin color and
distance between eyes can be found.

Methods for object detection generally fall into either neural network-based or
non-neural approaches. For non-neural approaches, it becomes necessary to first
define features using one of the methods below, then using a technique such as a
support vector machine (SVM) to do the classification. On the other hand, neural
techniques are able to do end-to-end object detection without specifically defining
features, and are typically based on convolutional neural networks (CNN).

1.3 Non-neural approaches:

Viola–Jones object detection framework based on Haar features


Scale-invariant feature transform (SIFT)
Histogram of oriented gradients (HOG) features
Neural network approaches:
Region Proposals (R-CNN, Fast R-CNN, Faster R-CNN, cascade R-CNN.)
Single Shot MultiBox Detector (SSD)
You Only Look Once (YOLO)
Single-Shot Refinement Neural Network for Object Detection (RefineDet)
Retina-Net
Deformable convolutional networks

1.4 Objective:

SSD is designed for object detection in real-time. Faster R-CNN uses a region
proposal network to create boundary boxes and utilizes those boxes to classify
objects. While it is considered the start-of-the-art in accuracy, the whole process runs
at 7 frames per second. Far below what real-time processing needs. SSD speeds up the
process by eliminating the need for the region proposal network. To recover the drop
in accuracy, SSD applies a few improvements including multi-scale features and
default boxes. These improvements allow SSD to match the Faster R-CNN’s accuracy
using lower resolution images, which further pushes the speed higher.
According to the following comparison, it achieves the real-time processing speed and
even beats the accuracy of the Faster R-CNN. (Accuracy is measured as the mean
average precision mAP: the precision of the predictions.)

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 3 of 40
The SSD object detection composes of 2 parts:
Extract feature maps, and
Apply convolution filters to detect objects.

SSD does not use a delegated region proposal network. Instead, it resolves to a very
simple method. It computes both the location and class scores using small convolution
filters. After extracting the feature maps, SSD applies 3 × 3 convolution filters for
each cell to make predictions. (These filters compute the results just like the regular
CNN filters.) Each filter outputs 25 channels: 21 scores for each class plus one
boundary box (detail on the boundary box later).

Fig:1.2 Architecture of a convolutional neural network with a SSD detector

Image classification in computer vision takes an image and predicts the object in an
image, while object detection not only predicts the object but also finds their location
in terms of bounding boxes. For example, when we build a swimming pool classifier,
we take an input image and predict whether it contains a pool, while an object
detection model would also tell us the location of the pool.

For illustrative purpose, assuming there is at most one class and one object in an
image, the output of an object detection model should include:
Probablity that there is an object,
Height of the bounding box,
Width of the bounding box,
Horizontal coordinate of the center point of the bounding box,
Vertical coordinate of the center point of the bounding box.
This is just one of the conventions of specifying output. Different models and
implementations may have different formats, but the idea is the same, which is to
output the probablity and the location of the object.

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 4 of 40
1.5 Working

SSD has two components: a backbone model and SSD head. Backbone model usually
is a pre-trained image classification network as a feature extractor. This is typically a
network like ResNet trained on ImageNet from which the final fully connected
classification layer has been removed. We are thus left with a deep neural network
that is able to extract semantic meaning from the input image while preserving the
spatial structure of the image albeit at a lower resolution. For ResNet34, the backbone
results in a 256 7x7 feature maps for an input image. We will explain what feature and
feature map are later on. The SSD head is just one or more convolutional layers added
to this backbone and the outputs are interpreted as the bounding boxes and classes of
objects in the spatial location of the final layers activations.

The SSD network ran both faster and had superior performance to YOLO. As
mentioned, the increased performance in speed in comparison to the Faster R-CNN
model was due to the elimination of bounding box proposals and subsampling of the
image.SSD also uses anchor boxes at various aspect ratio similar to Faster-RCNN and
learns the off-set rather than learning the box. In order to handle the scale, SSD
predicts bounding boxes after multiple convolutional layers.We present a method for
detecting objects in images using a single deep neural network. Our approach, named
SSD, discretizes the output space of bounding boxes into a set of default boxes over
different aspect ratios and scales per feature map location

Fig:1.3 Feature Extractor Of Yolo v2, Yolo v3, SSD

SSD Object Detection extracts feature map using a base deep learning network, which
are CNN based classifiers, and applies convolution filters to finally detect objects.
Our implementation uses MobileNet as the base network (others might include-
VGGNet, ResNet, DenseNet).

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 5 of 40
As wireless network technology continues to evolve, it has brought great convenience
to people’s life and work with its powerful technical capabilities. Wireless networks
have gradually facilitated the main stream of people’s online life. At the same time,
the advent of 5G network will further enable the greater development and more
advanced applications of wireless network technology. The future generations of
wireless networks will provide strong support for related applications such as Internet
of Things (IoT) and virtual reality (VR). Many of these applications connect to each
other and transmit information within networks based on the detection of specific
target objects. In order to achieve a comprehensive network connection between
people and people, things and people, and things and things, one of the key tasks of
future applications is to identify the target in a real-time manner in the wireless
networks.

1.6 Performance

Based on the recent advancement in deep learning with image processing, the
proposed schemes then use multiple images and detect the objects from these images,
labeling them with their respective class label. These images can be from videos
which are fed into the model we prepared, and the training of the model takes place
until the error rate is reduced to an acceptable level. To speed up the computational
performance of the object detection technique, we have used improved single shot
multi-box detector (SSD) algorithm along with the faster region convolutional neural
network. We also conduct experiments to check the accuracy of our proposed method
in detecting the objects with different parameters including loss function, mean
average precision (mAP), and frames per second. The experiment results demonstrate
that the proposed model has a high performance in detect accurate objects for
real-time applications.

Based on the recent advancement in deep learning with image processing, the
proposed schemes then use multiple images and detect the objects from these images,
labeling them with their respective class label. These images can be from videos
which are fed into the model we prepared, and the training of the model takes place
until the error rate is reduced to an acceptable level. To speed up the computational
performance of the object detection technique, we have used an improved single shot
multibox detector (SSD) algorithm along with the faster region convolutional neural
network. We also conduct experiments to check the accuracy of our proposed method
in detecting the objects with different parameters including loss function, mean
average precision (mAP), and frames per second. The experiment results demonstrate
that the proposed model has a high performance in detecting accurate objects for
real-time applications.

Specifically, this research makes contributions to the existing literature by improving


the accuracy of SSD algorithm for detecting smaller objects. SSD algorithm works

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 6 of 40
well in detecting large objects but is less accurate in detecting smaller objects. Hence,
we modify the SSD algorithm to achieve acceptable accuracy for detecting smaller
objects. The images or scenes are taken from web cameras and we have used Pascal
visual object class (VOC) and common objects in context (COCO) datasets to carry
out experiments. We capture object detection (OD) datasets from our center for image
processing lab. We make use of different libraries to form a network and use
tensorflow-GPU.

So it is designed for object detection in real-time. Faster R-CNN uses a region


proposal network to create boundary boxes and utilizes those boxes to classify objects
as mentioned above. SSD is a single-stage object detection method that discretizes
the output space of bounding boxes into a set of default boxes over different aspect
ratios and scales per feature map location. At prediction time, the network generates
scores for the presence of each object category in each default box and produces
adjustments to the box to better match the object shape. Additionally, the network
combines predictions from multiple feature maps with different resolutions to
naturally handle objects of various sizes.

The fundamental improvement in speed comes from eliminating bounding box


proposals and the subsequent pixel or feature resampling stage. Improvements over
competing single-stage methods include using a small convolutional filter to predict
object categories and offsets in bounding box locations, using separate predictors
(filters) for different aspect ratio detections, and applying these filters to multiple
feature maps from the later stages of a network in order to perform detection at
multiple scales.

1.7 Motivation

So object defers from image classification in a few ways. First, while a classifier
outputs a single category per image, an object detector must be able to recognize
multiple objects in a single image. Technically, this task is called multiple object
detection, but most research in the area addresses the multiple object setting, so we
will abuse terminology just a little. Second, while classifiers need only to output
probabilities over classes, object detectors must output both probabilities of class
membership and also the coordinates that identify the location of the objects.

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 7 of 40
Chapter 2: LITERATURE SURVEY

Artificial Neural Networks is a type of artificial intelligence that attempts to simulate


the way a human brain works. Rather than using a digital model, in which all
computations manipulate zeros and ones, a Neural Network works by creating
connections between processing elements, the computer equivalent of neurons. An
ANN is configured for a specific application, such as pattern recognition or data
classification, through a learning process. Learning in biological systems involves
adjustments to the synaptic connections that exist between the neurons. This is true
for ANN’s as well.

Why Artificial Neural Networks?


1. Adaptive Learning: An ability to learn how to do tasks based on the data given for
training or initial experience.
2. Self-Organisation: An ANN can create its own organisation or representation of
the information it receives during learning time
3. Real time Operations: ANN computations may be carried out in parallel and
special hardware devices are being designed and manufactured which take advantage
of this capability.
4. Fault Tolerance via Redundant Information Coding: Partial destruction of a
network leads to the corresponding degradation of performance. However, some
network capabilities may be retained even
with major network damage.

OBJECT DETECTION TECHNIQUES:-


Images of objects from a particular class are highly variable. One source of variation
is the actual imaging process. Changes in illumination, changes in camera position as
well as digitization of artifacts, all produce significant variations in image appearance,
even in a static scene. The second source of variation is due to the intrinsic
appearance variability of objects within a class, even assuming no variation in the
imaging process. Object detection involves detecting instances of objects from a
particular class in an image.

Object detection in images using artificial neural networks and improved binary
gravitational search algorithm :-
In this paper, Artificial Neural Network (ANN) and Improved Primary Gravitational
Search algorithm (IBGSA) have been used to detect object in images. Watershed
algorithm is used to segment images and extract the objects colour, feature and
geometric elements are separated from each question. IBGSA is utilized as a best
Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 8 of 40
technique to locate subset of components for array arranging coveted items. The
reason for utilizing IBGSA is to diminish intricacy by choosing remarkable
components.

Object recognition is an issue in clutter background, objects can be in various pose


and lighting. Part base technology encode the structure by utilizing an arrangement of
patches covering essential parts of an objects.

In 3D ECDS, the edges of different objectives are segregated and the spatial relation
of the same object is kept as well. A method of object detection that can combine the
feature reduction and feature excerpt of PCA and Ada Boost.

Object detection is the identification of an object in the image along with its
localisation and classification. It has wide spread applications and is a critical
component for vision based software systems. This paper seeks to perform a rigorous
survey of modern object detection algorithms that use deep learning. As part of the
survey, the topics explored include various algorithms, quality metrics, speed/size
trade offs and training methodologies. This paper focuses on the two types of object
detection algorithms- the SSD class of single step detectors and the Faster R-CNN
class of two step detectors. Techniques to construct detectors that are portable and fast
on low powered devices are also addressed by exploring new lightweight
convolutional base architectures. Ultimately, a rigorous review of the strengths and
weaknesses of each detector leads us to the present state of the art.

Object detection using deep neural network especially convolution neural networks.
Object detection was previously done using only conventional deep convolution
neural network whereas using regional based convolution network increases the
accuracy and also decreases the time required to complete the program. The dataset
used is PASCAL VOC 2012 which contains 20 labels. The dataset is very popular in
image recognition, object detection and other image processing problems. Supervised
learning is also possible in implementing the problem using Decision trees or more
likely SVM. But neural network work best in image processing because they can
handle images well.

Focal Loss for Dense Object Detection:-


The highest accuracy object detectors to date are based on a two-stage approach
popularized by R-CNN, where a classifier is applied to a sparse set of candidate
object locations. In contrast, one-stage detectors that are applied over a regular, dense
sampling of possible object locations have the potential to be faster and simpler, but
have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate
why this is the case. We discover that the extreme foreground-background class
imbalance encountered during training of dense detectors is the central cause. We
propose to address this class imbalance by reshaping the standard cross entropy loss

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 9 of 40
such that it down-weights the loss assigned to well-classified examples. Our novel
Focal Loss focuses training on a sparse set of hard examples and prevents the vast
number of easy negatives from overwhelming the detector during training. To
evaluate the effectiveness of our loss, we design and train a simple dense detector we
call Retina Net. Our results show that when trained with the focal loss, Retina Net is
able to match the speed of previous one-stage detectors while surpassing the accuracy
of all existing state-of-the-art two-stage detectors. This paper pushes the envelope
further: we present a one stage object detector that, for the first time, matches the
state-of-the-art COCO AP of more complex two-stage detectors, such as the Feature
Pyramid Network (FPN).or Mask R-CNN variants of Faster R-CNN. To achieve this
result, we identify class imbalance during training as the main obstacle impeding
one-stage detector from achieving state-of-the-art accuracy and propose a new loss
function that eliminates this barrier. SSD is also a part of the family of networks
which predict the bounding boxes of objects in a given image. It is a simple, end to
end single network, removing many steps involved in other networks which tries to
achieve the same task, at the time of its publishing. It works better than the state of the
art Faster-RCNN in cases of higher dimensional images.

The fundamental concept of SSD is mostly based on the feed forward convolution
network. It discretizes the output space of bounding boxes into a set of default boxes
over different aspect ratios and scales per feature map location. It then generates
scores forth presence of each object class in each default box and produces
adjustments to better match object shape.

The SSD model consists of mainly two structures: Base network and Auxiliary
network. The Base network is the early part of the model which is based on standard
architecture used for high quality image classification. The Auxiliary network has
features mainly focused for objects with different scales or aspect ratios. SSD has two
components in its structure. The first component, called a base network, is used for
classification of images. The second component has three useful features including
multi-scale features maps for detection, convolutional predictors for detection, and six
aspect ratios of detection boxes at the end of November 2016 and reached new
records in terms of performance and precision for object detection tasks, scoring over
74% mAP(mean Average Precision) at 59 frames per second on standard datasets
such as PascalVOC and COCO. To better understand SSD,
let’s start by explaining where the name of this architecture comes from:
● Single Shot
This means that the tasks of object localization and classification are done in a single
forward pass of the network
● Multi-Box
This is the name of a technique for bounding box regression developed by Szegedy et
al. (we will briefly cover it shortly)
● Detector

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 10 of 40
The network is an object detector that also classifies those detected objects.

In Multi-Box, the bounding box regression technique of SSD is inspired by Szegedy’s


work on MultiBox, a method for fast classagnostic bounding box coordinate
proposals. Interestingly, in the work done on Multi-Box an Inception-style
convolutional network is used. 1x1 convolutions that help dimensionality reduction
since the number of dimensions will go down (but “width” and“height” will remain
the same). Such applications demand a protocol that can hide the node identities and
geographical positions well as their traffic information. The protocols that ensure this
type of routing are known as anonymous routing protocols.

Some of the fields where SSD is used combine with Transfer Learning for Ship
Detection Using Chinese Gaofen-3 Images and SAR Target Detection Based on SSD
with Data Augmentation. The automatic detection of objects in real time video
surveillance videos proposes a robust, scalable framework for automatic
detection of abandoned, stationary objects in real time surveillance videos that can
pose a security threat. The background modelling method to generate a long-term and
a short-term background model to extract foreground objects is used. Subsequently,
a pixel-based FSM detects stationary candidate objects based on the temporal
transition of code patterns. In order to classify the stationary candidate objects, the
deep learning method (SSD) is used to suppress false alarm and also to remove other
stationary candidate objects other than the suspected stationary objects and also check
if there is no person nearby the suspected detected objects for a particular time.

Accordingly, another method -The Inception block to replace the extra layers in SSD,
and call this method Inception SSD (I-SSD) helps to catch more information without
increasing the complexity. Usage of the batch normalization (BN) and the residual
structure in this I-SSD network architecture. The proposed I-SSD algorithm achieves
78.6% mAP on the Pascal VOC2007 test and an Outdoor Object Detection (OOD)
dataset to testify the effectiveness of the proposed ISSD on the platform of unmanned
vehicles.

Object detection applications are easier to develop than ever before. Besides
significant performance improvements, these techniques have also been leveraging
massive image datasets to reduce the need for large datasets. In addition, with current
approaches focusing on full end-to-end pipelines, performance has also improved
significantly, enabling real-time use cases SSD is widely used for different types of
applications. Some applications need accuracy and speed at the same time in order to
achieve the main objective. This can be done by getting more data, inventing/creating
more data, re-scaling the data, transforming the data or by feature selection. Further,
these models will be optimized by tuning the algorithms. . In addition to this it will be
able to detect objects in the pathway.

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 11 of 40
The above mentioned process is done while maintaining real-time speed, in addition
to producing promising detection results on small objects. The evaluation code and
trained models are publicly available and also customized as per the requirement.
The work focus on following is the objectives:
1) To find to create new datasets which consists real time images and videos and train
the model developed using SSD for object
detection.
2) To try our best to improve the accuracy when it comes to detecting small objects.
3) To maintain the speed while maintaining accuracy.
4) To use in video simulation of self-driving cars to detect pedestrians, road curvature
and alert the driver regarding the turns.

Many surveys have been published recently over different aspects of the IoT,
Middleware for IoT, event-processing, multimedia big data, processing of multimedia
using deep learning, object detection models, comparison of image processing
datasets, etc. Before proceeding towards our review, it is necessary to analyze existing
surveys to understand the need of presented work. Internet of things is a
well-understood field in literature, continually growing and analyzed as well as
summarized in many surveys from time. Event recognition in multimedia is another
domain which is sometimes realized with/without IoT . Deep learning-based surveys
also incorporated many image recognition methods (like object detection models)
along the dimensions of their technical implementation while leaving the performance
of them in real-time applications . Few recent reviews also focus on particular object
detection models and visual datasets, but none of them gives a comprehensive review
of applying them in IoT or multimedia streaming covered by our work.

A survey on deep learning for IoT big data and streaming analytics focuses on
reviewing a wide range of deep neural network-based architectures and exploring IoT
based applications that take benefits from DL algorithms. It serves as a guide to match
IoT applications with appropriate deep learning models. However, it does not go into
specific details of any deep learning models (like CNN in object detection models)
and their performance in different applications of smart cities. Considering the
importance of IoT in various domains of smart cities, another survey provides a
holistic view of middleware for IoT. It highlights the requirements of IoT middleware
and existing middleware-solutions against those requirements. Although it presents
specific approaches of middleware for handling IoT data but does not consider the
processing of multimedia data generated in smart cities. Event-based middleware is
among one of the design approaches of existing middleware solutions for IoT, which
currently applied in many application domains, including smart cities, finance,
medical services, telecommunications, entertainment, etc. A taxonomy of distributed
event-based programming systems presented in the paper . The taxonomy identifies a
set of fundamental properties of event-based programming systems and categorizes
them according to the service structure and supported event model.

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 12 of 40
Event services are further classified according to their organization and interaction
models, as well as other functional and non-functional features. These properties in a
hierarchical manner used to define relationships between event systems, event
services, and event models. Introduced taxonomy is extensive, but more recent event
processing models need to be incorporated. An overview of the existing event
recognition methods presented in paper with a focus on deep learning architectures in
multimedia. A multimedia event-based analysis categorized into four groups i.e. event
recognition in single images, event recognition in personal photo collections, event
recognition in videos, and event recognition in audio recordings. Specifically, it
provides an extensive review of deep learning-based frameworks in the context of
event recognition. Moreover, it also emphasizes on benchmark datasets in-order to
validate event recognition methods. Thus it is the most relevant survey for the image
recognition, but also require performance-based evaluations of deep learning models
in the real-time applications of IoT.

The first research that initiates the vision of the Internet of Multimedia Things (IoMT)
and its standardization presented .The paper establishes a novel paradigm “IoMT”
where multimedia things can interact with one another and connects to the IoT to
facilitate multimedia-based services while having applications and users in a loop. It
introduces an architecture for IoMT and presented its possible use-cases and
applications. It also realizes various requirements and challenges by reviewing
already existing technologies. Since this provides a global view, it is necessary to
narrow it down towards specific popular applications of IoMT like image recognition
in smart cities, and survey existing technologies with more specific challenges and
requirements. Wireless Multimedia Sensor Networks (WMSNs) is the most common
terminology used in literature for processing multimedia events to extracts the
surrounding environmental information. Existing hardware and communication
protocol layer technologies reviewed , for achieving objectives of WMSNs, with
associated technical challenges. Most relevant work in the direction of multimedia
stream processing appears in the survey . The paper discusses issues of multimedia
big data (MMBD) computing in the context of IoT and developed a comprehensive
taxonomy. It also reviewed literature covering challenges associated with scalability,
accessibility, data reliability, heterogeneity, and Quality of Service (QoS). However,
the concept of IoMT is realized only for data acquisition and replaced it with MMBD
computing in IoT, and thus does not cover recent approaches of processing
multimedia utilizing IoMT. Moreover presented case study needs to be evaluated on
existing solutions also.

As the Internet of Things (IoT) is a well-established term in the field of IoT, many
comprehensive surveys have been appearing in literature. All of these are complete in
their aspects in covering primary elements of IoT, enabling technologies, challenges,

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 13 of 40
vision, and future directions. It is necessary to investigate existing multimedia based
methods to get benefit from existing IoT based technologies. However, multimedia
processing is still not well associated with IoT, and its surveys are either applications
specific or method-specific. As in this work, we used a case study of image
recognition (i.e. object detection) for the processing multimedia using IoMT; we have
analyzed recent surveys related to object detection models and available datasets.

In the domain of image recognition, a comparative study of object detection


algorithms presented. It attempts to find the best possible speed with maximum
accuracy by comparing different convolutional neural network-based object detection
models. It discusses three models i.e. Single Shot Detector (SSD), Faster R-CNN
(Region-based Convolutional Neural Networks), and R-FCN (Region based-Fully
Convolutional Networks), while leaving modern object detection models like YOLO,
RetinaNet, etc. Similarly, another recent research also compares the performance of
only three object detection algorithms (SSD, Faster RCNN, and RFCN) with different
feature extractors (Inception V2, Resnet-101, and Mobilenet-V1). An article reviews
the recent literature related to deep convolutional neural network (CNN) based object
detection models. It covers not only the design decisions of CNN based models but
also provide challenges in going forward in object detection and future directions to
extend these detectors. Moreover, it also presents a good overview of existing datasets
along with classification and state-of-the-art algorithms. Another similar research
provides a detailed analysis of deep learning-based object detection models with
handling different sub-problems like occlusion, clutter, and low resolution. This paper
also briefly reviews three common tasks: salient object detection, face detection, and
pedestrian detection. Then finally concluded with promising future directions for the
understanding of object detection landscape.

To achieve a balance of speed/memory/ accuracy in modern convolutional object


detectors, a unified implementation of Faster R-CNN, R-FCN, and SSD, is shown. by
changing feature extractors and critical hyperparameters. Most of the experiments in
this study conducted on the Microsoft COCO dataset. This paper can be of great help
for practitioners in choosing an appropriate method on the deployment of object
detection models in real-world scenarios. A detailed comparison of visual datasets for
machine learning concerning size, location, and contextual information presented.
Although it gives a new approach to create datasets, its main focus is on seven object
detection datasets, namely PASCAL VOC, ImageNet, SUN, INRIA, Kitti, Caltech,
and Microsoft COCO.

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 14 of 40
Chapter 3: PROPOSED METHODOLOGY

3.1 ResNet

To train the network model in a more effective manner, we herein adopt the same
strategy as that used for DSSD (the performance of the residual network is better than
that of the VGG network). The goal is to improve accuracy. However, the first
implemented for the modification was there placement of the VGG network which is
used in the original SSD with ResNet. We will also add a series of convolution feature
layers at the end of the underlying network. These feature layers will gradually be
reduced in size that allowed prediction of the detection results on multiple scales.
When the input size is given as 300 and 320, although the ResNet–101 layer is deeper
than the VGG–16layer, it is experimentally known that it replaces the SSD’s
underlying convolution network with network, and it does not improve its accuracy
but rather decreases it.

3.2 R-CNN

To circumvent the problem of selecting a huge number of regions, Ross Girshick et al.
Proposed a method where we use the selective search for extract just 2000 regions
from the image and he called them region proposals. Therefore, instead of trying to
classify the huge number of regions, you can just work with 2000 regions. These 2000
region proposals are generated by using the selective search algorithm which is
written below.

1. Generate the initial sub-segmentation, we generate many candidate regions


2. Use the greedy algorithm to recursively combine similar regions into larger ones
3. Use generated regions to produce the final candidate region proposals

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 15 of 40
Fig 3.1 R-CNN:-Regions with CNN features
These 2000 candidate regions which are proposals are warped into a square and fed
into a convolutional neural network that produces a 4096-dimensional feature vector
as output. The CNN plays a role of feature extractor and the output dense layer
consists of the features extracted from the image and the extracted features are fed
into an SVM for the classify the presence of the object within that candidate region
proposal. In addition to predicting the presence of an object within the region
proposals, the algorithm also predicts four values which are offset values for
increasing the precision of the bounding box.

For example, given the region proposal, the algorithm might have predicted the
presence of a person but the face of that person within that region proposal could have
been cut in half. Therefore, the offset values which is given help in adjusting the
bounding box of the region proposal.

Fig 3.2 R-CNN

3.2.1 Problems with R-CNN

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 16 of 40
∙Itstill takes a huge amount of time to train the network as you would have to classify
2000 region proposals per image. It cannot be implemented real time as it takes
around 47 seconds for each test image. The selective search algorithm is a fixed
algorithm. Therefore, no learning is happening at that stage. This could lead to the
generation of bad candidate region proposals.

3.3 Fast R-CNN

Fig:3.3 Fast R-CNN

The same author of the previous paper(R-CNN) solved some of the drawbacks of
R-CNN to build a faster object detection algorithm and it was called Fast R-CNN.
The approach is similar to the R-CNN algorithm. But, instead of feeding the region
proposals to the CNN, we feed the input image to the CNN to generate a
convolutional feature map. From the convolutional feature map, we can identify the
region of the proposals and warp them into the squares and by using an RoI pooling
layer were shape them into the fixed size so that it can be fed into a fully connected
layer. From the RoI feature vector,we can use a softmax layer to predict the class of
the proposed region and also the offset values for the bounding box.
The reason “Fast R-CNN” is faster than R-CNN is because you don’t have to feed
2000 region proposals to the convolutional neural network every time. Instead, the
convolution operation is always done only once per image and a feature map is
generated from it.

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 17 of 40
Fig:3.4 Comparison Of Object Detection Algorithms

3.4 Faster R-CNN

Both of the above algorithms(R-CNN & Fast R-CNN) uses selective search to find
out the region proposals. Selective search is the slow and time-consuming process
which affect the performance of the network.

Fig:3.5 Faster R-CNN

Similar to Fast R-CNN, the image is provided as an input to a convolutional network


which provides a convolutional feature map. Instead of using the selective search
algorithm for the feature map to identify the region proposals, a separate network is
used to predict the region proposals. The predicted the region which is proposals are
then reshaped using an RoI pooling layer which is used to classify the image within
the proposed region and predict the offset values for the bounding boxes.

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 18 of 40
Fig 3.6 Comparison of test-time speed of object detection algorithms

3.5 YOLO — You Only Look Once

All the previous object detection algorithms have used regions to localize the object
within the image.The network does not look at the complete image. Instead, parts of
the image which has high probabilities of containing the object. YOLO or You Only
Look Once is an object detection algorithm much is different from the region based
algorithms which seen above. In YOLO a single convolutional network predicts the
bounding boxes and the class probabilities for these boxes.

YOLO works by taking an image and split it into an SxS grid, within each of the grid
we take m bounding boxes. For each of the bounding box, the network gives an
output a class probability and offset values for the bounding box. The bounding boxes
have the class probability above a threshold value is selected and used to locate the
object within the image.

Fig:3.7 YOLO Object Detection

YOLO is orders of magnitude faster(45 frames per second) than any other object
detection algorithms. The limitation of YOLO algorithm is that it struggles with the

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 19 of 40
small objects within the image, fore example, it might have difficulties in identifying
a flock of birds. This is due to the spatial constraints of the algorithm.

3.6 SSD

The SSD object detection composes of 2 parts:


1. Extract feature maps, and
2. Apply convolution filters to detect objects.

SSD uses VGG16 to extract feature maps. Then it detects objects using the
Conv4_3layer. For illustration, we draw the Conv4_3 to be 8 × 8 spatially (it should
be 38 × 38). For each cell in the image(also called location), it makes 4 object
predictions.

Fig:3.8 SSD (Single Shot Detector) Architecture


Each prediction composes of a boundary box and 21 scores for each class (one extra
class for n object), and we pick the highest score as the class for the bounded object.
Conv4_3 makes total of38× 38 × 4 predictions: four predictions per cell regardless of
the depth of feature maps. SSD reserves a class “0” to indicate.

SSD does not use the delegated region proposal network. Instead, it resolves to a very
simple method. It computes both the location and class scores using small convolution
filters. After extraction the feature maps, SSD applies 3 × 3 convolution filters for
each cell to make predictions. (These filters compute the results just like the regular
CNN filters.) Each filter gives outputs as 25channels:21scores for each class plus one
boundary box.

Beginning, we describe the SSD detects objects from a single layer. Actually, it uses
multiple layers (multi-scale feature maps) for the detecting objects independently. As
CNN reduces the spatial dimension gradually, the resolution of the feature maps also
decrease. SSD uses lower resolution layers for the detect larger-scale objects. For
example, the 4× 4 feature maps are used for the larger-scale object.

SSD adds 6 more auxiliary convolution layers to image after VGG16. Five of these
layers will be added for object detection. In which three of those layers, we make 6

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 20 of 40
predictions instead of 4. In total, SSD makes 8732 predictions using 6 convolution
layers.

Fig :3.9 SSD Object Detection

Chapter 4: SOFTWARE REQUIREMENTS

Install Python on your computer system.

1. Install ImageAI and its dependencies like tensorflow, Numpy,OpenCV, etc.


2. Download the Object Detection model file(Retinanet)

Steps to be followed :-
1) Download and install Python version 3 from official Python Language website
https://python.org
2) Install the following dependencies via pip:

4.1 Installation Of Tensorflow:

Tensorflow is an open-source software library for dataflow and differentiable


programming across a range of tasks. It is an symbolic math library, and is also used
for machine learning application such as neural networks,etc.. It is used for both
research and production by Google. Tensorflow is developed by the Google Brain

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 21 of 40
team for internal Google use. It is released under the Apache License 2.0 on
November 9,2015. Tensorflow is Google Brain's second-generation system.1st
Version of tensorflow was released on February 11, 2017.While the reference
implementation runs on single devices, Tensorflow can run on multiple CPU’s and
GPU (with optional CUDA and SYCL extensions for general-purpose computing on
graphics processing units). TensorFlow is available on various platforms such as64-bit
Linux, macOS, Windows, and mobile computing platforms including Android and
iOS. The architecture of tensorflow allows the easy deployment of computation across
a variety of platforms (CPU’s, GPU’s, TPU’s), and from desktops - clusters of servers
to mobile and edge devices.

Tensorflow computations are expressed as stateful dataflow graphs. The name


Tensorflow derives from operations that such neural networks perform on
multidimensional data arrays, which are referred to as tensors.

pip install tensorflow -command

4.2 Installation Of Numpy:

NumPy is library of Python programming language, adding support for large,


multi-dimensional array and matrice, along with large collection of high-level
mathematical function to operate over these arrays. The ancestor of NumPy, Numeric,
was originally created by Jim Hugunin with contributions from several developers. In
2005 Travis Olphant created NumPy by incorporating features of computing
Numarray into Numeric, with extension modifications. NumPy is open-source
software and has many contributors.

pip install numpy -command

4.3 Installation Of OpenCV:

OpenCV is an library of programming functions mainly aimed on real time computer


vision. originally developed by Intel, it is later supported by Willow Garage then
Itseez. The library is a cross-platform and free to use under the open-source BSD
license.

pip install opencv-python -command

4.4 Installation Of Matplotlib:

Matplotlib is a Python programming language plotting library and its NumPy


numerical math extension. It provides an object-oriented API to use general-purpose

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 22 of 40
GUI toolkits such as Tkinter, wxPython, Qt, or GTK+ to embed plots into
applications.

pip install matplotlib-command

4.5 Installation Of Keras:

Keras is an open-source neural-network library written in Python. It is capable of


running on top of TensorFlow, Microsoft Cognitive Toolkit, Theano, or PlaidML.
Designed to enable fast experimentation with deep neural networks, it focuses on
being user-friendly, modular, and extensible.

pip install keras

Chapter 5: RESULTS

5.1.Python Code for Object Detection In images

import cv2 #pip install OpenCV-python


import matplotlib.pyplot as plt #pip install matplotlib
config_file = 'ssd_mobilenet_v3_large_coco_2020_01_14.pbtxt'
frozen_model = 'frozen_inference_graph.pb'
model = cv2.dnn_DetectionModel(frozen_model,config_file)
classLabels = [] ##empty list of python
file_name = 'labels.txt'
with open(file_name,'rt') as fpt:

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 23 of 40
classLabels = fpt.read().rstrip('\n').split('\n')
#classLables.opend(fpt.read())
print(classLabels)
print(len(classLabels))
model.setInputSize(320,320)
model.setInputScale(1.0/127.5)
model.setInputMean((127.5,127.5,127.5))
model.setInputSwapRB(True)
img = cv2.imread('image.jpg/jpeg/png/jfif')
plt.imshow(img)
classIndex, confidece, bbox = model.detect(img,confThreshold=0.5)
print(classIndex)
font_scale = 3
font = cv2.FONT_HERSHEY_PLAIN
for classInd, conf, boxes in zip(classIndex.flatten(), confidece.flatten(), bbox):
cv2.rectangle(img,boxes,(255,0,0),2)
cv2.putText(img,classLabels[classInd-1],(boxes[0]+10,boxes[1]+40),font,
fontScale=font_scale,color=(0,255,0), thickness=3)
plt.imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))

Output Images:-

Fig: 5.1.1Reading the Image-1

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 24 of 40
Fig:5.1.2 Detection and labelling of Image-1

Fig:5.1.3 Reading the Image-2

Fig:5.1.4 Detection and labelling of Image-2

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 25 of 40
Fig:5.1.5 Reading the Image-3

Fig:5.1.6 Detection and labelling the Image-3

Fig:5.1.7 Reading the image-4

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 26 of 40
Fig:5.1.8 Detection and labelling of image-4

5.2.Python Code for Object Detection In videos

import cv2 #pip install OpenCV-python


import matplotlib.pyplot as plt #pip install matplotlib
config_file = 'ssd_mobilenet_v3_large_coco_2020_01_14.pbtxt'
frozen_model = 'frozen_inference_graph.pb'
model = cv2.dnn_DetectionModel(frozen_model,config_file)
classLabels = [] ##empty list of python
file_name = 'labels.txt'
with open(file_name,'rt') as fpt:

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 27 of 40
classLabels = fpt.read().rstrip('\n').split('\n')
#classLables.opend(fpt.read())
print(classLabels)
print(len(classLabels))
model.setInputSize(320,320)
model.setInputScale(1.0/127.5)
model.setInputMean((127.5,127.5,127.5))
model.setInputSwapRB(True)
cap = cv2.VideoCapture('vedio.webm/mp4')
if not cap.isOpened():
cap = cv2.VideoCapture(0)
if not cap.isOpened():
raise IDError("Cannot Open video")
font_scale = 3
font = cv2.FONT_HERSHEY_PLAINwhile True:
ret,frame = cap.read()
classIndex, confidece, bbox = model.detect(frame,confThreshold=0.5)
print(classIndex)
if (len(classIndex)!=0):

for classInd, conf, boxes in zip(classIndex.flatten(), confidece.flatten(), bbox):


if (classInd<80):
cv2.rectangle(frame,boxes,(255,0,0),2)
cv2.putText(frame,classLabels[classInd-1],(boxes[0]+10,boxes[1]+40),
font, fontScale=font_scale,color=(0,255,0), thickness=3)
cv2.imshow('Object Detection Tutorial',frame)
if cv2.waitKey(2) & 0xFF == ord('q'):
Break
cap.release()
cv2.destroyAllWindows()

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 28 of 40
Outputs: -

Detection and labelling In vedio

Fig:5.2.1 Detection and labelling of video part-1

Fig:5.2.2 Detection and labelling of video part-2

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 29 of 40
Fig:5.2.3 Detection and labelling of video part-3

Fig:5.2.4 Detection and labelling of video part-4

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 30 of 40
Fig:5.2.5 Detection and labelling of video part-5

Fig:5.2.6 Detection and labelling of video part-6

5.3.Python Code for Object Detection using webcam

import cv2 #pip install OpenCV-python


import matplotlib.pyplot as plt #pip install matplotlib
config_file = 'ssd_mobilenet_v3_large_coco_2020_01_14.pbtxt'
frozen_model = 'frozen_inference_graph.pb'

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 31 of 40
model = cv2.dnn_DetectionModel(frozen_model,config_file)
classLabels = [] ##empty list of python
file_name = 'labels.txt'
with open(file_name,'rt') as fpt:
classLabels = fpt.read().rstrip('\n').split('\n')
#classLables.opend(fpt.read())
print(classLabels)
print(len(classLabels))
model.setInputSize(320,320)
model.setInputScale(1.0/127.5)
model.setInputMean((127.5,127.5,127.5))
model.setInputSwapRB(True)
cap = cv2.VideoCapture(1)
if not cap.isOpened():
cap = cv2.VideoCapture(0)
if not cap.isOpened():
raise IDError("Cannot Open video")
font_scale = 3
font = cv2.FONT_HERSHEY_PLAINwhile True:
ret,frame = cap.read()
classIndex, confidece, bbox = model.detect(frame,confThreshold=0.5)
print(classIndex)
if (len(classIndex)!=0):
for classInd, conf, boxes in zip(classIndex.flatten(), confidece.flatten(), bbox):
if (classInd<80):
cv2.rectangle(frame,boxes,(255,0,0),2)
cv2.putText(frame,classLabels[classInd-1],(boxes[0]+10,boxes[1]+40),
font, fontScale=font_scale,color=(0,255,0), thickness=3)
cv2.imshow('Object Detection Tutorial',frame)
if cv2.waitKey(2) & 0xFF == ord('q'):
Break
Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 32 of 40
cap.release()
cv2.destroyAllWindows()

Fig:5.3 Object Detection using webcam

Chapter 6: CONCLUSION AND FUTURE SCOPE

6.1 Conclusion: -

We have seen object detection using the SSD model and OpenCV with VGG network.
We understood how exactly SSD works with OpenCV. We also implemented a simple
object detection using a pre-trained model with COCO datasets for images, videos
and also webcam with high accuracies

6.2 Future Scope:-


The object recognition system can be applied in the area of surveillance system, face
recognition, fault detection, character recognition etc. The objective of this thesis is to
develop an object recognition system to recognize the 2D and 3D objects in the
image. The performance of the object recognition system depends on the features
used and the classifier employed for recognition. This research work attempts to
propose a novel feature extraction method for extracting global features and obtaining
local features from the region of interest. Also the research work attempts to hybrid
the traditional classifiers to recognize the object. The object recognition system
developed in this research was tested with the benchmark datasets like COIL100,
Caltech 101, ETH80 and MNIST. The object recognition system is implemented in
MATLAB 7.5
It is important to mention the difficulties observed during the experimentation of the
object recognition system due to several features present in the image. The research
work suggests that the image is to be preprocessed and reduced to a size of 128 x 128.
The proposed feature extraction method helps to select the important feature. To
improve the efficiency of the classifier, the number of features should be less in
number. Specifically, the contributions towards this research work are as follows, An

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 33 of 40
object recognition system is developed, that recognizes the two-dimensional and three
dimensional objects. The feature extracted is sufficient for recognizing the object and
marking the location of the object. x The proposed classifier is able to recognize the
object in less computational cost. The proposed global feature extraction requires less
time, compared to the traditional feature extraction method. The performance of the
SVM-kNN is greater and promising when compared with the BPN andSVM. The
performance of the One-against-One classifier is efficient.
Global feature extracted from the local parts of the image. Local feature PCA-SIFT is
computed from the blobs detected by the Hessian-Laplace detector. Along with the
local features, the width and height of the object computed through projectionmethod
is used. The methods presented for feature extraction and recognition are common
and can be applied toany application that is relevant to object recognition. The
proposed object recognition method combines the state-of-art classifier SVM and
k-NN to recognize the objects in the image. The multiclass SVM is used to hybridize
with the k-NN for the recognition. The feature extraction method proposed in this
research work is efficient and provides unique information for the classifier. The
image is segmented into 16 parts, from each part the Hu’s Moment invariant is
computed and it is converted into Eigen component. The local feature of the image is
obtained by using the Hessian-Laplace detector. This helps to obtain the objects
feature easily and mark the object location without much difficulty. As a scope for
future enhancement, Features either the local or global used for recognition can be
increased, to increase the efficiencyof the object recognition system. Geometric
properties of the image can be included in the feature vector for recognition.
Using unsupervised classifier instead of a supervised classifier for recognition of the
object. The proposed object recognition system uses grey-scale image and discards
the color information. The colour information in the image can be used for
recognition of the object. Colour based object recognition plays vital role in Robotics.
Although the visual tracking algorithm proposed here is robust in many of the
conditions, it can be made more robust by eliminating some of the limitations as listed
below: In the Single Visual tracking, the size of the template remains fixed for
tracking. If the size of the object reduces with the time, the background becomes more
dominant than the object being tracked. In this case the object may not be tracked.
Fully occluded object cannot be tracked and considered as a new object in the next
frame. Foreground object extraction depends on the binary segmentation which is
carried out by applying threshold techniques. So blob extraction and tracking depends
on the threshold value. Splitting and merging cannot be handled very well in all
conditions using the single camera due to the loss of information of a 3D object
projection in 2D images.
For Night time visual tracking, night vision mode should be available as an inbuilt
feature in the CCTV camera. To make the system fully automatic and also to
overcome the above limitations, in future, multi- view tracking can be implemented
using multiple cameras. Multi view tracking has the obvious advantage over single
view tracking because of wide coverage range with different viewing angles for the
objects to be tracked. In this thesis, an effort has been made to develop an algorithm

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 34 of 40
to provide the base for future applications such as listed below. In this research work,
the object Identification and Visual Tracking has been done through the use of
ordinary camera. The concept is well extendable in applications like Intelligent
Robots, Automatic Guided Vehicles, Enhancement of Security Systems to detect the
suspicious behaviour along with detection of weapons, identify the suspicious
movements of enemies on boarders with the help of night vision cameras and many
such applications. In the proposed method, background subtraction technique has
been used that is simple and fast. This technique is applicable where there is no
movement of camera. For robotic application or automated vehicle assistance system,
due to the movement of camera, backgrounds are continuously changing leading to
implementation of some different segmentation techniques like single Gaussian
mixture or multiple Gaussian mixture models. Object identification task with motion
estimation needs to be fast enough to be implemented for the real time system. Still
there is a scope for developing faster algorithms for fast execution.

REFERENCES

1. Agarwal, S., Awan, A., and Roth, D. (2004). Learning to detect objects in images
via a sparse, part-based representation. IEEE Trans. Pattern Anal. Mach. Intell.
26,1475–1490. doi:10.1109/TPAMI.2004.108
2. Alexe, B., Deselaers, T., and Ferrari, V. (2010). “What is an object?,” in
ComputerVision and Pattern Recognition (CVPR), 2010 IEEE Conference on (San
Francisco,CA: IEEE), 73–80. doi:10.1109/CVPR.2010.5540226
3. Aloimonos, J., Weiss, I., and Bandyopadhyay, A. (1988). Active vision. Int.
J.Comput. Vis. 1, 333–356. doi:10.1007/BF00133571
4. Andreopoulos, A., and Tsotsos, J. K. (2013). 50 years of object recognition:
direc-tions forward. Comput. Vis. Image Underst. 117, 827–891.
doi:10.1016/j.cviu.2013.04.005
5. Azizpour, H., and Laptev, I. (2012). “Object detection using strongly-supervised
deformable part models,” in Computer Vision-ECCV 2012 (Florence:
Springer),836–849.
6. Azzopardi, G., and Petkov, N. (2013). Trainable cosfire filters for keypoint
detection and pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35,
490–503.doi:10.1109/TPAMI.2012.106
7. Azzopardi, G., and Petkov, N. (2014). Ventral-stream-like shape
representation:from pixel intensity values to trainable object-selective cosfire models.
Front.Comput. Neurosci. 8:80. doi:10.3389/fncom.2014.00080

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 35 of 40
8. Benbouzid, D., Busa-Fekete, R., and Kegl, B. (2012). “Fast classification using
sparse decision dags,”in Proceedings of the 29th International Conference on
MachineLearning (ICML-12), ICML ‘12, eds J. Langford and J. Pineau (New York,
NY:Omnipress), 951–958.
9. Bengio, Y. (2012). “Deep learning of representations for unsupervised and transfer
learning,” in ICML Unsupervised and Transfer Learning, Volume 27 of
JMLRProceedings, eds I. Guyon, G. Dror, V. Lemaire, G. W. Taylor, and D. L.
Silver(Bellevue: JMLR.Org), 17–36.
10. Bourdev, L. D., Maji, S., Brox, T., and Malik, J. (2010). “Detecting people using
mutually consistent poselet activations,” in Computer Vision – ECCV2010 – 11th
European Conference on Computer Vision, Heraklion, Crete, Greece,September 5-11,
2010, Proceedings, Part VI, Volume 6316 of Lecture Notes inComputer Science, eds
K. Daniilidis, P. Maragos, and N. Paragios(Heraklion:Springer), 168–181.
11. Bourdev, L. D., and Malik, J. (2009). “Poselets: body part detectors trained using
3dhuman pose annotations,” in IEEE 12th International Conference on
ComputerVision, ICCV 2009, Kyoto, Japan, September 27 – October 4, 2009 (Kyoto:
IEEE),1365–1372.

12. Cadena, C., Dick, A., and Reid, I. (2015). “A fast, modular scene understanding
sys-tem using context-aware object detection,” in Robotics and Automation
(ICRA),2015 IEEE International Conference on (Seattle, WA).
13. Correa, M., Hermosilla, G., Verschae, R., and Ruiz-del-Solar, J. (2012). Human
detection and identification by robots using thermal and visual information in
domestic environments. J. Intell. Robot Syst. 66, 223–243.
doi:10.1007/s10846-011-9612-2
14. Dalal, N., and Triggs, B. (2005). “Histograms of oriented gradients for human
detection,” inComputer Vision and Pattern Recognition, 2005. CVPR 2005.
IEEEComputer Society Conference on, Vol. 1 (San Diego, CA: IEEE), 886–893.
doi:10.1109/CVPR.2005.177
15. Erhan, D., Szegedy, C., Toshev, A., and Anguelov, D. (2014). “Scalable object
detec-tion using deepneural networks,” in Computer Vision and Pattern Recognition
Frontiers in Robotics and AI www.frontiersin.org November 2015

Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 36 of 40
Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 37 of 40
Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 38 of 40
Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 39 of 40
Presidency University, Bengaluru / Report on University Project - II / B.Tech. Electronics and Communication Engineering /
May, 2021

Page 40 of 40

You might also like