19bce0014 VL2021220702099 Pe003

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

School of Computer Science and Engineering

Fast-Track Semester 2022


Single Model Mul ple Object Detec on


Submi ed By

Varun Varatharajan 19BCE0538 (Team Leader)

Kush Dang 19BCE0573
Mihika Hingar 19BCE0948
Vrushali Deshmukh 19BCE0033
Nimish K. Aggarwal 19BCE0014
Rohan Sirohia 19BCE0984

Submi ed to






As a college of so many students VIT has a massive influx of students who

enroll in the university every year. With many students coming in, the
usage of a library by the students also increases. The library in VIT has a
lot of students coming in and out every day. Being a part of this institution,
we have observed that there is a compulsion to wear an ID card and use a
mask while entering the library. While a few people make sure nobody
enters without wearing them, our solution will help us reduce the
dependency on the man force and solve a genuine problem around us with
a huge research gap. We want to create a single model that detects multiple
objects - the ID card and the mask and confirms the student's entry inside

YOLO is an acronym that stands for You Only Look Once. We are
employing Version 5, which is now the most advanced object
identification algorithm available. It is a novel convolutional neural
network (CNN) that detects objects in real-time with great accuracy. This
approach uses a single neural network to process the entire picture, then
separates it into parts and predicts bounding boxes and probabilities for
each component. These bounding boxes are weighted by the expected

Detection of a single object using the YOLOv5 architecture is a

straightforward job. One can also think about using Multi-class
classification to classify multiple objects. But in real time scenarios, we
come across an input frame with multiple objects, which requires a single
image having multiple labels. Moreover, the cost function to classify into
many classes of objects is also high. Another intuition could be to train
multiple models, one for each object that we want to detect.

Although this does not compromise accuracy, the performance metrics:

speed and memory required to sequentially provide the input frames to
each model to detect its own object is drastically affected. Thus, there is a
dire need for an intuition that does not compromise any one of speed,
performance, accuracy and memory space.

1) To create a dataset of students wearing mask and/or ID card.
2) To train a model that detects mask and id card.
3) Run inference and check real time scenarios.


SNo. Year Title Author Insights

1 2022 A Study on Pradyuman The paper talks about the
Real Time Tomar, Sameer introduction of deep learning and
Object Haider, Sagar famous object detection system
Detection using like CNN (Convolutional Neural
Deep Learning Network), R-CNN, RNN
(Recurrent brain network), and
Faster RNN. The paper proposes
a new improved algorithm
wherein a convolutional network
predicts the bounding boxes and
the class probabilities for these
containers. It's challenging to
contain the assets for Deep
Learning, thus prompting the
need for such an algorithm.
2 2022 Real Time Tarun Jaiswal; In this paper multiple image
Multiple-Object Manju Pandey; detection approach is used to
Detection Priyanka identify and locate the object
Based On Tripathi concerning their class label, and
Enhanced SSD an enhanced feature map is used
[IEEE] for object detection. All this is
possible due to the recent
advancement in DL along with
image processing. These images,
fed to the model, come from the
video source. To tackle the
problem related to the aspect
ratio, the model uses the multi-
scale feature map together with a
distinct filter for diverse default

3 2021 Implementation Umesh YOLO a SOTA object detector

of YOLOv4 Parameshwar considered to be a smart
Algorithm for Naik; Varadi convolutional neural network
Multiple Object Rajesh; Rohith having capability of detecting
Detection in Kumar R; objects in the image, classifying
Image and Mohana them accordingly and localizing
Video Dataset the object perfectly with
using Deep annotations. YOLOv4 is
Learning and employed in this work for
Artificial multiple object detection in
Intelligence for image and video for traffic
Urban Traffic surveillance applications trained
Video using a custom dataset created
Surveillance with Indian road traffic images.
4 2021 Cross-Domain Khattar, This paper proposes a novel MTL
Multi-task Apoorv Hegde, framework in the absence of a
Learning for Srinidhi common annotated dataset for
Object Hebbalaguppe, joint estimation of important
Detection and Ramya downstream tasks in computer
Saliency vision - object detection and
Estimation saliency estimation. Unlike many
[IEEE] state-of-the-art methods that rely
on common annotated datasets
for training,this paper considers
the annotations from different
datasets for jointly training
different tasks, calling this setting
as cross-domain MTL.This paper
also shows that the proposed
MTL network offers a 13%
reduction in memory footprint
due to parameter sharing between
the related tasks.

5 2020 An improved Danyang Cao , In this study, we compare and

object detection Zhixin Chen , analyze mainstream object
algorithm based Lei Gao detection algorithms and propose
on multi-scaled a multiscaled deformable
and deformable convolutional object detection
convulational network to deal with the
neural network challenges faced by current
methods. Our analysis
demonstrates a strong
performance on par, or even
better, than state of the art
methods. We use deep
convolutional networks to obtain
multi-scaled features, and add
deformable convolutional
structures to overcome geometric

The network architecture of YOLOv5 is as follows:

Image source: ResearchGate - A Forest Fire Detec on System Based on Ensemble Learning

It consists of three parts:

1) Backbone: CSPDarknet
2) Neck: PANet
3) Head: Yolo Layer

The data are first input to CSPDarknet for feature extraction, and then fed
to PANet for feature fusion. Finally, Yolo Layer outputs detection results
(class, score, location, size).


● We will use the YOLOv5 object detection architecture in PyTorch.
● One stage detectors are better at real time applications.
● We also plan to solve the speed, time, memory and accuracy problem
with the help of Multi Task Learning.
● In multi-class classification, you are assigning a single label to an input
image, whereas in multi-task learning, you are asking each input image
whether it has a particular object 1, object 2… object n.
● With multi-task learning, multiple objects can appear in the same image
and hence one image has multiple labels (one for each object that we are
going to detect.
● With multi-task learning, we train a single neural network to look at
each image and solve N different classification problems. This results in
better performance than training N completely separate neural networks to
do N tasks separately.

Roboflow is a computer application website that allows us to create
custom datasets. First, we create a workspace and then add our project
Mask&IDCard detection. Then we add the images we clicked. The
software runs a preprocessor to make the dimensions of all images
uniform. It then allows us to add annotations. We annotate each and every
image by adding boxes around Masks and ID cards in each image. The
software also splits the dataset into train, validation and test set.

This file/module clones the publicly available yolov5 repo. It also reads
the custom dataset that we create from Roboflow. The number of classes
parameter is edited in the predefined yolov5s.yaml file to suit our needs. It
then gives a function call to train.py to train the neural network to a
specific number of epochs and save the weights to the best.pt file. We then
plot the various performance parameters using tensorboard (mAP,
precision, recall, loss). We then run the inference on a test dataset to verify
the results.

This module is responsible for training our model. It initializes other
classes like the Model class and the Dataloader class. It trains the model
on a specific number of epochs. It also calculates all the performance
metrics. It finally stores the updated or the best weights into the best.pt file
and saves the model. It also has the option to train the model on the GPU
of the device.

This file deals with the model configurations. It defines the anchors.
Anchor boxes are a set of predefined bounding boxes of a certain height
and width. These boxes are defined to capture the scale and aspect ratio of
specific object classes you want to detect and are typically chosen based
on object sizes in your training datasets. Then comes the structure of the
backbone, which is a convolutional neural network that pools image pixels

to form features at different granularities. Then comes the YOLOv5 head,

which consists of Conv and C3 layers. The YOLOv5s model displayed in

Module LetterBox
This module resizes and pads the images while meeting stride-multiple
constraints and then draws a bounding box around the detected object. We
first get the current shape of the image and resize to out desired shape.
Then, we scale down the image for a better mAP score. Next up, we
compute the padding required to surround the image. Now, finally using
OpenCV's copyMakeBorder() method we draw a border around the object
with the specified colour on the image.

It initializes all the parameters like:
● Source of the videostream or image.
● Source of the weights file.
● Image size.
● Device: CPU / GPU.
● Confidence threshold.

It loads the model and opens the videostream. It then continuously reads
the frames and preprocesses the frame using the letterbox module and then
runs inference, makes predictions by drawing the boxes around the
detected objects.


Our dataset has been created by our team members. We have clicked
pictures of people with four different cases to make our dataset. These
include pictures without mask and id card, pictures with only id card,
pictures with only mask, and pictures with both id card and mask. Since
our model is applies to VIT library, we used VIT students with mask and
VIT ID card as our dataset. Our dataset comprises of 324 pictures.




F1 Curve

Confusion Matrix




It was noticed that the inference of our model ran smoothly for real time
object detection with CPU computing, whereas multiple models required
GPU computing speed to account for the delay in the inference process.
There was also a significant difference in the memory used for running the
inference. Single model (500MB RAM) used 200MB of RAM lesser
compared to multiple model approach (700MB).
Also, our multi task learning approach trains a single neural network to
detect all objects simultaneously, thereby reducing the cost function (Log
loss) significantly compared to what other approaches can offer.
The above reasons are why the MTL approach is very beneficiary for real-
time object detection where inference time and device memory are
In this way, we aim to solve a simple problem existing in VIT. Our project
clearly confirms or rejects the entry of the students depending upon if they
are wearing their ID Cards and Masks or not.



We have proposed a solution to solve the problem of students not wearing

ID cards and mask in the VIT library as it is a compulsory rule but
students often tend to bend it. Our solution optimizes the process of
classifying students in this manner and successfully automates the afore
-mentioned process by lessening the dependency on the library staff.
The entry is confirmed only for the students who are wearing masks and
ID Card both.
We have employed the YOLOv5 object detection architecture in PyTorch
for the same.
The research work that is available is based on multi-class learning
whereas our research solves the problem using multi-task learning wherein
we train a single neural network to look at each image and solve N
different classifications (i.e., mask AND ID Card).
This results in better performance than training N completely separate
neural networks to do N tasks separately, hence proving the efficiency of
our model.
Thus, we can conclude that with the help of this project, our team has
successfully solved a really simple but genuine problem at VIT with an
enormous but unnoticed research gap.


[1] Tomar, Haider, Sagar (May 2022). A Study on Real Time Object Detection using Deep Learning.
International Journal of Engineering Research & Technology (IJERT) Volume 11, Issue 05

[2] Jaiswal, T., Pandey, M., & Tripathi, P. (2022, March). Real Time Multiple-Object Detection Based
On Enhanced SSD. In 2022 Second International Conference on Power, Control and Computing
Technologies (ICPC2T) (pp. 1-5). IEEE.

[3] Xu, Renjie & Lin, Haifeng & Lu, Kangjie & Cao, Lin & Liu, Yunfei. (2021). A Forest Fire Detection
System Based on Ensemble Learning. Forests. 12. 217. 10.3390/f12020217.

[4] Naik, U. P., Rajesh, V., & Kumar, R. (2021, September). Implementation of YOLOv4 algorithm for
multiple object detection in image and video dataset using deep learning and artificial intelligence for
urban traffic video surveillance application. In 2021 Fourth International Conference on Electrical,
Computer and Communication Technologies (ICECCT) (pp. 1-6). IEEE.

[5] Khattar, A., Hegde, S., & Hebbalaguppe, R. (2021). Cross-domain multi-task learning for object
detection and saliency estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (pp. 3639-3648).

[6] Cao, D., Chen, Z., & Gao, L. (2020). An improved object detection algorithm based on multi-scaled
and deformable convolutional neural networks. Human-centric Computing and Information Sciences,
10(1), 1-22.

[7] U. Subbiah, D. K. Kumar, S. K. Thangavel and L. Parameswaran. An Extensive Study and

Comparison of the Various Approaches to Object Detection using Deep Learning. 2020 Third
International Conference on Smart Systems and Inventive Technology (ICSSIT), 2020, pp. 1018-1030,
doi: 10.1109/ICSSIT48917.2020.9214185.

[8] L. Jiao et al., "A Survey of Deep Learning-Based Object Detection," in IEEE Access, vol. 7, pp.
128837-128868, 2019, doi: 10.1109/ACCESS.2019.2939201.

[9] Zhao, Zhong-Qiu & Zheng, Peng & Xu, Shou-Tao & Wu, Xindong. (2019). Object Detection With
Deep Learning: A Review. IEEE Transactions on Neural Networks and Learning Systems. PP. 1-21.

[10] S., Manjula & Krishnamurthy, Lakshmi & Ravichandran, Manjula. (2016). A Study On Object


You might also like