Video Stream Analytics
Video Stream Analytics
Video Stream Analytics
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2517653, IEEE
Transactions on Cloud Computing
1
Abstract—Object detection and classification are the basic quality videos. This led to a widespread use of these video
tasks in video analytics and become the starting point for other cameras for security and monitoring purposes. The video
complex applications. Traditional video analytics approaches are streams coming from these cameras need to be analysed for
manual and time consuming. These are subjective due to the very
involvement of human factor. We present a cloud based video extracting useful information such as object detection and
analytics framework for scalable and robust analysis of video object classification. Object detection from these video streams
streams. The framework empowers an operator by automating is one of the important applications of video analysis and
the object detection and classification process from recorded becomes a starting point for other complex video analytics
video streams. An operator only specifies an analysis criteria applications. Video analysis is a resource intensive process and
and duration of video streams to analyse. The streams are then
fetched from a cloud storage, decoded and analysed on the cloud. needs massive compute, network and data resources to deal
The framework executes compute intensive parts of the analysis with the computational, transmission and storage challenges
to GPU powered servers in the cloud. Vehicle and face detection of video streams coming from thousands of cameras deployed
are presented as two case studies for evaluating the framework, to protect utilities and assist law enforcement agencies.
with one month of data and a 15 node cloud. The framework There are approximately 6 million cameras in the UK alone
reliably performed object detection and classification on the data,
comprising of 21,600 video streams and 175 GB in size, in 6.52 [1]. Camera based traffic monitoring and enforcement of speed
hours. The GPU enabled deployment of the framework took 3 restrictions have increased from just over 300,000 in 1996
hours to perform analysis on the same number of video streams, to over 2 million in 2004 [2]. In a traditional video analysis
thus making it at least twice as fast than the cloud deployment approach, a video stream coming from a monitoring camera
without GPUs. is either viewed live or is recorded on a bank of DVRs
Index Terms—Cloud Computing, Video Stream Analytics, or computer HDD for later processing. Depending upon the
Object Detection, Object Classification, High Performance needs, the recorded video stream is retrospectively analyzed by
the operators. Manual analysis of the recorded video streams
I. I NTRODUCTION is an expensive undertaking. It is not only time consuming, but
ECENT past has observed a rapid increase in the avail- also requires a large number of staff, office work place and
R ability of inexpensive video cameras producing good resources. A human operator loses concentration from video
monitors only after 20 minutes [3]; making it impractical to
The paper was first submitted on May 02, 2015. The revised version of go through the recorded videos in a time constrained scenario.
the paper is submitted on October 01, 2015.
In real life, an operator may have to juggle between viewing
This research is jointly supported by Technology Support Board, UK live and recorded video contents while searching for an object
and XAD Communications, Bristol under Knowledge Transfer Partnership of interest, making the situation a lot worse especially when
grant number KTP008832
resources are scarce and decisions need to be made relatively
Ashiq Anjum is with University of Derby, College of Engineering and quicker.
Technology, Kedleston road campus, DE221GB, Derby. He can be contacted Traditional video analysis approaches for object detection
at a.anjum@derby.ac.uk
Tariq Abdullah is with University of Derby, College of Engineering and and classification such as color based[4], statistical background
Technology, Kedleston road campus, DE221GB, Derby. He can be contacted supression[5], adaptvie background [6], template matching [7]
at t.abdullah@derby.ac.uk and Guassian [8] are subjective, inaccurate and at times may
M Fahim Tariq is with XAD Communications, Bristol and can be contacted
at m.f.tariq@xadco.com provide incomplete monitoring results. There is also a lack of
Yusuf Baltaci is with XAD Communications and can be contacted at object classification in these approaches [4], [5], [8]. These
yusuf.baltaci@xadco.com approaches do not automatically produce colour, size and
Nikos Antonopoulos is with University of Derby, College of Engineering and
Technology, Kedleston road campus, DE221GB, Derby and can be contacted object type information [5], [6]. Moreover, these approaches
at n.antonopoulos@derby.ac.uk are costly and time consuming to such an extent that their
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2517653, IEEE
Transactions on Cloud Computing
2
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2517653, IEEE
Transactions on Cloud Computing
3
APS Server
Processing Server
Stream Acquisition
Cloud Storage
Analogue IP Stream Compute
Node 1 GPU1
Analytics
OpenCV + Mahout
Compute
Hadoop Cluster
Node 2 GPU2 Database
Compute
Node N-1
GPU N-1
IP Stream
Compute
Network Switch Node N
GPU N
Analogue
APS Client
Authentication Client Controller FTP Client
Video analytics have also been the focus of commercial parallel computing framework based on the MapReduce [29]
vendors. Vi-System [22] offers an intelligent surveillance programming model. The results suggest that such a model
system with real time monitoring, tracking of an object within will be hugely beneficial for video processing and real time
a crowd using analytical rules and provides alerts for different video analytics systems. We aim to use a similar approach in
users on defined parameters. Vi-System does not work for this research.
recorded videos, analytics rules are limited and need to be Existing cloud based video analytics approaches do not sup-
defined in advance. SmartCCTV [23] provides optical based port recorded video streams [22] and lack scalability [23], [24].
survey solutions, video incident detection systems, high end GPU based approaches are still experimental [28]. IVA 5.60
digital CCTV and is mainly used in UK transportation system. [25] supports only embedded video analytics and Intelligent
Project BESAFE [24] aimed for automatic surveillance of Vision [27] is not scalable, otherwise their approaches are
people, tracking their abnormal behaviour and detection of close to the approach presented in this research.
their activities using trajectories approach for distinguishing The framework being reported in this paper uses GPU
state of the objects. The main limitation of SmartCCTV and mounted servers in the cloud to capture and record video
Project BESAFE is lack of scalability to a large number of streams and to analyse the recorded video streams using a
streams and a requirement of high bandwidth for video stream cascade of classifiers for object detection.
transmission.
IVA 5.60 [25] is an embedded video analysis system and is Supported Video Formats
capable of detecting, tracking and analyzing moving objects in CIF, QCIF, 4CIF and Full HD video formats are supported
a video stream. It can detect idle and removed objects as well for video stream recording in the presented framework. The
as loitering, multiple line crossing, and trajectories of an ob- resolution (number of pixels present in one frame) of a video
ject. EptaCloud [26] extends the functionality provided by IVA stream in CIF format is 352x288 and each video frame has 99k
5.60 and implements the system in a scalable environment. pixels. QCIF (Quarter CIF) is a low resolution video format
Intelligent Vision [27] is a tool for performing intelligent video and is used in setups with limited network bandwidth. Video
analysis and for fully automated video monitoring of premises stream resolution in QCIF format is 176x144 and each video
with a rich set of features. The video analysis system is built frame has 24.8k pixels. 4CIF video format has 4 times higher
into the cameras of IVA 5.60 that increases its installation resolution (704x576) than that of the CIF format and captures
cost. Intelligent Vision is not scalable and does not serve our more details in each video frame. CIF and 4CIF formats have
requirements. been used for acquiring video streams from the camera sources
Because of abundant computational power and extensive for traffic/object monitoring in our framework. Full HD (Full
support on multi-threading, GPUs have become an active High Definition) video format captures video streams with
research area to improve performance of video processing 1920x1080 resolution and contains 24 times more details in a
algorithms. For example, Lui et. al. [28] proposed a hybrid video stream than CIF format. It is used for high resolution
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2517653, IEEE
Transactions on Cloud Computing
4
video recording with availability of abundant disk storage and notified of the completion of the analysis process. The operator
high speed internet connection. can then access the analysis results from the database.
A higher resolution video stream presents a clearer image of An “Analysis Request” comprises of the defined region of
the scene and captures more details. However, it also requires interest, an analysis criteria and the analysis time interval. The
more network bandwidth to transmit the video stream and operator defines a region of interest in a video stream for an
occupies more disk storage. Other factors that may affect the analysis. The analysis criteria defines parameters for detecting
video stream quality are video bit rate and frames per second. objects of interests (face, car, van or truck) and size/colour
Video bit rate represents the number of bits transmitted from a based classification of the detected objects. The time interval
video stream source to the destination over a set period of time represents the duration ofanalysis from the recorded video
and is a combination of the video stream itself and mate-data streams as the analysis of all the recorded video streams might
about the video stream. Frames per second (fps) represents not be required.
the number of video frames stuffed in a video stream in one
second and determines the smoothness of a video stream. The
Framework Components
video streams have been captured with a constant bitrate of
200kbps and at 25 fps in the results reported in this paper. Our framework employs a modular approach in its design.
Table I summarizes the supported video formats and their At the top level, it is divided into client and server components
parameters. (Figure 1). The server component runs as a daemon on the
cloud machines and performs the main task of video stream
analysis. Whereas, the client component supports multi-user
III. V IDEO A NALYSIS F RAMEWORK environment and runs on the client machines (operators in our
This section outlines the proposed framework, its different case). The control/data flow in the framework is divided into
components and the interaction between them (Figure 1). the following three stages:
The proposed framework provides a scalable and automated • Video stream acquisition and storage
solution for video stream analysis with minimum latencies and • Video stream analysis
user intervention. It also provides capability for video stream • Storing analysis results and informing operators
capture, storage and retrieval. This framework makes the video The deployment of the client and server components is as
stream analysis process efficient and reduces the processing follows: The Video Stream Acquisition is deployed at the
latencies by using GPU mounted servers in the cloud. It video stream sources and is connected to the Storage Server
empowers a user by automating the process of identifying through 1/10 Gbps LAN connection. The cloud based storage
and finding objects and events of interest. Video streams and processing servers are deployed collectively in the cloud
are captured and stored in a local storage from a cluster of based data center. The APS Client is deployed at the end-user
cameras that have been installed on roads/buildings for the sites. We explain the details of the framework components in
experiments being reported in this paper. The video streams the remainder of this section.
are then transferred to a cloud storage for further analysis
and processing. The system architecture of the video analysis
framework is depicted in Figure 1 and the video streams A. Video Stream Acquisition
analysis process on an individual compute node is depicted The Video Stream Acquisition component captures video
in Figure 2a. We explain the framework components and the streams from the monitoring cameras and transmits to the
video stream analysis process in the remainder of this section. requesting clients for relaying in a control room and/or for
Automated Video Analysis: The framework automates the storing these video streams in the cloud data center. The
video stream analysis by reducing the user interaction dur- captured video streams are encoded using H.264 encoder [30].
ing this process. An operator/user initiates the video stream Encoded video streams are transmitted using RTSP (Real
analysis by defining an “Analysis Request” from the APS Time Streaming Protocol) [31] in conjunction with RTP/RTCP
Client component (Figure 1) of the framework. The analysis protocols [32]. The transmission of video streams is initiated
request is sent to the cloud data center for analysis and no on a client’s request. A client connects to the video stream
more operator interaction is required during the video stream acquisition component by establishing a session with the RTSP
analysis. The video streams, specified in the analysis request, server.
are fetched from the cloud storage. These video streams are The client is authenticated using CHAP protocol before
analysed according to the analysis criteria and the analysis transmitting video streams. The RTSP server sends a challenge
results are stored in the analytics database. The operator is message to the client in the CHAP protocol. The client
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2517653, IEEE
Transactions on Cloud Computing
5
APS Server
Single Compute Node
Analytics
database Create Upload Sequence File Create Input Split
Start
Image Buffer Thread N Sequence File to Cloud storage from Sequence File
Video Stream 2
Analysis Thread 2 Detect Objects Extract Video
End Store Results
of Interest Frames
Analytics
Vehicle/Face
Database
Classifier
APS Client
responds to the RTSP server with a value obtained by a one- received video streams are stored as 120 seconds long video
way hash function. The server compares this value with the files. These files can be stored in QCIF, CIF, 4CIF or in Full
value calculated with its own hash function. The RTSP server HD video format. These supported video formats are explained
starts video stream transmission after a match of the hash in Section II. The average size, frame rate, pixels per frame
values. The connection is terminated in case of a mis-match and the video resolution of each recorded video file for the
of the hash values. supported video formats is summarized in Table I.
The video stream is transmitted continuously for one RTSP The length of a video file plays an important role in the
session. The video stream transmission stops only when the storage, transmission and analysis of a recorded video stream.
connection is dropped or the client disconnects itself. A new The 120 seconds length of a video file is decided after
session will be established after a dropped connection and the considering the network bandwidth, performance and fault
client will be re-authenticated. The client is responsible for tolerance reasons in the presented framework. A smaller video
recording the transmitted video stream into video files and file is transmitted quicker as compared to a large video file.
storing them in the cloud data center. Administrators in the Secondly, it is easier and quicker to re-transmit a smaller file
framework are authorized to change quality of the captured after a failed transmission due to a network failure or any other
video streams. Video streams are captured at 25 fps in the reasons. Thirdly, the analysis results of a small video file are
experimental results reported in this paper. quickly available as compared to a large video file.
The average minimum and maximum size of a 120 seconds
long video stream, from one monitoring camera, is 3MB
B. Storage Server
and 120MB respectively. One month of continuous recording
The scale and management of the data coming from hun- from one camera requires 59.06GB and 2.419TB of minimum
dreds or thousands of cameras will be in exabytes, let alone all and maximum disk storage respectively. The storage capacity
of the more than 4 million cameras in UK. Therefore, storage required for storing these video streams of one month duration
of these video streams is a real challenge. To address this from one camera is summarized in Table II.
issue, H.264 encoded video streams received from the video
sources, via video stream acquisition, are recorded as MP4
C. Analytics Processing Server (APS)
files on storage servers in the cloud. The storage server has
RL300 recorders for real time recording of video streams. The APS server sits at the core of our framework and
It stores video streams on disk drives and meta-data about performs the video stream analysis. It uses the cloud storage
the video streams is recorded in a database (Figure 1). The for retrieving the recorded video streams and implements a
processing server as compute nodes in a Hadoop cluster in the
cloud data center (as shown in Figure 1). The analysis of the
Duration Minimum Size Maximum Size
recorded video streams is performed on the compute nodes by
2 Minutes (120 Seconds) 3MB 120MB
1 Hours (60 Minutes) 90MB 3.6GB applying the selected video analysis approach. The selection
1 Day (24 Hours) 2.11GB 86.4GB of a video analysis approach varies according to the intended
1 Week (168 Hours) 14.77GB 604.8GB video analysis purpose. The analytics results and meta-data
4 Weeks (672 Hours) 59.06GB 2.419TB
about the video streams is stored in the Analytics Database.
Table II: Disk Space Requirements for One Month of Recorded Video Overall working of the framework is depicted in Figure 1,
Streams the internal working of a compute node for analysing the video
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2517653, IEEE
Transactions on Cloud Computing
6
streams is depicted in Figure 2a. An individual processing components. These custom AMIs can be added or removed
server starts analysing the video streams on receiving an according to the varying workloads during the video stream
analysis request from an operator. It fetches the recorded video analysis process. Determining the right EC2 instance size will
streams from the cloud storage. The H.264 encoded video be a challenging task in this migration. It is dependent on
streams are decoded using the FFmpeg library and individual the processing workload, performance requirements and the
video frames are extracted. The analysis process is started on desired concurrency for video streams analysis. The health and
these frames by selecting features in an individual frame and performance of the instances can be monitored and managed
matching these features with the features stored in the cascade using AWS Management Console.
classifier. The functionality of the classifier is explained in
section IV. The information about detected objects is stored IV. V IDEO A NALYSIS A PPROACH
in the analytics database. The user is notified after completion
The video analysis approach detects objects of interest
of the analysis process. Figure 2b summarizes the procedure
from the recorded video streams and classifies the detected
of analysing the stored video streams on a compute node.
objects according to their distinctive properties. The AdaBoost
based cascade classifier [14] algorithm is applied for detecting
D. APS Client objects from the recorded video streams. Whereas, size based
The APS Client is responsible for the end-user/operator classification is performed for the detected vehicles in the
interaction with the APS Server. The APS Client is deployed second case study.
at the client sites such as police traffic control rooms or city In the AdaBoost based object detection algorithm, a cascade
council monitoring centers. It supports multi-user interaction classifier combines weak classifiers into a strong classifier. The
and different users may initiate the analysis process for their cascade classifier does not operate on individual image/video
specific requirements, such as object identification, object clas- frame pixels. It rather uses integral image for detecting object
sification, or the region of interest analysis. These operators features from he individual frames in a video stream. All
can select the duration of recorded video streams for analysis the object features not representing an object of interest are
and can specify the analysis parameters. The analysis results discarded early in the object detection process. The cascade
are presented to the end-users after an analysis is completed. classifier based object detection increases detection accuracy,
The analysed video streams along with the analysis results are uses less computational resources and improves the overall
accessible to the operator over 1/10 Gbps LAN connection detection performance. This algorithm is applied in two stages
from the cloud storage. as explained below.
The APS Client is deployed at the client sites such as
police traffic control rooms or city council monitoring centers.
Training a Cascade Classifier
A user from a client site connects to the camera sources
through video stream acquisition module. The video stream A cascade classifier combining a set of weak classifiers
acquisition modules transmits video streams over the network using real AdaBoost [17] algorithm is trained in multiple
from the camera sources. The acquired video streams are boosting stages. In the training process, a weak classifier
viewed live or are stored for analysis. In this video stream learns about the object of interest by selecting a subset of
acquisition/transmission model, neither changes are required in rectangular features that efficiently distinguish both classes
the existing camera deployments nor any additional hardware of the positive and negative images from the training data.
is needed. This classifier is the first level of cascade classifier. Initially,
equal weights are attached to each training example. The
weights are raised for the training examples misclassified by
E. Porting the Video Analytics Framework to a Public Cloud
the current weak classifier in each boosting stage. All of these
The presented video analysis framework is evaluated on the weak classifiers determine the optimal threshold function such
private cloud at the University of Derby. Porting the framework that mis-classification is minimized. The optimal threshold
to a public cloud such as Amazon EC2, Google Compute function is mathematically represented [17] as follow:
Engine or Microsoft Azure will be a straighforward process. (
We explain the steps/phases for porting the framework to 1 if pj fj (x) < pj θj
Amazon EC2 in the remainder of this section. hj (x) =
0 otherwise
The main phases in porting the framework to Amazon EC2
are data migration, application migration and performance where x is the window, fj is value of the rectangle feature, pj
optimization. The data migration phase involve uploading all is parity and θt is the threshold. A weak classifier with lowest
the stored video streams from local cloud storage to the weighted training error, on the training examples, is selected
Amazon S3 cloud storage. The AWS SDK for Java will be in each boosting stage. The final strong classifier is a linear
used for uploading the video streams and AWS Management combination of all the weak classifiers and has gone through
Console to verify the upload. The “Analytics Database” will all the boosting stages. The weight of each classifier in the
be created in Amazon RDS for MySQL database. The APS is final classifier is directly proportional to its accuracy.
moved to Amazon EC2 in the application migration phase. The cascade classifier is developed in a hierarchical fashion
Custom Amazon Machine Images (AMIs) will be created and consists of cascade stages, trees, features and thresholds.
on Amazon EC2 instances for hosting and running the APS Each stage is composed of a set of trees, trees are composed
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2517653, IEEE
Transactions on Cloud Computing
7
The value of any given feature is always the sum of the pixels
LJ LJ within white rectangles subtracted from the sum of the pixels
ϭϬ ϮϬ ϭϬ ϮϬ ϭϬ ϯϬ ϰϬ ϲϬ within shaded rectangles. The evaluated image regions are
sorted out between positive and negative images (i.e. objects
ϮϬ ϭϬ ϭϬ ϭϬ ϯϬ ϲϬ ϴϬ ϭϭϬ and non-objects).
For reducing the processing time, each video frame is
dž ϯϬ ϭϬ ϭϬ ϮϬ dž ϲϬ ϭϬϬ ϭϯϬ ϭϴϬ scanned in two passes. A video frame is divided into sub-
windows in the first pass. These sub-windows are evaluated
ϭϬ ϮϬ ϯϬ ϮϬ ϳϬ ϭϯϬ ϭϵϬ ϮϲϬ against the first two stages of the cascade of classifiers
KƌŝŐŝŶĂů/ŵĂŐĞ /ŶƚĞŐƌĂů/ŵĂŐĞ (containing weak classifiers). A sub-window is not evaluated
against the remaining stages of cascade classifier, if it is
Figure 3: Integral Image Representation eliminated in stage zero or evaluates to an object in stage
one. The second pass evaluates only those sub-windows that
were neither eliminated nor marked as objects in the first pass.
of features and features are composed of rectangles. The An attentional cascade is applied for reducing the detection
features are specified as rectangles with their x, y, height and time. In the attentional cascade, simple classifiers are applied
width value and a tilted field for each rectangular feature. earlier in the detection process and a candidate rectangular
The tilted field specifies whether the feature is rotated or not. feature is rejected at any stage for a negative response from
The threshold field specifies the branching threshold for the the classifier. The strong classifiers are applied later in the
feature. The left branch is taken when the value of the feature detection stages to reduce false positive detections. A positive
is less than the adjusted threshold and the right branch is taken rectangular feature from a simple classifier is further evaluated
otherwise. The tree values within a stage are accumulated by a second complex classifier from the cascade classifier.
and are compared with the stage threshold. The accumulated The detection process rejects many of the negative rectangular
value is used to decide object detection at a stage. If this features and detects all of the positive rectangular features.
value is higher than the stage threshold the sub-windows are This process continues for all the classifiers in the cascade
classified as an object and is passed to the next stage for further classifier.
classification. It is important to mention that the use of a cascade classifier
A cascade of classifiers increases detection performance and for detecting objects reduces the time and resources required
reduces computational power during the detection process. The in the object detection process. However, object detection
cascade training process aims to build a cascade classifier by using a cascade of classifiers is still time and resource
with more features for achieving higher detection rates and a consuming. It can be further optimized by porting the parallel
lower false positive rate. However, a cascade classifier with parts of the object detection process to GPUs.
more features will require more computational power. The It is also important to note that the machine learning based
objective of the cascade classifier training process is to train a real AdaBoost algorithm for training the cascade classifier is
classifier with minimum number of features for achieving the not trained on the cloud. The cascade classifier is trained once
expected detection rate and false positive rate. Furthermore, on a single compute node. The trained cascade classifiers are
these features can encode ad hoc domain knowledge that is used for detecting objects from the recorded video streams on
difficult to learn using finite quantity of the training data. We the compute cloud.
used opencv_harrtraining utility provided with OpenCV [33] Object Classification: The objects detected from the video
to train the cascade classifier. streams can be classified according to their features. The
Detecting Objects of Interest from Video Streams Using colour, size, shape or a combination of these features can be
Cascade Classifier: Object detection with a cascade classifier used for the object classification.
starts by scanning a video frame for all the rectangular features In the second case study for vehicle detection, the vehicles
in it. This scanning starts from the top-left corner of the video detected from the video streams are classified into cars, vans or
frame and finishes at the bottom-right corner. All the identified trucks according to their size. As explained above, the vehicles
rectangular features are evaluated against the cascade. Instead from the video streams are detected as they pass through the
of evaluating all the pixels of a rectangular feature, the defined region of interest in the analysis request. Each detected
algorithm applies an integral image approach and calculates vehicle is represented by a bounding box (a rectangle) that
a pixel sum of all the pixels inside a rectangular feature by encloses the detected object. The height and the width of the
using only 4 corner values of the integral image as depicted in bounding box is treated as the size of the detected vehicle.
Figure 3. The integral image results in faster feature evaluation All the vehicles are detected at the same point in video
than the pixel based evaluation. Scanning a video frame and streams. Hence, the location of the detected vehicles in a video
constructing an integral image are computationally expensive frame becomes irrelevant and the bounding box represents the
tasks and always need optimization. size of the detected object.
The integral image is used to calculate the value of the The detected vehicles are classified into cars, vans and
detected features. The identified features consist of small trucks by profiling the size of their bounding boxes as follows.
rectangular regions of white and shaded areas. These features The vehicles with a bounding box of size less than (100x100)
are evaluated against all the stages of the cascade classifier. are classified as cars, (150x150) are classified as vans and all
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2517653, IEEE
Transactions on Cloud Computing
8
the detected vehicles above this size are classified as trucks. a production cloud with as many nodes as required by the data
The object detection and classification results are explained in size of a user application.
the experimental results section. The framework is executed on Hadoop MapReduce [29]
The faces detected from video streams, in the second case for evaluations in the cloud. JavaCV, a Java wrapper of
study, can be classified according to their gender or age. OpenCV [33] is used as image/video processing library.
However, no classification is performed on the detected faces Hadoop Yarn schedules the jobs and manages resources for
in this work. The object detection and classification results are the running processes on Hadoop. There is a NameNode
explained in the experimental results section. for managing nodes and load balancing among the nodes, a
DataNode/ComputeNode for storing and processing the data,
a JobTracker for executing and tracking jobs. A DataNode
V. E XPERIMENTAL S ETUP
stores video stream data as well as executes the video stream
This section explains the implementation and experimental analysis tasks. This setting allows us to schedule analysis tasks
details for evaluating the video analysis framework. The results on all the available nodes in parallel and on the nodes close
focus on the accuracy, performance and scalability of the to the data. Figure 4 summarizes the flow of analysing video
presented framework. The experiments are executed in two streams on the cloud.
configurations; cloud deployment and cloud deployment with
Nvidia GPUs.
B. Compute Node with GPUs
The cloud deployment evaluates the scalability and robust-
ness of the framework by analysing different aspects of the The detection accuracy and performance of the framework
framework including (i) video stream decoding time, (ii) video is evaluated on cloud nodes with 2 Nvidia GPUs. The compute
data transfer time to the cloud, (iii) video data analysis time on nodes used in these experiments have Nvidia Tesla K20C and
the cloud nodes and (iv) collecting the results after completion Nvidia Quadro 600 GPUs. Nvidia Tesla K20 has 5GB DDR5
of the analysis. RAM, 208 GBytes/sec data transfer rate, 13 multiprocessor
The experiments on the cloud nodes with GPUs evaluate units and 2496 processing cores. Nvidia Quadro 600 has
the accuracy and performance of the video analysis approach 1GB DDR3 RAM, 25.6 GBytes/sec data transfer rate, 2
on state of the art compute nodes with two GPUs each. These multiprocessor units and 96 processing cores.
experiments also evaluate the video stream decoding and video CUDA is used for implementing and executing the compute
stream data transfer between CPU and GPU during the video intensive parts of the object detection algorithm on a GPU.
stream analysis. The energy implications of the framework It is an SDK that uses SIMD (Single Instruction Multiple
at different stages of the video analytics life cycle are also Data) parallel programming model. It provides fine-grained
discussed towards the end of this section. data parallelism and thread parallelism nested within coarse-
grained data and task parallelism [35], [36]. The CUDA pro-
gram executing compute intensive parts of the object detection
A. Compute Cloud algorithm is called CUDA kernel.
The framework is evaluated on the cloud resources available A CUDA program starts its execution on CPU (called host),
at the University of Derby. The cloud instance is running processes the data with CUDA kernels on a GPU (called
OpenStack Icehouse [34] with Ubuntu LTS 14.04.1. It consist device) and transfers the results back to the host.
of six server machines with 12 cores each. Each server has Challenges in porting CPU application to GPU: The main
two 6-core Intel® Xeon® processors running at 2.4Ghz with challenge in porting a host application (CPU based applica-
32GB RAM and 2 Terabyte storage capacity. The cloud tion) to a CUDA program is in identifying parts of the host
instance has a total of 72 processing cores with 192GB of application that can be executed in parallel and isolating data
RAM and 12TB of storage capacity. OpenStack Icehouse is to be used by the parallel parts of the application. After porting
providing a management and control layer for the cloud. It the parallel parts of the host application to CUDA kernels, the
has a single dashboard for controlling the pool of computers, program and data are transferred to the GPU memory and
storage, network and other resources. The iSCSI solution using the processed results are transferred back to the host with the
Logical Volume Manager (LVM) is implemented as the default CUDA API function calls.
OpenStack Block Storage service. The second challenge is faced while transferring the pro-
We configured a cluster of 15 nodes on the cloud instance. gram data for kernel execution from CPU to GPU. This
Each of these 15 nodes have 100GB of storage space, 8GB transfer is usually limited by the data transfer rates between
RAM and a 4 VCPU running at 2.4 GHz. This 15 nodes cluster CPU-GPU and the amount of available GPU memory.
is used for evaluating the framework. These experiments focus The third challenge relates to the global memory access in
at different aspects of the framework such as analysis time of a CUDA application. The global memory access on a GPU
the framework, effect of task parallelism on each node, block
size, block replication factor and number of compute/data Image Training Images Boosting Scale
nodes in the cloud. The purpose of these experiments is to test Size Positive Negative Stages Factor
Vehicle 20x20 550 550 12 1.2
the performance, scalability and reliability of the framework Face 20x20 5000 3000 21 1.2
with varying cloud configurations. The conclusions from these
experiments can then be used for deploying the framework on Table III: Image Dataset with Cascade Classifier Parameters
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2517653, IEEE
Transactions on Cloud Computing
9
takes between 400 and 600 clock cycles as compared to 2 image thresholding and image masking are used from OpenCV
clock cycles of the GPU register memory access. The speed library in addition to HaarCascade Classifier algorithm.
of memory access is also affected by the thread memory
access pattern. The execution speed of a CUDA kernel will C. Image Datasets for Cascade Classifier Training and Test-
be considerably higher for coalesced memory access (all the ing
threads in same multiprocessor access consecutive memory
locations) than that of non-coalesced memory access. The UIUC image database [37] and FERET image database
[38] are used for training the cascade classifier for detecting
The above challenges are taken into account while porting
vehicles and faces from the recorded video streams respec-
our CPU based object detection implementation to the GPU.
tively. The images in UIUC database are gray scaled and
The way, we tackled these challenges is detailed below.
contain front, side and rear views of the cars. There are 550
single-scale car images and 500 non-car images in the training
What is Ported on GPU and Why database. The training database contains two test image data
sets. The first test set of 170 single-scale test images contains
A video stream consists of individual video frames. All of 200 cars at roughly the same scale as of the training images.
these video frames are independent of each other from object Whereas, the second test set has 108 multi-scale test images
detection perspective and can be processed in parallel. The containing 139 cars at various scales.
Nvidia GPUs use SIMD model for executing CUDA kernels. The FERET image database [38] is used for training and
Hence, video stream processing becomes an ideal application creating the cascade classifier for detecting faces from the
for porting to GPUs as the same processing logic is executed recorded video streams. A total of 5000 positive frontal face
on every video frame. images and 3000 non-face images are used for training the
We profiled the CPU execution of the video analysis face cascade classifier. The classifier is trained for frontal faces
approach for detecting objects from the video streams and only with an input image size of 20x20. It has 21 boosting
identified the compute intensive parts in it. Scanning a video stages.
frame, constructing an integral image, and deciding the feature The input images used for training both the cascade classi-
detection are the compute intensive tasks and consumed most fiers have a fixed size of 20x20. We used opencv_harrtraining
of the processing resources and time in the CPU based utility provided with OpenCV for training both the cascade
implementation. These functions are ported to GPU by writing classifiers. These datasets are summarized in Table III.
CUDA kernels in our GPU implementation.
In the GPU implementation, the object detection process
D. Energy Implications of the Framework
executes partially on CPU and partially on GPU. The CPU
decodes a video stream and extracts video frames from it. Energy consumed in a cloud based system is an important
These video frames and cascade classifier data are ported to a aspect of its performance evaluation. The following three are
GPU for object detection. The CUDA kernel processes a video the major areas where energy savings can be made, leading to
frame and the object detection results are transferred back to energy efficient video stream analysis.
the CPU. 1) Energy consumption on camera posts
We used OpenCV [33], an image/video processing library 2) Energy consumption in video stream transmission
and its GPU component for implementing the analysis algo- 3) Energy consumption during video stream analysis
rithms as explained in Section IV. JavaCV, a Java wrapper Energy consumed on camera posts: Modern state of the
of OpenCV, is used in the MapReduce implementation. Some art cameras are employed which consume as little energy as
primitive image operations like converting image colour space, possible. The solar powered, network enabled digital cameras
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2517653, IEEE
Transactions on Cloud Computing
10
Table IV: Detection Rate with Varying Scaling Factor for the Supported Video Formats
are replacing the traditional power hungry cameras. These the performance of the framework for video stream decoding,
cameras behave like sensors and only activate themselves video stream data transfer between CPU and GPU and the
when an object appears in front of a camera, leading to performance gains by porting the compute intensive parts of
considerable energy savings. the algorithm to the GPUs.
Energy consumed in video stream transmission: There The cloud deployment without GPUs evaluates the scala-
are two ways we can reduce the energy consumption during bility and robustness of the framework by analysing different
the video stream transmission. The first one is by using components of the framework such as video stream decoding,
compressions techniques on the camera sources. We are using video data transfer from local storage to the cloud nodes, video
H264 encoding in our framework which compresses the data data analysis on the cloud nodes, fault-tolerance and collecting
up to 20 times before streaming it to a cloud data center. The the results after completion of the analysis. The object detec-
second approach is by using an on source analysis to reduce tion and classification results for vehicle/face detection and
the amount of data and thereby reducing the energy consumed vehicle classification case studies are summarized towards the
in the streaming process. The on source analysis will have end of this section.
in-built rules on camera sources that will detect important
information such as motion and will only stream the data
that is necessary for analysis. We will employ a combination A. Performance of the Trained Cascade Classifiers
of these two approaches to considerably minimize the energy
consumption in the video transmission process. The performance of the trained cascade classifiers is eval-
Energy consumption during video stream Analysis: The uated for the two case studies presented in this paper i.e.
energy consumed during the analysis of video data uses a large vehicle and face detection from the recorded video streams. It
portion of energy consumption in the whole life cycle of a is evaluated by the detection accuracy of the trained cascade
video stream. We have employed GPUs for efficient processing classifiers and the time taken to detect the objects of interest
of the data. GPUs have lightweight processing cores than that from the recorded video streams. The training part of the real
of CPUs and consume less energy. Nvidia’s Fermi (Quadro AdaBoost algorithm is not executed on the cloud resources.
600) and Kepler’s Tesla K20 GPUs are at least 10 times more The cascade classifiers for vehicle and face detection are
energy efficient that the latest x86 based CPUs [39]. We used trained once on a single compute node and are used for
both of these GPUs in the reported results. detecting objects from the recorded video streams on the cloud
The GPUs execute compute intensive part of the video resources.
analysis algorithms efficiently. We achieved a speed up of at Cascade Classifier for Vehicle Detection: The UIUC image
least two times on cloud nodes with GPUs as compared to database [37] is used for training the cascade classifier for
the cloud nodes without GPUs. In this way the energy use is vehicle detection. The details of this data set are explained in
reduced by half for the same amount of data. By ensuring the Section V. Minimum detection rate was set to 0.999 and 0.5
availability of video data closer to the cloud nodes also reduced was set as maximum false positive rate during the training.
the un-necessary transfer of the data during the analysis. We The test images data set varied in lightening conditions and
are also experimenting the use of in-memory analytics that background scheme.
will further reduce the time to analyse, leading to considerable The input images used in the classifier training has a fixed
energy savings. size of 20x20 for vehicles. Only those vehicles will be detected
that have a similar size as of the training images. The recorded
VI. E XPERIMENTAL R ESULTS video streams have varying resolutions (Section II) and capture
We present and discuss the results obtained from the two objects at different scales. The objects of different sizes, than
configurations detailed in Section V. These results focus that of 20x20, from the recorded video streams can be detected
on evaluating the framework for object detection accuracy, by re-scaling the video frames. A scale factor of 1.1 means
performance and scalability of the framework. The execution decreasing the video frame size by 10%. It increases the
of the framework on the cloud nodes with GPUs evaluates the chance of matching size with the training model, however, re-
performance and detection accuracy of the video analysis ap- scaling the image is computationally expensive and increases
proach for object detection and classification. It also evaluates computation time during the object detection process.
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2517653, IEEE
Transactions on Cloud Computing
11
(a) (b)
Figure 5: (a) Frame Decode, Transfer and Analysis Times for the Supported Video Formats, (b) Total Analysis Time of One Video Stream
for the Supported Video Formats on CPU & GPU
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2517653, IEEE
Transactions on Cloud Computing
12
Video Video Stream Frame Frame Data Frame Transfer Single Frame Analysis Time Single Video Stream Analysis Time
Format Resolution Decode Time Size Time (CPU-GPU) CPU GPU CPU GPU
QCIF 177 X 144 0.11 msec 65 KB 0.02 msec 3.03 msec 1.09 msec 9.39 sec 3.65 sec
CIF 352 X 288 0.28 msec 273 KB 0.12 msec 9.49 msec 4.17 msec 29.31 sec 13.71 sec
4CIF 704 X 576 0.62 msec 1.06 MB 0.59 msec 34.28 msec 10.17 msec 104.69 sec 34.13 sec
Full HD 1920 X 1080 2.78 msec 2.64 MB 0.89 msec 44.79 msec 30.38 msec 142.71 sec 105.14 sec
Table VI: Single Video Stream Analysis Time for Supported Video Formats
and video stream processing time. It is important to note that the processed data back to CPU memory from GPU memory
no data transfer is required in the CPU implementation as the consumed almost the same time. The data size of a video
video frame data is being processed by the same CPU. The frame for the supported video formats and the time taken to
time taken on all of these steps for CPU and GPU executions transfer this data from CPU to GPU is summarized in Table
is explained in the rest of this section. VI. An individual video frame reading time, the transfer time
Decoding a Video Stream: Video stream analysis is started from CPU to GPU and the processing time on the CPU and on
by decoding a video stream. The video stream decoding is the GPU is graphically depicted in Figure 5a. No data transfer
performed using FFmpeg library and involves reading a video is required for the CPU implementation as CPU processes a
stream from the hard disk or from the cloud data storage and video frame directly from the CPU memory.
extracting video frames from it. It is an I/O bound process Processing Video Frame Data: The processing of data for
and can potentially make the whole video stream analysis very object detection on a GPU is started after all the required data
slow, if not handled properly. We performed a buffered video is transferred from CPU to GPU. The data processing on a
stream decoding to avoid any delays caused by the I/O process. GPU is dependent on the available CUDA processing cores
The recorded video stream of 120 seconds duration is read and the number of simultaneous processing threads supported
into buffers for further processing. The buffered video stream by a GPU. The processing of an individual video frame means
decoding is also dependent on available amount of RAM on processing all of its pixels for detecting objects from it using
a compute node. The amount of memory used for buffered the cascade classifier algorithm. The processing time for an
reading is a configurable parameter in our framework. individual video frame of the supported video formats varied
The video stream decoding time for extracting one video between 1.09 milliseconds to 30.38 milliseconds. The total
frame for the supported video formats (QCIF, CIF, 4CIF, processing time of a video stream on a GPU varied between
Full HD) varied from between 0.11 to 2.78 milliseconds. 3.65 seconds to 105.14 seconds.
The total time for decoding a video stream of 120 seconds The total video stream analysis time on a GPU includes
duration varied between 330 milliseconds to 8.34 seconds video stream decoding time, the data transfer time from CPU
for the supported video formats. It can be observed from to GPU, the video stream processing time, and transferring
Figure 5a that less time is taken to decode a lower resolution the processed data back to the CPU memory from the GPU
video format and more time to decode higher resolution video memory. The analysis time for a QCIF video stream, of 120
formats. The video stream decoding time is same for both seconds duration, is 3.65 seconds and the analysis time for a
CPU and GPU implementations as the video stream decoding Full HD video stream of the same duration is 105.14 seconds.
is only done on CPU. The processing of a video frame on a CPU does not involve
Transfer Video Frame Data from CPU to GPU Memory: any data transfer and is quite straightforward. The video frame
Video frame processing on GPU requires transfer of the video data is already available in the CPU memory. The CPU reads
frame and other required data from CPU memory to the GPU the individual frame data and applies the algorithm on it. The
memory. This transfer is limited by the data transfer rates CPU has less processing cores than that of a GPU and takes
between CPU and GPU and the amount of available memory more time to process an individual video frame than on a GPU.
on the GPU. The high end GPUs such as Nvidia Tesla and It took 3.03 milliseconds for processing a QCIF video frame
Nvidia Quadro provide better data transfer rates and have more and 44.79 milliseconds for processing a Full HD video frame.
available memory. Nvidia Tesla K20 has 5GB DDR5 RAM The total processing time of a video stream for the supported
and 208 GBytes/sec data transfer rate and Quadro 600 has video formats varied between 9.09 seconds to 134.37 seconds.
1GB DDR3 RAM and a data transfer rate of 25.6 GBytes/sec. Table VI summarizes the individual video frame processing
Whereas, lower end consumer GPUs have limited on-board time for the supported video formats.
memory and are not suitable for video analytics. The total video stream analysis time on CPU includes the
The size of data transfer from CPU memory to GPU video stream decoding time and the video stream processing
memory depends on the video format. The individual video time. The total analysis time for a QCIF video stream is 9.39
frame data size for the supported video formats varies from seconds and the total analysis time for a Full HD video stream
65KB to 2.64 MB (the least for QCIF and the highest for the of 120 seconds duration is 142.71 seconds. It is obvious that
Full HD video format). The data transfer from CPU memory to the processing of a Full HD video stream on the CPU is slower
the GPU memory took between 0.02 to 0.89 milliseconds for and is taking more time than the length of a Full HD video
an individual video frame. The total transfer time from CPU stream. Each recorded video stream took 25% CPU processing
to GPU for a QCIF video stream took only 60 milliseconds power of the compute node. We were limited to analyse only
and 2.67 seconds for a Full HD video stream. Transferring three video streams in parallel on one CPU. The system was
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2517653, IEEE
Transactions on Cloud Computing
13
crashing with simultaneous analysis of more than three video video file has a frame rate of 25 frames per second. Therefore,
streams. each video file has 3000 (=120*25) individual video frames.
In the GPU execution, we observe less speed up for QCIF The individual video frames are extracted as images and these
and CIF video formats as compared to 4CIF video format. images file are saved as PNG files before transferring them
QCIF and CIF are low resolution video formats and a part into the cloud data storage. The number of 120 seconds video
of the processing speed up gain is over-shadowed by the data files in one hour is 30 and in 24 hours is 720. The total number
transfer overhead from CPU memory to the GPU memory. of image files for the 24 hours recorded video streams from
The highest speed up of 3.07 times is observed for 4CIF video one camera source is 216,000 (=720*3000).
format and is least affected by the data transfer overhead, as These small image files are not suitable for directly trans-
can be observed in Figure 5b. ferring to the cloud data storage and processing with the
We can analyse more video streams by processing them in MapReduce framework. As the MapReduce framework is
parallel. As mentioned above, we could only analyse 3 video designed for processing large files and processing a small
streams in parallel on a single CPU and were constrained file will decrease the overall performance. These files are
by the availability of CPU processing power. We spawned first converted into a large file which is suitable for storing
multiple video streams processing threads from CPU to GPU. into cloud data storage and processing with the MapReduce
In this way, multiple video streams are processed in parallel on framework. The process of converting these small files into
a GPU. The video frames of each video stream were processed a large file and transferring to the cloud nodes is explained
in its own thread. We analyzed four parallel video streams below.
on the two Nvidia GPUs namely Tesla K20 and Quadro 600 The default block size for data storage on the cloud nodes
GPUs, each analyzing 2 video streams in parallel. is 128MB. Any file smaller than this size will still occupy
one block and will thus decrease the performance. All the
C. Analysing Video Streams on Cloud video files recorded in the storage server are less than the
128MB block size. These small files require lots of disk
We explain the evaluation results of the framework on
seeks, inefficient data access pattern and increased hopping
the cloud resources in this section. These results focus on
from DataNode to DataNode due to the distribution of the
evaluation of the scalability of the framework. The analysis of
files across the nodes. These factors result in overall reduced
a video stream on the cloud using Hadoop [29] is evaluated
performance of the whole system. When a large numbers of
in three distinct phases.
small files are stored in the cloud data storage (minimum of
1) Transferring the recorded video stream data from storage 216,000), the meta-data of these files occupies large portion
server to the cloud nodes of the namespace. Every file/directory/block in the cloud data
2) Analysing the video stream data on the cloud nodes storage is represented as an object in NameNode’s memory
3) Collecting results from the cloud nodes and occupies namespace. The namespace capacity is limited
Hadoop MapReduce framework is used for processing the by the physical memory in the NameNode. Each of the object
video frames data in parallel on the cloud nodes. The input is 150 bytes. If the video files are transferred to the cloud data
video stream data is transferred into the Hadoop file storage storage without converting them into a large file, one month
(HDFS). This video stream data is analysed for object de- of recorded video streams data would require around 10GB of
tection and classification using the MapReduce framework. cloud data storage for storing the meta-data only and this is
The meta-data produced is then collected and stored in the only from one camera source. It results in reduced efficiency
Analytics database for later use (as depicted in Figure 4). of data storage in particular and of the whole cloud system in
Each cloud node executes one or more "analysis tasks". An general.
analysis task is a combination of map and reduce tasks. It is These small files can either be stored as Hadoop Archives
generated from the analysis request submitted by a user. A (HAR) files or as Hadoop sequence files. HAR files data
map task in our framework is used for processing the video access is slower, requires two index file reads in addition
frames for object detection and classification and generating to the data file read and is therefore not suitable for our
analytics meta-data. The reduce task writes the meta-data back
into the output sequence file. A MapReduce job splits input
sequence file into independent data chunks. Each data chunk Sequence File Creation Time
becomes input to an individual map task. The output sequence 12
file is downloaded from the cloud data storage and the results
File Creation Time (Hours)
10
are stored in the Analytics database. 8
It is important to mention that the MapReduce framework
6
takes care of the scheduling map and reduce tasks, monitoring
4
their execution and re-scheduling the failed tasks.
Creating Sequence File from Recorded Video Streams: The 2
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2517653, IEEE
Transactions on Cloud Computing
14
Table VII: Analysis Time for Varying Data Set Sizes on the Cloud (in Hours)
3
192 MB 256 MB
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2517653, IEEE
Transactions on Cloud Computing
15
Analysis Time with Varying Number of Nodes We measured the time taken by one analysis task and the
90
total time taken for analysing the whole data set with varying
80 128MB number of cloud nodes. Each analysis task takes a minimum
70 time that cannot be reduced beyond a certain limit. The
Analysis Time (Hours)
192MB
60
256MB inter process communication, data read, and write from cloud
50
storage are the limiting factors in reducing the execution time.
40
30
However, the total analysis time decreases with the increasing
20 number of nodes. The time taken by single map/reduce task
10 and analysis time of 175 GB data set is summarized in Table
0 VIII.
3 6 9 12 15
Number of Nodes The execution time for 175GB data set with varying number
of nodes for 3 different cloud storage block sizes is depicted
Figure 9: Analysis Time with Varying Number of the Cloud Nodes in Figure 9. The execution time shows a decreasing trend with
increasing number of nodes. When the framework is executing
with 3 nodes, it takes about 27.80 hours to analyse 175GB data
execution. With increasing data set size, an increasing trend is
set. Whereas, the analysis of the same data set with 15 nodes
observed for the Map/reduce task execution time (Figure 8).
is completed in 5.83 hours.
Varying block size has no major effect on the execution time
Task Parallelism on Compute Nodes: The total number of
and all the data sets consumed almost the same time for each
the analysis tasks is equal to the number of input splits. The
block size. The execution time varied between 6.38 minutes
number of analysis tasks running in parallel on a cloud node is
and 5.83 hours for 5GB and 175GB data sets respectively.
dependent on the input data set, available physical resources
The analysis time for varying data sets and block sizes on the
and the cloud data storage block size. For the dataset size
cloud is summarized in Table VII.
of 175GB and with the default cloud storage block size of
All the block sizes consumed almost the same amount of 128MB, 1400 map/reduce tasks are launched. Each node has
memory except the block size 64MB. However, the 64MB 8GB RAM, 2GB is reserved for the operating system and
block size required more physical memory for completing Hadoop framework. Each container with one processing core
the execution as compared to other block sizes (Figure 10). and 6GB of RAM provided the best performance results.
The 64MB block size is less than the default cloud storage We varied the number of nodes from 3 to 15 in this set
block size (128MB) and produces more data blocks to be of experiments. The number of analysis tasks on each node
processed by the cloud nodes. These smaller blocks cause increases with decreasing number of nodes. The increased
management and memory overhead. The map tasks become number of tasks per node reduces performance of the whole
inefficient with the smaller block sizes and more memory framework. Each task has to wait longer to get scheduled
is needed to process these smaller block sizes. The system and executed due to the over occupied physical resources
crashed when less memory is allocated to a container with on the compute nodes. Table VIII summarizes the framework
64MB block size and required compute nodes with 16GB of performance with varying number of nodes. It also shows the
RAM. The total memory consumed by varying data sets is number of tasks executed on each compute node.
summarized in Figure 10. The analysis time for 5GB to 175GB data sets with varying
Robustness with Changing Cluster Size: The objective of block sizes is summarized in Table VII and is graphically
this set of experiments is to measure the robustness of the depicted in Figure 8. It is observed that analysis time of a
framework. It is measured by total analysis time and the data set with a larger block size is less as compared to smaller
speedup achieved with varying number of cloud nodes. The block size. Less number of map tasks, with large block size,
number of cloud nodes is varied from 3 to 15 and the data set are better suited to process a data set as compared to a small
is varied from 5GB to 175GB. block size. The reduced number of tasks reduces memory
and management overhead on the compute nodes. However,
input splits of 512MB and 1024 MB did not fit in the cloud
Physical Memory Used for Analysis on the Cloud nodes with 8GB RAM and required compute nodes with large
3 memory of 16GB. The variation in the block size did not affect
64MB 128 MB
2.5 the execution time of the Map task. The 175GB data set with
Memory Size (Tera Bytes)
192 MB 256 MB
512MB block size took the same time as of 128MB or other
2
block sizes. However, the larger block sizes required more
1.5 time to transfer data and larger compute nodes are needed to
1 process the data.
0.5
Effect of Data Locality on Video Stream Analysis: Data
locality in a cloud based video analytics system can increase
0
0 20 40 60 80 100 120 140 160 180
its performance. Data locality means that the video streams
Data Set Size (GB) data is available at the node responsible for analysing that data.
The HDFS block replication factor determines the number of
Figure 10: Total Memory Consumed for Analysis on the Cloud replicas of each data block and ensures the data locality and
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2517653, IEEE
Transactions on Cloud Computing
16
Effect of Data Locality the vehicles is based on their size. A total of 127,243 cars,
100 26,261 vans, and 7441 trucks pass through the defined region
of interest. These number of detected faces and vehicles are
Data Local Tasks Percentage
80
largely dependent on the input data set used for these case
60 studies and may vary depending on the video stream duration.
40
VII. C ONCLUSIONS & F UTURE R ESEARCH D IRECTIONS
20 The cloud based video analytics framework for automated
object detection and classification is presented and evaluated in
0
1 2 3 4 5 this paper. The framework automated the video stream analysis
Data Replication Factor process by using a cascade classifier and laid the foundation
Figure 11: Effect of Cloud Data Storage Replication Factor
for the experimentation of a wide variety of video analytics
algorithms.
The video analytics framework is robust and can cope with
fault tolerance of the input data. varying number of nodes or increased volumes of data. The
In this set of experiments, we measured the time taken time to analyse one month of video data depicted a decreasing
by the framework for analysing each data set with block trend with the increasing number of nodes in the cloud, as
replication factor under varying block sizes. The block repli- summarized in Figure 9. The analysis time of the recorded
cation factor ensures data locality and fault tolerance of the video streams decreased from 27.80 hours to 5.83 hours, when
input data. The data sets varied from 5GB to 175GB, the the number of nodes in the cloud varied from 3-15. The
block replication factor is varied from 1 to 5 and the block analysis time would further decrease when more nodes are
size varied from 64MB to 1024MB. The effect of varying added to the cloud.
replication factor on data locality is shown in Figure 11. The The larger volumes of video streams required more time to
x-axis represent the data replication factor for 175GB data set perform object detection and classification. The analysis time
and the y-axis represent the percentage of data local tasks as varied from 6.38 minutes to 5.83 hours, with the video stream
compared to the rack local tasks. It is evident from Figure 11 data increasing from 5GB to 175GB.
that increasing the replication factor increases the data locality The time taken to analyse one month of recorded video
of the map/reduce tasks. However, a replication factor of 4 stream data on a cloud with GPUs is shown in Figure 12. The
and 5 resulted in over replication of the cloud data storage. speed up gain for the supported video formats varied according
The over replication results in consuming more disk space and to the data transfer overheads. However, maximum speed up is
network bandwidth to transfer the replicated data to the cloud observed for 4CIF video format. CIF and 4CIF video formats
nodes. are mostly used for recording video streams from cameras.
2) Storing the Results in Analytics Database: The reducers The analysis time for one month of recorded video stream
in map/reduce tasks write the analysis results to one output data is summarized in Table IX.
sequence file. This output sequence file is processed separately, A cloud node with two GPUs mounted on it took 51 hours
by a utility, for extracting analysis meta-date from it. The to analyse one month of the recorded video streams in the
meta-data is stored in the Analytics database. 4CIF format. Whereas, the analysis of the same data on the
15 node cloud took a maximum of 6.52 hours. The analysis
D. Object Detection and Classification of these video streams on 15 cloud nodes with GPUs took 3
hours. The cloud nodes with GPUs yield a speed up of 2.17
The object detection performance of the framework, its times as compared to the cloud nodes without GPUs.
scalability and robustness is evaluated in the above set of
experiments. This evaluation is important for the technical
viability and acceptance of the framework. Another important
Analysis Time for One Month Video Streams on GPU
evaluation aspect of the framework is the count of detected and 300
classified objects after completing the video stream analysis. CPU Analysis Time
250
The count of detected faces in the region of interest from
Analysis Time (Hours)
0
QCIF CIF 4CIF Full HD QCIF CIF 4CIF Full HD
Hours 5.47 20.57 51.2 15771.3 Supported Video Formats
GPU
Days 0.23 0.86 2.13 6.57
Table IX: Time for Analyzing One Month of Recorded Video Streams Figure 12: Analysis Time of One Month of Video Streams on GPU
for the Supported Video Formats for the Supported Video Formats
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2517653, IEEE
Transactions on Cloud Computing
17
In this work, we did not empirically measure the energy [10] T. Abdullah, A. Anjum, M. Tariq, Y. Baltaci, and N. Antonopoulos,
used in capturing, streaming and processing the video data. “Traffic monitoring using video analytics in clouds,” in 7th IEEE/ACM
International Conference on Utility and Cloud Computing (UCC), 2014,
We did not measure the use of energy while acquiring video pp. 39–48.
streams from camera posts and streaming them to the cloud [11] K. F. MacDorman, H. Nobuta, S. Koizumi, and H. Ishiguro, “Memory-
data centre. This is generally perceived that GPUs are energy based attention control for activity recognition at a subway station,”
IEEE MultiMedia, vol. 14, no. 2, pp. 38–49, April 2007.
efficient and will save considerable energy by executing the [12] C. Stauffer and W. E. L. Grimson, “Learning patterns of activity using
video analytics algorithms. However, we did not empirically real-time tracking,” IEEE Transactions on Pattern Analysis and Machine
verify this fact in our experiments. This is one of the future Intelligence, vol. 22, no. 8, pp. 747–757, August 2000.
[13] C. Stauffer and W. Grimson, “Adaptive background mixture models for
directions of this work. real-time tracking,” in IEEE Computer Society Conference on Computer
In future, we would also like to extend our framework Vision and Pattern Recognition, 1999, pp. 246–252.
for processing the live data coming directly from the camera [14] P. Viola and M. Jones, “Rapid object detection using a boosted cascade
of simple features,” in IEEE Conference on Computer Vision and Pattern
sources. This data will be directly written into data pipeline Recognition, 2001, pp. 511–518.
by converting into sequence files. We would also extend our [15] Y. Lin, F. Lv, S. Zhu, M. Yang, T. Cour, K. Yu, L. Cao, and T. Huang,
framework by making it more subjective. It will enable us to “Large-scale image classification: Fast feature extraction and svm train-
ing,” in IEEE Conference on Computer Vision and Pattern Recognition
perform logical queries, such as , “How many cars of a specific (CVPR), 2011.
colour passed yesterday” on video streams. More sophisticated [16] V. Nikam and B.B.Meshram, “Parallel and scalable rules based classifier
queries like, “How many cars of a specific colour entered into using map-reduce paradigm on hadoop cloud,” International Journal of
Advanced Technology in Engineering and Science, vol. 02, no. 08, pp.
the parking lot between 9 AM to 5 PM on a specific date” 558–568, 2014.
will also be included. [17] R. E. Schapire and Y. Singer, “Improved boosting algorithms using
Instead of using sequence files, in future we would also confidence-rated predictions,” Machine Learning, vol. 37, no. 3, pp. 297
– 336, December 1999.
like to use a NoSQL database such as HBase for achieving [18] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop
scalability in data writes. Furthermore, Hadoop fails to come distributed file system,” in 26th IEEE Symposium on Mass Storage
up to the low latency requirements. We plan to use Spark [40] Systems and Technologies (MSST), 2010.
[19] A. Ishii and T. Suzumura, “Elastic stream computing with clouds,” in
for this purpose in future. 4th IEEE Intl. Conference on Cloud Computing, 2011, pp. 195–202.
[20] Y. Wu, C. Wu, B. Li, X. Qiu, and F. Lau, “CloudMedia: When cloud
on demand meets video on demand,” in 31st International Conference
VIII. ACKNOWLEDGMENTS on Distributed Computing Systems, 2011, pp. 268–277.
[21] J. Feng, P. Wen, J. Liu, and H. Li, “Elastic Stream Cloud (ESC): A
This research is jointly supported by Technology Support stream-oriented cloud computing platform for rich internet application,”
Board, UK and XAD Communications, Bristol under Knowl- in Intl. Conf. on High Performance Computing and Simulation, 2010.
edge Transfer Partnership grant number KTP008832. The [22] “Vi-system,” http://www.agentvi.com/.
[23] “SmartCCTV,” http://www.smartcctvltd.com/.
authors would like to say thanks to Tulasi Vamshi Mohan and [24] “Project BESAFE,” http://imagelab.ing.unimore.it/besafe/.
Gokhan Koch from the video group of XAD Communications [25] B. S. System, “IVA 5.60 intelligent video analysis,” Bosh Security
for their support in developing the software platform used in System, Tech. Rep., 2014.
[26] “EPTACloud,” http://www.eptascape.com/products/eptaCloud.html
this research. We would also like to thank Erol Kantardzhiev [27] “Intelligent vision,” http://www.intelli-vision.com/products/intelligent-
and Ahsan Ikram from the data group of XAD Communi- video-analytics.
cations for their support in developing the file client library. [28] K.-Y. Liu, T. Zhang, and L. Wang, “A new parallel video understanding
and retrieval system,” in IEEE International Conference on Multimedia
We are especially thankful to Erol for his expert opinion and and Expo (ICME), July 2010, pp. 679–684.
continued support in resolving all the network related issues. [29] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on
large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113,
January 2008.
R EFERENCES [30] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview
of the H.264/AVC video coding standard,” IEEE Trans. on Circuits and
[1] “The picture in not clear: How many surveillance cameras are there in Systems for Video Technology, vol. 13, no. 7, pp. 560–576, July 2003.
the UK?” Research Report, July 2013. [31] H. Schulzrinne, A. Rao, and R. Lanphier, “Real time streaming protocol
[2] K. Ball, D. Lyon, D. M. Wood, C. Norris, and C. Raab, “A report on (RTSP),” Internet RFC 2326, April 1996.
the surveillance society,” Report, September 2006. [32] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, “RTP: A
[3] M. Gill and A. Spriggs, “Assessing the impact of CCTV,” London Home transport protocol for real-time applications,” Internet RFC 3550, 2203.
Office Research, Development and Statistics Directorate, February 2005. [33] “OpenCV,” http://opencv.org/.
[4] S. J. McKenna and S. Gong, “Tracking colour objects using adaptive [34] “OpenStack icehouse,” http://www.openstack.org/software/icehouse/.
mixture models,” Image Vision Computing, vol. 17, pp. 225–231, 1999. [35] J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable parallel
[5] N. Ohta, “A statistical approach to background supression for surveil- programming with CUDA,” Queue-GPU Computing, vol. 16, no. 2, pp.
lance systems,” in International Conference on Computer Vision, 2001, 40–53, April 2008.
pp. 481–486. [36] J. Sanders and E. Kandrot, CUDA by Example: An Introduction to
[6] D. Koller, J. W. W. Haung, J. Malik, G. Ogasawara, B. Rao, and General-Purpose GPU Programming, 1st ed. Addison-Wesley Pro-
S. Russel, “Towards robust automatic traffic scene analysis in real-time,” fessional, 2010.
in International conference on Pattern recognition, 1994, pp. 126–131. [37] http://cogcomp.cs.illinois.edu/Data/Car/.
[7] J. S. Bae and T. L. Song, “Image tracking algorithm using template [38] http://www.itl.nist.gov/iad/humanid/feret/.
matching and PSNF-m,” International Journal of Control, Automation, [39] http://www.nvidia.com/object/gcr-energy-efficiency.html.
and Systems, vol. 6, no. 3, pp. 413–423, June 2008. [40] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Schenker, and I. Stoica,
[8] J. Hsieh, W. Hu, C. Chang, and Y. Chen, “Shadow elimination for “Spark: Cluster computing with working sets,” in 2nd USENIX confer-
effective moving object detection by gaussian shadow modeling,” Image ence on Hot topics in cloud computing, 2010.
and Vision Computing, vol. 21, no. 3, pp. 505–516, 2003.
[9] S. Mantri and D. Bullock, “Analysis of feedforward-back propagation
neural networks used in vehicle detection,” Transportation Research Part
C– Emerging Technologies, vol. 3, no. 3, pp. 161–174, June 1995.
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.