Fast CNN-Based Object Tracking Using Localization Layers and Deep Features Interpolation
Fast CNN-Based Object Tracking Using Localization Layers and Deep Features Interpolation
Interpolation
Abstract—Object trackers based on Convolution Neural motion is small. On the other hand, the appearance model is
Network (CNN) have achieved state-of-the-art performance on used to represent the target and verify the predicted location
recent tracking benchmarks, while they suffer from slow of the target in every frame [6]. The appearance model can
computational speed. The high computational load arises from be classified to generative and discriminative methods. In
the extraction of the feature maps of the candidate and generative methods, the tracking is performed by searching
training patches in every video frame. The candidate and for the most similar region to the object [6]. In discriminative
training patches are typically placed randomly around the methods, a classifier is used to distinguish the object from
previous target location and the estimated target location the background. In general, the appearance model can be
respectively. In this paper, we propose novel schemes to speed-
updated online to account for the target appearance
up the processing of the CNN-based trackers. We input the
variations during tracking.
whole region-of-interest once to the CNN to eliminate the
redundant computations of the random candidate patches. In Traditionally, tracking algorithms employed hand-crafted
addition to classifying each candidate patch as an object or features like pixel intensity, color and Histogram of Oriented
background, we adapt the CNN to classify the target location Gradients (HOG) [7] to represent the target in either
inside the object patches as a coarse localization step, and we generative or discriminative appearance models. Although
employ bilinear interpolation for the CNN feature maps as a hand-crafted features achieve satisfactory performance in
fine localization step. Moreover, bilinear interpolation is constrained environments, they are not robust to severe
exploited to generate CNN feature maps of the training patches appearance changes [8]. Deep learning using Convolution
without actually forwarding the training patches through the Neural Networks (CNN) has recently achieved a significant
network which achieves a significant reduction of the required performance boost to various computer vision applications.
computations. Our tracker does not rely on offline video Visual object tracking has been affected by this popular trend
training. It achieves competitive performance results on the in order to overcome the tracking challenges and obtain
OTB benchmark with 8x speed improvements compared to the better performance than that obtained by hand-crafted
equivalent tracker. features. In pure CNN-based trackers, the appearance model
is learned by a CNN and a classifier is used to label the
Keywords- object tracking, CNN, computer vision, video image patch as an object or background. CNN-based trackers
processing, bilinear interpolation, classification-based trackers [8]-[10] achieved state-of-the-art performance in latest
benchmarks [11], [12] even with simple motion models and
I. INTRODUCTION no offline training. However, CNN-based trackers typically
Visual object tracking is a classical problem in the suffer from high computational loads because of the large
computer vision domain where the location of the target is number of the candidate patches and the training patches
estimated in every video frame. The tracking research field which are required in the tracking phase and the training
continues to be active since long period because of the phase respectively.
several variations imposed in the tracking process, like In this paper, we address the speed limitations of the
occlusion, changing appearance, illumination changes and CNN-based trackers. We adapt the CNN not only as a two-
cluttered background. It is challenging for a tracker to handle label classifier, object and background labeling, but also as a
all these variations in a single framework. Therefore, five-position classifier for the object position inside the
numerous algorithms and schemes exist in literature aiming candidate patch. This scheme allows achieving coarse object
to tackle the tracking challenges and improve the overall localization with less number of candidate patches. In
tracing performance [1]-[3]. addition, we exploit a bilinear interpolation scheme of the
A typical tracking system consists of two main models, CNN feature maps already extracted in the coarse
motion model and appearance model. The motion model is localization step for two purposes: first for the fine object
employed to predict the target location in the next frame like localization, and second for the CNN feature extraction of
using Kalman filter [4] or particle filter [5] to model the the training patches. The computation of the bilinear
target motion. The motion model can also be simple like interpolation is significantly less than that of extracting a
constraining the search space to a small search window new feature map which would speed-up the required
around the previous target location and assuming the target
processing time. Moreover, we did not perform offline accurate. On the other hand, online training is necessary to
training on any tracking dataset for our tracker. cope with the potential appearance changes of the target. It is
This paper is organized as follows: Section II gives an typical to update the parameters of the fully-connected layer
overview of the CNN-based trackers and the speed only and keep those of the convolution layers fixed
bottlenecks in these systems. Our proposed schemes are throughout the whole tracking process because the
presented in Section III. Section IV demonstrates the convolutional layers would have generic tracking
experimental results with the OTB benchmark, and finally, information while the fully-connected layers would have
Section V concludes our work. target-background specific information. Short-term and long-
term model updates proposed by [8] have been employed as
II. OVERVIEW OF CNN-BASED TRACKERS well in other CNN-based trackers [9], [10], [23], [24]. Long-
Following the huge success of deep CNNs in image term updates are carried out at regular intervals, while short-
classification [13], [14] and object detection applications term updates are carried out when the object score drops
[15], [16], many recent works in the object tracking domain severely during tracking. The training data required for the
have adopted deep CNNs and achieved state-of-the-art online training is obtained every frame where deep features
performance. There exists different use cases of CNNs in the for positive and negative patches are generated and stored.
tracking filed. References [17]-[19] employed CNNs with The positive and negative patches have Intersection of Union
Discriminative Correlation Filters (DCF) where the (IoU) overlap with the estimated target location larger and
regression models of these DCF-based trackers are trained less than certain thresholds respectively. When it is required
by the feature maps extracted by the deep CNNs. References to update the model, the stored positive and negative feature
[20]-[22] adopted Siamese structure where two identical maps are sampled randomly to update the parameters.
CNN branches are used to generate feature maps for two The main computation steps in CNN-based trackers can
patches simultaneously either from the same frame or be categorized into candidate evaluation, collecting training
successive frames. The outputs from both branches are then data and model update. The model update is performed at
correlated to localize the target. References [8]-[10], [23], fixed intervals in the typical case and has less effect on the
[24] are pure CNN-based trackers where fully-connected computation time compared to the candidate and training
layers are added after generating the feature maps to classify data processing. The CNN-based trackers mainly suffer from
the input patches to object or background. A softmax layer is slow speed because of the computation in the convolutional
typically used at the end to score the candidate patches and layers to obtain the deep features for the candidate and
opt the highest object score as the new target location. These training patches every frame. However, it can be noticed
pure CNN-based trackers achieved state-of-the-art there would be a lot of computation redundancies because
performance in the latest benchmarks and we focus on this the candidate and training patches are generated randomly
type of trackers in the rest of the paper. with large potential overlaps. Hence, we propose novel
Fig. 1 shows a typical CNN-based tracker. In each schemes in the next section to mitigate the redundant
frame, candidate patches are generated with different computations and speed-up the required processing time of
translations and scales sampled from a Gaussian distribution. the CNN-based trackers.
The mean of the Gaussian distribution is the previous
location and scale of the target. The deep features are III. PROPOSED CNN-BASED TRACKER
extracted by the convolution layers for each patch and scored A. Target localization
by the fully connected (fc) layers. Although CNNs typically have local max-pooling layers
to allow CNNs to be spatially-invariant to the input data, the
intermediate feature maps are not actually invariant to large
transformations of the input data [26]. Hence, we exploit
this typical behavior of the network such that we do not
only classify the patch into object or background, but also,
classify the location of the object inside the patch. Having
Fig. 1. A typical CNN-based tracker four classes as up, down, right and left to represent the
target location inside the patch, we can localize the target
For the training of the CNN-based trackers, transfer with less number of candidates. In addition, we do not
learning is typically exploited where the network parameters
generate random candidate patches to cover the Region of
are initialized by another network pre-trained on large-scale
classification dataset like ImageNet [25]. References [8]-[10] Interest (ROI) as previous works but we generate fixed-
adopted offline training models to update the network spacing patches to cover the whole ROI as shown in Fig. 2.
parameters before tracking. It is difficult, however, to collect This scheme prevents the potential redundant computations
a large training dataset for visual tracking. Therefore, recent in case of generating random patches and reduces the risk of
works [23], [24] dispensed with offline training steps and missing the target. We propose also to forward the whole
still achieved state-of-the-art performance. These techniques ROI through the convolution layers to save some redundant
depend on increasing the number of iteration for the training computations instead of forwarding each patch separately.
in the initial frame because the target location is known and
This idea is similar to what proposed in [16], [27] in the positive patches are actually sub-divided into localization
object detection field where the whole image is forwarded patches. Although we add more classification classes in the
through the network instead of the proposal regions. network for localization, the required computation does not
increase much because the localization patches are not
forwarded through the whole convolutional layers and
bilinear interpolation is exploited instead.
C. Scale variation
Reference [8] handled scale variation by generating
training and candidate patches with random scales drawn
from a Gaussian distribution and forwarding these patches
through the whole network to obtain the feature maps.
Fig. 2. Random patches and fixed-spacing patches However in our proposed scheme, we extract feature maps
for three fixed scales only {1, max_scale_up,
It is common in CNN-based trackers that the target max_scale_down}. We then obtain the feature map of any
required scale in that range, for either a candidate or training
localization is carried out by taking the mean location of the
patch, by applying linear interpolation on two scales. Hence,
candidate patches with top object scores, while, in our instead of forwarding the images patches generated
scheme, the patches which are classified as objects are first randomly in spatial and scale domains through the
moved based on the localization network. The patch with convolution layers, we extract feature maps for a larger
the highest overlap with the other object patches is selected image patch at three fixed scales. We perform bilinear
as an input to a fine localization step where we utilize interpolation to obtain the feature map at the required
bilinear interpolation of the feature maps. Bilinear displacement and perform linear interpolation to obtain the
interpolation was first proposed by [26] for the feature map at the required scale. Fig. 4 illustrates our
implementation of a spatial transformer network and it was scheme of obtaining feature maps of image patches at
then employed by [28] in a ROI align scheme for object different scales.
detection applications. Let’s assume the target is represented
by 3x3xd feature maps as shown in Fig. 3 (a), where d is the
feature depth, and we extract feature maps for a region
larger than the target size such that we get 5x5xd feature
maps as shown in Fig. 3 (b). We would have nine 3x3 grid
in total. Each 3x3 grid is displaced by dx and/or dy from its
neighbors. The value of dx and dy depends on the network
structure. Accordingly, we can obtain the feature maps of all
image patches which have displacements ranging from 0 to
dx or dy measured from the center by bilinear interpolation
without forwarding the image patches through the
Fig. 4. Interpolation of fixed-scale feature maps
convolution layers. Any point value is calculated by bilinear
interpolation from the four nearby points in the feature maps IV. IMPLEMENATION DETAILS
such as point * in Fig. 3 (c).
A. Netwrok stucture
We start with the MDNet_N implementation as a
baseline for our work. MDNet_N is the same as MDNet [8]
but without both offline training and bounding box
regression. The parameters of the convolutional layers
(conv1-3) are initialized by the VGG-M [29] model and the
fully connected layers are initialized by random values. In
[8], the object size (h×w) is cropped and padded to the
network input size which is 107×107 such that this fixed
size, 107×107, would be equivalent to an image patch of
(107÷75)×(h×w). The spatial size of the feature maps
Fig. 3. Interpolation of feature maps generated from conv3 is 3×3 for a network input of 107×107.
Our network shown in Fig. 5 is similar to MDNet but we add
B. Network training fc7-9 as a localization network and allow different input
We reuse the feature maps obtained in the localization sizes to get feature maps of sizes 3×3, 5×5, 7×7 … etc when
phase to extract the feature maps of the positive and negative needed. The localization layer classifies the positive patches
training patches by applying bilinear interpolation. The
into five classes based on the position of the object inside the network. The patch with the highest overlap with other
patch (up, down, right, left and middle). object patches is chosen for the next fine localization step.
In the fine localization step, we need to find a finer
B. Initial frame training location and a newer scale of the object. We calculate new
In order to generate training data for the object and 5x5 feature maps centered on the coarse location at two
localization layers, in the initial frame, we generate feature scales, 1.05 and 1.05-1. We generate 100 fine samples
maps for an input of size 139×139 at three fixed scales: 1, displaced with fixed values in the x and y direction and with
1.2 and 1.2-1. The output from conv3 would be 5x5 at three different scales drawn from a Gaussian distribution. The
fixed scales. The initial object is actually represented by the feature maps of these fine samples are calculated by bilinear
inner 3x3 feature maps at scale 1. Accordingly, we can interpolation in the spatial and scale domains. Then, we
exploit bilinear interpolation and generate feature maps for check the object score of the fine samples and average the
any patch with a displacement ranging from 0 to a (16÷75 × three samples with the highest object score.
w) and (16÷75 × h) in the x and y direction respectively and
with different scales ranging from 1.2-1 to 1.2. The object D. Online network update
training samples are generated from a Gaussian distribution We adopt long-term and short-term network update
in the same way as MDNet_N so that the IoU with the initial schemes as proposed in [8]. Long-term is carried-out every
target location is larger than 0.7. The localization training 10 frames using the training samples collected, while, a
samples are generated from five Gaussian distributions short-term update is carried out when the object score is less
equivalent to each localization class and the IoU should be
than 0.5. We generate training samples for the object,
larger than 0.7 as well. In order to generate training data for
the background, in the initial frame, we divide the background and localization layers each frame if the object
background training data into two types, close and far score obtained in that frame is larger than 0.5 similar to [8].
samples. The close samples are the samples which are close However, we reuse the feature maps generated in the
to the initial target location, and hence, we can apply the tracking stage to obtain the feature maps of the training
same interpolation scheme used for the object and samples by bilinear interpolation in the spatial and scale
localization training samples. For the far background domains. In addition, we employ hard minibatch mining for
samples, we generate feature maps as normal by forwarding the negative training samples similar to [8] where 96
the samples through all the convolutional layers. All the negative samples of the highest positive score are selected
background training samples should have IoU less than 0.5 out of 1024 negative samples. The number of training
with the initial target. Our network is trained by Stochastic
samples for the object, localization and background layers is
Gradient Descent (SDG) with mini-batch size of 128 and 65
for fc4-6 and fc7-9 respectively and 90 iterations in the 30, 30 and 100 samples respectively each frame and 10
initial frame. iterations are adapted for the online update.
E. Experimental results
We evaluate our tracker on the Object Tracking
Benchmark (OTB-100) [11] which contains 100 fully
annotated videos. Our tracker is implemented in Matlab
using MatConvNet and runs on an Intel i7-3520M CPU
system. We ran MDNET_N on the same system as a
reference system. The tracking performance is measured by
performing One-Pass Evaluation (OPE) on two metrics,
center location error and IoU between the estimated target
location and the ground truth. Fig. 6 shows the precision plot
Fig. 5. Proposed network structure and the success plot of our tracker on 100 video frames [11]
C. Object tracking compared with MDNET_N. The precision plot shows the
percentage of frames whose estimated target location is
We forward the whole ROI of size (4w×4h) which is within the error threshold (x-axis) of the ground truth, while
centered on the previous target location through the the success plot shows the percentage of frames whose IoU
convolution layers. We crop this ROI to 299×299 before is larger than the overlap threshold (x-axis). The legend
entering the network and obtain a 15×15 conv3 feature map values in the precision and the success plots are the precision
accordingly. As the object is represented by a 3x3 conv3 score in case error threshold = 20 pixel and the area under
feature map, we would obtain 169 feature maps of spatial the curve (AUC) of the success plot respectively. It can be
size 3x3. These 169 feature maps represent image patches seen from Fig. 6 that our tracker which is based on
displaced from the center by [k × (16÷75) × w] and [k × Interpolation and Localization Network (ILNET) has the
(16÷75) × h] in the x and y direction respectively, where k is same AUC as MDNET_N and slightly lower precision.
an integer in [-5:5]. The object score of each 3x3 feature map Fig. 7 demonstrates the effectiveness of our tracker to
is checked and if it is larger than 0.5, the new location of the handle all kinds of tracking challenges. It can be seen that
equivalent patch will be obtained based on the localization our tracker achieves almost the same or better performance
compared to the baseline tracker, MDNET_N. Table I shows TABLE I AVERAGE COMPUTATION TIME IN SECONDS PER FRAME
the breakdown of the processing time savings achieved by MDNET-N* Our work Speed-up
our tracker. Both the tracking and the training speeds have [8] (ILNET) factor
Candidate processing 3.4 0.36 9.4x
increased despite adding a localization network and
Training processing 3.3 0.21 15.7x
increasing the number of training iterations. This speed-up Network update
achievement is due to bilinear interpolation on the feature (@10th frame for long-term)
2.3 2.3 1x
maps and by using fixed-spaced candidates. First frame training 90 52 1.72x
Frame processing
7 0.8 8.8x
without first frame
*
MDNET_N:MDNET without offline training and bounding box regression