Salient point region covariance
descriptor for target tracking
Serdar Cakir
Tayfun Aytaç
Alper Yildirim
Soosan Beheshti
Ö. Nezih Gerek
A. Enis Cetin
Optical Engineering 52(2), 027207 (February 2013)
Salient point region covariance descriptor for target
tracking
Serdar Cakir
TÜBİTAK BİLGEM İLTAREN
Şehit Mu. Yzb. İlhan Tan Kışlası
2432. cad., 2489. sok.
TR-06800, Ümitköy, Ankara, Turkey
and
Bilkent University
Department of Electrical and Electronics
Engineering
TR-06800, Ankara, Turkey
E-mail: serdar.cakir@tubitak.gov.tr
Tayfun Aytaç
Alper Yildirim
TÜBİTAK BİLGEM İLTAREN
Şehit Mu. Yzb. İlhan Tan Kışlası
2432. cad., 2489. sok.
TR-06800, Ümitköy, Ankara, Turkey
Soosan Beheshti
Ryerson University
Department of Electrical and Computer
Engineering
Toronto, Ontario, Canada
Abstract. Features extracted at salient points are used to construct a
region covariance descriptor (RCD) for target tracking. In the classical
approach, the RCD is computed by using the features at each pixel
location, which increases the computational cost in many cases. This
approach is redundant because image statistics do not change significantly between neighboring image pixels. Furthermore, this redundancy
may decrease tracking accuracy while tracking large targets because statistics of flat regions dominate region covariance matrix. In the proposed
approach, salient points are extracted via the Shi and Tomasi’s minimum
eigenvalue method over a Hessian matrix, and the RCD features extracted
only at these salient points are used in target tracking. Experimental
results indicate that the salient point RCD scheme provides comparable
and even better tracking results compared to a classical RCD-based
approach, scale-invariant feature transform, and speeded-up robust
features-based trackers while providing a computationally more efficient
structure. © 2013 Society of Photo-Optical Instrumentation Engineers (SPIE) [DOI: 10
.1117/1.OE.52.2.027207]
Subject terms: salient points; feature selection; feature extraction; region covariance
descriptor; covariance tracker.
Paper 121317 received Sep. 12, 2012; revised manuscript received Jan. 24, 2013;
accepted for publication Jan. 25, 2013; published online Feb. 22, 2013.
Ö. Nezih Gerek
Anadolu University
Department of Electrical and Electronics
Engineering
İki Eylül Kampüsü
TR-26470, Eskişehir, Turkey
A. Enis Cetin
Bilkent University
Department of Electrical and Electronics
Engineering
TR-06800, Ankara, Turkey
1 Introduction
In target tracking, it is important to extract features from the
target region that have high differentiation property and
scale and rotation invariance. Features should be robust to
noise, partially invariant to affine transformation, intensity
changes, and occlusion.1,2 Another issue in target tracking
is to estimate and predict target location in the subsequent
frames based on the observations.3 A fundamentally important requirement comes from video processing. In order to
process video frames while preserving real-time requirements, it is important to extract features in a computationally
efficient manner for object tracking purposes.4 Features may
be the color, raw pixel intensities or statistics extracted from
these values, edges, displacement vectors in optic flow-based
approaches, textures, and their combinations depending on
the target model (appearance and motion) and imaging
0091-3286/2013/$25.00 © 2013 SPIE
Optical Engineering
system. A detailed evaluation of point-of-interest detectors
and feature descriptors for visual tracking can be found in
Refs. 5 and 6.
Features obtained by scale-invariant feature transform
(SIFT)7 are independent of scale, rotation, and intensity
change and robust against affine transformation. As a feature
detector, SIFT uses difference of Gaussians. SIFT is widely
used in applications for target detection,8,9 tracking,9,10 classification,11 image matching,12–14 and constructing mosaic
images.15 When compared to other point-of-interest detectors such as Moravec16 and Harris,17 SIFT features are
more robust to background clutter, noise, and occlusion.
Unfortunately, despite the distinctive properties of SIFT,
the feature extraction process is time-consuming, and the
method is hardly used in real-time applications. Inspired
by the previous feature descriptor schemes, the authors of
speeded-up robust features (SURF) descriptors claimed
that the SURF scheme approximates even outperforms previously published techniques in a more computationally
027207-1
February 2013/Vol. 52(2)
Cakir et al.: Salient point region covariance descriptor for target tracking
efficient manner.18 In SURF, the detector is based on the
efficient computation of a Hessian matrix at different scales.
There are other feature descriptors such features from
accelerated segment test,19 keypoint classification with randomized trees20 and ferns.21 A detailed performance comparison of the above-mentioned methods is provided in
Ref. 6 for a common database.
The covariance descriptor proposed in Ref. 22 provides
an efficient signature set in object detection and classification
problems and the descriptor is successfully used in applications, such as indoor and outdoor target tracking,23 fire and
flame detection,24 sea-surface and aerial target tracking,25
pedestrian detection,26 and face recognition.27
In our earlier work,25 we proposed an offline feature
selection and evaluation mechanism for robust visual
tracking of sea-surface and aerial targets based on region
covariance descriptor (RCD). In the feature extraction phase,
features were constructed via the RCD, and feature sets
resulting in the best target/background classification were
used for tracking. The same feature set is used in Ref. 28
for performance comparison of classifiers for maritime applications. The previously proposed target tracking scheme25
outperformed correlation,29 Kanade-Lucas-Tomasi (KLT)
30–32
feature, and SIFT-based7 trackers in both air and sea
surveillance scenarios. In that work, gradient-based features,
together with the pixel locations and intensity values were
observed to be the most powerful features. However, the proposed tracking scheme needs to be significantly accelerated
for real time applications. The main reason for the high computation cost is the requirement of extraction of features from
all pixels in the target region and the accompanying rules
for target update strategy, which takes into account scale
changes in different search regions. Motivated by these
observations, a computationally efficient technique is proposed for the calculation of the RCD. This alternative
descriptor is named salient point region covariance descriptor (SPRCD), and the descriptor provides a computationally
efficient approach without losing the classical RCD’s representative power. We compared the performance of the
SPRCD with the classical RCD-based approach25 and
SIFT-7 and SURF-based18 trackers.
In the literature, various researchers have attempted to
develop algorithms in order to construct RCD in an efficient
way.22–34 The “integral image” concept is proposed in
Ref. 22 to construct RCD in a computationally efficient
manner. The region codifference method33,34 enables further
reduction in the computational complexity of the RCD by
replacing the multiplication operators with an addition/
subtraction-based operator. The covariance descriptor within
visually salient regions is computed in Ref. 35 for duplicated image and video copy detection. In the paper, the
authors use a maximization type of information theoretic
approach to calculate visual saliency maps by employing
a data-independent Hadamard transform. Then, they calculate the RCD using the features extracted from local windows centered at the pixels that provide saliency scores
exceeding a predefined threshold. In Ref. 36, the subsets of
the image feature space are used together with the means of
the image features in a computationally efficient manner for
human detection problem. In Ref. 37, the characteristics of
the eigenvalues of the weighted covariance matrix are used
for the position correction task. The weighted covariance
Optical Engineering
matrix proposed in that work is based on the pixel-wise
intensity statistics of the reference image and the scene
image. The eigenvalues of this matrix are analyzed to determine whether the pixel contains detailed information.
Although this technique is not an RCD type of scheme, the
local complexity is taken into account to relate the local
information with target characteristics. To the best of our
knowledge, no attempts for computing RCD at salient points
have been made previously for target tracking purposes. In
this paper, we propose the utilization of salient points and the
RCD approach together to develop a computationally efficient descriptor scheme for target tracking. We investigate
the relation between the RCDs computed at each and every
pixel and at only salient points and observe that RCD computation can be decreased when the pixel characteristics are
taken into account before covariance computation, i.e., the
autocorrelation of the pixel with its neighborhood.
The paper is organized as follows: In Sec. 2, SPRCD is
briefly described. Feature selection for the descriptor calculation is explained in Sec. 3. In Sec. 4, the target tracking
framework is briefly described. Experimental work and
results including the performance comparisons over different
performance measures, including target loss indications, are
provided in Sec. 5. Concluding remarks are presented and
direction for future research is provided in Sec. 6.
2 Salient Point Region Covariance Descriptor
The RCD is widely used in various image representation
problems due to its low computational complexity and
robustness to partial occlusion. It also enables one to add
or remove features in a simple manner to adapt the tracker
for different target types and imaging systems. However, the
cost of computing RCD significantly increases as the image
region used for the descriptor calculation grows. This is
especially the case when large targets need to be tracked.
In order to determine an upper limit to the descriptor computation cost and to satisfy the real-time requirements, the
SPRCDs are proposed.
The calculation of the classical RCD starts by stacking the
feature matrices (f i ; i ¼ 1; 2; : : : ; D) extracted from an H ×
W dimensional image in order to construct H × W × D
dimensional feature tensor as given in Fig. 1. A detailed discussion for the extraction of feature matrices (f i s) is provided in Sec. 3. In the feature tensor, the elements in each
layer with the index ðm; nÞ are sorted to construct the feature
vector (̱ St ) [Eq. (3)]. In the classical RCD, a total of H × W
feature vectors (̱ St ) are constructed:
̱ St ¼ ½ f 1 ðm; nÞ f 2 ðm; nÞ
···
f D ðm; nÞ ;
(1)
where m ¼ 1; 2; : : : ; W, n ¼ 1; 2; : : : ; H, t ¼ 1; 2; : : : ; k,
and k ¼ H × W.
The computation procedure of the SPRCD is the same as
the procedure in classical RCD computation22,25 up to this
point. The main and crucial difference in the calculation of
SPRCD is that only the feature vectors corresponding to
salient point locations are used instead of using feature
vectors at all pixel positions. We tried two different point
extractors in the experiments, namely the Harris corner
detector17 and the Shi-Tomasi32 detector. The covariance descriptors calculated over the corners extracted by the Harris
method did not provide satisfactory tracking performances,
especially in scenarios where the target template changes
027207-2
February 2013/Vol. 52(2)
Cakir et al.: Salient point region covariance descriptor for target tracking
Fig. 1 The illustration of determining salient points in the feature
tensor.
rapidly. Therefore, the salient points are determined by the
minimum eigenvalue method introduced by Shi and Tomasi.
In this method the corner points are determined by analyzing
the eigenvalues of the Hessian matrix (H). The method
relates the image point characteristics with the values of
the two eigenvalues of the matrix H. At this point, instead
of recalculating the Hessian matrix directly, the available features used in the SPRCD calculation are gathered in order to
construct Hessian matrix. By this way, no additional effort
to calculate the Hessian matrix is made. As a reminder,
the structure of the Hessian matrix is provided in Eq. (2):
H¼
"
∂2 I
∂x2
∂2 I
∂x∂y
∂2 I
∂y∂x
∂2 I
∂y2
#
;
(2)
where
∂2 I
∂ ∂I
¼ ð Þ
∂x2 ∂x ∂x
and
∂2 I
∂ ∂I
¼ ð Þ
2
∂y ∂y
∂y
are the second derivatives along the horizontal and vertical
axes, respectively and
∂2 I
∂ ∂I
∂ ∂I
¼ ð Þ¼ ð Þ
∂x∂y ∂x ∂y
∂y ∂x
is the mixed derivative along the horizontal and vertical axes.
Two small values of the matrix H mean a roughly constant
region, whereas two large eigenvalues indicate a “busy”
structure. Such busy regions can correspond to noise, as
well as salt and pepper texture, or any pattern that can be
tracked reliably.32 Therefore, a thresholding type of approach
Optical Engineering
onto the minimum eigenvalue of the matrix was developed in
Ref. 32 to determine the representative points for tracking.
The main idea behind the descriptor calculation approach
using salient points is finding the relational variances between the features located at important corners instead of
considering the variances of features calculated at each and
every image pixel location. In this way, a representative and
computationally efficient feature descriptor is developed.
Moreover, the proposed descriptor scheme is not affected by
partial occlusion that causes the KLT tracker to fail in targettracking scenarios.25 Since the proposed descriptor scheme
depends on the spatial relations of the features calculated at
corner points rather than a simple corner matching type of
approach, it is not affected by the destructive effects of partial
occlusion.25 The illustration utilizing the feature vectors corresponding to the salient points is given in Fig. 1. In Fig. 1,
instead of displaying a generic implementation, the depth
of the feature tensor is selected as five in order to obtain
a reasonable visualization. Suppose that there exists ε salient
points extracted within a given region, then the covariance
descriptor calculation procedure can be rewritten as
ε
ε
ε
X
1 X
1X
̱ St ðqÞ ;
̱ St ðpÞ
̱ St ðpÞ̱ St ðqÞ
M SPR ðp; qÞ ¼
ε − 1 t¼1
ε t¼1
t¼1
(3)
where ̱ St , ðt ¼ 1; 2; : : : ; εÞ denote feature vectors evaluated
only at salient points. Since ε is naturally less than the
number of pixels in the target region (k), SPRCD is computationally more efficient than the classical region covariance
method. Depending on the scenario, the number of salient
points (ε) may vary between tens to hundreds. An upper
limit ϖ for ε is determined via extensive experimental
work using the relation presented in Eq. (4):
ε if ε < ϖ
.
(4)
ε¼
ϖ if ε < ϖ
This strategy prevents the cost of the descriptor complexity
from growing limitlessly. In the experiments, the target
region is represented with an SPRCD calculated using at
most ϖ ¼ 25 salient points that provide satisfactory tracking
accuracies. Although the upper limit ϖ is selected as 25
after a large-scale experimental framework, it can further
be adjusted adaptively by defining a certain ratio between
ϖ and the number of image pixels, k.
The RCD can be calculated using the “Integral Image”
concept22 rather than the calculation using the classical
formulation [Eq. (3)]. The “Integral Image” method introduces a significant reduction in the computational complexity
of RCD. The SPRCD feature extraction scheme proposed
herein is implemented over the “Integral Image” concept
rather than the classical covariance computation formulation. By this way, a further reduction in the computational
complexity is introduced.
In the next section, a brief discussion about the feature set
used in the descriptor computation is provided.
3 Feature Selection
The feature set used in SPRCD calculation is determined
by using the experimental results obtained in our previous
work.25 The gradient-based feature set ðI; x; y; GM; GOÞ,
027207-3
February 2013/Vol. 52(2)
Cakir et al.: Salient point region covariance descriptor for target tracking
Fig. 2 The flow diagram and TT update strategy of the proposed SPRCD tracker.
Optical Engineering
027207-4
February 2013/Vol. 52(2)
Cakir et al.: Salient point region covariance descriptor for target tracking
which provided plausible and robust tracking results, is used
in the feature extraction phase of the proposed descriptor
scheme. Here, I denotes the image intensity, x and y denote
the horizontal and vertical pixel locations, and GM and GO
stand for the gradient magnitude and orientation, respectively. GM and GO features are calculated using the first par∂I
tial derivatives along the horizontal (∂1;x ¼ ∂x
) and vertical
∂I
axis (∂1;y ¼ ∂y) as in Eq. (5). It can be noted that the first
partial derivatives ∂1;x and ∂1;y are calculated using the filter
½−1;0;1.
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
∂1;y
GM ¼ ∂21;x þ ∂21;y GO ¼ tan−1
.
(5)
∂1;x
The feature set ðI; x; y; GM; GOÞ is illustrated in Fig. 1
where f 1 , f 2 , f 3 , f 4 , and f 5 denote the features I, x, y,
GM, and GO, respectively. All of the features used in the
descriptor computations are normalized to ½0;1 range.
4 SPRCD-Based Tracker
The general framework of the proposed SPRCD-based
tracking scheme is presented in Fig. 2. The proposed tracker
is initialized as soon as the target region is determined. After
initialization, the determined target gate and the next image
frame are exposed to a preprocessing step. The preprocessing
step includes deinterlacing and gray-scale conversion for visual band images. In the surveillance applications, the target
region is generally determined automatically or manually.
In our case, the target region is selected manually by an
operator. As soon as the target template (TT) is determined,
the target is searched within a search region (SR). The SR is
taken as the smallest rectangle surrounding the TT-sized
rectangles located at each pixel location within a τ-pixel
neighborhood of the target center. At the end, ð2τ þ HÞ ×
ð2τ þ WÞ dimensional SR is obtained. The illustration of
the SR is given in Fig. 3.
After the determination of the SR, the SPRCD belonging
to the TT and the TT-sized subregions within the SR are computed. A descriptor-matching type of approach is performed
in order to locate the target in the current frame. In Ref. 22,
the descriptor-matching process is carried out by the eigenvalue-based metric defined in Ref. 38. However, in this
study, we prefer to use a computationally efficient metric
based on normalized L1 distance34 presented in Eq. (6):
Fig. 3 The illustration of the search region SR. Oðx ; y Þ is the target
center and W and H are target width and height, respectively.
Optical Engineering
^ RÞ ¼
^ TT ; M
ρðM
D X
D ^
X
^ R ði; jÞj
jMTT ði; jÞ − M
;
^
^
i¼1 j¼1 M TT ði; iÞ þ M R ði; iÞ
(6)
^ TT and M
^ R are the SPRCDs extracted from the TT
where M
and the region used for comparison (MR ), respectively.
As visualized in Fig. 2, the tracker algorithm checks the
value of ρ to decide which search mode is used in the next
video frame. If ρ is larger than a predefined threshold e0 , the
target is searched in different scales (meaning camera zoom
or target approach/leave). In that case, the SR approach
(illustrated in Fig. 3) is modified by increasing or decreasing
the target template size rather than fixing it. By this way,
different scaled rectangles centered at each pixel of the SR
are taken as candidate regions. The dimensions of the different scaled rectangles are determined by multiplying the
dimensions of the target template of the previous frame with
the scale coefficient κ. The tracker contains two shrinkage
(κ ¼ f0.8; 0.9g) and two growth (κ ¼ f1.1; 1.2g) scale coefficients. By this way, the target is searched within the SR
using four different scales, considering the target dimension
changes in both positive and negative directions. This approach is similar to the Monte Carlo-based target update
strategy presented in Ref. 39. The candidate region resulting
in the smallest ρ value with the current TT is selected as
M R;Best and the TT is updated using the MR;Best .
In case of scale change, unlike the classical RCD computation, the salient points must be relocated at the scaled TTs.
The relocation of salient points is performed using the ratio
of the differences between the salient point locations and the
location of the center of the TT. The illustration and formulation of the salient point relocation are given in Fig. 4 and
Eq. (7), respectively.
ðp; qÞ → ðp̃; q̃Þ
p̃ ¼ X̃ c − sgnðX c − pÞjX c − pjκ
(7)
q̃ ¼ Ỹ c − sgnðY c − qÞjY c − qjκ.
~ qÞ
~ denote the locations of a certain
Here, ðp; qÞ and ðp;
salient point and corresponding relocated salient point,
respectively. Also note that, ðXc ; Y c Þ and ðX~ c ; Y~ c Þ correspond to the center locations of the TT and scaled TT.
Fig. 4 The illustration of relocation of the salient points in case of
scale change. The illustration is exaggerated (κ ¼ 4) for better
visualization of the relocation structure.
027207-5
February 2013/Vol. 52(2)
Cakir et al.: Salient point region covariance descriptor for target tracking
After the determination of MR;Best , the TT is updated
using a strategy based on the ρ and Euclid distance-based
measure (α) defined in Eq. (8):
α¼
kM R;Best − TTk2
.
number of pixelsðMR;Best Þ
(8)
As can be seen from Fig. 2, the ρ and α terms are used
together with their predefined thresholds e2 and e3 in the TT
update mechanism. If ρ is smaller than e2 , a strong match
criterion is satisfied and TT is taken directly as MR;Best .
Otherwise, the TT is updated according to the α value. In
this case, template change counter (TCC), which is defined
to indicate the number of similar (α < e3 ) TT’s and MR;Best ’s
in the consecutive frames, is altered. If the α value defined in
Eq. (8) is less than e3 , TCC value is incremented by one and
TT is updated according to Eq. (9):
TTNext ¼ αðMR;Best Þ þ ð1 − αÞTT.
(9)
In Eq. (9), since α has small values, the previous TT value
is more emphasized in the updated TT.
When the TCC reaches a predefined value (N), existing
TT is updated with the same strategy, but the M R;Best is more
emphasized in TT update. Therefore, the update in Eq. (9)
is modified as follows:
TTNext ¼ ð1 − αÞMR;Best þ αTT:
(10)
In this case, after TT is updated, TCC is reset to zero. The
same zero-resetting is also applied if the α value is larger
than the threshold e3 .
In the SPRCD-based tracker framework, if TT is significantly different from the MR;Best , the value of ρ becomes
greater than its value in a normal match. In this case, the
algorithm assumes that the target faced a scale change
and initiates a target search with varying scales. This property enables it to track targets with varying scale and shape.
It also provides robustness to abrupt camera movements,
camera vibrations, and sudden displacements.
In the aerial target tracking case, if ρ is larger than threshold e1 , the tracker assumes that there is a significant change
in the target model, and a target detection strategy is initiated
in order to adapt the TT to the rapid changes in the target
model. The target detection algorithm used in the air surveillance case is a simple intensity thresholding-based technique
that takes advantage of contrast difference between the aerial
target and the sky background. The reason to use a simple
target detection algorithm is to meet the real-time requirements. The detection algorithm is tested over plenty of air
surveillance videos, and satisfactory detection performances
are achieved.
To sum up, the main difference between the proposed
tracking scheme and the one in Ref. 25 is their feature extraction structure. The proposed SPRCD enables a computationally more efficient feature extraction mechanism without
losing the representability of the classical RCD.
5 Experimental Work and Results
In the experiments, the proposed SPRCD-based tracker is
tested in different scenarios. In this paper, tracking scenarios
including sea-surface and aerial targets captured using a visual band camera and a ground target captured using an
Optical Engineering
infrared (IR) camera are provided. The tracking results
obtained by the proposed scheme are compared with the
tracker structure developed in Ref. 25 that is known to be
outperforming the classical tracking algorithms including
correlation, KLT, and SIFT-based trackers after performing
a large-scale experimental verification. Also the proposed
tracking scheme is compared with SIFT- and SURF-based
tracking techniques40 in an appropriate tracking scenario.
The SPRCD-based tracker has naturally different tracking parameters than the classical RCD-based tracker. Since
SPRCD structure depends on fewer pixel-wise features, it
becomes more sensitive to the changes in the target model.
Therefore, the threshold e0 regarding the descriptor matching result (ρ) must be selected larger than the one used in
the classical RCD based structure.
In Sec. 5.1, the performance measures to evaluate the
tracking performance are mentioned, and in Sec. 5.2, the
tracking results for each tracking scenario are presented.
5.1 Performance Measures
In order to evaluate the tracking performance within a quantitative manner, four different morphological similarity measures (PM i , where i ¼ 1; 2; 3; 4) proposed in Ref. 25 are
used. The PM1 and PM 2 are pixel-wise overlapping and
nonoverlapping area-based measures, and PM 3 and PM4
are L2 and L1 norms, respectively. A more detailed analysis
of these measures as well as a naive performance measure
fusion strategy are provided in Ref. 25. By using these performance measures and fusion mechanism, a final evaluation
of the tracking performance is established.
In addition to the PM i s, a statistical method based on a
confidence interval type of approach41 is proposed for target
loss detection. The target loss detection algorithm is based on
an object signature function ½gðz; vÞ that is the observations
of a random variable V with a finite variance. Here, v is the
sample of this random variable for any possible values of z.
The mean value (Efgðz; VÞg ¼ ΓðzÞ) and the variance
(Varfgðz; VÞg) of the target signature function are used in
order to obtain proper confidence intervals with a certain
high probability since the standard deviation of the signature
function is naturally less than the mean value of the function.
The mean value of the signature function is the cumulative
distribution function (CDF) of the function and the CDF and
variance of the signature function can be estimated using the
target parameters of the previous frame. By this way, a target
loss detection evaluation mechanism for the current processed image frame can be determined using the mean and
variance-based confidence intervals. Let ΓðzÞ denote the
mean values of the target signature function where z ¼
0; 1; : : : ; 255 is the value set that a pixel can possess.
Therefore, a lower bound LðzÞ and an upper bound UðzÞ
can be determined as in Eq. (11) around the mean ΓðzÞ by
using the Gaussianity assumption for the target signature
function due to the central limit theorem:41
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
LðzÞ ¼ ΓðzÞ − λ Varfgðz; VÞg
(11)
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
UðzÞ ¼ ΓðzÞ þ λ Varfgðz; VÞg.
The parameter λ in the LðzÞ and UðzÞ is determined
according to the three-sigma (empirical) rule and six-sigma
approach. Consequently, 3 ≤ λ ≤ 5 becomes a proper
027207-6
February 2013/Vol. 52(2)
Cakir et al.: Salient point region covariance descriptor for target tracking
Fig. 5 The bounds on gðz; V Þ when λ ¼ 3.
interval for target loss detection problem. As an example, the
bounds on gðz; VÞ using the three-sigma rule (λ ¼ 3) for a
sea-surface target are illustrated in Fig. 5. Note that the
bounds on gðz; VÞ for aerial and IR targets are determined
via the same three-sigma approach.
In the experimental results, the average calculation times
for RCD and SPRCD blocks and the overall method are also
provided. The average processing times for both of the
blocks are obtained by averaging the total sum of elapsed
times at each visit to the unoptimized descriptor computation
block. The proposed tracker is implemented using C++ programming language on a computer with a Core(TM)2 Quad
CPU of 2.5 GHz and 2 GB RAM running on Microsoft
Windows XP operating system.
5.2 Tracking Scenarios
In the first experiment, the RCD- and SPRCD-based trackers
are tested in a sea surveillance scenario. The experiment is
carried out using a visual band camera that captures 640 ×
480 (H × W) interlaced video frames. In the preprocessing
step, a “line doubling” type of approach is used for deinterlacing, where the odd-numbered (even-numbered) rows of
each frame are taken and the interpolation of two consecutive
rows are placed between these rows. At the end, a reasonably
deinterlaced video frame at the same dimension with the
original video frame is obtained. The video contains 1000
frames of a moving sea-surface target. The target is occluded
by other target-like structures, such as a speed boat and a sail
boat. The speed boat moves faster in front of the target of
interest (in frames 1 to 500) and causes the “white cap effect”
(sea foam) that changes the target environment and contrast
rapidly. The sail boat that has low-intensity pixel values
moves to the right of the image and occludes the target in
frames 850 to 930. The mast of the sail boat causes a sudden
intensity change in the target. Consequently, the white-cap
effect and the mast of the sail boat are the potential locations
that may contain strong corner locations. The tracker parameters τ; e0 ; e2 ; e3 ; and N for sea surveillance scenario are
selected as 7, 1, 0.1, 0.0019, and 10, respectively, which
are experimentally obtained considering a wide range of
cues for sea scenarios. The evaluation of the tracking performances of the classical RCD-based tracker and proposed
SPRCD-based tracker are given in Table 1. In the same
table, the average computation time for a descriptor is provided in order to observe the computational efficiency of the
proposed SPRCD. As seen in the table, both of the trackers
result in similar tracking accuracies. The proposed SPRCD
approach is 35% faster than the classical one while preserving the track quality. Sample images of the sea surveillance
scenario are provided in Fig. 6. According to the target loss
detection measure, only four and five out of 1000 frames are
determined as the frames that exhibit target losses for the
RCD- and SPRCD-based trackers, respectively.
The aerial surveillance scenario is also considered in the
experimental studies. The experiments are carried out using
the same capture device mentioned above. The video contains 187 frames of a moving helicopter in a cloudy environment. Moreover, the video was captured on a windy day,
causing stabilization problems. Therefore, there are some
vibrations and sudden movements that reduce the quality
of the captured video and make the target tracking task
more complicated. The tracker parameters τ; e0 ; e2 ; e3 , and
N for air surveillance scenario are selected as 8, 1, 0.1,
0.0019, and 3, respectively. The performance of the classical
RCD- and proposed SPRCD-based trackers is provided in
Table 2. The computation time for RCD and SPRCD
block is also presented in order to give an idea about the
computational complexity of the approaches. In this case,
the target is a point-like structure. Therefore, there exist
very few salient points extracted from the target region.
Consequently, the SPRCD tracker is not able to outperform
the classical RCD tracker. Although the SPRCD tracker has
lower PMi values than the classical RCD tracker, the target is
tracked with only four target losses until the end of the video.
In the same video, the classical RCD-based tracker has two
frames containing target losses. The processing time of the
proposed approach is more or less the same as the time of the
classical RCD as stated before. It is therefore reasonable to
conclude that the proposed SPRCD approach is mostly suitable for large targets where the SPRCD takes the advantage
of computational efficiency. The sample images for the
tracking of the aerial target are provided in Fig. 7.
The proposed SPRCD-based tracking scheme is also
tested in an IR surveillance scenario. The IR video used
Table 1 The performance of trackers in visual sea-surface target tracking scenario.
Tracker type
PM 1
PM 2
PM 3
PM 4
Track score
Track loss
Block computation time
(milliseconds)
RCD
0.066
0.908
0.99
1.12
0.8375
4∕1000
0.1130
SPRCD
0.021
0.849
0.82
0.94
0.8224
5∕1000
0.0737
Optical Engineering
027207-7
February 2013/Vol. 52(2)
Cakir et al.: Salient point region covariance descriptor for target tracking
Fig. 6 The sample images of a sea-surface target tracking scenario.
Table 2 The performance of trackers in visual aerial target tracking scenario.
Tracker type
PM 1
PM 2
PM 3
PM 4
Track score
Track loss
Block computation time
(milliseconds)
RCD
0.085
0.666
0.87
1.05
0.5998
2∕187
0.0665
SPRCD
0.212
0.434
1.71
2.08
0.3230
4∕187
0.0719
in this experiment includes 210 frames of a moving vehicle
in a complex background that includes stationary objects,
buildings, trees, and moving vehicles, and is captured with
a longwave IR camera having a frame size of 320 × 240. The
target is also exposed to partial occlusion in certain frames.
The sample frames of the tracking results of the SPRCDbased tracker are presented in Fig. 8. The performance of
the proposed SPRCD-based tracking scheme is compared
with the classical RCD-based framework. Besides, unlike
the tracking scenarios presented above, the IR tracking scenario contains a more detailed analysis by introducing SIFTand SURF-based trackers to the comparison of the tracking
results (Table 3). The comparison with SIFT- and SURFbased trackers are not included in the air and sea surface
scenarios because in the feature extraction phase, the scenarios include small targets that yield an insufficient set of
Optical Engineering
features. The insufficient feature set due to small targets may
degrade the performance of SIFT- and SURF-based trackers;
therefore, for a fair comparison, these results are not provided for sea-surface and aerial target tracking. In the IR
tracking scenario, the parameters of SIFT and SURF trackers
are determined after performing an experimental framework. For the SIFT-based tracker, the number of octave
layers is three, contrast and edge thresholds are 1000. σ is 1.
Similarly, for the SURF tracker, the number of octave layers
is five, and the threshold for the Hessian matrix is 1. The
length of the feature descriptor is 128.
From Table 3, it may be concluded that the SPRCD-based
scheme outperforms the classical RCD-, SIFT-, and SURFbased tracking schemes. The classical RCD-, SIFT- and
SURF-based tracking techniques fail to track the target
when most of the target is occluded by another object in
027207-8
February 2013/Vol. 52(2)
Cakir et al.: Salient point region covariance descriptor for target tracking
Fig. 7 The sample images of an aerial target tracking scenario.
certain frames. The occlusion also causes the extraction of
the SIFT- and SURF-based features to be blocked over the
regions overlapped by the occluding object. The proposed
SPRCD can handle such situations by considering the covariance type of relations of the Harris corners. In this way,
the weak corners that are not considered as strong SIFT and
SURF corners, play an important role in target representation. The classical RCD-based trackers fail to track the target
when most of the target region is occluded by another targetlike structure in certain frames. The target loss indication
algorithm verifies the track fail situation by detecting 27 out
of 210 losses in this scenario. However, the SPRCD deals
with such types of occlusion by taking advantage of the
covariance type of relation between the salient points. In that
case, only 11 out of 210 frames are detected as the frames
that contain target losses. Moreover, SPRCD enables an efficient implementation by reducing the average time of the
descriptor calculation block in the IR surveillance case.
Although, the target loss indication scheme gives track
loss decision in certain frames of each surveillance scenarios,
the targets continue to be tracked. The target loss indication
mechanism, in fact, measures the track quality rather than the
losses of the target presence. Sudden changes in the target
model, abrupt movements and vibrations on the capturing
device may be the main reasons for the low track quality.
As the comparison of the “computational time” experiment, the average execution times for a classical RCD and
Optical Engineering
proposed SPRCD computed over different sized W × W
regions are examined. As can be seen from the Fig. 9, the
experiment is carried out by selecting a reference point in
a visual band video and W × W target regions are located
at this reference point. At each time, the W value is changed
and the corresponding elapsed time is computed for the calculation of a descriptor. The computation times for the RCD
and SPRCD corresponding to each computation region is
visualized in Fig. 10. Note that, both classical RCD- and
proposed SPRCD-based trackers track the W × W sized targets without any track loss conditions. From Fig. 10, one can
conclude that the computation time of the classical RCD
grows exponentially as the dimensions of the descriptor calculation region increase. However, the increasing size of the
calculation region does not have a significant effect on the
computation time of the proposed SPRCD since ϖ is fixed to
be at most 25. The upper limit for the number of salient
points is determined through experimental studies for each
tracking scenario. Obviously, one can determine more salient
points depending on the scenario by considering the trade-off
between the tracking accuracy and the computational cost.
Another concern may be the cost of the initial salient point
extraction procedure in the case of tracking larger targets.
However, this initial cost is not high compared to the inclusion of all pixels in the descriptor computation in the
classical RCD approach. Therefore, the proposed SPRCD
is computationally more efficient than the classical RCD,
027207-9
February 2013/Vol. 52(2)
Cakir et al.: Salient point region covariance descriptor for target tracking
Fig. 8 The sample images of a ground target tracking scenario in IR band.
Table 3 The performance of trackers in IR ground target tracking scenario.
Tracker type
PM 1
PM 2
PM 3
PM 4
Track score
Track loss
Block computation time
(milliseconds)
RCD
0.519
0.621
4.94
5.75
0.245
27∕210
0.083
SIFT
0.474
0.664
3.60
4.57
0.309
19∕210
4.815
SURF
0.057
0.389
5.85
7.74
0.299
14∕210
1.118
SPRCD
0.338
0.895
3.37
3.95
0.556
11∕210
0.078
especially when dealing with relatively large objects occupying large regions on the image.
In this work, our main aim is to develop a computationally
efficient descriptor extraction scheme. Thus, the salient point
extraction scheme is employed to modify the classical RCD
technique to keep the computational cost as small as possible.
However, for more complicated tracking problems, the proposed point selection mechanism can be further expanded by
introducing additional points in the descriptor computation.
As an additional design, the salient points are expanded by
locating a predetermined sized rectangle at the center of the
mass of the salient points. The features located at the points in
Optical Engineering
this rectangle are additionally used in the descriptor computation. Hence, the descriptor calculated over these extended
salient points provides better tracking accuracies as well as
enabling the characteristic of the smooth regions by introducing a predetermined sized rectangle at the center of the mass
of the salient points. Although this extended scheme is computationally more efficient than the classical RCD technique,
it does not provide the most economic design in terms of
computational cost. Since the main concern addressed in
this work is the reduction of the computational cost, only
the tracking accuracies obtained via the most computationally efficient scheme are included in Sec. 5.
027207-10
February 2013/Vol. 52(2)
Cakir et al.: Salient point region covariance descriptor for target tracking
We plan to fuse features obtained using IR cameras operating at different wavelengths and/or visual band cameras.
We will investigate the relation of the features at different
salient points between images recorded in different bands
for robust feature selection. The target loss indication algorithm is intended to be injected into the decision mechanism
of the tracker in order to weaken the dependency of the
tracker to the direct regional matching metric. In this way,
an alternative online control mechanism over the tracker
will be introduced.
Fig. 9 The illustration of the W × W computation region located at a
reference point. The values for the target size W is selected as follows:
W ¼ f5; 8; 10; 12; 16; 20; 30; 40; 50; 60; 80; 100; 125; 150; 200g
Fig. 10 The computation times of a single classical RCD and
proposed SPRCD over the W × W computation region.
6 Conclusion
In this paper, a new descriptor based on the salient points and
RCD is proposed. The proposed descriptor scheme enables
robust target tracking as well as computationally efficient
structure by using only salient pixels that may have more
discriminative power compared to other pixels of a region.
The classical RCD has been widely used in many feature
extraction problems, but the computational cost of this technique increases excessively when the target region (descriptor calculation region) grows. Hence, the classical RCD
scheme may not be implemented in real-time using digital
signal processors. By considering only salient points over
a region, it is possible to put an upper bound on the computational cost while preserving RCD’s power to represent
targets. It is experimentally observed that the proposed
descriptor even outperforms the classical RCD by using the
advantage of variational relations between the salient points
in some partial occlusion cases. Moreover, the proposed
tracking scheme achieves better tracking accuracies than the
well-known SIFT- and SURF-based tracking techniques.
Optical Engineering
Acknowledgments
This study is supported by the project with number 109A001
in the framework of TÜBİTAK 1007 Program. The authors
would like to thank A. Onur Karali for his efforts in video
capture and helpful discussions and Dr. M. Alper Kutay for
his support in this study.
References
1. A. Yilmaz, O. Javed, and M. Shah, “Object tracking: a survey,” ACM
Comput. Surveys 38(4), 1–45 (2006).
2. H. Yang et al., “Recent advances and trends in visual tracking: a review,”
Neurocomputing 74(18), 3823–3831 (2011).
3. S. Y. Chen, “Kalman filter for robot vision: a survey,” IEEE Trans.
Industrial Electron. 59(11), 4409–4420 (2012).
4. X. Zhang et al., “Robust object tracking for resource-limited hardware
systems,” in Lecture Notes in Computer Sci., 4th Int. Conf. on Intelligent Robotics and Applications, H. L. S. Jeschke and D. Schilberg,
Eds., Vol. 7102, pp. 85–94, Springer Berlin Heidelberg, Germany
(2011).
5. C. Schmid, R. Mohr, and C. Bauckhage, “Evaluation of interest point
detectors,” Int. J. Comput. Vision 37(2) 151–172 (2000).
6. S. Gauglitz, T. Höllerer, and M. Turk, “Evaluation of interest point
detectors and feature descriptors for visual tracking,” Int. J. Comput.
Vision 94(3), 335–360 (2011).
7. D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vision 60(2), 91–110 (2004).
8. C. Park, K. Baea, and J.-H. Jung, “Object recognition in infrared image
sequences using scale invariant feature transform,” Proc. SPIE
6968, 69681P (2008).
9. T. Can, A. O. Karalı, and T. Aytaç, “Detection and tracking of sea-surface targets in infrared and visual band videos using the bag-of-features
technique with scale-invariant feature transform,” Appl. Opt. 50(33),
6203–6212 (2011).
10. H. Lee et al., “Scale-invariant object tracking method using strong
corners in the scale domain,” Opt. Eng. 48(1), 017204 (2010).
11. P. B. W. Schwering et al., “Application of heterogeneous multiple camera system with panoramic capabilities in a harbor environment,” Proc.
SPIE 7481, 74810C (2009).
12. L. Jing-zheng et al., “Automatic matching of infrared image sequences
based on rotation invariant,” in Proc. IEEE Int. Conf. Environmental
Sci. Info. Application Technol., pp. 365–368, IEEE, China (2009).
13. Y. Pang et al., “Scale invariant image matching using triplewise constraint and weighted voting,” Neurocomputing 83, 64–71 (2012).
14. Y. Pang et al., “Fully affine invariant SURF for image matching,”
Neurocomputing 85, 6–10 (2012).
15. Y. Wang, “Image mosaicking from uncooled thermal IR video captured
by a small UAV,” in Proc. IEEE Southwest Sympos. Image Anal.
Interpret., pp. 161–164, IEEE, New Mexico (2008).
16. H. P. Moravec, “Visual mapping by a robot rover,” in Int. Joint Conf.
Artificial Intell., pp. 598–600, Morgan Kaufmann Publishers Inc., Japan
(1979).
17. C. Harris and M. Stephens, “A combined corner and edge detector,” in
Alvey Vision Conf., pp. 147–152, University of Sheffield Printing Unit,
England (1988).
18. H. Bay et al., “SURF: speeded up robust features,” Comput. Vis. Image
Understand. 110(3), 346–359 (2008).
19. E. Rosten and T. Drummond, “Machine learning for high-speed corner
detection,” in Proc. 9th European Conf. Computer Vision–Volume
Part I, pp. 430–443, Springer-Verlag, Austria (2006).
20. V. Lepetit and P. Fua, “Keypoint recognition using randomized trees,”
IEEE Trans. Pattern Anal. Mach. Intell. 28(9), 1465–1479 (2006).
21. M. Ozuysal et al., “Fast keypoint recognition using random ferns,” IEEE
Trans. Pattern Anal. Mach. Intell. 32(3), 448–461 (2010).
22. O. Tuzel, F. Porikli, and P. Meer, “Region covariance: a fast descriptor
for detection and classification,” in Proc. IEEE European Conf.
Computer Vision, pp. 589–600, Springer-Verlag, Austria (2006).
027207-11
February 2013/Vol. 52(2)
Cakir et al.: Salient point region covariance descriptor for target tracking
23. F. Porikli, O. Tuzel, and P. Meer, “Covariance tracking using model
update based on Lie algebra,” in Proc. IEEE Int. Conf. Computer
Vision Pattern Recog. Vol. 1, pp. 728–735, IEEE, New York (2006).
24. Y. H. Habiboğlu, O. Günay, and A. E. Çetin, “Covariance matrix-based
fire and flame detection method in video,” Mach. Vis. Appl. 23(6),
1103–1113 (2011).
25. S. Cakir et al., “Classifier based offline feature selection and evaluation
for visual tracking of sea-surface and aerial targets,” Opt. Eng. 50(10),
107205 (2011).
26. S. Paisitkriangkrai, C. Shen, and J. Zhang, “Fast pedestrian detection
using a cascade of boosted covariance features,” IEEE Trans. Circ.
Syst. Video Technol. 18(8), 1140–1151 (2008).
27. Y. Pang, Y. Yuan, and X. Li, “Gabor-based region covariance matrices
for face recognition,” IEEE Trans. Circ. Syst. Video Technol. 18(7),
989–993 (2008).
28. M. Hartemink, “Robust automatic object detection in a maritime environment: polynomial background estimation and the reduction of false
detections by means of classification,” Master’s Thesis, Delft University
of Technology, The Netherlands, Turkey (2012).
29. S. M. A. Bhuiyan, M. S. Alam, and M. Alkanhal, “New two-stage correlation-based approach for target detection and tracking in forwardlooking infrared imagery using filters based on extended maximum
average correlation height and polynomial distance classifier correlation,” Opt. Eng. 46(8), 086401–14 (2007).
30. C. Tomasi and T. Kanade, “Detection and tracking of point features,”
Technical Report, Carnegie Mellon University (1991).
31. B. D. Lucas and T. Kanade, “An iterative image registration technique
with an application to stereo vision,” in Proc. 7th Int. Joint Conf.
Artificial Intell., pp. 674–679, Morgan Kaufmann Publishers Inc.,
BC, Canada (1981).
32. J. Shi and C. Tomasi, “Good features to track,” in Proc. IEEE Conf. Computer Vision and Pattern Recog., pp. 593–600, IEEE, Washington (1994).
33. H. Tuna, İ. Onaran, and A. E. Çetin, “Image description using a multiplier-less operator,” IEEE Signal Process. Lett. 16(9), 751–753 (2009).
34. K. Duman, “Methods for target detection in SAR images,” Master’s
Thesis, Bilkent University, Department of Electrical and Electronics
Engineering, Ankara, Turkey (2009).
35. L. Zheng et al., “Salient covariance for near-duplicate image and video
detection,” in Proc. IEEE Int. Conf. Image Processing, pp. 2585–2588,
IEEE, Belgium (2011).
36. J. Yao and J.-M. Odobez, “Fast human detection from videos using
covariance features,” in Proc. European Conf. Computer Vision,
Visual Surveillance Workshop, France (2008).
37. J. Ling et al., “Infrared target tracking with kernal-based performance
metric and eigenvalue-based similarity measure,” Appl. Opt. 46(16),
3239–3252 (2007).
38. W. Forstner and B. Moonen, “A metric for covariance matrices,”
Technical Report, Department of Geodesy and Geoinformatics,
Stuttgart University (1999).
39. X. Ding et al., “Region covariance based object tracking using Monte
Carlo method,” in Proc. IEEE Int. Conf. Control and Automation,
pp. 1802–1805, IEEE, India (2010).
40. A. Vedaldi and B. Fulkerson, VLFeat: An Open and Portable Library of
Computer Vision Algorithms, http://www.vlfeat.org/ (2008).
41. S. Beheshti et al., “Noise invalidation denoising,” IEEE Trans. Signal
Process. 58(12), 6007–6016 (2010).
Serdar Cakir received his BSc from
Eskişehir Osmangazi University in 2008.
Immediately after graduation, he joined
Bilkent University and he got his MSc in electrical engineering in 2010. He joined the Scientific and Technological Research Council
of Turkey in 2010, where he is currently a
research scientist. He also continues his
PhD studies at Bilkent University, Department of Electrical Engineering. His main research interests are image/video processing,
computer vision, and pattern recognition.
Tayfun Aytaç received his BSc in electrical
engineering from Gazi University, Ankara,
Turkey, in 2000 and his MS and PhD in electrical engineering from Bilkent University,
Ankara, Turkey, in 2002 and 2006, respectively. He joined the Scientific and Technological Research Council of Turkey in 2006,
where he is currently a chief research scientist. His current research interests include
imaging systems, automatic target recognition, target tracking and classification, and
electronic warfare in infrared band.
Optical Engineering
Alper Yildirim received a BSc degree in
electrical engineering from Bilkent University, Ankara, Turkey, in 1996, an MSc degree
in digital and computer systems from Tampere University of Technology, Tampere,
Finland, in 2001, and a PhD degree in electronics engineering from Ankara University,
Ankara, in 2007. He was a design engineer
with Nokia Mobile Phones, Tampere. He is
currently a chief research scientist with the
Scientific and Technological Research
Council of Turkey, Ankara. His research interests include digital signal
processing, optimization, and radar systems.
Soosan Beheshti received a BSc degree
from Isfahan University of Technology,
Isfahan, Iran, and MSc and PhD degrees
from the Massachusetts Institute of Technology (MIT), Cambridge, in 1996 and 2002,
respectively, all in electrical engineering.
From September 2002 to June 2005, she
was a postdoctoral associate and a lecturer
at MIT. Since July 2005, she has been with
the Department of Electrical and Computer
Engineering, Ryerson University, Toronto,
Ontario, Canada, where she is currently an assistant professor and
director of Signal and Information Processing Laboratory. Her
research interests include statistical signal processing, hyperspectral
imaging, and system dynamics and modeling.
Ö. Nezih Gerek received BSc, MSc, and
PhD degrees in electrical engineering from
Bilkent University, Ankara, Turkey, in 1991,
1993, and 1998, respectively. During his
PhD studies, he spent a semester at the
University of Minnesota as an exchange
researcher in an NSF project. Following his
PhD degree, he spent one year as a research
associate at EPFL, Lausanne, Switzerland.
Currently, he is a full professor of electrical
engineering at Anadolu University, Eskisehir.
He is also a member of the Electrical, Electronics and Informatics
Research Fund Group of the Scientific and Technological Research
Council of Turkey. He is on the editorial board of Turkish Journal of
Electrical Engineering and Computer Science and Elsevier: Digital
Signal Processing. His research areas include signal analysis,
image processing, and signal coding.
A. Enis Cetin received his PhD from University of Pennsylvania in 1987. Between
1987 and 1989, he was an assistant professor of electrical engineering at University of
Toronto. He has been with Bilkent University,
Ankara, Turkey, since 1989. He was an associate editor of the IEEE Transactions on
Image Processing between 1999 and 2003.
Currently, he is on the editorial boards of
Signal Processing and Journal of Advances
in Signal Processing and Journal of Machine
Vision and Applications, Springer. He is a Fellow of IEEE. His
research interests include signal and image processing, humancomputer interaction using vision and speech, and audiovisual
multimedia databases.
027207-12
February 2013/Vol. 52(2)