0% found this document useful (0 votes)
42 views16 pages

Generalized Canonical Time Warping

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 16

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO.

2, FEBRUARY 2016 279

Generalized Canonical Time Warping


Feng Zhou and Fernando De la Torre

Abstract—Temporal alignment of human motion has been of recent interest due to its applications in animation, tele-rehabilitation
and activity recognition. This paper presents generalized canonical time warping (GCTW), an extension of dynamic time warping
(DTW) and canonical correlation analysis (CCA) for temporally aligning multi-modal sequences from multiple subjects performing
similar activities. GCTW extends previous work on DTW and CCA in several ways: (1) it combines CCA with DTW to align
multi-modal data (e.g., video and motion capture data); (2) it extends DTW by using a linear combination of monotonic functions
to represent the warping path, providing a more flexible temporal warp. Unlike exact DTW, which has quadratic complexity,
we propose a linear time algorithm to minimize GCTW. (3) GCTW allows simultaneous alignment of multiple sequences.
Experimental results on aligning multi-modal data, facial expressions, motion capture data and video illustrate the benefits
of GCTW. The code is available at http://humansensing.cs.cmu.edu/ctw.

Index Terms—Multi-modal sequence alignment, canonical correlation analysis, dynamic time warping

1 INTRODUCTION

T EMPORAL alignment of multiple time series is an impor-


tant problem with applications in many areas such as
speech recognition [1], computer graphics [2], computer
warping (DTW) or Bayesian networks are not capable
of aligning align multi-modal data.
To address these problems, this paper proposes gener-
vision [3], and bio-informatics [4]. In particular, align- alized canonical time warping (GCTW), a technique to
ment of human motion from sensory data has recently temporally align multi-modal time series of different
received increasing attention in computer vision and com- subjects performing similar activities. GCTW is a spatio-
puter graphics to solve problems such as curve matching temporal alignment method that temporally aligns two
[5], temporal clustering [6], tele-rehabilitation [7], activity or more multi-dimensional and multi-modal time series
recognition [8] and motion synthesis [9], [10]. While algo- by maximizing the correlation across them. GCTW can
rithms for aligning time series have been commonly used be seen as an extension of DTW or canonical correlation
in computer vision and computer graphics, a relatively analysis (CCA). To accommodate for subject variability
unexplored problem has been the alignment of multi- and take into account the difference in the dimensional-
dimensional and multi-modal time series that encode ity of the signals, GCTW uses CCA as a measure of spa-
human motion. Fig. 1 illustrates the main problem tial correlation. GCTW extends DTW by incorporating a
addressed by this paper: how can we efficiently find the feature weighting mechanism to align signals of different
temporal correspondence between (1) frames of a video, dimensionality and provide higher weights to the
(2) samples of motion capture data and (3) samples of features that make both signals more correlated. It also
three-axis accelerometers of different subjects performing extends DTW by parameterizing the warping path as a
a similar action? combination of monotonic functions, providing more
The alignment of multi-dimensional, multi-modal accurate alignment and faster optimization strategies.
time series of human motion poses several challenges. Unlike exact approaches based on DTW, which have
First, there is typically a large variation in the subjects’ quadratic cost, GCTW uses a Gauss-Newton (GN) algo-
physical characteristics, motion style and speed per- rithm that has linear complexity in the length of the
forming the activity. Second, large changes in view sequence. Preliminary versions of this paper were pub-
point also complicate the correspondence problem [11]. lished in [13], [14].
Third, it is unclear how existing techniques can be
used to align sets of time series that have different
2 PREVIOUS WORK
modalities (e.g., video and motion capture data). While
there is extensive literature on time series alignment 2.1 Temporal Alignment
(e.g., [12]), standard extensions of dynamic time This section discusses prior work on alignment of time
series in the context of computer graphics, computer vision
and data mining.
In the literature of computer graphics, temporal
 The authors are with the Robotics Institute, Carnegie Mellon University. alignment of human motion has been commonly applied
E-mail: zhfe99@gmail.com, ftorre@cs.cmu.edu. to solve problems such as content modeling [15], and
Manuscript received 15 Dec. 2013; revised 13 Feb. 2015; accepted 10 Mar. motion blending [2]. Hsu et al. [9] proposed iterative
2015. Date of publication 17 Mar. 2015; date of current version 13 Jan. 2016. motion warping (IMW), a robust method that finds a
Recommended for acceptance by C.H. Lampert. spatio-temporal warping between two instances of
For information on obtaining reprints of this article, please send e-mail to:
reprints@ieee.org, and reference the Digital Object Identifier below. motion captured data. Shapiro et al. [16] used indepen-
Digital Object Identifier no. 10.1109/TPAMI.2015.2414429 dent component analysis to separate motion data
0162-8828 ß 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
Authorized licensed use limited to:http://www.ieee.org/publications_standards/publications/rights/index.html
See Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTCinformation.
for more from IEEE Xplore. Restrictions apply.
280 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 2, FEBRUARY 2016

footnote for the notation1), X ¼ ½x1 ; . . . ; xn  2 Rdx n and


Y ¼ ½y1 ; . . . ; yn  2 Rdy n , CCA finds the linear combinations
of the variables in X that are most correlated with the linear
combinations
P of
P the variables in Y. Assuming zero-mean
data ( i xi ¼ j yj ¼ 0), CCA finds a combination of the
original features that minimizes the sum of the distances
between samples:
 2
min Jcca ¼ VTx X  VTy YF þ fðVx Þ þ fðVy Þ; (1)
fVx ;Vy g2F

Fig. 1. Temporal alignment of sequences recorded with different sensors where Vx 2 Rdx d and Vy 2 Rdy d denote the low-dimen-
(from top to bottom: video, motion capture and accelerometers) of three sional embeddings (d  minðdx ; dy Þ) for X and Y respec-
subjects kicking a ball. tively. fðÞ is a regularization function that penalizes the
high-frequency components of the embedding matrices:
into visually meaningful components called style compo-

nents. In comparison, our method solves a more general fðVÞ ¼ kVk2F ; (2)
problem of aligning human motion from multi-modal 1
time series.
In the context of computer vision, temporal alignment In order to avoid the trivial solution of VTx X and VTy Y
of video captured with different cameras and view being zero, CCA decorrelates the canonical variates (col-
points has been a topic of interest. Rao et al. [17] aligned umns of VTx X and VTy Y) by imposing the orthogonal con-
trajectories of two moving points using constraints from
straint as:
the fundamental matrix. Junejo et al. [8] adopted DTW
n   
for synchronizing human actions with view changes 
based on a view-invariant description. In comparison, F ¼ fVx ; Vy g  VTx ð1  ÞXXT þ I Vx ¼ I;
  o (3)
our method simultaneously estimates the optimal spatial
VTy ð1  ÞYYT þ I Vy ¼ I :
transformation and temporal correspondence to align
video sequences. Recently, Gong and Medioni [18]
where  2 ½0; 1 is a weight to trade-off between the least-
extended CTW approach to incorporate more complex
squared error and the regularization terms. In the case of
spatial transformations through manifold learning.
insufficient samples where the covariance matrices (XXT or
Nicolaou et al. [19] proposed a probabilistic extension of
CTW for fusing multiple continuous expert annotations YYT ) are singular, a positive regularization term I is nec-
in tasks related to affective behavior. These works show essary to avoid over-fitting and for numerical stability. Opti-
the flexibility of our framework for aligning various mizing (1) has a closed-form solution in terms of a
types of time series. generalized eigenvalue problem, i.e., ½Vx ; Vy  ¼ eigd ðA; BÞ,
In the field of data mining, there have been several where:
extensions of DTW to align time series that differ in the    
0 XYT XXT 0
temporal and spatial domain. Keogh and Pazzani [20], A¼ ; B ¼ ð1  Þ þ I:
for example, used derivatives of the original signal to YXT 0 0 YYT
improve alignment with DTW. Listgarten et al. [21] pro- See [25] for a unification of several component analysis
posed continuous profile models, a probabilistic method methods and a review of numerical techniques to efficiently
for simultaneously aligning and normalizing sets of time solve generalized eigenvalue problems.
series in bio-informatics. Unlike these works, which were In computer vision, CCA has been used for matching sets
originally designed for aligning 1-D time series, our work of images in problems such as activity recognition from
addresses the more challenging problem of aligning video [26] and activity correlation from cameras [27].
multi-modal and multi-dimensional time series. Recently, Fisher et al. [28] proposed an extension of CCA
In the literature of manifold alignment, Ham et al. [22] with parameterized warping functions to align protein
aligned manifolds of images in a semi-supervised manner. expressions. The learned warping function is a linear combi-
The prior knowledge of pairwise correspondences between nation of hyperbolic tangent functions with non-negative
two sets was used to guide the graph embedding. Wang
and Mahadevan [23] aligned manifolds based on an exten-
1. Bold capital letters denote a matrix X, bold lower-case letters a
sion of the Procrustes analysis (PA). The main benefit of this column vector x. xi and xðiÞ represent the ith column and ith row of the
approach is that PA learns a mapping generalized for out- matrix X respectively. xij denotes the scalar in the ith row and jth col-
of-sample cases. However, these models lack a mechanism umn of the matrix X. All non-bold letters represent scalars.
1mn ; 0mn 2 Rmn are matrices of ones and zeros. In 2 Rnn isqanffiffiffiffiffiffiffiffiffiffiffiffiffiffi
iden-
to enforce temporal continuity. pP ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P 2ffi
tity matrix. kxkp ¼ p i jxi jp denotes the p-norm. kXkF ¼ ij xij
represents the Frobenious norm. vecðXÞ denotes the vectorization of
2.2 Canonical Correlation Analysis (CCA) matrix X. X  Y is the Hadamard product. ½X1 ;    ; Xn  denotes the
vertical concatenation of sub-matrices Xi . eigd ðA; BÞ denotes the top d
CCA [24] is a technique to extract common features from a eigenvectors V that solve the generalized eigenvalue problem AV =
pair of multi-variate data. Given two sets of n variables (see BVL.

Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.
ZHOU AND DE LA TORRE: GENERALIZED CANONICAL TIME WARPING 281

review on several DTW’s modifications to control the warp-


ing paths.
Although the number of possible ways to align X and Y
is exponential in nx and ny , dynamic programming (DP)

[30] offers an efficient approach with complexity of O nx ny


to minimize Jdtw using Bellman’s equation:

J
pxt ; pyt ¼ miny kxpxt  ypy k22 þ J
pxtþ1 ; pytþ1 ;
pðpxt ;pt Þ t

where the cost-to-go value function, J


ðpxt ; pyt Þ, represents
the cost remaining starting at the tth step using the optimal
policy p
. The policy function, pð; Þ : f1 : nx g  f1 : ny g
! f½1; 0; ½0; 1; ½1; 1g, defines the deterministic transition
between consecutive steps, ½pxtþ1 ; pytþ1  ¼ ½pxt ; pyt  þ pðpxt ; pyt Þ.
Once the policy queue is known, the alignment steps can be
Fig. 2. An example of DTW for aligning time series. (a) Two 1-D time
recursively selected by backtracking, pxl ¼ nx and pyl ¼ ny .
series (nx ¼ 7 and ny ¼ 8) and the optimal alignment between samples Fig. 2a shows an example of DTW for aligning two 1-D
computed by DTW. (b) Euclidean distances between samples, where time series. Fig. 2b illustrates the Euclidean distance of each
the red curve denotes the optimal warping path (l ¼ 9). (c) DP policy at pair of samples. To compute the optimal warping path, DP
each pair of samples, where the three arrow directions, #; &; !, denote
the policy, pð; Þ 2 f½1; 0; ½1; 1; ½0; 1g, respectively. (d) A matrix-form efficiently enumerates all possible steps as in Fig. 2c from
interpretation of DTW as stretching the two time series in matrix prod- the upper-left corner to the bottom-right one. At the end,
ucts. (e) Warping matrix Wx . (f) Warping matrix Wy . the optimal alignment (red curve) can be computed by itera-
tively tracing back along the arrows.
coefficients, ensuring monotonicity. Similarly, Shariat and Given two sequences of length nx and ny , exact DTW has
Pavlovic [29] imposed monotonic constraints on CCA using a computational cost in space and time of Oðnx ny Þ. In prac-
non-negative least squares for activity recognition tasks. tice, various modifications [1] on the step size, local weights
However, these methods were unable to deal with feature and global constraints (e.g., the Sakoe-Chiba and Itakura
weighting. Parallelogram bands [1]) have been proposed to speed up
the DTW computation as well as to better control the possi-
2.3 Dynamic Time Warping (DTW) ble routes of the warping paths. In recent work [31], [32],
Given two time series, X ¼ ½x1 ; . . . ; xnx  2 Rdnx and [33], a multi-scale searching scheme has been shown to
effectively generate a speedup from one to three orders of
Y ¼ ½y1 ; . . . ; yny  2 Rdny , DTW [1] aligns X and Y such
magnitude, compared to the classic DTW algorithm. More
that the sum of the distances between the aligned samples recently, Rakthanmanon et al. [34] have shown that DTW
is minimized: for mining 1-D sub-sequences can be scaled up to very large
Xl
datasets using early-abandoning and cascading lower
min Jdtw ¼ kxpxt  ypy k22 ; (4)
fpx ;py g2C t bounds. However, most of these works are originally
t¼1
designed for 1-D time series. In comparison, our method
where l maxðnx ; ny Þ is the number of indices used to align can be applied to deal with more general multi-dimensional
the samples. The optimal l is automatically selected by the sequences and align signals of different dimensionality.
DTW algorithm. The warping paths, px 2 f1 : nx gl and
py 2 f1 : ny gl , denote the composition of alignment in 3 CANONICAL TIME WARPING (CTW)
frames. The ith frame in X and the jth frame in Y are DTW lacks a feature weighting mechanism and thus it can-
aligned if there exist pxt ¼ i and pyt ¼ j at some step t. not be directly used for aligning multi-modal sequences
In order to find a polynomial time solution, the warping (e.g., video and motion capture) with different features. To
paths must satisfy three constraints: address this issue, this section presents CTW, a unified
n  framework that combines DTW with CCA.

C ¼ fpx ; py g  px 2 f1 : nx gl and py 2 f1 : ny gl ;
3.1 Least-Squares Formulation for DTW
Boundary : ½px1 ; py1  ¼ ½1; 1 and ½pxl ; pyl  ¼ ½nx ; ny ; In order to have a compact and compressible energy
Monotonicity : t1 t2 ) pxt1 pxt2 and pyt1 pyt2 ; function for CTW, it is important to note that the original
o objective of DTW (4) can be reformulated in matrix form as:
Continuity : ½pxt ; pyt   ½pxt1 ; pyt1  2 f½0; 1; ½1; 0; ½1; 1g :
(5) min Jdtw ¼ kXWx  YWy k2F ; (6)
fpx ;py g2C

The choice of step size in the continuity constraint is not


where X 2 Rdnx and Y 2 Rdny denote the two time series
unique. For instance, replacing the step size by
f½2; 1; ½1; 2; ½1; 1g can avoid the degenerate case in which a to be aligned. Wx ¼ Wðpx Þ 2 f0; 1gnx l and Wy ¼ Wðpy Þ 2
single frame of one sequence is assigned to many consecu- f0; 1gny l are two binary warping matrices (Figs. 2e and 2f)
tive frames in the other sequence. See [1] for an extensive associated with the warping paths by a non-linear mapping,

Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.
282 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 2, FEBRUARY 2016

" #
WðpÞ : f1 : ngl ! f0; 1gnl ; (7) 0 XWx WTy YT
A¼ ;
The value of the elements of W is always 0 or 1. wpt ;t is 1 for YWy WTx XT 0
any step t 2 f1 : lg, and zero otherwise. The matrix Wx X " #
XWx WTx XT 0
has the effect to replicate (possibly multiple times) samples B ¼ ð1  Þ þ I:
of the X. Similarly for Wy . Fig. 2d shows that the DTW 0 YWy WTy YT
alignment in Fig. 2a can be equivalently interpreted as
stretching the two time series X and Y by multiplying them
with the warping matrices Wx and Wy , respectively as In most experiments, we initialized CTW by setting Wx
and Wy a uniform warping that aligns the sequences. In
shown in Figs. 2e and 2f. Note that (6) is very similar to
CCA’s objective (1). CCA applies a linear transformation to this case, the warping paths are computed as
combine the rows (features), while DTW applies binary
px ¼ roundðlinspaceð1; nx ; lÞÞ0 ; (9)
transformations to replicate the columns (time). It is impor-
tant to notice that reformulating DTW as a least-squared where roundðÞ and linspaceðÞ are MATLAB functions. But
optimization function allows easy generalizations, and this CTW can be initialized by any other time warping method
is one of the important contributions of this paper. such as DTW or iterative motion warping [9]. The dimen-
sion d can be selected to preserve a certain amount (e.g.,
3.2 Objective Function of CTW 90 percent) of the total correlation. Once the spatial transfor-
In order to accommodate for differences in style and subject mation is computed, the temporal alignment is computed
variability, add a feature selection mechanism, and reduce using standard approaches for DTW. Alternating between
the dimensionality of the signals, CTW adds a linear trans- these two steps (spatial and temporal alignment) monotoni-
formation (Vx 2 Rdx d and Vy 2 Rdy d ) to the least-squared cally decreases Jctw . Jctw is bounded below, so the proposed
form of DTW (6), as in the case of CCA. Moreover, this algorithm will converge to a critical point.
transformation allows alignment of temporal signals with
different dimensionality (e.g., video and motion capture). In 4 GENERALIZED CANONICAL TIME WARPING
a nutshell, CTW combines DTW and CCA by minimizing:
(GCTW)
 2
min Jctw ¼ VTx XWx  VTy YWy F In the previous section, we described CTW, a technique
fVx ;Vy g2F;fpx ;py g2C that is able to align two multi-modal sequences with dif-
(8)
þ fðVx Þ þ fðVy Þ; ferent features. However, CTW has three main limitations
inherited from DTW: (1) The exact computational com-
where Vx 2 Rdx d and Vy 2 Rdy d parameterize the spatial plexity of DTW for multi-dimensional sequences is qua-
transformation and project the sequences into the same dratic both in space and time; (2) CTW and its extensions
low-dimensional coordinate system. Constrained by (5), address the problem of aligning two sequences, but it is
Wx and Wy warp the signal in time to maximize the tempo- unclear how to extend it to the alignment of multiple
ral correlation. Similar to CCA, fðÞ is a regularization term sequences; (3) The temporal alignment is computed using
(2) for Vx and Vy . In addition, the projections satisfy the DTW, which relies on DP to find the optimal path. In
constraints, some problems (e.g., sub-sequence alignment) the warp-
n    ing path provided by DP is too rigid (e.g., the first and

F ¼ fVx ; Vy g  VTx ð1  ÞXWx WTx XT þ I Vx ¼ I; the last samples have to match).
  o To address these issues, this section proposes GCTW, an
VTy ð1  ÞYWy WTy YT þ I Vy ¼ I ; efficient technique for spatio-temporal alignment of multi-
ple time series. To accommodate for subject variability and
where  2 ½0; 1 is to trade-off between the least-squared to take into account the difference in the dimensionality of
error and the regularization term. the signals, GCTW uses multi-set CCA (mCCA). To
Equation (8) is the main contribution of this paper. CTW compensate for temporal changes, GCTW extends DTW by
is a clean extension of CCA and DTW to align two signals in incorporating a more efficient and flexible temporal warp-
space and time. It extends previous work on CCA by adding ing parameterized by a set of monotonic basis functions.
temporal alignment and on DTW by allowing a feature Unlike existing approaches based on DP with quadratic
selection and dimensionality reduction mechanism for complexity, GCTW efficiently optimizes the time warping
aligning signals of different dimensions. function using a Gauss-Newton algorithm, which has linear
complexity in the length of the sequence.
3.3 Optimization of CTW
Optimizing Jctw is a non-convex optimization problem with 4.1 Objective Function of GCTW
respect to the warping matrices and projection matrices. We Given a collection of m time series, fXi gm i¼1 , GCTW aims to
take a coordinate-descent approach that alternates between seek for each Xi ¼ ½xi1 ; . . . ; xini  2 Rdi ni , a low-dimensional
solving the temporal alignment using DTW, and computing
spatial embedding Vi 2 Rdi d and a non-linear temporal
the spatial projections using CCA.
Given the warping matrices, the optimal projection transformation Wi ¼ Wðpi Þ 2 f0; 1gni l parameterized by
matrices are the leading d generalized eigenvectors, i.e., pi 2 f1 : ni gl , such that the resulting sequence VTi Xi Wi 2
½Vx ; Vy  ¼ eigd ðA; BÞ, where: Rdl is well aligned with the others in the least-squared

Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.
ZHOU AND DE LA TORRE: GENERALIZED CANONICAL TIME WARPING 283

Fig. 3. Approximating temporal warping using monotonic bases. (a) Five


common choices for monotonic bases. (b) An example of time warping
XWðQaÞ 2 R170 of 1-D time series X 2 R150 . (c) The warping matrix.
(d) The warping function Qa is a linear combination of three basis func-
tions including a constant function (q1 ) and two monotonic functions (q2
and q3 ).

sense. In a nutshell, GCTW minimizes the sum of pairwise Fig. 4. An example of using Gauss-Newton for solving the sub-sequence
alignment problem. (a) Two 1-D sequences. (b) The contour of the
distances: objective function (Ja as defined in (13)) with respect to the weights of
two bases. (c) The Gauss-Newton optimization procedure, the longer
Xm X m
1  red sequence is warped to match the shorter blue sequence. (d) Warp-
min Jgctw ¼ VT Xi Wi  VT Xj Wj 2 ing function (p) as a combination of a linear function (q1 ) and a constant
i j F
fVi gi 2F;fpi gi 2C
i¼1 j¼1
2 function (q2 ) used for scaling and translation respectively.
X
m
þ ðfðVi Þ þ cðpi ÞÞ;
i¼1 (axb ), (2) exponential (expðax þ bÞ), (3) logarithm (logðaxþ bÞ),
(10) (4) hyperbolic tangent (tanhðax þ bÞ) and (5) I-spline [36].
Similar work by Fisher et al. [28] also used hyperbolic tangent
where fðVi Þ ¼ m=ð1  ÞkVi k2F is the regularization func- functions as temporal bases, and the weights were optimized
tion penalizing the irregularity of the spatial transformation using a non-negative least squares algorithm. However,
Vi .  2 ½0; 1 is a trade-off parameter between the least- GCTW differs in three aspects: (1) GCTW allows aligning
squared error and the regularization term. Following the multi-dimensional time series that have different features,
multi-set CCA (mCCA) [35], GCTW constrains the spatial while [28] can only align one-dimensional time-series;
embeddings as: (2) GCTW uses a more efficient eigen decomposition to solve
( ) mCCA and quadratic programming for optimizing the
 X
m

 weights; and (3) GCTW uses a family of monotonic functions


F¼ fVi gi  Vi ð1  ÞXi Wi Wi Xi þ I Vi ¼ I :
T T T
that allow for a more general warping (e.g., sub-sequence
i¼1
alignment), and constraints to regularize the solution.
cðÞ and CðÞ, defined in the following sections, are used to To approximate the DTW constraints (5) on the warping
respectively regularize and constrain the temporal transfor- path p, we alternatively impose the following constraints
mation pi . on the weights a.
Let us consider a single sequence X 2 Rdn and its tempo- Boundary conditions. We enforce the position of the first
ral warping, p 2 f1 : ngl . While the possible composition of frame, p1 ¼ qð1Þ a 1, and the last frame, pl ¼ qðlÞ a  n,
the temporal warping path, p, is locally enforced by the origi- where qð1Þ 2 R1k and qðlÞ 2 R1k are the first and last rows
nal DTW constraints (5), the global shape of any valid p must of the basis matrix Q respectively. In contrast to DTW, which
correspond to a monotonic and continuous trajectory in imposes a tight boundary (i.e., p1 ¼ 1 and pl ¼ n), GCTW
matrix W 2 f0; 1gnl starting from the upper-left corner and allows p to index a sub-part of X. This relaxation is useful in
ending at the bottom-right one. Recall that any nonnegative solving the more general problem of sub-sequence align-
combination of monotonic trajectories is guaranteed to be ment. For instance, Fig. 4 illustrates an example of matching
monotonic. GCTW parameterizes the warping path p as a lin- a shorter 1-D sequence (blue) to a sub-sequence of the longer
ear combination of monotonic functions: one (red). In this sub-sequence alignment problem, GCTW
models the time warping p as a combination of a linear basis
X
k q1 and a constant basis q2 .
p ac qc ¼ Qa; (11) Monotonicity. We enforce t1  t2 ) pt1  pt2 by constrain-
c¼1 ing the sign of the weight: a 0. Note that constraining the
weights is a sufficient condition to ensure monotonicity but
where a 2 Rk ; a 0 is the non-negative weight vector it is not necessary. See [37], [38], [39] for in-depth discus-
and Q ¼ ½q1 ; . . . ; qk  2 ½1; nlk is the basis set composed of k sions on monotonic functions.
pre-defined monotonically increasing functions. Fig. 3a illus- Continuity. To approximate the hard constraint on the
trates five common choices for qc , including (1) polynomial step size, GCTW penalizes the curvature of the warping

Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.
284 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 2, FEBRUARY 2016

P
path using a temporal regularization term, lt¼1 krqðtÞ ak22 A direct optimization of Ja is difficult due to the non-lin-
kFl Qak22 where Fl 2 Rll is the 1st order differential ear function WðÞ. Inspired by the Lucas-Kanade frame-
operator. work for image alignment [40], we approximate the
In summary, we constrain the warping path in (10) by temporal alignment problem using a Gauss-Newton
adding the following constraints on a, method. More specifically, GN iteratively updates the
weights a ^i ai þ di by minimizing a series of first-order
cðaÞ ¼ hkFl Qak22 ; C ¼ fa j La  bg; Taylor approximations of Ja centered at each term Zðai Þ
ð1Þ ðlÞ
(12) given the initial ai , where di 2 Rk denotes the increment of
where L ¼ ½Ik ; q ; q  and b ¼ ½0k ; 1; n:
the weight ai .
To better understand the approximation, let us first focus
Therefore, given a basis set of k monotone functions, all fea-
on, zt ðai Þ 2 Rd , the tth column of Zðai Þ ¼ ½z1 ðai Þ; . . . ; zl ðai Þ,
sible weights belong to a polyhedron in Rk parameterized
which can be rewritten as,
by L 2 Rðkþ2Þk and b 2 Rkþ2 . For instance, Fig. 3d illus-
trates an example of a warping function (solid line) as a
combination of three monotone functions (dotted lines). zt ðai Þ ¼ VTi Xi WðQai Þ t ¼ VTi Xi ½WðQai Þt ; (14)

4.2 Optimization of GCTW w.r.t. Spatial Basis


where ½t denotes the tth column of a matrix. According to
Minimizing Jgctw (10) is a non-convex optimization problem
with respect to the temporal transformation and the spatial the definition of WðÞ in (7), ½WðQai Þt 2 f0; 1gn is a binary
projection. We optimize GCTW by alternating between vector with only one non-zero element located at qðtÞ ai ,
solving for the time warping using an efficient Gauss-New- where qðtÞ 2 R1k is the tth row of Q. In other words, zt ðai Þ
ton algorithm (see Section 4.3), and computing the spatial is a replication of qðtÞ ath T
i column of the signal Vi Xi , i.e.,
transformation using mCCA. zt ðai Þ ¼ ½Vi Xi qðtÞ ai . Following [40], we approximate
T

Assuming the time warping is fixed, mCCA computes zt ðai þ di Þ as,


the optimal fVi gi using a generalized eigen decomposition
as ½V1 ; . . . ; Vm  ¼ eigd ðA; BÞ, where:

@qðtÞ ai
2 3 2 3 zt ðai þ di Þ zt ðai Þ þ r VTi Xi jqðtÞ ai di ; (15)
Y1 YT1  Y1 YTm Y1 YT1  0 @ai
6 7 6 7
A¼6
4
..
.
..
.
..
.
76
5 4
..
.
..
.
..
.
7;
5
where rðVTi Xi ÞjqðtÞ ai 2 Rd denotes the row-wise gradient2
Ym YT1    Ym YTm 0  Ym YmT
2 3 of the signal VTi Xi around column qðtÞ ai . The term,
Y1 YT1  0
6 . 7 @qðtÞ ai =@ai ¼ qðtÞ 2 R1k is the Jacobian of the time warping.
B ¼ ð1  Þ6
4 ..
..
.
..
.
7 þ I; Yi , Xi Wi :
5 Putting together the approximations of each column of
Zðai þ di Þ using (15) yields:
0    Ym YTm

vecðZðai þ di ÞÞ vi þ Gi di ;
These steps monotonically decrease Jgctw , and because 2 T
3
the function is bounded below, the alternating scheme will r Vi Xi jqð1Þ ai qð1Þ
6 7 (16)
converge to a critical point. 6 .. 7
where vi ¼ vecðZðai ÞÞ; Gi ¼ 6 . 7:
4
5
ðlÞ
4.3 Optimization of GCTW w.r.t. Temporal Weights r Vi Xi jqðlÞ ai q
T

This section describes how GCTW relaxes the warping path


to be a linear combination of monotonic paths, 11 provides
Plugging (16) in (13) yields the approximation:
a new model for temporal alignment and new methods for
optimizing it. Given k bases, Q 2 Rlk , optimizing (10) with
respect to the warping paths fpi gi can be written as: X
m X
m
1 X
m
kvi þ Gi di  vj  Gj dj k22 þ hkFi Qðai þ di Þk22 :
2
Xm X m
1  i¼1 j¼1 i¼1
min Ja ¼  VT Xi WðQai Þ  VT Xj WðQaj Þ 2
fai gi
i¼1 j¼1
2 |fflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflffl}
i j
|fflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflffl} F
Zðai Þ Zðaj Þ Minimizing it with respect to all the weight increments
X
m d ¼ ½d1 ; . . . ; dm  2 Rmk yields a quadratic programming
þ hkFl Qai k22 ; s:t: Li ai  bi ; 8i; problem:
i¼1
(13)
1 T
where WðÞ is a non-linear mapping function defined in (7). min d Hd þ f T d; s:t: Ld  b  La; (17)
d 2
Li and bi are used to constrain the monotonicity and bound-
ary of the time warping for each sequence. To shorten in
notation, we denote each term VTi Xi WðQai Þ 2 Rdl as
Zðai Þ. 2. The gradient is independently computed for each row of VTi Xi .

Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.
ZHOU AND DE LA TORRE: GENERALIZED CANONICAL TIME WARPING 285

Fig. 5. Comparison between Gauss-Newton and variants of DTW for temporal alignment. (a) An example of two 1-D time series and the alignment
results calculated using DTW, DTW constrained in the Sakoe-Chiba band (DTW-SC) and the Itakura Parallelogram band (DTW-IP), DTW optimized
in a multi-level scheme (FastDTW) and Gauss-Newton (GN). (b) Comparison of different warping paths. GN-Init denotes the initial warping used for
GN. SC-Bound and IP-Bound denote the boundaries of SC band and IP band respectively. (c) Alignment errors. (d) Computational costs.

whose components are computed as follows: However, using a narrow band (a small b) cuts off potential
2 3 warping space, and may lead to a sub-optimal solution. For
mGT1 G1 þ hQT FT1 F1 Q    GT1 Gm instance, Fig. 5a shows an example of two 1-D time series
6 7
H¼6
4
..
.
..
.
..
.
7;
5
and the alignment results calculated by different algo-
rithms. The results computed by DTW-SC and DTW-IP are
Gm G1
T
   mGm Gm þ hQ Fm Fm Q
T T T
less accurate than the ones computed using our proposed
2 P 3
GT1 ðmv1  i vi Þ þ hQT FT1 F1 Qa1 Gauss-Newton. This is because both the SC and IP bands
6 7
f ¼6
4
..
.
7;
5
are over-constrained (Fig. 5b). Alternatively, instead of
P constraining the warping path, exhaustive DTW search can
G ðmvm  i vi Þ þ hQ Fm Fm Qam
T T T
be approximated in a multi-level scheme. For instance,
2 m3 2 3 2 3
a1 b1 L1    0 Salvador and Chan [33] introduced FastDTW by recursively
6 . 7 6 . 7 6 . . .. 7
a¼6 7 6 7 6
4 .. 5; b ¼ 4 .. 5; L ¼ 4 .. .. 7
. 5:
projecting a solution from a coarser resolution and refining
the projected solution in a higher resolution. Although the
am bm 0    Lm
coarse-to-fine framework could largely reduce the
search space, the solution is not exact. See Fig. 6 for a
Note that the objective function of (17) is convex. Fig. 4 detailed comparison.
illustrates an example of aligning two 1-D time series To provide a quantitative evaluation, we synthetically
(Fig. 4a) using this approach. To achieve sub-sequence generated 1-D sequences at 15 scales. For DTW-SC, we
alignment, we model the time warping path p as a combina- set the bandwidth to be b ¼ 0:1, which is common in
tion of a linear basis q1 and a constant one q2 (Fig. 4d). As practical applications. For FastDTW, we recursively
shown in Fig. 4c, Gauss-Newton takes three steps to find shrink the sequence length in half from the finest level
the optimal warping parameter in a 2-D space (Fig. 4b). to the coarsest one. We then propagated the DTW solu-
In most experiments, we initialized ai by uniformly tion from coarse to fine with radius r ¼ 5. For GN, we
aligning the sequences (see GN-Init curve in Fig. 5b). How- varied k to be 5; 8; 12 to investigate the effect of the num-
ever, better results can be achieved by using a more sophis- ber of bases. For each scale, we randomly generated 100
ticated initialization method. The length of the warping pairs of sequences. The error was computed using (19)
path l is usually set to be l ¼ 1:1maxm i¼1 ni . The computa- and shown in Figs. 5c and 5d. DTW obtains the lowest
tional complexity of the algorithm is Oðdlmk þ m3 k3 Þ. error but takes the most time to compute. This is because
DTW exhaustively searches the entire parameter space to
4.4 Comparison with Other DTW Techniques find the global optima. DTW-SC, DTW-IP and FastDTW
As discussed in [1], [33], there are various techniques that all need less time than DTW because they search a
have been proposed to accelerate DTW. For instance, the smaller space. Empirically, DTW-SC is the least accurate
Sakoe-Chiba band (DTW-SC) and the Itakura Parallelogram compared to DTW-IP and FastDTW in our synthetic
band (DTW-IP) reduce the complexity of the original DTW dataset. GN is most computationally efficient because it
algorithm to Oðbnx ny Þ by constraining the warping path to has linear complexity in the sequence length. Moreover,
be in a band of a certain shape, where b < 1 is the size ratio increasing the number of bases monotonically reduces
between the band and the original search space of DTW. the error.

Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.
286 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 2, FEBRUARY 2016

m X
X m
1
min Jmdtw ¼ kXi Wi  Xj Wj k2F
fpi 2Cgi
i¼1 j¼1
2
m  2 (18)
X  1X m

¼m Xi Wi  Xj Wj  :
i¼1
m j¼1
F

mDTW alternates between independently solving for each


pi using an asymmetrical DTW and updating the mean
P
sequence m1 m j¼1 Xj Wj by averaging fXi Wi gi .
DDTW and mDDTW. In order to make DTW invariant to
translation, derivative DTW (DDTW) [20] uses the deriva-
tives of the original features and minimizes:

min Jddtw ¼ kXFTnx Wx  YFTny Wy k2F ;


fpx ;py g2C

Fig. 6. Comparison of temporal alignment algorithms as a function of where Fnx and Fny are the 1st order differential operators.
degrees-of-freedom and complexity. l is the length of warping path.
LSðnÞ and eigðnÞ denote the complexity of solving a least-squares of n
To align multiple sequences, multi-sequence DDTW
variables and a generalized eigenvalue problem with two n-by-n matri- (mDDTW) extends DDTW in the Procrustes framework
ces, respectively. similar to (18).
IMW and mIMW. Similar to CTW, iterative motion warp-
ing [9] alternates between time warping and spatial trans-
5 EXPERIMENTS formation to align two sequences. Assuming the same
This section compares CTW and GCTW against state-of-the- number of spatial features between X 2 Rdnx and
art methods for temporal alignment of time series in seven Y 2 Rdny , IMW translates and re-scales each feature in X
experiments. In the first experiment, we compared the per- independently to match with Y. Written in a simple matrix
formance of CTW and GCTW against DTW, DDTW [20], form, IMW minimizes:
and IMW [9] in the problem of aligning synthetic time series
of varying complexity. In the second experiment, we min Jimw ¼ kðX  Ax þ Bx ÞWx  Yk2F
px 2C;Ax ;Bx
aligned videos of different subjects performing a similar  2  2
activity; each video is represented using different types of þ a Ax FTnx F þ b Bx FTnx F ;
visual features. In the third experiment, we showed how
where Ax ; Bx 2 Rdnx are the scaling and translation
GCTW and CTW can be applied to provide a metric useful
parameters respectively. a and b are the weights for the
for activity recognition. In the fourth experiment, we
least-squared error and the regularization terms. The regu-
aligned facial expressions across subjects on videos with
larization terms are used to enforce a smooth change in the
naturally occurring facial behavior. In the fifth experiment,
columns of Ax and Bx . In the experiments, we set them to
we showed how GCTW can be applied to large-scale align-
be a ¼ b ¼ 1. IMW takes a coordinate-descent approach
ment. We aligned approximately 50;000 frames of motion
to optimize the time warping, scaling and translation. Given
capture data of two subjects cooking the same recipe. In the
the warping matrix Wx , the optimal spatial transformation
sixth experiment, we showed how GCTW can be used to
can be computed in closed-form. To align multiple sequen-
localize common sub-sequences between two time series.
ces, we extended IMW to multi-sequence IMW (mIMW) in
The last experiment shows how GCTW is able to align three
the Procrustes framework similar to mDTW (18).
sequences of different subjects performing a similar action
mCTW. CTW was originally proposed to align two multi-
recorded with different sensors (motion capture data, accel-
modal sequences. We extended CTW to multi-sequence
erometers and video). In the first four experiments, the
CTW (mCTW) for aligning multiple time series using the
ground truth was known and we provided quantitative
Procrustes analysis framework. mCTW optimizes the same
evaluation of the performance. In the others, we evaluated
objective (10) as GCTW does, however, the temporal
the quality of the alignment visually.
alignment step is different. mCTW alternates between
warping each time series using asymmetric DTW and
5.1 Evaluation Methods updating the mean sequence, while GCTW uses Gauss-
In the experiments, we compared CTW and GCTW with Newton for jointly optimizing over all weights of the bases.
several state-of-the-art methods for temporal alignment of Fig. 6 compares different temporal alignment methods
time series. Below, we provide a brief description of the as a function of the number of variables to optimize and
techniques that we used for comparison. their computational complexity. The comparison reports
DTW and mDTW. DTW is solved using a standard two cases: (1) alignment of two sequences (top of the
dynamic programming algorithm that minimizes (4). To table) and more than two sequences. Given two time
evaluate the performance of temporal alignment of multiple series, X 2 Rdnx and Y 2 Rdny , DTW and DDTW have
sequences, we extended the concept of Procrustes analysis the same complexity Oðnx ny Þ for finding the optimal
[41] to time series. That is, given m ð> 2Þ time series, multi- l-length warping path. IMW additionally solves d least-
sequence DTW (mDTW) seeks for a set of warping paths squared problems for each row of Ax and Bx indepen-
fpi gi that minimizes: dently. Similarly, CTW relies on DTW to optimize the

Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.
ZHOU AND DE LA TORRE: GENERALIZED CANONICAL TIME WARPING 287

Fig. 7. Comparison of temporal alignment algorithms on the synthetic dataset. (a) An example of three synthetic time series generated by performing
a random spatio-temporal transformation of a 2-D latent sequence Z and adding Gaussian noise in the third dimension. (b) The alignment results. (c)
Mean and variance of the alignment errors. (d) Mean and variance of the computational cost (time in seconds).

time warping, resulting in a complexity of Oðnx ny Þ in palg


i 2R
lalg
is the time warping path for the ith sequence. To
both space and time. However, CTW uses CCA to accom- evaluate the error of the time warping paths given by differ-
modate the difference in the number of features by solv- ent methods, we computed their difference from the
ing a generalized eigen-decomposition of two ltru m
ground-truth path, Ptru ¼ ½ptru 1 ; . . . ; pm  2 R
tru
, where
ðdx þ dy Þ-by-ðdx þ dy Þ matrices. CTW has fewer variables the number of warping steps (lalg and ltru ) could be differ-
than IMW and thus is less likely to overfit the data. ent. To better understand the error, let us consider each
Compared to CTW, GCTW has the same complexity for warping path P 2 Rlm as a curve in Rm with l points (rows
computing the spatial embedding. The main computa- of P). For instance, Figs. 8c and 10c compare the warping
tional advantage of GCTW is using Gauss-Newton, which paths as 3-D and 2-D curves respectively. The error can be
optimizes a small-scale QP with 2k variables for the time hence defined as the normalized distance between the
warping. Given m sequences, fXi 2 Rdi ni gm i¼1 , a direct curves Palg and Ptru ,
generalization of the DTW is computationally infeasible
1 
due to the combinatorial explosion
Q of possible warpings, error ¼ distðPalg ; Ptru Þ þ distðPtru ; Palg Þ ;
incurring a complexity of Oð m 2
i¼1 i Þ. In the experiment,
n
1 X  ðiÞ ðjÞ 
(19)
mDTW is used as an approximation of the exact DTW where distðP1 ; P2 Þ ¼ min p1  p2 2 :
optimization. However, mDTW and other DTW-based l1 l2 i¼1l j¼1l2
1
methods (mDDTW, mIMW and mCTW) still have qua-
ðiÞ ðjÞ
dratic complexity. Instead, GCTW approximates the com- The term, minj¼1l2 kp1  p2 k2 , measures the shortest dis-
binatorial problem of time warping as a continuous ðiÞ
tance between the point p1 and any point on the curve P2 ,
optimization that can be more efficiently optimized by ðiÞ ðjÞ
solving a small-scale QP with mk variables. where p12 R1m and 2 R1m are the ith row of P1 and
p2
jth row of P2 respectively.

5.2 Evaluation Metric 5.3 Aligning Synthetic Sequences


For all experiments, we used the normalized distance from In the first experiment, we synthetically generated spatio-
the ground-truth as an error metric. More specifically, let us temporal signals (3-D in space and 1-D in time) to evalu-
denote the alignment result of m sequences by a set of time ate the performance of mCTW and GCTW. As shown in
lalg m
warping paths, Palg ¼ ½palg 1 ; . . . ; pm  2 R
alg
, where Fig. 7a, the signals were randomly generated by spatially

Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.
288 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 2, FEBRUARY 2016

Fig. 8. Comparison of temporal alignment algorithms for aligning multi-feature video data. (a) An example of three aligned videos by GCTW. The left
three sequences are the original frames after background subtraction, while the right three are the binary images, the Euclidean distance transforms
and the solutions of the Poisson equation. (b) The alignment results. (c) Comparison of time warping paths. (d) Mean and variance of the alignment
errors.

and temporally transforming a latent 2-D spiral, generated time series. Both mDTW and mDDTW per-
Z 2 R2l ; l ¼ 300 as X ¼ ½UT ðZ þ b1T ÞM; eT  2 R3n , formed poorly in this case since they do not have a fea-
where U 2 R 22
and b 2 R2 were a randomly generated ture weight mechanism to adapt the spatial
projection matrix and translation vector respectively. To transformation of the sequences. mIMW warps sequences
synthesize the temporal distortion, a binary selection towards others by translating and re-scaling each frame
matrix M 2 f0; 1gln was generated by randomly choos- in eachP dimension. Moreover, mIMW has P more parame-
ing n  l columns from the identity matrix Il . The third ters (2l i di ) than mCTW and GCTW (d i di ), and hence
spatial dimension e 2 Rn was added with a zero-mean mIMW is more prone to over-fitting. Furthermore, mIMW
Gaussian noise. In this experiment, mCTW and GCTW tries to fit the noisy dimension (third spatial component),
were compared with mDTW, mDDTW and mIMW for biasing alignment in time, whereas both mCTW and
aligning three time series. The ground-truth alignment GCTW had a feature selection mechanism that effectively
was known and the performance of each method was canceled out the third dimension. Among all initializa-
evaluated in terms of the alignment errors defined in tions, the uniform alignment (mCTW-U and GCTW-U)
(19). We repeated the above process 100 times with ran- and mIMW (mCTW-I and GCTW-I) achieved the best
dom numbers. In each trial, we studied three different results. It is important to notice that mCTW and GTW
initialization methods for mCTW and GCTW: uniform always improved the initial error, that is, mCTW-D and
alignment for mCTW-U and GCTW-U as in (9), mDTW mCTW-I obtained lower errors than their initializations
for mCTW-D and GCTW-D, and mIMW for mCTW-I and given by mDTW and mIMW respectively. Compared to
GCTW-I. The subspace dimensionality for CCA d was mCTW, GCTW achieved better performance when align-
selected to preserve 90 percent of the total correlation. In ing more than two sequences because GCTW jointly opti-
this case, we have sufficient samples (l ¼ 300) in 3-D mizes over all the possible time warpings for each time
space, and the regularization weight  was set to zero. series, while mCTW takes a greedy approach by warping
For GCTW, we selected three hyperbolic tangent and each sequence towards the mean sequence independently.
three polynomial functions as monotonic bases (upper- Fig. 7d evaluates the computational cost of each
right corner in Fig. 7b). method with respect to the average sequence length.
Figs. 7 illustrates a comparison of the previously mIMW was the most computationally intensive method
described methods for aligning multiple time series. because it solves a least-squared problem for each feature
Fig. 7b shows the spatio-temporal warping estimated by dimension. mCTW was more expensive than DTW-based
mDTW, mDDTW, mIMW, mCTW-U and GCTW-U. methods because of the additional computation to solve
Fig. 7c shows the alignment errors (19) for 100 randomly CCA. As expected, GCTW was the most efficient.

Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.
ZHOU AND DE LA TORRE: GENERALIZED CANONICAL TIME WARPING 289

5.4 Aligning Videos with Different Features


In the second experiment, we used mCTW and GCTW to
align video sequences of different people performing a simi-
lar action. Each video was encoded using different visual
features. The video sequences were taken from the
Weizmann database [42], which contains nine people per-
forming 10 actions. To represent dynamic videos, we sub-
tracted the background (the left three rows in Fig. 8a) and
computed three popular shape features (the right three
rows of Fig. 8a) for each 70-by-35 re-scaled mask image,
including (1) binary image, (2) Euclidean distance transform
[43], and (3) solution of Poisson equation [44]. In order to
reduce the dimension of the feature space (2;450), we picked
the top 123 principal components that preserved 99 percent
of the total energy. We randomly selected three sequences
and manually labeled their temporal correspondence as Fig. 9. Activity classification. (a) DTW (dis)similarity matrices computed
ground-truth. We repeated the process 10 times. All meth- between videos using different features: binary images fXi gi and
Euclidean distance transforms fYj gj . A darker color indicates a smaller
ods were initialized with uniform alignment. For GCTW,
distance. (b) GCTW distance matrices. (c) Classification errors.
we used five hyperbolic tangent and five polynomial func-
tions as the monotonic bases.
Fig. 8d shows the error for 10 randomly generated sets of were concatenated in two matrices Xtr and Ytr , and the pro-
videos. Neither mDTW nor mDDTW was able to align the jections Vx and Vy were computed by CCA optimizing (1).
videos because they were not able to handle alignment of Recall that the projection matrices were fixed for DTW,
signals of different dimensions. mIMW registered the top DDTW and IMW during testing, but for CTW and GCTW
three components well in space; however, it overfitted they were optimized by the algorithms. Then, the (dis)simi-
the data and also computed a biased time warping path. In larity between each pair of sequences was then computed as
contrast, mCTW and GCTW warped the sequences accu-
rately in both space and time. 1  
VT Xi Walg  VT Yj Walg  ;
distðXi ; Yj Þ ¼ x xi y yj F (20)
lalg

5.5 Activity Classification where the alignment step lalg was used to normalize the
This section explores the use of CTW and GCTW as a (dis) (dis)similarity. Given the (dis)similarity computed in (20),
similarity measure between videos with different features, classification for a test sequence was done by finding the
and between videos and motion capture data. This (dis)sim- closest video in the training sequence. The overall classifica-
ilarity is used in combination with a nearest neighbor tion error was averaged over all testing sequences.
classifier. Figs. 9a and 9b display the 12-by-12 (dis)similarity matri-
Given a training set of videos of a subject performing an ces computed by DTW and GCTW respectively. Each
activity recorded with different visual features fXi gi , our element in the matrices encodes the (dis)similarity between
goal is to find the testing video fYj gj that contains similar a training sequence (row) and a testing sequence (column).
activity. We took 24 sequences from the Weizmann dataset We re-ordered the rows and columns of the matrices so that
[42]: eight people performed three actions (walking, run- the sequences containing the same activities were grouped
ning and jumping). We repeated our experiments 10 times. in consecutive rows and columns. We then divided the
In each trial, we randomly split the 24 sequences into two matrix into nine 4-by-4 blocks (yellow lines), where the
disjoint sets of 12 sequences used for training and testing block in the ith row and jth column contains the (dis)simi-
respectively. The training sequences fXi gi were represented larity between the training sequences of the ith action and
using the binary silhouette (see left columns of Figs. 9a and the testing data of the jth action, where i; j 2 fwalk;
9b as examples) while the testing ones fYj gj were encoded run; jumpg. Darker color denotes smaller (dis)similarity.
with the Euclidean distance features (see top rows of Ideally, if the (dis)similarity would be able to capture per-
Figs. 9a and 9b as examples). Given a pair of sequences, fectly the activity, the matrix will be perfect with zeros
(black color) in the diagonal blocks and higher values (white
Xi 2 Rdx nxi and Yj 2 Rdy nyj , we computed the (dis)simi-
color) elsewhere.
larity between videos using different methods: DTW,
From Fig. 9a, we can observe that the DTW (dis)similar-
DDTW, IMW, CTW and GCTW. The warping path for each
nxi l nyj l ity values in the first and last row of blocks were smaller
method is denoted as Walg xi 2 R yj 2 R
and Walg . than the blocks in the middle row. This is because, DTW
In order to provide a fair comparison with DTW, DDTW does not have a feature adaptation mechanism and fails to
and mIMW (that cannot compute (dis)similarity between provide semantic similarity of videos. In comparison,
different features), we computed a pair of projection Fig. 9b shows that GCTW captured better the (dis)similarity
matrices, Vx 2 Rdx d and Vy 2 Rdy d for them. To do that, between actions. Fig. 9c shows the nearest-neighbor classifi-
we took a subset of 1=3 of the training sequences encoded cation error using different (dis)similarity. Overall, CTW
with both features fXtri g and fYi g, and uniformaly aligned
tr
and GCTW achieved lower errors than others due to their
each pair of sequences of the same label. The aligned frames feature selection in aligning videos with different features.

Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.
290 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 2, FEBRUARY 2016

Fig. 10. Comparison of algorithms for aligning facial expression across different subjects. (a) An example of two smiling expression sequences
aligned by GCTW. The features of the two sequences are computed as the 18 landmark coordinates of the mouth given by a face tracker. (b) Align-
ment results. The position that corresponds to the peak of the expression is indicated by points on the curves in the top row. (c) Comparison of time
warping paths. The position that corresponds to the peak of the expression is indicated by the intersection of the dashed lines. (d) Alignment errors.

In the second example, we recognized actions from the face. For the alignment of AU12, we used only the 18
motion capture sequences fYj gj given a training set con- landmarks that correspond to the outline of the mouth. See
taining videos with different visual features fXi gi . From the Fig. 10a for example frames where the mouth outlines are
Weizmann dataset [42] we selected 24 videos containing plotted.
three actions (walking, running and jumping). From the The performance of CTW and GCTW were compared
CMU motion capture database we selected 30 sequences of with DTW, DDTW and IMW. We initialized IMW, CTW
subjects performing the same three actions (walking, run- and GCTW using the same uniform warping. Fig. 10b
ning and jumping). For the mocap data Y, we extracted shows the alignment result obtained by different methods,
quaternions on 20 joints [45] and use them as features. Each where the three dimensions correspond to the first three
video frame X was encoded with a binary silhouette fea- principal components of the original signals. As an approxi-
ture. The experimental setting was similar to previous mate ground-truth, the position of the peak frame of each
experiments. We divided all sequences into two disjoint AU12 event is indicated as the red and blue points on the
sets used for training and testing respectively. Each test curves in Fig. 10b and the intersection of the two dashed
motion capture sequence was classified with the label of the lines in Fig. 10c. As we can see from Figs. 10b and 10c, the
video sequence training sequence X with smallest (dis)simi- two peaks in the low-dimensional projection found by CTW
larity (20). As shown in Fig. 9f, GCTW achieved the lowest and GCTW are closer to the manually labeled peak than the
classification error compared to other methods. This was ones in the original space used for DTW and DDTW.
because our GCTW (dis)similarity captures between multi- Finally, the distance between the peak point and the warp-
modal similarity between actions in videos. ing path is computed to quantitatively measure the perfor-
mance. Fig. 10d shows the average error as the distance
5.6 Aligning Facial Expression Sequences normalized by the sequence lengths over 20 random repeti-
In the fourth experiment, we compared CTW and GCTW in tions, where CTW and GCTW achieved better performance.
the task of aligning unscripted facial expression sequences.
The facial videos were taken from the RU-FACS database 5.7 Aligning Large-Scale Motion Capture
[46], which contains digitized videos of 29 young adults. Sequences
The action units (AUs) in this database have been manually This experiment illustrates the benefits of using GCTW for
coded, and we randomly cropped video segments contain- aligning two large-scale motion capture sequences. The two
ing AU12 (smiling) for our experiments. Each event of sequences were taken from the CMU multimodal activity
AU12 is coded at its peak position. We used a person-spe- dataset [48], which contains multi-sensor recordings (video,
cific active appearance model [47] to track 66 landmarks on audio, motion capture data and accelerometers) of

Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.
ZHOU AND DE LA TORRE: GENERALIZED CANONICAL TIME WARPING 291

Fig. 11. Aligning two large-scale motion capture sequences using GCTW. (a) A coarse-to-fine strategy for improving the optimization performance of
GCTW. (b) The first row shows the first principal components of the original sequences for three levels of the temporal pyramids. The second row
corresponds to the aligned sequences using GCTW. (c) Key frames of similar body poses aligned by GCTW.

naturalistic behavior of 40 subjects cooking five different the aligned key frames in Fig. 11c. Although the two sub-
recipes. The two motion capture sequences used in this jects spent different amounts of time and followed different
experiment contain 44;387 and 48;724 frames respectively procedures in cooking, GCTW was able to align similar
from two subjects cooking brownies. See Fig. 11c for several right hand actions.
key-frames of these two sequences. For each motion capture
frame, we computed quaternions in four joints on the right 5.8 Detection and Alignment of Similar
hand, resulting in a 12-D feature vector. In this experiment, Sub-Sequences
we only optimized the temporal component of GCTW, not A major problem of DTW-type techniques is that they
the spatial component. We used five polynomial and five require an exact matching between the first frame and the
tanhðÞ functions as monotonic bases for the time warping last one, see (5). These boundary conditions are impractical
function. and very restrictive when only a subset of the input
To avoid local minima in the alignment, we used a tem- sequence is similar to the sequence to be aligned. Fig. 12
poral coarse-to-fine strategy for the Gauss-Newton optimi- illustrates this problem: how can we align four motion cap-
zation in GCTW. As shown in Fig. 11a, the coarse-to-fine ture signals composed by different walking cycles? This
strategy proceeds in two steps: (1) In the pre-processing problem is related to sub-sequence DTW [49] and temporal
step, we obtained a three-level pyramid for each time series commonality discovery [50]. A major limitation of these
by recursively applying Gaussian smoothing with s ¼ 200. methods is its inability to handle multiple sequences. This
For instance, the first row of Fig. 11b illustrates the two experiment shows how can we use GCTW for multiple sub-
sequences in three levels, where the ones in the first level sequence alignment.
correspond to the original signals, while the ones in the We selected four walking sequences from the CMU
third level contain less detailed but much smoother signals. motion capture database [51]. For each motion capture
(2) In the optimization step, GCTW was first used to align frame, we computed the quaternions for 14 joints on the
the two sequences on the third level instead of the first level. body, resulting in a 42-D feature vector that describes the
The computed time warping result was then used to human pose. Fig. 12a illustrates the first three principal com-
initialize GCTW on the second level. We repeated the same ponents of the walking sequences. To allow for sub-sequence
procedure to compute the final time warping result of the alignment, the warping path in GCTW is represented by a
original sequences on the first level. combination of a constant function and a linear one as the
For this large-scale example, DTW was slow and compu- monotonic bases (see upper-left corner of Fig. 12c). Both
tationally expensive. However, GCTW was able to effi- GCTW and the baseline mDTW method are initialized by
ciently find the temporal correspondence between the uniformly aligning the sequences.
sequences in just a few seconds using Matlab on a regular A visual comparison between mDTW and GCTW is illus-
laptop with a 2.5 GHz Intel CPU. Since the ground-truth is trated in Fig. 12. Without any manual cropping, most of the
unknown, we qualitatively evaluated GCTW by showing conventional DTW-based methods, such as mDTW, aligned

Fig. 12. Aligning similar sub-sequences of four walking motion capture signals. (a) Original features of four mocap walking sequences. (b) Alignment
achieved by mDTW. mDTW alings the sequences end-to-end and hence it has to stretch some parts of the sequences (flat lines indicated by arrows).
(c) Alignment using GCTW. GCTW efficiently aligns the sub-sequences and also finds the boundaries of the sub-sequences containing similar
motions. (d) Key frames aligned by GCTW.

Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.
292 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 2, FEBRUARY 2016

Fig. 13. Example of aligning multi-modal sequences. (a) Accelerometer. (b) Projection onto the first principal component for the motion capture data,
video and accelerometers respectively. (c) GCTW. (d) Key frames aligned by GCTW. Notice that similar hand gestures have been aligned. From the
top to bottom, we show mocap data, video, and accelerometer data respectively.

the sequences by matching the first and the last frame, series. CTW extends DTW by adding a feature selection
which results in incorrect alignments (see Fig. 12b). Some mechanism and enabling alignment of signals with different
parts (noted by arrows) of the sequences with fewer dimensionality. CTW extends CCA by adding temporal
cycles have to be stretched into flat lines in order to match alignment and allowing temporally local projections. To
the other sequences with more cycles. Unlike conven- improve the efficiency of CTW, allow a more flexible time-
tional DTW-based methods built on dynamic program- warping, and align multiple sequences, GCTW extends
ming, GCTW uses the Gauss-Newton method, which CTW by parameterizing the warping path as a combination
allows for a more flexible time warping. By incorporating of monotonic functions. Inspired by existing work on image
a constant function in the set of bases, GCTW can natu- alignment, GCTW is optimized using coarse-to-fine Gauss-
rally be generalized to deal with the sub-sequence align- Newton updates, which allows for efficient alignment of
ment problem across multiple sequences. As shown in long sequences.
Figs. 12c and 12d, GCTW is not only able to align the Although CTW and GCTW have shown promising pre-
sequences in time, but also locate the boundaries of the liminary results, there are still unresolved issues. First,
sub-sequences that contain similar motions. This experi- the Gauss-Newton algorithm used in GCTW for time
ment demonstrates the benefits of GCTW in controlling warping converges poorly in areas where the objective
the warping path. function is non-smooth. Second, both CTW and GCTW
are subject to local minima. The effect of local minima
5.9 Aligning Multi-Modal Sequences can be partially alleviated using a temporal coarse-to-fine
The last experiment evaluates CTW and GCTW in the task approach as in the case of image alignment. In future
of aligning multi-modal sequences recorded with different work, we also plan to explore better initialization strate-
sensors. We selected one motion capture sequence from the gies. Third, although the experiments have shown good
CMU motion capture database, one video sequence from results using manually designed bases, an obvious exten-
the Weizmann database [42], and we collected an acceler- sion is to learn a set of monotonic bases from training
ometer signal of a subject performing jumping jacks. Some data, so the basis is adapted to a particular alignment
instances of the multi-modal data can be seen in Fig. 13d. problem. Finally, kernelization of CTW and GCTW might
Note that, to make the problem more challenging, the two lead to improvements in alignment error.
subjects in the mocap (top row) and video (middle row)
sequences are performing the same activity, but in the accel- ACKNOWLEDGMENTS
erometer sequence (bottom row) the subject only moves one This work was partially supported by the National Sci-
hand and not the legs. Even in this challenging scenario, ence Foundation under Grant No. EEEC-0540865, RI-
GCTW is able to solve for the temporal correspondence that 1116583 and CPS-0931999. Any opinions, findings, and
maximizes the correlation between signals. For the mocap conclusions or recommendations expressed in this mate-
data, we computed the quaternions for the 20 joints. In the rial are those of the author(s) and do not necessarily
case of the Weizmann dataset, we computed the Euclidean reflect the views of the National Science Foundation. F.
distance transform as described earlier. The X, Y and Z axis Zhou is the corresponding author.
accelerometer data (40 Hz) was collected using an X6-2 mini
USB accelerometer (Fig. 13a). Fig. 13b shows the first com- REFERENCES
ponents of the three sequences projected separately by
[1] L. Rabiner and B. Juang, Fundamentals of Speech Recognition. Engle-
PCA. As shown in Fig. 13c, GCTW found an accurate wood Cliffs, NJ, USA: Prentice-Hall, 1993.
temporal correspondence between the three sequences. [2] A. Bruderlin and L. Williams, “Motion signal processing,” in
Unfortunately, we do not have ground-truth for this experi- Proc. ACM 22nd Annu. Conf. Comput. Graphics Interactive Techn.,
ment. However, visual inspection of the video suggests that 1995, pp. 97–104.
[3] Y. Caspi and M. Irani, “Aligning non-overlapping sequences,” Int.
the results are consistent with human labeling. Fig. 13d J. Comput. Vis., vol. 48, no. 1, pp. 39–51, 2002.
shows several frames that have aligned by GCTW. [4] J. Aach and G. M. Church, “Aligning gene expression time series
with time warping algorithms,” Bioinformatics, vol. 17, no. 6,
pp. 495–508, 2001.
6 CONCLUSIONS [5] T. B. Sebastian, P. N. Klein, and B. B. Kimia, “On aligning
curves,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 1,
This paper proposes CTW and GCTW, two new techniques pp. 116–125, Jan. 2003.
for spatio-temporal alignment of multiple multi-modal time

Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.
ZHOU AND DE LA TORRE: GENERALIZED CANONICAL TIME WARPING 293

[6] F. Zhou, F. De la Torre, and J. K. Hodgins, “Hierarchical aligned [32] A. Heloir, N. Courty, S. Gibet, and F. Multon, “Temporal align-
cluster analysis for temporal clustering of human motion,” IEEE ment of communicative gesture sequences,” J. Vis. Comput. Anima-
Trans. Pattern Anal. Mach. Intell., vol. 35, no. 3, pp. 582–596, Mar. tion, vol. 17, no. 3-4, pp. 347–357, 2006.
2013. [33] S. Salvador and P. Chan, “Toward accurate dynamic time warp-
[7] J. M. Winters and Y. Wang, “Wearable sensors and tele- ing in linear time and space,” Intell. Data Anal., vol. 11, no. 5,
rehabilitation,” IEEE Eng. Med. Biol. Mag., vol. 22, no. 3, pp. 56–65, pp. 561–580, 2007.
May-Jun. 2003. [34] T. Rakthanmanon, B. J. L. Campana, A. Mueen, G. E. A. P. A.
[8] I. N. Junejo, E. Dexter, I. Laptev, and P. Perez, “View-independent Batista, M. B. Westover, Q. Zhu, J. Zakaria, and E. J. Keogh,
action recognition from temporal self-similarities,” IEEE Trans. “Searching and mining trillions of time series subsequences under
Pattern Anal. Mach. Intell., vol. 33, no. 1, pp. 172–185, Jan. 2011. dynamic time warping,” in Proc. 18th ACM Conf. Knowl. Discovery
[9] E. Hsu, K. Pulli, and J. Popovic, “Style translation for human Data Mining, 2012, pp. 262–270.
motion,” ACM Trans. Graphics, vol. 24, no. 3, pp. 1082–1089, 2005. [35] M. A. Hasan, “On multi-set canonical correlation analysis,” in
[10] W. Pan and L. Torresani, “Unsupervised hierarchical modeling of Proc. Int. Joint Conf. Neural Netw., 2009, pp. 1128–1133.
locomotion styles,” in Proc. 26th Int. Conf. Mach. Learn., 2009, pp. [36] J. O. Ramsay and B. W. Silverman, Functional Data Analysis, 2nd
785–792. ed. New York, NY, USA: Springer, 2005.
[11] A. Veeraraghavan, A. Srivastava, A. K. R. Chowdhury, and R. [37] J. O. Ramsay, “Estimating smooth monotone functions,” J. Roy.
Chellappa, “Rate-invariant recognition of humans and their Statist. Soc.: Series B Statist. Methodol., vol. 60, pp. 365–375, 1998.
activities,” IEEE Trans. Image Process., vol. 18, no. 6, pp. 1326–1339, [38] T. Robertson, F. T. Wright, and R. L. Dykstra, Order Restricted Sta-
Jun. 2009. tistical Inference. New York, NY, USA: Wiley, 1988.
[12] D. Gusfield, Algorithms on Strings, Trees and Sequences: Computer [39] I. W. Wright and E. J. Wegman, “Isotonic, convex and related
Science and Computational Biology. Cambridge, U.K.: Cambridge splines,” Ann. Statist., vol. 8, no. 5, pp. 1023–1035, 1980.
Univ. Press, 1997. [40] B. Lucas and T. Kanade, “An iterative image registration tech-
[13] F. Zhou and F. De la Torre, “Canonical time warping for align- nique with an application to stereo vision,” in Proc. Int. Joint Conf.
ment of human behavior,” in Proc. Neural Inf. Process. Syst., 2009, Artif. Intell., 1981, pp. 674–679.
pp. 2286–2294. [41] I. L. Dryden and K. V. Mardia, Statistical Shape Analysis. New
[14] F. Zhou and F. De la Torre, “Generalized time warping for align- York, NY, USA: Wiley, 1998.
ment of human motion,” in Proc. IEEE Conf. Comput. Vision Pattern [42] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri,
Recog., 2012, pp. 1282–1289. “Actions as space-time shapes,” IEEE Trans. Pattern Anal. Mach.
[15] M. Brand and A. Hertzmann, “Style machines,” in Proc. ACM 27th Intell., vol. 29, no. 12, pp. 2247–2253, Dec. 2007.
Annu. Conf. Comput. Graphics Interactive Techn., 2000, pp. 183–192. [43] C. R. Maurer, R. Qi, and V. V. Raghavan, “A linear time algorithm
[16] A. Shapiro, Y. Cao, and P. Faloutsos, “Style components,” in Proc. for computing exact Euclidean distance transforms of binary
Graphics Interface, 2006, pp. 33–39. images in arbitrary dimensions,” IEEE Trans. Pattern Anal. Mach.
[17] C. Rao, A. Gritai, M. Shah, and T. Fathima, “View-invariant align- Intell., vol. 25, no. 2, pp. 265–270, Feb. 2003.
ment and matching of video sequences,” in Proc. IEEE Int. Conf. [44] L. Gorelick, M. Galun, E. Sharon, R. Basri, and A. Brandt, “Shape
Comput. Vis., 2003, pp. 939–945. representation and classification using the Poisson equation,”
[18] D. Gong and G. G. Medioni, “Dynamic manifold warping for IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 12, pp. 1991–
view invariant action recognition,” in Proc. IEEE Int. Conf. Comput. 2005, Dec. 2006.
Vis., 2011, pp. 571–578. [45] J. Barbic, A. Safonova, J.-Y. Pan, C. Faloutsos, J. K. Hodgins, and
[19] M. A. Nicolaou, V. Pavlovic, and M. Pantic, “Dynamic probabilis- N. S. Pollard, “Segmenting motion capture data into distinct
tic CCA for analysis of affective behavior and fusion of continuous behaviors,” in Proc. Graphics Interface, 2004, pp. 185–194.
annotations,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 7, [46] M. S. Bartlett, G. C. Littlewort, M. G. Frank, C. Lainscsek, I. Fasel,
pp. 1299–1311, Jul. 2014. and J. R. Movellan, “Automatic recognition of facial actions in
[20] E. J. Keogh and M. J. Pazzani, “Derivative dynamic time spontaneous expressions,” J. Multimedia, vol. 1, no. 6, pp. 22–35,
warping,” in Proc. SIAM Int. Conf. Data Mining, 2001, pp. 5–7. 2006.
[21] J. Listgarten, R. M. Neal, S. T. Roweis, and A. Emili, “Multiple [47] I. Matthews and S. Baker, “Active appearance models revisited,”
alignment of continuous time series,” in Proc. Neural Inf. Process. Int. J. Comput. Vis., vol. 60, no. 2, pp. 135–164, 2004.
Syst., 2005, pp. 817–824. [48] F. De la Torre, J. K. Hodgins, J. Montano, and S. Valcarcel,
[22] J. Ham, D. Lee, and L. Saul, “Semisupervised alignment of man- “Detailed human data acquisition of kitchen activities: The CMU-
ifolds,” in Proc. Int. Conf. Artif. Intell. Statist., 2005, pp. 120–127. multimodal activity database (CMU-MMAC),” Carnegie Mellon
[23] C. Wang and S. Mahadevan, “Manifold alignment using Procrus- Univ., Pittsburgh, PA, USA, Tech. Rep. RI-TR-08-22, 2009.
tes analysis,” in Proc. 25th Int. Conf. Mach. Learn., 2008, pp. 1120– [49] P. Tormene, T. Giorgino, S. Quaglini, and M. Stefanelli, “Matching
1127. incomplete time series with dynamic time warping: An algorithm
[24] T. W. Anderson, An Introduction to Multivariate Statistical Analysis. and an application to post-stroke rehabilitation,” Artif. Intell. Med.,
Hoboken, NJ, USA: Wiley, 2003. vol. 45, no. 1, pp. 11–34, 2009.
[25] F. De la Torre, “A least-squares framework for component [50] W.-S. Chu, F. Zhou, and F. De la Torre, “Unsupervised temporal
analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 6, commonality discovery,” in Proc. Eur. Conf. Comput. Vis., 2012,
pp. 1041–1055, Jun. 2012. pp. 373–387.
[26] T. K. Kim and R. Cipolla, “Canonical correlation analysis of video [51] (2015). Carnegie Mellon University Motion Capture Database.
volume tensors for action categorization and detection,” IEEE [Online]. Available: http://mocap.cs.cmu.edu
Trans. Pattern Anal. Mach. Intell., vol. 31, no. 8, pp. 1415–1428,
Aug. 2009.
[27] C. C. Loy, T. Xiang, and S. Gong, “Time-delayed correlation analy-
sis for multi-camera activity understanding,” Int. J. Comput. Vis.,
vol. 90, no. 1, pp. 106–129, 2010.
[28] B. Fischer, V. Roth, and J. Buhmann, “Time-series alignment by
non-negative multiple generalized canonical correlation analysis,”
BMC Bioinformatics, vol. 8, no. 10, pp. 1–10, 2007.
[29] S. Shariat and V. Pavlovic, “Isotonic CCA for sequence alignment
and activity recognition,” in Proc. IEEE Int. Conf. Comput. Vis.,
2011, pp. 2572–2578.
[30] D. P. Bertsekas, Dynamic Programming and Optimal Control.
Belmont, MA, USA: Athena Scientific, 1995.
[31] S. Chu, E. Keogh, D. Hart, and M. Pazzani, “Iterative deepening
dynamic time warping for time series,” in Proc. SIAM Int. Conf.
Data Mining, 2002, pp. 195–212.

Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.
294 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 2, FEBRUARY 2016

Feng Zhou received the BS degree in computer Fernando de la Torre received the BSc degree
science from Zhejiang University in 2005, the MS in telecommunications, as well as the MSc and
degree in computer science from Shanghai Jiao PhD degrees in electronic engineering from the
Tong University in 2008, and the PhD degree in La Salle School of Engineering at Ramon Llull
robotics from Carnegie Mellon University in 2014. University, Barcelona, Spain in 1994, 1996, and
He is now a researcher in the Media Analytics 2002, respectively. He is an associate research
group at NEC Laboratories America. His professor in the Robotics Institute at Carnegie
research interests include machine learning and Mellon University. His research interests are in
computer vision. the fields of computer vision and machine
learning. Currently, he is directing the Compo-
nent Analysis Laboratory (http://ca.cs.cmu.edu)
and the Human Sensing Laboratory (http://humansensing.cs.cmu.edu)
at Carnegie Mellon University. He has more than 150 publications in
referred journals and conferences. He has organized and co-organized
several workshops and has given tutorials at international conferences
on the use and extensions of Component Analysis.

" For more information on this or any other computing topic,


please visit our Digital Library at www.computer.org/publications/dlib.

Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.

You might also like