Generalized Canonical Time Warping
Generalized Canonical Time Warping
Generalized Canonical Time Warping
Abstract—Temporal alignment of human motion has been of recent interest due to its applications in animation, tele-rehabilitation
and activity recognition. This paper presents generalized canonical time warping (GCTW), an extension of dynamic time warping
(DTW) and canonical correlation analysis (CCA) for temporally aligning multi-modal sequences from multiple subjects performing
similar activities. GCTW extends previous work on DTW and CCA in several ways: (1) it combines CCA with DTW to align
multi-modal data (e.g., video and motion capture data); (2) it extends DTW by using a linear combination of monotonic functions
to represent the warping path, providing a more flexible temporal warp. Unlike exact DTW, which has quadratic complexity,
we propose a linear time algorithm to minimize GCTW. (3) GCTW allows simultaneous alignment of multiple sequences.
Experimental results on aligning multi-modal data, facial expressions, motion capture data and video illustrate the benefits
of GCTW. The code is available at http://humansensing.cs.cmu.edu/ctw.
Index Terms—Multi-modal sequence alignment, canonical correlation analysis, dynamic time warping
1 INTRODUCTION
Fig. 1. Temporal alignment of sequences recorded with different sensors where Vx 2 Rdx d and Vy 2 Rdy d denote the low-dimen-
(from top to bottom: video, motion capture and accelerometers) of three sional embeddings (d minðdx ; dy Þ) for X and Y respec-
subjects kicking a ball. tively. fðÞ is a regularization function that penalizes the
high-frequency components of the embedding matrices:
into visually meaningful components called style compo-
nents. In comparison, our method solves a more general fðVÞ ¼ kVk2F ; (2)
problem of aligning human motion from multi-modal 1
time series.
In the context of computer vision, temporal alignment In order to avoid the trivial solution of VTx X and VTy Y
of video captured with different cameras and view being zero, CCA decorrelates the canonical variates (col-
points has been a topic of interest. Rao et al. [17] aligned umns of VTx X and VTy Y) by imposing the orthogonal con-
trajectories of two moving points using constraints from
straint as:
the fundamental matrix. Junejo et al. [8] adopted DTW
n
for synchronizing human actions with view changes
based on a view-invariant description. In comparison, F ¼ fVx ; Vy g VTx ð1 ÞXXT þ I Vx ¼ I;
o (3)
our method simultaneously estimates the optimal spatial
VTy ð1 ÞYYT þ I Vy ¼ I :
transformation and temporal correspondence to align
video sequences. Recently, Gong and Medioni [18]
where 2 ½0; 1 is a weight to trade-off between the least-
extended CTW approach to incorporate more complex
squared error and the regularization terms. In the case of
spatial transformations through manifold learning.
insufficient samples where the covariance matrices (XXT or
Nicolaou et al. [19] proposed a probabilistic extension of
CTW for fusing multiple continuous expert annotations YYT ) are singular, a positive regularization term I is nec-
in tasks related to affective behavior. These works show essary to avoid over-fitting and for numerical stability. Opti-
the flexibility of our framework for aligning various mizing (1) has a closed-form solution in terms of a
types of time series. generalized eigenvalue problem, i.e., ½Vx ; Vy ¼ eigd ðA; BÞ,
In the field of data mining, there have been several where:
extensions of DTW to align time series that differ in the
0 XYT XXT 0
temporal and spatial domain. Keogh and Pazzani [20], A¼ ; B ¼ ð1 Þ þ I:
for example, used derivatives of the original signal to YXT 0 0 YYT
improve alignment with DTW. Listgarten et al. [21] pro- See [25] for a unification of several component analysis
posed continuous profile models, a probabilistic method methods and a review of numerical techniques to efficiently
for simultaneously aligning and normalizing sets of time solve generalized eigenvalue problems.
series in bio-informatics. Unlike these works, which were In computer vision, CCA has been used for matching sets
originally designed for aligning 1-D time series, our work of images in problems such as activity recognition from
addresses the more challenging problem of aligning video [26] and activity correlation from cameras [27].
multi-modal and multi-dimensional time series. Recently, Fisher et al. [28] proposed an extension of CCA
In the literature of manifold alignment, Ham et al. [22] with parameterized warping functions to align protein
aligned manifolds of images in a semi-supervised manner. expressions. The learned warping function is a linear combi-
The prior knowledge of pairwise correspondences between nation of hyperbolic tangent functions with non-negative
two sets was used to guide the graph embedding. Wang
and Mahadevan [23] aligned manifolds based on an exten-
1. Bold capital letters denote a matrix X, bold lower-case letters a
sion of the Procrustes analysis (PA). The main benefit of this column vector x. xi and xðiÞ represent the ith column and ith row of the
approach is that PA learns a mapping generalized for out- matrix X respectively. xij denotes the scalar in the ith row and jth col-
of-sample cases. However, these models lack a mechanism umn of the matrix X. All non-bold letters represent scalars.
1mn ; 0mn 2 Rmn are matrices of ones and zeros. In 2 Rnn isqanffiffiffiffiffiffiffiffiffiffiffiffiffiffi
iden-
to enforce temporal continuity. pP ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P 2ffi
tity matrix. kxkp ¼ p i jxi jp denotes the p-norm. kXkF ¼ ij xij
represents the Frobenious norm. vecðXÞ denotes the vectorization of
2.2 Canonical Correlation Analysis (CCA) matrix X. X Y is the Hadamard product. ½X1 ; ; Xn denotes the
vertical concatenation of sub-matrices Xi . eigd ðA; BÞ denotes the top d
CCA [24] is a technique to extract common features from a eigenvectors V that solve the generalized eigenvalue problem AV =
pair of multi-variate data. Given two sets of n variables (see BVL.
Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.
ZHOU AND DE LA TORRE: GENERALIZED CANONICAL TIME WARPING 281
J
pxt ; pyt ¼ miny kxpxt ypy k22 þ J
pxtþ1 ; pytþ1 ;
pðpxt ;pt Þ t
Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.
282 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 2, FEBRUARY 2016
" #
WðpÞ : f1 : ngl ! f0; 1gnl ; (7) 0 XWx WTy YT
A¼ ;
The value of the elements of W is always 0 or 1. wpt ;t is 1 for YWy WTx XT 0
any step t 2 f1 : lg, and zero otherwise. The matrix Wx X " #
XWx WTx XT 0
has the effect to replicate (possibly multiple times) samples B ¼ ð1 Þ þ I:
of the X. Similarly for Wy . Fig. 2d shows that the DTW 0 YWy WTy YT
alignment in Fig. 2a can be equivalently interpreted as
stretching the two time series X and Y by multiplying them
with the warping matrices Wx and Wy , respectively as In most experiments, we initialized CTW by setting Wx
and Wy a uniform warping that aligns the sequences. In
shown in Figs. 2e and 2f. Note that (6) is very similar to
CCA’s objective (1). CCA applies a linear transformation to this case, the warping paths are computed as
combine the rows (features), while DTW applies binary
px ¼ roundðlinspaceð1; nx ; lÞÞ0 ; (9)
transformations to replicate the columns (time). It is impor-
tant to notice that reformulating DTW as a least-squared where roundðÞ and linspaceðÞ are MATLAB functions. But
optimization function allows easy generalizations, and this CTW can be initialized by any other time warping method
is one of the important contributions of this paper. such as DTW or iterative motion warping [9]. The dimen-
sion d can be selected to preserve a certain amount (e.g.,
3.2 Objective Function of CTW 90 percent) of the total correlation. Once the spatial transfor-
In order to accommodate for differences in style and subject mation is computed, the temporal alignment is computed
variability, add a feature selection mechanism, and reduce using standard approaches for DTW. Alternating between
the dimensionality of the signals, CTW adds a linear trans- these two steps (spatial and temporal alignment) monotoni-
formation (Vx 2 Rdx d and Vy 2 Rdy d ) to the least-squared cally decreases Jctw . Jctw is bounded below, so the proposed
form of DTW (6), as in the case of CCA. Moreover, this algorithm will converge to a critical point.
transformation allows alignment of temporal signals with
different dimensionality (e.g., video and motion capture). In 4 GENERALIZED CANONICAL TIME WARPING
a nutshell, CTW combines DTW and CCA by minimizing:
(GCTW)
2
min Jctw ¼ VTx XWx VTy YWy F In the previous section, we described CTW, a technique
fVx ;Vy g2F;fpx ;py g2C that is able to align two multi-modal sequences with dif-
(8)
þ fðVx Þ þ fðVy Þ; ferent features. However, CTW has three main limitations
inherited from DTW: (1) The exact computational com-
where Vx 2 Rdx d and Vy 2 Rdy d parameterize the spatial plexity of DTW for multi-dimensional sequences is qua-
transformation and project the sequences into the same dratic both in space and time; (2) CTW and its extensions
low-dimensional coordinate system. Constrained by (5), address the problem of aligning two sequences, but it is
Wx and Wy warp the signal in time to maximize the tempo- unclear how to extend it to the alignment of multiple
ral correlation. Similar to CCA, fðÞ is a regularization term sequences; (3) The temporal alignment is computed using
(2) for Vx and Vy . In addition, the projections satisfy the DTW, which relies on DP to find the optimal path. In
constraints, some problems (e.g., sub-sequence alignment) the warp-
n ing path provided by DP is too rigid (e.g., the first and
F ¼ fVx ; Vy g VTx ð1 ÞXWx WTx XT þ I Vx ¼ I; the last samples have to match).
o To address these issues, this section proposes GCTW, an
VTy ð1 ÞYWy WTy YT þ I Vy ¼ I ; efficient technique for spatio-temporal alignment of multi-
ple time series. To accommodate for subject variability and
where 2 ½0; 1 is to trade-off between the least-squared to take into account the difference in the dimensionality of
error and the regularization term. the signals, GCTW uses multi-set CCA (mCCA). To
Equation (8) is the main contribution of this paper. CTW compensate for temporal changes, GCTW extends DTW by
is a clean extension of CCA and DTW to align two signals in incorporating a more efficient and flexible temporal warp-
space and time. It extends previous work on CCA by adding ing parameterized by a set of monotonic basis functions.
temporal alignment and on DTW by allowing a feature Unlike existing approaches based on DP with quadratic
selection and dimensionality reduction mechanism for complexity, GCTW efficiently optimizes the time warping
aligning signals of different dimensions. function using a Gauss-Newton algorithm, which has linear
complexity in the length of the sequence.
3.3 Optimization of CTW
Optimizing Jctw is a non-convex optimization problem with 4.1 Objective Function of GCTW
respect to the warping matrices and projection matrices. We Given a collection of m time series, fXi gm i¼1 , GCTW aims to
take a coordinate-descent approach that alternates between seek for each Xi ¼ ½xi1 ; . . . ; xini 2 Rdi ni , a low-dimensional
solving the temporal alignment using DTW, and computing
spatial embedding Vi 2 Rdi d and a non-linear temporal
the spatial projections using CCA.
Given the warping matrices, the optimal projection transformation Wi ¼ Wðpi Þ 2 f0; 1gni l parameterized by
matrices are the leading d generalized eigenvectors, i.e., pi 2 f1 : ni gl , such that the resulting sequence VTi Xi Wi 2
½Vx ; Vy ¼ eigd ðA; BÞ, where: Rdl is well aligned with the others in the least-squared
Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.
ZHOU AND DE LA TORRE: GENERALIZED CANONICAL TIME WARPING 283
sense. In a nutshell, GCTW minimizes the sum of pairwise Fig. 4. An example of using Gauss-Newton for solving the sub-sequence
alignment problem. (a) Two 1-D sequences. (b) The contour of the
distances: objective function (Ja as defined in (13)) with respect to the weights of
two bases. (c) The Gauss-Newton optimization procedure, the longer
Xm X m
1 red sequence is warped to match the shorter blue sequence. (d) Warp-
min Jgctw ¼ VT Xi Wi VT Xj Wj 2 ing function (p) as a combination of a linear function (q1 ) and a constant
i j F
fVi gi 2F;fpi gi 2C
i¼1 j¼1
2 function (q2 ) used for scaling and translation respectively.
X
m
þ ðfðVi Þ þ cðpi ÞÞ;
i¼1 (axb ), (2) exponential (expðax þ bÞ), (3) logarithm (logðaxþ bÞ),
(10) (4) hyperbolic tangent (tanhðax þ bÞ) and (5) I-spline [36].
Similar work by Fisher et al. [28] also used hyperbolic tangent
where fðVi Þ ¼ m=ð1 ÞkVi k2F is the regularization func- functions as temporal bases, and the weights were optimized
tion penalizing the irregularity of the spatial transformation using a non-negative least squares algorithm. However,
Vi . 2 ½0; 1 is a trade-off parameter between the least- GCTW differs in three aspects: (1) GCTW allows aligning
squared error and the regularization term. Following the multi-dimensional time series that have different features,
multi-set CCA (mCCA) [35], GCTW constrains the spatial while [28] can only align one-dimensional time-series;
embeddings as: (2) GCTW uses a more efficient eigen decomposition to solve
( ) mCCA and quadratic programming for optimizing the
X
m
Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.
284 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 2, FEBRUARY 2016
P
path using a temporal regularization term, lt¼1 krqðtÞ ak22 A direct optimization of Ja is difficult due to the non-lin-
kFl Qak22 where Fl 2 Rll is the 1st order differential ear function WðÞ. Inspired by the Lucas-Kanade frame-
operator. work for image alignment [40], we approximate the
In summary, we constrain the warping path in (10) by temporal alignment problem using a Gauss-Newton
adding the following constraints on a, method. More specifically, GN iteratively updates the
weights a ^i ai þ di by minimizing a series of first-order
cðaÞ ¼ hkFl Qak22 ; C ¼ fa j La bg; Taylor approximations of Ja centered at each term Zðai Þ
ð1Þ ðlÞ
(12) given the initial ai , where di 2 Rk denotes the increment of
where L ¼ ½Ik ; q ; q and b ¼ ½0k ; 1; n:
the weight ai .
To better understand the approximation, let us first focus
Therefore, given a basis set of k monotone functions, all fea-
on, zt ðai Þ 2 Rd , the tth column of Zðai Þ ¼ ½z1 ðai Þ; . . . ; zl ðai Þ,
sible weights belong to a polyhedron in Rk parameterized
which can be rewritten as,
by L 2 Rðkþ2Þk and b 2 Rkþ2 . For instance, Fig. 3d illus-
trates an example of a warping function (solid line) as a
combination of three monotone functions (dotted lines). zt ðai Þ ¼ VTi Xi WðQai Þ t ¼ VTi Xi ½WðQai Þt ; (14)
vecðZðai þ di ÞÞ vi þ Gi di ;
These steps monotonically decrease Jgctw , and because 2 T
3
the function is bounded below, the alternating scheme will r Vi Xi jqð1Þ ai qð1Þ
6 7 (16)
converge to a critical point. 6 .. 7
where vi ¼ vecðZðai ÞÞ; Gi ¼ 6 . 7:
4
5
ðlÞ
4.3 Optimization of GCTW w.r.t. Temporal Weights r Vi Xi jqðlÞ ai q
T
Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.
ZHOU AND DE LA TORRE: GENERALIZED CANONICAL TIME WARPING 285
Fig. 5. Comparison between Gauss-Newton and variants of DTW for temporal alignment. (a) An example of two 1-D time series and the alignment
results calculated using DTW, DTW constrained in the Sakoe-Chiba band (DTW-SC) and the Itakura Parallelogram band (DTW-IP), DTW optimized
in a multi-level scheme (FastDTW) and Gauss-Newton (GN). (b) Comparison of different warping paths. GN-Init denotes the initial warping used for
GN. SC-Bound and IP-Bound denote the boundaries of SC band and IP band respectively. (c) Alignment errors. (d) Computational costs.
whose components are computed as follows: However, using a narrow band (a small b) cuts off potential
2 3 warping space, and may lead to a sub-optimal solution. For
mGT1 G1 þ hQT FT1 F1 Q GT1 Gm instance, Fig. 5a shows an example of two 1-D time series
6 7
H¼6
4
..
.
..
.
..
.
7;
5
and the alignment results calculated by different algo-
rithms. The results computed by DTW-SC and DTW-IP are
Gm G1
T
mGm Gm þ hQ Fm Fm Q
T T T
less accurate than the ones computed using our proposed
2 P 3
GT1 ðmv1 i vi Þ þ hQT FT1 F1 Qa1 Gauss-Newton. This is because both the SC and IP bands
6 7
f ¼6
4
..
.
7;
5
are over-constrained (Fig. 5b). Alternatively, instead of
P constraining the warping path, exhaustive DTW search can
G ðmvm i vi Þ þ hQ Fm Fm Qam
T T T
be approximated in a multi-level scheme. For instance,
2 m3 2 3 2 3
a1 b1 L1 0 Salvador and Chan [33] introduced FastDTW by recursively
6 . 7 6 . 7 6 . . .. 7
a¼6 7 6 7 6
4 .. 5; b ¼ 4 .. 5; L ¼ 4 .. .. 7
. 5:
projecting a solution from a coarser resolution and refining
the projected solution in a higher resolution. Although the
am bm 0 Lm
coarse-to-fine framework could largely reduce the
search space, the solution is not exact. See Fig. 6 for a
Note that the objective function of (17) is convex. Fig. 4 detailed comparison.
illustrates an example of aligning two 1-D time series To provide a quantitative evaluation, we synthetically
(Fig. 4a) using this approach. To achieve sub-sequence generated 1-D sequences at 15 scales. For DTW-SC, we
alignment, we model the time warping path p as a combina- set the bandwidth to be b ¼ 0:1, which is common in
tion of a linear basis q1 and a constant one q2 (Fig. 4d). As practical applications. For FastDTW, we recursively
shown in Fig. 4c, Gauss-Newton takes three steps to find shrink the sequence length in half from the finest level
the optimal warping parameter in a 2-D space (Fig. 4b). to the coarsest one. We then propagated the DTW solu-
In most experiments, we initialized ai by uniformly tion from coarse to fine with radius r ¼ 5. For GN, we
aligning the sequences (see GN-Init curve in Fig. 5b). How- varied k to be 5; 8; 12 to investigate the effect of the num-
ever, better results can be achieved by using a more sophis- ber of bases. For each scale, we randomly generated 100
ticated initialization method. The length of the warping pairs of sequences. The error was computed using (19)
path l is usually set to be l ¼ 1:1maxm i¼1 ni . The computa- and shown in Figs. 5c and 5d. DTW obtains the lowest
tional complexity of the algorithm is Oðdlmk þ m3 k3 Þ. error but takes the most time to compute. This is because
DTW exhaustively searches the entire parameter space to
4.4 Comparison with Other DTW Techniques find the global optima. DTW-SC, DTW-IP and FastDTW
As discussed in [1], [33], there are various techniques that all need less time than DTW because they search a
have been proposed to accelerate DTW. For instance, the smaller space. Empirically, DTW-SC is the least accurate
Sakoe-Chiba band (DTW-SC) and the Itakura Parallelogram compared to DTW-IP and FastDTW in our synthetic
band (DTW-IP) reduce the complexity of the original DTW dataset. GN is most computationally efficient because it
algorithm to Oðbnx ny Þ by constraining the warping path to has linear complexity in the sequence length. Moreover,
be in a band of a certain shape, where b < 1 is the size ratio increasing the number of bases monotonically reduces
between the band and the original search space of DTW. the error.
Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.
286 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 2, FEBRUARY 2016
m X
X m
1
min Jmdtw ¼ kXi Wi Xj Wj k2F
fpi 2Cgi
i¼1 j¼1
2
m 2 (18)
X 1X m
¼m Xi Wi Xj Wj :
i¼1
m j¼1
F
Fig. 6. Comparison of temporal alignment algorithms as a function of where Fnx and Fny are the 1st order differential operators.
degrees-of-freedom and complexity. l is the length of warping path.
LSðnÞ and eigðnÞ denote the complexity of solving a least-squares of n
To align multiple sequences, multi-sequence DDTW
variables and a generalized eigenvalue problem with two n-by-n matri- (mDDTW) extends DDTW in the Procrustes framework
ces, respectively. similar to (18).
IMW and mIMW. Similar to CTW, iterative motion warp-
ing [9] alternates between time warping and spatial trans-
5 EXPERIMENTS formation to align two sequences. Assuming the same
This section compares CTW and GCTW against state-of-the- number of spatial features between X 2 Rdnx and
art methods for temporal alignment of time series in seven Y 2 Rdny , IMW translates and re-scales each feature in X
experiments. In the first experiment, we compared the per- independently to match with Y. Written in a simple matrix
formance of CTW and GCTW against DTW, DDTW [20], form, IMW minimizes:
and IMW [9] in the problem of aligning synthetic time series
of varying complexity. In the second experiment, we min Jimw ¼ kðX Ax þ Bx ÞWx Yk2F
px 2C;Ax ;Bx
aligned videos of different subjects performing a similar 2 2
activity; each video is represented using different types of þ a Ax FTnx F þ b Bx FTnx F ;
visual features. In the third experiment, we showed how
where Ax ; Bx 2 Rdnx are the scaling and translation
GCTW and CTW can be applied to provide a metric useful
parameters respectively. a and b are the weights for the
for activity recognition. In the fourth experiment, we
least-squared error and the regularization terms. The regu-
aligned facial expressions across subjects on videos with
larization terms are used to enforce a smooth change in the
naturally occurring facial behavior. In the fifth experiment,
columns of Ax and Bx . In the experiments, we set them to
we showed how GCTW can be applied to large-scale align-
be a ¼ b ¼ 1. IMW takes a coordinate-descent approach
ment. We aligned approximately 50;000 frames of motion
to optimize the time warping, scaling and translation. Given
capture data of two subjects cooking the same recipe. In the
the warping matrix Wx , the optimal spatial transformation
sixth experiment, we showed how GCTW can be used to
can be computed in closed-form. To align multiple sequen-
localize common sub-sequences between two time series.
ces, we extended IMW to multi-sequence IMW (mIMW) in
The last experiment shows how GCTW is able to align three
the Procrustes framework similar to mDTW (18).
sequences of different subjects performing a similar action
mCTW. CTW was originally proposed to align two multi-
recorded with different sensors (motion capture data, accel-
modal sequences. We extended CTW to multi-sequence
erometers and video). In the first four experiments, the
CTW (mCTW) for aligning multiple time series using the
ground truth was known and we provided quantitative
Procrustes analysis framework. mCTW optimizes the same
evaluation of the performance. In the others, we evaluated
objective (10) as GCTW does, however, the temporal
the quality of the alignment visually.
alignment step is different. mCTW alternates between
warping each time series using asymmetric DTW and
5.1 Evaluation Methods updating the mean sequence, while GCTW uses Gauss-
In the experiments, we compared CTW and GCTW with Newton for jointly optimizing over all weights of the bases.
several state-of-the-art methods for temporal alignment of Fig. 6 compares different temporal alignment methods
time series. Below, we provide a brief description of the as a function of the number of variables to optimize and
techniques that we used for comparison. their computational complexity. The comparison reports
DTW and mDTW. DTW is solved using a standard two cases: (1) alignment of two sequences (top of the
dynamic programming algorithm that minimizes (4). To table) and more than two sequences. Given two time
evaluate the performance of temporal alignment of multiple series, X 2 Rdnx and Y 2 Rdny , DTW and DDTW have
sequences, we extended the concept of Procrustes analysis the same complexity Oðnx ny Þ for finding the optimal
[41] to time series. That is, given m ð> 2Þ time series, multi- l-length warping path. IMW additionally solves d least-
sequence DTW (mDTW) seeks for a set of warping paths squared problems for each row of Ax and Bx indepen-
fpi gi that minimizes: dently. Similarly, CTW relies on DTW to optimize the
Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.
ZHOU AND DE LA TORRE: GENERALIZED CANONICAL TIME WARPING 287
Fig. 7. Comparison of temporal alignment algorithms on the synthetic dataset. (a) An example of three synthetic time series generated by performing
a random spatio-temporal transformation of a 2-D latent sequence Z and adding Gaussian noise in the third dimension. (b) The alignment results. (c)
Mean and variance of the alignment errors. (d) Mean and variance of the computational cost (time in seconds).
Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.
288 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 2, FEBRUARY 2016
Fig. 8. Comparison of temporal alignment algorithms for aligning multi-feature video data. (a) An example of three aligned videos by GCTW. The left
three sequences are the original frames after background subtraction, while the right three are the binary images, the Euclidean distance transforms
and the solutions of the Poisson equation. (b) The alignment results. (c) Comparison of time warping paths. (d) Mean and variance of the alignment
errors.
and temporally transforming a latent 2-D spiral, generated time series. Both mDTW and mDDTW per-
Z 2 R2l ; l ¼ 300 as X ¼ ½UT ðZ þ b1T ÞM; eT 2 R3n , formed poorly in this case since they do not have a fea-
where U 2 R 22
and b 2 R2 were a randomly generated ture weight mechanism to adapt the spatial
projection matrix and translation vector respectively. To transformation of the sequences. mIMW warps sequences
synthesize the temporal distortion, a binary selection towards others by translating and re-scaling each frame
matrix M 2 f0; 1gln was generated by randomly choos- in eachP dimension. Moreover, mIMW has P more parame-
ing n l columns from the identity matrix Il . The third ters (2l i di ) than mCTW and GCTW (d i di ), and hence
spatial dimension e 2 Rn was added with a zero-mean mIMW is more prone to over-fitting. Furthermore, mIMW
Gaussian noise. In this experiment, mCTW and GCTW tries to fit the noisy dimension (third spatial component),
were compared with mDTW, mDDTW and mIMW for biasing alignment in time, whereas both mCTW and
aligning three time series. The ground-truth alignment GCTW had a feature selection mechanism that effectively
was known and the performance of each method was canceled out the third dimension. Among all initializa-
evaluated in terms of the alignment errors defined in tions, the uniform alignment (mCTW-U and GCTW-U)
(19). We repeated the above process 100 times with ran- and mIMW (mCTW-I and GCTW-I) achieved the best
dom numbers. In each trial, we studied three different results. It is important to notice that mCTW and GTW
initialization methods for mCTW and GCTW: uniform always improved the initial error, that is, mCTW-D and
alignment for mCTW-U and GCTW-U as in (9), mDTW mCTW-I obtained lower errors than their initializations
for mCTW-D and GCTW-D, and mIMW for mCTW-I and given by mDTW and mIMW respectively. Compared to
GCTW-I. The subspace dimensionality for CCA d was mCTW, GCTW achieved better performance when align-
selected to preserve 90 percent of the total correlation. In ing more than two sequences because GCTW jointly opti-
this case, we have sufficient samples (l ¼ 300) in 3-D mizes over all the possible time warpings for each time
space, and the regularization weight was set to zero. series, while mCTW takes a greedy approach by warping
For GCTW, we selected three hyperbolic tangent and each sequence towards the mean sequence independently.
three polynomial functions as monotonic bases (upper- Fig. 7d evaluates the computational cost of each
right corner in Fig. 7b). method with respect to the average sequence length.
Figs. 7 illustrates a comparison of the previously mIMW was the most computationally intensive method
described methods for aligning multiple time series. because it solves a least-squared problem for each feature
Fig. 7b shows the spatio-temporal warping estimated by dimension. mCTW was more expensive than DTW-based
mDTW, mDDTW, mIMW, mCTW-U and GCTW-U. methods because of the additional computation to solve
Fig. 7c shows the alignment errors (19) for 100 randomly CCA. As expected, GCTW was the most efficient.
Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.
ZHOU AND DE LA TORRE: GENERALIZED CANONICAL TIME WARPING 289
5.5 Activity Classification where the alignment step lalg was used to normalize the
This section explores the use of CTW and GCTW as a (dis) (dis)similarity. Given the (dis)similarity computed in (20),
similarity measure between videos with different features, classification for a test sequence was done by finding the
and between videos and motion capture data. This (dis)sim- closest video in the training sequence. The overall classifica-
ilarity is used in combination with a nearest neighbor tion error was averaged over all testing sequences.
classifier. Figs. 9a and 9b display the 12-by-12 (dis)similarity matri-
Given a training set of videos of a subject performing an ces computed by DTW and GCTW respectively. Each
activity recorded with different visual features fXi gi , our element in the matrices encodes the (dis)similarity between
goal is to find the testing video fYj gj that contains similar a training sequence (row) and a testing sequence (column).
activity. We took 24 sequences from the Weizmann dataset We re-ordered the rows and columns of the matrices so that
[42]: eight people performed three actions (walking, run- the sequences containing the same activities were grouped
ning and jumping). We repeated our experiments 10 times. in consecutive rows and columns. We then divided the
In each trial, we randomly split the 24 sequences into two matrix into nine 4-by-4 blocks (yellow lines), where the
disjoint sets of 12 sequences used for training and testing block in the ith row and jth column contains the (dis)simi-
respectively. The training sequences fXi gi were represented larity between the training sequences of the ith action and
using the binary silhouette (see left columns of Figs. 9a and the testing data of the jth action, where i; j 2 fwalk;
9b as examples) while the testing ones fYj gj were encoded run; jumpg. Darker color denotes smaller (dis)similarity.
with the Euclidean distance features (see top rows of Ideally, if the (dis)similarity would be able to capture per-
Figs. 9a and 9b as examples). Given a pair of sequences, fectly the activity, the matrix will be perfect with zeros
(black color) in the diagonal blocks and higher values (white
Xi 2 Rdx nxi and Yj 2 Rdy nyj , we computed the (dis)simi-
color) elsewhere.
larity between videos using different methods: DTW,
From Fig. 9a, we can observe that the DTW (dis)similar-
DDTW, IMW, CTW and GCTW. The warping path for each
nxi l nyj l ity values in the first and last row of blocks were smaller
method is denoted as Walg xi 2 R yj 2 R
and Walg . than the blocks in the middle row. This is because, DTW
In order to provide a fair comparison with DTW, DDTW does not have a feature adaptation mechanism and fails to
and mIMW (that cannot compute (dis)similarity between provide semantic similarity of videos. In comparison,
different features), we computed a pair of projection Fig. 9b shows that GCTW captured better the (dis)similarity
matrices, Vx 2 Rdx d and Vy 2 Rdy d for them. To do that, between actions. Fig. 9c shows the nearest-neighbor classifi-
we took a subset of 1=3 of the training sequences encoded cation error using different (dis)similarity. Overall, CTW
with both features fXtri g and fYi g, and uniformaly aligned
tr
and GCTW achieved lower errors than others due to their
each pair of sequences of the same label. The aligned frames feature selection in aligning videos with different features.
Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.
290 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 2, FEBRUARY 2016
Fig. 10. Comparison of algorithms for aligning facial expression across different subjects. (a) An example of two smiling expression sequences
aligned by GCTW. The features of the two sequences are computed as the 18 landmark coordinates of the mouth given by a face tracker. (b) Align-
ment results. The position that corresponds to the peak of the expression is indicated by points on the curves in the top row. (c) Comparison of time
warping paths. The position that corresponds to the peak of the expression is indicated by the intersection of the dashed lines. (d) Alignment errors.
In the second example, we recognized actions from the face. For the alignment of AU12, we used only the 18
motion capture sequences fYj gj given a training set con- landmarks that correspond to the outline of the mouth. See
taining videos with different visual features fXi gi . From the Fig. 10a for example frames where the mouth outlines are
Weizmann dataset [42] we selected 24 videos containing plotted.
three actions (walking, running and jumping). From the The performance of CTW and GCTW were compared
CMU motion capture database we selected 30 sequences of with DTW, DDTW and IMW. We initialized IMW, CTW
subjects performing the same three actions (walking, run- and GCTW using the same uniform warping. Fig. 10b
ning and jumping). For the mocap data Y, we extracted shows the alignment result obtained by different methods,
quaternions on 20 joints [45] and use them as features. Each where the three dimensions correspond to the first three
video frame X was encoded with a binary silhouette fea- principal components of the original signals. As an approxi-
ture. The experimental setting was similar to previous mate ground-truth, the position of the peak frame of each
experiments. We divided all sequences into two disjoint AU12 event is indicated as the red and blue points on the
sets used for training and testing respectively. Each test curves in Fig. 10b and the intersection of the two dashed
motion capture sequence was classified with the label of the lines in Fig. 10c. As we can see from Figs. 10b and 10c, the
video sequence training sequence X with smallest (dis)simi- two peaks in the low-dimensional projection found by CTW
larity (20). As shown in Fig. 9f, GCTW achieved the lowest and GCTW are closer to the manually labeled peak than the
classification error compared to other methods. This was ones in the original space used for DTW and DDTW.
because our GCTW (dis)similarity captures between multi- Finally, the distance between the peak point and the warp-
modal similarity between actions in videos. ing path is computed to quantitatively measure the perfor-
mance. Fig. 10d shows the average error as the distance
5.6 Aligning Facial Expression Sequences normalized by the sequence lengths over 20 random repeti-
In the fourth experiment, we compared CTW and GCTW in tions, where CTW and GCTW achieved better performance.
the task of aligning unscripted facial expression sequences.
The facial videos were taken from the RU-FACS database 5.7 Aligning Large-Scale Motion Capture
[46], which contains digitized videos of 29 young adults. Sequences
The action units (AUs) in this database have been manually This experiment illustrates the benefits of using GCTW for
coded, and we randomly cropped video segments contain- aligning two large-scale motion capture sequences. The two
ing AU12 (smiling) for our experiments. Each event of sequences were taken from the CMU multimodal activity
AU12 is coded at its peak position. We used a person-spe- dataset [48], which contains multi-sensor recordings (video,
cific active appearance model [47] to track 66 landmarks on audio, motion capture data and accelerometers) of
Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.
ZHOU AND DE LA TORRE: GENERALIZED CANONICAL TIME WARPING 291
Fig. 11. Aligning two large-scale motion capture sequences using GCTW. (a) A coarse-to-fine strategy for improving the optimization performance of
GCTW. (b) The first row shows the first principal components of the original sequences for three levels of the temporal pyramids. The second row
corresponds to the aligned sequences using GCTW. (c) Key frames of similar body poses aligned by GCTW.
naturalistic behavior of 40 subjects cooking five different the aligned key frames in Fig. 11c. Although the two sub-
recipes. The two motion capture sequences used in this jects spent different amounts of time and followed different
experiment contain 44;387 and 48;724 frames respectively procedures in cooking, GCTW was able to align similar
from two subjects cooking brownies. See Fig. 11c for several right hand actions.
key-frames of these two sequences. For each motion capture
frame, we computed quaternions in four joints on the right 5.8 Detection and Alignment of Similar
hand, resulting in a 12-D feature vector. In this experiment, Sub-Sequences
we only optimized the temporal component of GCTW, not A major problem of DTW-type techniques is that they
the spatial component. We used five polynomial and five require an exact matching between the first frame and the
tanhðÞ functions as monotonic bases for the time warping last one, see (5). These boundary conditions are impractical
function. and very restrictive when only a subset of the input
To avoid local minima in the alignment, we used a tem- sequence is similar to the sequence to be aligned. Fig. 12
poral coarse-to-fine strategy for the Gauss-Newton optimi- illustrates this problem: how can we align four motion cap-
zation in GCTW. As shown in Fig. 11a, the coarse-to-fine ture signals composed by different walking cycles? This
strategy proceeds in two steps: (1) In the pre-processing problem is related to sub-sequence DTW [49] and temporal
step, we obtained a three-level pyramid for each time series commonality discovery [50]. A major limitation of these
by recursively applying Gaussian smoothing with s ¼ 200. methods is its inability to handle multiple sequences. This
For instance, the first row of Fig. 11b illustrates the two experiment shows how can we use GCTW for multiple sub-
sequences in three levels, where the ones in the first level sequence alignment.
correspond to the original signals, while the ones in the We selected four walking sequences from the CMU
third level contain less detailed but much smoother signals. motion capture database [51]. For each motion capture
(2) In the optimization step, GCTW was first used to align frame, we computed the quaternions for 14 joints on the
the two sequences on the third level instead of the first level. body, resulting in a 42-D feature vector that describes the
The computed time warping result was then used to human pose. Fig. 12a illustrates the first three principal com-
initialize GCTW on the second level. We repeated the same ponents of the walking sequences. To allow for sub-sequence
procedure to compute the final time warping result of the alignment, the warping path in GCTW is represented by a
original sequences on the first level. combination of a constant function and a linear one as the
For this large-scale example, DTW was slow and compu- monotonic bases (see upper-left corner of Fig. 12c). Both
tationally expensive. However, GCTW was able to effi- GCTW and the baseline mDTW method are initialized by
ciently find the temporal correspondence between the uniformly aligning the sequences.
sequences in just a few seconds using Matlab on a regular A visual comparison between mDTW and GCTW is illus-
laptop with a 2.5 GHz Intel CPU. Since the ground-truth is trated in Fig. 12. Without any manual cropping, most of the
unknown, we qualitatively evaluated GCTW by showing conventional DTW-based methods, such as mDTW, aligned
Fig. 12. Aligning similar sub-sequences of four walking motion capture signals. (a) Original features of four mocap walking sequences. (b) Alignment
achieved by mDTW. mDTW alings the sequences end-to-end and hence it has to stretch some parts of the sequences (flat lines indicated by arrows).
(c) Alignment using GCTW. GCTW efficiently aligns the sub-sequences and also finds the boundaries of the sub-sequences containing similar
motions. (d) Key frames aligned by GCTW.
Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.
292 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 2, FEBRUARY 2016
Fig. 13. Example of aligning multi-modal sequences. (a) Accelerometer. (b) Projection onto the first principal component for the motion capture data,
video and accelerometers respectively. (c) GCTW. (d) Key frames aligned by GCTW. Notice that similar hand gestures have been aligned. From the
top to bottom, we show mocap data, video, and accelerometer data respectively.
the sequences by matching the first and the last frame, series. CTW extends DTW by adding a feature selection
which results in incorrect alignments (see Fig. 12b). Some mechanism and enabling alignment of signals with different
parts (noted by arrows) of the sequences with fewer dimensionality. CTW extends CCA by adding temporal
cycles have to be stretched into flat lines in order to match alignment and allowing temporally local projections. To
the other sequences with more cycles. Unlike conven- improve the efficiency of CTW, allow a more flexible time-
tional DTW-based methods built on dynamic program- warping, and align multiple sequences, GCTW extends
ming, GCTW uses the Gauss-Newton method, which CTW by parameterizing the warping path as a combination
allows for a more flexible time warping. By incorporating of monotonic functions. Inspired by existing work on image
a constant function in the set of bases, GCTW can natu- alignment, GCTW is optimized using coarse-to-fine Gauss-
rally be generalized to deal with the sub-sequence align- Newton updates, which allows for efficient alignment of
ment problem across multiple sequences. As shown in long sequences.
Figs. 12c and 12d, GCTW is not only able to align the Although CTW and GCTW have shown promising pre-
sequences in time, but also locate the boundaries of the liminary results, there are still unresolved issues. First,
sub-sequences that contain similar motions. This experi- the Gauss-Newton algorithm used in GCTW for time
ment demonstrates the benefits of GCTW in controlling warping converges poorly in areas where the objective
the warping path. function is non-smooth. Second, both CTW and GCTW
are subject to local minima. The effect of local minima
5.9 Aligning Multi-Modal Sequences can be partially alleviated using a temporal coarse-to-fine
The last experiment evaluates CTW and GCTW in the task approach as in the case of image alignment. In future
of aligning multi-modal sequences recorded with different work, we also plan to explore better initialization strate-
sensors. We selected one motion capture sequence from the gies. Third, although the experiments have shown good
CMU motion capture database, one video sequence from results using manually designed bases, an obvious exten-
the Weizmann database [42], and we collected an acceler- sion is to learn a set of monotonic bases from training
ometer signal of a subject performing jumping jacks. Some data, so the basis is adapted to a particular alignment
instances of the multi-modal data can be seen in Fig. 13d. problem. Finally, kernelization of CTW and GCTW might
Note that, to make the problem more challenging, the two lead to improvements in alignment error.
subjects in the mocap (top row) and video (middle row)
sequences are performing the same activity, but in the accel- ACKNOWLEDGMENTS
erometer sequence (bottom row) the subject only moves one This work was partially supported by the National Sci-
hand and not the legs. Even in this challenging scenario, ence Foundation under Grant No. EEEC-0540865, RI-
GCTW is able to solve for the temporal correspondence that 1116583 and CPS-0931999. Any opinions, findings, and
maximizes the correlation between signals. For the mocap conclusions or recommendations expressed in this mate-
data, we computed the quaternions for the 20 joints. In the rial are those of the author(s) and do not necessarily
case of the Weizmann dataset, we computed the Euclidean reflect the views of the National Science Foundation. F.
distance transform as described earlier. The X, Y and Z axis Zhou is the corresponding author.
accelerometer data (40 Hz) was collected using an X6-2 mini
USB accelerometer (Fig. 13a). Fig. 13b shows the first com- REFERENCES
ponents of the three sequences projected separately by
[1] L. Rabiner and B. Juang, Fundamentals of Speech Recognition. Engle-
PCA. As shown in Fig. 13c, GCTW found an accurate wood Cliffs, NJ, USA: Prentice-Hall, 1993.
temporal correspondence between the three sequences. [2] A. Bruderlin and L. Williams, “Motion signal processing,” in
Unfortunately, we do not have ground-truth for this experi- Proc. ACM 22nd Annu. Conf. Comput. Graphics Interactive Techn.,
ment. However, visual inspection of the video suggests that 1995, pp. 97–104.
[3] Y. Caspi and M. Irani, “Aligning non-overlapping sequences,” Int.
the results are consistent with human labeling. Fig. 13d J. Comput. Vis., vol. 48, no. 1, pp. 39–51, 2002.
shows several frames that have aligned by GCTW. [4] J. Aach and G. M. Church, “Aligning gene expression time series
with time warping algorithms,” Bioinformatics, vol. 17, no. 6,
pp. 495–508, 2001.
6 CONCLUSIONS [5] T. B. Sebastian, P. N. Klein, and B. B. Kimia, “On aligning
curves,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 1,
This paper proposes CTW and GCTW, two new techniques pp. 116–125, Jan. 2003.
for spatio-temporal alignment of multiple multi-modal time
Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.
ZHOU AND DE LA TORRE: GENERALIZED CANONICAL TIME WARPING 293
[6] F. Zhou, F. De la Torre, and J. K. Hodgins, “Hierarchical aligned [32] A. Heloir, N. Courty, S. Gibet, and F. Multon, “Temporal align-
cluster analysis for temporal clustering of human motion,” IEEE ment of communicative gesture sequences,” J. Vis. Comput. Anima-
Trans. Pattern Anal. Mach. Intell., vol. 35, no. 3, pp. 582–596, Mar. tion, vol. 17, no. 3-4, pp. 347–357, 2006.
2013. [33] S. Salvador and P. Chan, “Toward accurate dynamic time warp-
[7] J. M. Winters and Y. Wang, “Wearable sensors and tele- ing in linear time and space,” Intell. Data Anal., vol. 11, no. 5,
rehabilitation,” IEEE Eng. Med. Biol. Mag., vol. 22, no. 3, pp. 56–65, pp. 561–580, 2007.
May-Jun. 2003. [34] T. Rakthanmanon, B. J. L. Campana, A. Mueen, G. E. A. P. A.
[8] I. N. Junejo, E. Dexter, I. Laptev, and P. Perez, “View-independent Batista, M. B. Westover, Q. Zhu, J. Zakaria, and E. J. Keogh,
action recognition from temporal self-similarities,” IEEE Trans. “Searching and mining trillions of time series subsequences under
Pattern Anal. Mach. Intell., vol. 33, no. 1, pp. 172–185, Jan. 2011. dynamic time warping,” in Proc. 18th ACM Conf. Knowl. Discovery
[9] E. Hsu, K. Pulli, and J. Popovic, “Style translation for human Data Mining, 2012, pp. 262–270.
motion,” ACM Trans. Graphics, vol. 24, no. 3, pp. 1082–1089, 2005. [35] M. A. Hasan, “On multi-set canonical correlation analysis,” in
[10] W. Pan and L. Torresani, “Unsupervised hierarchical modeling of Proc. Int. Joint Conf. Neural Netw., 2009, pp. 1128–1133.
locomotion styles,” in Proc. 26th Int. Conf. Mach. Learn., 2009, pp. [36] J. O. Ramsay and B. W. Silverman, Functional Data Analysis, 2nd
785–792. ed. New York, NY, USA: Springer, 2005.
[11] A. Veeraraghavan, A. Srivastava, A. K. R. Chowdhury, and R. [37] J. O. Ramsay, “Estimating smooth monotone functions,” J. Roy.
Chellappa, “Rate-invariant recognition of humans and their Statist. Soc.: Series B Statist. Methodol., vol. 60, pp. 365–375, 1998.
activities,” IEEE Trans. Image Process., vol. 18, no. 6, pp. 1326–1339, [38] T. Robertson, F. T. Wright, and R. L. Dykstra, Order Restricted Sta-
Jun. 2009. tistical Inference. New York, NY, USA: Wiley, 1988.
[12] D. Gusfield, Algorithms on Strings, Trees and Sequences: Computer [39] I. W. Wright and E. J. Wegman, “Isotonic, convex and related
Science and Computational Biology. Cambridge, U.K.: Cambridge splines,” Ann. Statist., vol. 8, no. 5, pp. 1023–1035, 1980.
Univ. Press, 1997. [40] B. Lucas and T. Kanade, “An iterative image registration tech-
[13] F. Zhou and F. De la Torre, “Canonical time warping for align- nique with an application to stereo vision,” in Proc. Int. Joint Conf.
ment of human behavior,” in Proc. Neural Inf. Process. Syst., 2009, Artif. Intell., 1981, pp. 674–679.
pp. 2286–2294. [41] I. L. Dryden and K. V. Mardia, Statistical Shape Analysis. New
[14] F. Zhou and F. De la Torre, “Generalized time warping for align- York, NY, USA: Wiley, 1998.
ment of human motion,” in Proc. IEEE Conf. Comput. Vision Pattern [42] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri,
Recog., 2012, pp. 1282–1289. “Actions as space-time shapes,” IEEE Trans. Pattern Anal. Mach.
[15] M. Brand and A. Hertzmann, “Style machines,” in Proc. ACM 27th Intell., vol. 29, no. 12, pp. 2247–2253, Dec. 2007.
Annu. Conf. Comput. Graphics Interactive Techn., 2000, pp. 183–192. [43] C. R. Maurer, R. Qi, and V. V. Raghavan, “A linear time algorithm
[16] A. Shapiro, Y. Cao, and P. Faloutsos, “Style components,” in Proc. for computing exact Euclidean distance transforms of binary
Graphics Interface, 2006, pp. 33–39. images in arbitrary dimensions,” IEEE Trans. Pattern Anal. Mach.
[17] C. Rao, A. Gritai, M. Shah, and T. Fathima, “View-invariant align- Intell., vol. 25, no. 2, pp. 265–270, Feb. 2003.
ment and matching of video sequences,” in Proc. IEEE Int. Conf. [44] L. Gorelick, M. Galun, E. Sharon, R. Basri, and A. Brandt, “Shape
Comput. Vis., 2003, pp. 939–945. representation and classification using the Poisson equation,”
[18] D. Gong and G. G. Medioni, “Dynamic manifold warping for IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 12, pp. 1991–
view invariant action recognition,” in Proc. IEEE Int. Conf. Comput. 2005, Dec. 2006.
Vis., 2011, pp. 571–578. [45] J. Barbic, A. Safonova, J.-Y. Pan, C. Faloutsos, J. K. Hodgins, and
[19] M. A. Nicolaou, V. Pavlovic, and M. Pantic, “Dynamic probabilis- N. S. Pollard, “Segmenting motion capture data into distinct
tic CCA for analysis of affective behavior and fusion of continuous behaviors,” in Proc. Graphics Interface, 2004, pp. 185–194.
annotations,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 7, [46] M. S. Bartlett, G. C. Littlewort, M. G. Frank, C. Lainscsek, I. Fasel,
pp. 1299–1311, Jul. 2014. and J. R. Movellan, “Automatic recognition of facial actions in
[20] E. J. Keogh and M. J. Pazzani, “Derivative dynamic time spontaneous expressions,” J. Multimedia, vol. 1, no. 6, pp. 22–35,
warping,” in Proc. SIAM Int. Conf. Data Mining, 2001, pp. 5–7. 2006.
[21] J. Listgarten, R. M. Neal, S. T. Roweis, and A. Emili, “Multiple [47] I. Matthews and S. Baker, “Active appearance models revisited,”
alignment of continuous time series,” in Proc. Neural Inf. Process. Int. J. Comput. Vis., vol. 60, no. 2, pp. 135–164, 2004.
Syst., 2005, pp. 817–824. [48] F. De la Torre, J. K. Hodgins, J. Montano, and S. Valcarcel,
[22] J. Ham, D. Lee, and L. Saul, “Semisupervised alignment of man- “Detailed human data acquisition of kitchen activities: The CMU-
ifolds,” in Proc. Int. Conf. Artif. Intell. Statist., 2005, pp. 120–127. multimodal activity database (CMU-MMAC),” Carnegie Mellon
[23] C. Wang and S. Mahadevan, “Manifold alignment using Procrus- Univ., Pittsburgh, PA, USA, Tech. Rep. RI-TR-08-22, 2009.
tes analysis,” in Proc. 25th Int. Conf. Mach. Learn., 2008, pp. 1120– [49] P. Tormene, T. Giorgino, S. Quaglini, and M. Stefanelli, “Matching
1127. incomplete time series with dynamic time warping: An algorithm
[24] T. W. Anderson, An Introduction to Multivariate Statistical Analysis. and an application to post-stroke rehabilitation,” Artif. Intell. Med.,
Hoboken, NJ, USA: Wiley, 2003. vol. 45, no. 1, pp. 11–34, 2009.
[25] F. De la Torre, “A least-squares framework for component [50] W.-S. Chu, F. Zhou, and F. De la Torre, “Unsupervised temporal
analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 6, commonality discovery,” in Proc. Eur. Conf. Comput. Vis., 2012,
pp. 1041–1055, Jun. 2012. pp. 373–387.
[26] T. K. Kim and R. Cipolla, “Canonical correlation analysis of video [51] (2015). Carnegie Mellon University Motion Capture Database.
volume tensors for action categorization and detection,” IEEE [Online]. Available: http://mocap.cs.cmu.edu
Trans. Pattern Anal. Mach. Intell., vol. 31, no. 8, pp. 1415–1428,
Aug. 2009.
[27] C. C. Loy, T. Xiang, and S. Gong, “Time-delayed correlation analy-
sis for multi-camera activity understanding,” Int. J. Comput. Vis.,
vol. 90, no. 1, pp. 106–129, 2010.
[28] B. Fischer, V. Roth, and J. Buhmann, “Time-series alignment by
non-negative multiple generalized canonical correlation analysis,”
BMC Bioinformatics, vol. 8, no. 10, pp. 1–10, 2007.
[29] S. Shariat and V. Pavlovic, “Isotonic CCA for sequence alignment
and activity recognition,” in Proc. IEEE Int. Conf. Comput. Vis.,
2011, pp. 2572–2578.
[30] D. P. Bertsekas, Dynamic Programming and Optimal Control.
Belmont, MA, USA: Athena Scientific, 1995.
[31] S. Chu, E. Keogh, D. Hart, and M. Pazzani, “Iterative deepening
dynamic time warping for time series,” in Proc. SIAM Int. Conf.
Data Mining, 2002, pp. 195–212.
Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.
294 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 2, FEBRUARY 2016
Feng Zhou received the BS degree in computer Fernando de la Torre received the BSc degree
science from Zhejiang University in 2005, the MS in telecommunications, as well as the MSc and
degree in computer science from Shanghai Jiao PhD degrees in electronic engineering from the
Tong University in 2008, and the PhD degree in La Salle School of Engineering at Ramon Llull
robotics from Carnegie Mellon University in 2014. University, Barcelona, Spain in 1994, 1996, and
He is now a researcher in the Media Analytics 2002, respectively. He is an associate research
group at NEC Laboratories America. His professor in the Robotics Institute at Carnegie
research interests include machine learning and Mellon University. His research interests are in
computer vision. the fields of computer vision and machine
learning. Currently, he is directing the Compo-
nent Analysis Laboratory (http://ca.cs.cmu.edu)
and the Human Sensing Laboratory (http://humansensing.cs.cmu.edu)
at Carnegie Mellon University. He has more than 150 publications in
referred journals and conferences. He has organized and co-organized
several workshops and has given tutorials at international conferences
on the use and extensions of Component Analysis.
Authorized licensed use limited to: Zhejiang University. Downloaded on March 24,2022 at 15:46:09 UTC from IEEE Xplore. Restrictions apply.