Pattern Recognition: Xijian Fan, Tardi Tjahjadi

Pattern Recognition 64 (2017) 399–406
Contents lists available at ScienceDirect
Pattern Recognition
journal homepage: www.elsevier.com/locate/pr
A dynamic framework based on local Zernike moment and motion history MARK
image for facial expression recognition
⁎
Xijian Fan, Tardi Tjahjadi
School of Engineering, University of Warwick, Gibbet Hill Road, Coventry CV4 7AL, United Kingdom
A R T I C L E I N F O A BS T RAC T
Keywords: A dynamic descriptor facilitates robust recognition of facial expressions in video sequences. The current two
Zernike moment main approaches to the recognition are basic emotion recognition and recognition based on facial action coding
Facial expression system (FACS) action units. In this paper we focus on basic emotion recognition and propose a spatio-temporal
Motion history image feature based on local Zernike moment in the spatial domain using motion change frequency. We also design a
Entropy
dynamic feature comprising motion history image and entropy. To recognise a facial expression, a weighting
Feature extraction
strategy based on the latter feature and sub-division of the image frame is applied to the former to enhance the
dynamic information of facial expression, and followed by the application of the classical support vector
machine. Experiments on the CK+ and MMI datasets using leave-one-out cross validation scheme demonstrate
that the integrated framework achieves a better performance than using individual descriptor separately.
Compared with six state-of-arts methods, the proposed framework demonstrates a superior performance.
1. Introduction Shape as a geometric-based representation is crucial for interpret-

ing facial expressions. However, current state-of-the-art methods only
In recent years facial expression recognition has become a popular focus on a small subset of possible shape representation, e.g., point-
research topic [1–3]. With the recent advances in robotics, and as based methods that represent a face using the locations of several
robots interact more and more with human and become a part of discrete points. Noting that image moments can describe simple
human living and work space, there is an increasing requirement that properties of a shape, e.g., its area (or total intensity), its centre and
robots are able to understand human emotions via a facial expression its orientation, Zernike moments (ZMs) have been used to represent a
recognition system [4]. Facial expression recognition system also plays face and facial expressions in [9,10]. Zernike moments are rotation
a significant role in Human-Computer Interaction (HCI) [5], which has invariant features, which can be used to address in-plane head pose
helped to create meaningful and responsive HCI interfaces. It has also variation. In the field of facial expression recognition, rotation invar-
been widely used in behavioural study, video games, animations, safety iant LBP and uniform LBP [11] have also been used to overcome the
mechanism in auto-mobile, etc. [6]. rotation problem. In [12], Quantised Local Zernike Moment (QLZM) is
Discriminative and robust features that represent facial expressions used to describe the neighbourhood of a face sub-region. The Local
are important for effective recognition of facial expressions, and how to Zernike moments have more discriminant power than other image
obtain them is still a challenging problem. Recent methods that address features, e.g., local phase-magnitude histogram(H-LZM), cascaded
this problem can be categorised into global-based methods and local- LZM transformation (H − LZM2 ) and local binary pattern (LBP) [13].
based methods. It has been shown that local-based methods (e.g., Since a facial expression involves a dynamic process, and the
based on Gabor wavelets using grid points) achieve better performance dynamics contain information that represents a facial expression more
than the global-based ones (e.g., based on eigenfaces, Fisher's dis- effectively, it is important to capture such dynamic information so as to
criminant analysis, etc.) [7]. Gabor wavelet results in good perfor- recognise facial expressions over the entire video sequence. Recently,
mance due to its locality and orientation selectivity. However, its there has been more effort on modelling the dynamics of a facial
computational complexity requiring high computational time makes it expression sequence. However, the modelling is still a challenging
unsuitable for real-time applications. Local Binary Pattern (LBP) problem. Thus, in this paper, we focus on analysing the dynamics of
descriptor which is based on the histogram of local patterns also facial expression sequences. First, we extend the spatial domain QLZM
achieves a promising performance [8]. descriptor into spatio-temporal domain, i.e., Motion Change Frequency
⁎
Corresponding author.
E-mail address: t.tjahjadi@warwick.ac.uk (T. Tjahjadi).
http://dx.doi.org/10.1016/j.patcog.2016.12.002
Received 29 February 2016; Received in revised form 1 December 2016; Accepted 2 December 2016
Available online 05 December 2016
0031-3203/ © 2016 Elsevier Ltd. All rights reserved.
X. Fan, T. Tjahjadi Pattern Recognition 64 (2017) 399–406
based QLZM (QLZM_MCF), which enables the representation of mine the feature vectors that represent textures and are thus simple to
temporal variation of expressions. Second, we apply optical flow to implement. Gabor wavelets [24] and LBPs [25] are two representative
Motion History Image (MHI) [14], i.e., (optical flow based MHI) feature vectors of such an approach that describe the local appearance
MHI_OF, to represent spatial-temporal dynamic information (i.e., models of facial expressions. Gabor magnitudes are robust to misalign-
velocity). ment of corresponding image features. However, computing Gabor
We utilise two types of features: a spatio-temporal shape repre- filters has a high computational cost, and the dimensionality of the
sentation, QLZM_MCF, to enhance the local spatial and dynamic output can be large, especially if they are applied to a wide range of
information, and a dynamic appearance representation, MHI_OF. frequencies, scales and orientations of the image features. LBP is a
We also introduce an entropy-based method to provide spatial relation- histogram where each bin corresponds to one of the different possible
ship of different parts of a face by computing the entropy value of binary patterns representing a facial feature, resulting in a 256-
different sub-regions of a face. The main contributions of this paper dimensional descriptor. The most popular LBP is the uniform LBP
are: (a) QLZM_MCF; (b) MHI_OF; (c) an entropy-based method for [26]. LBP has been extended to spatio-temporal domain so as to utilise
MHI_OF to capture the motion information; and (d) a strategy the dynamics information, which results in a significant improvement
integrating QLZM_MCF and entropy to enhance spatial information. in recognition rate [27]. One drawback of appearance-based approach
The rest of the paper is organised as follows. Previous related work is that it is difficult to generalise appearance features across different
is presented in Section 2. Section 3 presents QLZM_MCF, the method persons.
using MHI_OF and entropy, and the intergration of the two dynamic A Dynamic Texture (DT) is a spatially repetitive, time-varying
features. The framework and the experimental results are respectively visual pattern that forms an image sequence with certain temporal
presented in Sections 4 and 5. Finally, Section 6 concludes the paper. stationarity [28]. MHI applied to the recognition of DT can be used to
address the problem of facial expression recognition [29]. MHI
2. Related work decomposes motion-based recognition by first describing where there
is motion (i.e., the spatial pattern) and then describing how the object
The two main focuses in the current research on facial expression is moving [14], where the temporal information can be retained by
are basic emotion recognition and recognition based on facial action eliminating one dimension. One of the advantages of MHI is that a
coding system (FACS) action units (AUs). The most widely used facial range of times may be encoded in a single frame, and in this way, the
expression descriptors for recognition and analysis are the six proto- MHI spans the time scale of the human motion. In MHI, the intensity
typical expressions of Anger, Disgust, Fear, Happiness, Sadness and value of each image pixel denotes the recent movement, ignoring the
Surprise [15]. The most widely used facial muscle action descriptors speed of the movement. However, speed can be used to distinguish the
are AUs [1]. With regard to basic emotion recognition, geometric-based movement of some facial parts (e.g., opening of mouth and raising of
features and appearance-based features are most widely used. eyebrows) and the movements caused by changes of in-plane head pose
Geometric-based methods rely on the locations of a set of fiducial or relatively stable facial parts (e.g., cheek, nose, forehead, etc.) during
facial points [16,17], a connected face mesh [18,19], or the shapes of facial expressions. Optical flow has been used to capture the velocity of
face components [20]. The commonly used geometric representation is movement at pixels in an image, but by computing the changes in pixel
facial points, which represent a face by concatenating the x and y intensities between two consecutive frames it does not accurately
coordinates of a number of fiducial points. Alternative shape repre- describe the entire video sequence. We address the limitations of
sentations include the distances between facial landmarks, distance MHI and optical flow by combining them so as to incorporate speed
and angle that represent the opening/closing of the eyes and mouth, and to enable more distinct representations of movement of different
and groups of points that describe the state of the cheeks. Although it facial parts.
has been shown that shape representation plays a vital role for Entropy-based methods extract intensity information of image
analysing facial expressions, they have not been exploited to their full pixels, and have been applied for face recognition. For example,
potential [12]. Cament et al. [30] combined entropy-like weighted Gabor features
Image moments can be categorised into geometric moments, with the local normalisation of Gabor features. Chai et al. [31]
complex moments and orthogonal moments. Although easy to use, introduced the entropy of a facial region, where a low entropy value
the large values of geometric moments are their main limitations means the probabilities of different intensities are different, and a high
leading to numerical instabilities and sensitivity to noise. Complex value means the probabilities are the same. They used the entropy of
moments are defined similarly to geometric and have been used to each of the equal-size blocks of a face image to determine the number
describe the shape of a probability density function and to measure the of sub-blocks within each block. Inspired by [31], we use entropy in the
mass distribution of a body. Hu moments exhibit translation, rotation proposed MHI_OF as follows. Since the intensity value of each pixel in
and scaling invariance, and has been applied in many areas [21]. MHI represents a movement, the high intensity values denoting large
Orthogonal moments are projections of a function onto a polynomial movement will result in high entropy value, and vice versa.
basis. ZMs employ complex Zernike polynomials as its moment basis
set [22], and have been used to recognise facial expressions [23]. The
3. Feature extraction
rotation invariance of Zernike-based facial features is discussed in
[9,10]. QLZM is used in [12] for recognising facial expressions.
3.1. Motion history image
However, ZM has its shortcomings, namely it is a low level histogram
representation which ignores the spatial relations (i.e., configure
MHI can be considered as a two-component temporal template, a
information) among the different facial parts. Also, ZMs only describe
vector-valued image where each component of each pixel is some
the texture information in each frame of an image sequence, and do not
function of the motion at that pixel location. The MHI Hτ (x, y, t ) is
capture any dynamic information. In this paper, we address these two
computed from an update function Ψ (x, y, t ), i.e.,
limitations by extending QLZM to spatio-temporal in order to extract
dynamic information, and introducing an entropy to incorporate ⎧ τ, Ψ (x , y , z ) = 1
spatial relations. Hτ (x, y, t ) = ⎨
⎩ max(0, Hτ (x, y, t ) − δ ), otherwise (1)
The appearance-based methods try to find a more effective and
robust way to represent appearance feature including skin motion and where (x, y, t ) is the spatial coordinates (x,y) of an image pixel at time t
texture changes (i.e., deformation of skin) such as bulges, wrinkle and (in terms of image frame number), the duration τ determines the
furrows. Transformations and statistical methods are used to deter- temporal extent of the movement in terms of frames, and δ is the decay
400
Fig. 2. Optical flow based MHI for Anger, Happiness and Surprise (from left to right).
Fig. 1. Example of images from sequences (left and middle) and its MHI (right).
M (x, y, t ) = d (x, y, t ) + M (x, y, t − 1)*τ (5)
parameter. Ψ (x, y, z ) is defined as
where τ is another decay parameter, and
⎧1, D(x, y, t )
Ψ (x , y , z ) = ⎨ ⎧ a*d (x, y, t ) + b d (x, y, t ) > T
⎩ 0, otherwise (2) d (x , y , t ) = ⎨
⎩0 otherwise. (6)
where D(x, y, t ) is a binary image comprising pixel intensity differences
of frames separated by temporal distance Δ, i.e., a and b are scale factors, and T is a threshold which is used to remove
small movements, while retaining large movements of some fiducial
D(x, y, z ) = |I (x, y, t ) − I (x, y, t ± Δ)| (3) points (e.g., eyebrows, lips, etc.). Scale factors are used because the
and I (x, y, t ) is the intensity value of pixel with coordinates (x, y ) at the optical flow descriptor is not significantly large for the movements of
tth frame of the image sequence. The duration τ and the decay points in two consecutive frames. In our experiments, the original
parameter δ have an impact on the MHI image. If τ is smaller than optical flow d (x, y, t ) is magnified by a scale factor a of 10 with a
the number of frames, then the prior information of the motion in its starting value b of 20, and the threshold T is set to 1. A large value of
MHI will be lost. For example, when τ = 10 for a sequence with 19 the decay parameter τ creates a slow decrement of the accumulated
frames, the motion information of the first 9 frame will be lost if the motion strength, and the long-term history of the motion is recorded in
value of δ = 1. On the other hand, if the temporal duration is set at very the resulting MHI_OF image. A small value of τ gives an accelerated
high value compared to the number of frames, then the changes of pixel decrement of motion strength, and only the recent short-term move-
value in the MHI is less significant. The MHI of a sequence from the ments are retained in the MHI_OF image. Fig. 2 illustrates optical flow
Extended CK dataset (CK+) [32] is shown in Fig. 1. based MHI for some facial expressions.
3.2. Optical flow algorithm 3.4. Entropy
Optical flow descriptor can represent the velocity of a set of The entropy of a discrete random variable X with possible values
individual pixels in an image, which capture their dynamic informa- {x 0 , x1, …x 2 , xN } can be defined as [35]
tion. We employ optical flow descriptor in our framework to exploit
E (X ) = − ∑ p(xi ) × log2(p(xi )),
velocity information. i =0 (7)
The Lucas-Kanade method is one of most widely-used method for
optical flow computation [33], which solves basic optical flow equation where p(. ) is the probability function. For a grey-level face image, the
for all pixels in their local neighbourhood by using the least squares intensity value of each pixel varies from 0 − 255, and the possibility of a
criterion. Given two consecutive image frames It−1 and It, for a point particular value occurring is random and varies depending on the
p = (x, y )T in It−1, if the optical flow is d = (u , v )T then the correspond- pattern of different face images. Considering a face image with
ing point in It is p + d , where T is the transpose operator. The dimension H × W having a total of M = H × W pixels, the probability
algorithm finds the d which minimises the match error between the of a particular intensity value xi occurring in the image is p(xi ) = ni / M ,
local appearances of two corresponding points. A cost function is where ni is the number of occurrences of xi among the M pixels. In this
defined for the local area R(p), i.e., [33] case, considering Σini = M , the entropy of the image can be expressed
as
e (d ) = ∑ w(x )(It (x + d ) − It −1(x ))2 ,
255
x ∈ R (p ) (4) 1
E (X ) = log2M − × ∑ ni × log2(ni ).
M (8)
where w(x) is a weights window, which assigns larger weight to pixels i =0
that are closer to its central pixel as these pixels are considered to It is shown in [37] that certain facial regions contain more
contain more important information than those further away. important information for recognising facial expressions than others.
For example the regions of mouth and eyes that produce more changes
3.3. Optical flow based MHI (MHI_OF) than those of nose and forehead during an expression have more
contribution towards the recognition. Also, as can be seen from the
In [34], Tsai et al. proposed a representation that incorporates both leftmost and middle columns of Fig. 3, different facial regions in MHI
optical flow and a revised MHI for action recognition, which can better have different intensity levels due to the distance and speed of
describe local movements. Since a video sequence of facial expression movements during an expression. Thus, introducing a weight function
involves local movements of different facial parts, we consider applying which allocates different weights to different facial regions will improve
this representation into spatio-temporal facial expression recognition. recognition. Instead of setting weights empirically based on the
As according to [34], we compute the optical flow between two observation, we utilise entropy to determine the weights as it is
consecutive frames and obtain the optical flow image where the expected that the entropy at different facial regions will differ sig-
intensity of each pixel represents the magnitude of the optical flow nificantly due to pixel intensity variation at these regions.
descriptor. The higher values denote the faster movement of facial The size of the training samples in practice is often not large enough
points. We define MHI_OF of a sequence as to cover all the possible values of pixels in MHI. To address this sparse
401
The MHI_OF using entropy representation is shown in the rightmost

column of Fig. 3.
3.5. Local zernike moment
ZMs of an image is computed by decomposing the image onto a set

of complex orthogonal basis on the unit disc x 2 + y 2 ≤ 1 called Zernike
polynomials. The Zernike polynomials are defined as [12]
Vnm(ρ , θ ) = Vmn(ρcosθ , ρsinθ ) = Rnm(ρ)e jmθ , (14)
where n is the order of the polynomial and m is the number of

iterations such that |m| < n and n − |m| is even. Rmn are the radial
polynomials, i.e.,
n−| m|
( − 1)s ρ(n −2 s) (n − s )!
Rmn(ρ) = ∑ ⎛ n + |m| ⎞ ⎛ n − |m| ⎞
,
s =0 s!⎜ − s⎟ ! ⎜ − s ⎟!
⎝ 2 ⎠⎝ 2 ⎠ (15)
where ρ and θ are the radial coordinates. A ZM of a face image I (x, y )

Fig. 3. Example image entropies: (left column) neutral image and its entropy; (middle consisting of a real and an imaginary components is [12]
colulmn) surprise image and its entropy; and (right column) MHI of surprise image and
X −1 Y −1
its entropy. Lighter shades denote larger entropy values. I n+1 * (ρ , θxy )Δx Δy ,
Znm = ∑ ∑ I (x, y)Vmn xy
π x =0 y =0 (16)
problem, we divide the possible 256 intensity levels into several
sections to form intensity divisions. For a 2-dimensional (2D) matrix where x and y are the image coordinates mapped to the range
X = (xij )H × W , let χ = {t1, t2, …, tK } be the sorted set of all possible K x
[ − 1, + 1], ρxy = x 2 + y 2 , θxy = tan−1 y and Δx = Δy = 2/ N 2 .
intensity values that exist in X where t1 < t2 < t3… < tK and K is the Since a local descriptor represents the discontinuities and texture of
number of the distinct intensity values. The process of division is an image effectively, QLZM is proposed in [12] using non-linear
⎧ xt , t1 ≤ xij ≤ t2 encoding and pooling, where non-linear encoding facilitates the
⎪ 1 relevance of low-level features by increasing their robustness against
⎪ xt2 , t2 ≤ xij ≤ t3
⎪ image noise, while pooling is exploited to deal with the problem of
xij = ⎨ xt3, t3 ≤ xij ≤ t4,
small geometric variation. Non-linear encoding is carried out on
⎪ .
.
⎪ . complex-valued local ZMs using binary quantisation, which converts
⎪ xt , tK −1 ≤ xij ≤ tK .
⎩ k (9) the real and imaginary parts of each ZM coefficient into binary values
using signum functions. Such coarse quantisation increases compres-
To compute the weight function, we divide the MHIs with size of sion and encodes each local block with a single integer. Since features
H × W into several non-overlapping sub-regions. The 2D spatial along borders may fall out of the local histogram, they are down-
histogram of the intensity values xtk on each sub-region of X is weighted in pooling using a Gaussian window peaked at the centre of
hk = {hk (p , q )|1 ≤ p ≤ P, 1 ≤ q ≤ Q}, (10) each subregion. A second partitioning is also applied to account for the
down-weighted features, where a higher emphasis is placed on features
+
where p, q ∈  , P × Q is the size of sub-regions, and hk (p , q ) ∈ [0, +] down-weighted at the first partitioning. The final QLZM feature is
is the number of occurrences of the intensity values xtK in the spatial constructed by concatenating all local histograms, and the length of
grid located on the image sub-region of extracted correspond to two parameters: the number of moment
H H W W
[(p − 1) P , p P ] × [(q − 1) Q , q Q ]. In forming 2D spatial histogram hk coefficient K1 and the size of the grid M, which are computed by
of intensity values xtk , the aspect ratio of the original image is
maintained on spatial grids. In this way, spatial characteristics of 22K1 × [M2 + (M + 1)2 ], (17)
pixels are retained when forming the 2D spatial histogram.
The entropy value on each sub-region of the 2D spatial histogram is where for moment order n, K1 is computed using the function of
computed for intensity values xtk using moment order n
K ⎧ n(n + 2)
S (p , q ) = − ∑ p(hk (p, q ))log2p(hk (p, q )), ⎪
⎪
ifn is even
4
k =1 (11) K1(n ) = ⎨ 2
⎪ (n + 1)
⎪ ifn is odd.
where p(hk (q , p)) is the possibility of particular intensity value xtk in the ⎩ 4 (18)
spatial grid located on the image sub-region of
H H W W
[(p − 1) M , p M ] × [(q − 1) N , q N ]. The process of generating QLZM is illustrated in Fig. 4.
The normalisation process is implemented using
ω(p , q ) = (s(p , q ) − smin )/(smax − smin ) (12)
to convert the range of weights into 0 − 1, where smin and smax are
respectively the maximum value and the minimum of the entropy
values over all sub-regions. The computed weights of each subregion on
MHI_OF are as the final weight features
enMHIOF = {ω(1, 1), ω(1, 2), …ω(1, q ), …, ω(p , q )}. (13) Fig. 4. QLZM based facial representation framework.
402
3.6. Extension to spatio-temporal as

⎧R enMHIOFp, q > Ten
QLZM of a 2D image incorporating local spatial textural informa- R p, q = ⎨ p, q
tion has been shown to achieve good facial expression recognition rate ⎩ remove otherwise, (25)
[12]. In this paper, we incorporate dynamic information by applying a where Ten is the threshold to be set and enMHIOFp, q is the value of
Motion Change Frequency (MCF) for spatial QLZM, and propose a enMHIOF in subregion (p,q). This scheme is required because sub-
spatio-temporal descriptor QLZMMCF . Suppose we have a QLZM regions with larger enMHIOF value indicating more significant motion
sequence where each image frame has been transformed by using (thus making larger contribution to recognition) should be allocated
QLZM, and the subregions of each QLZM frame are denoted as larger weights, while subregions with smaller enMHIOF indicating little
Qp, q(i , t ), where t is the image frame number in the sequence and i motion (thus making no or little contribution to recognition) should be
denotes the local pattern from a subregion (m, n ) of each QLZM image. allocated smaller weights or removed. The integrated feature is
For each pattern i in subregions (p,q), its positive change sequence fWeightedFUSION , and the dimension of the feature is 3 × 22 K × Ns , where
pos p, q (i, t), t = 1, 2, …, T − 1 is defined as Ns is the number of selected subregions obtained by the thresholding.
⎧1 Qp, q(i , t + 1) − Qp, q(i , t ) > Ts*Qp, q(i , t )
pos p, q = ⎨ 3.8. Dimensionality reduction using 2D PCA
⎩ 0 otherwise, (19)
where Ts is a threshold. Similarly, its negative change sequence is Principal Component Analysis (PCA) is widely used in facial
defined as expression recognition for reducing the dimensionality of feature
⎧1 Qp, q(i , t + 1) − Qp, q(i , t ) < − Ts*Qp, q(i , t ) space. It aims to extract decorrelated features out of possible correlated
negp, q = ⎨ features using a linear mapping function. Under controlled head-pose
⎩ 0 otherwise. (20)
and imaging conditions, these features capture the statistical structure
Also, we define the unchanged sequence as of facial expressions. 2D PCA has been shown to be superior to PCA in
terms of more accurate estimation of covariance matrices and reduced
⎧1 |Qp, q(i , t + 1) − Qp, q(i , t )| ≤ Ts*Qp, q(i , t )
uncp, q = ⎨ computational complexity for feature extraction by operating directly
⎩ 0 otherwise. (21) on 2D matrices instead of 1-dimensional vectors [39], i.e., it is not
Ts is an adjustable parameter which affects the performance of the necessary to convert the 2D image into 1D feature prior to feature
proposed framework. If Ts is set too large then some movements extraction. Given L training samples, i.e., G1, G2 ,.., GL , the scatter matrix
between two consecutive frames might be ignored, while if Ts is set too S is [39]
small then small movements, e.g., due to subtle head pose are detected. L
1
In our experiments, Ts is set to 0.1. The QLZMMCF on each subregion S= ∑ (Gi − M )T × (Gi − M ),
L (26)
(p,q) is the combination of three changes of the pattern i, i.e., i =1
L
QLZMMCFp, q = {QLZMMCFp, q(i , 1), QLZMMCFp, q(i , 2), (22) where M = (1/ L ) ∑i =1 Gi .
Since there are at most L − 1 eigenvectors of S
with non-zero eigenvalues, N eigenvectors (where N < L − 1) are
QLZMMCFp, q(i , 3)} randomly chosen from the set of L − 1 eigenvectors, i.e.,
(e1, e2 , …eL −1), with the largest eigenvalues used to construct L sub-
where spaces R k KL =1. The nth eigenvector with zero eigenvalue is discarded in
T −1 order to reduce the dimensionality of the feature space while preser-
QLZMMCFp, q(i , 1) = ∑ pos(i, t )/(T − 1) ving discriminatory information. Thus, 2D PCA is adopted in this
t =1
T −1 paper.
QLZMMCFp, q(i , 2) = ∑ neg(i, t )/(T − 1)
t =1
4. Facial expression recognition framework
(23)
T −1 Fig. 5 outlines the proposed framework which comprises pre-
QLZMMCFp, q(i , 3) = ∑ unc(i, t )/(T − 1). processing, feature extraction and classification. The pre-processing
t =1
includes facial landmark detection and face alignment, where face
The final QLZMMCF feature is obtained by concatenating all alignment is applied to reduce the effects of variation in head pose and
QLZMMCFp, q on each region. scene illumination. We use the local evidence aggregated regression
[38] to detect facial landmarks over each frame, where the locations of
3.7. Fusion using weighting function detected eyes and nose are used for face alignment including scaling
and cropping. The aligned face images are the size of 200×200, where
Given two different types of facial features, an efficient way to the x coordinate of the centre of the two eyes are the centre in the
combine them is to concatenate the two features to give horizontal direction, while the y coordinate of the nose tip locates the
lower third in the vertical direction. Since the dimensionality of the
fFUSION = (enMHIOF, QLZMMCF), (24) features is high, following the feature extraction as in Section 3 a
where enMHIOF and QLZMMCF are the two features. dimension reduction technique is applied to obtain a more compact
Another combination scheme is also introduced to combine the two representation. Different classifiers may lead to different recognition
features by applying enMHIOF feature as weight function in pooling performance. We use support vector machines (SVM) that has been
during the generation of QLZM. Specifically, we use the same strategy widely used and shown to be effective in recognising facial expressions.
of subregion division on the input image of MHI and QLZM, and the
threshold based on enMHIOF is introduced to each subregion of QLZM 5. Experiments
image to determine which subregions are removed or retained to
compute spatial-temporal QLZMMCF . If the enMHIOF value of a 5.1. Facial expression datasets
subregion is larger than the threshold, the subregion at the same
location in the QLZM image is retained for further processing, We use the Extended CK dataset (CK+) as it is widely used for
otherwise the subregion is removed. The threshold function is defined evaluating the performance of facial expression recognition methods
403
Fig. 6. Recognition rates of all expressions using QLZM and QLZM_MCF on CK+
dataset.
between the spatial and spatio-temporal features. The recognition

rates in using spatial QLZM and the spatio-temporal QLZM_MCF
which employs dynamic information are summarised in Fig. 6. We also
compare the use of MHI_OF and MHI. Since using MHI image as input
of the classifier may lead to higher dimensionality, we use histogram
computation to represent MHI and MHI_OF. The recognition rates in
using MHI and MHI_OF are shown in Fig. 7. As can be seen from
Fig. 5. The proposed framework.
Fig. 7, the performance in using MHI_OF is better than in using MHI.
These two figures show that QLZM_MCF and MHI_OF outperform the
and thus facilitates comparison of performances. The dataset includes spatial QLZM and MHI, respectively, although the performance of both
327 image sequences of six basic expressions (namely Anger, Disgust, MHI and MHI_OF are less than satisfactory.
Fear, Happiness, Sadness and Surprise) and a non-basic emotion The third experiment investigates the effectiveness of concatenating
expression (namely Contempt), performed by 118 subjects. Each image QLZM_MCF with enMHI_OF in the proposed framework. Tables 2–5
sequence from this dataset has various number of frames and starts respectively show the results using two individual features separately,
with the neutral state and ends with the peak phase of a facial the simple fusion strategy fFUSION and the proposed fusion strategy
expression. We use standard leave-one-out cross-validation scheme fWeightedFUSION . The overall recognition rates using all four features
to evaluate the performance of the proposed framework by computing (QLZM_MCF, enMHI_OF, feature using simple fusion strategy and
the average recognition rate. One sequence corresponding to an feature using proposed weighting fusion strategy) are shown in Table 6.
expression is chosen for testing and the remaining sequences of the The tables show the framework using the simple fusion strategy of two
same expression are used for training. We run the proposed recogni- features performs better than using individual feature separately, and
tion system 327 times on the selected image sequences, and averaged the proposed fusion strategy achieves the best performance. In Table 6,
all recognition rates to obtain the final rates. we compare the proposed feature with the method of Eskil et al. [43],
We also use MMI [36], a publicly available dataset, which includes the static method of Lucey et al. [32] and our previous work [44], which
both posed and spontaneous facial expression sequences. 203 se- shows the fused feature achieves an average recognition rate of 88.30%
quences which are labelled as one of six basic expressions are selected, for all seven facial expressions, and outperforms the other methods.
and all selected sequences are converted into 8-bit grey-scale images Thus, we can also conclude that the combination of two dynamic
with only the sub-sequences from start frame to the frame with the features improves the recognition rate.
peak expression phase included. We also conducted an experiment on the MMI dataset, comparing
the proposed framework with the method that uses LBP and SVM [37],
5.2. Experimental results and the methods in [44,45] that are evaluated using the same
classification strategy of 10-fold cross-validation. The average recogni-
The first experiment aims to investigate the effectiveness of the tion rates are shown in Table 7. The table shows that the proposed
enMHI_OF feature, and is conducted on the CK+ dataset. As the framework outperforms all the other five methods. The result for LBP
performance of enMHI_OF might rely on the size of sub-regions and was obtained by using different samples to those used in [37], and
the number of grey levels represented by K, we conducted our using the same strategy of classification introduced in [45] which is
experiment using different sizes and K. Table 1 shows that better also used in [44] and the proposed method.
performances are achieved using divided grey levels (i.e., using K=4,
10, 20) than using the entire 256 grey levels. Also, using sub-regions
with size 20 × 20 gives the best performances.
The second experiment compares the performance difference
Table 1
Recognition rate of enMHI_OF using several combinations of different grey levels and
block sizes on classification of six facial expressions and contempt of the CK+ dataset
with leave-sequence-out cross-validation.
20 × 20 10 × 10 8×8 5×5
K=4 74.31 70.33 71.55 67.28

K=10 75.84 75.84 70.63 72.78
K=20 76.14 75.53 72.48 74.92
K=256 73.40 70.94 71.55 73.09
Fig. 7. Recognition rates of all expressions using MHI and MHI_OF on CK+ dataset.
404
Table 2 Table 7
Recognition rate (in term of percentage of true positive, true negative, ect.) of Comparative evaluation of the proposed framework on the MMI dataset.
QLZM_MCF on classification of six facial expressions and contempt of the CK+ dataset
with leave-sequence-out cross-validation. Study Methodology
A D F H Sa Su C LBP[37] 54.5
AAM[45] 62.4
Anger(A) 89.9 2.2 0 0 4.4 0 4.4 ASM[45] 64.4
Disgust(D) 1 94.9 1.7 0 0 0 1.7 Fang in[45] 71.6
Fear(F) 4.0 8.0 64.0 4.0 12.0 0 4.0 Our previous work[44] 74.3
Happiness(H) 1.4 0 0 97.1 0 1.4 0 Proposed weighting fusion strategy 79.8
Sadness(Sa) 0 3.6 7.1 0 78.6 3.6 7.1
Contempt(Su) 0 0 1.2 2.4 1.2 94.0 1.2
Contempt(C) 0 5.6 16.7 11.1 5.6 0 61.1 Although CK+ and MMI are two of the most widely used datasets
for evaluating facial expression recognition methods, they are both
collected in a strict controlled settings with near frontal poses,
Table 3
Recognition rate (in term of percentage of true positive, true negative, ect.) of enMHI_OF
consistent illumination and posed expressions. The recent and more
on classification of six facial expressions and contempt of the CK+ dataset with leave- challenging datasets of AFEW and SFEW [46] provide platforms for
sequence-out cross-validation. researchers to create, extend and test their methods on a common
benchmarked data. Since the proposed framework recognises facial
A D F H Sa Su C
expression on video sequence which treat a sequence as an entity, we
Anger(A) 73.3 4.4 6.7 6.7 6.7 0 2.2 use AFEW which are used for EmotiW 2014 for our experiments [47].
Disgust(D) 5.1 84.8 3.4 0 3.4 0 3.4 AFEW is a dynamic temporal facial expressions data corpus extracted
Fear(F) 8.0 0 40.0 16.0 20.0 4.0 12.0 from movies with realistic real world environment. It was collected on
Happiness(H) 0 0 4.3 89.9 2.9 0 2.9
the basis of Subtitles for Deaf and Hearing impaired (SDH) and Closed
Sadness(Sa) 10.7 0 21.4 3.6 42.9 7.1 14.3
Surprise(Su) 1.2 0 2.4 2.4 1.2 90.4 2.4 Caption (CC) for searching expression-related content and extracting
Contempt(C) 11.1 0 27.8 5.6 11.1 5.6 38.9 time stamps corresponding to video clips which represent some
meaningful facial motion. The database contains a large age range of
subjects from 1 to 70 years, and the subjects in the clips have been
Table 4 annotated with attributes like Name, Age of Actor, Age of Character,
Recognition rate (in term of percentage of true positive, true negative, ect.) of using
Pose, Gender, Expression of Person and the overall Clip Expression.
simple fusion strategy on classification of six facial expressions and contempt of the CK+
dataset with leave-sequence-out cross-validation. There are a total of 957 video clips in the database labelled with six
basic expressions anger, disgust, fear, happy, sad, surprise and the
A D F H Sa Su C neutral. To compare with the baseline method of EmotiW 2014 [47],
we modified the proposed framework slightly, where we use the pre-
Anger(A) 86.7 2.2 2.2 2.2 6.7 0 0
Disgust(D) 3.4 94.9 1.7 0 0 0 0
processing methods (face detection and alignment) provided by the
Fear(F) 8.0 4.0 72.0 0 4.0 4.0 8.0 baseline method.
Happiness(H) 0 0 0 95.7 8.0 4.0 0 We used the training samples for training, and the validation
Sadness(Sa) 3.6 0 10.7 0 78.6 0 7.1 samples for performance evaluation. Table 8 shows the recognition
Surprise(Su) 0 0 1.2 2.4 1.2 95.2 0
rate using the proposed framework on AFEW dataset. The overall
Contempt(C) 5.6 0 22.2 5.6 5.6 5.6 55.6
recognition rate of the proposed framework on the validation set is
37.63%, which is higher than the 33.15% achieved by the video only
Table 5 baseline method. Unlike the experiments on CK+ dataset, the surprise
Recognition rate (in term of percentage of true positive, true negative, ect.) of the expression is much more difficult to be recognised. This is because
proposed fusion strategy on classification of six facial expressions of the CK+ dataset and sometimes the surprise expression might not be acted exaggeratedly
contempt with leave-sequence-out cross-validation.
(i.e., the openness of mouth) in real situations. Also, the overall
A D F H Sa Su C recognition rate is much lower than on the CK++ and MMI dataset.
This is because numerous frames from the AFEW sequences were
Anger(A) 91.1 2.2 6.7 0 0 0 0 captured under poor light condition, have large pose or occlusion, and
Disgust(D) 3.4 96.7 0 0 0 0 0
the expressions are not always from neutral to peak expression.
Fear(F) 4.0 4.0 80.0 4.0 0 4.0 4.0
Happiness(H) 0 1.4 0 98.6 0 0 0
Sadness(Sa) 3.6 0 0 3.6 89.3 3.6 0
Surprise(Su) 0 0 0 0 1.2 97.6 1.2 6. Conclusion
Contempt(C) 5.6 0 11.1 5.6 5.6 0 72.2
This paper presents a facial expression recognition framework
Table 6
Table 8
The overall recognition rates of the four spatio-temporal features on the CK+ dataset.
Recognition rate (in term of percentage of true positive, true negative, ect.) of the
proposed strategy on classification of six basic facial expressions and neutral expression
Feature Recognition rate
of the AFEW dataset.
Lucey et al.[32] 50.4
A D F H N Sa Su
Eskil et al.[43] 76.8
Our previous work[44] 83.7
Anger(A) 65.6 7.8 4.7 1.6 9.4 6.3 4.7
QLZM_MCF 82.6
Disgust(D) 15.0 22.5 7.5 12.5 17.5 10.0 15.0
enMHI_OF 65.7
Fear(F) 21.7 13.0 20.6 15.2 15.2 8.7 6.5
simple fusion strategy 82.6
Happiness(H) 3.2 7.9 7.9 63.5 9.5 6.3 1.6
proposed weighting fusion strategy 88.3
Neutral(N) 3.2 6.3 14.3 9.5 49.2 9.5 7.9
Sadness(Sa) 8.2 11.5 14.8 14.8 26.2 21.3 3.3
Surprise(Su) 13.0 10.9 17.4 13.0 19.6 4.3 21.7
405
using enMHI_OF and QLZM_MCF. The framework which comprises Proceedings of the IEEE Conference Comp. Vision and Pattern Recognition, vol. 1,
2003, pp. 595–601.
pre-processing, feature extraction followed by 2D PCA and SVM [20] Y. Tian, T. Kanade, J. Cohn, Recognizing action units for facial expression analysis,
classification achieves better performance than most of the state-of- IEEE Trans. Pattern Anal. Mach. Intell. 23 (2) (2001) 1–19.
art methods on CK+ dataset and MMI dataset. Our main contributions [21] M.K. Hu, Visual pattern recognition by moment invariants, IRE Transctions Inf.
Theory 8 (2) (1962) 179–187.
are three folds. First, we proposed a spatio-temporal feature based on [22] M.R. Teague, Image analysis via the general theory of moments*, J. Opt. Soc. Am.
QLZM. Second, we applied optical flow in MHI to obtain MHI_OF 70 (1980) 920–930.
feature which incorporates velocity information. Third, we introduced [23] A. Khontanzad, Y.H. Hong, Roatation invariant image recognition using features
selected via a systematic method, Pattern Recognit. 23 (1990) 1089–1101.
entropy to employ the spatial relation of different facial parts, and [24] Z. Zhang, M. Lyons, M. Schuster, S. Akamatsu, Comparison between geometry-
designed a strategy based on entropy to integrate enMHI_OF and based and gabor-wavelets-based facial expression recognition using multilayer, in :
QLZM_MCF. The proposed framework performs slightly worse in Proceedings of the International Conference on Automatic Face and Gesture
Recognition, (1998) pp. 454–459.
distinguishing the three expressions of Fear, Sadness and Contempt,
[25] G. Zhan, M. Pietikainen, Dynamic texture recognition using local binary patterns
thus how to design a better feature to represent these expressions will with an application to facial expression, IEEE Trans. Pattern Anal. Mach. Intell. 29
be part of our feature work. Also, since an expression usually occurs (6) (2007) 915–928.
along with the movement of shoulder and hands, it might be useful to [26] T. Ojala, M. Pietikainen, T. Maenpaa, Multiresolution gray-scale and rotation
invariant texture classification with local binary patterns, IEEE Trans. Pattern Anal.
exploit these information in our recognition system. Mach. Intell. 24 (7) (2002) 971–987.
When applying a facial expression recognition framework in real [27] G.Y. Zhao, M. Pietikainen, Dynamic texture recognition using local binary pattern
situations, computation speed might be a factor to be considered. In with an application to facial expression, IEEE Trans. Pattern Anal. Mach. Intell. 2
(6) (2007) 915–928.
some case, the increase in speed may result in a decrease in recognition [28] D. Chetverikov, R. Peteri, A brief survey of dynamic texture description and
performance. How to design a framework for facial expression recogni- recognition, in: Proceedings of the Conference Computer Recognition Systems, vol.
tion which increases the computational speed without any degradation 5 (2005) pp. 17–26.
[29] S. Koelstra, M. Pantic, I. Patras, A dynamic texture-based approach to recognition
in the recognition rate remains a challenge. of facial actions and their temporal models, IEEE Trans. Pattern Anal. Mach. Intell.
32 (11) (2010) 1940–1954.
Acknowledgements [30] L.A. Cament, L.E. Castillo, J.P. Perez, F.J. Galdames, C.A. Perez, Fusion of local
normalization and gabor entropy weighted features for face identification, Pattern
Recogn. 47 (2014) 568–577.
The authors would like to thank China Scholarship Council / [31] Z. Chai, H. Mendez-Vazquez, R. He, Z. Sun, T. Tan, Semantic pixel sets based local
Warwick Joint Scholarship (Grant no. 201206710046) for providing binary patterns for face recognition, in: Proceedings of the Computer VisionACCV
(2012) pp. 639–651, Heidelberg.
the funds for this research.
[32] P. Lucey, J.F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, I. Matthews, Extended
Cohn-Kande Dataset (CK+): A complete facial expression dataset for action unit
References and emotion-specified expression. Paper presented at in: Proceedings of the Third
IEEE Workshop on CVPR for Human Communicative Behaviour Analysis, (2010),
pp. 94–101.
[1] M. Pantic, L. Rothkrantz, Automatic analysis of facial expressions: the state of art, [33] B.D. Lucas, T. Kanade, An iterative image registration technique with an applica-
IEEE Trans. Pattern Anal. Mach. Intell. 22 (12) (2000) 1424–1445. tion to stereo vision, in: Proceedings of the 7th International Joint Conference on
[2] M. Pantic, L. Rothkrantz, Toward an affect-sensitive multimodal human-computer Artificial Intelligence, vol. 2, (1981), pp. 674–679.
interaction, in: Proceedings of the IEEE, vol. 91, 2003, pp. 1370–1390. [34] D.M. Tsai, W.Y. Chiu, M.H. Lee, Optical flow-motion history image (OF-MHI) for
[3] Y. Tian, T. Kanade, J. Cohn, Handbook of face recognition, Springer, 2005 (Chapter action recognition, Signal, Image Video Process. 9 (8) (2015) 1897–1906.
11. Facial expression recognition). [35] R. Balian, Entropy, a Protean concept. In Dalibard, Jean. Poincar Seminar 2003:
[4] T. Tojo, Y. Matsusaka, T. Ishii, T. Kobayashi, A conversational robot utilizing facial Bose-Einstein condensation - entropy. Basel: Birkhuser. (2004), pp. 119144.
and body expressions, in: Proceedings of the Systems, Man, and Cybernetics, 2000 [36] M. Pantic, M.F. Valstar, R. Rademaker, L. Maar, Web-based database for facial
IEEE International Conference on, vol. 2, 2000, pp. 858–863. expression analysis, in: Proceedings of the IEEE International Conference on
[5] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, Multimedia and Expo, Amsterdam. The Netherlands, 2005, pp. 317–321.
J.G. Taylor, Emotion recognition in human-computer interaction, IEEE Signal [37] C. Shan, S. Gong, P. McOwen, Facial expression recognition based on local binary
Process. Mag. 18 (1) (2001) 32–80. patterns: a comprehensive study, Image Vis. Comput. 27 (2009).
[6] B. Fasal, J. Luettin, Automatic facial expression analysis: a survey, Pattern [38] B. Martinez, M.F. Valstar, X. Binefa, M. Pantic, Local evidence aggregation for
Recognit. 23 (2003) 259–275. regression based facial point detection, IEEE Trans. Pattern Anal. Mach. Intell. 35
[7] B. Heisele, P. Ho, J. Wu, T. Poggio, Face recognition: component-based versus (5) (2013) 1149–1163.
global approaches, Comput. Vis. Image Underst. 91 (2003) 6–21. [39] J. Yang, D. Zhang, A.F. Frangi, J.Y. Yang, Two-dimensional PCA: a new approach to
[8] T. Ahonen, A. Hadid, M. Pietikainen, Face recognition with local binary patterns, appearance-based face representation and recognition, IEEE Trans. Pattern Anal.
in: Proceedings of the European Conference on Computer Vision (ECCV), 2004. Mach. Intell. 26 (1) (2004) 131–137.
[9] A. Ono, Face recognition with Zernike moments, Syst. Comput. Jpn. 34 (10) (2003) [43] M.T. Eskil, K. Benli, Facial expression recognition based on anatomy, Comput. Vis.
26–35. Image Underst. 119 (2014) 1–14.
[10] C. Singh, N. Mittal, E. Walia, Face recognition using Zernike and complex Zernike [44] X. Fan, T. Tjahjadi, A spatial-temporal framework based on histogram of gradients
moment features, Pattern Recognit. Image Anal. 21 (2011) 71–81. and optical flow for facial expression recognition in video sequences, Pattern
[11] S. Moore, R. Bowden, Local binary patterns for multi-view facial expression Recognit. 48 (11) (2015) 3407–3416.
recognition, Comput. Vis. Image Underst. 115 (4) (2011) 541–558. [45] H. Fang, N.M. Parthalin, J. Aubrey, K.L. Tam, R. Borgo, L. Rosin, W. Grant,
[12] E. Sariyanidi, H. Gunes, M. Gokmen, A. Cavallaro, Local Zernike Moments D. Marshall, M. Chen, Facial expression recognition in dynamic sequences: an
representations for facial affect recognition, in: Proceedings of the IEEE integrated approach, Pattern Recognit. 47 (3) (2014) 1271–1281.
International'l Conference Image Processing, 2012, pp. 585–588. [46] A. Dhall, R. Goecke, S. Lucey, T. Gedeon, Collecting large, richly annotated facial-
[13] E. Saryanidi, V. Dal, S.C. Tek, B. Tunc, M. Gökmen, Local Zernike Moments: A new expression databases from movies, IEEE Multimed. (2012) 34–41.
representation for face recognition. in: 2012 Proceedings of the 19th IEEE [47] A. Dhall, R. Goecke, J. Joshi, K. Sikka, T. Gedeon, Emotion Recognition In The
International Conference on Image Processing (2012) 2012, 585-588. Wild Challenge 2014: Baseline, Data and Protocol, in: Proceedings of the 16th
[14] A.F. Bobick, J.W. David, The recognition of human movement using temporal International Conference on Multimodal Interaction, 2014, pp. 461–466.
templates, IEEE Trans., Pattern Anal. Mach. Intell. 23 (3) (2001) 257–267.
[15] P. Ekman, W. Friesen, Constants across cultures in the face and emotion, J. Pers.
Soc. Psychol. 17 (2) (1971) 124–129. Xijian Fan received B.Sc. in Information and Communication Technology from Nanjing
[16] M. Pantic, L. Rothkrantz, Facial action recognition for facial expression analysis University of Posts and Telecommunications, China, and M.Sc. in Computer Information
from static face images, IEEE Trans. Syst., Man Cybern. 34 (3) (2004) 1449–1461. and Science from Hohai University, China, in 2008 and 2012, respectively. He is
[17] M. Pantic, I. Patras, Dynamic of facial expressions - recognition of facial actions and currently pursuing PhD in Engineering at the University of Warwick, U.K. His research
their temporal segments from face profile image sequences, IEEE Trans. Syst., Man interests include image processing and facial expression recognition.
Cybern. 36 (2) (2006) 433–449.
[18] S. Gokturk, J. Bouguet, C. Tomasi, B. Girod, Model-based face tracking for view- Tardi Tjahjadi received B.Sc. in Mechanical Engineering from University College
independent facial expression recognition, in: Proceedings of the IEEE Conference London in 1980, and M.Sc. in Management Sciences in 1981 and Ph.D. in Total
Face and Gesture Recognition, 2002, pp. 272–278. Technology in 1984 from UMIST, U.K. He has been an associate professor at Warwick
[19] I. Cohen, N. Sebe, F. Cozman, M. Cirelo, T. Huang, Learning bayesian network University since 2000 and a reader since 2014. His research interests include image
classfiers for facial expression recognition both labeled and unlabeled data, in: processing and computer vision.
406

Pattern Recognition: Xijian Fan, Tardi Tjahjadi

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Pattern Recognition: Xijian Fan, Tardi Tjahjadi

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pattern Recognition: Xijian Fan, Tardi Tjahjadi

Uploaded by

Copyright:

Available Formats

Pattern Recognition 64 (2017) 399–406

Contents lists available at ScienceDirect

1. Introduction Shape as a geometric-based representation is crucial for interpret-

3.2. Optical ﬂow algorithm 3.4. Entropy

The MHI_OF using entropy representation is shown in the rightmost

3.5. Local zernike moment

ZMs of an image is computed by decomposing the image onto a set

Vnm(ρ , θ ) = Vmn(ρcosθ , ρsinθ ) = Rnm(ρ)e jmθ , (14)

where n is the order of the polynomial and m is the number of

where ρ and θ are the radial coordinates. A ZM of a face image I (x, y )

3.6. Extension to spatio-temporal as

between the spatial and spatio-temporal features. The recognition

K=4 74.31 70.33 71.55 67.28

You might also like