Pattern Recognition: Xijian Fan, Tardi Tjahjadi
Pattern Recognition: Xijian Fan, Tardi Tjahjadi
Pattern Recognition: Xijian Fan, Tardi Tjahjadi
Pattern Recognition
journal homepage: www.elsevier.com/locate/pr
A dynamic framework based on local Zernike moment and motion history MARK
image for facial expression recognition
⁎
Xijian Fan, Tardi Tjahjadi
School of Engineering, University of Warwick, Gibbet Hill Road, Coventry CV4 7AL, United Kingdom
A R T I C L E I N F O A BS T RAC T
Keywords: A dynamic descriptor facilitates robust recognition of facial expressions in video sequences. The current two
Zernike moment main approaches to the recognition are basic emotion recognition and recognition based on facial action coding
Facial expression system (FACS) action units. In this paper we focus on basic emotion recognition and propose a spatio-temporal
Motion history image feature based on local Zernike moment in the spatial domain using motion change frequency. We also design a
Entropy
dynamic feature comprising motion history image and entropy. To recognise a facial expression, a weighting
Feature extraction
strategy based on the latter feature and sub-division of the image frame is applied to the former to enhance the
dynamic information of facial expression, and followed by the application of the classical support vector
machine. Experiments on the CK+ and MMI datasets using leave-one-out cross validation scheme demonstrate
that the integrated framework achieves a better performance than using individual descriptor separately.
Compared with six state-of-arts methods, the proposed framework demonstrates a superior performance.
⁎
Corresponding author.
E-mail address: t.tjahjadi@warwick.ac.uk (T. Tjahjadi).
http://dx.doi.org/10.1016/j.patcog.2016.12.002
Received 29 February 2016; Received in revised form 1 December 2016; Accepted 2 December 2016
Available online 05 December 2016
0031-3203/ © 2016 Elsevier Ltd. All rights reserved.
X. Fan, T. Tjahjadi Pattern Recognition 64 (2017) 399–406
based QLZM (QLZM_MCF), which enables the representation of mine the feature vectors that represent textures and are thus simple to
temporal variation of expressions. Second, we apply optical flow to implement. Gabor wavelets [24] and LBPs [25] are two representative
Motion History Image (MHI) [14], i.e., (optical flow based MHI) feature vectors of such an approach that describe the local appearance
MHI_OF, to represent spatial-temporal dynamic information (i.e., models of facial expressions. Gabor magnitudes are robust to misalign-
velocity). ment of corresponding image features. However, computing Gabor
We utilise two types of features: a spatio-temporal shape repre- filters has a high computational cost, and the dimensionality of the
sentation, QLZM_MCF, to enhance the local spatial and dynamic output can be large, especially if they are applied to a wide range of
information, and a dynamic appearance representation, MHI_OF. frequencies, scales and orientations of the image features. LBP is a
We also introduce an entropy-based method to provide spatial relation- histogram where each bin corresponds to one of the different possible
ship of different parts of a face by computing the entropy value of binary patterns representing a facial feature, resulting in a 256-
different sub-regions of a face. The main contributions of this paper dimensional descriptor. The most popular LBP is the uniform LBP
are: (a) QLZM_MCF; (b) MHI_OF; (c) an entropy-based method for [26]. LBP has been extended to spatio-temporal domain so as to utilise
MHI_OF to capture the motion information; and (d) a strategy the dynamics information, which results in a significant improvement
integrating QLZM_MCF and entropy to enhance spatial information. in recognition rate [27]. One drawback of appearance-based approach
The rest of the paper is organised as follows. Previous related work is that it is difficult to generalise appearance features across different
is presented in Section 2. Section 3 presents QLZM_MCF, the method persons.
using MHI_OF and entropy, and the intergration of the two dynamic A Dynamic Texture (DT) is a spatially repetitive, time-varying
features. The framework and the experimental results are respectively visual pattern that forms an image sequence with certain temporal
presented in Sections 4 and 5. Finally, Section 6 concludes the paper. stationarity [28]. MHI applied to the recognition of DT can be used to
address the problem of facial expression recognition [29]. MHI
2. Related work decomposes motion-based recognition by first describing where there
is motion (i.e., the spatial pattern) and then describing how the object
The two main focuses in the current research on facial expression is moving [14], where the temporal information can be retained by
are basic emotion recognition and recognition based on facial action eliminating one dimension. One of the advantages of MHI is that a
coding system (FACS) action units (AUs). The most widely used facial range of times may be encoded in a single frame, and in this way, the
expression descriptors for recognition and analysis are the six proto- MHI spans the time scale of the human motion. In MHI, the intensity
typical expressions of Anger, Disgust, Fear, Happiness, Sadness and value of each image pixel denotes the recent movement, ignoring the
Surprise [15]. The most widely used facial muscle action descriptors speed of the movement. However, speed can be used to distinguish the
are AUs [1]. With regard to basic emotion recognition, geometric-based movement of some facial parts (e.g., opening of mouth and raising of
features and appearance-based features are most widely used. eyebrows) and the movements caused by changes of in-plane head pose
Geometric-based methods rely on the locations of a set of fiducial or relatively stable facial parts (e.g., cheek, nose, forehead, etc.) during
facial points [16,17], a connected face mesh [18,19], or the shapes of facial expressions. Optical flow has been used to capture the velocity of
face components [20]. The commonly used geometric representation is movement at pixels in an image, but by computing the changes in pixel
facial points, which represent a face by concatenating the x and y intensities between two consecutive frames it does not accurately
coordinates of a number of fiducial points. Alternative shape repre- describe the entire video sequence. We address the limitations of
sentations include the distances between facial landmarks, distance MHI and optical flow by combining them so as to incorporate speed
and angle that represent the opening/closing of the eyes and mouth, and to enable more distinct representations of movement of different
and groups of points that describe the state of the cheeks. Although it facial parts.
has been shown that shape representation plays a vital role for Entropy-based methods extract intensity information of image
analysing facial expressions, they have not been exploited to their full pixels, and have been applied for face recognition. For example,
potential [12]. Cament et al. [30] combined entropy-like weighted Gabor features
Image moments can be categorised into geometric moments, with the local normalisation of Gabor features. Chai et al. [31]
complex moments and orthogonal moments. Although easy to use, introduced the entropy of a facial region, where a low entropy value
the large values of geometric moments are their main limitations means the probabilities of different intensities are different, and a high
leading to numerical instabilities and sensitivity to noise. Complex value means the probabilities are the same. They used the entropy of
moments are defined similarly to geometric and have been used to each of the equal-size blocks of a face image to determine the number
describe the shape of a probability density function and to measure the of sub-blocks within each block. Inspired by [31], we use entropy in the
mass distribution of a body. Hu moments exhibit translation, rotation proposed MHI_OF as follows. Since the intensity value of each pixel in
and scaling invariance, and has been applied in many areas [21]. MHI represents a movement, the high intensity values denoting large
Orthogonal moments are projections of a function onto a polynomial movement will result in high entropy value, and vice versa.
basis. ZMs employ complex Zernike polynomials as its moment basis
set [22], and have been used to recognise facial expressions [23]. The
3. Feature extraction
rotation invariance of Zernike-based facial features is discussed in
[9,10]. QLZM is used in [12] for recognising facial expressions.
3.1. Motion history image
However, ZM has its shortcomings, namely it is a low level histogram
representation which ignores the spatial relations (i.e., configure
MHI can be considered as a two-component temporal template, a
information) among the different facial parts. Also, ZMs only describe
vector-valued image where each component of each pixel is some
the texture information in each frame of an image sequence, and do not
function of the motion at that pixel location. The MHI Hτ (x, y, t ) is
capture any dynamic information. In this paper, we address these two
computed from an update function Ψ (x, y, t ), i.e.,
limitations by extending QLZM to spatio-temporal in order to extract
dynamic information, and introducing an entropy to incorporate ⎧ τ, Ψ (x , y , z ) = 1
spatial relations. Hτ (x, y, t ) = ⎨
⎩ max(0, Hτ (x, y, t ) − δ ), otherwise (1)
The appearance-based methods try to find a more effective and
robust way to represent appearance feature including skin motion and where (x, y, t ) is the spatial coordinates (x,y) of an image pixel at time t
texture changes (i.e., deformation of skin) such as bulges, wrinkle and (in terms of image frame number), the duration τ determines the
furrows. Transformations and statistical methods are used to deter- temporal extent of the movement in terms of frames, and δ is the decay
400
X. Fan, T. Tjahjadi Pattern Recognition 64 (2017) 399–406
Fig. 2. Optical flow based MHI for Anger, Happiness and Surprise (from left to right).
Fig. 1. Example of images from sequences (left and middle) and its MHI (right).
M (x, y, t ) = d (x, y, t ) + M (x, y, t − 1)*τ (5)
parameter. Ψ (x, y, z ) is defined as
where τ is another decay parameter, and
⎧1, D(x, y, t )
Ψ (x , y , z ) = ⎨ ⎧ a*d (x, y, t ) + b d (x, y, t ) > T
⎩ 0, otherwise (2) d (x , y , t ) = ⎨
⎩0 otherwise. (6)
where D(x, y, t ) is a binary image comprising pixel intensity differences
of frames separated by temporal distance Δ, i.e., a and b are scale factors, and T is a threshold which is used to remove
small movements, while retaining large movements of some fiducial
D(x, y, z ) = |I (x, y, t ) − I (x, y, t ± Δ)| (3) points (e.g., eyebrows, lips, etc.). Scale factors are used because the
and I (x, y, t ) is the intensity value of pixel with coordinates (x, y ) at the optical flow descriptor is not significantly large for the movements of
tth frame of the image sequence. The duration τ and the decay points in two consecutive frames. In our experiments, the original
parameter δ have an impact on the MHI image. If τ is smaller than optical flow d (x, y, t ) is magnified by a scale factor a of 10 with a
the number of frames, then the prior information of the motion in its starting value b of 20, and the threshold T is set to 1. A large value of
MHI will be lost. For example, when τ = 10 for a sequence with 19 the decay parameter τ creates a slow decrement of the accumulated
frames, the motion information of the first 9 frame will be lost if the motion strength, and the long-term history of the motion is recorded in
value of δ = 1. On the other hand, if the temporal duration is set at very the resulting MHI_OF image. A small value of τ gives an accelerated
high value compared to the number of frames, then the changes of pixel decrement of motion strength, and only the recent short-term move-
value in the MHI is less significant. The MHI of a sequence from the ments are retained in the MHI_OF image. Fig. 2 illustrates optical flow
Extended CK dataset (CK+) [32] is shown in Fig. 1. based MHI for some facial expressions.
Optical flow descriptor can represent the velocity of a set of The entropy of a discrete random variable X with possible values
individual pixels in an image, which capture their dynamic informa- {x 0 , x1, …x 2 , xN } can be defined as [35]
tion. We employ optical flow descriptor in our framework to exploit
E (X ) = − ∑ p(xi ) × log2(p(xi )),
velocity information. i =0 (7)
The Lucas-Kanade method is one of most widely-used method for
optical flow computation [33], which solves basic optical flow equation where p(. ) is the probability function. For a grey-level face image, the
for all pixels in their local neighbourhood by using the least squares intensity value of each pixel varies from 0 − 255, and the possibility of a
criterion. Given two consecutive image frames It−1 and It, for a point particular value occurring is random and varies depending on the
p = (x, y )T in It−1, if the optical flow is d = (u , v )T then the correspond- pattern of different face images. Considering a face image with
ing point in It is p + d , where T is the transpose operator. The dimension H × W having a total of M = H × W pixels, the probability
algorithm finds the d which minimises the match error between the of a particular intensity value xi occurring in the image is p(xi ) = ni / M ,
local appearances of two corresponding points. A cost function is where ni is the number of occurrences of xi among the M pixels. In this
defined for the local area R(p), i.e., [33] case, considering Σini = M , the entropy of the image can be expressed
as
e (d ) = ∑ w(x )(It (x + d ) − It −1(x ))2 ,
255
x ∈ R (p ) (4) 1
E (X ) = log2M − × ∑ ni × log2(ni ).
M (8)
where w(x) is a weights window, which assigns larger weight to pixels i =0
that are closer to its central pixel as these pixels are considered to It is shown in [37] that certain facial regions contain more
contain more important information than those further away. important information for recognising facial expressions than others.
For example the regions of mouth and eyes that produce more changes
3.3. Optical flow based MHI (MHI_OF) than those of nose and forehead during an expression have more
contribution towards the recognition. Also, as can be seen from the
In [34], Tsai et al. proposed a representation that incorporates both leftmost and middle columns of Fig. 3, different facial regions in MHI
optical flow and a revised MHI for action recognition, which can better have different intensity levels due to the distance and speed of
describe local movements. Since a video sequence of facial expression movements during an expression. Thus, introducing a weight function
involves local movements of different facial parts, we consider applying which allocates different weights to different facial regions will improve
this representation into spatio-temporal facial expression recognition. recognition. Instead of setting weights empirically based on the
As according to [34], we compute the optical flow between two observation, we utilise entropy to determine the weights as it is
consecutive frames and obtain the optical flow image where the expected that the entropy at different facial regions will differ sig-
intensity of each pixel represents the magnitude of the optical flow nificantly due to pixel intensity variation at these regions.
descriptor. The higher values denote the faster movement of facial The size of the training samples in practice is often not large enough
points. We define MHI_OF of a sequence as to cover all the possible values of pixels in MHI. To address this sparse
401
X. Fan, T. Tjahjadi Pattern Recognition 64 (2017) 399–406
to convert the range of weights into 0 − 1, where smin and smax are
respectively the maximum value and the minimum of the entropy
values over all sub-regions. The computed weights of each subregion on
MHI_OF are as the final weight features
enMHIOF = {ω(1, 1), ω(1, 2), …ω(1, q ), …, ω(p , q )}. (13) Fig. 4. QLZM based facial representation framework.
402
X. Fan, T. Tjahjadi Pattern Recognition 64 (2017) 399–406
403
X. Fan, T. Tjahjadi Pattern Recognition 64 (2017) 399–406
Fig. 6. Recognition rates of all expressions using QLZM and QLZM_MCF on CK+
dataset.
Table 1
Recognition rate of enMHI_OF using several combinations of different grey levels and
block sizes on classification of six facial expressions and contempt of the CK+ dataset
with leave-sequence-out cross-validation.
20 × 20 10 × 10 8×8 5×5
404
X. Fan, T. Tjahjadi Pattern Recognition 64 (2017) 399–406
Table 2 Table 7
Recognition rate (in term of percentage of true positive, true negative, ect.) of Comparative evaluation of the proposed framework on the MMI dataset.
QLZM_MCF on classification of six facial expressions and contempt of the CK+ dataset
with leave-sequence-out cross-validation. Study Methodology
A D F H Sa Su C LBP[37] 54.5
AAM[45] 62.4
Anger(A) 89.9 2.2 0 0 4.4 0 4.4 ASM[45] 64.4
Disgust(D) 1 94.9 1.7 0 0 0 1.7 Fang in[45] 71.6
Fear(F) 4.0 8.0 64.0 4.0 12.0 0 4.0 Our previous work[44] 74.3
Happiness(H) 1.4 0 0 97.1 0 1.4 0 Proposed weighting fusion strategy 79.8
Sadness(Sa) 0 3.6 7.1 0 78.6 3.6 7.1
Contempt(Su) 0 0 1.2 2.4 1.2 94.0 1.2
Contempt(C) 0 5.6 16.7 11.1 5.6 0 61.1 Although CK+ and MMI are two of the most widely used datasets
for evaluating facial expression recognition methods, they are both
collected in a strict controlled settings with near frontal poses,
Table 3
Recognition rate (in term of percentage of true positive, true negative, ect.) of enMHI_OF
consistent illumination and posed expressions. The recent and more
on classification of six facial expressions and contempt of the CK+ dataset with leave- challenging datasets of AFEW and SFEW [46] provide platforms for
sequence-out cross-validation. researchers to create, extend and test their methods on a common
benchmarked data. Since the proposed framework recognises facial
A D F H Sa Su C
expression on video sequence which treat a sequence as an entity, we
Anger(A) 73.3 4.4 6.7 6.7 6.7 0 2.2 use AFEW which are used for EmotiW 2014 for our experiments [47].
Disgust(D) 5.1 84.8 3.4 0 3.4 0 3.4 AFEW is a dynamic temporal facial expressions data corpus extracted
Fear(F) 8.0 0 40.0 16.0 20.0 4.0 12.0 from movies with realistic real world environment. It was collected on
Happiness(H) 0 0 4.3 89.9 2.9 0 2.9
the basis of Subtitles for Deaf and Hearing impaired (SDH) and Closed
Sadness(Sa) 10.7 0 21.4 3.6 42.9 7.1 14.3
Surprise(Su) 1.2 0 2.4 2.4 1.2 90.4 2.4 Caption (CC) for searching expression-related content and extracting
Contempt(C) 11.1 0 27.8 5.6 11.1 5.6 38.9 time stamps corresponding to video clips which represent some
meaningful facial motion. The database contains a large age range of
subjects from 1 to 70 years, and the subjects in the clips have been
Table 4 annotated with attributes like Name, Age of Actor, Age of Character,
Recognition rate (in term of percentage of true positive, true negative, ect.) of using
Pose, Gender, Expression of Person and the overall Clip Expression.
simple fusion strategy on classification of six facial expressions and contempt of the CK+
dataset with leave-sequence-out cross-validation. There are a total of 957 video clips in the database labelled with six
basic expressions anger, disgust, fear, happy, sad, surprise and the
A D F H Sa Su C neutral. To compare with the baseline method of EmotiW 2014 [47],
we modified the proposed framework slightly, where we use the pre-
Anger(A) 86.7 2.2 2.2 2.2 6.7 0 0
Disgust(D) 3.4 94.9 1.7 0 0 0 0
processing methods (face detection and alignment) provided by the
Fear(F) 8.0 4.0 72.0 0 4.0 4.0 8.0 baseline method.
Happiness(H) 0 0 0 95.7 8.0 4.0 0 We used the training samples for training, and the validation
Sadness(Sa) 3.6 0 10.7 0 78.6 0 7.1 samples for performance evaluation. Table 8 shows the recognition
Surprise(Su) 0 0 1.2 2.4 1.2 95.2 0
rate using the proposed framework on AFEW dataset. The overall
Contempt(C) 5.6 0 22.2 5.6 5.6 5.6 55.6
recognition rate of the proposed framework on the validation set is
37.63%, which is higher than the 33.15% achieved by the video only
Table 5 baseline method. Unlike the experiments on CK+ dataset, the surprise
Recognition rate (in term of percentage of true positive, true negative, ect.) of the expression is much more difficult to be recognised. This is because
proposed fusion strategy on classification of six facial expressions of the CK+ dataset and sometimes the surprise expression might not be acted exaggeratedly
contempt with leave-sequence-out cross-validation.
(i.e., the openness of mouth) in real situations. Also, the overall
A D F H Sa Su C recognition rate is much lower than on the CK++ and MMI dataset.
This is because numerous frames from the AFEW sequences were
Anger(A) 91.1 2.2 6.7 0 0 0 0 captured under poor light condition, have large pose or occlusion, and
Disgust(D) 3.4 96.7 0 0 0 0 0
the expressions are not always from neutral to peak expression.
Fear(F) 4.0 4.0 80.0 4.0 0 4.0 4.0
Happiness(H) 0 1.4 0 98.6 0 0 0
Sadness(Sa) 3.6 0 0 3.6 89.3 3.6 0
Surprise(Su) 0 0 0 0 1.2 97.6 1.2 6. Conclusion
Contempt(C) 5.6 0 11.1 5.6 5.6 0 72.2
This paper presents a facial expression recognition framework
Table 6
Table 8
The overall recognition rates of the four spatio-temporal features on the CK+ dataset.
Recognition rate (in term of percentage of true positive, true negative, ect.) of the
proposed strategy on classification of six basic facial expressions and neutral expression
Feature Recognition rate
of the AFEW dataset.
Lucey et al.[32] 50.4
A D F H N Sa Su
Eskil et al.[43] 76.8
Our previous work[44] 83.7
Anger(A) 65.6 7.8 4.7 1.6 9.4 6.3 4.7
QLZM_MCF 82.6
Disgust(D) 15.0 22.5 7.5 12.5 17.5 10.0 15.0
enMHI_OF 65.7
Fear(F) 21.7 13.0 20.6 15.2 15.2 8.7 6.5
simple fusion strategy 82.6
Happiness(H) 3.2 7.9 7.9 63.5 9.5 6.3 1.6
proposed weighting fusion strategy 88.3
Neutral(N) 3.2 6.3 14.3 9.5 49.2 9.5 7.9
Sadness(Sa) 8.2 11.5 14.8 14.8 26.2 21.3 3.3
Surprise(Su) 13.0 10.9 17.4 13.0 19.6 4.3 21.7
405
X. Fan, T. Tjahjadi Pattern Recognition 64 (2017) 399–406
using enMHI_OF and QLZM_MCF. The framework which comprises Proceedings of the IEEE Conference Comp. Vision and Pattern Recognition, vol. 1,
2003, pp. 595–601.
pre-processing, feature extraction followed by 2D PCA and SVM [20] Y. Tian, T. Kanade, J. Cohn, Recognizing action units for facial expression analysis,
classification achieves better performance than most of the state-of- IEEE Trans. Pattern Anal. Mach. Intell. 23 (2) (2001) 1–19.
art methods on CK+ dataset and MMI dataset. Our main contributions [21] M.K. Hu, Visual pattern recognition by moment invariants, IRE Transctions Inf.
Theory 8 (2) (1962) 179–187.
are three folds. First, we proposed a spatio-temporal feature based on [22] M.R. Teague, Image analysis via the general theory of moments*, J. Opt. Soc. Am.
QLZM. Second, we applied optical flow in MHI to obtain MHI_OF 70 (1980) 920–930.
feature which incorporates velocity information. Third, we introduced [23] A. Khontanzad, Y.H. Hong, Roatation invariant image recognition using features
selected via a systematic method, Pattern Recognit. 23 (1990) 1089–1101.
entropy to employ the spatial relation of different facial parts, and [24] Z. Zhang, M. Lyons, M. Schuster, S. Akamatsu, Comparison between geometry-
designed a strategy based on entropy to integrate enMHI_OF and based and gabor-wavelets-based facial expression recognition using multilayer, in :
QLZM_MCF. The proposed framework performs slightly worse in Proceedings of the International Conference on Automatic Face and Gesture
Recognition, (1998) pp. 454–459.
distinguishing the three expressions of Fear, Sadness and Contempt,
[25] G. Zhan, M. Pietikainen, Dynamic texture recognition using local binary patterns
thus how to design a better feature to represent these expressions will with an application to facial expression, IEEE Trans. Pattern Anal. Mach. Intell. 29
be part of our feature work. Also, since an expression usually occurs (6) (2007) 915–928.
along with the movement of shoulder and hands, it might be useful to [26] T. Ojala, M. Pietikainen, T. Maenpaa, Multiresolution gray-scale and rotation
invariant texture classification with local binary patterns, IEEE Trans. Pattern Anal.
exploit these information in our recognition system. Mach. Intell. 24 (7) (2002) 971–987.
When applying a facial expression recognition framework in real [27] G.Y. Zhao, M. Pietikainen, Dynamic texture recognition using local binary pattern
situations, computation speed might be a factor to be considered. In with an application to facial expression, IEEE Trans. Pattern Anal. Mach. Intell. 2
(6) (2007) 915–928.
some case, the increase in speed may result in a decrease in recognition [28] D. Chetverikov, R. Peteri, A brief survey of dynamic texture description and
performance. How to design a framework for facial expression recogni- recognition, in: Proceedings of the Conference Computer Recognition Systems, vol.
tion which increases the computational speed without any degradation 5 (2005) pp. 17–26.
[29] S. Koelstra, M. Pantic, I. Patras, A dynamic texture-based approach to recognition
in the recognition rate remains a challenge. of facial actions and their temporal models, IEEE Trans. Pattern Anal. Mach. Intell.
32 (11) (2010) 1940–1954.
Acknowledgements [30] L.A. Cament, L.E. Castillo, J.P. Perez, F.J. Galdames, C.A. Perez, Fusion of local
normalization and gabor entropy weighted features for face identification, Pattern
Recogn. 47 (2014) 568–577.
The authors would like to thank China Scholarship Council / [31] Z. Chai, H. Mendez-Vazquez, R. He, Z. Sun, T. Tan, Semantic pixel sets based local
Warwick Joint Scholarship (Grant no. 201206710046) for providing binary patterns for face recognition, in: Proceedings of the Computer VisionACCV
(2012) pp. 639–651, Heidelberg.
the funds for this research.
[32] P. Lucey, J.F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, I. Matthews, Extended
Cohn-Kande Dataset (CK+): A complete facial expression dataset for action unit
References and emotion-specified expression. Paper presented at in: Proceedings of the Third
IEEE Workshop on CVPR for Human Communicative Behaviour Analysis, (2010),
pp. 94–101.
[1] M. Pantic, L. Rothkrantz, Automatic analysis of facial expressions: the state of art, [33] B.D. Lucas, T. Kanade, An iterative image registration technique with an applica-
IEEE Trans. Pattern Anal. Mach. Intell. 22 (12) (2000) 1424–1445. tion to stereo vision, in: Proceedings of the 7th International Joint Conference on
[2] M. Pantic, L. Rothkrantz, Toward an affect-sensitive multimodal human-computer Artificial Intelligence, vol. 2, (1981), pp. 674–679.
interaction, in: Proceedings of the IEEE, vol. 91, 2003, pp. 1370–1390. [34] D.M. Tsai, W.Y. Chiu, M.H. Lee, Optical flow-motion history image (OF-MHI) for
[3] Y. Tian, T. Kanade, J. Cohn, Handbook of face recognition, Springer, 2005 (Chapter action recognition, Signal, Image Video Process. 9 (8) (2015) 1897–1906.
11. Facial expression recognition). [35] R. Balian, Entropy, a Protean concept. In Dalibard, Jean. Poincar Seminar 2003:
[4] T. Tojo, Y. Matsusaka, T. Ishii, T. Kobayashi, A conversational robot utilizing facial Bose-Einstein condensation - entropy. Basel: Birkhuser. (2004), pp. 119144.
and body expressions, in: Proceedings of the Systems, Man, and Cybernetics, 2000 [36] M. Pantic, M.F. Valstar, R. Rademaker, L. Maar, Web-based database for facial
IEEE International Conference on, vol. 2, 2000, pp. 858–863. expression analysis, in: Proceedings of the IEEE International Conference on
[5] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, Multimedia and Expo, Amsterdam. The Netherlands, 2005, pp. 317–321.
J.G. Taylor, Emotion recognition in human-computer interaction, IEEE Signal [37] C. Shan, S. Gong, P. McOwen, Facial expression recognition based on local binary
Process. Mag. 18 (1) (2001) 32–80. patterns: a comprehensive study, Image Vis. Comput. 27 (2009).
[6] B. Fasal, J. Luettin, Automatic facial expression analysis: a survey, Pattern [38] B. Martinez, M.F. Valstar, X. Binefa, M. Pantic, Local evidence aggregation for
Recognit. 23 (2003) 259–275. regression based facial point detection, IEEE Trans. Pattern Anal. Mach. Intell. 35
[7] B. Heisele, P. Ho, J. Wu, T. Poggio, Face recognition: component-based versus (5) (2013) 1149–1163.
global approaches, Comput. Vis. Image Underst. 91 (2003) 6–21. [39] J. Yang, D. Zhang, A.F. Frangi, J.Y. Yang, Two-dimensional PCA: a new approach to
[8] T. Ahonen, A. Hadid, M. Pietikainen, Face recognition with local binary patterns, appearance-based face representation and recognition, IEEE Trans. Pattern Anal.
in: Proceedings of the European Conference on Computer Vision (ECCV), 2004. Mach. Intell. 26 (1) (2004) 131–137.
[9] A. Ono, Face recognition with Zernike moments, Syst. Comput. Jpn. 34 (10) (2003) [43] M.T. Eskil, K. Benli, Facial expression recognition based on anatomy, Comput. Vis.
26–35. Image Underst. 119 (2014) 1–14.
[10] C. Singh, N. Mittal, E. Walia, Face recognition using Zernike and complex Zernike [44] X. Fan, T. Tjahjadi, A spatial-temporal framework based on histogram of gradients
moment features, Pattern Recognit. Image Anal. 21 (2011) 71–81. and optical flow for facial expression recognition in video sequences, Pattern
[11] S. Moore, R. Bowden, Local binary patterns for multi-view facial expression Recognit. 48 (11) (2015) 3407–3416.
recognition, Comput. Vis. Image Underst. 115 (4) (2011) 541–558. [45] H. Fang, N.M. Parthalin, J. Aubrey, K.L. Tam, R. Borgo, L. Rosin, W. Grant,
[12] E. Sariyanidi, H. Gunes, M. Gokmen, A. Cavallaro, Local Zernike Moments D. Marshall, M. Chen, Facial expression recognition in dynamic sequences: an
representations for facial affect recognition, in: Proceedings of the IEEE integrated approach, Pattern Recognit. 47 (3) (2014) 1271–1281.
International'l Conference Image Processing, 2012, pp. 585–588. [46] A. Dhall, R. Goecke, S. Lucey, T. Gedeon, Collecting large, richly annotated facial-
[13] E. Saryanidi, V. Dal, S.C. Tek, B. Tunc, M. Gökmen, Local Zernike Moments: A new expression databases from movies, IEEE Multimed. (2012) 34–41.
representation for face recognition. in: 2012 Proceedings of the 19th IEEE [47] A. Dhall, R. Goecke, J. Joshi, K. Sikka, T. Gedeon, Emotion Recognition In The
International Conference on Image Processing (2012) 2012, 585-588. Wild Challenge 2014: Baseline, Data and Protocol, in: Proceedings of the 16th
[14] A.F. Bobick, J.W. David, The recognition of human movement using temporal International Conference on Multimodal Interaction, 2014, pp. 461–466.
templates, IEEE Trans., Pattern Anal. Mach. Intell. 23 (3) (2001) 257–267.
[15] P. Ekman, W. Friesen, Constants across cultures in the face and emotion, J. Pers.
Soc. Psychol. 17 (2) (1971) 124–129. Xijian Fan received B.Sc. in Information and Communication Technology from Nanjing
[16] M. Pantic, L. Rothkrantz, Facial action recognition for facial expression analysis University of Posts and Telecommunications, China, and M.Sc. in Computer Information
from static face images, IEEE Trans. Syst., Man Cybern. 34 (3) (2004) 1449–1461. and Science from Hohai University, China, in 2008 and 2012, respectively. He is
[17] M. Pantic, I. Patras, Dynamic of facial expressions - recognition of facial actions and currently pursuing PhD in Engineering at the University of Warwick, U.K. His research
their temporal segments from face profile image sequences, IEEE Trans. Syst., Man interests include image processing and facial expression recognition.
Cybern. 36 (2) (2006) 433–449.
[18] S. Gokturk, J. Bouguet, C. Tomasi, B. Girod, Model-based face tracking for view- Tardi Tjahjadi received B.Sc. in Mechanical Engineering from University College
independent facial expression recognition, in: Proceedings of the IEEE Conference London in 1980, and M.Sc. in Management Sciences in 1981 and Ph.D. in Total
Face and Gesture Recognition, 2002, pp. 272–278. Technology in 1984 from UMIST, U.K. He has been an associate professor at Warwick
[19] I. Cohen, N. Sebe, F. Cozman, M. Cirelo, T. Huang, Learning bayesian network University since 2000 and a reader since 2014. His research interests include image
classfiers for facial expression recognition both labeled and unlabeled data, in: processing and computer vision.
406