Novel Metrics of Speech Rhythm for the Assessment of Emotion
Fabien Ringeval1,3 , Mohamed Chetouani2 , Björn Schuller3
1
2
DIVA group, Department of Informatics, University of Fribourg, Switzerland
Institut des Systèmes Intelligents et de Robotique, Université Pierre et Marie Curie, Paris, France
3
Institute for Human-Machine Communication, Technische Universität München, Germany
fabien.ringeval@unifr.ch, mohamed.chetouani@upmc.fr, schuller@tum.de
Abstract
Whereas rhythmic speech analysis is known to bear great potential for the recognition of emotion, it is often omitted or reduced to the speaking rate or segmental durations. An obvious explanation is that the characterisation of speech rhythm
is not an easy task itself and there exist many types of rhythmic information. In this paper, we study advanced methods to
define novel metrics of speech rhythm. Their ability to characterise spontaneous emotions is demonstrated on the recent
Audio/Visual Emotion Challenge Task on 3.6 hours of natural human affective conversational speech. Emotion is assessed
for the four dimensions Activation, Expectation, Power and Valence as binary classification tasks on the word level. We compare our new rhythmic feature types to the official 2 k bruteforce acoustic baseline feature set on the Audio Sub-Challenge.
In the results, the rhythmic features achieve a promising relative
improvement of 16% for Valence, whereas the performance is
more contrasted for the three others dimensions.
Index Terms: speech rhythm, prosodic features, emotion
recognition
2. Speech Rhythm
A literature survey on rhythm shows how difficult it is to have a
precise definition of what it refers to, since many conceptual and
terminological inventories are available [3]. However, a set of
pretty well established phenomena have been identified so far,
such as: (i) the duality between form and structure [4], (ii) some
temporal distortions [5] and (iii) some preferential anchors according to the language [1].
The majority of studies that were conducted on speech
rhythm were guided by a taxonomical spirit, i.e., in order to
perform a classification of languages [1]. The emergence of
new metrics of rhythm in the last decade brought a revival of
interest in the taxonomical community. Temporal properties of
consonants and vowels were used to argue for the existence of a
rhythmic continuum between stress (e.g., English and German)
and syllabic (e.g., Spanish and French) languages [6].
Because emotions clearly rely on dynamic processes in
both their production and perception counterparts [7], new advanced models of speech rhythm would help to provide relevant
features for their characterisation. Indeed, the automatic processing of both spontaneous and natural emotions still remains
a challenge, specifically when it comes to real data analysis.
1. Introduction
3. Metrics of Speech Rhythm
Most definitions from the literature consider rhythm as being
conveyed by the perceived information during the alternation
(or repetition) of events spaced over time. However, this definition considers events from various origin such as: (i) biological
(e.g., food, heart, respiratory, etc.), (ii) body (e.g., choreography) or (iii) from speech [1]. Rhythm thus refers to the notion
of dynamic movement in the perception of different types of
phenomena. Because these phenomena are very diverse, their
entanglement is evident for speech [2], and since the mechanisms of human perception are also very complex [5], identifying in concrete and simple terms the characteristics of rhythm
is particularly difficult.
We describe in the following paragraphs several techniques that
have been proposed to characterize the rhythmic information
conveyed by speech. First we present the techniques used by the
taxonomical community and then we describe our novel metrics
of speech rhythm.
This paper investigates different metrics of speech rhythm,
with the aim to study their relevance for the characterisation
of spontaneous emotions from natural human conversational
speech. The goal of this work is to help provide new relevant
features for the analysis of affective spontaneous interactions.
In the remainder of this paper we briefly introduce some rhythmic phenomena that were identified in the literature (Sec. 2),
the metrics of speech rhythm (Sec. 3), including the taxonomical models (Sec. 3.1) and the new advanced ones (Sec. 3.2), the
experimental setup (Sec. 4), including the database (Sec. 4.1)
and the rhythmic feature set (Sec. 4.2) and present experimental
results (Sec. 5) before concluding (Sec. 6).
3.1. Taxonomical Models
The taxonomical models of speech rhythm use measurements
of the segmental duration or the quantity of segments according
to a given speech unit (e.g., vowels, consonants, words, etc.).
The speech rate is, for example, often used as a unique feature
of rhythm in emotion recognition systems [8], although many
studies have shown that it is only one component [9]. In addition, there are many metrics whose use for the characterisation
of affective correlates from speech could be studied.
3.1.1. Vocalic and Consonantal Variability Phenomena.
Ramus et al., proposed a measure of speech rhythm based on the
percentage of vocalic intervals (%V) and the standard deviation
of consonantal intervals (∆C), with the aim of quantifying a
rhythmic continuum between stress and syllabic languages [10].
However, these measures would only be relevant for the study
of a corpus for which the speech rate is strictly controlled.
3.1.2. Compensatory Phenomena (Oscillatory Mechanisms).
Brady et al., used circular statistical measures to study the cognitive processes of speech for Japanese [11]. After a detection
of the attacks of voiced syllables, a sinusoidal waveform was
generated with a period set by the average interval duration between the segments. Therefore, the position of segments in time
is represented by a value of phase θi in the generated sinusoid.
The periodicity of segments, R̄, was then quantified as the sum
of the vectors corresponding to the values of phase θi , divided
by the number N of segments.
!2
!2 1/2
N
N
X
1 X
R̄ =
(1)
sin 2πθi
+
cos 2πθi
N
i=1
i=1
3.1.3. Variation Coefficient of Segmental Duration.
The coefficient of variation (V arco), which is defined as the
ratio between the standard deviation σ and the mean µ of a given
distribution, was used on vocalic and consonantal intervals by
[12]. This measure was combined with that of %V and allowed
to discriminate syllabic from stress languages. However, the
differences were found to be at least as significant between the
dialects of these languages.
3.1.4. Pair-wise Variability Index.
Grabe and Low proposed to measure the temporal variability of
pairs of successive phonetic intervals (Ik and Ik+1 ) by using the
rPVI [13] (2); a normalisation to the speaking rate has been proposed with the nPVI (3). These measures helped to strengthen
the theory of rhythmic language classes outlined above.
rP V I(k) = Ik+1 − Ik
(2)
Ik+1 − Ik
(3)
Ik + Ik+1
Other studies suggested that the comparison of time intervals
between phonetic units could be achieved by using their ratio,
instead of their difference [6]. The rhythm ratio (RR) measure
(4) provided results close to the nPVI on corpora of languages.
nP V I(k) = 2
RR(k) =
Ik
Ik+1
(4)
3.2. Advanced Models
Some authors have proposed to expand the definitions of the
metrics used in the taxonomical models [14]. It was suggested
that the phenomena of rhythm could be generated by the dynamics in prosody. Indeed, Lerdahl et al. showed that pitch serves
to distinguish the accents of music in groups of: (i) metric, (ii)
phenomena and (iii) structures [15]. Moreover, prosodic particularities seem to be related to the strong beats of rhythm, since
they fall at: (i) important changes or low values in the pitch, (ii)
changes in the harmonics or (iii) changes in rates.
In the following, we first present two advanced methods of
features extraction that use the repartition of speech units over
time to quantify rhythm. Whereas the final two techniques use
both temporal information and changes in the prosodic shape of
consecutive speech units to characterise speech rhythm.
3.2.1. Low Fourier Frequency Analysis.
Tilsen et al. proposed a method to extract the rhythmic envelope
of the speech signal [16]. A low frequency (LF) signal is calculated on the speech signal by several filtering steps, which are
supposed to represent the process of perception of rhythm. As
the waveform of the LF signal is stationary, a Fourier transform
can be used to estimate the values of entropy, centroid and the
average frequency from the rhythmic envelope.
3.2.2. Instantaneous Frequency and Envelope.
We proposed in [17] to use the Hilbert-Huang Transform (HHT)
for characterising speech rhythm. A speech unit interval (SUI)
signal is first generated by a resampling process (cubic spline,
Fs = 32 Hz) on interval durations of speech units. Because the
interval duration between phonemes is known to vary from 60
ms to 1 second (i.e., from 1 Hz to 16 Hz)[18], all frequencies of
speech rhythm can be captured with a sampling frequency set to
32 Hz. Empirical mode decomposition (EMD) is then applied
on the SUI signal to extract the HHT derived features: instantaneous amplitude and frequency from the sum of the three first
intrinsic mode functions (provided by the EMD), and the mean
instantaneous frequency (MIF) obtained by the calculation proposed in [19].
3.2.3. Prosodic Pair-wise Variability Index.
In order to characterize the rhythm through the dynamic of
prosody, we propose to extend the PVI [13]. The time interval
measurement Ik is replaced by the Varco coefficient of given
prosodic low-level-descriptor (LLD), such as pitch, loudness,
spectral flux, ... [20]. A normalisation factor α is used to take
into account durations d of consecutive segments k and k + 1
and their interval duration Ik in the pPVI metric:
pP V I(k) = α. (V arcok+1 − V arcok )
(5)
σ
dk dk+1 Ik
and V arco =
with α = log
dk + dk+1 + Ik
µ
The value of this feature is equal to zero if the dispersion
of the LLD is identical on two consecutive segments of speech,
which means a monotony in the prosodic component. Otherwise, the values depend on the amount of changes present in
the Varco of the LLD between the speech segments. Due to the
normalisation factor α, the values of pPVI also depend on both
the duration and interval of the two consecutive segments; these
effects are cumulative. The logarithm of the duration ratio was
computed to reduce its variability. Finally, a local maximum (or
minimum) in the pPVI defines prominence in a given prosodic
LLD.
3.2.4. Prosodic Hotelling Distance.
The Hotelling distance (HD) is a measure for comparing the
statistical distribution of two data sets by a calculation similar
to the Mahalanobis distance. In particular, it involves a normalisation factor by the duration of the two compared speech
segments. However, as the interval duration between these segments is not included in the HD calculation, we added this value
using the normalisation coefficient α. This new metric is termed
the prosodic Hotelling distance (PHD):
h
i
P HD(k) = α (µk − µk+1 )T Σ−1
k∪k+1 (µk − µk+1 )
(6)
where µk , µk+1 denote the means of a prosodic LLD on the
past (k) and new (k + 1) speech segments, and Σk∪k+1 the
covariance matrix of the joint events k and k + 1.
This measure can be used for one or several LLDs, and two
different techniques are available to define the matrix Σk∪k+1 :
Table 1: Overview of the AVEC dataset per partition
Train Develop
Test
# Sessions
31
32
32
# Words
20 183
16 311 13 856
262
276
249
Avg. word dur. [ms]
Table 2: Overview of class balance: fraction of positive instances over total instances of words in the training and developing partitions.
Ratio
ACT.
E XP.
P OWER VALENCE
Train
0.496 0.409
0.560
0.554
Develop 0.581 0.334
0.670
0.654
(i) the first consists of filling the diagonal with the standarddeviation values of each LLD and (ii) the second technique exploits all values from the covariance matrix.
The value of the PHD is equal to zero when the distributions of the prosodic LLD(s) are identical between pairs of consecutive segments, and positive in all other cases. It varies proportionately with the amount of change present in the statistical
distribution of the LLD(s), and the normalisation factor α also
influences the values of the PHD, like for the pPVI.
4. Experimental Setup
In this section we describe the methodology we used for the
AVEC emotion recognition Audio Sub-Challenge. Our approach consisted of comparing the relevance of our rhythmic
features with the official prosodic 2 k brute-forced acoustic
baseline feature set.
4.1. Database
The Solid-Sensitive Artificial Listener (SAL) part of the SEMAINE corpus was used for the AVEC challenge [21]. In this
database, participants were asked to talk in turn to four emotionally stereotyped human operators. The used language was
English and all the sessions were split in three: a training, develop, and test partition. Table 1 shows the distribution of data
in sessions and words for each partition. Activation, Expectation, Power, and Valence compose the annotated emotion dimensions of the Solid-SAL corpus. The binary labels of each
affective dimension were obtained at the word level by a threshold on the continuous values that were rated using the tool called
Feeltrace. Table 2 lists the fraction of positive instances per partition and per dimension from the Solid-SAL corpus.
4.2. Rhythmic Features
The taxonomical models of speech rhythm provide 6 LLDs at
the word level: duration, interval, rPVI, nPVI and RR (for both
duration and interval). As functionals are computed on these
LLDs for each word, values were resampled by cubic splines
(Fs = 32 Hz). 30 functionals (cf. Table 1 in [17] plus second order regression coefficient, mean and standard-deviation value of
raising / falling values and mean number of raising / falling values) were then computed on the 6 LLDs. 4 additional features:
word rate, periodicity, and Varco of durations and intervals were
merged, so that 184 features of speech rhythm were returned in
total by the taxonomical models.
The new advanced models provide much more features.
The pPVI and PHD metrics were performed at the word level
on each LLD returned by the openSMILE feature extraction’s
toolkit [22]; 25 energy and spectral related LLDs, plus 6 voicing related LLDs with delta coefficients. Obtained values were
resampled by cubic splines (Fs = 32 Hz) and the 30 functionals
were computed at the word level. Finally, the 30 functionals
were also applied on the HHT derived features (instantaneous
envelope and frequency) and 4 additional features: MIF, entropy, spectrum and mean frequency of LF filtered speech signal were merged to the feature vector; 3784 features of speech
rhythm are returned in total by the new advanced models.
5. Results
The goal of our experiments is to evaluate the relevance of both
taxonomical and new advanced models of speech rhythm for
the emotion recognition. In order to compare these rhythmic
feature sets with the prosodic one, we used the same classification strategy as in the challenge baseline: we used Support Vector Machine (SVM) classification with linear Kernel, Sequential Minimal Optimisation (SMO) for learning and optimised
the complexity on the development partition. The SMO implementation in the WEKA toolkit was used. Because the rhythmic
features are numerous and include some redundancy, especially
for the new advanced models, we used a correlation-based feature selection method (CFS). This technique reduces the feature
space by keeping the features that are highly correlated with the
emotional classes while having low inter correlation.
When CFS is performed on all available features of speech
rhythm, only 11 of them are kept for Activation and Valence, 8
for Expectation and 6 for Power. In the mean, 45% of these features are derived from the pPVI and PHD metrics, 33% from
the LF signal, 12% from the HHT and the last 10% from the
taxonomical models. This result shows the relevance of using
dynamic to characterize emotion from speech on two different sides: local dynamic (pPVI and PHD) and global dynamic
(LF). Indeed, local information on both temporal and prosodic
shape changes over consecutive words are captured by the pPVI
and PHD metrics, whereas the spectrum of LF signal provides
global information on the envelop of speech rhythm. Results
also show that using local information (i.e., interval duration
between words) to estimate the envelop and the frequency of
speech rhythm (HHT) appears to be much less relevant than using global information (i.e., spectrum of LF signal). Results obtained by the rhythmic features in the emotion recognition task,
as well as those from the baseline (2 k brute-forced acoustic features [21]), are given in Table 3 for each affective dimension.
In the results, the speech rhythm features outperform the
baseline for Valence. The relative improvement obtained by the
taxonomical features is equal to 27% for the develop partition
and 16% for the test partition (UA measure); 28% and 4% respectively for the advanced features. On the other hand, results
are much more contrasted for the three others dimensions. The
rhythmic features achieve a lower performance than the baseline
for Activitation and both develop and test partitions, excepted
with the WA measure and the advanced models for test. The
best scores are achieved by the rhythmic features for Expectation on the develop partition, whereas the baseline performs best
on the test partition. Concerning Power, the speech rhythm features provide the best performance in all cases excepted for the
test partition with the WA measure. In the mean, the baseline
achieves best performance for the test partition and the speech
rhythm features for develop with the UA measure.
Table 3: Results on the AVEC 2011 Audio Sub-Challenge by the competition measure accuracy for speech rhythm and baseline features.
WA stands for weighted accuracy, UA for unweighted accuracy.
Accuracy
ACTIVATION
E XPECTATION
P OWER
VALENCE
A LL
WA
UA
WA
UA
WA
UA
WA
UA
WA
UA
[%]
Taxonomical features
Develop
43.5
52.4
66.7
66.8
67.7
67.2
66.9
66.9
61.2
63.3
54.8
53.1
47.9
50.0
26.2
41.4
50.7
54.8
44.9
49.8
Test
Advanced features
Develop
45.2
51.0
67.3
67.3
66.9
67.1
67.2
67.9
61.7
63.3
57.6
54.1
51.1
53.0
20.0
50.0
46.5
49.2
43.8
51.6
Test
Baseline features
Develop
63.7
64.0
63.2
52.7
65.6
55.8
58.1
52.9
62.7
56.4
Test
55.0
57.0
52.9
54.5
28.0
49.1
44.3
47.2
45.1
52.0
As a conclusion, results obtained by the features of speech
rhythm outperform the prosodic features for emotion recognition for Valence, and achieve much more contrasted results on
the others dimension. While there exists difference in the classification results between the two types of features, more detailed
future investigation needs to be carried out to understand the relationship between these two types of measures and each emotional dimension. Also note that even if the taxonomical models of speech rhythm have been proposed with a rather different
goal in mind, i.e., to quantify cross-linguistic difference rather
than intra-language variation due to emotion, they still achieve
good performance with best scores in some specific cases.
6. Conclusion
Both taxonomical and novel advanced models of speech rhythm
were used for the spontaneous emotion recognition on the recent Audio/Visual Emotion Challenge task, which includes 3.6
hours of natural human affective conversational speech. The
performance of these feature sets was compared with the usual
prosodic features set. The rhythmic features achieve a promising relative improvement of 16% for Valence, whereas the performance is more contrasted for the three others dimensions.
This study thus shows for the first time the relevance of using
advanced models of speech rhythm for the characterisation of
emotional correlates from speech, especially for Valence. Future work will use specific acoustic anchors of speech (e.g., automatically detected pseudo-phonemes [23]) to provide different structural bases for the metrics of speech rhythm, as well
as fusion techniques to estimate their complementarity in the
emotion recognition task.
7. References
[1] F. Cummins, “Speech rhythm and rhythmic taxonomy,” in Speech
Prosody, Aix-en-Provence, France, 2002, pp. 121–126.
[2] S. Tilsen, “Multitimescale dynamical interactions between speech
rhythm and gesture,” Cogn. Sci., vol. 33, pp. 839–879, 2009.
[3] J. R. Evans and M. Clynes, Rhythm in psychological, linguistic,
and musical processes. Springfield, Charles C. Thomas, 1986.
[4] P. Fraisse, Les structures rythmiques : étude psychologique. Publications Universitaires de Louvain, 1956.
[5] E. Zwicker and H. Fastl, Psychoacoustics: facts and models.
Springer-Verlag, Heidelberg, 1990.
[6] D. Gibbon and U. Gut, “Measuring speech rhythm,” in Eurospeech, Aalborg, Denmark, 2001, pp. 95–98.
[7] K. R. Scherer, “Psychological models of emotion,” The neuropsychology of emotion, pp. 137–162, 2000.
[8] J. Ang, R. Dhillon, E. Schriberg, and A. Stolcke, “Prosodybased automatic detection of annoyance and frustration in humancomputer dialog,” in Interspeech, 7th ICSLP, Denver (CO), USA,
2002, pp. 67–79.
[9] V. Dellwo, “The role of speech rate in perceiving speech rhythm,”
in Speech Prosody, Campinas, Brasil, 2008, pp. 375–378.
[10] F. Ramus, M. Nespor, and J. Mehler, “Correlates of linguistic
rhythm in the speech signal,” Cognition - International Journal
of Cognitive Science, pp. 265–292, 1999.
[11] M. C. Brady and R. F. Port, “Speech rhythm and rhythmic taxonomy,” in 16th ICPhS, Saarbrcken, Germany, 2006, pp. 337–342.
[12] V. Dellwo, “Rhythm and speech rate: A variation coefficient for
∆C,” in Lang. and Lang. Proc., 38th Ling. Colloq., Piliscsaba,
Hungary, 2006, pp. 231–241.
[13] E. Grabe and E. Low, “Durational variability in speech and the
rhythm class hypothesis,” Papers in laboratory phonology VII,
vol. 7, pp. 515–546, 1977.
[14] L. M. Smith, “A multiresolution time-frequency analysis and interpretation of musical rhythm,” Ph.D. dissertation, University of
Western Australia, 2000.
[15] F. Lerdahl and R. Jackendoff, A generative theory of tonal music.
MIT Press, 1996.
[16] S. Tilsen and K. Johnson, “Low-frequency Fourier analysis of
speech rhythm,” J. of Acoust. Soc. of Amer., vol. 124, no. 2, pp.
34–39, 2008.
[17] F. Ringeval and M. Chetouani, “Hilbert-Huang transform for nonlinear characterization of speech rhythm,” in NOLISP, Vic, Spain,
2009.
[18] R. Drullman, J. M. Festen, and R. Plomp, “Effect of temporal envelope smearing on speech reception,” J. of Acous. Soc. of America, vol. 95, pp. 1053–1064, 1994.
[19] H. Xie and Z. Wang, “Mean frequency derived via Hilbert-Huang
transform with application to fatigue EMG signal analysis,” Computer Methods and Programs in Biomedicine, vol. 82, no. 2, pp.
114–120, 2006.
[20] B. Schuller, A. Batliner, D. Seppi, S. Steidl, T. Vogt, J. Wagner,
L. Devillers, L. Vidrascu, L. Noam, A. Kessous, and V. Aharonson, “The relevance of feature type for the automatic classification
of emotional user states: low level descriptors and functionals,” in
Interspeech, Antwerp, Belgium, 2007, pp. 2253–2256.
[21] B. Schuller, M. Valstar, F. Eyben, G. McKeown, R. Cowie, and
M. Pantic, “AVEC 2011 — The first international audio/visual
emotion challenge,” D’Mello, S. et al. (eds.), ACII 2011, LNCS
6975, pp. 415–424, 2011.
[22] F. Eyben, M. Wöllmer, and B. Schuller, “The Munich versatile
and fast open-source audio feature extractor,” in ACM Multimedia
(MM), Florence, Italy, 2010, pp. 1459–1462.
[23] F. Ringeval and M. Chetouani, “A vowel based approach for acted
emotion recognition,” in Interspeech, Brisbane, Australia, 2008,
pp. 2763–2766.