Automatic Detection of Pathological Voic

370 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO.
2, FEBRUARY 2011
Automatic Detection of Pathological Voices Using

Complexity Measures, Noise Parameters, and
Mel-Cepstral Coefficients
Julián D. Arias-Londoño*, Student Member, IEEE, Juan I. Godino-Llorente, Senior Member, IEEE,
Nicolás Sáenz-Lechón, Vı́ctor Osma-Ruiz, and Germán Castellanos-Domı́nguez
Abstract—This paper proposes a new approach to improve the by the mentioned measures. Such behavior is due to: nonlinear
amount of information extracted from the speech aiming to in- pressure-flow in the glottis, delayed feedback of the mucosal
crease the accuracy of a system developed for the automatic detec- wave, the nonlinear stress–strain curves of vocal fold tissues,
tion of pathological voices. The paper addresses the discrimination
capabilities of 11 features extracted using nonlinear analysis of time and the nonlinearities associated with vocal fold collision [4].
series. Two of these features are based on conventional nonlinear Titze et al. [5] introduced a qualitative classification for speech
statistics (largest Lyapunov exponent and correlation dimension), sounds corresponding to sustained vowels, taking into account
two are based on recurrence and fractal-scaling analysis, and the their nonlinear dynamics. The authors established three classes:
remaining are based on different estimations of the entropy. More- Type I sounds are nearly periodic, Type II are aperiodic, or do
over, this paper uses a strategy based on combining classifiers for
fusing the nonlinear analysis with the information provided by not have a dominant period, and Type III are irregular and ape-
classic parameterization approaches found in the literature (noise riodic. Normal voices can usually be classified as Type I and,
parameters and mel-frequency cepstral coefficients). The classifi- sometimes, as Type II, whereas voice disorders commonly lead
cation was carried out in two steps using, first, a generative and, to any of these three classes [6]. Besides, the conventional per-
later, a discriminative approach. Combining both classifiers, the turbation parameters (such as shimmer and jitter) are defined
best accuracy obtained is 98.23% ± 0.001.
only for nearly periodic voice signals, and thus, their usefulness
Index Terms—Combining classifiers, Gaussian mixture models is limited to Type II and III signals [7]. In this sense, some
(GMMs), nonlinear analysis, pathological voices, support vector researchers have been interested in applying nonlinear time se-
machines (SVMs).
ries analysis to disordered speech, attempting to characterize the
nonlinear phenomena and evaluating the discriminative capabil-
ities of these measures for the detection of pathological voices
I. INTRODUCTION (see [7] and cites therein).
ESEARCH on automatic systems to assess voice disor- The nonlinear analysis of time series is derived from the the-
R ders has received considerable attention in the past few
years due to its objectivity and noninvasive nature. Much of the
ory of dynamical systems, and use to be carried out using two
statistics: the largest Lyapunov exponent (LLE), and the cor-
work done in this area is based on the use of acoustic parame- relation dimension (CD). LLE is a measure that attempts to
ters: amplitude and frequency perturbation parameters, noise pa- quantify the sensibility to the initial conditions of the underly-
rameters, and mel-frequency cepstral coefficients (MFCC) [3]. ing system [8]. CD is a measure developed for quantifying the
However, several researchers pointed out that voice production geometry (self-similarity) in the state space of the underlying
involves some nonlinear processes that cannot be characterized system [8]. There are previous works that investigated the be-
havior of LLE and CD for the characterization of pathological
voices. In [9], CD was used to describe the complexity of the
speech signals uttered by normal speakers and by patients with
Manuscript received February 23, 2010; revised June 8, 2010 and vocal polyps. From each utterance, the CD was estimated using
September 27, 2010; accepted August 22, 2010. Date of publication October frames of 200 ms. The authors demonstrated that CD values
21, 2010; date of current version January 21, 2011. This work was supported from normal and pathological speakers have statistically signif-
by the Spanish Ministry of Education under Grant TEC2006–12887-C02 and
by the Convocatoria de apoyo a doctorados nacionales 2007—COLCIENCIAS. icant differences, concluding that the nonlinear analysis can be
Asterisk indicates corresponding author. used as supplementary method to evaluate and detect laryngeal
*J. D. Arias-Londoño is with the Department ICS, Universidad Politécnica pathologies. Zhang and Jiang [10] used the CD to discriminate
de Madrid, Madrid 28031, Spain and also with GC&PDS, Universidad Nacional
de Colombia, Manizales, Colombia (e-mail: jdariasl@unal.edu.co). between three types of speech signals according to the afore-
J. I. Godino-Llorente, N. Sáenz-Lechón, and V. Osma-Ruiz are with the mentioned definition by Titze et al. [5]. The database contained
Department ICS, Universidad Politécnica de Madrid, Madrid 28031, Spain, different types of pathologies, but the speech signals with strong
(e-mail: igodino@ics.upm.es; nslechon@ics.upm.es; vosma@ics.upm.es).
G. Castellanos-Domı́nguez is with Department of Electrical, Electronic, and glottal pulse noise were excluded. Unlike the previous work,
Computational Engineering, Universidad Nacional de Colombia, Manizales, the estimation of the CD was carried out in a frame of 500 ms.
Colombia (e-mail: cgcastellanosd@unal.edu.co). Again, the authors conclude that CD tends to increase from
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. Type I to III signals, but a classification rate is not presented.
Digital Object Identifier 10.1109/TBME.2010.2089052 Similar studies [11]–[15] used CD to characterize pathological
0018-9294/$26.00 © 2010 IEEE

ARIAS-LONDOÑO et al.: AUTOMATIC DETECTION OF PATHOLOGICAL VOICES 371
voices before and after a clinical treatment, leading to similar applied to quantify the effects of radiotherapy in patients with
conclusions. In [16], CD and LLE along with other complexity laryngeal cancer, concluding that AE can be used to differen-
measures were used to characterize some speech recordings ex- tiate healthy from patients that underwent radiotherapy. Again,
tracted from the Massachusetts Eye and Ear Infirmary (MEEI) in [28], AE was used joint to a scaling parameter to classify
voice disorders database [17]. The authors performed differ- between normal and pathological voices. The authors conclude
ent classification experiments using support vector machines that AE is an effective tool to classify vocal fold disorders,
(SVM) with a polynomial kernel. Eighty percent of the speech but no results were provided in terms of performance. In [29],
recordings were employed to train the SVM and the remaining another set of entropy-based measures were used for the au-
to validate the system. Only one run of the training was used to tomatic detection of pathological voices. The study used two
estimate the accuracy. The authors obtained an accuracy up to voice disorders databases. Among the features used are: the
94.4% only by using CD. In [1], the LLE was used to differenti- Shannon entropy, the first and second-order Renyi entropies,
ate between normal voices and patients with unilateral laryngeal the correlation entropy, and CD. The results showed a very high
paralysis. The authors found statistically significant differences classification accuracy using the MEEI database (99.6%). How-
between both groups. ever, the accuracy obtained cast some doubts due to a possible
Although LLE and CD have shown certain discrimination bias in the estimation of the Shannon entropy. Each single fea-
capabilities, such nonlinear statistics require the dynamics of ture provided a detection error above 40%, except for the CD
speech to be purely deterministic, and this assumption is inade- and Shannon entropy, evidencing an important contribution to
quate, since randomness due to turbulence is an inherent part of the final accuracy of these two features. However, for the second
the speech production [6]. Besides, these measures have been database used in [25], the Shannon entropy reveals a detection
used under the assumption of presence of chaos or a completely error around 43%, which seems to be lacking of coherence. In
random behavior. However, in [18], Pincus demonstrated that addition, theoretically, the Shannon entropy must be equal to the
there exist stochastic processes with CD equal to zero and, in first-order Renyi entropy; therefore, their combination becomes
general, it is not valid to infer the presence of an underlying redundant. Moreover, and again using the MEEI database, the
deterministic system from the convergence of the algorithms accuracy obtained by the first-order Renyi entropy is extremely
designed to estimate these measures. In the case of LLE, many different to that obtained with the Shannon entropy, but such
analyses are based on the fact that generally, a system containing difference does not appear with the second database. Addition-
at least one positive Lyapunov exponent is defined as chaotic, ally, the Shannon entropy for normal voices and patients with
whereas a system with no positive exponents is regular (as in nodules was also estimated in [30] and [31], and the values of the
the case of dynamical systems). From this assumption, other parameters and the classification accuracy obtained were very
authors concluded that an irregular phonation presents a chaotic different to those obtained in [29]. On the other hand, the lengths
dynamic [7]. Nevertheless, the sign of the Lyapunov exponent of the normal and pathological recordings in the MEEI database
does not present statistically significant differences [19] and, fur- are very different (normal voices are around 3 s long, whereas
thermore, using LLE to differentiate between normal and patho- pathological are 1 s long). Having in mind that the Shannon
logical voices lead to positive values for both classes [1], [7]. entropy is the only feature in [29] estimated using the whole
There are also numerical and algorithmic problems associated recording, the conclusion is that the results could be biased by
with the calculation of nonlinear measures for speech signals, the different lengths of the recordings.
casting doubts about the reliability of such tools to develop In [6], Little et al. characterize deterministic and stochastic
systems for pathological voice detection [6]. dynamics of the speech. The deterministic behavior is character-
To overcome these restrictions, the literature reports a set of ized by a measure from recurrence analysis, and the stochastic
features based on information theory. Such measures attempt to components by means of fractal-scaling analysis. This approach
quantify the signal complexity in a way such that there is no reached a 91.8% of accuracy to detect pathological voices.
need of making assumptions about of nature of the signal (i.e., No comparison with approximate entropy measures was given.
deterministic or stochastic). This idea is in concordance with the In [32], Little et al. used the same analysis applied to healthy
fact that the time series generated by biological systems most and pathological speakers with unilateral vocal fold paralysis
likely contain deterministic and stochastic components; there- evaluated pre- and postsurgery. The measures based on recur-
fore, both approaches may provide complementary information rence and fractal-scaling analysis were compared with CD and
about the underlying dynamics [20]. The most common measure conventional perturbation parameters. The study used a small
used in this context is the approximate entropy (AE ) [18], [21], database (17 pathological and 11 normal speakers). No classi-
and some other measures derived from this one, such as: sample fication results were given, but the authors concluded that the
entropy (SE ) [22], and Gaussian kernel approximate entropy nonlinear methods are more stable and reproducible than con-
(GAE ) [23]. AE is a regularity statistic that quantifies the un- ventional perturbation parameters.
predictability of the fluctuations in a time series, and reflects On the other hand, the estimation of complexity measures
the likelihood that similar patterns of observations will not be requires the reconstruction of a state space (i.e., the embedding
followed by additional similar observations [24]. This class of attractor) from a time series. From a pattern recognition point
measures provides a better parameterization of the nonlinear of view, the complexity measures, such as AE , SE , and GAE ,
behavior [25], but its use in the context of pathological speech use a nonparametric estimate of the probability mass function
have not been extensively explored. In [26] and [27], AE was of the embedding attractor using a Parzen-window method with
372 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO. 2, FEBRUARY 2011
up concatenating 12 fast Fourier transform (FFT)-based MFCC,

and three noise parameters: HNR, NNE, and GNE.
On the other hand, the embedding attractor extracted from the
speech frames was parameterized using nonlinear analysis. For
each embedding attractor, a set of 11 complexity measures were
estimated. The values of the points in the attractor were normal-
ized into the [0, 1] interval, because the complexity measures
are based on the distance among different points of the attrac-
tor; therefore, the normalization allows controlling the bound of
these measures.
A first decision about the presence or absence of pathology
for each speech signal was taken from the outputs given by a
generative classifier. Such a classifier was based on Gaussian
mixture models (GMM), which have previously been used for
the same task with good results [39]. Each voice record was
characterized with the same number of vectors as frames ex-
Fig. 1. General scheme of the system developed for the automatic detection tracted from each recording. Following the scheme depicted in
of pathological voices.
Fig. 1, one set of GMM (i.e., one GMM for pathological and
another for normal voices) was trained using the complexity
a Gaussian or rectangular kernel [33]. They only attempt to measures, and another set of GMM was trained using the afore-
quantify the divergence of the trajectories of the attractor, but mentioned combination of noise parameters and MFCC. The
do not take into account the directions of divergence. first decision about the presence or absence of pathology for
In this paper, we used a discrete hidden Markov model each speaker was taken establishing a decision threshold over
(DHMM) to estimate a nonparametric density function of the the averaging of the scores given to each speech signal for each
attractor. The aim is to characterize the divergence of the trajec- classifier. Finally, for each speaker, the outputs of both classi-
tories and its directions into the state space in terms of the transi- fiers were combined using a discriminative approach based on
tions between regions provided by the DHMM. This scheme was SVM. The final decision was taken establishing a threshold over
used to estimate two empirical entropy (EE) measures [34]. The the overall output given by the SVM.
discriminative capabilities of the proposed features are studied In order to allow comparisons using different feature subsets,
and compared along this paper, both, individually, and comple- the evaluation has been carried out following the methodology
menting different nonlinear features. presented in [3].
The results are compared with those using noise parameters
and MFCC [35] together. Noise parameters have proven to be A. Embedding
reliable for detecting the presence of voice disorders, since most
voices present some degree of noise in presence of pathology. Prior to the estimation of the nonlinear features, an embed-
The harmonics to noise ratio (HNR) [36], normalized noise ding attractor has to be reconstructed. The embedding attractor
energy (NNE) [37], and glottal to noise excitation ratio (GNE) is the starting point needed to estimate the nonlinear measures.
[38], have been widely used both to evaluate the voice quality, The state-space reconstruction is based on the time-delay em-
and for the detection of voice disorders. bedding theorem [8], which can be written as follows: given
Moreover, in order to improve the accuracy of the automatic a dynamic system with a d-dimensional solution space and an
detection of pathological voices, a two stepped strategy has been evolving solution h(t), let x be some observation x(h(t)). Let us
followed combining generative and discriminative classifiers also define the lag vector (with dimension m, m > 2d + 1, and
fed with the aforementioned nonlinear measures and a classic common time lag τ ) x(t) ≡ (xt , xt−τ , . . . , xt−(m −1)τ ). Then,
analysis based on fusing noise parameters and MFCC. under very general conditions, the space of vectors x(t) gener-
ated by the dynamics contains all the information of the space
of solution vectors h(t). The mapping between them is smooth
II. METHODOLOGY
and invertible. This property is referred to as diffeomorphism
Fig. 1 depicts a block diagram with the overall scheme of and this kind of mapping is referred to as an embedding. The
the system used in this paper. Prior to classification, the speech embedding theorem establishes that, when there is only a sin-
signal was divided into frames. gle sampled quantity from a dynamical system, it is possible to
On one hand, each window was parameterized by following a reconstruct a state space that is equivalent to the original (but un-
classic approach based on the NNE, GNE, HNR, and 12 MFCC. known) state space composed of all the dynamical variables [8].
These features were used before in [38] for the same task lead- The points in the state-space form trajectories, and the set of
ing good results and, in this paper, are used as a baseline for trajectories from a time series is known as attractor.
comparison. In this case, the speech signal was framed and win- In this paper, the embedding dimension m was chosen using
dowed using 40 ms Hamming windows with a 50% frame shift. the improved version of the false neighbors method proposed
Therefore, the feature vector extracted for each frame is built in [40], and time-delay τ , by using the first minimum of the auto
complex ones, since complexity measures lack of the spectral

leakage problems presented in FFT-based parameters. This kind
of window has been used in other works employing complexity
measures for the same task [16].
From each frame, LLE, CD, and nine entropy-based com-
plexity measures were estimated.
1) Largest Lyapunov Exponent: LLE is a measure of the
separation rate of infinitesimally close trajectories of the attrac-
tor [1]. In other words, LLE measure the sensibility to the initial
conditions of the underlying system. Having in mind two tra-
jectories of the state space with an initial separation δx0 , the
divergence is [8]
|δx(t)| ≈ exp (λt) |δx0 | (1)
being λ the Lyapunov exponent. The LLE can be defined as
follows
1 |δx (t)|
λ = lim ln . (2)
t→∞ t |δx0 |
Fig. 2. 3-dimensional state spaces reconstructed by using the time-delay em-
bedding theorem. The attractors were reconstructed using frames of 200 ms. There exist different algorithms to calculate LLE. In order to
(a) Normal voice (file DMA1NAL.NSP) with close trajectories. (b) Normal
voice (file PCA1NAL.NSP) with separate trajectories. (c) Pathological voice allow comparisons, two algorithms widely used in the literature
(file JRF30AN.NSP) with separate trajectories. (d) Pathological voice (file have been used throughout this paper. The first one is described
EED07AN.NSP) with no clear dynamic behavior. in [1]. It is based on the Wolff algorithm [44], but adjusted to
speech signals. The second one was proposed in [2], theoreti-
cally with better results. Hereafter, the LLE estimated using the
mutual information function [8]. Fig. 2 shows four examples of
Wolff algorithm will be called LLE1 and LLE2 with the second
3-dimensional embedding attractors of speech signals belong-
approach.
ing to the MEEI database [17]. Significant differences can be
2) Correlation Dimension: CD is a measure of the dimen-
observed between the attractors in Fig. 2(a) and Fig. 2(d), but the
sionality of the space occupied by a set of random points or its
differences are not so clear between the attractors in Fig. 2(b)
geometry. In order to characterize CD, it is necessary to define
and Fig. 2(c).
the correlation sum (CS) for a set of points x ∈ Ψ, where Ψ is
the embedding space. CS is the fraction of all possible pairs of
B. Parameterization points, which are closer than a given distance r in a particular
In order to take into account possible changes in the nonlinear norm. CS is given by [8]
dynamics of the speech, the signal was parameterized following N
a short-time procedure. Following this approach, the dynamic

C(r) = Cim (r) (3)
information in the speech signal can be characterized by the evo- i= 1
lution along time of the complexity measures estimated. This is
where
quite important because, in a real-physiologic system, changes
in the nonlinear dynamics may indicate states of pathophysio- N
2
logical dysfunction [7]. In this framework, the window length is Cim (r) = Θ (r − xi − xj ) (4)
N (N − 1) j =i+ 1
an important variable to set, because it is linked with the number
of points used to reconstruct the state space. In order to provide being N the number of points in Ψ, Θ is the Heaviside function,
a reliable estimation of nonlinear measures, we must use a large and the norm (·) defined in any consistent metric space. CD
enough number of points. is defined in the limit of an infinite amount of data (N → ∞)
In the context of nonlinear analysis of time series, the number and for small r, and can be expressed as follows:
of points used to reconstruct the attractor has been established
around 10CD [41], [42]. In this paper, the length of the frame is ∂ ln C (r, N )
CD = lim lim d (N, r) , d (N, r) = . (5)
based on previous experiments with complexity measures using r →0 N →∞ ∂ ln r
the same database. In [43], the frame size was selected taking The CD is commonly calculated using the Grassberger–
into account that the mean value of the CD was estimated as 3.1. Procaccia algorithm [45]. However, in this paper, the CD was
Thus, the number of points used to reconstruct the attractor must calculated using the Takens estimator, since it is computation-
be around 1500, corresponding 60 ms. The final window length ally more efficient and obtains closer estimates to real values
was selected as the size that reported the best detection accuracy than the Grassberger–Procaccia algorithm [46].
between normal and pathological voices for a set of experiments. 3) Entropy-Based Complexity Measures: The entropy is a
The frames used are 55 ms long with an overlapping of 50%, measure of the uncertainty of a random variable. Let X be
and were extracted using rectangular windows instead of more a discrete random variable with alphabet χ and probability
mass function p(x) = Pr{X = x}, x ∈ X . The Shannon en- Another modification of AE presented in [23], called Gaus-
tropy H(X) is defined by [47] sian Kernel approximate entropy (GAE ), changes the Heaviside
function by a Gaussian-kernel-based function with the aim of
H (X) = − p (x) log p(x). (6) suppressing the discontinuity of the auxiliary function over the
x∈X CS (rectangular kernel), and in this way, nearby points have
If instead of a random variable we have a sequence of n greater weight than distant. In this case, the Heaviside function
random variables (i.e., a stochastic process), the process can is replaced by
be characterized by a joint probability mass function: Pr{X1 = 2
x1 , . . . , Xn = xn } = p(x1 , x2 , . . . , xn ). Under the assumption xi − xj 1
dG (xi , xj ) = exp − . (14)
of existence of the limit, the rate at which the joint entropy grows 10r2
with n is defined by [47]
By using (14), the estimation of GAE is carried out in the same
1 1 way than for AE [see (11) and (12)]. Moreover, in order to
H (X) = lim H (X1 , X2 , . . . , Xn ) = lim Hn . (7)
n →∞ n n →∞ n evaluate its behavior, the SE estimation was modified using
If the set of random variables are independent, but not iden- a Gaussian kernel. From now on this modification is called
tically distributed, the entropy rate is given by Gaussian kernel sample entropy (GSE ).
n Besides of the number of points used to estimate this class
1 of measures, it is necessary to set the threshold r. The value
H(X) = lim H(Xi ). (8)
n →∞ n of r was fixed to r = rc · std(signal), being std(·) the standard
i= 1
deviation operator [22]. The parameter rc has been chosen equal
On the other hand, let the state space be partitioned into to 0.35 according to the previous experiments reported in [43].
hypercubes of content εm , and the state of the system measured 4) Measures Based on Recurrence and Fractal-Scaling Anal-
at intervals of time δ. Besides, let p(k1 , . . . , kn ) denote the joint ysis: Considering that there exists a combination of both deter-
probability that the state of the system is in the hypercube k1 at ministic and stochastic components into the speech [6], the de-
t = δ, k2 at t = 2δ. The Kolmogorov–Sinai entropy (HKS ) is as terministic component could be characterized by a recurrence
follows [20]: measure. Let B(x(ti ), r) be a close ball with radius r > 0,
1 containing an embedded data point x(ti ). Excluding temporal
HKS = − lim p (k1 , . . . , kn ) log p (k1 , . . . , kn )
δ →0 nδ correlations, let tr = tj − ti be the recurrence time, where tj is
ε→0 k 1 ,...,k n
n →∞ the instant at which the trajectory first return to the same ball.
(9) Let R(t) be the normalized histogram of the recurrence times
measuring the mean rate of creation of information [20]. For estimated for all embedded points into a reconstructed attrac-
stationary processes, it can be shown that [20] tor, the recurrence probability density entropy (RPDE) can be
HKS = lim lim lim (Hn + 1 − Hn ). (10) expressed as follows:
δ →0 ε→0 n →∞
tm a x
Numerically, only entropies of finite order n can be computed. − i= 1 R (i) ln R (i)
RPDE = (15)
However, some methods have been proposed in an attempt to ln tm ax
estimate the HKS [20]. One of them is the AE . AE is a measure where tm ax is the maximum recurrence time in the attractor.
of the average conditional information generated by diverging On the other side, the stochastic component could be charac-
points on the trajectory [20], [22]. AE is defined as a function terized by means of a detrended fluctuation analysis (DFA) [6]
of the CS (4). For a fixed m and r, AE is given by that calculates the scaling exponent in nonstationary time series.
AE (m, r) = lim Φm + 1 (r) − Φm (r)

(11) First, the time series x(t) is integrated
N →∞ n
where y (n) = x (t) (16)
t= 1
N −m
+ 1
1 for n = 1, 2, . . . , N , where N is the number of samples in the
Φm (r) = ln Cim (r). (12) signal. Then, y(n) is divided into windows of length L samples.
N −m+1 i= 1 A least-square straight-line approximation is carried out to each
A first modification of AE presented in [22] called SE was window and the root-mean squared error is calculated for every
developed to obtain a more independent measure of the signal window at every time scale
length than AE . SE is given by 1
L 2
m +1 1
SE (m, r) = lim − ln
Γ (r)
. (13) F (L) = (y (n) − an − b)2 (17)
N →∞ Γm (r) L n=1
The difference between Γ and Φ is that the first one does where a and b correspond to the straight-line parameters. This
not compare the embedding vectors with themselves (excludes process is repeated over the whole signal for different window
self-matches). The advantage of this fact is that the estimator is sizes L, and a log–log graph of L against F (L) is constructed. A
unbiased [20]. straight line on this graph indicates self-similarity expressed as
F (L)αLβ . Then, the DFA measure corresponds to a sigmoidal entropies, HE can be written as follows:
normalization of the scaling exponent β [6]. ⎛ ⎞
k k ϕ
5) Hidden Markov Entropy Measures: A Markov chain is
a random process {X(t)}, which can take a finite number of HES = − ⎝ πi Aij log Aij + Bij log Bij ⎠ . (20)
ij i= 1 j = 1
k values at certain moments of time (t0 ≤ t1 ≤ t2 ≤ · · ·). The
values of the stochastic process change with known probabilities
If instead of using the Shannon entropy, we use the Renyi
called transition probabilities. The particularity of this stochastic
entropy [42] becomes
process is that the probability of change to other state depends
only on the current state of the process; this is known as the k k k ϕ
Markov condition. When such probabilities do not change with
πi 1
HER = log Aαi j + log Biαj (21)
time and the initial probability of each state is also constant, the i= 1
1−α j=1 i= 1
1 − α j=1
Markov chain is stationary. Let {X(t)} be a stationary Markov
chain with initial distribution π and transition matrix A. Then, where α > 0; and α = 1 is the entropy order. In this paper, we
the entropy rate is given by [47] are using α = 2, since it is the most common Renyi entropy [47].
The uncertainty is maximum (and equal to k −1 · log k + k ·
H (X) = − π i Aij log Aij . (18) log ϕ) if all states in the Markov chain have equal likelihood
ij to be achieved from any other state (all directions are equally
probable) and also all observation symbols are equally probable
In view of (18), it is possible to observe that the entropy measure
in each state of the process. In other words, no behavior in
is a sum of the individual Shannon entropy measures for the
the state space can be likened to a trajectory. The minimum
transition probability distribution of each state, weighted with
value (zero) corresponds to the case, where only one state can
respect to the initial probability of its corresponding state.
be achieved from other state in the Markov chain (excluding
There exist some processes that can be seen like a Markov
self-transitions) and only one observation symbol is likely to
chain, whose outputs are random variables generated from prob-
be emitted in each state. Therefore, there exists one evident
ability functions associated to each state. Such processes are
trajectory into the state space, and its dispersion is zero. In this
called hidden Markov processes (HMP), since the states of the
case, the initial probability does not influence the EE, since there
Markov process cannot be identified from its output (the states
is only one state sequence probable in the Markov chain.
are “hidden”). In this case, it is not possible to obtain a close form
The hidden Markov entropy measures used in this paper were
for the entropy rate [47], [48]. A HMP can also be understood
estimated by using DHMM with six states and a codebook of 32
as a Markov process with noisy observations [48]. Therefore,
words. These values were set after different experiments chang-
in the same way as in (18), it is possible to establish an entropy
ing the values in the range [5, 10] and [16, 256], respectively.
measure of the HMP as the entropy of the Markov process plus
the entropy generated by the noise in each state of the process.
We call this measure EE. If we use a DHMM to represent a C. Classification
stochastic process, the noise is modeled by means of discrete 1) Gaussian Mixture Models: The central idea of the GMM,
distributions, and finally, it is possible to obtain a probability is to estimate the probability density function of a dataset, by
mass function for the noise in each state. means of a set of Gaussian weighted functions. The model can
Denoting the actual state of the process in time t as St , a be expressed as follows [50]:
DHMM can be characterized by the following parameters [49]:
1) π = {πi }, i = 1, 2, . . . , k: the initial state distribution, M

where πi = p(S0 = i) is the probability of starting at the p (x|ζ) = αi pi (x) (22)
ith state; i= 1
2) A = {Aij }, 1 ≤ i, j ≤ k: the set of transition probabili-

where ζ = {ζn , ζp } indicates the class to model, pi (x), i =
ties among states, where Aij = p(St+ 1 = j|St = i) is the
1, . . . , M are the component densities, and ci , i = 1, . . . , M
probability of reaching the jth state at time t + 1, coming
are the component weights. Each component density is a n-
from the ith sate at time t;
variate Gaussian function; therefore, the different components
3) B = {Bij }, i = 1, 2, . . . , k, j = 1, 2, . . . , ϕ: the proba-
act together to model the overall pdf. The most common estima-
bility distribution of the observation symbol, being Bij =
tion method of the parameters of the model is the expectation
p(ot = υj |St = i), where ot is the output at time t, υj are
maximization (EM) algorithm.
the symbols that can be associated to the output, and ϕ is
For each class to be recognized, the parameters of a different
the total number of symbols. All parameters are subject to
GMM are estimated. Thus, the evaluation is made by calculating
standard stochastic constrains [49].
for each GMM, the a posteriori probability of an observation.
Using this definition, the EE, HE , can be defined as follows:
The score given to each sequence is obtained by calculating the
HE = HM C + H g (19) logarithm of the ratio between the likelihoods given by both
models (called log-likelihood ratio).
where HM C is the entropy due to the Markov process (18), 2) Support Vector Machines: A SVM is a two-class classi-
and Hg is the Shannon entropy due to noise. By replacing both fier. The problem in approaching a SVM [51] is analogous to
solving the problem of finding a linear function that satisfies set of features in a single feature space, encouraging the use of
a classifier combination strategy.
f (x) = w, x + b with w ∈ X , b∈ℜ (23)
III. EXPERIMENTS AND RESULTS
where χ corresponds to the space of the input patterns x.
A. Corpus of Speakers
The function f (·) is calculated following an optimization
problem from sums of a kernel function. The support vector al- Testing has been carried out with the MEEI voice disorders
gorithm looks for the hyperplane that separates the two classes database [17]. Due to the different sampling rates of the record-
with the largest margin of separation. Considering a radial basis ings stored in the database, a downsampling with a previous half
function kernel, the training implies adjusting the aperture of band filtering were carried out, when needed, adjusting every
the kernel γ and a penalty parameter C. utterance to a 25 kHz sampling rate. The 16 bits of resolution
The output given by the SVM for each speech sample can were used for all the recordings. The recordings contain the sus-
be interpreted as the likelihood that the sample belongs to a tained phonation of the/ah/vowel from patients with a variety
specific class. Henceforth, the log-likelihood (likelihood in the of voice pathologies that were previously edited to remove the
log domain) will be called “score.” beginning and ending of each utterance, removing the onset and
3) Fusing Generative and Discriminative Classifiers: The offset effects in these parts of each utterance. A subset of 173
classification of normal/pathological voices was carried out in pathological and 53 normal speakers were taken according to
two steps, following first a generative approach based on GMM, those enumerated in [55].
and later, a discriminative approach based on SVM (see Fig. 1).
As commented earlier, the SVM was fed with the scores given B. Experimental Setup
by two GMM-based classifiers supplied using two parameteriza-
The methodology proposed in [3] has been used to evalu-
tion approaches: 1) nonlinear measures and 2) noise parameters
ate the system. The generalization abilities have been tested
combined with MFCC.
following a cross-validation scheme with ten different sets for
For each speech signal, and for each parameterization ap-
training and validation. The results are presented giving the
proach, both the likelihood to be normal and the likelihood
following rates: true positive rate (tp) (or sensitivity, is the ra-
to be pathological were calculated using GMM. A score was
tio between pathological files correctly classified and the total
obtained subtracting the log-likelihood (likelihood in the log-
number of pathological voices), false negative rate (fn) (ratio
arithmic domain) to be normal, from the log-likelihood to be
between pathological files wrongly classified and the total num-
pathological. Later, the intermediate decisions about presence
ber of pathological files), true negative rate (tn) (or specificity,
or absence of pathology given by both statistical models were
is the ratio between normal files correctly classified and the to-
used as a new feature space, and a SVM-based classifier was
tal number of normal files), and false positive rate (fp) (is the
trained to detect the presence of pathology.
ratio between normal files wrongly classified and the number of
The SVM and GMM classifiers were chosen on the basis of
normal files). The overall accuracy is the ratio between the hits
the modeling capabilities they present. The nonlinear mapping
of the system and the total number of files.
carried out by the SVM maximizes the generalization capabili-
Receiver operating characteristic (ROC) curves were used to
ties of the classifier. In addition, the possibility of choosing dif-
represent graphically the performance of the proposed architec-
ferent basis functions, from a priori knowledge of the problem
ture. The ROC [56] reveals the diagnostic accuracy expressed
domain, allows this system to adapt better to a given problem
in terms of sensitivity and 1-specificity (i.e., fp). In addition, the
[52]. On the other hand, the GMM fit the distribution of the ob-
area under the ROC curve (AUC) was calculated, representing
served data by means of a set of weighted Gaussian functions.
an estimation of the expected performance of the system in a
The advantages of using GMM are that they are computationally
single scalar [56].
inexpensive, and capable of modelling complex statistical dis-
tributions [53]. These structures have been used independently
C. Results
for the detection of pathological voices, and have proven to be
very reliable in comparison to others used in the state of the art Table I shows a statistical analysis of the nonlinear features
[39]. described in Section II. Initially, and in order to compare with
The advantage of combining outputs from different classi- previous results in the state of the art, they were estimated using
fiers instead of fusing features is that the structure of the feature frames of 200 ms. Table I shows that the maximum embedding
space used to feed each classifier is much simpler. Furthermore, dimension for this database has been estimated as 7. This is
although one of the classifiers would yield a better performance, a high value in comparison with the values found in previous
the sets of speech recordings misclassified would not necessar- works (usually 2 or 3) [1], [26]. However, in [1], the embed-
ily overlap; therefore, the combination of their outputs could ding dimension was 3 for all voices, because the mean value of
improve the overall performance [54]. the correlation dimension is around 3. In [26], the embedding
On the other hand, the parameterization using complexity dimension was simply assumed as 2. Nevertheless, other works
measures and MFCC requires different window lengths (55 and that used algorithms to estimate the embedding dimension, have
40 ms, respectively) and different windows (rectangular and used high values of m (equal to 11), even for normal recordings
Hamming, respectively), which complicate the use of the whole [57].
TABLE I TABLE II
STATISTICS OF THE NONLINEAR FEATURES (WITH 200 MS FRAMES) ACCURACY USING COMPLEXITY MEASURES AND A GMM DETECTOR
TABLE III
ACCURACY USING DIFFERENT FEATURE SETS AND A GMM DETECTOR
Fig. 4. ROC curves using different feature sets: (a) with the feature sets
enumerated in Table II and (b) using the noise parameters combined with MFCC,
the complexity measurements, and fusing both parameterization approaches.
Fig. 3. Distributions of CD and two different estimations of LLE for nor-
mal and pathological voices. (a) LLE estimated by using the algorithm in [1].
(b) LLE estimated by using the algorithm in [2]. (c) CD calculated by using the experiments calculating the features for short-time windows to
Takens estimator.
train the GMM-based detectors.
Table II shows the sensitivity, specificity, accuracy, and AUC
On the other hand, Table I shows large differences between obtained independently for each of the nonlinear measures. The
the values obtained with both algorithms used to calculate the experiments were carried out using different number of Gaus-
LLE. The first one delivered positive and negative values around sians for the GMM (from 2 to 6), and the results showed in
zero, while the second provided just positive values. Table II were the best obtained for each feature. Although
Since many normal and some pathological voices present at- Fig. 3 showed that the medians are statistically different, Table II
tractors with close trajectories, one might expect LLE to be zero shows that LLE does not provide good discrimination capabili-
in these voices, but in the second case, the algorithm does not ties. Moreover, the best classification accuracy is obtained with
estimate values equal to zero. Fig. 3 shows the distributions for HE S . Table III shows the classification accuracy obtained using
CD and both estimates of LLE. The line inside the box marks the different sets of features. The first set corresponds to the more
median, whiskers mark 1.5 times the interquartile range from the classical nonlinear features (LLE and CD), only the algorithm
ends of the box, and “+” symbols mark the outlying points. If for LLE1 was used, because it presented a better behavior than
the notches in the box plot do not overlap, we can conclude with LLE2 . The second set is conformed by the entropy measures
95% confidence that the true medians do differ, so medians are based on AE ; the third set corresponds to the recurrence and
statistically different for normal and pathological voices. These fractal-scaling analysis features together; the fourth, to the hid-
results are in concordance with those found in the literature us- den Markov entropy measures; and, for the sake of comparison,
ing different databases [7]. The values of CD present clearer the fifth set corresponds to the noise parameters and MFCC.
differences between normal and pathological voices. However, Fig. 4(a) plots the ROC curves for the configurations re-
since the main interest of this paper is to establish the discrim- ported in Table III. In all cases, the classification was carried out
inative capabilities of the nonlinear features, we performed the with a GMM. The best performances are those obtained using:
Fig. 5. Probability density functions and cumulative distributions for true and false scores obtained with: (a) training set and (b) testing set.
TABLE IV IV. CONCLUSION

CLASSIFICATION ACCURACY OBTAINED USING A CLASSIFIER COMBINATION
STRATEGY. RESULTS WITH TRAINING AND TESTING SUBSETS The methodology for nonlinear analysis of speech signals
proposed in this paper reveals valuable and complementary in-
formation to detect pathological voices. The use of complexity
measures in this context showed reliable results, especially us-
ing hidden Markov entropy measures. The characterization of
the embedding space carried out by DHMM takes into account
additional information about the transitions between different
regions of the state space from the trajectories of the attractor,
1) hidden Markov entropy measures and 2) noise parameters improving the discrimination capabilities of the EEs.
joint to MFCC coefficients. Fusing the whole set of nonlin- The use of a classifier combination strategy has demonstrated
ear features, the performance diminished with respect to the to be a valuable alternative for fusing information from different
performance obtained using the hidden Markov entropy mea- phenomena involved in the speech production. The use of SVM
sures. Due to this fact, the features based on hidden Markov as final detectors, allows a classification accuracy of 98.23%
entropy measures were preferred for the further analysis com- with a very narrow confidence interval (approximately 0), rep-
bining classifiers. Again, for the sake of comparison, Fig. 4(b) resenting an improvement with respect to other works found in
shows the ROC curves for the system trained using noise param- the state of the art.
eters combined with MFCC, complexity measures, and fusing Although the conventional nonlinear statistics (CD and LLE)
both strategies. There is a clear improvement in the performance showed statistically significant differences between normal and
of the system complementing the nonlinear measures with the pathological voices, their usefulness for the automatic detection
noise parameters combined with MFCC. of pathologies still remains unclear. Experiments with new data
Table IV shows the best classification accuracy fusing both must be carried out to clarify this aspect.
classifiers, each fed with a different feature set. The GMM pro- Regarding the future work, the proposed methodology could
vides a new feature space that was used, for each speaker, to take be fused with additional features, which could provide com-
the final decision about the presence or absence of pathology. plementary information, e.g., features based on biomechanical
The table shows the best classification accuracy. Combining parameters or extracted from the characterization of the mucosal
both classifiers with a SVM, the error decreased up to 44.47% wave.
with respect to the minimum error obtained using only com-
plexity measures, representing an absolute reduction of 2.21% REFERENCES
in the final error. Besides, for the sake of comparison, Table IV [1] A. Giovanni, M. Ouaknine, and J.-M. Triglia, “Determination of largest
explores the possibility of using a third GMM classifier instead Lyapunov exponents of vocal signal: Application to unilateral laryngeal
paralysis,” J. Voice, vol. 13, no. 3, pp. 341–454, 1999.
of the proposed SVM: although the results are similar in terms [2] R. Hegger, H. Kantz, and T. Schreiber, “Practical implementation of non-
of accuracy, the confidence intervals of the results using SVM linear time series methods: The TISEAN package,” Chaos, vol. 9, no. 2,
are close to zero, evidencing a better stability. pp. 413–439, 1999.
[3] N. Sáenz-Lechón, J. I. Godino-Llorente, V. Osma-Ruiz, and P. Gómez-
Fig. 5 shows the distributions for the normal and pathological Vilda, “Methodological issues in the development of automatic systems
scores given by the final SVM-based detector, as well as the for voice pathology detection,” Biomed. Signal Process. Control, vol. 1,
false positive rate and false negative rate curves, which corre- no. 2, pp. 120–128, 2006.
[4] I. R. Titze, The Myoelastic Aerodynamic Theory of Phonation, Iowa, IA:
spond to the cumulative sum of the distributions for normal and National Center for Voice and Speech, 2006.
pathological scores, respectively. Fig. 5(a) shows the graphics [5] I. R Titze, R. J. Baken, and H. Herzel, “Evidence of chaos in vocal fold
obtained with the training set, and Fig. 5(b) shows the graphics vibration,” in Frontiers in Basic Science, I. R Titze, Ed., San Diego, CA:
Singular Publishing Group, 1993, pp. 143–188.
obtained with the testing set. The similitude of the distribution [6] M. A. Little, P. E. McSharry, S. J. Roberts, D. A. Costello, and I. M. Moroz,
of scores, in both cases, confirms the stability of the system “Exploiting nonlinear recurrence and fractal scaling properties for voice
given by the confidence interval shown in Table III. disorder detection,” Biomed. Eng. Online, vol. 6, no. 23, 2007.
[7] J. J. Jiang, Y. Zhang, and C. McGilligan, “Chaos in voice, from modeling [33] D. Woodcock and I. T. Nabney, A New Measure Based on the Renyi
to measurement,” J. Voice, vol. 20, no. 1, pp. 2–17, 2006. Entropy Rate Using Gaussian Kernels, Aston University, U.K., 2006.
[8] H. Kantz and T. Schreiber, Nonlinear time series analysis, 2nd ed. Cam- [34] J. D. Arias-Londoño, J. I. Godino-Llorente, G. Castellanos-Domı́nguez,
bridge, U.K: Cambridge University Press, 2004. N. Sáenz-Lechon, and V. Osma-Ruiz, “Complexity analysis of patholog-
[9] J. J. Jiang and Y. Zhang, “Nonlinear dynamic analysis of speech from ical voices by means of hidden Markov entropy measures,” in Proc. 31st
pathological subjects,” Electron. Lett., vol. 38, no. 6, pp. 294–295, 2002. Int. IEEE EMBS Conf., 2009, pp. 2248–2251.
[10] Y. Zhang and J. J. Jiang, “Nonlinear dynamic analysis in signals typing [35] X. Huang, A. Acero, and H. W. Hon, Spoken language processing, Engle-
of pathological human voices,” Electron. Lett., vol. 39, no. 13, pp. 1021– wood Cliffs, NJ: Prentice Hall PTR, 2001.
1023, 2003. [36] G. de Krom, “A cepstrum-based technique for determining a harmonics-
[11] J. MacCallum, L. Cai, L. Zhou, Y. Zhang, and J. Jiang, “Acoustic analy- to-noise ratio in speech signals,” J. Speech Hear. Res., vol. 36, no. 2,
sis of aperiodic voice: Perturbation and nonlinear dynamic properties in pp. 254–266, 1993.
esophageal phonation,” J. Voice, vol. 23, no. 3, pp. 283–290, 2009. [37] H. Kasuya, S. Ogawa, K. Mashima, and S. Ebihara, “Normalized noise
[12] M. L. Meredith, S. M. Theis, J. S. McMurray, Y. Zhang, and J. Jiang, energy as an acoustic measure to evaluate pathologic voice,” J. Acoust.
“Describing pediatric dysphonia with nonlinear dynamic parameters,” Soc. Am., vol. 80, no. 5, pp. 1329–1334, 1986.
Int. J. Pediatr. Otorhinolaryngol., vol. 72, no. 12, pp. 1829–1836, 2008. [38] D. Michaelis, T. Gramss, and H. W. Strube, “Glottal-to-noise excitation
[13] Y. Zhang, J. Jiang, L. Biazzo, and M. Jorgensen, “Perturbation and non- ratio. A new measure for describing pathological voices,” Acustica/Acta
linear dynamic analysis of voices from patients with laryngeal paralysis,” acustica, vol. 83, pp. 700–706, 1997.
J. Voice, vol. 19, no. 4, pp. 519–528, 2004. [39] J. I. Godino-Llorente, P. Gómez-Vilda, and M. Blanco-Velasco, “Dimen-
[14] Y. Zhang, C. McGilligan, L. Zhou, M. Vig, and J. Jiang, “Nonlinear sionality reduction of a pathological voice quality assessment system
dynamic analysis of voices before and after surgical excision of vocal based on Gaussian mixture models and short-term cepstral parameters,”
polyps,” J. Acoust. Soc. Am., vol. 115, no. 5, pp. 2270–2277, 2008. IEEE Trans. Biomed. Eng., vol. 53, no. 10, pp. 1943–1953, Oct. 2006.
[15] Y. Zhang and J. J. Jiang, “Acoustic analysis of sustained and running [40] L. Cao, “Practical method for determining the minimum embedding di-
voices from patients with laryngeal pathologies,” J. Voice, vol. 22, no. 1, mension of a scalar time series,” Physica D, vol. 110, no. 1–2, pp. 43–50,
pp. 1–9, 2008. 1997.
[16] G. Vaziri, F. Almasganj, and R. Behroozmand, “Pathological assessment [41] R. Carvajal, N. Wessel, M. Vallverdú, P. Caminal, and A. Voss, “Cor-
of patients’ speech signals using nonlinear dynamical analysis,” Comput. relation dimension analysis of heart rate variability in patients with di-
Biol. Med., vol. 40, no. 1, pp. 54–63, 2010. lated cardiomyopathy,” Comput. Meth. Programs Biomed., vol. 78, no. 2,
[17] Massachusetts Eye and Ear Infirmary. Voice Disorders Database, Version pp. 133–140, 2005.
1.03. Kay Elemetrics Corp., Lincoln Park, NJ, 1994. [CD-ROM]. [42] M. Ding, C. Grebogi, E. Ott, T. Sauer, and J. A. Yorke, “Estimating cor-
[18] S. M. Pincus, “Approximate entropy as a measure of system complexity,” relation dimension from chaotic time series: When does plateau occur?”
Proc. Natl. Acad. Sci. USA, vol. 88, pp. 2297–2301, 1991. Physica D, vol. 69, no. 3–4, pp. 404–424, 1993.
[19] A. Serletis, A. Shahmordi, and D. Serletis, “Effect of noise on estimation [43] J. D. Arias-Londoño, J. I. Godino-Llorente, and G. Castellanos-
of Lyapunov exponents from a time series,” Chaos, Solutions Fractals, Domı́nguez, “Short time analysis of pathological voices using complexity
vol. 32, no. 2, pp. 883–887, 2007. measures,” in Proc. 3rd Adv. Voice Funct. Assess. Int. Workshop, 2009,
[20] M. Costa, A. Goldberger, and C. Peng, “Multiscale entropy analysis of pp. 93–96.
biological signals,” Phys. Rev. E, vol. 71, pp. 021906-1–021906-18, 2005. [44] A. Wolff, J. Swift, H. Swinney, and J. Vastano, “Determinning Lyapunov
[21] I. A. Rezek, S. J. Roberts, and “, “Stochastic complexity measures for exponents from a time series,” Physica D, vol. 16, no. 3, pp. 285–317,
physiological signal analysis,” IEEE Trans. Biomed. Eng., vol. 45, no. 9, 1985.
pp. 1186–1191, Sep. 1998. [45] P. Grassberger and I. Procaccia, “Characterization of strange attractors,”
[22] J. S. Richman and J. R. Moorman, “Physiological time-series analysis Phys. Rev. Lett., vol. 50, no. 5, pp. 346–349, 1983.
using approximate entropy and sample entropy,” Am J Physiol. Heart [46] B. Borovkova, R. Burton, and H. Dehling, “Consistency of the takens
Circ. Physiol., vol. 278, pp. H2039–H2049, 2000. estimator for the correlation dimension,” Ann. Appl. Probab., vol. 9,
[23] L.-S. Xu, K.-Q. Wang, and L. Wang, “Gaussian kernel approximate en- no. 2, pp. 376–390, 1999.
tropy algorithm for analyzing irregularity of time series,” in Proc. 4th Int. [47] T. M. Cover and J. A. Thomas, Elements of information theory, 2nd ed.
Conf. Mach. Learn. Cybern., 2005, pp. 5605–5608. Hoboken, NJ: Wiley Interscience, 2006.
[24] K. K. L. Ho, G. B. Moody, C.-K. Peng, J. E. Miteus, M. G. Larson, D. Levy, [48] M. Rezaeian, “Hidden markov process: A new representation, entropy rate
and A. L. Goldberger, “Predicting survival in heart failure case and con- and estimation entropy,” arXiv:cs/0606114v2, 2006.
trol subjects by use of fully automated methods for deriving nonlinear [49] L. R. Rabiner, “A tutorial on hidden markov models and selected appli-
and conventional indices of heart rate dynamics,” Circulation, vol. 96, cations on speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286,
pp. 842–848, 1997. Feb. 1989.
[25] L. A. Fleisher, S. M. Pincus, and S. H. Rosenbaum, “Approximate en- [50] D. Reynolds, T. Quatieri, and R. B. Dunn, “Speaker verification using
tropy of heart rate as a correlate of postoperative ventricular dysfunction,” adapted Gaussian Mixture models,” Digit. Signal Proc., vol. 10, pp. 19–
Anesthesiology, vol. 78, no. 4, pp. 683–692, 1993. 41, 2000.
[26] K. Manickam, C. Moore, T. Willard, and N. Slevin, “Quantifying aberrant [51] V. Vapnik, “An overview of statistical learning theory,” IEEE Trans.
phonation using approximate entropy in electrolaryngography,” Speech Neural Netw., vol. 10, no. 5, pp. 988–1000, Sep. 1999.
Commun., vol. 47, no. 3, pp. 312–321, 2005. [52] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification, 2nd ed.
[27] C. Moore, K. Manickam, T. Willard, S. Jones, N. Slevin, and S. Shalet, Hoboken, NJ: Wiley Interscience, 2000.
“Spectral pattern complexity analysis and the quantification of voice nor- [53] D. A. Reynolds, T. Quatieri, and R. Dunn, “Speaker verification using
mality in healthy and radiotherapy patient groups,” Med. Eng. Phys., adapted gaussian mixture models,” Digit. Signal Proc., vol. 10, pp. 19–
vol. 26, no. 4, pp. 291–301, 2004. 41, 2000.
[28] B. S. Aghazadeh, H. Khadivi, and M. Nikkhah-Bahrami, “Nonlinear anal- [54] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, “On combining classi-
ysis and classification of vocal disorders,” in Proc. 29th Int. IEEE EMBS fiers,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 3, pp. 226–239,
Conf., 2007, pp. 6199–6202. Mar. 1998.
[29] P. Henrı́quez, J. B. Alonso, M. A. Ferrer, C. M. Travieso, J. I. Godino- [55] V. Parsa and D. Jamieson, “Identification of pathological voices using
Llorente, and F. Dı́as-de-Marı́a, “Characterization of healthy and patho- glottal noise measures,” J. Speech, Lang. Hear. Res., vol. 43, no. 2,
logical voice through measures based on nonlynear dynamics,” IEEE pp. 469–485, 2000.
Trans. Audio, speech, Lang. Process., vol. 17, no. 6, pp. 1186–1195, Aug. [56] T. Fawcett, “ROC graphs: Notes and practical considerations for re-
2009. searches,” HP Laboratories, Palo Alto, CA, 2004.
[30] C. Maciel, J. Pereira, and D. Stewart, “Identifying healthy and pathologi- [57] R. Nicollas, R. Garrel, M. Ouaknine, A. Giovanni, B. Nazarian, and
cally affected voice signals,” IEEE Signal Process. Mag., vol. 27, no. 1, J.-M. Triglia, “Normal voice in children between 6 and 12 years of age:
pp. 120–123, Jan. 2010. Database and nonlinear analysis,” J. Voice, vol. 22, no. 6, pp. 671–675,
[31] P. Scalassara, M. Dajer, J. Marrara, C. Maciel, and J. Pereira, “Analysis 2008.
of voice pathology evolution using entropy rate,” in Proc. Tenth IEEE Int.
Symp. Multimedia, 2008, pp. 580–585.
[32] M. A. Little, D. A. Costello, and M. L. Harries, “Objective dysphonia
quantification in vocal fold paralysis: comparing nonlinear with classical
measures,” J. Voice, to be published, DOI:10.1016/j.jvoice.2009.04.004. Author’s photographs and biographies not available at the time of publication.

Automatic Detection of Pathological Voic

Uploaded by

Copyright:

Available Formats

Automatic Detection of Pathological Voic

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Automatic Detection of Pathological Voic

Uploaded by

Copyright:

Available Formats

370 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO.

Automatic Detection of Pathological Voices Using

0018-9294/$26.00 © 2010 IEEE

up concatenating 12 fast Fourier transform (FFT)-based MFCC,

complex ones, since complexity measures lack of the spectral

2) A = {Aij }, 1 ≤ i, j ≤ k: the set of transition probabili-

TABLE IV IV. CONCLUSION

You might also like