1 s2.0 S0892199709000563 Main - 2

Pathological Likelihood Index as a Measurement
of the Degree of Voice Normality

and Perceived Hoarseness
*Juan Ignacio Godino-Llorente, *Pedro Gómez-Vilda, †Fernando Cruz-Roldán, †Manuel Blanco-Velasco,
and *Rubén Fraile, *yMadrid, Spain
Summary: A new index is introduced in this article to measure the degree of normality in the speech. The proposed
parameter has demonstrated to be correlated with the perceived hoarseness, giving an indication of the degree of
normality. The calculation of such a parameter is based on a statistical model developed to represent normal and path-
ological voices. The modeling is built around Gaussian mixture models and Mel frequency cepstral coefficients. The
proposed index has been named pathological likelihood index (PLI). PLI is compared with other aperiodicity features
(such as jitter and shimmer), and measurements sensitive to additive noise (such as harmonics-to-noise ratio (HNR),
cepstrum-based HNR, normalized noise energy, and glottal-to-noise excitation ratio). The proposed parameter is
revealed to be a good estimator of the presence of pathology, showing lower correlation with noise, frequency, and
amplitude perturbation parameters than these classical features among them.
Key Words: Screening voice disorders–Short-term analysis–Cepstral parameters–Gaussian mixture models–Voice
quality.
INTRODUCTION Apart from the physical interpretation of each acoustic

The usefulness of acoustic features in the objective evaluation of parameter and its ability to measure different aspects of voice
pathological voices has been tested under various contexts and quality for screening purposes, it is very important to know
with a variety of goals. The state-of-the-art acoustic analysis en- how likely a voice record is normal or not given each of the
ables a wide range of long-term acoustic parameters to be esti- previously mentioned acoustic parameters. In the context of
mated by averaging short-term perturbations. These features screening, there does not exist a single feature which is capable
are grouped according to the characteristics they measure.1,2 It of differentiating between normal and pathological voices,
is usual to divide these parameters into three main groups: the because voice pathologies tend to combine different kinds of
‘‘amplitude perturbation features’’ (designed to capture the var- perturbations in a polyhedrical glance.
ious forms of periodicity disturbances in the amplitude of the On the other hand, to date, very few studies have evaluated
acoustic signal), such as, shimmer,3 amplitude perturbation quo- the discriminative capabilities of the acoustic parameters for
tient (APQ),3 or smoothed APQ (sAPQ)3; the ‘‘frequency per- screening purposes. Parsa and Jamieson evaluated the discrim-
turbation features’’ (developed to measure the periodicity inative capabilities of several noise features:17 NNE, SNR,
disturbances in the acoustic signal), such as jitter,3 pitch pertur- zHNR, fHNR, PA, and SPR, with reporting accuracies equal
bation quotient (PPQ),3 or smoothed PPQ (sPPQ)3; and the to 79.8%, 82.5%, 83.3%, 88.6%, 94.7%, and 98.7%, respec-
‘‘noise features’’ (designed to measure the relative noise compo- tively. Yumoto et al5 proposed the HNR parameter for acoustic
nent in the speech signal), such as signal-to-noise ratio (SNR),4 discrimination of voice disorders reporting an error rate of
harmonics-to-noise ratio (HNR),5 zero-phase HNR (zHNR),6 16.7%. Kasuya et al9 obtained a screening accuracy of 78.6%
frequency domain HNR (fHNR),7 cepstrum-based HNR for NNE and 74.1% for HNR. Other works15–19 indicate that
(CHNR),8 normalized noise energy (NNE),9 voice turbulence an accurate screening can be carried out using a combination
index (VTI),10 soft phonation index (SPI),10 pitch amplitude of several of the aforementioned acoustic parameters, enabling
of the linear prediction model residual (PA), spectral flatness ra- each individual voice utterance to be quantified by a single set
tio (SFR),11 and glottal-to-noise excitation ratio (GNE).12 The of one-dimensional parameters (similar to those enumerated
common framework for these parameters is that they were con- earlier). Although the multidimensional studies reported
ceived to measure the quality and ‘‘degree of normality’’ of a good efficiency in screening (obtaining accuracies up to
voice records.2,3,8–10,12–14 Former studies show that the screen- 96%),18,19 such analysis is not always easy to interpret from
ing of voice pathologies can be carried out by means of the afore- the perspective of a human evaluator and it is usually carried
mentioned long-term averaged acoustic parameters.15,16 out by methods based on complex pattern-recognition tech-
niques. In this sense, the best acoustic features for screening
Accepted for publication April 20, 2009.
From the *Universidad Politécnica de Madrid, Circuits & Systems Engineering Depart-
would be those with the lowest correlation with respect to the
ment, Ctra. de Valencia, Madrid, Spain; and the yUniversidad de Alcalá, Escuela Politécn- others; with a good correlation against the perceptual evalua-
ica, Signal Theory and Communications Department, Ctra. de Madrid-Barcelona, Alcalá
de Henares, Madrid, Spain.
tions used in the clinic by the specialists; those providing the
Address correspondence and reprint requests to Juan Ignacio Godino-Llorente, Univer- best discrimination capabilities; and those whose interpretation
sidad Politécnica de Madrid, Ctra. de Valencia, km. 7, 28031, Madrid, Spain. E-mail:
igodino@ics.upm.es
can be easily interpreted by a human expert.
Journal of Voice, Vol. 24, No. 6, pp. 667-677 The basic aim of the present research is to evaluate a new in-
0892-1997/$36.00
2010 The Voice Foundation
dex developed to measure the likelihood that a voice record is
doi:10.1016/j.jvoice.2009.04.003 normal or pathological. Such index is calculated based on the
668 Journal of Voice, Vol. 24, No. 6, 2010
framework proposed in our previous work.20 The modeling is was applied to a set of 16 ‘‘classical’’ acoustic measurements
conducted by means of Gaussian mixture models (GMM) using (Table 1). The goal is to determine interdependencies between
nonparametric short-term Mel frequency cepstral coefficients PLI and the other features. Furthermore, the correlation of the
(MFCC).21 Each voice record is characterized with the same proposed parameter with the ratings given by the specialists
number of vectors as frames are extracted from each recording. through perceptual evaluation has been analyzed. The statistic
These feature vectors are used to build two different statistical applied was the nonparametric Spearman rank-order correlation
models: one for normal and the other for pathological voices. coefficient,24 which is independent of the shape of the underly-
For each frame, both the likelihood to be normal and the likeli- ing data distribution. The significance of the difference between
hood to be pathological are calculated. An index (called log the correlation coefficients was also calculated. Moreover, the
likelihood ratio [log-LR]) is obtained subtracting the log-likeli- discrimination capabilities of the proposed parameter are com-
hood (likelihood in the logarithmic domain) to be normal from pared with those parameters that have been reported in the liter-
the log-likelihood to be pathological. The next step involves an ature to be the best indicators of the presence of pathology.
averaging in time of the index calculated for each frame to ob- The new index improves the screening accuracy with respect
tain a final score. Finally, the proposed pathological likelihood to the existing parameters maintaining a medium–high
index (PLI) is calculated normalizing the LR by means of a lo- correlation with the perceptual judgments about quality given
gistic function. The decision about normality or abnormality is by the specialists, and simultaneously reducing the correlation
taken establishing a threshold over the normalized LR. Such with respect to the classical parameters found by the state-
threshold corresponds to the equal error rate (EER) point.22,23 of-the-art technique.
A more detailed description of the process will be given in sec- The article is organized as follows: section methodology
tions methodology and pathological likelihood index. introduces the block diagram used to build the statistical model;
To compare the proposed PLI with the acoustic parameters in section pathological likelihood index describes the calculation
existing literature, in this work, a simple correlation analysis of the PLI parameter; section results and discussion
TABLE 1.
‘‘Classical’’ Acoustic Measurements Used in the Study to Compare Results With the PLI
Types Parameters Description
Frequency perturbation Jitta Absolute jitter gives an evaluation of the period-to-period variability of the
pitch period (ms)2,3
Jitter Jitter percent represents the relative period-to-period (in short-term)
variability of the pitch (%)2,3
RAP Relative average perturbation gives an evaluation of the variability of the pitch
period with a smoothing factor of 3 periods (%)2,3
PPQ Pitch period perturbation quotient gives an evaluation in percent of the
variability of the pitch period with a smoothing factor of 5 periods (%)2,3
sPPQ Smoothed pitch period perturbation quotient gives an evaluation in percent
of the long-term variability of the pitch period with a user-selected number
of periods (usually 55) (%)2,3
Amplitude perturbation ShdB Shimmer in dB gives an evaluation of the period-to-period variability of the
peak-to-peak amplitude (dB)2,3
Shim Shimmer percent gives an evaluation in percent of the variability of the
peak-to-peak amplitude. It represents the relative period-to-period
(in short-term) variability of the peak-to-peak amplitude (%)2,3
sAPQ Smoothed amplitude perturbation quotient gives an evaluation in percent
of the long-term variability of the peak-to-peak amplitude with a smoothing
factor of a user-selected number of periods (usually 55) (%)2,3
Noise HNR Harmonics-to-noise ratio is an average ratio of energy of the inharmonic
components in the range 1500–4500 Hz to the harmonic-component energy
in the range 70–4500 Hz (%)10,15
CHNR Cepstrum-based HNR is an average ratio based on calculating the ratio of the
energy of the harmonics related to the noise energy present in the voice
(both measured in dB) in the range 70–4500 Hz (dB)5,8
NNE Normalized noise energy is a ratio of the energy of the noise present (in the
range 1–5 kHz) in the vocalization to the total energy of the signal (dB)9,44
GNE Glottal-to-noise excitation ratio is the maximum of the cross-correlation
between Hilbert envelopes calculated for different frequency channels and
extracted from the inverse filtering of the speech signal. The bandwidth of the
envelopes used is 1 kHz, and frequency bands are separated by 300 Hz12,42
Juan Ignacio Godino-Llorente, et al Pathological Likelihood Index 669
summarizes the results and comparisons with other acoustic pa- axes, giving uniform treatment to both types of errors. It uses
rameters, paying special attention to the discrimination capabil- a normal deviate scale for both axes which spreads out the
ities of the parameter; and section conclusions presents some plot and better distinguishes different well-performing parame-
conclusions. ters. The closer the plot is to the origin, the better would be the
screening accuracy of the parameter.
METHODOLOGY
Figure 1 shows a block diagram describing the process setup for Database
the modeling. The features used to build the statistical models As it is the only commercially available voice-disorder data-
are calculated from short-time windows extracted from the base, and to allow reproducibility, the tests have been carried
recordings of the sustained phonation of a vowel. The window out using the database developed by the Massachusetts Eye
length was selected to contain, in the worst case, at least two and Ear Infirmary Voice and Speech Labs.29 The speech sam-
consecutive pitch periods (2T0);25 hence, the feature extraction ples were collected in a controlled environment and sampled
was performed using 40-millisecond Hamming windows with at a 16-bit resolution. When necessary, a downsampling with
an overlap of 50% between adjacent frames. Consequently, a previous half-band filtering is done to adjust every utterance
the frame rate obtained is 50 frames/s. to the sampling rate of 25 kHz. The material available in the
The feature extraction module calculates, for each frame, the database contains the sustained phonation (1- to 3-second
MFCC vector of parameters complemented with other features long) of the vowel /ah/, and continuous speech recordings of
developed to measure their speed of variation. In the third stage, the ‘‘rainbow passage.’’ The samples were obtained from
a pattern classification block models the statistical distribution patients (males and females) with normal voices and a wide
for normal and pathological voices. Two different statistical variety of organic, neurological, traumatic, and psychogenic
models are computed: one for normal and another for patholog- voice disorders.
ical voices. The last step is a logistic transformation to obtain The database has been segmented according to the criteria
a final score normalized into the interval (0,1). explained in the work by Parsa and Jamieson (2000),17 used ear-
The screening accuracy of the statistical modeling stage is lier in other studies.20 The criteria used in work by Parsa and
estimated using a k-fold cross-validation scheme. Nine repeti- Jamieson (2000)17 ensure that gender and age are uniformly
tions were used to estimate the performance figures, averaging distributed among the samples belonging to both classes. The
the results obtained from each data set. For each set, data files subset taken from the database contains 53 normal and 173
were randomly split into two subsets: the first one to train pathological talkers. The larger number of recordings belong-
(70% of the files), and the second (30% of the files) to simulate ing to the pathological set allows a better modeling of a class
and validate the results, keeping the same proportion for each that has a larger inherent variability. This fact does not imply
class. The division into training and evaluation data sets is a slant of the proposed methodology toward the pathological
carried out on a file (rather than frame) basis to check and class because, typically, the intra- and interspeaker variability
prevent the system from learning speaker-related features. in the feature space of the pathological voices is much greater
Both male and female voices have been mixed together in the than in the control group.
training and validation sets. Perceptual labeling. To study the correlation between the
The accuracy of the parameter has been evaluated with the proposed PLI and the perceptual judgments used in the clinical
detection error tradeoff (DET)26 and compared with other practice, the database has been evaluated by an experienced
acoustic parameters using the relative operating characteristic speech and language therapist (SALT) following the GRBAS
(ROC)27 plots. These plots allow a comparison of the perfor- scale. This perceptual scale was proposed by Hirano30 and ac-
mance and accuracy with other parameters found in the cepted as standard by the Japanese Society of Logopedics and
literature. The ROC27,28 curve is a popular tool in medical de- Phoniatrics and the European Group on the Larynx.31 The
cision making. It displays the diagnostic accuracy expressed in GRBAS scale comprises five qualitative parameters: grade of
terms of sensitivity (or true positive rate) against 1 specificity dysphony (G), roughness (R), breathiness (B), asthenicity (A),
(or false acceptance rate) at all possible threshold values in and strain (S). For each parameter, a value in the range
a convenient way. The ROC is analyzed calculating the area un- {0 v 3; v˛Z} is considered, where 0 corresponds to
der the curve (AUR) and its standard error (SE), as suggested in healthy voice, 1 to light disease, 2 to moderate disease, and 3
the work by Hanley and McNeil (1983).27 On the other hand, to severe disease. The sum of the partial scores gives an
the DET plot26 is also widely used for the assessment of detec- overall GRBAS score in the range {0 v 12; v˛Z}. De-
tion performance in speaker-verification tasks. A DET curve spite some limitations, GRBAS is simple and fast, and has
plots error rates (false positive vs false negative) on both good correlation with some acoustic parameters. In the work
Feature Pattern
Pre-processing Extraction Classification
Speech (MFCC) (GMM) PLI
Logistc
signal
FIGURE 1. Block diagram of the statistical speech-modeling detector: preprocessing front end, feature extraction, and statistical module.
Histogram of the log-likelihood ratio

TABLE 2.
Distribution of Voice Samples According to Gender and
Normal
GRBAS Scale pathological
GRBAS
Gender Normal: 0–3 3–6 6–9 >9

\ 68 38 26 3
_ 40 28 19 4
Total 108 66 45 7
by Hirano et al,32 it is demonstrated that there is a strong corre-

lation between psychoacoustic perceptual evaluations and
acoustic parameters, such as APQ, PPQ, and NNE. To achieve
this evaluation, it is usual to work with continuous speech re- -200 -150 -100 -50 0 50 100 150 200 250 300
cordings. However, sometimes, it is approached by means of Log-likelihood ratio
sustained vowels, although there are studies33 demonstrating FIGURE 3. Average histogram of the log-likelihood ratio for normal
that the results might differ depending on the material used. and pathological voices.
The PLI was obtained using the sustained vowel phonations
available in the database, but the perceptual judgments were car-
the real frequency scale (Hz) and the perceived frequency scale
ried out listening to both the continuous speech and the sustained
(mels) is approximately linear below 1 kHz and logarithmic for
vowel recordings. The GRBAS evaluation has been carried out
higher frequencies. The bandwidth of the critical band varies
following a random and blind procedure, hearing the vowel and
according to the perceived frequency.21 Such mapping converts
the continuous speech of each patient. Table 2 shows the final dis-
real frequency into perceived frequency, and matches with the
tribution of the voice samples stored in the database once they
idea that a well-trained speech therapist is able to detect in
have been segmented according to gender and the GRBAS scale.
most cases the presence of a disorder simply by listening to
the speech sample.
Parameterization MFCCs can be estimated using a parametric approach de-
Through this approach, the modeling is conducted by means of rived from linear prediction coefficients (LPC), or using
short-time features. As proposed in our former work Llorente,20 a nonparametric fast fourier transform (FFT)-based approach.
the following parameters ware calculated for each frame ex- However, FFT-based MFCCs typically encode more informa-
tracted from the sustained phonation of the /ah/ vowel: (1) 16 tion from excitation, whereas LPC-based MFCCs remove the
MFCCs and (2) the first temporal derivatives (D) of the previous excitation.34 FFT-based MFCC parameters are obtained by cal-
features. Thus, the dimension of the final feature vector is culating the discrete cosine transform (DCT)21 over the loga-
N ¼ 32 (16 MFCCs, and their respective Ds). A brief rithm of the energy in several frequency bands. Each band in
description of these features is given in the following sections. the frequency domain is dependent on the bandwidth of the
The Mel frequency cepstral coefficient features. This
family of parameters has been widely used in speech recogni-
1
tion and speaker verification. In this work, the MFCCs were cal-
culated following a nonparametric modeling method based on
the human auditory perception system. The mapping between
0.8
0.1
0.09 0.6
0.08
0.07
0.4
0.06
0.05
0.04
0.2
0.03 Pathological
Normal
0.02
EER
0.01 0
0 0.2 0.4 0.6 0.8 1
0
-15 -10 -5 0 5 10 15 20 PLI
FIGURE 2. Histogram of a single cepstral coefficient and its aprox- FIGURE 4. Normalized cumulative false-positive and false-negative
immation by means of a Gaussian mixture (solid line). plots. The normalization is carried out with a logistic transformation.
ian mixture density. A mixture of Q component densities is

40 given by Equation 1:37
X
Q X
Q
20
Pðx=QÞ ¼ ci pi ðxÞ; ci ¼ 1; ci 0 (1)
i¼1 i¼1
Miss probability (in %)
10 where Q ¼ {Qn ; Qp } indicates the class—normal or

pathological—to model, pi(x), i ¼ 1,.,Q are the component
5 densities, and ci, i ¼ 1,.,Q are the component weights.
Each component density is a n-variate Gaussian function; hence,
2 the different components act together to model the overall pdf.
1 The model is trained with an expectation-maximization
0.5
(EM) algorithm.38 The EM algorithm is an iterative method
for learning maximum likelihood parameters of a generative
0.2 model where some of the random variables are observed and
0.1 some are hidden.
0.1 0.2 0.5 1 2 5 10 20 40
In this framework, the optimum testa to decide between two
False Alarm probability (in %)
hypotheses is to establish a threshold q over the LR given by
Equation 2:
FIGURE 5. DET plot to show the false alarm (or false positive) ver-

sus miss probability (or false negative). The closer the plot is to the or- P x Q p .q; accept Qp
igin, the better is the performance. LR ¼ (2)
PðxjQn Þ ,q; accept Qn
filter central frequency. The higher the frequency, the wider the A new index, log LR, is obtained subtracting the log-likeli-
bandwidth is. hood (likelihood in the logarithmic domain) to be normal from
the log-likelihood to be pathological. This is represented in
Temporal derivatives. The MFCC features have been
Equation 3:
extended to include the first temporal derivatives among the

neighboring frames. The first temporal derivatives (D) are P x Q p
good estimators of the speed of variation giving relevant infor- LR ¼ 0logLR ¼ log P xQp logðPðxjQn ÞÞ
PðxjQn Þ
mation about the dynamics and short-time variability. These
features have been considered significant as, because of the (3)
presence of voice disorders, a lower degree of stationarity might If the a priori probability of a speech sample to be normal,
be expected (ie, larger temporal variation of the MFCC vector PðQn Þ, or pathological, PðQp Þ, were known, the LR could be
of parameters).35 Another reason to complement the feature calculated including this a priori knowledge as in Equation 4.
vectors with the D is that the generative approaches used to
This a priori knowledge would represent the expertise of the
model the normal and pathological classes (the GMMs) do
person evaluating the voice.
not consider any temporal dependence by themselves. The

calculation of D is achieved by means of antisymmetric finite
0
P Qp x P Qp $P xQp
impulse response (FIR) filters to avoid phase distortion of the LR ¼ ¼ (4)
PðQn jxÞ PðQn Þ$PðxjQn Þ
temporal sequence.36
In this work, the a priori probability to be normal or patho-
Statistical modeling logical was supposed to be equal (ie, no a priori information, or
The motivation to use GMM is attributed to its ability to repre- equal probability to be normal/pathological), although in prac-
sent a great amount of distributions.22,23 Each class (normal or tice, these probabilities could be adjusted according to the pri-
pathological) is modeled as a random process whose probabil- mary judgment given by the evaluator, and/or combining with
ity density function (pdf) is characterized by a mixture of Q epidemiological data about the prevalence of speech disorders
Gaussians (Figure 2). Let x˛RN be an N-dimensional random in the population.
vector with an arbitrary distribution representing the parame- For the sake of simplicity, and to make its computation eas-
ters extracted from each frame. Therefore, let us suppose that ier, Equation 4 can also be expressed in the logarithmic domain.
x is the evidence measured from the speech, Qn is the hypoth- Thus, the final score, z, assigned to the whole utterance is com-
esis that the speech sample corresponds to a normal voice, and puted assuming independence between observations by adding
Qp is the hypothesis that the speech sample belongs to a patho- the log-likelihoods obtained for each frame over a patient’s ut-
logical voice. Thus, for each frame, both the likelihood to be terance and dividing by the number of frames evaluated.
normal, pðxjQn Þ, and the likelihood to be pathological,
pðxjQp Þ, could be obtained as a result of the score assigned to a
Strictly speaking, the LR test is only optimal when the likelihood functions are
each model modelling the distribution density of x with a Gauss- exactly known. In practice, this is not the case.
672
TABLE 3.
Rank Correlation Coefficients of Different Parameters Extracted From Pathological (n ¼ 173) and Normal Voices (n ¼ 53)
Frequency Perturbation Parameters Amplitude Perturbation Parameters Noise Features
Parameter Jitta Jitter RAP PPQ sPPQ ShdB Shim sAPQ HNR CHNR NNE GNE PLI
Pathological
Jitta 1.00 0.90 0.89 0.88 0.75 0.72 0.73 0.49 0.70 0.72 0.79 0.67 0.60
Jitter – 1.00 0.99 0.99 0.81 0.69 0.69 0.48 0.69 0.56 0.66 0.62 0.59
RAP – – 1.00 0.98 0.80 0.68 0.68 0.47 0.68 0.55 0.66 0.61 0.58
PPQ – – – 1.00 0.81 0.68 0.68 0.49 0.69 0.55 0.64 0.60 0.57
sPPQ – – – – 1.00 0.55 0.55 0.64 0.60 0.47 0.53 0.57 0.47
ShdB – – – – – 1.00 0.99 0.75 0.76 0.86 0.74 0.72 0.39
Shim – – – – – – 1.00 0.74 0.76 0.86 0.75 0.72 0.39
sAPQ – – – — – – – 1.00 0.60 0.63 0.48 0.56 0.24
HNR – – – – – – – – 1.00 0.80 0.60 0.50 0.36
CHNR – – – – – – – – – 1.00 0.78 0.65 0.35
NNE – – – – – – – – – – 1.00 0.78 0.56
GNE – – – – – – – – – – – 1.00 0.51
PLI – – – – – – – – – – – – 1.00
0.5
Normal
Jitta 1.00 0.71 0.67 0.67 0.72 0.61 0.61 0.61 0.61 0.44 0.64 0.58 0.09*
Jitter – 1.00 0.99 0.99 0.77 0.40 0.40 0.29 0.22* 0.21* 0.37 0.35 0.007*
RAP – – 1.00 0.99 0.73 0.36 0.34 0.18* 0.17* 0.17* 0.31 0.31 0.013*
PPQ – – – 1.00 0.73 0.40 0.38 0.19* 0.18* 0.20* 0.34 0.32 0.007*
sPPQ – – – – 1.00 0.44 0.43 0.46 0.50 0.36 0.49 0.5 0.02*
ShdB – – – – – 1.00 0.99 0.80 0.50 0.75 0.66 0.52 0.11*
Shim – – – – – – 1.00 0.96 0.48 0.74 0.66 0.55 0.10*
Journal of Voice, Vol. 24, No. 6, 2010

sAPQ – – – – – – – 1.00 0.60 0.60 0.50 0.43 0.09*
HNR – – – – – – – – 1.00 0.50 0.37 0.4 0.09*
CHNR – – – – – – – – – 1.00 0.47 0.33 0.26*
NNE – – – – – – – – – – 1.00 0.58 0.01*
GNE – – – – – – – – – – – 1.00 0.19*
PLI – – – – – – – – – – – – 1.00
Abbreviations: RAP, relative average perturbation; ShdB, shimmer in dB.
* Insignificant correlations. Significant correlations are with P < 0.05 (95% confidence interval).
Box plots m2n m2p mn mp

wo ¼ ; wT ¼ (6)
1 2s2 2s2
where mn and mp are the means of the distributions corresponding
0.9
to pathological and normal voices, and s2 is their variance. As
the variances of both distributions do not necessarily have to be
0.8
common, s is calculated averaging the variances of both distri-
butions (Equation 7):
0.7

s ¼ 0:5 sn þ sp (7)
0.6
PLI
The parameters of the sigmoid issued from the normal and

0.5 pathological scores are fixed with the data set used in the train-
ing phase. Under these premises, the score obtained is a good
0.4 estimator of the a posteriori probability, and it is usual to obtain
values close to 1 for pathological voices and close to 0 for
0.3 almost all normal voices (following a nearly deterministic
distribution). This fact is not desirable and it would be expected
0.2 that the nonlinear part of the logistic function would have fallen
within the whole range of the distribution of the normal and
Normal Pathological pathological scores. This problem is easily solved by increasing
FIGURE 6. Box plot of the PLI for normal and pathological voices. the standard deviation to be equal to 2s.
The box has lines at the lower quartile, median, and upper quartile
values. The notches represent a robust estimate of the uncertainty about RESULTS AND DISCUSSION
the medians for box-to-box comparison. The notches do not overlap; it Figure 4 shows the cumulative false acceptance and false rejec-
means that the medians of the two groups differ at the 5% significance tion plot after the normalization using the logistic transforma-
level. tion. Both lines cross over the threshold that corresponds to
the EER point. Because of the normalization process, the scores
The statistical modeling has been built around a GMM with are in the range [0,1], which is a good estimator of the a poste-
Q ¼ 8 mixtures trained with 16 MFCC + D parameters riori probability that the speech record is normal or pathologi-
(N ¼ 32), as proposed and discussed in the work by Godino- cal. The discrimination capability of the PLI parameter is
Llorente et al (2006).20 As the number of Gaussians is small, represented in Figure 5, where the average DET plot (averaging
the linear combination of full covariance Gaussians is able to the k runs of the k-fold cross-validation scheme) shows the abil-
model the correlation among feature elements. ity of the proposed parameter to separate normal and patholog-
Figure 3 depicts the average histogram of the log-LR ob- ical voices. The plot depicts the false alarm versus the miss
tained for normal and pathological voices. The averaging was probability at different decision thresholds. The closer the
carried out by means of the results obtained using a k-fold plot is to the origin, the better the detector is.
cross-validation scheme (K ¼ 11). Based on the definition of the PLI parameter, a score close to
0 is expected for voices with low perturbations and noise—
closer to the normality criterion—and a PLI score close to 1
for those voices with a great amount of perturbations. Figure 6
PATHOLOGICAL LIKELIHOOD INDEX
represents the box plots of the PLI corresponding to the control
The proposed PLI is calculated normalizing the log-LR
group and the pathological subset. The values corresponding to
obtained by means of a sigmoid (or logistic) transformation.
pathological voices are widely spread than those belonging to
Assuming Gaussianity and equal variances in the distribution
normal records. As usual, in medical decision making, some
of the scores given by the GMM for both normal and patholog-
overlapping might be expected. Because the notches in the
ical classes, the transformation gives a new score within the
box plots do not overlap, we can conclude, with 95% confi-
range (0,1) that can be considered an estimation of the a poste-
dence, that the true medians do differ.
riori probability39,40 that the speech utterance belongs to the
Table 3 shows the correlation of the PLI with different acous-
normal or pathological class. Equation 5 represents the logistic
tic parameters enumerated before (Table 1). The analysis was
transformation:
done by separating the normal and pathological subsets as in
1 the work by Michaelis et al (1998).42 Although the goal is
PLI ¼ f ðzÞ ¼ ; PLI˛ð0; 1Þ (5) just to show and compare the correlation between the PLI and
1þ eðwo þwT $zÞ
the rest of the acoustic parameters, both tables represent the cor-
where z is the log-LR obtained from the GMM model, and wo relation among all extracted parameters. The correlation found
and wT are the parameters that define the transformation, which among the ‘‘classical’’ parameters (Table 1) is very similar to
are calculated as:41 that found in the literature.32,42,43 Conclusions about it may
A Scatter plot CHNR vs. PLI B Scatter plot HNR vs. PLI
1 1
Normal Normal
0.9 Pathological 0.9 Pathological
Pathological Likelihood
0.8 0.8
0.7 0.7
Ratio (PLI)
Ratio (PLI)
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 5 10 15 20 25 30 35 -10 -5 0 5 10 15 20
Cepstrum-based Harmonics to Harmonics to Noise Ratio (HNR)
Noise Ratio (CHNR)
C Scatter plot NNE vs. PLI D Scatter plot GNE vs. PLI
1 1
Normal Normal
0.9 Pathological 0.9 Pathological
0.8 0.8
0.7 0.7
Ratio (PLI)
0.6 Ratio (PLI) 0.6

0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
-30 -25 -20 -15 -10 -5 0 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1
Normalized Noise Energy (NNE) Glottal to Noise Excitation Ratio (GNE)
FIGURE 7. A. Scatter plot of the CHNR versus PLI. B. Scatter plot of the HNR versus PLI. C. Scatter plot of the NNE versus PLI. D. Scatter plot
of the GNE versus PLI.
be found in the works by Michaelis et al (1998) and Hirano et al disturbances owing to aperiodicities tend to partially increase
(1988),32,42 and hence, will not be discussed here. Our goal is to the interharmonic energy because of uncertainty of the right po-
pay attention to the correlation results that correspond to the sition of the harmonics; hence, this phenomenon is also re-
PLI and the rest of the parameters (last column in Table 3). flected in the MFCC parameters and explains the existing
For the pathological subset, all correlations were found to be correlation with the frequency and amplitude perturbation pa-
significant (Table 3). In general terms, the new parameter rameters. Nevertheless, the most interesting finding is that the
showed lower correlation with respect to the classical parame- correlation is slightly lower than that found among the classical
ters than the classical parameters among them. In this sense, parameters.
PLI showed the highest correlation with the frequency perturba- For the normal voices, the correlations are generally lower
tion features (below r ¼ 0.6), but slightly lower with respect to (below r ¼ 0.26) than those for the pathological group, but
the correlation of the frequency perturbation parameters among not significant (Table 3). Furthermore, as the voice quality of
the rest of the features studied. Regarding the amplitude pertur- normal voices is generally regarded to possess many indepen-
bation, the PLI showed lower correlation (below r ¼ 0.39) than dent degrees of freedom, an important correlation should not
the other features. With respect to noise, once again, the PLI be expected for the normal subset.42
demonstrated lower correlation (below r ¼ 0.56) than the rest Figure 7 depicts the scatter plots between the PLI and four
of the parameters. The correlation study means that the new pa- different noise measurements: HNR, CHNR, NNE, and GNE.
rameter integrates some information about the noise and fre- They are represented to show the normal and pathological clus-
quency perturbation, but is less sensitive to amplitude ters. For these four features, the correlation found is significant,
perturbations. These results are logical considering that the but below r ¼ 0.56. The plots in Figure 7 show some correla-
noise because of turbulences increases the energy at the differ- tion, but also a large dispersion. As commented earlier, it means
ent frequency bands, and this phenomenon is better reflected in that some of the aspects measured by the noise features are
the MFCC parameters used to build the model than the distur- lightly reflected in PLI, because noise is present in most of
bances because of amplitude perturbations. Moreover, those the pathological voices.
A ROC curve. Frequency B ROC curve. Amplitude

perturbation parameters perturbation parameters
1 1
0.9 0.9
0.8 0.8
true positive rate
true positive rate

0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
jitta AUR=0.89
0.3 jitter AUR=0.83 0.3
RAP AUR=0.816 ShdB AUR=0.922
0.2 PPQ AUR=0.813 0.2 Shimm AUR=0.928
sPPQ AUR=0.817 sAPQ AUR=0.888
0.1 PLI AUR=0.987 0.1 PLI AUR=0.987
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
false positive rate false positive rate
C ROC curve. Noise parameters

1
0.9
0.8
true positive rate
0.7
0.6
0.5
0.4
CHNR AUC=0.97
0.3 HNR AUC=0.95
VTI AUC=0.84
0.2 NNE AUC=0.96
GNE AUC=0.97
0.1 PLI AUR=0.99
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
false positive rate
FIGURE 8. ROC plot to show and compare the ability to discriminate between normal and pathological voices. A. PLI and frequency perturbation
parameters. B. PLI and amplitude perturbation parameters; C. PLI and noise parameters.
TABLE 4.
Screening Accuracy, AUC, and SE of the Area for a Set of Acoustic Parameters
Types Parameters Screening Accuracy (%) AUR SE
Frequency perturbation Absolute jitter (ms) 82.24 0.86 0.025
Relative jitter (%) 73.83 0.80 0.031
RAP (%) 74.30 0.79 0.032
PPQ (%) 73.40 0.79 0.032
sPPQ (%) 75.70 0.82 0.030
Amplitude perturbation Absolute shimmer(dB) 82.71 0.92 0.02

Relative shimmer (%) 83.64 0.92 0.02
sAPQ (%) 80.84 0.90 0.02
Noise HNR (dB) 84.57 0.95 0.03

CHNR (dB) 90.01 0.97 0.02
NNE (dB) 89.71 0.96 0.01
GNE 89.91 0.97 0.02
PLI 94.07 0.99 0.01
The screening accuracy is obtained from the probability distribution (histogram) at the equal error rate threshold (ie, the point for which the false-positive rate
equals the false-negative rate).
14 be used for the one-dimensional study and characterization of

12
pathological voices. In this study, the proposed methodology
has been applied to sustained vowels, but could be easily
10 extended to running speech using an appropriate voiced-
unvoiced detector.
GRBAS
8
From the definition of the PLI parameter, the results show
6 that voices with a smaller amount of perturbations and noise
4 have a low score (near 0), whereas pathological voices with
a greater amount of perturbations have a high score (near 1).
2
Moreover, in view of the ROC plots and their AUR, this new
0 parameter alone has demonstrated better screening accuracy
0 0,2 0,4 0,6 0,8 1 1,2 than the most important parameters reported in the literature.
PLI In general terms, the new parameter showed lower correlation
FIGURE 9. Scatter plot of the PLI and the overall score assigned ac- with respect to the classical parameters than the classical param-
cording to the GRBAS scale. eters among them. The PLI shows a medium correlation with the
frequency perturbation features and with some of the noise pa-
Figure 8 illustrates graphically the discrimination ability rameters, and low correlation with the amplitude perturbation pa-
between normal and pathological voices of the PLI, compared rameters. Thus, we can say that PLI integrates some information
with the frequency perturbation (Figure 8A), amplitude pertur- about noise and frequency perturbation, but measures another
bation (Figure 8B), and noise parameters (Figure 8C). The plots complementary aspect of the phenomenon (or different voice
show that PLI has better discrimination ability than the other characteristics) than the ‘‘classical’’ acoustic measures. The cor-
measurements represented. The area under the ROC curve relations found mean that this parameter represents a good com-
(AUR) corresponding to the PLI parameter numerically plement to the ‘‘classical’’ acoustic parameters—specially for the
supports this statement (Table 4). amplitude perturbation parameters—giving an indication of the
Finally, Figure 9 and Table 5 illustrate the correlation between probability that a voice record is either normal or pathological
the PLI and the perceptual evaluations. To allow comparisons, (ie, the ‘‘degree of normality’’). Moreover, the proposed param-
Table 5 shows the correlation between several noise features eter is demonstrated to be correlated with the G, R, and overall
and the GRBAS parameters. Although it is known that the per- GRBAS scores; hence, we can conclude that the proposed feature
ceptual judgments have a wide intra- and interevaluator variabil- gives an indication of the presence or absence of pathology and
ity that could bias the evaluations, the results demonstrate that an evaluation of the perceived hoarseness.
PLI correlates well with the GRBAS overall score (r ¼ 0.63), The PLI is based on a previous computation of the MFCCs
with the G (r ¼ 0.65) and with R (r ¼ 0.65) features. As the and their first derivatives. The main advantage of the MFCCs
severity of hoarseness is quantified under the parameter G is that they are very robust in their calculation, but the drawback
(grade)—G integrates two main components: breathiness (B) is that their individual physical interpretation is not completely
and roughness (R)—we can conclude that the PLI can be used clear. Despite this, the output of the statistical model presented
as a predictor of the severity of hoarseness and the overall might be understood as a quality measurement, giving an
perceived quality as the noise features studied, but with the estimation of the likelihood that a speech utterance is normal
advantage of better discrimination capabilities for screening. or pathological. Hence, the final index could be easily under-
Moreover, Table 5 illustrates that the results found in terms of stood by a speech therapist.
correlation between PLI and the G and R ratings are very similar For this work, the a priori probability to be normal or path-
to those obtained with the noise parameters. ological was supposed to be equal but, in practice, these prob-
abilities could be adjusted according to the primary judgment
CONCLUSIONS given by the evaluator, and/or combining with epidemiological
As with other classic parameters, such a jitter, shimmer, HNR, data about the prevalence of speech disorders in the population.
NNE, and others, PLI is a long-term feature calculated by aver- Including this a priori information, the efficiency of the
aging short-time measurements. The proposed parameter may proposed methodology might be improved.
TABLE 5.
Rank Correlation Coefficients (r) of PLI and Noise Parameters With Respect to the GRBAS Perceptual Scale
Parameters G R B A S GRBAS
PLI 0.65 0.65 0.38 0.51 0.24 0.63
CHNR 0.65 0.67 0.41 0.53 0.06* 0.60
NNE 0.63 0.75 0.38 0.52 0.14* 0.56
GNE 0.58 0.65 0.45 0.37 0.15* 0.54
* Insignificant correlations. Significant correlations are with P < 0.05 (95% confidence interval).
In conclusion, PLI seems to be a promising feature for the 19. Ritchings RT, McGillion MA, Moore CJ. Pathological voice quality as-
screening and assessment of voice quality, showing better sessment using artificial neural networks. Med Eng Phys. 2002;24:
561–564.
discrimination ability than other noise or perturbation measure-
20. Godino-Llorente JI, Gómez-Vilda P, Blanco-Velasco M. Dimensionality
ments found in the literature, demonstrating to be a good reduction of a pathological voice quality assessment system based on
predictor of the perceived hoarseness, and emerging as a clear Gaussian mixture models and short-term cepstral parameters. IEEE Trans
complement to improve the multidimensional analysis based Biomed Eng. 2006;53:1943–1953.
on the classical acoustic parameters. 21. Deller JR, Proakis JG, Hansen JHL. Discrete-Time Processing of Speech
Signals. New York: Macmillan Series for Prentice Hall; 1993.
22. Reynolds DA, Rose RC. Robust text-independent speaker identification
using Gaussian mixture speaker models. IEEE Trans Speech Audio
Acknowledgments Processing. 1995;3:72–83.
23. Reynolds DA. Speaker identification using Gaussian mixture speaker
This research was carried out under grant TEC2006-12887-C02 models. Speech Commun. 1995;17:91–108.
from the Ministry of Education of Spain. The authors would 24. Weiss NA. Introductory Statistics. (6th ed). Reading, MA: Addison Wesley;
like to thank Janaı́na Mendes-Laureano for her support in the 2000.
perceptual evaluation of the voice samples. 25. Manfredi C, D’Aniello M, Bruscaglioni P, Ismaelli A. A comparative
analysis of fundamental frequency estimation methods with application
to pathological voices. Med Eng Phys. 2000;22:135–147.
26. Martin AF, Doddington GR, Kamm T, Ordowski M, Przybocki MA. The
REFERENCES DET curve in assessment of detection task performance. In: Proceedings
1. Smits I, Ceuppens P, De Bodt M. A comparative study of acoustic voice of Eurospeech ’97, Vol. IV 1997;1895–1898. Rhodes, Crete.
measurements by means of Dr. Speech and Computerized Speech Lab. 27. Hanley JA, McNeil BJ. A method of comparing the areas under receiver
J Voice. 2005;19:187–196. operating characteristics curves derived from the same cases. Radiology.
2. Baken RJ, Orlikoff R. Clinical Measurement of Speech and Voice. (2nd ed). 1983;148:839–843.
San Diego, CA: Singular Publishing Group; 2000. 28. Hanley JA, McNeil BJ. The meaning and use of the area under a re-
3. Feijoo S, Hernández-Espinosa C. Short-term stability measures for the ceiver operating characteristic (ROC) curve. Radiology. 1982;143:
evaluation of vocal quality. J Speech Hear Res. 1990;33:324–334. 29–36.
4. Klingholtz F, Martin F. The measurement of the signal-to-noise ratio (SNR) 29. Kay Elemetrics Corp. Disordered Voice Database. Version 1.03. Lincoln
in continuous speech. Speech Commun. 1987;6:15–26. Park, NJ: Kay Elemetrics Corp; 1994 [CD-ROM].
5. Yumoto E, Gould WJ, Baer T. Harmonics-to-noise ratio as an index of the 30. Hirano M. Psycho-Acoustic Evaluation of Voice. New York:
degree of hoarseness. J Acoust Soc Am. 1982;71:1544–1550. Springer-Verlag; 1981.
6. Qi Y, Weinberg B, Bi N, Hess W. Minimizing the effect of period determi- 31. Dejonkere PH, Remacle M, Fresnel-Elbaz E, Woisard W, Crevier-
nation on the computation of amplitude perturbation of voice. J Acoust Soc Euchman L, Millet B. Differentiated perceptual evaluation of pathological
Am. 1995;97:2525–2532. voice quality: reliability and correlations with acoustic measurements. Rev
7. Qi Y, Hillman RE. Temporal and spectral estimations of harmoni- Laringol Otol Rhinol. 1996;117:219–224.
cs-to-noise ratio in human voice signals. J Acoust Soc Am. 1997;102: 32. Hirano M, Hibi S, Yoshida T, Hirade Y, Kasuya H, Kikuchi Y. Acoustic
537–543. analysis of pathological voice. Acta Otolaringol. 1988;105(2):432–438.
8. de Krom G. A cepstrum-based technique for determining a harmoni- 33. Revis J, Giovanni A, Wuyts F. Comparison of different types of vowel frag-
cs-to-noise ratio in speech signals. J Speech Hear Res. 1993;36:254–266. ments for the evaluation of voice quality. In: Proceedings of Voicedata’98;
9. Kasuya H, Ogawa S, Mashima K, Ebihara S. Normalized noise energy as an 80–85.
acoustic measure to evaluate pathologic voice. J Acoust Soc Am. 1986;80: 34. Bou-Ghazale SE, Hansen JHL. A comparative study of traditional and
1329–1334. newly proposed features for recognition of speech under stress. IEEE Trans
10. Deliyski D. Acoustic model and evaluation of pathological voice produc- Speech Audio Processing. 2000;8:429–442.
tion. In: Proceedings of Eurospeech ’93, Vol. 3 1993;:1969–1972. Berlin, 35. Childers DG, Sung-Bae K. Detection of laryngeal function using speech
Germany. and electroglottographic data. IEEE Trans Biomed Eng. 1992;39:19–25.
11. Prosek RA, Montgomery AA, Walden BE, Hawkins DB. An evaluation of 36. Oppenheim AV, Schafer RW, Buck JR. Discrete-Time Signal Processing.
residue features as correlates of voice disorders. J Commun Disord. (2nd ed). New Jersey: Prentice Hall; 1999.
1987;20:105–117. 37. Schalkoff RJ. Pattern Recognition: Statistical, Structural and Neural
12. Michaelis D, Gramss T, Strube HW. Glottal-to-noise excitation ratio—a Approaches. New York: John Wiley & Sons; 1991.
new measure for describing pathological voices. Acustica/Acta Acustica. 38. Moon TK. The expectation-maximization algorithm. IEEE Signal Process-
1997;83:700–706. ing Mag. 1996;13(6):47–60.
13. Winholtz W. Vocal tremor analysis with the vocal demodulator. J Speech 39. Tax DMJ, van Breukelen M, Duin RPW, Kittler J. Combining multiple clas-
Hear Res. 1992;35:562–563. sifiers by averaging of by multiplying. Pattern Recognit. 2000;33:
14. Boyanov B, Hadjitodorov S. Acoustic analysis of pathological voices. A 1475–1485.
voice analysis system for the screening of laryngeal diseases. IEEE Eng 40. Kittler J, Hatef M, Duin RPW, Matas J. On combining classifiers. IEEE
Med Biol Mag. 1997;16:74–82. Trans Pattern Anal Mach Intell. 1998;20:226–239.
15. Yumoto E, Sasaki Y, Okamura H. Harmonics-to-noise ratio and psycho- 41. Bishop CM. Neural Networks for Pattern Recognition. (2nd ed). Oxford,
physical measurement of the degree of hoarseness. J Speech Hear Res. UK: Oxford University Press; 1995.
1984;27:2–6. 42. Michaelis D, Fröhlich M, Strube HW. Selection and combination of acous-
16. Hadjitodorov S, Boyanov B, Teston B. Laryngeal pathology detection by tic features for the description of pathologic voices. J Acoust Soc Am.
means of class-specific neural maps. IEEE Trans Inf Technol Biomed. 1998;103:1628–1639.
2000;4:68–73. 43. Fröhlich M, Michaelis D, Strube HW. Acoustic ‘‘breathiness measures’’ in
17. Parsa V, Jamieson DG. Identification of pathological voices using glottal the description of pathological voices,’’. In: Proceedings of ICASSP’98;
noise measures. J Speech Lang Hear Res. 2000;43:469–485. 937–940. Seattle, WA, USA.
18. Hadjitodorov S, Mitev P. A computer system for acoustic analysis of path- 44. Kasuya H, Ogawa S, Kikuchi Y, Ebihara S. An acoustic analysis of patho-
ological voices and laryngeal disease screening. Med Eng Phys. 2002;24: logical voice and its application to the evaluation of laryngeal pathology.
419–429. Speech Commun. 1986;5:171–181.

1 s2.0 S0892199709000563 Main - 2

Uploaded by

Copyright:

Available Formats

1 s2.0 S0892199709000563 Main - 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S0892199709000563 Main - 2

Uploaded by

Copyright:

Available Formats

Pathological Likelihood Index as a Measurement

of the Degree of Voice Normality

INTRODUCTION Apart from the physical interpretation of each acoustic

Histogram of the log-likelihood ratio

Gender Normal: 0–3 3–6 6–9 >9

by Hirano et al,32 it is demonstrated that there is a strong corre-

ian mixture density. A mixture of Q component densities is

10 where Q ¼ {Qn ; Qp } indicates the class—normal or

Journal of Voice, Vol. 24, No. 6, 2010

Box plots m2n m2p mn mp

The parameters of the sigmoid issued from the normal and

0.6 Ratio (PLI) 0.6

A ROC curve. Frequency B ROC curve. Amplitude

true positive rate

C ROC curve. Noise parameters

Amplitude perturbation Absolute shimmer(dB) 82.71 0.92 0.02

Noise HNR (dB) 84.57 0.95 0.03

14 be used for the one-dimensional study and characterization of

You might also like