1 s2.0 S0892199709000563 Main - 2
1 s2.0 S0892199709000563 Main - 2
1 s2.0 S0892199709000563 Main - 2
Summary: A new index is introduced in this article to measure the degree of normality in the speech. The proposed
parameter has demonstrated to be correlated with the perceived hoarseness, giving an indication of the degree of
normality. The calculation of such a parameter is based on a statistical model developed to represent normal and path-
ological voices. The modeling is built around Gaussian mixture models and Mel frequency cepstral coefficients. The
proposed index has been named pathological likelihood index (PLI). PLI is compared with other aperiodicity features
(such as jitter and shimmer), and measurements sensitive to additive noise (such as harmonics-to-noise ratio (HNR),
cepstrum-based HNR, normalized noise energy, and glottal-to-noise excitation ratio). The proposed parameter is
revealed to be a good estimator of the presence of pathology, showing lower correlation with noise, frequency, and
amplitude perturbation parameters than these classical features among them.
Key Words: Screening voice disorders–Short-term analysis–Cepstral parameters–Gaussian mixture models–Voice
quality.
framework proposed in our previous work.20 The modeling is was applied to a set of 16 ‘‘classical’’ acoustic measurements
conducted by means of Gaussian mixture models (GMM) using (Table 1). The goal is to determine interdependencies between
nonparametric short-term Mel frequency cepstral coefficients PLI and the other features. Furthermore, the correlation of the
(MFCC).21 Each voice record is characterized with the same proposed parameter with the ratings given by the specialists
number of vectors as frames are extracted from each recording. through perceptual evaluation has been analyzed. The statistic
These feature vectors are used to build two different statistical applied was the nonparametric Spearman rank-order correlation
models: one for normal and the other for pathological voices. coefficient,24 which is independent of the shape of the underly-
For each frame, both the likelihood to be normal and the likeli- ing data distribution. The significance of the difference between
hood to be pathological are calculated. An index (called log the correlation coefficients was also calculated. Moreover, the
likelihood ratio [log-LR]) is obtained subtracting the log-likeli- discrimination capabilities of the proposed parameter are com-
hood (likelihood in the logarithmic domain) to be normal from pared with those parameters that have been reported in the liter-
the log-likelihood to be pathological. The next step involves an ature to be the best indicators of the presence of pathology.
averaging in time of the index calculated for each frame to ob- The new index improves the screening accuracy with respect
tain a final score. Finally, the proposed pathological likelihood to the existing parameters maintaining a medium–high
index (PLI) is calculated normalizing the LR by means of a lo- correlation with the perceptual judgments about quality given
gistic function. The decision about normality or abnormality is by the specialists, and simultaneously reducing the correlation
taken establishing a threshold over the normalized LR. Such with respect to the classical parameters found by the state-
threshold corresponds to the equal error rate (EER) point.22,23 of-the-art technique.
A more detailed description of the process will be given in sec- The article is organized as follows: section methodology
tions methodology and pathological likelihood index. introduces the block diagram used to build the statistical model;
To compare the proposed PLI with the acoustic parameters in section pathological likelihood index describes the calculation
existing literature, in this work, a simple correlation analysis of the PLI parameter; section results and discussion
TABLE 1.
‘‘Classical’’ Acoustic Measurements Used in the Study to Compare Results With the PLI
Types Parameters Description
Frequency perturbation Jitta Absolute jitter gives an evaluation of the period-to-period variability of the
pitch period (ms)2,3
Jitter Jitter percent represents the relative period-to-period (in short-term)
variability of the pitch (%)2,3
RAP Relative average perturbation gives an evaluation of the variability of the pitch
period with a smoothing factor of 3 periods (%)2,3
PPQ Pitch period perturbation quotient gives an evaluation in percent of the
variability of the pitch period with a smoothing factor of 5 periods (%)2,3
sPPQ Smoothed pitch period perturbation quotient gives an evaluation in percent
of the long-term variability of the pitch period with a user-selected number
of periods (usually 55) (%)2,3
Amplitude perturbation ShdB Shimmer in dB gives an evaluation of the period-to-period variability of the
peak-to-peak amplitude (dB)2,3
Shim Shimmer percent gives an evaluation in percent of the variability of the
peak-to-peak amplitude. It represents the relative period-to-period
(in short-term) variability of the peak-to-peak amplitude (%)2,3
sAPQ Smoothed amplitude perturbation quotient gives an evaluation in percent
of the long-term variability of the peak-to-peak amplitude with a smoothing
factor of a user-selected number of periods (usually 55) (%)2,3
Noise HNR Harmonics-to-noise ratio is an average ratio of energy of the inharmonic
components in the range 1500–4500 Hz to the harmonic-component energy
in the range 70–4500 Hz (%)10,15
CHNR Cepstrum-based HNR is an average ratio based on calculating the ratio of the
energy of the harmonics related to the noise energy present in the voice
(both measured in dB) in the range 70–4500 Hz (dB)5,8
NNE Normalized noise energy is a ratio of the energy of the noise present (in the
range 1–5 kHz) in the vocalization to the total energy of the signal (dB)9,44
GNE Glottal-to-noise excitation ratio is the maximum of the cross-correlation
between Hilbert envelopes calculated for different frequency channels and
extracted from the inverse filtering of the speech signal. The bandwidth of the
envelopes used is 1 kHz, and frequency bands are separated by 300 Hz12,42
Juan Ignacio Godino-Llorente, et al Pathological Likelihood Index 669
summarizes the results and comparisons with other acoustic pa- axes, giving uniform treatment to both types of errors. It uses
rameters, paying special attention to the discrimination capabil- a normal deviate scale for both axes which spreads out the
ities of the parameter; and section conclusions presents some plot and better distinguishes different well-performing parame-
conclusions. ters. The closer the plot is to the origin, the better would be the
screening accuracy of the parameter.
METHODOLOGY
Figure 1 shows a block diagram describing the process setup for Database
the modeling. The features used to build the statistical models As it is the only commercially available voice-disorder data-
are calculated from short-time windows extracted from the base, and to allow reproducibility, the tests have been carried
recordings of the sustained phonation of a vowel. The window out using the database developed by the Massachusetts Eye
length was selected to contain, in the worst case, at least two and Ear Infirmary Voice and Speech Labs.29 The speech sam-
consecutive pitch periods (2T0);25 hence, the feature extraction ples were collected in a controlled environment and sampled
was performed using 40-millisecond Hamming windows with at a 16-bit resolution. When necessary, a downsampling with
an overlap of 50% between adjacent frames. Consequently, a previous half-band filtering is done to adjust every utterance
the frame rate obtained is 50 frames/s. to the sampling rate of 25 kHz. The material available in the
The feature extraction module calculates, for each frame, the database contains the sustained phonation (1- to 3-second
MFCC vector of parameters complemented with other features long) of the vowel /ah/, and continuous speech recordings of
developed to measure their speed of variation. In the third stage, the ‘‘rainbow passage.’’ The samples were obtained from
a pattern classification block models the statistical distribution patients (males and females) with normal voices and a wide
for normal and pathological voices. Two different statistical variety of organic, neurological, traumatic, and psychogenic
models are computed: one for normal and another for patholog- voice disorders.
ical voices. The last step is a logistic transformation to obtain The database has been segmented according to the criteria
a final score normalized into the interval (0,1). explained in the work by Parsa and Jamieson (2000),17 used ear-
The screening accuracy of the statistical modeling stage is lier in other studies.20 The criteria used in work by Parsa and
estimated using a k-fold cross-validation scheme. Nine repeti- Jamieson (2000)17 ensure that gender and age are uniformly
tions were used to estimate the performance figures, averaging distributed among the samples belonging to both classes. The
the results obtained from each data set. For each set, data files subset taken from the database contains 53 normal and 173
were randomly split into two subsets: the first one to train pathological talkers. The larger number of recordings belong-
(70% of the files), and the second (30% of the files) to simulate ing to the pathological set allows a better modeling of a class
and validate the results, keeping the same proportion for each that has a larger inherent variability. This fact does not imply
class. The division into training and evaluation data sets is a slant of the proposed methodology toward the pathological
carried out on a file (rather than frame) basis to check and class because, typically, the intra- and interspeaker variability
prevent the system from learning speaker-related features. in the feature space of the pathological voices is much greater
Both male and female voices have been mixed together in the than in the control group.
training and validation sets. Perceptual labeling. To study the correlation between the
The accuracy of the parameter has been evaluated with the proposed PLI and the perceptual judgments used in the clinical
detection error tradeoff (DET)26 and compared with other practice, the database has been evaluated by an experienced
acoustic parameters using the relative operating characteristic speech and language therapist (SALT) following the GRBAS
(ROC)27 plots. These plots allow a comparison of the perfor- scale. This perceptual scale was proposed by Hirano30 and ac-
mance and accuracy with other parameters found in the cepted as standard by the Japanese Society of Logopedics and
literature. The ROC27,28 curve is a popular tool in medical de- Phoniatrics and the European Group on the Larynx.31 The
cision making. It displays the diagnostic accuracy expressed in GRBAS scale comprises five qualitative parameters: grade of
terms of sensitivity (or true positive rate) against 1 specificity dysphony (G), roughness (R), breathiness (B), asthenicity (A),
(or false acceptance rate) at all possible threshold values in and strain (S). For each parameter, a value in the range
a convenient way. The ROC is analyzed calculating the area un- {0 v 3; v˛Z} is considered, where 0 corresponds to
der the curve (AUR) and its standard error (SE), as suggested in healthy voice, 1 to light disease, 2 to moderate disease, and 3
the work by Hanley and McNeil (1983).27 On the other hand, to severe disease. The sum of the partial scores gives an
the DET plot26 is also widely used for the assessment of detec- overall GRBAS score in the range {0 v 12; v˛Z}. De-
tion performance in speaker-verification tasks. A DET curve spite some limitations, GRBAS is simple and fast, and has
plots error rates (false positive vs false negative) on both good correlation with some acoustic parameters. In the work
Feature Pattern
Pre-processing Extraction Classification
Speech (MFCC) (GMM) PLI
Logistc
signal
FIGURE 1. Block diagram of the statistical speech-modeling detector: preprocessing front end, feature extraction, and statistical module.
670 Journal of Voice, Vol. 24, No. 6, 2010
GRBAS
0.1
0.09 0.6
0.08
0.07
0.4
0.06
0.05
0.04
0.2
0.03 Pathological
Normal
0.02
EER
0.01 0
0 0.2 0.4 0.6 0.8 1
0
-15 -10 -5 0 5 10 15 20 PLI
FIGURE 2. Histogram of a single cepstral coefficient and its aprox- FIGURE 4. Normalized cumulative false-positive and false-negative
immation by means of a Gaussian mixture (solid line). plots. The normalization is carried out with a logistic transformation.
Juan Ignacio Godino-Llorente, et al Pathological Likelihood Index 671
X
Q X
Q
20
Pðx=QÞ ¼ ci pi ðxÞ; ci ¼ 1; ci 0 (1)
i¼1 i¼1
Miss probability (in %)
filter central frequency. The higher the frequency, the wider the A new index, log LR, is obtained subtracting the log-likeli-
bandwidth is. hood (likelihood in the logarithmic domain) to be normal from
the log-likelihood to be pathological. This is represented in
Temporal derivatives. The MFCC features have been
Equation 3:
extended to include the first temporal derivatives among the
neighboring frames. The first temporal derivatives (D) are P x Q p
good estimators of the speed of variation giving relevant infor- LR ¼ 0logLR ¼ log P xQp logðPðxjQn ÞÞ
PðxjQn Þ
mation about the dynamics and short-time variability. These
features have been considered significant as, because of the (3)
presence of voice disorders, a lower degree of stationarity might If the a priori probability of a speech sample to be normal,
be expected (ie, larger temporal variation of the MFCC vector PðQn Þ, or pathological, PðQp Þ, were known, the LR could be
of parameters).35 Another reason to complement the feature calculated including this a priori knowledge as in Equation 4.
vectors with the D is that the generative approaches used to
This a priori knowledge would represent the expertise of the
model the normal and pathological classes (the GMMs) do
person evaluating the voice.
not consider any temporal dependence by themselves. The
calculation of D is achieved by means of antisymmetric finite
0
P Qp x P Qp $P xQp
impulse response (FIR) filters to avoid phase distortion of the LR ¼ ¼ (4)
PðQn jxÞ PðQn Þ$PðxjQn Þ
temporal sequence.36
In this work, the a priori probability to be normal or patho-
Statistical modeling logical was supposed to be equal (ie, no a priori information, or
The motivation to use GMM is attributed to its ability to repre- equal probability to be normal/pathological), although in prac-
sent a great amount of distributions.22,23 Each class (normal or tice, these probabilities could be adjusted according to the pri-
pathological) is modeled as a random process whose probabil- mary judgment given by the evaluator, and/or combining with
ity density function (pdf) is characterized by a mixture of Q epidemiological data about the prevalence of speech disorders
Gaussians (Figure 2). Let x˛RN be an N-dimensional random in the population.
vector with an arbitrary distribution representing the parame- For the sake of simplicity, and to make its computation eas-
ters extracted from each frame. Therefore, let us suppose that ier, Equation 4 can also be expressed in the logarithmic domain.
x is the evidence measured from the speech, Qn is the hypoth- Thus, the final score, z, assigned to the whole utterance is com-
esis that the speech sample corresponds to a normal voice, and puted assuming independence between observations by adding
Qp is the hypothesis that the speech sample belongs to a patho- the log-likelihoods obtained for each frame over a patient’s ut-
logical voice. Thus, for each frame, both the likelihood to be terance and dividing by the number of frames evaluated.
normal, pðxjQn Þ, and the likelihood to be pathological,
pðxjQp Þ, could be obtained as a result of the score assigned to a
Strictly speaking, the LR test is only optimal when the likelihood functions are
each model modelling the distribution density of x with a Gauss- exactly known. In practice, this is not the case.
672
TABLE 3.
Rank Correlation Coefficients of Different Parameters Extracted From Pathological (n ¼ 173) and Normal Voices (n ¼ 53)
Frequency Perturbation Parameters Amplitude Perturbation Parameters Noise Features
Parameter Jitta Jitter RAP PPQ sPPQ ShdB Shim sAPQ HNR CHNR NNE GNE PLI
Pathological
Jitta 1.00 0.90 0.89 0.88 0.75 0.72 0.73 0.49 0.70 0.72 0.79 0.67 0.60
Jitter – 1.00 0.99 0.99 0.81 0.69 0.69 0.48 0.69 0.56 0.66 0.62 0.59
RAP – – 1.00 0.98 0.80 0.68 0.68 0.47 0.68 0.55 0.66 0.61 0.58
PPQ – – – 1.00 0.81 0.68 0.68 0.49 0.69 0.55 0.64 0.60 0.57
sPPQ – – – – 1.00 0.55 0.55 0.64 0.60 0.47 0.53 0.57 0.47
ShdB – – – – – 1.00 0.99 0.75 0.76 0.86 0.74 0.72 0.39
Shim – – – – – – 1.00 0.74 0.76 0.86 0.75 0.72 0.39
sAPQ – – – — – – – 1.00 0.60 0.63 0.48 0.56 0.24
HNR – – – – – – – – 1.00 0.80 0.60 0.50 0.36
CHNR – – – – – – – – – 1.00 0.78 0.65 0.35
NNE – – – – – – – – – – 1.00 0.78 0.56
GNE – – – – – – – – – – – 1.00 0.51
PLI – – – – – – – – – – – – 1.00
0.5
Normal
Jitta 1.00 0.71 0.67 0.67 0.72 0.61 0.61 0.61 0.61 0.44 0.64 0.58 0.09*
Jitter – 1.00 0.99 0.99 0.77 0.40 0.40 0.29 0.22* 0.21* 0.37 0.35 0.007*
RAP – – 1.00 0.99 0.73 0.36 0.34 0.18* 0.17* 0.17* 0.31 0.31 0.013*
PPQ – – – 1.00 0.73 0.40 0.38 0.19* 0.18* 0.20* 0.34 0.32 0.007*
sPPQ – – – – 1.00 0.44 0.43 0.46 0.50 0.36 0.49 0.5 0.02*
ShdB – – – – – 1.00 0.99 0.80 0.50 0.75 0.66 0.52 0.11*
Shim – – – – – – 1.00 0.96 0.48 0.74 0.66 0.55 0.10*
A Scatter plot CHNR vs. PLI B Scatter plot HNR vs. PLI
1 1
Normal Normal
0.9 Pathological 0.9 Pathological
Pathological Likelihood
Pathological Likelihood
0.8 0.8
0.7 0.7
Ratio (PLI)
Ratio (PLI)
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 5 10 15 20 25 30 35 -10 -5 0 5 10 15 20
Cepstrum-based Harmonics to Harmonics to Noise Ratio (HNR)
Noise Ratio (CHNR)
C Scatter plot NNE vs. PLI D Scatter plot GNE vs. PLI
1 1
Normal Normal
0.9 Pathological 0.9 Pathological
Pathological Likelihood
Pathological Likelihood
0.8 0.8
0.7 0.7
Ratio (PLI)
be found in the works by Michaelis et al (1998) and Hirano et al disturbances owing to aperiodicities tend to partially increase
(1988),32,42 and hence, will not be discussed here. Our goal is to the interharmonic energy because of uncertainty of the right po-
pay attention to the correlation results that correspond to the sition of the harmonics; hence, this phenomenon is also re-
PLI and the rest of the parameters (last column in Table 3). flected in the MFCC parameters and explains the existing
For the pathological subset, all correlations were found to be correlation with the frequency and amplitude perturbation pa-
significant (Table 3). In general terms, the new parameter rameters. Nevertheless, the most interesting finding is that the
showed lower correlation with respect to the classical parame- correlation is slightly lower than that found among the classical
ters than the classical parameters among them. In this sense, parameters.
PLI showed the highest correlation with the frequency perturba- For the normal voices, the correlations are generally lower
tion features (below r ¼ 0.6), but slightly lower with respect to (below r ¼ 0.26) than those for the pathological group, but
the correlation of the frequency perturbation parameters among not significant (Table 3). Furthermore, as the voice quality of
the rest of the features studied. Regarding the amplitude pertur- normal voices is generally regarded to possess many indepen-
bation, the PLI showed lower correlation (below r ¼ 0.39) than dent degrees of freedom, an important correlation should not
the other features. With respect to noise, once again, the PLI be expected for the normal subset.42
demonstrated lower correlation (below r ¼ 0.56) than the rest Figure 7 depicts the scatter plots between the PLI and four
of the parameters. The correlation study means that the new pa- different noise measurements: HNR, CHNR, NNE, and GNE.
rameter integrates some information about the noise and fre- They are represented to show the normal and pathological clus-
quency perturbation, but is less sensitive to amplitude ters. For these four features, the correlation found is significant,
perturbations. These results are logical considering that the but below r ¼ 0.56. The plots in Figure 7 show some correla-
noise because of turbulences increases the energy at the differ- tion, but also a large dispersion. As commented earlier, it means
ent frequency bands, and this phenomenon is better reflected in that some of the aspects measured by the noise features are
the MFCC parameters used to build the model than the distur- lightly reflected in PLI, because noise is present in most of
bances because of amplitude perturbations. Moreover, those the pathological voices.
Juan Ignacio Godino-Llorente, et al Pathological Likelihood Index 675
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
false positive rate false positive rate
0.7
0.6
0.5
0.4
CHNR AUC=0.97
0.3 HNR AUC=0.95
VTI AUC=0.84
0.2 NNE AUC=0.96
GNE AUC=0.97
0.1 PLI AUR=0.99
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
false positive rate
FIGURE 8. ROC plot to show and compare the ability to discriminate between normal and pathological voices. A. PLI and frequency perturbation
parameters. B. PLI and amplitude perturbation parameters; C. PLI and noise parameters.
TABLE 4.
Screening Accuracy, AUC, and SE of the Area for a Set of Acoustic Parameters
Types Parameters Screening Accuracy (%) AUR SE
Frequency perturbation Absolute jitter (ms) 82.24 0.86 0.025
Relative jitter (%) 73.83 0.80 0.031
RAP (%) 74.30 0.79 0.032
PPQ (%) 73.40 0.79 0.032
sPPQ (%) 75.70 0.82 0.030
8
From the definition of the PLI parameter, the results show
6 that voices with a smaller amount of perturbations and noise
4 have a low score (near 0), whereas pathological voices with
a greater amount of perturbations have a high score (near 1).
2
Moreover, in view of the ROC plots and their AUR, this new
0 parameter alone has demonstrated better screening accuracy
0 0,2 0,4 0,6 0,8 1 1,2 than the most important parameters reported in the literature.
PLI In general terms, the new parameter showed lower correlation
FIGURE 9. Scatter plot of the PLI and the overall score assigned ac- with respect to the classical parameters than the classical param-
cording to the GRBAS scale. eters among them. The PLI shows a medium correlation with the
frequency perturbation features and with some of the noise pa-
Figure 8 illustrates graphically the discrimination ability rameters, and low correlation with the amplitude perturbation pa-
between normal and pathological voices of the PLI, compared rameters. Thus, we can say that PLI integrates some information
with the frequency perturbation (Figure 8A), amplitude pertur- about noise and frequency perturbation, but measures another
bation (Figure 8B), and noise parameters (Figure 8C). The plots complementary aspect of the phenomenon (or different voice
show that PLI has better discrimination ability than the other characteristics) than the ‘‘classical’’ acoustic measures. The cor-
measurements represented. The area under the ROC curve relations found mean that this parameter represents a good com-
(AUR) corresponding to the PLI parameter numerically plement to the ‘‘classical’’ acoustic parameters—specially for the
supports this statement (Table 4). amplitude perturbation parameters—giving an indication of the
Finally, Figure 9 and Table 5 illustrate the correlation between probability that a voice record is either normal or pathological
the PLI and the perceptual evaluations. To allow comparisons, (ie, the ‘‘degree of normality’’). Moreover, the proposed param-
Table 5 shows the correlation between several noise features eter is demonstrated to be correlated with the G, R, and overall
and the GRBAS parameters. Although it is known that the per- GRBAS scores; hence, we can conclude that the proposed feature
ceptual judgments have a wide intra- and interevaluator variabil- gives an indication of the presence or absence of pathology and
ity that could bias the evaluations, the results demonstrate that an evaluation of the perceived hoarseness.
PLI correlates well with the GRBAS overall score (r ¼ 0.63), The PLI is based on a previous computation of the MFCCs
with the G (r ¼ 0.65) and with R (r ¼ 0.65) features. As the and their first derivatives. The main advantage of the MFCCs
severity of hoarseness is quantified under the parameter G is that they are very robust in their calculation, but the drawback
(grade)—G integrates two main components: breathiness (B) is that their individual physical interpretation is not completely
and roughness (R)—we can conclude that the PLI can be used clear. Despite this, the output of the statistical model presented
as a predictor of the severity of hoarseness and the overall might be understood as a quality measurement, giving an
perceived quality as the noise features studied, but with the estimation of the likelihood that a speech utterance is normal
advantage of better discrimination capabilities for screening. or pathological. Hence, the final index could be easily under-
Moreover, Table 5 illustrates that the results found in terms of stood by a speech therapist.
correlation between PLI and the G and R ratings are very similar For this work, the a priori probability to be normal or path-
to those obtained with the noise parameters. ological was supposed to be equal but, in practice, these prob-
abilities could be adjusted according to the primary judgment
CONCLUSIONS given by the evaluator, and/or combining with epidemiological
As with other classic parameters, such a jitter, shimmer, HNR, data about the prevalence of speech disorders in the population.
NNE, and others, PLI is a long-term feature calculated by aver- Including this a priori information, the efficiency of the
aging short-time measurements. The proposed parameter may proposed methodology might be improved.
TABLE 5.
Rank Correlation Coefficients (r) of PLI and Noise Parameters With Respect to the GRBAS Perceptual Scale
Parameters G R B A S GRBAS
PLI 0.65 0.65 0.38 0.51 0.24 0.63
CHNR 0.65 0.67 0.41 0.53 0.06* 0.60
NNE 0.63 0.75 0.38 0.52 0.14* 0.56
GNE 0.58 0.65 0.45 0.37 0.15* 0.54
* Insignificant correlations. Significant correlations are with P < 0.05 (95% confidence interval).
Juan Ignacio Godino-Llorente, et al Pathological Likelihood Index 677
In conclusion, PLI seems to be a promising feature for the 19. Ritchings RT, McGillion MA, Moore CJ. Pathological voice quality as-
screening and assessment of voice quality, showing better sessment using artificial neural networks. Med Eng Phys. 2002;24:
561–564.
discrimination ability than other noise or perturbation measure-
20. Godino-Llorente JI, Gómez-Vilda P, Blanco-Velasco M. Dimensionality
ments found in the literature, demonstrating to be a good reduction of a pathological voice quality assessment system based on
predictor of the perceived hoarseness, and emerging as a clear Gaussian mixture models and short-term cepstral parameters. IEEE Trans
complement to improve the multidimensional analysis based Biomed Eng. 2006;53:1943–1953.
on the classical acoustic parameters. 21. Deller JR, Proakis JG, Hansen JHL. Discrete-Time Processing of Speech
Signals. New York: Macmillan Series for Prentice Hall; 1993.
22. Reynolds DA, Rose RC. Robust text-independent speaker identification
using Gaussian mixture speaker models. IEEE Trans Speech Audio
Acknowledgments Processing. 1995;3:72–83.
23. Reynolds DA. Speaker identification using Gaussian mixture speaker
This research was carried out under grant TEC2006-12887-C02 models. Speech Commun. 1995;17:91–108.
from the Ministry of Education of Spain. The authors would 24. Weiss NA. Introductory Statistics. (6th ed). Reading, MA: Addison Wesley;
like to thank Janaı́na Mendes-Laureano for her support in the 2000.
perceptual evaluation of the voice samples. 25. Manfredi C, D’Aniello M, Bruscaglioni P, Ismaelli A. A comparative
analysis of fundamental frequency estimation methods with application
to pathological voices. Med Eng Phys. 2000;22:135–147.
26. Martin AF, Doddington GR, Kamm T, Ordowski M, Przybocki MA. The
REFERENCES DET curve in assessment of detection task performance. In: Proceedings
1. Smits I, Ceuppens P, De Bodt M. A comparative study of acoustic voice of Eurospeech ’97, Vol. IV 1997;1895–1898. Rhodes, Crete.
measurements by means of Dr. Speech and Computerized Speech Lab. 27. Hanley JA, McNeil BJ. A method of comparing the areas under receiver
J Voice. 2005;19:187–196. operating characteristics curves derived from the same cases. Radiology.
2. Baken RJ, Orlikoff R. Clinical Measurement of Speech and Voice. (2nd ed). 1983;148:839–843.
San Diego, CA: Singular Publishing Group; 2000. 28. Hanley JA, McNeil BJ. The meaning and use of the area under a re-
3. Feijoo S, Hernández-Espinosa C. Short-term stability measures for the ceiver operating characteristic (ROC) curve. Radiology. 1982;143:
evaluation of vocal quality. J Speech Hear Res. 1990;33:324–334. 29–36.
4. Klingholtz F, Martin F. The measurement of the signal-to-noise ratio (SNR) 29. Kay Elemetrics Corp. Disordered Voice Database. Version 1.03. Lincoln
in continuous speech. Speech Commun. 1987;6:15–26. Park, NJ: Kay Elemetrics Corp; 1994 [CD-ROM].
5. Yumoto E, Gould WJ, Baer T. Harmonics-to-noise ratio as an index of the 30. Hirano M. Psycho-Acoustic Evaluation of Voice. New York:
degree of hoarseness. J Acoust Soc Am. 1982;71:1544–1550. Springer-Verlag; 1981.
6. Qi Y, Weinberg B, Bi N, Hess W. Minimizing the effect of period determi- 31. Dejonkere PH, Remacle M, Fresnel-Elbaz E, Woisard W, Crevier-
nation on the computation of amplitude perturbation of voice. J Acoust Soc Euchman L, Millet B. Differentiated perceptual evaluation of pathological
Am. 1995;97:2525–2532. voice quality: reliability and correlations with acoustic measurements. Rev
7. Qi Y, Hillman RE. Temporal and spectral estimations of harmoni- Laringol Otol Rhinol. 1996;117:219–224.
cs-to-noise ratio in human voice signals. J Acoust Soc Am. 1997;102: 32. Hirano M, Hibi S, Yoshida T, Hirade Y, Kasuya H, Kikuchi Y. Acoustic
537–543. analysis of pathological voice. Acta Otolaringol. 1988;105(2):432–438.
8. de Krom G. A cepstrum-based technique for determining a harmoni- 33. Revis J, Giovanni A, Wuyts F. Comparison of different types of vowel frag-
cs-to-noise ratio in speech signals. J Speech Hear Res. 1993;36:254–266. ments for the evaluation of voice quality. In: Proceedings of Voicedata’98;
9. Kasuya H, Ogawa S, Mashima K, Ebihara S. Normalized noise energy as an 80–85.
acoustic measure to evaluate pathologic voice. J Acoust Soc Am. 1986;80: 34. Bou-Ghazale SE, Hansen JHL. A comparative study of traditional and
1329–1334. newly proposed features for recognition of speech under stress. IEEE Trans
10. Deliyski D. Acoustic model and evaluation of pathological voice produc- Speech Audio Processing. 2000;8:429–442.
tion. In: Proceedings of Eurospeech ’93, Vol. 3 1993;:1969–1972. Berlin, 35. Childers DG, Sung-Bae K. Detection of laryngeal function using speech
Germany. and electroglottographic data. IEEE Trans Biomed Eng. 1992;39:19–25.
11. Prosek RA, Montgomery AA, Walden BE, Hawkins DB. An evaluation of 36. Oppenheim AV, Schafer RW, Buck JR. Discrete-Time Signal Processing.
residue features as correlates of voice disorders. J Commun Disord. (2nd ed). New Jersey: Prentice Hall; 1999.
1987;20:105–117. 37. Schalkoff RJ. Pattern Recognition: Statistical, Structural and Neural
12. Michaelis D, Gramss T, Strube HW. Glottal-to-noise excitation ratio—a Approaches. New York: John Wiley & Sons; 1991.
new measure for describing pathological voices. Acustica/Acta Acustica. 38. Moon TK. The expectation-maximization algorithm. IEEE Signal Process-
1997;83:700–706. ing Mag. 1996;13(6):47–60.
13. Winholtz W. Vocal tremor analysis with the vocal demodulator. J Speech 39. Tax DMJ, van Breukelen M, Duin RPW, Kittler J. Combining multiple clas-
Hear Res. 1992;35:562–563. sifiers by averaging of by multiplying. Pattern Recognit. 2000;33:
14. Boyanov B, Hadjitodorov S. Acoustic analysis of pathological voices. A 1475–1485.
voice analysis system for the screening of laryngeal diseases. IEEE Eng 40. Kittler J, Hatef M, Duin RPW, Matas J. On combining classifiers. IEEE
Med Biol Mag. 1997;16:74–82. Trans Pattern Anal Mach Intell. 1998;20:226–239.
15. Yumoto E, Sasaki Y, Okamura H. Harmonics-to-noise ratio and psycho- 41. Bishop CM. Neural Networks for Pattern Recognition. (2nd ed). Oxford,
physical measurement of the degree of hoarseness. J Speech Hear Res. UK: Oxford University Press; 1995.
1984;27:2–6. 42. Michaelis D, Fröhlich M, Strube HW. Selection and combination of acous-
16. Hadjitodorov S, Boyanov B, Teston B. Laryngeal pathology detection by tic features for the description of pathologic voices. J Acoust Soc Am.
means of class-specific neural maps. IEEE Trans Inf Technol Biomed. 1998;103:1628–1639.
2000;4:68–73. 43. Fröhlich M, Michaelis D, Strube HW. Acoustic ‘‘breathiness measures’’ in
17. Parsa V, Jamieson DG. Identification of pathological voices using glottal the description of pathological voices,’’. In: Proceedings of ICASSP’98;
noise measures. J Speech Lang Hear Res. 2000;43:469–485. 937–940. Seattle, WA, USA.
18. Hadjitodorov S, Mitev P. A computer system for acoustic analysis of path- 44. Kasuya H, Ogawa S, Kikuchi Y, Ebihara S. An acoustic analysis of patho-
ological voices and laryngeal disease screening. Med Eng Phys. 2002;24: logical voice and its application to the evaluation of laryngeal pathology.
419–429. Speech Commun. 1986;5:171–181.