(Stuart Geman, Mark Johnson (Auth.), Mark Johnson, (B-Ok - Xyz) PDF
(Stuart Geman, Mark Johnson (Auth.), Mark Johnson, (B-Ok - Xyz) PDF
(Stuart Geman, Mark Johnson (Auth.), Mark Johnson, (B-Ok - Xyz) PDF
in Mathematics
and its Applications
Volume 138
Series Editors
Douglas N. Amold Fadil Santosa
**********
IMA ANNUAL PROGRAMS
Mathematical Foundations
of Speech and
Language Processing
With 56 Illustrations
Springer
Mark Johnson Sanjeev P. Khudanpur Mari Ostendorf
Dept. of Cognitive and Dept. of ECE and Dept. of Dept. of Electrical Engineering
Linguistic Studies Computer Science University of Washington
Brown University Johns Hopkins University Seattle, WA 98195
Providence, RI 02912 Baltimore, MD 21218 USA
USA USA
Mathematics Subject Classification (2000): 68T10, 68T50, 94A99, 68-06, 94A12, 94A40,
6OJ22, 6OJ20, 68U99, 94-06
springeron/ine.com
FOREWORD
Series Editors
Douglas N. Arnold, Director of the IMA
Fadil Santosa, Deputy Director of the IMA
v
PREFACE
criteria) towards full Bayesian modeling in both the acoustic and linguistic
domains.
The role of mathematics and statistics in speech and language tech-
nologies cannot be overestimated. The rate at which we continue to ac-
cumulate speech and language training data is far greater than the rate
at which our understanding of the speech and language phenomena grows.
As a result, the relative advantage of data driven techniques continues to
grow with time, and with it, the importance of mathematical and statistical
methods that make use of such data.
In this volume, we have compiled papers representing some original
contributions presented by participants during the two workshops. More
information about the various workshop presentations and discussions can
be found online, at http://www.ima.umn.edu/multimedia/. In this vol-
ume, chapters are organized starting with four contributions related to
language processing, moving from more general work to specific advances
in structure and topic representations in language modeling. The fifth pa-
per on prosody modeling provides a nice transition, since prosody can be
seen as an important link between acoustic and language modeling. The
next five papers relate primarily to acoustic modeling, starting with work
that is motivated by speech production models and acoustic-phonetic stud-
ies, and then moving toward more general work on new models. The book
concludes with two contributions from the statistics community that we
believe will impact speech and language processing in the future.
Finally, we would like to express our gratitude to the National Science
Foundation for making these workshops possible via its funding of the IMA
and its activities; to the IMA's staff for so ably organizing and adminis-
tering the workshops and related events; and to all the participants for
contributing to the success of these workshops in particular and "the Year
of Mathematics in Multimedia" in general.
Mark Johnson
Department of Cognitive and Linguistic Sciences
Brown University
Sanjeev P. Khudanpur
Department of Electrical and Computer Engineering and Department of
Computer Science
Johns Hopkins University
Mari Ostendorf
Signal and Image Processing
University of Washington
Roni Rosenfeld
School of Computer Science
Carnegie Mellon University
CONTENTS
Foreword v
Pr eface vii
Probability and st atistics in comput at ional
linguistics , a brief review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Stuart Geman and Mark Johnson
ix
x CONTENTS
1 A rigorous definition req uires a proper introduction t o Tu ring mach ines . Again , we
recommend [Hopcroft and Ullm an , 1979].
6 STUART GEMAN AND MARK JOHNSON
S S
'l/J1 = »<.VP
NP
'l/J2 = -<.VP
NP
I
rice
I
grows
I
wheat
I
grows
NP--------
--------
~
VP
S
VP
~
PP
V NP P NP
-<. .r>;
Det N Det N
I I I I
I saw the man with the telescope
P ROBAB ILIT Y AND STAT IST ICS IN COM PU TATIONAL LINGUISTI CS 7
--------
S
---------
NP VP
--------
V NP
NP PP
-<. ~
Det N P NP
»<.
Det N
I I
I saw th e man with t he telescope
start state, and th e double circle indicat es that t he state lab elled A is a
fina l state. The parse t ree for 'aaba$ ' with respect to G 3 t ra nslates im-
mediately into t he sequence of states t hat the machine t ra nsitio ns t hrough
when accept ing 'aaba' .
a a
--------
S
--------
a S
a S
b A
a A
I
$
----
---- ----
AAB
S
AAB
a
---- ----
a
AB
B
I
b
a
a
AB
B
I
b
PROBABILITY AND STATISTICS IN COMPU TAT IONAL LINGUISTICS 9
8 --+ a8 p= .80
8 --+ b8 p= .01
8 --+ bA p= .19
A --+ bA p= .90
A --+ b p=.10 .
The language is the set of strings ending with a sequence of at least two
b's. The grammar is ambiguous: in general, a sequence of terminal states
does not uniquely identify a sequence of productions. The sentence aabbbb
has three parses (determined by the placement of the production 8 --+ b A),
but the most likely parse, by far, is 8 --+ a S, 8 --+ a S, 8 --+ bA, A --+ bA,
A --+ b A, A --+ b (P = .8· .8 · .19·.9· .1), which has a posterior probability
of nearly .99. The corresponding parse tree is shown below.
----------- a 8
----------- a 8
----------- b A
a, b b
----------- b A
-----------I
ObOb
~®~
b A
where the first row and column represent 8, the next represent A, and the
last represent F; and the output probabilities are based on state-to-state
pairs ,
PROBABILITY AND STAT IST ICS IN COMPUTATIONAL LINGUISTICS 11
(S,S) ~{ :
prob = 80/81
prob = 1/81
prob = 0
(S, A) ~{ :
prob = 1
(A,F)~{: prob = 0
prob = 1 .
(3)
If a nont erminal A does not appear in the sample, t hen the numerator
and denominator are zero, and p(A - t a ), a E (N U T)+ , can be assigned
arbitra rily, provided it is consiste nt with (1).
The problem of est imat ing P from sentences (" unsupervised learning")
is more interesting , and more important for applications. Recall th at Y( 'IjJ )
is t he yield of 'IjJ , i.e. t he sequence of terminals in 'IjJ. Given a sentence W E
T+, let 'lJ w be the set of parses which yield w: 'lJ w = {'IjJ E 'lJ : Y( 'IjJ ) = w} .
Th e likelihood of a sent ence W E T+ is the sum of the likelihoods of its
possible parses:
Imagine now a sequence 'ljJl, . . . , 'ljJn , iid according to P , for which only th e
corr esponding yields, Wi = Y( 'ljJi), 1 ::; i ::; n , are observed. Th e likelihood
function is
n
(4) L=L(Pi Wl , ... ,Wn) = IT L IT p (A -t a)! (A--+Q;!/J; ) .
i = l !/JE ilI",; A--+QER
12 STUART GEMAN AND MARK JOHNSON
(5)
(6)
(7)
more-or-less the same as the issue of computing Ep[f(A -+ 0:; "p)I"p E 'lJ wl
that came up in our discussion of inference. The basic structure and cost
of the comput at ional algorithm is the same for each of the four problems-
compute the prob ability of w, compute the set of parses, compute the
best parse, compute E p. For regular grammars, there is a simple dynamic
programming solution to each of these problems, and in each case the
complexity is of the order n ·IRI, where n is the length of w , and IRI is the
number of productions in G.
Consider the representative problem of producing the most likely
parse, (8). Let w = (b1 , . . • , bn ) E T", Th ere are n - 1 productions of
the form A k -+ bk+l Ak+l for k = 0, . . . , n - 2, with Ao = S, followed by
a single terminating production A n - 1 -+ bn . Th e most likely sequen ce of
productions can be computed by a dynamic-programming type iteration:
for every A E N initi alize with
A 1(A) =S
V1(A) = p(S -+ b1A) .
Finally, let
Consider the most likely sequence of productions from S at "t ime 0" to
A at "time k ," given bi , . . . , bk' k = 1, . . . , n -1. Ak(A) is the st ate at tim e
k - 1 along this sequen ce, and Vk(A) is the likelihood of this sequence.
Therefore, An- 1 d;j A n is the state at time n - 1 associated with the
most likely parse, and working backwards, the best st at e seque nce overall
is Ao, A1 , • . . , An - 1 , where
There can be ties when more than one sequence achieves the optimum.
In fact , the pro cedure genera lizes easily to produce the best l parses , for
any l > 1. Another modification produces all parses, while st ill anot her
computes expe ctations E p of the kind th at app ear in the EM iterati on (7)
or probabilities such as P{Y("p) = w} (thes e last two are, essentially, just
a matter of replacing argmax by summation).
14 STUART GEMAN AND MARK JOHNSON
Every 'Ij; E '1' defines a sentence W E T+ : read the labels off of the terminal
nodes of 'Ij; from left to right . Consistent with the notation of §3.1, we will
write Y( 'Ij;) = w. Conversely, every sentence W E T+ defines a subset of
'1' , which we denot e by ww , consisting of all 'Ij; with yield w (Y('Ij;) = w).
A context-free grammar G defines a subset of '1', We, whose collection of
yields is the language, L e , of G. We seek a probability distribution P on
\[1 which concent rates on \[1 e.
The time-honored approach to probabilistic context-free grammars is
through the production probabilities p : R --+ [0, 1], with
(A-+OI)ER
S--+SS
S--+a.
Let p(S --+ S S) = q and p(S --+ a) = 1 - q, and let Sh be the tot al
prob ability of all trees with depth less th an or equal to h. Then S2 = 1 - q
(corresponding to S --+ a) and S3 = (1 - q) + q(1 - q)2 (corresponding to
S --+ a or S --+ S S followed by S --+ a, S --+ a). In general, Sh+l = l - q + q S~,
P ROBABILIT Y AND STAT IST ICS IN COMPU TATIONAL LING UIST ICS 17
(" maximum a post eriad ' or "opt imal" parsing); compute expectations
of t he form E p, (J(A ---+ a ; 'l/J)I'l/J E ww ] th at arise in iterative est ima-
t ion schemes like (7). Th e four comput ations turn out to be more-or-
less t he same, as was the case for regular grammars (§3.1.3), and there
is a common dynamic-programmin g-like solution [Lari and Young , 1990,
Lari and Young , 1991].
We illustr at e wit h t he problem of finding t he probability of a string
(sente nce) w, under a gra mmar G, and under a probability dist ribution
P concent rating on We. For PCFG s, t he dynamic-programming algo-
rit hm involves a recursion over substrings of th e st ring w t o be parsed.
If w = W I . . . W m is the string to be parsed, then let Wi,j = Wi . .. Wj be
the substring consisting of terminals i through i , with t he convention t hat
W i ,i = Wi · Th e dynamic-programm ing algorit hm works from smaller to
18 STUART GEMAN AND MARK JOHNSON
larger substrings Wi,j, calculating the probability that A :::;..* Wi,j for each
nonterminal A E N . Because a substring of length 1 can only be gener-
ated by a unary production of the form A -+ x, for each i = 1, . .. , m,
P(A :::;..* Wi,i) = p(A -+ Wi)' Now consider a substring Wi,j of length 2
or greater. Consider any derivation A ~* Wi,j ' The first production used
must be a binary production of the form A -+ B C, with A, B , C EN.
That is, there must be a k between i and j such that B ~ * Wi,k and
C :::;..* Wk+l,j ' Thus the dynamic programming step involves iterating from
smaller to larger substrings Wi,j, 1 ::::; i, j ::::; m, calculating:
At the end of this iteration, P(w) = P(S ~* Wl,m) ' This calculation
involves applying each production once for each triple of "string positions"
o < i ::::; k < j ::::; m, so the calculation takes O(IRlm 3 ) time .
3.3. Gibbs distributions. There are many ways to generalize. The
coverage of a context-free grammar may be inadequate, and we may hope,
therefore, to find a workable scheme for placing probabilities on context-
sensitive grammars, or perhaps even more general grammars. Or, it may be
preferable to maintain the structure of a context-free grammar, especially
because of its dynamic programming principle, and instead generalize the
class of probability distributions away from those induced (parameterized)
by production probabilities. But nothing comes for free. Most efforts
to generalize run into nearly intractable computational problems when it
comes time to parse or to estimate parameters.
Many computational linguists have experimented with using Gibbs
distributions, popular in statistical physics, to go beyond production-based
probabilities, while nevertheless preserving the basic context-free structure.
We shall take a brief look at this particular formulation, in order to illus-
trate the various challenges that accompany efforts to generalize the more
standard probabilistic grammars.
3.3.1. Probabilities. The sample space is the same: W is the set
of finite trees, rooted at S, with leaf nodes labeled from elements of T
and interior nodes labeled from elements of N. For convenience we will
stick to Chomsky normal form, and we can therefore assume that every
nonterminal node has either two children labeled from N or a single child
labeled from T. Given a particular context-free grammar 0, we will be
interested in measures concentrating on the subset WG of W. The sample
space, then, is effectively IJ1 G rather than 1J1 .
Gibbs measures are built from sums of more-or-less simple functions ,
known as "potent ials" in statistical physics, defined on the sample space.
In linguistics, it is more natural to call these features rather than poten-
tials. Let us suppose, then, that we have identified M linguistically salient
PROBABILITY AND STATISTICS IN COMPUTATIONAL LINGUISTICS 19
(11)
where 81 " " 8M are parameters, to be adjusted "by hand" or inferred from
data, 8 = (81 " " 8M), and where Z = Z(8) (known as the "partition
function") normalizes so that Po(q,) = 1. Evidently, we need to assume or
ensure that L 1/JE>va exp{L~ 8di(7/J)} < 00. For instance, we had better
require that 81 < 0 if M = 1 and it (7/J) = IY( 7/J )I (the number of words in
a sentence), unless of course Iq,GI < 00.
Relation to Probabilistic Context-Free Grammars. The feature
set {J(A - t 0:; 7/J )} A->QER represents a particularly important special case:
The Gibbs distribution (11) takes on the form
Then P = P when q ::; .5. In any case, P is Gibbs of the form (12), with
B_s-+s s = loge q, BS-+ a = 10ge(1 - q), and Zs = min(l, ~). Accordingly,
P is also a PCFG with production probabilities
min(l,.!.=.9.) min(l,.!.=.9.) 1_ q
p(S - t S S) =q .q 1- q = qmin(l, - - )
mm(1 ' 7 ) q
and
_ 1
p(S - t a) = (1 - q).
mm(l, 7) 1
3.3.2. Inference. The feature set {fdi=I, ...,M can accommodate ar-
bitrary linguistic attributes and constraints, and the Gibbs model (11),
therefore, has great promise as an accurate measure of linguistic fitness.
But the model depends critically on the parameters {Bdi=I ,...,M , and the
associated estimation problem is, unfortunately, very hard. Indeed, the
problem of unsupervised learning appears to be all but intractable.
Of course if the features are simply production frequencies, then (11)
is just a PCFG, with Z = 1, and the parameters are just log production
probabilities (BA-+a = 10gep(A - t 0:)) and easy to estimate (see §3.2.2).
More generally, let B = (01 , . . • ,OM) and suppose that we observe a sample
'l/Jl ,' " 'l/Jn E iJ!e ("supervised learning") . Writing Z as Z(O) , to emphasize
the dependency of the normalizing constant on 0, the likelihood function is
PROBABILITY AND STATISTICS IN COMPUTATIONAL LINGUISTICS 21
which leads to the likelihood equations (by setting 8~i log L to 0):
(13)
(14)
asks for too much , or even the wrong thing. It might be more rele-
vant to maximize the likelihood of the observed parses , given the yields
Y( WI), . . . ,Y(Wn) [Johnson et al. , 1999]:
n
(15) II Po(wiIY(Wi)) .
i=1
One way to compare these criteria is to do some (loose) asymptotics.
Let P( W) denote the "t rue" distribution on We (from which WI, · · · , Wn
22 STUART GEMAN AND MARK JOHNSON
are presumably drawn, iid), and in each case (14 and 15) compute the
large-sample-size average of the log likelihood:
1 1
II L log Po(7/Ji)
n n
-log Po(7/Ji) = -
n i=l n i=l
~ L P(7/J) logPo(7/J)
'l/JEw e
(16)
REFERENCES
[Abney et al., 1999] ABNEY , STEVEN, DAVID McALLESTER, AND FERNANDO PEREIRA .
1999. Relating probabilistic grammars and automata. In Proceedings of
the 37th Annual Meeting of the Association for Computational Linguistics,
pages 542-549, San Francisco. Morgan Kaufmann.
[Abney, 1997] ABNEY , STEVEN P. 1997. Stochastic Attribute-Value Grammars. Com-
putational Linguistics, 23(4) :597-617.
[Baum, 1972] BAUM, L.E . 1972. An inequality and associated maximization techniques
in statistical estimation of probabilistic functions of Markov processes. In-
equalities, 3 :1-8.
[Besag , 1974] BESAG , J . 1974. Spatial interaction and the statistical analysis of lattice
systems (with discussion). Journal of the Royal Statistical Society, Series
D,36:192-236 .
[Besag, 1975] BESAG , J . 1975. Statistical analysis of non-lattice data. The Statistician ,
24:179-195.
[Charniak, 1996] CHARNIAK, EUGENE. 1996. Tree-bank grammars. In Proceedings of the
Thirteenth National Conference on Artificial Inteliigence , pages 1031-1036,
Menlo Park. AAAI Press/MIT Press .
[Charniak, 1997] CHARNIAK , EUGENE. 1997. Statistical parsing with a context-free
grammar and word statistics. In Proceedings of the Fourteenth National
Conference on Artificial Inteliig ence, Menlo Park. AAAI Press/MIT Press.
[Chi , 1998] CHI, ZHIYi. 1998. Probability Models for Complex Syst ems. PhD thesis ,
Brown University.
[Chi , 1999J CHI, ZHIYi. 1999. Statistical properties of probabilistic cont ext -free gr am-
mars. Computational Linguistics, 25(1) :131-160.
[Chi and Geman, 1998] CHI, ZHIYI AND STUART GEMAN. 1998. Estimation of proba-
bilistic context-free grammars. Computational Linguistics, 24(2):299-305.
PROB ABILITY AND STATISTICS IN COMPUTATIONAL LINGUISTICS 25
[Cho msky, 1957] C HOMSKY, NOAM. 1957. Syntactic Structures. Mouton, The Hague.
[ColI ins , 1996] COLLINS, M .J . 1996. A new statistical parser based on bigram lexical
d ep endencies. In Th e Proceedings of the 34th A nnual Meeting of the Asso-
ciation for Com putational Linguistics, pages 184-191, San Fr an cisc o. The
Association for Co m p utationa l Linguistics, Morgan Kaufmann.
[C uly, 1985] C ULY, C HRISTOPHER. 1985. The complexity of t he vocab ula ry of Bambara.
Linguistics and Philosophy, 8(3 ):345-352 .
[Dempster et al ., 1977] DEMPSTER, A ., N . LAIRD , AND D . RUBIN. 1977. Maximum
likelihood from incomplet e data via the EM a lgo rit h m . Jou rnal of the
Royal Statistical Society, Series B, 39:1- 38.
[Ei sner and Satta, 1999] EISNER, J ASON AND GIORGIO SATTA. 1999. Efficient parsing
for bilexical context-free gr ammars and head automaton gr ammars. In Pro-
ceedings of the 37th Annual Meeting of the As sociat ion for Com putational
Linguistics, pages 457-464, 1999.
[Foo, 1974] Foo , K.S. 1974. Syntactic Methods in Pattern Recognition . Academic
Press.
[Foo , 1982] Foo , K.S. 1982. Syntacti c Patt ern Recognition and Applications. Prentice-
HalI .
[Garey and Johnson, 1979] GAREY, MICHAEL R. AND DAVID S . JOHNSON . 1979. Com-
puters and Introctability: A Guide to the Theory of NP- Completeness.
W.H. Freeman and Company, New York.
[Gazdar et al ., 1985] GAZDAR, GERALD, EWAN KLEIN , GEOFFREY P ULLUM , AND IVAN
SAG. 1985. Generolized Phrose Structure Grommar. Basil Bl ackwelI, Ox-
ford .
[G re na nd er , 1976] GRENANDER, ULF. 1976. Lectures in Pattern Th eory . Volum e 1:
Pattern Synthesis. Springer , Berlin.
[Harr is, 1963] HARRIS, T .E. 1963. Th e Th eory of Bron ching Processes. Sp ringer, Berlin.
[Ho p cro ft and UlIma n, 1979] HOPCROFT, JOHN E . AND JEFFREY D . ULLMAN. 1979.
Introdu ction to Automata Th eory, Languages and Com putation. Addison-
Wesley.
[Jelinek , 1997] JELINEK , FREDERICK. 1997. Statistical Methods f or Speech Recognition .
The MIT Press, Cambr id ge, Massachusetts .
[Johnson et a l., 1999] JOHNSON, MARK , STUART GEMAN , STEPHEN CANON, ZHIYI C HI,
AND STEFA N RIEZLER. 1999. Estimators for stochastic "u n ification-bas ed "
grammars. In Th e Proceedings of the 37th Annual Confe rence of the Associ-
ation for Com putational Lin guistics, pages 535- 541, San Francis co. Morgan
Kaufmann .
[Kaplan and Bresnan, 19821 KAPLAN , RONALD M . AND JOAN BRESNA N. 1982. Lexical-
Functional Grammar : A formal system for grammatical representation. In
Joan Bresnan, editor, Th e Mental Representation of Grommatical Rela-
tions , Ch apter 4, pages 173-281. The MIT Press.
[Kay et al ., 1994] KAY, MARTIN , JEAN MARK GAVRON , AND PET ER NORVIG . 1994. Verb-
mobil : a tronslation system for face-to-face dialog. CSLI Press, Stanford,
California.
[Lari and Young, 1990] LARI, K . AND S .J. YOUNG . 1990. The est imat ion of Stochastic
Context-Free Grammars using the Inside-Outside a lgori t h m . Computer
Speech and Language, 4(3 5-56) .
[Lari a nd Young, 1991J LARI, K . AND S.J . YOUNG . 1991. Applications of Stochastic
Context-Free Grammars using the Inside-Outside a lgor it h m . Com puter
Speech and Language, 5 :237-257.
[Marcus et a l., 1993] MARCUS, MICHELL P. , BEATRICE SANTORINI , AND MARY A NN
MARCINKIEWICZ. 1993. Building a large annotated corpus of English: The
P enn Treebank. Com putational Linguistics, 19 (2 ):313-330.
[PolIard and Sag, 1987] POLLARD, C ARL AND IVAN A. SAG. 1987. Information-based
Syntax and Semantics. Number 13 in CSLI Lecture Notes Seri es. C h icago
University Press , C h icago.
26 STUART GEMAN AND MARK JOHNSON
Abstract. In this pap er we discuss t hree issues in mod ern language mode ling. The
first one is the question of a quality measure for language models, the second is lang uage
mod el smooth ing and the third is t he quest ion of how t o bu ild good long-r ang e language
mod els. In all three cases some results are given indicating possible directions of further
resear ch.
Key words. Language models, quality measures, perpl exity, smoot hing, long-range
correlat ions.
where Ntest(w, h) and Ntest(h) are frequencies on a test corpus for word w
following a history of words h and PLM(wlh) is the probability assigned to
th at sequence by th e langu age model.
The possible correlation of word-error-rat e (WER, th e fraction of
word s of a text , miss-recognized by a speech recognizer) words) and per-
plexity has been an issue in th e liter ature for quit e some time now, but
' P hilips GmbH Forschungslab oratorien , Weisshau sst r.2, 0 -52066 Aachen, Germany
(dietrich.klakow@philips.com ).
27
M. Johnson et al. (eds.), Mathematical Foundations of Speech and Language Processing
© Springer Science+Business Media New York 2004
28 DIETRICH KLAKOW
50
47.5
45
42.5
40
ta: 37.5
w
~ 35
32.5
30
27.5
..
25
300 350 400 500 600 BOO 1000 1200 1500 2000 2500
Perplexity
FIG. 1. Correlation of WER and perplexity tested on data from the DARPA 1994
evaluation (spoke 4). Only a sm all fmction of the error bars are shown , to keep the
power law fit visible.
2000 words) with special topics (like "Jackie Kennedy") were provided.
The task was to optimizes and adapt language models to the particular
topics.
The results of our experiments given are in Fig. 1. Each point cor-
responds to one of the 450 language models. Only a small fraction of the
error bars are shown to keep the power law fit visible. We observe that
the power law fit nicely runs through the error-bars. The optimal fit pa-
rameters in (2) are a = 0.270 ± 0.002 and b = 6.0 ± 0.01. Those are not
universal values but they depend on the corpus!
The correlation coefficient is given in Tab . 1. In addition we now also
show results for the 1996 and 1997 DARPA Hub4-evaluation data. For a
perfect correlation, the correlation coefficient r should be one. We observe
that r = 1 is always within the error-bars and hence we have no indication
that the power-law relation (2) is not valid. Please note that the fact that
both values (and all correlation-coefficient values given in the literature)
are smaller than one is not the result of a systematic deviation but a fact
coming from the definition of th e correlation coefficient.
In summary: we have observed no indication, that WER and per-
plexity are not perfectly correlated. However, these are only two data-sets
investigated. We will perform the same analysis on all future speech recog-
30 DIETRICH KLAKOW
TABLE 1
Measured correlation coefficients r and their error bars.
Data Correlation r
Hub4: 96 + 97 0.978 ± 0.073
DARPA Eval 1994: "Kennedy" 0.993 ± 0.048
nition tasks we are going to work on, but collecting data for a huge number
of really different language models is a time-consuming endeavor . We would
like to invite others to join the undertaking.
3. Smoothing of language models. Smoothing of language models
attracted much attention for a very long time. However for backing-off
language models the discussion calmed down during the last few years
as most people started to think that there is very little room for further
improvement.
A well established method is absolute discounting with marginal
backing-off [12]. It is defined by a very simple structure:
Count(hN, w)-d ) ) . ( )
p(wlhN)= Count(hN) +a(h N · ,8(wlhN-1 If Count hN,W >0,
{
a(hN)' ,8(wlhN- d if Count(hN,w)=O,
with the discounting parameter d (0::; d ::; 1) and the dedicated backing-off
distribution ,8(wlh) which is normalized 2: ,8(wlh) = 1.
w
Absolute discounting refers to the fact that d is subtracted from the
observed counts . Marginal backing-off means that special backing-off dis-
tributions are used rather than smoothed relative frequencies. How to cal-
culate the optimal backing-off distributions was described by Kneser and
Ney [12] .
Can we do better smoothing? To answer this question we want to
first turn to a very basic observation: Zipfs law [13, 14]. To observe this
behavior on a corpus, the frequency of the words in the corpus is counted,
this list is sorted by frequency and then for each word the position in the
list (the rank) is plotted versus the frequency on a doubly logarithmic scale.
The result is shown in Fig. 2. We looked at two different "texts" . One is
the novel "Crime and Punishment" by the Russian poet Dostoevsky. The
other is the Philips research speech recognizer, a typical piece of C-code.
The observation is that in both cases the statistics is nearly a power law
even though the exponents differ.
This was observed for the first time by Zipf and since then several
models have been developed, the most well known by Mandelbrot [15]. He
used the model of a "string of letters" chopped randomly into pieces. This
behavior is very general and can be observed very often in nature. Some
examples for systems with power law distributions:
THREE ISSUES IN MODERN LANGUAGE MODELING 31
Zipfs Law
0.1 ,--~-~~,--~-"""".......,r--""""'-""""'""""-""""'--'-",,,,--'--~"""
C-Program (Speech Recognizer)
Fit -- -----
Russian Novel(Crimeand Punishment) +
Fit ..
0.01
0.001
>-
0
c:
Q)
:>
xr
l'!
u. 0.0001
Q)
>
1a
-~~~
a;
a:
le-OS
<, ~~
....•... " .
le-06 .....•........•
<,
le-07
1 10 100 1000 10000 100000
Rank
(3) f w _ Count(w)
( ) - TotalCount ;::j (c + r(w))B
where r(w) is the rank of the word in the sorted list and f(w) the corr e-
sponding frequency. This function has two parameters: B and c. Here J..L
serves as a scaling but can also be used to normalize the distribution. For
all the experiments described above only these parameters vary.
We can and should use this observation to create a new smoothed LM
type. There is no need to actually estimate probabilities or frequencies.
All we have to do is to est imate the three parameters and the rank of the
words. This has to be compared with the task of estimating probabilities
for about a hundred thousand words . Hence we have very much simplified
the estimation problem.
To actually do language modeling we proceed as follows:
• Estimate the rank of each word: All words are sorted ac-
cording to their frequency in the training corpus. In case of equal
32 DIETRICH KLAKOW
AND SEVEN
,J'l .l
:\ ,1' . it
.
. ... .... .. .. . . .J..~I.l ~.~:~~:~:!:!.•_~:., , ' .. ~ j\
I~ i
7
~~ i \
.I \/ ~ I \ i
!
5 I
~ I \ '\
~. ! \ ! 'J,'\t\
: ' 3 . Y 'l·" !_.
i.·Ii-f_,
................................................... ..................~:~:!:~.1:t......_...........
PRESIDENT HE
_ Pd(W,W)
(4) Cd W
( ) - p(w)2
is given for four example words. Here, Pd(W, w) is the probability that the
word W occurs now and is observed again after skipping d words and p(w)
is the unigram distribution. We have the obvious prop erty
For four different words Fig. 3 shows that it is indeed the case that af-
ter skipping about a thousand words in between th e value 1 is reached .
However each word has its individual pattern as to how it approaches this
limit . A short-function word like "and" shows at short distances a strong
anti-correlation and then approaches this limit rapidly. "President" shows
a broad bump stemming from purely semantic relations within one newspa-
per article. The other two examples show mixed behavior . The very short
range positive correlation for "seven" comes from patterns like 7x7 where x
is another digit , which relates to references to Boeing airplanes. In general ,
34 DIETRICH KLAKOW
describ ed in the previous section but the test data is a closed vocabulary
task again from Wall Str eet Journal with a vocabulary of 5000 words . We
observe a steady improvement when increasing th e context. The bigram
and trigram are traditional backing-off models. The other models follow
t he formula (7) only th e upper bounds of t he products and t he rang e of t he
base 5-gram may be adjusted to the effective language model range . The
10-gram has a perplexity 30% lower than the trigram. The corresponding
speech recognition experiments are described in [22] .
TABLE 3
Perplexit ies for an increasing language mod el mnge.
LM-Range 2 3 4 6 10 .1
PP 112.7 60.4 50.4 45.4 43.3 .
REFERENCES
[1] F . JELINEK, Statsitical Method 's for Speech Recognition , The MIT Press, 1997.
[2] H. SAWAF, K. SCH UTZ, AN D H. NEV, On the Use of Gr ammar Based Language Mod-
els for Sta tistical Machine Translation, Proc. 6t h Intl . Workshop on Parsing
Techn ologies, 2000, pp . 231.
[3J J . P ONTE AND W . CROFT, A Language Modeling Approach to Information Retri eval,
Research and Development in Information Ret rieval, 1998, pp . 275.
36 DIETRICH KLAKOW
'Center for Language and Speech Processing, Johns Hopkins University, 3400 N.
Charles se., Baltimore, MD 21218.
1 Additional results related to the original 8LM formulation can be found in the
following references: [11-19J.
37
M. Johnson et al. (eds.), Mathematical Foundations of Speech and Language Processing
© Springer Science+Business Media New York 2004
38 FREDERICK JELINEK
<s>
3T he actions adjoin right ' and adjo in left' are necessary to assure that the trigram
language model be a spec ial case of the SSLM. T his case will resu lt from a degenerate
choice of t he constructor statistics: Q( n u ll lh_2 = v, h-l = v') = 1 for all v , v' E V ,
4 Aiming down from the apex.
5I.e., if either h-l = v, or h - l = v', then the newly created apex will be marked
by h~l = u" .
40 FREDERICK JELINEK
6It will turn out that the operation and statistical parameter values of the SSLM
are such that the only possible headword of a complete tree whose leaves are < 8 >
,Wl ,W2, . .. ,Wn, < /8> is < 8 > .
STOCHASTIC ANALYSIS OF STRUCTURED LANGUAGE MODELING 41
• If h_ l = v* , v E V then
if a = adjoin left
(2) Q(alh_ z = < s >, h- d = { ~ otherwise
• If h- z = v, v E V then
if a = adjoin left"
(3) Q(alh_z ,h_ l = < [s » = { ~ oth erwise
<s> Wz • • • • • • • <is>
of W, that is, one whose sole exposed headword is < s > . Further, let
W i = < s >, WI, ..., Wi denote a prefix of the complete sente nce W , and let
T i denote a partial parse structure built by the SSLM const ru ctor on t op
of w- . Clearly, W = Wn+l and WH 1 = w -, WH 1. Fin ally, let h: j (Ti)
denote the i" exposed headword of the structure T i , j = 1,2 , ..., k where
Lk (T i ) = < s > .
We will be int erested in such quantities as P(W) , P(T , W) , p(Wi) ,
P(T i , w 'j, P(wi+lI W i) , etc.
Because of the nature of the SSLM operation specified in Section 2, the
computation of P(T , W) or P(Ti, Wi) is straight-forward. In fact, given
a "legal" parse st ru cture T i of a prefix w -, there is a unique sequence
of const ructor and predi ctor moves that results in the pair T i , w'. For
inst an ce, for the parse of Figure 1, the sub-parse T 6 corr esponding to the
prefix < s > A FLEMISH GAME SHOW HAS AS results from the following
sequence of SSLM moves: predtx) , null, pred(FLEMISH), null, pred(GAME),
null, predlsnow), adjoin right, adjoin right, adjoin right, null, pred(HAs),
null, predfAs}, null. P (Ti , Wi) is simply equal to the product of the
probabilities of the SSLM moves that result in T i , w'. More formall y,
II P(Wjlh_ 2(Tj-I) , h_ (T j - I ))
i
P(T i , Wi) = 2
j=I
(5) m (j)
.
x II Q(aj,z1TJ- , Wj, aj, I, ..., aj ,I-I)
1
1=1
STOCHASTIC ANALYSIS OF STRUCTURED LANGUAGE MODELING 43
where aj,m(j) = null, aj,l E {adjoin left, adjoin right}, l = 1,2, ...,
m(j) - 1 are the actions taken by the constructor after Wj (and before
wj+t) has been generated by the predictor. Tj results from Tj-1 after
actions aj,l, . .. ,aj,m(j) have been performed by the constructor, and Tj-1
is the built-up structure just before control passes to the predictor which
will generate Wj' Furthermore, in (5) the actions aj,l, ... , aj,l-l performed
in succession on the structure [Tj-1 , Wj] result in a structure having some
particular pair of last two exposed headwords h(l - 1)-2, h(l - 1)-1, and
we define
Strictly speaking, (5) applies to i < n + 1 only. It may not reflect the
"end moves" that complete the parse.
4. A chart parsing algorithm. We will now develop a chart parsing
algorithm [4-6] that will enable us to calculate P(W) when the word string
was generated by the SSLM. The results of this algorithm will facilitate
the calculation of p(Wi) and thus allow an implementation of the SSLM
language model that is alternative to the one in [1]. Furthermore, it will
lead to a Viterbi-like determination of the most probable parse
7S0 can the famous CKY algorithm [4-6] that will turn out to be similar to ours. As
a matter of fact , it will be obvious from formula (9) below that the presented algorithm
can also be run from bottom up, just as the CKY algorithm usually is, but such a
direction would be computationally wasteful because is could not take full advantage of
the thresholding suggested in Section 8.
8The span < i,j > refers to the words Wi,Wi+l , ... ,Wj '
44 FREDERICK JELINEK
•
FIG. 3. Diagmm illustmting inside probability recursions .
i 1)
P(W{+ l' y[i ,jllwi , x) == P(WH1> ... , Wj, h(wi, WHl , ..., Wj) = ylwi , h_l(T - = x)
where h(Wi ,Wi+l, .. .,Wj) = e (empty symbol) if Wi,Wi+l , .. ., Wj do not form a phrase,
and
if y = Wi
otherwise
STOCHAST IC ANALYSIS OF ST RUCTU RED LANGUAGE MODELING 45
where
It may be useful to illustrate the carrying out of the above recurs ion by
a simp le example. We will create the corres ponding chart (also referred to
as the parse triangle) for the sentence fragment < s > FRESHMAN BASKET-
BALL PLAYER < / s > . To simplify t he presentation we will abbreviate the
preceding by < s > F B P < / s > . To fit into into Tab le 1 we will further
simp lify the probabilities P(W~+I' uri, lJ!Wi , x) by omitting the red undant
w~+l ' thus obtaining P(u[i ,lJlWi , x) , one of the entries in the i t h row and
Ith column of the parse tri angle. The parse tri angle of Table 1 cont ains all
the cases that the recursion (9) generates. As a further illustration, the
probability P(p[2, 3J1B ,F) would be computed as follows:
TABLE 1
Parse tri angle of the SSLM.
<s> I
Ifreshman (F) basketball (B) player (r-) I < I s>
Ip (<s > !n.oll <s> .< js» P ( <js> !n.111 < s>. < j s»
P (rlJ . ~lI r .< s»
P (Il[I .211 r ,<s»
P( r[l,lllr.<s» P(Il {1 .3I1r.<s» P ( < j s> [1,411 r ,<s»
P (r[1,211 r .<s> )
P ( r [I .:lll r ,< s»
P( f'12 ,~lIn .r )
P(n[2,2I1n,r)
P(uI2.:11ln,I.')
P (f'I:l ,:lll r ,D)
P ( I' I :l , ~lI f' , I')
4.2. Comput ing P(Wi+t1Wi) . We can now use the concepts and
notation developed in Section 4.1 to compute left-to-right probabilities of
word generation by th e SSLM. Let p(Wi+I , x) denote the probability that
t he sequence Wo ,WI, W2,...,Wi,Wi+ 1 is generated in a mann er resulting in
STOCHAST IC ANA LYSIS OF STRUCTURED LANGUAGE MODELING 47
(13)
p (WI+I , x ) =:L :L P (Wi , y)P (w~+I , x [i ,lll wi , y)P* (wI +!I Y , x)
i= 1 yEWi-1
for x E W I
wit h t he init ial conditio n
and therefore
(14)
R(W{+l ,y[i,jJlwi,X) =
where
R(W{+ l ,y[i,jJ!Wi, X) = 0
if x't{wo, ...,wi- r} or y't{Wi, ...,Wj} or i > j .
131n a given parse tree T , every nod e cor responds to some particular phrase span
< i,j > and is therefore uniqu ely identifi ed by it.
14Th e preceding exposed headword does not change, so it must still be x .
STOCHASTIC ANALYSIS OF STRUCTURED LANGUAGE MODELING 49
(18) P(W ,x, y[i ,j]) = P(wh ,wjtt ,x[i -1] ; y[i,j]) P(w{+l,y[i, jJlwi' X).
15In P(wh, w'ltl, x li - 1] ; y[i, j]) we use a semicolon rather that a vertical slash to
indicate that t his is a pr oduct of probabiliti es and not a probab ility itself. The semicolon
avoids a possible pr obl em in equat ion (18).
50 FREDERICK JELINEK
We will obtain a recursion for P(wb, wj:t ,xli - 1]; Y[i,j]) based on
the four cases presented in Figure 6 that illustrate what the situation of
Figure 5 (that pertains to P(wb, wj:t , xli - 1] ; y[i,j])) may lead to. The
four cases correspond to the four double sums on the right-hand side of the
following equation valid for x E Wi-1 ,y E {Wi, ..., Wj } :16
+ L L [P(wb-
l,wj:l,z[
i - i -1] ;y[i-i,j])
1=1 ZEWi-l-l
+ L L [P(wb,wj:~+l,X[i-11;Y[i,j+m])
m=1 uEWjt;"
x P(w~t~\U[j + 1,j + m]lwj+1,y)P*(wj+1Ix,y)Q(leftly,u)]
n-j+l
+ L L [P(wh,wj:~+l' X[i-1];u[i,j+m])
m=1 uEWJtt
x p(w]t;n, u[j+1,j+mllwj+l, Y)P*(Wj+llx, y)Q(rightIY, u)]
which reflects the requirement, pointed out in Section 2, that the final parse
has the appearance of Figure 2.
• •
FIG. 5. Diagmm illustmting the parse situation p(wb , -u; x li - 1] ; y[i, j]) .
P( F; B[2,2])
= P( < S>; B[I ,2])P(F[1 ,1]IF, < s> )Q(nullj < S>, F)
x P(BI < s>, F)Q (r ig ht jF,B)
(22) + P( < S>; F[I ,2])P(F[I ,I]IF , < S> )Q (nulll < S> ,F)
x P(BI < s> ,F)Q (le ft IF,B)
+ P(F; B[2,3])P(p[3 ,3]jp,B)Q(nullIF,B)P(pIF ,B)Q(le ft IB,p)
+ P(F; p[2 ,3])P (p[3,3]lp, B)Q(nullIF,B)P(P\F,B)Q(right\B,P).
1 .
CK(X, y, i,j, nu ll) ~ P (WK )P(wi+l, y[i , j llwi' x )
n-j+ 1
X {~1=1 [P(wb,w'lt;'+l ,X[i- 1] ;y[i,j+m])
(25) X p(w;~;n , u [j + 1, j +mJl wj+l,y) P*(Wj+l!x ,y)Q(le ftly ,u)]
n-j+ 1
+ L L
u m= 1
[P(w~ ,w'lt;'+1 , x[i-11 ; u[i , j + m])
, CC(x ,y,a)
(26) Q (a1L 2 = x,h_ 1 = y) = L a' CC(
x,y ,a
T
We can similarly use the quantities (25) for re-estimating
P(vIL 2 , h_t} . In fact, let
M nK nK
then
, CC(x, y, v)
(28) P (vlh_ 2 = X , h_ 1 = y) = L v' CC(
x ,y,v
')'
(29)
1 8 We will be brief, basing our exposition on the assumption that the reader is by now
familiar with the operation of the SSLM as described in Section 2.
STOCHAST IC ANALYSIS OF STRUCTU RED LANGUAGE MO DELIN G 55
h~ i = h - i- 1, i = 2,3 , ...
- (r ight*lll) means create an apex with downward connections
to h_ 2 and h_ 1. Label t he apex by h~ 1 = (h~1'1)*' Let
h~ i = h- i- 1, i = 2,3, ...
- (leftl b) means create an apex wit h downward connections
1
to h_ 2 and h _ 1. Label t he apex by h _ 1 = (h_ 2,1) ' Let
I
h~ i = h - i- 1, i = 2,3 , .,.
- (Ieft" Ib ) means create an apex with downward connections
to h_ 2 and h_ 1. Label th e apex by h~ 1 = ( h~2'1)* ' Let
h~ i = h- i-1 , i = 2,3 , ...
- (up lb) means create an apex with a downward connection
to h_ 1 only. Label the apex by h~1 = (h~1' 1) ' Let h~i =
h_ i , i = 2 , 3, ,..
- null means pass control to the predictor.
T he operation of the 8LM ends when th e parser marks its apex by the
head < s > .19
Start of operation: The predictor generates t he first word W1 wit h
probability P1(W1 = v) = P (WI = vi < s » , v E V. Th e
tagge r t hen tags W1 by t he part of speech 9 wit h probability
P (glw1 , < s », 9 E Q. The initi al heads (bot h exposed) become
h_ 2 = « s >, < s », h_ 1 = (W l ,g). Cont rol is passed to t he
constructo r.
Restriction: For all h_ 2 , h ~ 1' 10 and j = 0, 1, ...20
(30) Q(( u plllo)lh_ 2, (h~ 1 ' Ij)) IT Q((up lll i)lh- 2, (h ~ 1 ' I i-d) = O.
i= 1
if a = null
(31) Q(alh_2 = « s >, < s », h - d = {~ ot herwise.
19 Formally, t his head is h = ( < s > . < s » . but we rimy sometimes omit writ ing
t he second compo nent . Simila rly, the head corres ponding to the end of sentence symbo l
is h = ( < / s > , < / s » . This, of course, mea ns that when t he tagger is called upon
to tag < / s >, it tags it wit h probability one by the part of speech < / s > .
2°I.e., up actions cannot cycle.
56 FREDERICK J ELINEK
<s>,<s>
(has, s)*
• If h- z = (v,')') , v E V then
if a = (left*II')')
(33) Q(alh_z,h_ 1 = < [s » = {~ otherwise.
21 Adjustment of Sect ion 4.3 is left to t he reader since it is very similar to that of
Section 4.1. In fact, th e only difference between formula (9) and (15) is th at sums in
th e former are replaced by maxim a in the latter.
STOCHASTIC ANALYSIS OF STRUCTURED LANGUAGE MODELING 57
and
A certain subtlety must be observed in evaluat ing (36): For every pair
(x,yl) th e prob abilities P (W{+l ' (yl, ,) [i, j ]IWi , x) must be evaluated for dif-
ferent non-termin als , in a part icular order. Because of the restriction (30)
such an order exists assuring t hat t he values of P (W{+l , (yl ,, )[i, j ]l wi'X )
for, E I' (x .y) are fully known before P (w{+t ,y[i, j Jlwi,X) is computed.
T his completes t he adjust ment of the formulas of Section 4.1.
Before pro ceeding furth er we can return to t he examp le fragment
< S > FR ESHMAN BASKETB ALL PLAYER < / s > and perform for it
the operations (36) when the basic produ ction probabilities are as
given in Tab le 2 (these are realistic qua ntities obtained after training on
58 FREDERICK JELINEK
TABLE 2
Probabiliti es needed to compute inside-outside probabilities .
Probability I Value
P(rl« /s > , </s» ,« s> . <s > )) 7.05E-5
P(B/( <S>, <S> ),(r ,NN» I. 26E-I
P(pl ( <s> . <s> ).(B.N P) 3.22E-6
P(I'I (·"N N ),(B,NN)) 2.37E-l
P (pl(" ,NN) ,(B,NNP» 3.18E-5
P( < I s> /( <s>. <s> ),(p ,NP» 3.56E-l
P( NNlr, <s» 9.95E-l
P( NNI B,NN) 9.94E-l
P( NNPIB,NN) 5.03E-3
P( NNII',NP) 9.34E-l
P( NNPlr ,NP) 5.62E-2
P( NNlr,NN) 9.65E-l
P( NNPjl' ,NN) 3.46E-2
P(NNl r ,NNP) 6.65E-l
P(N NPl r ,NNP) 3.31E-l
P ( NULL_/(-cs», <S> ).(r ,NN) 8.59E-l
P (N UL "-I( <S> , <S> ),(B,NP ) 1.00
P( NULL_/( ,,,NN ),(B,NN» 6.64E-l
P(N LJLL_I(r ,NN) ,(B,NNP)) 7.85E-l
P(AR_NPI( r ,NN) ,(B,NN » 1.23E-l
P( AILNP/( r ,NN ),(D,NNP» 3.l1E-2
P(AR_NP'I(D ,NN) ,(I" NN)) 8.98E-l
P (AR-NP 'I( D,NNP) ,(r ,NN» 7.42E-l
P( AR _NI"I( D,NN) ,(" ,NNP» 1.24E-l
P(AIL NI"I(D,NNP) ,(p ,NNI') 2.13E-l
P(AR_NPI (Il ,NN) ,(p ,NN)) 4.93E-2
P (AILNPI( n,NNP),(r,NN» 5.38E-2
P(AR_NPI (n,NN) ,(r,NNP )) 2.82E-l
P(AR _NPI(B ,NNP ),(I',N NP)) 2.47E-l
P( A R_NPI( B,NP ),(I' ,NN)) 4.55E-l
P( AR_NPI(n ,NP ),( r ,NNP» 2.82E-2
P(A IL NPI (" ,NN) ,(r ,NP» 1.66E-2
P( AILNPI (r, NN),(p,NP' ) 8.45E-l
the U Penn Treebank [3]). Table 3 then represents the parse triangle
chart containing the inside probabilities (36) for all the relevant spans.
Finally, the following is the detailed calculation for the inside
probability P(BP,(p,NP)[l ,3]IF,« s >, < s > )) which we abbreviate as
P((p,NP)[1 ,3]1F,« s >, < s » ):
T A BLE 3 CIl
Ins id e probabili ty ta ble. '"':l
o
o
I <s> I freshman (F) I basketball (B) I player (p) I < Is> I :r:
:>
CIl
I P( ~8; . < 8> )1 < 8>. ( < / 8>. < /8») P« </8 >. </8» 1< 8> .« / 8> . < /8») '"':l
= 3 .llE-7 (3
P «F.NN)lr ..« 8> . <8» ) P « a .NP lIr.« 8> . <8» ) P«p.NP) lr.«8> . <8 ») P« </ 8>. < / 8> )l r .« 8> . < 8» )
:>
z
= 9 .95E - l = 1. 32 E -2 = 1.24E-2 = 4.4 1E-3 :>
P «a.NN lI B.(r .NN)) P «p .N P) la .(r.NN) ~
CIl
= 9 .94E - l = 9 .0 E-3 en
P ( (B.NNP)lIl.( r .N N)) P « p.NP ' )la .(r .NN» o
'lj
= 5 .0 3E - 3 = 1. 3 6 E - l CIl
'"':l
P«p .NN )lp .( Il.NP»
= 9 .3 4E - l
~
o
P«p .NNP)lp.(a .N P) '"':l
c:::
= 5 .62E - 2 ;:0
t.l:J
P« ".N N)lp·(B.NN» tl
= 9 .65E - l e-
:>
P« p.NNP )lp.( a.KN» Z
= 3 .4 6 E- 2
o
c:::
P « p.NN)lp .(B.NN P) :>
ot.l:J
= 6 .65E- l
P « p.NN PlIp ·(a .NNP ) ~
o
= 3.3 1E- l tl
t.l:J
P« < / 8> . < / 8> 1I < / 8> .(p.NP )) r-
=1 Z
o
en
<0
60 FREDERICK JE LINEK
I
P(WI+l ,x) = L L L [P(Wi,(yl ,"I))
i = l y1 E W i-l -y
(41)
X P(W~+I ' xli, l]l wi,(yl, "I)) P*(wI+ ll(y l, "I), x)]
for x l E Wi = {WI, W2, ...,WL}
(44) i ,wjn+l
P (WO +l ,x [.z - 1]', yz,
[. ].]) -. .:. . 0.
23T his definit ion is made use of in the summations Lz and Lu in (43) and else-
where, thus simplifying not at ion. Consequ ently Lz could also have been written
LzIEWi- I-1 L OE r , etc .
62 FREDERICK JELINEK
i-I
(47) x LL [P(W~-I,wjtl,z[i-l-l];(Xl,"f)[i-l,j])
z 1=1
i-I
(48) xL L [p(W~-I, wjtl, z[i - l - 1]; (yl, "f)[i -l,j])
z 1=1
CK(x,y,i,j,null) ~ P(~K)P(W{+I,Y[i,j]lwi'X)
n-j+l
X {L L L[P(w~,wjt~+I,X[i-l];(yl '''f)[i,j+m])
u m=1 "y
M nK+ lnK + l
CC( X, y , (left lh)) ~ L L L CK(x , y , i, j, (le ftlh'))
K= l i= 1 j=i
M nK + l nK+l
CC(x , y , (r ight IIT)) ~ L L L CK(X, y , i, j, (r ight IIT))
K =1 i= 1 j=i
M nK + l n K + l
CC(x, y ,(u pll,)) ~ L L L CK(x, y ,i,j,( u pll ,))
K=l i= 1 j=i
M n« nK
CC(x, y, null) ~ L L L CK(x ,y,i , j ,null)
K= 1 i= 1 j = i
we get t he re-estim at es
, CC (x ,y,a)
(51) Q (alh_ 2 = x , h_ 1 = y ) = 2:= a' CC( x ,y, a ')
We can similarly use t he quantities (47) t hrough (50) for re-est imating
P(vlh_ 2 , h- d. In fact , let
M n K nK
CC( x,y, v) ~ L L L CK(X,y, i, j ,null) t5(v,wj+t}
K =1 i = 1 j = i
t hen
, CC( x , y , v)
(52) P (v lh_2 = X , h_ 1 = y ) = ~
L.J v'
CC(
x,y ,v
')
K=l
Fina lly, t he re-esti mation of tagger probabilit ies is given by the for-
mula 24
CC (x, (y,g))
(53) P(g ly, h _ 1 = x) = 2:= g'EQ CC(x ,(y,g '))
wher e
CC( x , y ) = CC( x , y , null )
Observe next 25 t hat for a fixed span < i,j > the products P(Wi,x)
P(W{+l' y [i , jllwi, x ) are compara ble to each other regardless of t he iden-
t ity of x l E {wo,Wl, ..., wi- d and yl E {Wi, ...,Wj}. They can thus be
t hresholded with respect to maxv, P (W i , v ) P (w{+l ' z [i , j !lWi' v ). Th at
°
is, for further computation of inside probabilities P (W{+l' y [i , j llwi, x ) can
be set to i£2 6
25Aga in, anyt hing relating to the qua nt ities P (W{+I ' Y[i, jJlWi , x) applies equa lly to
the qu an t it ies R (w {+l ,y[i , j llwi'x ).
26T he author is indebted to Mark Jo hnson who pointed out t his improvement of t he
thresholding regime.
27Allowed are x = (x l, , ) and y = (y l , ,) where x l E {wo, ..., W j _ i ] and yl E
{ Wj , .. . ,w;} .
28Z eroing is carried out in step 4 that follows.
68 FREDERICK JELINEK
2. For each span < k, l + 1 > just computed, set to 0 all probabilities
P(w~:t.\,y[k,l + 11Iwk ,x) satisfying
p(Wk,x)p(W~:t.ll'y[k ,l + 11Iwk,x)
« maxP(Wt,
v,z
. l+l
v) P(wk+l' v[k, l + IJlwk ' z) .
P(w2',y[1,nllw1' < s » for various values of y) find and mark the sub-
parse pairs P(w~,x[1,j]lw1' < s », P(wJ+! ,z[j + 1,nllwH1,x) that re-
sulted in P(w2' , yjl , n]lw1, < s » . Perform this recursively. When the
process is completed, eliminate from the chart (i.e., set to 0) all the sub-
parses that are not marked . Computations P(w&, wj:f, xli - 1] ; y[i, j])
will thus be performed only for those positions which remain in the chart.
8.3. Limiting non-terminal productions. The straight-forward
way of training the Structured Language Model is to initialize the statistics
from reliable parses taken either from an appropriate treebank [3], or from
a corpus parsed by some automatic parser [8-10]. This is what was done
in previous work on the SLM [1, 2].
If this initialization is based on a sufficiently large corpus, then it is
reasonable to assume that all allowable reductions 1'1, 1'2 -+ I' and 1'0 -+ I'
have taken place in it. This can be used to limit the effort arising from the
sums 2:1' appearing in (36) and (43).
If we assume that the initialization corpus is identical to the training
corpus then the problem of out-of-vocabulary words does not arise. Never-
theless, we need to smooth the initial statistics. This can be accomplished
by letting words have all part of speech tags the dictionary allows, and
by assigning a positive probability only to reductions (xL 1'1), (x~, 1'2) -+
(xL 1') that correspond to reductions 1'1,1'2 -+ I' that were actually found
in the initialization corpus.
9. Smoothing of statistical parameters. Just as in a trigram lan-
guage model, the parameter values extracted in training will suffer from
sparseness. In order to use them on test data, they will have to be sub-
jected to smoothing. Let us, for instance, consider the predictor in the
SSLM setting. The re-estimation formulas specify its values in equation
(28) which we repeat in a slightly altered form
CC(x, y, v)
(55) f( VIh -2=X, h -l=Y ) = " , CC( ')
LJv' X,Y,V
1\(vlh_ 2 = x, h_ 1 = y)
(56)
= A f(vlh_ 2 = x, h_ 1 = y) + (1 - A) P(vlh_ 1 = y)
where P(v\h_ 1 = y) denotes a bigram probability smoothed according
to the same principles being described here. The value of A in (56) is
a function of the "bucket" that it belongs to. Buckets would normally
70 FREDERICK JELINEK
A(k) f(vlh_ 2 = x, h_ l = y)
CCHX,y,V
()
y) + (1 - A(k)) P(vlh_ l
A
A(k) f(vlh_ 2 = x, h_ l = = y)
31Even though the buckets are two-dimensional, they can be numbered in sequence.
32Values depend on the probabilities i\(vlh_2 = X,h-l = y) and QA(alh-2 =
x, h_ 1 = y) which change with each iteration as the values of the >.-parameters change.
33We assume it erati ve re-estimation.
STOCHASTIC ANALYSIS OF STRUCTURED LANGUAGE MODELING 71
REFERENCES
Abstract. St at istical language models used in lar ge voca bulary speech recognition
must properly capt ure th e vario us constraints, both local and global, present in the lan-
guage. Wh ile n-gram modeling readily accounts for the former , it has been more difficult
to handle t he latter , and in par t icular long-t erm semantic dep end encies, within a suit able
data-driven formalism. This pap er focuses on the use of latent semantic analysis (LSA)
for this purpose. The LSA paradigm auto matic ally uncovers meaningful associations in
the lan guage based on word-docum ent co-occurrences in a given corpus. The resulting
sema nt ic knowledge is encapsulat ed in a (cont inuous) vector space of compa rat ively low
dimension, where ar e mapped all (discrete) words and documents considered . Compar-
ison in this space is done through a simple similarity measure, so famili ar clustering
tec hniques can be applied. This leads to a powerful fram ework for both automatic se-
mantic classification and semantic language modeling. In the latter case, th e large-span
nature of LSA models makes them particularly well suited to complement convent ional
n-grams. This synergy can be harnessed through an integrative formulation, in which
lat ent semantic knowledge is exploited to judiciously adjust the usual n-gram probabil-
ity. The paper concludes with a discussion of intrinsic trade-offs, such as t he influenc e
of t raining data selection on th e result ing performance enhancement.
Key words. St atist ical language modeling, multi-span integra tion , n-grarns , latent
sema nt ic an alysis, speech recognition.
the problem of predicting the word "fell" from the word "stocks." In (1.1),
the prediction can be done with the help of a bigram language model
(n = 2). This is straightforward with the kind of resources currently
available [50]. In (1.2), however, the value n = 9 would be necessary, a
rather unrealistic proposition at the present time. In large part because of
this inability to reliably capture large-span behavior, the performance of
conventional n-gram technology has essentially reached a plateau [52].
This observation has sparked interest in a variety of research direc-
tions, mostly relying on either information aggregation or span extension
[5]. Information aggregation increases the reliability of the parameter esti-
mation by taking advantage of exemplars of other words that behave "like"
this word in the particular context considered. The trade-off, typically, is
higher robustness at the expense of a loss in resolution . This paper is
more closely aligned with span extension, which extends and/or comple-
ments the n-gram paradigm with information extracted from large-span
units (i.e., comprising a large number of words). The trade-off here is in
the choice of units considered, which has a direct effect on the type of long
distance dependencies modeled. These units tend to be either syntactic or
semantic in nature. We now expand on these two choices.
1.2. Syntactically-driven span extension. Assuming a suitable
parser is available for the domain considered, syntactic information can be
used to incorporate large-span constraints into the recognition . How these
constraints are incorporated varies from estimating n-gram probabilities
from grammar-generated data [61] to computing a linear interpolation of
the two models [36] . Most recently, syntactic information has been used
specifically to determine equivalence classes on the n-gram history, resulting
in so-called dependency language models [13, 48], sometimes also referred
to as structured language models [14, 35, 57].
In that framework, each unit is in the form of the headword of the
phrase spanned by the associated parse sub-tree. The standard n-gram
language model is then modified to operate given the last (n -1) headwords
as opposed to the last (n - 1) words. Said another way, the structure of
the model is no longer pre-determined: which words serve as predictors
depends on the dependency graph, which is a hidden variable [52] . In the
example above, the top two headwords in the dependency graph would be
"stocks" and "fell" in both cases, thereby solving the problem .
The main caveat in such modeling is the reliance on the parser, and
particularly the implicit assumption that the correct parse will in fact be as-
signed a high probability [60]. The basic framework was recently extended
LATENT SEMANTIC LANGUAGE MODELING 75
can be used for word and document clustering [6, 28, 30], as well as for
language modeling [2, 18]. In all cases, it was found to be suitable to
capture some of the global semantic constraints present in the language. In
fact, hybrid n-gram+LSA language models, constructed by embedding LSA
into the standard n-gram formulation, were shown to result in a substantial
reduction in average word error rate [3, 4].
(2.1)
(2.2)
(2.3)
denotes matrix transposition. As is well known, both left and right singular
matrices U and V are column-orthonormal, i.e., UTU = VTV = IR (the
identity matrix of order R) . Thus, th e column vectors of U and Veach
define an orthornormal basis for the space of dimension R spanned by the
(R-dimensional) u/s and vj's . Furthermore, the matrix W is the best
rank-R approximation to the word-document matrix W , for any unitarily
invariant norm (cf., e.g., [19]) . This entails, for any matrix A of rank R :
(2.4) min
{A : rank(A)=R}
IIW -All = IIW - WII = SR+l,
where II . II refers to the L 2 norm , and S R+l is the smallest singular value
retained in the order-(R+ 1) SVD of W . Obviously, SR+l = 0 if R is equal
to the rank of W .
Upon projecting the row vectors of W (Le., words) onto the orthonor-
mal basis formed by the column vectors of V, the row vector UiS charac-
terizes the position of word Wi in the underlying R-dimensional space , for
1 :S i :S M. Similarly, upon projecting the column vectors of W [i.e., docu-
ments) onto the orthonormal basis formed by the column vectors of U, the
row vector VjS characterizes the position of document dj in th e same space,
for 1 :S j :S N. We refer to each of the M scaled vectors iii = UiS as a word
vector, uniquely associated with word Wi in the vocabulary, and each of the
N scaled vectors Vj = "i S as a document vector, uniquely associated with
document dj in the corpus. Thus, (2.3) defines a transformation between
high-dimensional discrete entities (V and T) and a low-dimensional contin-
uous vector space 5, the R-dimensional (LSA) space spanned by the u/s
and vi's . The dimension R is bounded from above by the (unknown) rank
of the matrix W, and from below by the amount of distortion tolerable in
the decomposition. It is desirable to select R so that W captures the major
structural associations in W, and ignores higher order effects.
2.3. Properties. By construction, the "closeness" of vectors in the
LSA space 5 is determined by the overall pattern of the language used in
T , as opposed to specific constructs. Hence, two words whose representa-
tions are "close" (in some suitable metric) tend to appear in the same kind
of documents, whether or not they actually occur within identical word
contexts in those documents. Conversely, two documents whose represen-
tations are "close" tend to convey the same semantic meaning, whether
or not they contain the same word constructs. In the same manner, from
the bidiagonalization process inherent in the SVD, we can expect that the
respective representations of words and documents that are semantically
linked would also be "close" in the LSA space S.
Of course, the optimality of this framework can be debated, since the
L 2 norm may not be the best choice when it comes to linguistic phenom-
ena . For example, the Kullback-Leibler divergence provides a more elegant
(probabilistic) interpretation of (2.3) [31] , albeit at the expense of requiring
a conditional independence assumption on the words and the documents
L AT ENT SEMANT IC LANGUAGE MODELING 79
-, I·
a ~ l(lter-TQPic
~~ \ O.gm,1 .~'"
:0
'"
.0 a
£~
C;
o
..J a
"r
1 T he relevant definition for t his quantity will be d iscussed in detail shor tly, cr. Sec-
tion 3.3.
80 JEROME R. BELLEGARDA
(3.1)
(3.2)
for any 1 ::; i,j ::; M. A value of K(Wi,Wj) = 1 means the two words
always occur in the same semantic context, while a value of K (Wi, Wj) < 1
means the two words are used in increasingly different semantic contexts.
LATENT SEMANTIC LANGUAGE MODELING 81
Cluster 1
Cluster 2
While (3.2) does not define a bona fide distance measure in t he space S , it
easy leads to one. For exam ple, over t he interval [0,71'], the measure:
(3.3)
the present tense verb "rule" from cluster 2. This is an instance of a phe-
nomenon called polysemy: "drawing' and "rule" are more likely to appear in
the training text with their alternative meanings (as in "drawing a conclu-
sion" and "breaking a rule," respectively), thus resulting in different cluster
assignments. Finally, some words seem to contribute only marginally to the
clusters: for example, "hysteria" from cluster 1 and "here" from cluster 2.
These are the unavoidable outliers at the periphery of the clusters.
3.3. Document clustering. Proceeding as above, the SVD expres-
sion (2.3) also yields:
(3.4)
(3.5)
for any 1 :S i, j :S N . This has the same functional form as (3.2); thus,
the distance (3.3) is equally valid for both word and document clustering.f
The resulting set of clusters De, 1 :S f :S L, can be viewed as revealing
another layer of semantic knowledge in the space S.
3.4. Document cluster example. An early document clustering ex-
periment using the above measure was documented in [30]. This work was
conducted on the British National Corpus (BNC), a heterogeneous corpus
which contains a variety of hand-labelled topics. Using the LSA framework
as above, it is possible to partition BNC into distinct clusters, and compare
the sub-domains so obtained with the hand-labelled topics provided with
the corpus. This comparison was conducted by evaluating two different
mixture trigram language models : one built using either the LSA sub-
domains, and one built using the hand-labelled topics. As the perplexities
obtained were very similar [30], this validates the automatic partitioning
performed using LSA.
Some evidence of this behavior is provided in Figure 3, which plots
the distributions of four of the hand-labelled BNC topics against the ten
document sub-domains automatically derived using LSA. While clearly not
matching the hand-labeling, LSA document clustering in this example still
seems reasonable. In particular, as one would expect, the distribution
for the natural science topic is relatively close to the distribution for the
applied science topic (cf. the two solid lines), but quite different from the
two other topic distributions (in dashed lines). From that standpoint, the
data-driven LSA clusters appear to adequately cover the semantic space.
2In fact, the measure (3.3) is precisely the one used in the study reported in Figure l.
Thus, the distances on the x-axis of Figure 1 are V( di , dj) expressed in radians.
LATENT SEMANTIC LANGUAGE M ODEL ING 83
LO
+
X
Natwal Science
ApPlied Science
c:i Social S<;:ience
~ Imaginative
-
c:i
o
c:i
2 4 6 8 10
. . Sub-domain (Cluster) Index
Probab ilitv Distribut ions 01Four BNG TOPIcs AQalnst lSA Document Clusters
(4.1)
v
The vector p , indeed seen to be functionally similar to a document vector,
corresponds to the representation of the new document in the space S .
To convey the fact that it was not part of the SVD extraction, the
new document dp is referred to as a pseudo-document. Recall that the
(truncated) SVD provides, by definition, a parsimonious description of the
linear space spanned by W. As a result, if the new document contains
language patterns which are inconsistent with those extracted from W, the
SVD expansion (2.3) will no longer apply. Similarly, if the addition of dp
causes the major structural associations in W to shift in some substantial
manner.i' the parsimonious description will become inadequate. Then U
and S will no longer be valid, in which case it would be necessary to re-
compute (2.3) to find a proper representation for dp- If, on the other hand ,
the new document generally conforms to the rest of the corpus T , then the
v
pseudo-document vector p in (4.2) will be a reasonable representation for
s;
Once the representation (4.2) is obtained, the "closeness" between the
new document dp and any document cluster De can then be expressed as
V(dp , De), calculated from (3.5) in the previous section.
4.2. Semantic inference. This can be readily exploited in such com-
mand-and-control tasks as desktop user interface control [7] or automated
call routing [12]. Suppose that each document cluster De can be uniquely
associated with a particular action in the task. Then the centroid of each
cluster can be viewed as the semantic anchor of this action in the LSA
space . An unknown word sequence (treated as a new "document") can
thus be mapped onto an action by evaluating the distance (3.3) between
that "document" and each semantic anchor. We refer to this approach
as semantic inference [7, 8]. In contrast with usual inference engines (cf.
3For example, suppose training was carried out for a banking application involving
the word "bank" taken in a financial context. Now suppose dp is germane to a fishing
application, where "bank" is referred to in the context of a river or a lake. Clearly,
the closeness of "bank" to, e.g., "money" and "account," would be irrelevant to dp .
Conversely, adding dp to W would likely cause such structural associations to shift
substantially, and perhaps even disappear altogether.
LATENT SEMANTIC LANGUAGE MODELING 85
-,
0 b,
o
<D
(/) when is the meeting
meeting cancel
C\I the
o meeting
•
...
cancel
o the
o -
0.0 0.2 0.4 0.6 0.8 1.0
. First SVD Dimension
Two-Dirnerisional lfushatiori 01 LSASpace
[20]), semantic inference thus defined does not rely on formal behavioral
principles extracted from a knowledge base. Instead, the domain knowledge
is automatically encapsulated in the LSA space in a data-driven fashion.
To illustrate, consider an application with N = 4 actions (documents),
each associated with a unique command: (i) "what is the time," (ii) "what
is the day," (iii) "what time is the meeting," and (iv) "cancel the meeting."
In this simple example, there are only M = 7 words in the vocabulary, with
some interesting patterns: "what" and "is" always co-occur, "the " appears
in all four commands, only (ii) and (iv) contain a unique word , and (i) is
a proper subset of (iii). Constructing the (7 x 4) word-document matrix
as described above, and performing the SVD, we obtain the 2-dimensional
space depicted in Figure 4.
This figure shows how each word and each command is represented
in the space S. Note that the two words which each uniquely identify a
command-"day" for (ii) and "cancel" for (iv)-each have a high coordi-
nate on a different axis. Conversely, the word "the ," which conveys no in-
formation about the identity of a command, is located at the origin. On the
other hand, the semantic anchors for (ii) and (iv) fall "close" to the words
which predict them best-"day" and "cancel" , respectively. Similarly, the
semantic anchors for (i) and (iii) fall in the vicinity of their meaningful
components-"what-is" and "time" for (i) and "time" and "meeting" for
(iii)-with the word "time," which occurs in both, indeed appearing "close"
to both.
86 JEROME R. BELLEGARDA
Now suppose that a user says something outside of the training setup,
such as "when is the meeting" rather than "what time is the meeting."
This new word string turns out to have a representation in the space S
indicated by the hollow triangle in Figure 4. Observe that this point is
closest to the representation of command (iii). Thus , the new word string
can be considered semantically most related to (iii), and the correct action
can be automatically inferred . This can be thought of as a way to perform
"bottom-up" natural language understanding.
By replacing the traditional rule-based mapping between utterance
and action by such data-driven classification, semantic inference makes it
possible to relax some of the typical command-and-control interaction con-
straints. For example, it obviates the need to specify rigid language con-
structs through a domain-specific (and thus typically hand-crafted) finite
state grammar. This is turn allows the end user more flexibility in ex-
pressing the desired command/query, which tends to reduce the associated
cognitive load and thereby enhance user satisfaction [12J .
4.3. Caveats. Recall that LSA is an instance of the "bag-of-words"
paradigm, which pays no attention to the order of words in the sentence.
This is what makes it well-suited to capture semantic relationships between
words. By the same token, however, it is inherently unable to capitalize
on the local (syntactic, pragmatic) constraints present in the language.
For tasks such as call routing, where only the broad topic of a message is
to be identified, this limitation is probably inconsequential. For general
command and control tasks, however, it may be more deleterious.
Imagine two commands that differ only in the presence of the word
"not" in a crucial place. The respective vector representations could con-
ceivably be relatively close in the LSA space, and yet have vastly differ-
ent intended consequences. Worse yet, some commands may differ only
through word order. Consider, for instance, the two MacOS 9 commands:
which are mapped onto the exact same point in LSA space . This makes
them obviously impossible to disambiguate.
As it turns out, it is possible to handle such cases through an extension
of the basic LSA framework using word agglomeration. The idea is to move
from words and documents to word n-tuples and n-tuple documents, where
each word n-tuple is the agglomeration of n successive words, and each
(n-tuple) document is now expressed in terms of all the word n-tuples it
contains. Despite the resulting increase in computational complexity, this
extension is practical in the context of semantic classification because of
the relatively modest dimensions involved (as compared to large vocabulary
recognition) . Further details would be beyond the scope of this manuscript,
but the reader is referred to [8] for a complete description.
LATE NT SE MANT IC LANGUAGE MODELING 87
where the condit ioning on 5 reflects the fact th at the prob ability depends
on th e particular vector space arising from the SVD represent ation. In this
expression, Pr (wq!dq_ 1 ) is computed directly from th e represent ations of
wq and dq - 1 in the space 5 , i.e., it is inferred from th e "closeness" between
the associated word vector and (pseudo-)document vector in 5 . We there-
fore have to specify both the appropriate pseudo-document representation
and th e relevant probability measure.
5.1.1. Pseudo-document representation. To come up with a pseudo-
document representation , we leverage th e results of Section 4.1, with some
slight modification s due to t he time-varying nature of th e span considered.
From (4.2), t he context dq - 1 has a represent ation in t he space 5 given by:
(5.2) c;
V q- l = -
Vq- l
S = d-q-l
T U.
where the "1" in the above vector appears at coordinate i. This is turn
implies, from (5.2):
(5.4)
(5.5)
4Not surprisingly, this difference in scaling exactly mirrors the squ are root relation-
ship between the singular values of Wand the eigenvalues of the (square) matrices
WTWand WWT .
LATENT SEMANT IC LANGUAGE MODELIN G 89
P ( q H (I ) IH(n)
(5.7) Pr (w IH(n+I ) = r w , q-l q - l
q q-l "" P ( . H (I) IH (n» ,
~ r Wt , q -l q-l
w;E V
where the summat ion in the denominator extends over all words in V.
Expanding and re-arranging, the numerator of (5.7) is seen to be:
Now we make the assumpt ion that the probability of the document history
given the current word is not affect ed by the imm ediate context preceding
it . This is clearly an approximation, since WqWq_ l Wq-2 . .. Wq -n+l may
well reveal more information about the semantic fabric of the document
than wq alone. This remark notwithstanding, for content words at least ,
different syntactic constructs (immediate context) can generally be used to
carry the same meaning (document history) . Thus the assumption seems
to be reasonably well motivated for content words. How much it matters
for function words is less clear [37], but we conjecture that if the document
history is long enough, the semantic anchoring is sufficiently strong for the
assumption to hold. As a result, the integrated probability becomes:
Pr (wqIH(n+I))
q-l =
Pr (WqIWq_l Wq-2 . . . Wq-n+l) Pr (dq_1Iwq)
(5.9)
L Pr(wilwq-lWq-2 . . .Wq-n+t}Pr(dq-llwi)·
w;EV
(5.10)
where Pr (wq ) is simply the standard unigram probability. Note that this
expression is meaningful" for any n > 1.
5.3. Context scope selection. In practice, expressions like (5.9)-
(5.10) are often slightly modified so that a relative weight can be placed on
each contribution (here, the n-gram and LSA probabilities) . Usually, this is
done via empirically determined weighting coefficients. In the present case,
such weighting is motivated by the fact that in (5.9) the "prior" probability
Pr (dq_1Iw q) could change substantially as the current document unfolds.
Thus, rather than using arbitrary weights, an alternative approach is to
dynamically tailor the document history dq - 1 so that the n-gr am and LSA
cont ribut ions remain empirically balanced.
60 bserve that with n = 1, the right hand side of (5.10) degenerates to the LSA
probability alone, as expected.
L AT ENT SEMANTIC LANGUAGE MODELING 91
(5.11)
K
(6.1) Pr (wqldq_1 ) = L Pr (WqICk) Pr (Ckldq-l) '
k=l
which carries over to (5.10) in a straightforward manner. In (6.1), the
probability Pr(Ckldq-l) is qualitatively similar to (5.1) and can therefore
be obtained with the help of (5.5), by simply replacing the representation
of the word wq by that of the centroid of word cluster Ck. In contrast,
the probability Pr(wq!Ck) depends on the "closeness" of W'j relative to this
(word) centroid. To derive it, we therefore have to rely on the empirical
multivariate distribution induced not by the distance obtained from (5.5),
but by that obtained from the measure (3.2) mentioned in Section 3.1.
Note that a distinct distribution can be inferred on each of the clusters Ck,
thus allowing us to compute all quantities Pr (wiICk) for 1 ::; i ::; M and
1 ::; k ::; K.
The behavior of the model (6.1) depends on the number of word clus-
ters defined in the space S. Two special cases arise at the extremes of
the cluster range. If there are as many classes as words in the vocabulary
(K = M), then with the convention that P(wiICj) = tS ij , (6.1) simply re-
duces to (5.1). No smoothing is introduced, so the predictive power of the
model stays the same as before. Conversely, if all the words are in a single
class (K = 1), the model becomes maximally smooth: the influence of spe-
cific semantic events disappears, leaving only a broad (and therefore weak)
vocabulary effect to take into account . The effect on predictive power is,
accordingly, limited. Between these two extremes , smoothness gradually
increases , and it is reasonable to postulate that predictive power evolves in
a concave fashion.
The intuition behind this conjecture is as follows. Generally speaking,
as the number of word classes Ck increases, the contribution of Pr(wqjCk)
tends to increase, because the clusters become more and more semantically
meaningful. By the same token, however, the contribution of Pr(Ckldq-d
for a given dq - 1 tends to decrease, because the clusters eventually become
too specific and fail to reflect the overall semantic fabric of dq - 1 • Thus,
there must exist a cluster set size where the degree of smoothing (and
therefore the associated predictive power) is optimal for the task consid-
ered. This has indeed been verified experimentally, cf. [2] .
6.2. Document smoothing. Exploiting instead the set of document
clusters De, 1 ::; £, ::; L, produced in Section 3.3 leads to document-based
smoothing. The expansion is similar:
L
(6.2) Pr (wqldq- 1 ) = L Pr (wqIDe) Pr (Deldq-l) '
e=l
except that the document clusters De now replace the word clusters Ci,
This time, it is the probability Pr(wqJDe) which is qualitatively similar to
LATENT SEM AN T IC LANGUAGE MODELING 93
(5.1), and can therefore be obtained with th e help of (5.5). As for the
probability Pr(Deldq-1) , it depends on the "closeness" of dq- 1 relative to
t he cent roid of document cluster De. Thus , it can be obt ained t hrough th e
empirical multivari ate distribution induced by th e dist ance derived from
(3.5) in Section 3.3.
Again , the behavior of the model (6.2) depends on t he number of
document clusters defined in the space S. Compared to (6.1), however ,
(6.2) is more difficult to interpr et at th e extremes of t he cluste r range (i.e.,
L = 1 and L = N) . If L = N , for example, (6.2) does not reduce to (5.1),
because dq - 1 has not been seen in th e training dat a, and therefore cannot
be identified with any of the exist ing clusters. Similarly, t he fact t hat all
th e documents are in a single cluster (L = 1) does not imply th e degree
of degenerescence observed previously, because th e cluster itself is strongly
indicative of the general discourse domain (which was not genera lly true
of the "vocabulary cluster" above). Hence, depending on th e size and
st ruct ure of the corpus , th e model may still be adequate to capt ure general
discourse effects.
To see that, we apply L = 1 in (6.2), whereby th e expression (5.10)
becomes:
(6.3)
7The reader is referred to [4] for additional results in this application, and to [8J for
experiments involving semantic inference.
LATENT SEMANTIC LANGUAGE MODELING 95
TA BLE 1
Word Error Rat e (WER) Results Using Hybrid Bi-LSA and Tri -LSA Models.
likely related to the greater predictive power of the trigram compared to the
bigram, which makes the LSA contribution of the hybrid language model
comparatively smaller. This is consistent with the fact that the latent se-
mantic information delivered by the LSA component would (eventually)
be subsumed by an n-gram with a large enough n . As it turns out, how-
ever, in both cases the average WER reduction is far from constant across
individual sessions, reflecting the varying role played by global semantic
constraints from one set of spoken utterances to another.
Of course, this kind of fluctuations can also be observed with the
conventional n-gram models, reflecting the varying predictive power of the
local context across the test set. Anecdotally, the leverage brought about
by the hybrid n-LSA models appears to be greater when the fluctuations
due to the respective components move in opposite directions. So, at least
for n ::; 3, there is indeed evidence of a certain complementarity between
the two paradigms.
7.3. Context scope selection. It is important to emphasize that
the recognition task chosen above represents a severe test of the LSA com-
ponent of the hybrid language model. By design, the test corpus is con-
structed with no more than 3 or 4 consecutive sentences extracted from
a single article. Overall, it comprises 140 distinct document fragments,
which means that each speaker speaks, on the average, about 12 different
"mini-documents." As a result, the context effectively changes every 60
words or so, which makes it somewhat challenging to build a very accurate
pseudo-document representation. This is a situation where it is critical
for the LSA component to appropriately forget the context as it unfolds,
to avoid relying on an obsolete representation. To obtain the results of
Table 1, we used the exponential forgetting setup of (5.11) with a value
A = 0.975.8
In order to assess the influence of this selection, we also performed
recognition with different values of the parameter A ranging from A = 1
to A = 0.95, in decrements of 0.01. Recall from Section 5 that the value
A = 1 corresponds to an unbounded context (as would be appropriate for
a very homogeneous session), while decreasing values of A correspond to
increasingly more restrictive contexts (as required for a more heterogeneous
session) . Said another way, the gap between A and 1 tracks the expected
heterogeneity of the current session.
Table 2 presents the corresponding recognition results, in the case of
the best bi-LSA framework (l.e., with word smoothing) . It can be seen
that, with no forgetting, the overall performance is substantially less than
the comparable one observed in Table 1 (13% compared to 23% WER
reduction). This is consistent with the characteristics of the task, and
underscores the role of discounting as a suitable counterbalance to frequent
8To fix ideas, this means that the word which occurred 60 words ago is discounted
through a weight of about 0.2.
LATENT SEMANTIC LANGUAGE MODELING 97
TABLE 2
Influ ence of Cont ext Scope Selection on Word Error Rate.
context changes. Perform ance rapidly improves as >. decreases from >. =
1 to >. = 0.97, presumably because the pseudo-do cument representation
gets less and less contaminated with obsolete dat a. If forget ting becomes
too aggressive, however , the performance st arts degrading, as the effective
context no longer has an equivalent length which is sufficient for the t ask
at hand. Here, th is happ ens for>' < 0.97.
8. Inherent trade-offs. In th e previous section , the LSA component
of the hybrid langu age model was trained on exact ly t he same data as its
n-gram component. This is not a requirement, however , which ra ises the
question of how crit ical th e selection of th e LSA tr aining dat a is to the
performance of the recognizer. This is particularly interesting since LSA is
known to be weaker on heterog eneous corpora (see, for example, [30]) .
8.1. Cross-domain training. To ascert ain the matter, we went back
to calculat ing the LSA component using the original, unsmoothed model
(5.1). We kept th e same underlying vocabul ary V, left the bigram com-
ponent unchanged, and repeat ed the LSA tr aining on non-WSJ data from
th e same general period . Three corpora of increasing size were consid-
ered, all corr espond ing to Associat ed Pr ess (AP ) dat a: (i) Ti , composed
of N, = 84,000 document s from 1989, comprising approximately 44 mil-
lion words; (ii) 72 , composed of N 2 = 155, 000 documents from 1988 and
1989, comprising approximately 80 million words; and (iii) 73 , composed
of N 3 = 224,000 document s from 1988-1990, comprising approximately
117 million words. In each case we proceeded with th e LSA t raining as
described in Section 2. The results are reported in Table 3.
Two things are immediately apparent. First, th e performance im-
provement in all cases is much smaller th an previously observed (recall
th e corresponding reduction of 14% in Table 1). Larger training set sizes
notwithstanding, on the average t he hybrid model tr ained on AP dat a is
about 4 tim es less effect ive than that tr ained on WSJ data. This suggest s a
relatively high sensitivity of the LSA component to th e domain considered.
98 JEROME R. BELLEGARDA
TABLE 3
Model Sensitivity to LSA Training Data.
To put this observation into perspective, recall that: (i) by definition, con-
tent words are what characterize a domain; and (ii) LSA inherently relies
on content words, since, in contrast with n-grams, it cannot take advantage
of the structural aspects of the sentence. It therefore makes sense to expect
a higher sensitivity for the LSA component than for the usual n-gram.
Second, the overall performance does not improve appreciably with
more training data, a fact already observed in [2] using a perplexity mea-
sure . This supports the conjecture that, no matter the amount of data
involved, LSA still detects a substantial mismatch between AP and WSJ
data from the same general period . This in turn suggests that the LSA
component is sensitive not just to the general training domain, but also to
the particular style of composition, as might be reflected, for example, in
the choice of content words and/or word co-occurrences. On the positive
side, this bodes well for rapid adaptation to cross-domain data, provided a
suitable adaptation framework can be derived.
8.2. Discussion. The fact that the hybrid n-gram+LSA approach is
sensitive to composition style underscores the relatively narrow semantic
specificity of the LSA paradigm. While n-grams also suffer from a possi-
ble mismatch between training and recognition, LSA leads to a potentially
more severe exposure because the space S reflects even less of the prag-
matic characteristics for the task considered. Perhaps what is required is
to explicitly include an "authorship style" component into the LSA frame-
work. 9 In any event, one has to be cognizant of this intrinsic limitation,
and mitigate it through careful attention to the expected domain of use.
Perhaps more importantly, we pointed out earlier that LSA is inher-
ently more adept at handling content words than function words. But, as
is well-known, a substantial proportion of speech recognition errors come
from function words, because of their tendency to be shorter, not well ar-
ticulated, and acoustically confusable. In general, the LSA component will
not be able to help fix these problems. This suggests that, even within a
well-specified domain, syntactically-driven span extension techniques may
9In [47J , for example, it has been suggested to define an M x M stochastic matrix (a
matrix with non-negative entries and row sums equal to 1) to account for the way style
modifies the frequency of words . This solution, however, makes the assumption-not
always valid-that this influence is independent of the underlying subject matter.
LATENT SEMANTIC LANGUAGE MODELING 99
REFERENCES
[54] S. ROUKOS, Language Representation, Chapter 6 in Survey of the State of the Art
in Human Language Technology, R. Cole (Ed.) , Cambridge University Press,
Cambridge, MA, 1997.
[55] R . SCHWARTZ , T . IMAI , F . KUBALA , L. NGUYEN, AND J . MAKHOUL, A Maximum
Likelihood Model for Topic Classification of Broadcast News, in Proc. Fifth
Euro. Conf . Speech Comm. Technol., Rhodes, Greece, Vol. 3, pp . 1455-1458,
September 1997.
[56] R.E . STORY, An Explanation of the Effectiveness of Latent Semantic Indexing by
Means of a Bayesian Regression Model, Inform. Processing & Management,
Vol. 32, No.3, pp. 329-344, 1996.
[57] J . Wu AND S. KIIUDANPUR, Combining Nonlocal, Syntactic and N-Gram Depen-
dencies in Language Modeling, in Proc. Sixth Euro. Conf. Speech Comm.
Technol. , Budapest, Hungary, Vol. 5, pp. 2179-2182, September 1999.
[58] D .H . YOUNGER, Recognition and Parsing of Context-Free Languages in Time N 3,
Inform. & Control, Vol. 10 , pp . 198-208, 1967.
[59] R . ZHANG , E . BLACK , AND A. FINCH, Using Detailed Linguistic Structure in Lan-
guage Modeling , in Proc. Sixth Euro. Conf. Speech Comm. Technol., Bu-
dapest, Hungary, Vol. 4 , pp . 1815-1818, September 1999.
[60] X.J . ZHU, S.F . CHEN, AND R. ROSENFELD, Linguistic Features for Whole Sen-
tence Maximum Entropy Language Models, in Proc. Sixth Euro. Conf. Speech
Comm. Technol. , Budapest, Hungary, Vol. 4 , pp . 1807-1810, September 1999.
[61] V . ZUE, J. GLASS, D. GOODINE, H. LEUNG , M. PHILLIPS, J. POLIFRONI , AND S.
SENEFF, Integration of Speech Recognition and Natural Language Processing
in the MIT Voyager System , in Proc. 1991 IEEE Int . Conf. Acoust., Speech,
Signal Processing, Toronto, Canada, pp . 713-716, May 1991.
PROSODY MODELING FOR AUTOMATIC SPEECH
RECOGNITION AND UNDERSTANDING*
ELIZABETH SHRIBERGt AND ANDREAS STOLCKEt
Abstract. This paper summarizes statistical modeling approaches for the use of
prosody (th e rhythm and melody of speech) in automatic recognition and understanding
of speech. We outline effective prosodic feature extraction, model architectures, and
techniques to combine prosodic with lexical (word-based) information. We then survey
a number of applications of the framework, and give results for automatic sentence
segmentation and disfluency detection, topic segmentation, dialog act labeling, and word
recognition.
2.4. Lexical models. Our target classes are typically cued by both
lexical and prosodic information; we are therefore interested in optimal
modeling and combination of both feature types . Although in principle
one could add words directly as input features to a prosodic classifier, in
practice this is often not feasible since it results in too large a feature
space for most classifiers. Approaches for cardinality reduction (such as
inferring word classes via unsupervised clustering [4]) offer promise and are
an area we are interested in investigating. To date, however, we have used
statistical language models (LMs) familiar from speech recognition. One
or more LMs are used to effectively model the joint distribution of target
classes S and words W, P(W, S) . With labeled training data, such models
can usually be estimated in a straightforward manner. During testing on
unlabeled data, we compute P(SIW) to predict the possible classes and
their posterior probabilities, or simply to recover the most likely target
class given the words.
2By equating the class distributions for classifier training, as advocated above, we
obtain posterior estimates that are proportional to likelihoods, and can therefore be used
directly in the HMM.
PROSODY MODELING FOR SPEECH 109
TABLE 1
Sentence boundary and disfluency event tagging error rates for the Switchboard
corpus . The higher chance error rate for recognized words is due to incorrect word
boundary hypotheses.
TABLE 2
Sentence boundary tagging erro r rates for two different speech corpora: Switchboard
(SWB) and Broadcast News (BN) .
SWB BN
Model Tr ue words Rec. words Tr ue words Rec. words
LM only 4.3 22.8 4.1 11.8
P rosod y only 6.7 22.9 3.6 10.9
Combined 4.0 22.2 3.3 10.8
Cha nce 11.0 25.8 6.2 13.3
TABLE 3
Topic segmentation weighted error on Broadcast News data. The evaluation metric
used is a weighted combination of false alarm and miss errors [S}.
TA BL E 4
Dialog act classification error on highly ambiguous DA pairs in the Switchboard
corpus.
We have had some success using the hidden event N-gram model (pre-
viously introduced for sentence segmentation and disfluency detection) for
word recognition [13]. As before, we computed prosodic likelihoods for
each event type at each word boundary, and conditioned the word portion
of the N-gram on those events . The result was a small , but significant 2%
relative reduction in Switchboard word recognition error. This improve-
ment was surprising given that the prosodic model had not been optimized
for word recognition. We expect that more sophisticated and more tightly
integrated prosodic models will ultimately make substantive contributions
to word recognition accuracy.
3.5. Other corpora and tasks. We have recently started applying
the framework described here to new types of data, including multiparty
face-to-face meetings. We have found that speech in multiparty meetings
seems to have properties more similar to Switchboard than to Broadcast
News, with respect to automatic detection of target events [8] . Such data
also offers an opportunity to apply prosody to tasks that have not been
widely studied in a computational framework. One nice example is the
modeling of turn-taking in meetings. In a first venture into this area, we
have found that prosody correlates with the location and form of overlap-
ping speech [8].
We also studied disfluency detection and sentence segmentation in the
meeting domain, and obtained results that are qualitatively similar to those
reported earlier on the Switchboard corpus [1]. A noteworthy result was
that event detection accuracy on recognized words improved slightly when
the models were trained on recognized rather than true words . This indi-
cates that there is systematicity to recognition errors that can be partially
captured in event models.
4. Conclusions. We have briefly summarized a framework for com-
putational prosody modeling for a variety of tasks. The approach is based
on modeling of directly measurable prosodic features and combination with
lexical (statistical language) models . Results show that prosodic informa-
tion can significantly enhance accuracy on several classification and tagging
tasks, including sentence segmentation, disfluency detection, topic segmen-
tation, dialog act tagging, and overlap modeling. Finally, results so far show
that speech recognition accuracy can also benefit from prosody, by con-
straining word hypotheses through a combined prosody/language model.
More information about individual research projects is available
at http://www.speech.srLcom/projects/hidden-events.html, http://www.
speech .sri.com/projects/sleve/, and http://www.clsp.jhu .edu/ws97/dis-
course;'
PROSODY MO DELING FOR SP E EC H 113
REFERENCES
Abstract. A statistical generative model for the speech process is described that
embeds a substantially richer structure than the HMM currently in predominant use for
automatic speech recognition. This switching dynamic-system model generalizes and
integrates the HMM and the piece-wise stationary nonlinear dynamic system (state-
space) model. Depending on the level and the nature of the switching in the model
design , various key properties of the speech dynamics can be naturally represented in
the model. Such properties include the temporal structure of the speech acoustics, its
causal articulatory movements, and the control of such movements by the multidimen-
sional targets correlated with the phonological (symbolic) units of speech in terms of
overlapping articulatory features.
On e main challenge of using this multi-level switching dynamic-system model for suc-
cessful speech recognition is the computationally intractable inference (decoding with
confidence measure) on the posterior probabilities of the hidden states. This leads to
computationally intractable optimal parameter learning (training) also . Several versions
of BayesNets have been devised with detailed dependency implementation specified to
represent the switching dynamic-system model of speech. We discuss the variational
technique developed for general Bayesian networks as an efficient approximate algo-
rithm for the decoding and learning problems. Some common operations of estimating
phonological states' switching times have been shared between the variational technique
and the human auditory function that uses neural transient responses to detect temporal
landmarks associated with phonological features. This suggests that the variation-style
learning may be related to human speech perception under an encoding-decoding the-
ory of speech communication, which highlights the critical roles of modeling articulatory
dynamics for speech recognition and which forms a main motivation for the switching
dynamic system model for speech articulation and acoustics described in this chapter.
today still produce errors in more than one quarter of the words in natu-
ral conversational speech in spite of many hours of speech material used as
training data. The current methodology has been primarily founded on the
principle of statistical "ignorance" modeling. This fundamental philosophy
is unlikely to bridge the performance gap between human and machine
speech recognition. A potentially promising approach is to build into the
statistical speech model most crucial mechanisms in human speech com-
munication for use in machine speech recognition. Since speech recognition
or perception in humans is one integrative component in the entire closed-
loop speech communication chain, the mechanisms to be modeled need to
be sufficiently broad - including mechanisms in both speech production
and auditory perception as well as in their interactions.
Some recent work on speech recognition have been pursued along this
direction [6, 18, 13, 17, 46, 47] . The approaches proposed and described in
[1, 5, 49] have incorporated the mechanisms in the human auditory process
in speech recognizer design. The approaches reported in [18, 21, 19, 44, 3,
54] have advocated the use of the articulatory feature-based phonological
units which control human speech production and are typical of human
lexical representation, breaking away from the prevailing use of the phone-
sized, "beads-on-a-string" linear phonological units in the current speech
recognition technology. The approaches outlined in [35, 12, 11, 14, 13] have
emphasized the functional significance of the abstract, "task" dynamics in
speech production and recognition . The task variables in the task dynamics
are the quantities (such as vocal tract constriction locations and degrees)
that are closely linked to the goal of speech production, and are nonlinearly
related to the physical variables in speech production. Work reported and
surveyed in [10, 15, 38, 47] have also focused on the dynamic aspects in
the speech process, but the dynamic object being modeled is in the space
of speech acoustics, rather than in the space of the production-affiliated
variables.
Although dynamic modeling has been a central focus of much recent
work in speech recognition, the dynamic object being modeled either in
the space of "task" variables or of acoustic variables does not and may
not be potentially able to directly take into account the many important
properties in true articulatory dynamics. Some earlier work used [16, 22]
either quantized articulatory features or articulatory data to design speech
recognizers, employing highly simplistic models for the underlying artic-
ulatory dynamics . Some other earlier proposals and empirical methods
exploited pseudo-articulatory dynamics or abstract hidden dynamics for
the purpose of speech recognition [2, 4, 23, 45], where the dynamics of a
set of pseudo-articulators is realized either by FIR filtering from sequen-
tially placed, phoneme-specific target positions or by applying trajectory-
smoothness constraints. Such approaches relied on simplistic nature in
the use of the pseudo-articulators. As a result, compensatory articulation,
which is a key property of human speech production and which requires
MODELS FO R SP EEC H ARTIC ULATION AND ACOUSTI CS 117
1 Wh ile it is not universally accepted t hat list eners actua lly do a na lysis-by-sy nt hesis
in speech per ception, it would be useful to use such a fram ewor k to int er pret t he roles
118 LIDENG
There are many ways of choosing the static nonlinear function for h[z] .
Let us take an example of multi-layer perceptron (MLP) with three layers
(input, hidden and output) . Let Wjl be the MLP weights from input to
hidden units and Wij be the MLP weights from hidden to output units,
where l is the input node index, j the hidden node index and i the output
node index. Then the output signal at node i can be expressed as a (non-
linear) function h( .) of all the input nodes (making up the input vector)
according to
(3.3) hi(z) = tw
j=l
ij . s(tWjl'
1=1
ZI) ' 1:S i:S I ,
where I, J and L are the numbers of nodes at the output, hidden and input
layers, respectively. s(.) is the hidden unit's nonlinear activation function,
taken as the standard sigmoid function of
1
(3.4) s( z) - ----;----:-
-1+exp(-z)
The derivative of this sigmoid function has the following concise form:
As an example, for the MLP nonlinearity of Eqn. 3.3, the (i, l)-th element
of the Jacobian matrix is
J
(3.8) L W ij . Sj(Y) . (1 - Sj(Y)) . Wjl, 1 :s i < I, 1:S l :s L,
j=l
3.5. Switching state space model. Eqns. 3.1 and 3.2 form a spe-
cial version of the switching state-space model appropriate for describing
multi-level speech dynamics. The top-level dynamics occurs at the discrete-
state phonology, represented by the state transitions of S with a relatively
long time scale. The next level is the target (t) dynamics ; it has the same
time scale and provides systematic randomness at the segmental level. At
the level of articulatory dynamics, the time scale is significantly shortened.
This is continuous-state dynamics driven by the target process as input,
which follows HMM statistics. The state equation 3.1 explicitly describes
this dynamics in z, with index of S (which takes discrete values) implic-
itly representing the switching process. At the lowest level of acoustic
dynamics, there is no switching process. Since the observation equation
3.2 is static, this simplifying speech model assumes that acoustic dynamics
results solely from articulatory dynamics.
4. BayesNet representation of the segmental switching dy-
namic speech model. Developed traditionally by machine-learning re-
searchers, BayesNets have found many useful applications. A BayesNet
is a graphical model that describes dependencies in the probability dis-
tributions defined over a set of variables. A most interesting class of the
BayesNet, as relevant to speech modeling, is dynamic BayesNets that are
specifically aimed at modeling time series statistics. For time series data
such as speech vector sequences, there are causal dependencies between
random variables in time . The causal dependencies give some specific,
left-to-right BayesNet structures. Such specific structures either permit
development of highly efficient algorithms (e.g., for the HMM) for the prob-
abilistic inference (i.e., computation of conditional probabilities for hidden
variables) and for learning (i.e., model parameter estimation), or enable
the use of approximate techniques (such as variational techniques) to solve
the inference and learning problems .
Both the HMM and the stationary (Le., no switching) dynamic sys-
tem model are two of the simplest examples of a dynamic BayesNet, for
which the efficient algorithms developed already in statistics and in speech
processing [51, 38, 20] turn out to be identical to those based on the more
MODELS FOR SPEECH ART ICU LAT ION AND ACOUSTICS 125
F IG . 2. Dynamic Baye sNet for a basic version of the switching dynamic system
model of speech. The mndom variables on Row 1 are discrete, hidden linguistic state s
with the Markov- chain tempoml structure. Those on Row 2 are continuous, hidden
arti culatory targets as ideal articulation . Those on Row :3 are continuous, hidden states
represen ting physical articulat ion with the Markov tempoml struc ture also. Thos e on
Row 4 are continuous, observed acousti c/auditory vectors.
126 LI DENG
(4.1)
The vertical (level)2 dependency for the target random variables is specified
by the following conditional density function:
(4.2)
Eqns. 4.1, 4.2, 4.4, and 4.5 then completely specify the switching dynamic
model in Fig. 2 since they define all possible dependencies in its BayesNet
representation. Note that while the phonological state Sk and its associ-
ated target t(k) in principle are at a different time scale than the phonetic
variables z(k) and o(k), for simplicity purposes and as one possible imple-
mentation, Eqns. 4.1-4.5 have placed them at the same time scale.
Note also that in Eqn . 4.5 the "forward" condit ional probability for
the observation vector (when the corresponding articulatory vector z(k) is
known) is Gaussian, as is the measurement noise vector's distribution. The
FIG. 3. Dynamic BayesNet for an expanded version of the switching dynamic sys-
tem model of speech. Pamllel streams of the overlapping phonological features and their
associated articulatory dimensions are explicitly represented. The articulators from the
pamllel streams are ultimately combined to jointly determine the acoustic vectors .
128 LIDENG
This adds the new dependency of random vector of t( k) on Sk-l and t(k-
1), in addition to the existing Sk as in Fig. 2. The modified BayesNet
incorporating this new dependency is shown in Fig. 4.
FIG. 4. Dynamic BayesNet for the switching dynamic system model of speech
incorporating the segmental constraint for the target random variables.
MODELS FOR SPEECH ARTICULATION AND ACOUSTICS 129
REFERENCES
[1] J . ALLEN. "How do humans process and recognize speech," IEEE Trans . Speech
Audio Proc., Vol. 2, 1994, pp . 567-577.
MODELS FOR SPEECH ARTICULATION AND ACOUSTICS 131
[2] R . BAKIS. "Coart iculat ion modeling with continuous-state HMMs, " Proc. IEEE
Workshop Automatic Speech Recognition, Harriman, New York, 1991, pp .
20-21.
[3] N. BITAR AND C . ESPy-WILSON . "Speech parameterization based on phonetic fea-
tures: Application to speech recognition," Proc. Eurospeech, Vol. 2 , 1995, pp.
1411-1414.
[4] C. BLACKBURN AND S. YOUNG . "Towards improved speech recognition using a
speech production model ," Proc. Eurospeech ; Vol. 2 , 1995, pp . 1623-1626.
[5] H. BOURLARD AND S. DUPONT. "A new ASR approach based on independent pro-
cessing and recombination of partial frequency bands," Proc. ICSLP, 1996,
pp . 426-429.
[6] H. BOURLARD , H. HERMANSKY, AND N. MORGAN . "Towards increasing speech
recognition error rates," Speech Communication, Vol. 18 , 1996, pp . 205-231.
[7] C. BROWMAN AND L. GOLDSTEIN . "Art iculatory phonology: An overview, " Pho-
netica, Vol. 49 , pp . 155-180, 1992.
[8] N. CHOMSKY AND M. HALLE. The Sound Pattern of English, New York: Harper
and Row, 1968.
[9] N. CLEMENTS. "T he geometry of phonological features," Phonology Yearbook, Vol.
2, 1985, pp . 225-252 .
[10] L. DENG . "A generalized hidden Markov model with state-conditioned trend func-
tions of time for th e speech signal," Signal Processing, Vol. 27, 1992, pp . 65-78 .
[11] L. DENG . "A computational model of the phonology-phonetics interface for auto-
matic speech recognition," Summary Report, SLS-LCS , Massachusetts Insti-
tute of Technology, 1992-1993 .
[12] L. DENG . "Design of a feature-based speech recognizer aiming at integration of
auditory processing, signal modeling, and phonological structure of speech."
J. Acoust . Soc. Am., Vol. 93, 1993, pp . 2318.
[13] L. DENG . "Computational models for speech production," in Computational Mod-
els of Speech Pattern Processing (NATO ASI), Springer-Verlag, 1999, pp .
67-77.
[14J L. DENG . "A dynamic, feature-based approach to the interface between phonology
and phonetics for speech modeling and recognition," Speech Communication ,
Vol. 24 , No.4, 1998, pp. 299-323.
[15J L. DENG , M. AKSMANOVIC , D. SUN , AND J. Wu. "Speech recognition using hidden
Markov models with polynomial regression functions as nonstationary states,"
IEEE Trans. Speech Audio Proc., Vol. 2, 1994, pp . 507-520.
[16] L. DENG , AND K. ERLER. "Structural design of a hidd en Markov model based
speech recognizer using multi-valued phonetic features: Comparison with seg-
mental speech units," J. Acoust. Soc. Am. , Vol. 92 , 1992, pp . 3058-3067.
[17] L. DENG AND Z. MA. "Spont aneous speech recognition using a statistical coarticu-
latory model for th e hidden vocal-tract-resonance dynamics," J. Acoust. Soc.
Am., Vol. 108, No.6, 2000, pp . 3036-3048.
[18] L. DENG, G. RAMSAY, AND D. SUN . "Production models as a structural basis for
automatic speech recognition," Speech Communication, Vol. 22 , No.2, 1997,
pp . 93-11 1.
[19] L. DENG AND H. SAMET!. "Transitional speech units and their representation by the
regressive Markov states: Applications to speech recognition," IEEE Trans .
Speech Audio Proc., Vol. 4, No.4, July 1996, pp . 301-306 .
[20] L. DENG AND X. SHEN. "Max imum likelihood in st atistical estimation of dynamic
systems: Decomposition algorithm and simulation results" , Signal Processing,
Vol. 57, 1997, pp . 65-79 .
[21] L. DENG AND D. SUN . "A statistical approach to automatic speech recognition
using the atomic speech units constructed from overlapping articulatory fea-
tures," J. Acoust . Soc. Am ., Vol. 95, 1994, pp . 2702-2719.
[22] J . FRANKEL AND S. KING . "ASR - Articulatory speech recognition" , Proc. Eu-
rospeech, Vol. 1, 2001, pp . 599-602 .
132 LIDENG
Abstract. The mot ivat ion und erlying th e development of segment al hidd en Markov
mod els (SHMMs) is to overcome import ant sp eech-modeling limit at ions of conventional
HMMs by representing sequences (or 'seg ments') of features and incorporating the con-
cept of a trajectory to describe how features change over time. This pap er presents an
overview of investigations that have been carried out into the properties and recognition
performa nce of various SHMMs , highlighting some of the issues th at have been ident ified
in using these models successfully. Recognition resul ts are presented showing t hat t he
best recognition performance was obt ained when combining a trajectory model with a
form ant representation, in comparison both with a conventional cepstrum-based HMM
syst em and with syst ems that incorporated eit her of the developments individually.
An attractive characteristic of a formant-based trajectory model is that it applies
easily to speech synthesis as well as to speech recognition, and thus can provide the
basis for a 'unified' approach to both recognition and synthesis and t o spee ch modeling
in gener al. One practical application is in very low bit-rate sp eech coding, for which
a form ant trajectory description provides a compact means of coding an utterance . A
demonstration system has been developed that typically codes sp eech at 600-1000 bits/s
with good intelli gibility, whilst preserving speaker characteristics.
Key words. Segment al HMM , dynamics, trajectory, formant, unified model , low
bit-r ate speech coding.
"20/ 20 Speech Lim ited , Malvern Hills Science Park , Gera ldine Road , Malvern ,
Wores., WR1 4 3SZ, UK.
135
M. Johnson et al. (eds.), Mathematical Foundations of Speech and Language Processing
© Springer Science+Business Media New York 2004
136 WENDY J . HOLMES
the vocabulary size is much smaller. For example, on the CallHome cor-
pus of telephone conversations between family members, typical word error
rates exceed 30% [7] . Although there are many aspects to the problems
associated with this type of recognition task, the extent of the drop in per-
formance when moving beyond constrained domains suggests that there
may be inherent deficiencies in the acoustic modeling paradigm.
HMMs provide a framework which is broadly appropriate for mod-
eling speech patterns, accommodating both variability in timescale and
short-term spectral variability. However these models are simply general
statistical pattern matchers, and do not take advantage of the constraints
inherent in the speech production process and make certain assumptions
that conflict with what is known about the nature of speech production
and its relationship with acoustic realization. In particular, the follow-
ing three assumptions which are made by the HMM formalism are clearly
inappropriate for modeling speech patterns:
• Piece-wise stationarity. It is assumed that a speech pattern is
produced by a piece-wise stationary process, with instantaneous
transitions between stationary states.
• The independence assumption. The probability of a given acoustic
vector corresponding to a given state depends only on the vec-
tor and the state, and is otherwise independent of the sequence
of acoustic vectors preceding and following the current vector and
state. The model therefore takes no account of the dynamic con-
straints of the physical system which has generated a particular
sequence of acoustic data, except inasmuch as these can be incor-
porated in the feature vector associated with a state. In a typical
speaker-independent HMM recognizer where each modeling unit is
represented by a multi-modal Gaussian distribution to include all
speakers, the model in effect treats each frame of data as if it may
have been spoken by a different speaker.
• State duration distribution. A consequence of the independence
assumption is that the probability of a model staying in the same
state for several frames is determined only by the 'self-loop' transi-
tion probability. Thus the state duration in an HMM conforms to
a geometric probability distribution which assigns maximum prob-
ability to a duration of one frame and exponentially decreasing
probabilities to longer durations.
The impact of these inappropriate assumptions can be reduced by, for ex-
ample, using a generous allocation of states which allows a sequence of
piece-wise stationary segments to better approximate the dynamics and
also makes a duration of one frame per state more appropriate. A popu-
lar way of mitigating the effects of the independence assumption is to use
an acoustic feature vector which includes information over a time span of
several frames . This is most usually achieved by including the first and
sometimes also the second time-derivative of the original static features.
SEGM ENTAL HMMS 137
However , alt hough such tec hniques have been shown to be of pr actic al
benefit in improving recognition performance, t he independence assump-
t ion actua lly becomes even less valid because t he observed data for anyone
frame are used to cont ribute to a time spa n of several feature vecto rs.
Rather t ha n t rying to mod ify t he data to fit t he mode l, it should be
better to adopt a mod el t hat is more appropriate for t he cha racteristics
of speech signals. The inappropriate assumptions of HMMs are linked
wit h t heir frame-synchronous cha racteristic, whereby states are associated
with single acoustic feature vecto rs. In order to improve t he underlying
mod el, it is necessary somehow to incorporat e t he concept of mod eling
frame sequences rather t ha n individual fram es, wit h t he aim of providing
an accurate mod el of speech dynamics across an utterance in a way which
t akes into account predict able factor s such as speaker cont inuity. At the
sa me time, it is import ant to retain the many desirable attributes of t he
HMM framework, such as rigorous dat a-driven tr aining, optimal decoding
and delayed decision making. The desire to overcome the above limit ations
of HMMs by modeling frame sequences has motivat ed t he developm ent of a
variety of exte nsions, modifications and alte rnat ives to HMMs (see [20] for
a review). Models of t his type are usua lly referr ed to as "segment models"
[20] , or as "segmental HMMs" [16].
In [14] it was shown that the integral in the above expression can be
evaluated to give a tractable expression that does not depend on the values
of m or c, thus:
P(y) = J(T+l~1J + TV
~( 1 )
~
J21rT
(T+l)
(3.3)
140 WENDY J . HOLMES
(3.6)
The model means for the standard HMMs were initialized based on
a small quantity of hand-annotated training data, and all model variances
were initialized to the same arbitrary value. The model means and vari-
ances were then trained with five iterations of Baum-Welch re-estimation.
After five training iterations, these standard HMMs gave a word error rate
of 8.6% on the test data. These models provided the starting point for train-
ing different sets of segmental HMMs, all of which were trained with five
iterations of the appropriate Baum-Welch type re-estimation procedure.
For performance comparisons, the standard HMMs were also subjected to
a further five iterations of re-estimation.
3.2.3. Segmental HMM training. The simplest segmental models
were those represented in Equation (3.4), which used the standard-HMM
probability calculation but incorporated the constraint on maximum du-
ration. These simple static segmental HMMs and the linear FTSHMMs
(shown in Equation (3.5)) were both initialized directly from the trained
standard-HMM parameters, with the intra-segment variances being copied
from the standard-HMM variances and the extra-segment variances being
set to zero. The slope means of the FTSHMMs were initialized to zero, so
that the starting point was the same as for the simplest segmental HMMs,
with the difference being that when training the FTSHMMs the slope mean
was allowed to deviate away from its initial value.
Initial estimates for the different sets of PTSHMMs were computed
by first using the trained standard HMMs to segment the complete set
of training data, and using the statistics of these data to estimate the
various parameters of each set of PTSHMMs according to the relevant
modeling assumptions. In all cases the means and variances of the mid-
points were initialized from the distributions of the sample means for the
individual segments . The initialization strategies that were used for the
other parameters were different for the different types of PTSHMM and are
summarized in Table 1. Each set of segmental models was trained with five
iterations of the appropriate Baum-Welch-type re-estimation procedure
[14] [16] .
3.2.4. Results. The connected-digit recognition results are shown in
Table 2 for the different sets of segmental models, compared with the base-
line HMMs after both five and 10 training iterations. The main findings
are summarized in the following paragraphs.
The simple segmental HMMs with a maximum segment duration of 10
frames gave an error rate of 6.6%, which is lower than that of the conven-
tional HMMs even when further training had been applied (8.4%). Thus,
in these experiments there were considerable advantages in constraining
the maximum segment duration, which acts to prevent unrealistically long
occupancies for the speech model states.
The lowest word error rate achieved with the static PTSHMMs was
7.5%, which is not quite as good as the 6.6% obtained with the simplest
SEGMENTAL HMMS 143
TABLE 1
Ini tializati on stm tegy f or various model pammeter s of different sets of PTSHMMs .
°
static PTSHMM mean and variance variance of observations
(Eq . (3.3)) both fixed at about segment mean
linear PTSHMM mean and variance of variance of observations
(flex. slope) slopes of best-fit about best-fit linear
(Eq . (3.1)) linear trajectories trajectories
linear PTSHMM mean and variance variance of observations
(constr. slope) both set to 0, but mean about segment mean
(Eq. (3.6)) can vary in training (i. e. line with zero slope)
TABLE 2
Connected-digit recognition results for different sets of segmentalllMMs compared
with conventional HMMs .
speech database of American English [10]' for which all the utterances
have been phonetically transcribed, segmented and labeled. TIMIT was
designed to provide broad phonetic coverage, and is therefore particularly
appropriate for comparing approaches to acoust ic-phonetic modeling .
When classifying the data segments , all phones were treated as equally
likely (no langu age model was used to constrain the allowed phone se-
quen ces). This approach was considered appropriate for investigating im-
proved acoustic-phonetic modeling, but it does make th e task very difficult.
As with the digit experiments, the emphasis here is on th e relative perfor-
mance of the different segment al HMMs.
3.3 .1. Speech data and model sets. Th e experiments reported
here used the TIMIT designated training set and the core test set, using
data only from the male speakers . The available data had been analyzed
by applying a 20 ms Hamming window to the 16 kHz-sampled speech at a
rate of 100 framesjs and computing a fast Fourier tr ansform. The output
had been converted to a mel scale with 20 chann els, and a cosine transform
had been applied. The first 12 cosine coefficients together with an aver-
age amplitude feature formed the basic feature set, but some experiment s
also included time-derivative features computed for each frame by applying
linear regression over a five-frame window cent red on th e current frame.
The inventory of model units was defined as th e set of 61 symbols which
are used in the tim e-aligned phonetic tr anscriptions provided with TIMIT.
However, the two different silence symbols used in these transcriptions
were represented by a single silence model, to give 60 model units. In
common with most other work using TIMIT, when scoring recognition
output the 60-symbol set was reduced to the 39-category scoring set given in
[18] . Experiments were carried out with context-independent (monophone)
models, and also with right-context-dependent biphones, which depend on
only the immediately-following phoneme context .
Th e basic model structure for both the conventional and the segment al
HMMs was the same as the one used for th e digit experiments , with three
states per speech model and single-state non-speech models. However , this
structure imposes a minimum duration of three frames for every speech
unit, whereas some of the labeled phone segments are shorter than three
frames. In order to accommodate th ese very short utterances, the structure
of all the speech models was extended to allow transitions from the initial
st at e to all emitt ing states with a low probability.
3.3.2. Training procedure. First, a set of standard-HMM mono-
phone models was initialized and t rained with five iterations of Baum-
Welch re-estim ation. These models were then used to initi alize different
set s of monophone segmental HMMs, using the same appro ach as the one
adopted in the digit-recognition experiments. The discussion here will fo-
cus on comparing the performance of linear FTSHMMs and const rained-
slope linear PTSHMMs with that of simple segmental HMMs (models us-
146 WENDY J . HOLMES
The relative performance of the linear PTSHMMs and the simple seg-
mental HMMs was analyzed as a function of phone class, in order to de-
termine whether the segmental model of dynamics was more beneficial for
some types of sound than others. The results of this analysis are shown in
Table 4 for a selection of sound classes. The linear PTSHMMs improved
performance for all the phone classes, but were most beneficial for the diph-
thongs and the semivowels and glides. These sounds are characterized by
highly dynamic continuous changes, for which the trajectory model should
be particularly advantageous. The performance gain was smallest for the
stops, which have rather different characteristics involving abrupt changes
between relatively steady-state regions. Some model of dynamics across
segment boundaries may be the best way to represent these changes.
3.4. Discussion. The experiments described above have shown some
improvements in recognition performance through modeling trajectories of
mel-cepstrum features, although for best performance it was necessary to
also include time-derivatives in the feature set. These experiments all used
a fixed model structure with three segments per phone. There are various
aspects of the acoustic modeling which could be developed to take greater
account of the characteristics of speech patterns. For example , rather than
using three segments to model every phone, it would seem more appropriate
to represent each phone with the minimum number of segments to describe
typical trajectories in order to maximize the benefit from the constraints
SEGMENTAL HMMS 147
TABLE 4
Classificati on performance of linear PTSHMMs relative to that of simple segmental
HMMs , shown for different phon e classes (using biphone models with mel-cepstrum
f eatures) .
:r:-;.~=T-'=----.....4
FIG . 1. Spectrogram of an utterance of the words "nine one two" , with superimposed
formant tracks showing alternative formant allocations offered by the analyz er for Fl ,
F2 and F3. Tracks are not plotted when there is no confidence in their accuracy.
IThese results are not directly comparable with the earlier results that were shown
in Table 2 for a number of reasons, including the use of a different set of features and
also some other minor differences in the experimental framework .
SEGMENTAL HMMS 151
TABLE 5
Conne cted-digit recognition results for different sets of standard and segm ental
HMMs . Word error rates for a f eature set com prising the first eight MFCCs (and
an overall en ergy feature) are com pared with those for a feature set in which the three
highest MFCCs are replaced by three formant featur es.
UreCOgnifiOn
message
nSynthesiS
n
production mechanisms
INTERMEDIATE
high quality,
LEVELS • relatively high
detailed time-evolving
production parameters -t CODING bit-rate
for utterance
SURFACE
LEVEL
,illU~~~~~lLLLLllLLL~ peech
'1~lT"'waveform
tran$mrlter
L. frame·by.frame
synthesizer control
Resynthesized
parameters speech
REFERENCES
Abstract . Most speech recognizers use an observat ion space which is bas ed on
a t empora l sequence of sp ectral "frames." T here is ano th er class of recogn izer which
further processes th ese fram es to produce a segment-based network , and repr esents each
segment by a fixed-dim ensional "feat ure." In such feature-based recognizers t he obser-
vatio n space takes t he form of a temp oral graph of feature vectors, so t hat any singl e
segmentat ion of an utterance will use a subset of all possible feature vectors. In this
work we describe a max imum a posteriori decoding strat egy for feature-based recogniz-
ers and der ive two normalization crit era useful for a segment-based Viterbi or A ' sear ch.
We show how a segment-based recognizer is able to obtain good results on th e tasks of
ph oneti c and word recognition .
• MIT Lab oratory for Comp ute r Science, 200 Technology Square, Ca mbridge, MA
02139, USA (glass@mit.edu) .
157
M. Johnson et al. (eds.), Mathematical Foundations of Speech and Language Processing
© Springer Science+Business Media New York 2004
158 JAMES R. GLASS
1.107. to 1.%1'.
S_t
Iy 1.tl
ey 1.01
n -1. 51
.... -l.n
-%.%1
Ih -% .%7
I -%.1'
... -%.5%
a
_
-%.n
-%.1%
which could be called feature-based. Examples in t his area include the FEA-
TURE syste m [4], the SUMMIT system described in t his paper, as well t he
L A F F syste m [31].
T he SUMMIT segment-based syste m developed in our group uses a
segment- and landmark-based framework to represent the speech signal.
Acoustic or prob abilistic landm arks form t he basis for a phonetic network ,
or gra ph, as shown in Figur e 1. Acoustic features are ext racted over hy-
pothesized phonetic segments and at important acoust ic landmarks. Gaus-
sian mixture classifiers are th en used to provide phonet ic-level prob abilities
for both segment al and landm ark features. Words are represent ed as pro-
nunciat ion graphs whereby phonemic baseforms are expanded by a set of
phonological rules [14]. Phone probabilities are determined from tra ining
data. A prob abilistic MAP decoding framework is used. A modified Vit erbi
beam search finds t he best path through both acoustic-p honetic and pro-
nunciation gra phs while incorporating language const raints . All graph rep-
resentat ions are based on weighted finite-st ate tr ansducers [8, 21]. One of
the prop erties of t he segment-based framework is t hat, as will be described
later , the models must incorpor ate both positive and negative exa mples of
lexical units. Fin ally, a secondary A * search can provide word-gra ph or
N - best sentence outputs for furthe r processing.
160 JAMES R. GLASS
F IG. 2. Tw o segm entat ions (in solid lines) through a simple five segme nt gmph
with acoustic observations {al , . . . , as} . Th e top segm entation is associated with obser-
vations {al , a3,as} , while the bott om segmen tation is associated with {al , a2, a4, as} .
the ASR acoustic mod elling component, and is the subject of this paper.
For simplicity we will assume that acoust ic likelihoods are condit ionally
ind epend ent of W so that P (A IS, U, W ) = P(A IS, U). This assumption is
also standa rd in convent ional HMM-b ased syste ms.
In conventional speech recognizers, the acoustic observation space, A ,
cor res ponds to a temp oral sequence of acoustic frames (e.g., spect ra l slices).
Each hypoth esized segment , s., is repr esent ed by t he series of frames com-
puted bet ween segment start and end tim es. Thus, t he acoust ic likelihood
P(AIS , U), is derived from t he same observat ion spac e for all word hy-
potheses. In feat ure-based recognition, as illustrat ed in Figur e 2, each
segment , s., is repr esent ed by a single feature vector, ai. Given a par-
ti cular segment at ion, S, (a contiguous sequence of segments spanning the
ent ire ut terance), A consists of X , t he feature vectors associate d with the
segments in S, as well as Y , t he featu re vecto rs associate d with segments
not in S, such t hat A = X u Y and X n Y = 0. In order to compa re
different segmentations it is necessary to predict the likelihood of both X
and Y , since P (A IS, U) = P (X ,Y IS, U). Thus, in addit ion to th e obser-
vati ons X associated with t he segments in S, we must consider all ot her
possible observations in t he space Y, corre sponding to the set of all other
possible segments, R . In the top path in Figure 2, X = {al ,a3, as}, and
Y = {a2,a4}. In the bottom path , X = {al , a2,a4,as} , and Y = {a3}'
The to t al observation space A , contains both X and Y , so for MAP de-
coding it is necessary to est imate P (X ,YIS, U). Note t hat since X implies
S we can say P (X ,Y IS, U) = P(X ,Y IU). The following sections describe
two methods we have develop ed to account for t he ent ire observation space
in our segment-based recognizer .
_ P(XIn:) P(XIU)
(3.5) P(X, YIU) = P(XIU)P(Yla) P(XIn:) oc P(XIn:) '
(3.6)
Where Ui is the linguistic unit assigned to segment Si, and its associated
feature vector observation Xi .
The advantage of the anti-phone normalization framework for decod-
ing is that it models the entire observation space, using both positive and
negative examples . Log likelihood scores are normalized by the anti-phone
so that good scores are all positive, while bad scores will be negative . In
Figure 1 for example, only the top two log likelihood ratio 's for the Iii seg-
ment were positive. Invalid segments, which do not correspond to a valid
phone, are likely to have negative scores for all phonetic hypotheses. Note
that the anti-phone is not used for lexical access, but purely for normaliza-
tion purposes. In general, it is useful for pattern matching problems with
graph-based observation spaces.
3.2. Near-miss modelling. Anti-phone modelling divides the ob-
servation space into essentially two parts; subsets of segments which are
either on or off a hypothesized segmentation S. The anti-phone model is
quite general since it must model all examples of observations which are
not phones . A larger inventory of anti-phones might better model segments
which are near misses of particular phones. One such method, called near-
miss modelling, was developed by Chang [1, 2] . Near-miss modelling par-
titions the observation space into a set of near-miss subsets, where there is
one near-miss subset, Ai, associated with each segment, Si, in the segment
network (although Ai could be empty) . During recognition , observations in
a near-miss subset are mapped to the near-miss model of the hypothesized
phone . The net result can be represented as:
n
(3.7) W* = arg max
s,u,w .
II P(xdui)P(Ailui)P(silui)P(UIW)P(W)
t=l
Method % Error
Triphone CDHMM [16] 27.1
Recurrent Neural Network [27] 26.1
Bayesian Triphone HMM [20] 25.6
Anti-Phone, Heterogeneous classifiers [12] 24.4
Method % Error
Segment models 9.6
Landmark models 7.6
Combined 6.1
SEGMENT-BASED SPEECH RECOGNITION 165
REFERENCES
[23J M. Ostendorf and S. Roucos. A stochastic segment model for phoneme-based con-
tinuous spee ch recognition. IEEE Trans. ASSP, 37(12):1857-1869, December
1989.
[24J K. Ponting and S. Peeling. The use of variable frame rate analysis in speech
recognition. Computer Speech and Language, 5 :169-179,1991.
[25] L. Rabiner. A tutorial on hidden Markov models and selected applications in
speech recognition. Proceedings of the IEEE, 1989.
[26J M. Riley and A. Ljolje . Lexical access with a statistically-derived phonetic network.
In Proc. Eurospeech, pages 585-588, Genoa, Italy, September 1991.
[27] A. Robinson. An application of recurrent nets to phone probability estimation.
IEEE Trans. Neural Networks, 5(2) :298-305, March 1994.
[28] J . Rohlicek, W . Russell, S. Roucos, and H. Gish . Continuous hidden Markov
modelling for speaker-independent word spotting. In Proc. ICASSP, pages
627-630, Glasgow, Scotland, May 1989.
[29] R . Rose and D. Paul. A hidd en Markov model based keyword recognition system.
In Proc. ICASSP, pages 129-132, Albuquerque, NM, April 1990.
[30] M. Russell . A segmental HMM for speech pattern modelling. In Proc. ICASSP,
pages 499-502, Minneapolis, MN, 1993.
[31] K. Stevens. Lexical access from features . In Workshop on speech technology for
man-machine interaction, Bombay, India, 1990.
[32] N. Strom, L. Hetherington, T . Hazen , E. Sandness, and J . Glass . Acoustic mod -
elling improvements in a segment-based speech recognizer. In Proc. IEEE
Automatic Speech Recognition and Understanding Workshop, pages 139-142,
Keystone, CO , 1999.
[33] J . Wilpon, L. Rabiner, C.H. Lee, and E . Goldman. Automatic recognition of
keywords in unconstrained speech using hidden Markov models . IEEE Trans .
A SSP, 38(11) :1870-1878, November 1990.
[34] V. Zue, S. Seneff, J . Glass, J . Polifroni, C. Pac, T . Hazen , and L. Hetherington.
Jupiter: A telephone-based conversational interface for weather information .
IEEE Trans . Speech and Audio Proc., 8(1) :85-96, January 2000.
TOWARDS ROBUST AND ADAPTIVE SPEECH
RECOGNITION MODELS
HERVE BOURLARD', SAMY BENGIOt, AND KATRIN WEBER't
Key words. Robust speech recognition, hidden Markov models , subband process-
ing, multi-stream processing.
'Daile Molle Institute for Perceptual Artificial Intelligence (IDIAP) , and Swiss Fed-
eral Institute of Technology at Lausanne (EPFL) , 4, Rue du Simplon, CH-1920 Martigny,
Switzerland (bourlard@idiap.ch) .
tDalle Molle Institute for Perceptual Artificial Intelligence (IDIAP), 4, Rue du Sim-
pion , CH-1920 Martigny, Switzerland (bengio@idiap.ch).
t weber@idiap.ch.
IThis problem is not specific to the fact that phone models are generally used . Whole
word models , or syllable models, built up as sequences of HMM states will suffer from
exactly the same drawbacks, the only potential advantage of moving towards "larger"
units being that one can then have more word (or syllable) specific distributions (usually
resulting in more parameters and an increased risk of undersampled training data) .
Consequently, building an ASR system simply based on syllabic HMMs will not alleviate
the limitations of the current recognizers since those models will still be based on the
short-term piecewise stationary assumptions mentioned above.
169
M. Johnson et al. (eds.), Mathematical Foundations of Speech and Language Processing
© Springer Science+Business Media New York 2004
170 HERVE BOURLARD ET AL.
(2.1)
Although this conclusion is often questioned by the scientific community" ,
it is probably not worth arguing too long about it since it is pretty clear
that (2.1) is anyway the optimal rule to obtain the best performance out
of a possibly noisy multi-stream system (but it requires perfect knowledge
about which stream, if any, is noisy) . Moreover, a similar rule can usually
explain some of the empirical observations in audio-visual processing (see,
e.g., [24] and [20]).
Although pretty simple, rule (2.1) is not always easy to interpret (and
even less for engineers!). So let us have a closer look at it . Since the prob-
ability of being correct whenever we assign a particular observation x to a
class q is equal to the a posteriori probability P(qlx) (i.e., the probability
of error is equal to 1 - P(qlx) , see [8], page 12)5, rule (2.1) can also be
written as:
(2.2)
2 2
= 1- L P(qjl x k
) + IT P(qjlx k
) ,
k=l k=l
where P(qj\x k ) denotes the class posterior probabilities obtained for the
k-th input stream. Rewriting (2.2) in terms of (total) correct recognition
probability (P(qj lxi, x 2 ) = 1 - e(qj Ixl, x 2 )) , we have:
2 2
(2.3) P(qjlx l , x 2 ) = L P(qjl x k ) - II P(qjlx k
) .
k=l k=l
3Since we decided not to deal with the temporal const raints, this notation is over-
simplified. In th e case of temporal sequences, xl and x 2 will be sequences (possibly
of different lengths and different rates) of features, and qj will be a sequence of HMM
st ates.
4Since the relevant Fletcher experiments were done (i) with nonsense syllables only,
and (ii) using high-pass or low-pass filters (i.e., two streams) only, it is not clea r whether
or not this is an accurate statement for disparate bands in cont inuous speech.
5See Section 3 in this ar ticl e for further evidence.
6The probability of union of two events P(A or B) = P(A) + P(B) - P(A, B), which
is also equal to P(A) + P(B) - P(A)P(B) if events A and B are independent. Indeed, in
estimating the proportions of a sequence of trials in which A and B occur, respectively,
one counts twice those trials in which both occur.
172 HERVE BOVRLARD ET AL.
0.9 r - - _
This conclusion remains valid for more than two st reams. Actually, it
can even be shown th at th e area above a given (multi-dimensional) equal
error rat e surface grows exponenti ally with th e number of st rea ms. To
make th e link easier with what follows in t he remainder of t his art icle,
observe t hat, in t he case of t hree input st reams, (2.3) becomes:
3 3
P (qj IX1 , X2 , X3 ) = L P(qj lx k) + II P(qj lx k)
k=l k= l
(2.4)
3
L P (qjl xk)P(q j lx l ) .
l > k= l
Figure 1 does not change, but the position in the space depends on 8 , as
well as on the different st ream features. Ideally, robust tra ining and ad ap-
tation should be performed in the 8 space to guarantee t hat P (qj lx 1 , x 2 , 9 )
is always above a cert ain threshold, or to directly maxim ize (2.5). In t he
following, we discuss approaches going in t hat direction.
2.2. Discussion. Th e above analysis allows us to draw a few conclu-
sions and to design t he features of an "opt imal" ASR system:
1. Human hearin g performs combinat ion of frequency st reams accord-
ing to th e product of errors rule discussed above. In t his case (and
assuming th at the subbands are independent , which is false), cor-
rect classification of any subb and is empirically equivalent to cor-
rect full-band classification . In subband-based ASR systems, this
m eans that we should design the syst em and the traini ng criterion
to maximize the classification performance on subbands, while also
making sure that the subband errors are ind ependent .
2. As a direct consequence of th e above, it is also obvious that th e
more subbands we use, th e higher th e full-band correct classifica-
tion rate will be. As done in human hearing , A SR system s should
thus use a large number of subbands to have a better chance to
increase recognition rat es. It is interestin g to note here th at this
t rend has recently been followed in [15].
computed from feat ur es in different spaces , possibly of di fferent d imensions (since like-
lihoods, as usually computed (assu ming Gaussia n densiti es wit h diagonal covariance
mat rices), are "d imensional" , i.e., dependent on the dimension of t he feat ure space).
174 HERVE BOURLARD ET AL.
P(MIX) P(XIM)
(3.1)
P(M) P(X)
8Which are known , anyway, to yield the minimum error rate solution.
T OWARDS ROBUST SPEECH RECOGNIT ION MODELS 175
+
+ <>
0 .9 + <>
+6 0 0
0 .8 + 0°
o
o
0 .7
.,o
"
~
~
0
0.6
U
~ 0 .5
.,
-~
t!
0 .4
'"
~
"-
0 .3
0.2
0.1
0
0 0 .1 0.2 0 .3 0 .4 0 .5 0 .6 0.7 0 .8 0 .9
Out pu t Pro babi lity
9 As d iscussed lat er , the initial multi-st ream ap proach (Sectio n 5.1) was not using
stri ctl y exha ust ive expe rts since t hey did not cover all possible stream combina t ions.
The fuJI combinat ion approac h, as discussed in Secti on 5.3, will act ually use a ll possible
combinat ions.
176 HERVE BOURLARD ET AL.
K
P(MIX) = LP(M,EkIX)
k=1
K
(4.1) = LP(MIEk,X)P(EkI X)
k=1
K
~ L P(MkIXk)P(EkI X)
k=1 .
Acoustic Processing
Frequency band I
Recombined
Acoustic --t--e>t Acoustic Processing Recombination Result
Frequency bandk
Acoustic Processing
Frequency bandK
K
(5.1) P(qjlx n ) = I: wkP(qjlx~,8k)
k=l
TOWARDS ROB UST SPEECH RECOGNITION MODELS 179
13 It could however be argued that , in t his case , t he multi ba nd approach boils down
to a regular full-band recognizer in which several likelihoods of (assu med) indepe nde nt
features are est imated and multiplied tog et her to yield local likelihoods (since, in likeli-
hood based syste ms, expected values for t he full-ba nd is t he same t han t he concat enated
expected values of subband s). This is however not true when using posterior based sys-
tems (such as hybrid HMM/ ANN systems) where t he subbands are presented to different
nets th at are independently trained in a discri min ant way on each ind ividua l subband .
Fina lly, as discussed in t his pap er, we a lso believe t hat the comb ination crit er ion should
be different t ha n a simple product of (scaled) likelihoods or posteriors.
180 HERVE BOURLARD ET AL.
L
P(qjlx n , 8 ) = L P(qj , Eelxn, 8 )
e=1
15Which would corr espond to th e case where all th e band s ar e unreliable. In t his
case, t he best post erior est imate is th e prior probability P (qj ) , and one of t he L te rms
in the following equations will cont ain only t his prior inform ation.
16This amounts to assuming that t he position of th e noise or , in ot her words , the
position of th e reliable frequency bands, is a hidden (lat ent) variable on which we will
int egrate to maximize th e posterior probabilities (in the spirit of the EM algorit hm).
182 HERVE BOURLARD ET AL.
L
(5.2) = LP(qjIEe ,xn,8)P(EeI Xn)
e=l
L
= L P(qjIS~ , 8e)P(Eel xn)
e=l
where 8 represents the whole parameter space, while Se denotes the set
of (ANN) parameters used to compute the subband posteriors. Of course ,
implementation of (5.2) requires the training of L neural networks to es-
timate all the posteriors P(qjIS~ , 8e} that have to be combined according
to a weighted sum, with each weight representing the relative reliability
of a specific set of subbands. In the case of stationary interference, this
reliability could be estimated on the basis of the average (local) SNR in the
considered set . Alternatively, it could also be estimated as the probability
that the local SNR is above a certain threshold, and where the threshold
has been estimated to guarantee a prescribed recognition rate (e.g., lying
above a certain equal recognition rate curve in Figure 1) [3] .
Typical1y, training of the L neural nets would be done once and for al1
on clean data, and the recognizer would then be adapted online simply by
adjusting the weights P(Eelxn) (still representing a limited set of L weights)
to increase the global posteriors. This adaptation could be performed by
online estimation of the SNR or by an online version of the EM (deleted-
interpolation) algorithm. Although this approach is not really tractable, it
has the advantage of avoiding the independence assumption between the
subbands of a same set, as well as allowing any DCT transformation of the
combination before further processing. Consequently, this combination,
referred to as Full Combination, was actually implemented [10] for the case
of four frequency subbands (each containing several critical bands), thus
requiring the training of 16 neural nets, and used as an "opt imal" reference
point.
An interesting approximation to this "optimal" solution though con-
sists in simply train one neural network per subband for a total of K
models, and to approximate all the other subband combination probabili-
ties directly from these. In other words, re-introducing the independence
assumption!" between subbands, subband combination posteriors would be
estimated as [10, 11] :
(5.3)
17 Actually, it is shown in [10, 11] that we only need to introduce a weak (conditional)
independence assumption.
TOWARDS ROB UST SPEECH RECOGNITION MODELS 183
with Ce = P(qj ) (n t-l ) , and ne being the numb er of subbands in s'. In [10],
it is shown that this norm alization factor is important to achieve good
perform an ce. This Full Combin at ion rule thus takes exact ly the same form
as the product of errors rule [such as (2.4) or (2.5)], apart from the fact
t hat t he weighting factor s are different. In (5.4), the weighting factors can
be int erpreted as (scaled) prob abilities est imati ng t he relative reliability of
each combina tion, while in t he product of errors rule t hese are simply equal
to + 1 or -1. Another difference is that t he product of erro rs rule involves
2K - 1 te rms while th e Full Combination rule involves 2K terms, one of
t hem representing the cont ribut ion of the prior prob ability.
In the next section, we discuss a further exte nsion of t his approach
where the segment ation into subbands is no longer done explicitly, but is
achieved dyn amically over tim e, and where t he int egration over all possible
frequ ency segmentations is part of the same formalism.
6. HMM2: Mixture of HMMs. All HMM emission probabilities
discussed in the previous models are typically modeled through Gaussian
mixtures or artificial neur al networks . Also, in the multiband based recog-
nizers discussed above, we have to decide a priori the numb er and position
of the subbands being considered. As also briefly discussed above, it is not
always clear what the "opt imal" recombination criterion should be. In the
following, we introduce a new approach, referred to as HMM2, wher e the
emission prob abilities of the HMM (now referred to as "te mporal HMMs")
are est imat ed through a secondary, state-dependent, HMM (referred to as
"feat ure HMMs") specifically workin g along the feature vector. As briefly
discussed below (see references such as [2] and [31] for further detail) , this
FIG. 6. HMM2: the emi ssion distributions of the temporal HMM are estim ated by
secon dary, state-specific, feature HMMs .
model will then allow for dynamic (time and st at e dependent) subb and
(frequency) segmentation as well as "optimal" recombination according to
a standard maximum likelihood criterion (although other crit eria used in
st and ard HMMs could also be used) .
In HMM2, as illustrated in Figure 6, each temporal feature vector X n
is considered as a fixed length sequence of S components X n = (x~ , .. ,x~ ),
which is supposed to have been generat ed at time n by a specific fea-
ture HMM associated with a specific st at e qj of th e tempor al HMM. Each
feature HMM st ate ri is thus emitting individu al feature components x~ ,
whose distributions are modeled by, e.g., one dimensional Gaussian mix-
tures. Th e feature HMM thus looks at all possible subb and segmentations
and automatically performs the combination of th e likelihoods to yield a
single emission probability. The resulting emission probability can th en be
used as emission probability of the temporal HMM. As an alternat ive, we
can also use the resulting feature segmentat ion in multi band systems or as
addit ional acoustic features features in a st and ard ASR system. Ind eed, if
HMM2 is applied to th e spectral domain , it is expected th at the feature
HMM will "segment" the feature vector into piecewise st ationary spectral
regions, which could thus follow spectr al peaks (formant s) and/or spect ral
valley regions.
TOWARDS ROBUST SPEECH RECOGNITION MODELS 185
where qj is the temporal HMM state at time n, rl the feature HMM state
at feature s, R the set of all possible paths through the feature HMM,
P(rolqt) the initial state probability of the feature HMM, p(x~ln,qj) the
probability of emitting feature component x~ while in feature HMM state
rl oftemporal state qj, and P(rL!rl-t,qj) the transition probabilities ofthe
feature HMM in temporal state qj.
We believe that HMM2 (which includes the classical mixture of Gaus-
sian HMMs as a particular case) has several potential advantages, includ-
ing:
1. Better feature correlation modeling through the feature HMM
topology (e.g., working in the frequency domain) . Also, the com-
186 HERVE BOURLARD ET AL.
19probably because the advantages of subband based ASR can outweight the slight
problem due to independent processing of subbands.
20This conclusion is very similar to what is proved mathematically in [4]' p. 369,
para. 1 (also p . 424) .
TOWARDS ROBUST SPEECH RECOGNITION MODELS 187
REFERENCES
[1] ALLEN J ., "How do humans process and recognize speech?," IEEE Trans . on
Speech and Audio Processing, Vol. 2 , no . 4, pp , 567-577, 1994.
[2] BENGIO S., BOURLARD H., AND WEBER K. , "An EM Algorithm for HMMs with
Emission Distributions Represented by HMMs," IDIAP Research Report,
IDIAP-RR-OO-11 , 2000.
[3] BERTHOMMIER F . AND GLOTIN H., "A new SNR-feature mapping for robust multi-
stream speech recognition," Inti . Conf. of Phonetic Sci ences (ICPhS'99) (San
Francisco) , to appear, August 1999.
[4J BISHOP C.M ., Neural Networks for Pattern Recognition, Clarendon Press (Oxford),
1995.
[5] BOURLARD H. AND MORGAN N ., Connectionist Speech Recognition - A Hybrid
Approach, Kluwer Academic Publishers, 1994.
[6] BOURLARD H . AND DUPONT S., "A new ASR approach based on independent
processing and combination of partial frequency bands," Proc. of Inti . Conf.
on Spoken Language Processing (Philadelphia), pp . 422-425, October 1996.
[7] DE VETH J ., DE WET F ., CRANEN B. , AND BOVES L., "Missing feature theory
in ASR: make sure you miss the right type of features," Proceedings of the
ESCA Workshop on Robust Speech Recognition (Tampere, Finland) , May 25-
26, 1999.
[8] DUDA R.O . AND HART P .E ., Pattern Classification and Scene Analysis, John Wi-
ley, 1973.
[9] GREENBERG S., "On the origins of speech int elligibility in the real world," Proc.
of the ESCA Workshop on Robust Speech Recognition for Unknown Commu-
nication Channels, pp . 23-32, ESCA, April 1997.
[10] HAGEN A ., MORRIS A., AND BOURLARD H., "Subband-base d speech recognition
in noisy conditions: The full combinat ion a pproach," IDIAP Research Report
no. IDIAP-RR-98-15, 1998.
[11) HAGEN A., MORRIS A. , AND BOURLARD H ., "Different weighting schemes in the
full combination subbands approach for noise robust ASR," Proceedings of the
Workshop on Robust Methods for Speech Recognition in Adverse Conditions
(Tampere, Finland), May 25-26, 1999.
[12] HENNEBERT J., RIS C., BOURLARD H., RENALS S., AND MORGAN N. (1997),
"Est im at ion of Global Posteriors and Forward-Backward Training of Hybrid
Systems," Proceedings of EUROSPEECW97 (Rhodes, Greece, Sep . 1997) ,
pp . 1951-1954.
[13J HERMANSKY H. AND MORGAN N ., "RA STA processing of speech," IEEE Trans .
on Speech and Audio Processing, Vol. 2, no . 4, pp . 578-589, October 1994.
[14J HERMANSKY H ., PAVEL M. , AND TRIBEWALA S., "Towards ASR using partially cor-
rupted sp eech," Proc. of Inti . Conf. on Spoken Language Processing (Philadel-
phia), pp . 458-461 , October 1996.
[15] HERMANSKY H. AND SHARMA S., "Te mporal patterns (TRAPS) in ASR noisy
speech," Proc. of the IEEE Inti . Conf. on Acoustics, Speech, and Signal Pro-
cessing (Phoenix, AZ) , pp . 289-292, March 1999.
TOWARDS ROB UST SP EECH RE CO GNITION MODELS 189
[16] HOUTGAST T . AND STEENEKEN H.J .M., "A review of th e MTF concept in room
acoust ics and its use for est imating speech int elligibility in auditoria ," J.
Ac ous t. S oc. Am. , Vol. 77, no. 3, pp . 1069-1077, March 1985.
[17J IKBAL S., BOURLARD H., BENGIO S ., AN D WEBER K ., "ID IAP HMM /HMM2 Sys-
t em : Theoretical Basis and Software Specificat ions" IDIAP Research R eport,
IDIAP-RR-01-27 , 200 l.
[18] KINGSBURY B., MORGA N N., AND GREENBERG S., "Robust sp eech recogni t ion us-
ing t he modulati on spect rogram," Speech Communication, Vol. 25 , nos. 1-3,
pp . 117-132, 1998.
(19] LIPPMA NN R.P. AND CARLSON B.A. , "Using missing feature theory to act ively
select features for rob ust sp eech recogn ition with int erruptions, filtering a nd
noise ," Proc. Euro speech '97 (Rhodes, Greec e, September 1997), pp . KN37-40 .
(201 MCGURK H. AND McDoNALD J ., "Hearing lips and seeing voices," Nat ure, no. 264,
pp.746-748, 1976.
[21] MIRGHAFORI N. AND MORGA N N., "Transmissions and transi ti ons: A study of two
common assumptions in multi-band ASR ," Int! . IEEE Conf. on A coustics ,
Speech , and Signal Processin g (Seattle, WA, May 1997) , pp . 713-716.
[22] MORRIS A.C ., COOKE M.P ., AND GREEN P .D ., "Some solutions to the miss-
ing features problem in data classification, with application to noise robust
ASR," Proc. Int! . Conf on A coustics, Speech , and Si gn al Processing , pp . 737-
740, 1998.
(23] MORRIS A. C ., HAGEN A., AND BOURLARD H., "T he full combina ti on subbands
approach to noise robust HMM /ANN-based ASR ," Proc. of Eurospeech '99
(Budapest, Sep . 99), t o appear.
[24] MOORE B .C .J ., An Introduction to the Ps ychology of Hearing (4t h edition) , Aca-
demic Press, 1997.
[25] NADEU C. , HERNANDO J ., AND GORRICHO M., "On th e decorrelati on of filter-
bank energies in sp eech recogniti on ," Proc. of Eu rospeech '95 (Mad rid , Spain),
pp. 1381-1384, 1995.
[26] OKAWA S., BOCCHIERI E. , AND P OTAMIANOS A., "Mult i-ba nd speec h recogn it ion in
noisy environment ," Proc . IEEE In ti. Conf. on Ac ous tics, Speech, and Signal
Processing , 1998.
[27] RAO S. AND PEARLMA N W .A., "Analysis of linear pr edict ion, coding, and sp ectral
est imation from subb ands, " IE EE Tran s. on Information Th eory , Vol. 42 ,
pp . 1160-1178, July 1996.
[28J TOMLINSON J ., RUSSEL M.J ., AND BROOKE N.M ., "Integrat ing audio and visua l
information to provid e highly robust sp eech recogniti on ," Proc. of IEEE Inti.
Conf. on A coustics, Speech, and Signal Processing (Atlanta), May 1996.
[29J T OMLINSON M.J ., RUSSEL M.J ., MOORE R.K ., BUCKLAN A.P ., AND FAWLEY M.A .,
"Modelling asynchrony in speech using elementary single-signal decomposi-
tion," Proc. of IEEE Int!. Conf . on A coustics, Speech, and Signal Processing
(Munich) , pp , 1247-1250, April 1997.
[30J VARGA A. AND MOORE R., "Hidden markov model decompositi on of speech and
noise," Proc. IEEE Int! . Conf. on Acoustics, Speech and Si gnal Processing,
pp . 845-848, 1990.
[31] WEBER K ., BENGIO S., AND BOURLARD H ., "HMM2- Extraction of Formant Fea-
tures and their Use for Robust ASR " , Proc. of Euro speech, pp . 607- 610, 2001.
[32] WELLEKENS C.J . , KANGASHARJU J ., AND MILESI C., "T he use of met a-HMM in
multistream HMM training for automatic spe ech recognition," Proc. of Int!.
Confe rence on Spoken Language Processing (Sydney), pp . 2991-2994, Decem-
ber 1998.
[33] W u S.-L., KINGSB URY B.E ., MORGAN N. , AND GREENBERG S., "Pe rfor mance im-
pr ovements th rou gh combining phone and syllable-scale information in auto-
mati c sp eech recogniti on ," Proc. Int! . Conf. on Spoken Language Pro cessing
(Sydney) , pp. 459- 462, Dec. 1998.
GRAPHICAL MODELS AND AUTOMATIC
SPEECH RECOGNITION*
JEFFREY A. BILMESt
"This material is based upon work supported in part by the National Science Foun-
dation under Grant No. 0093430.
tDepartment of Electrical Engineering, University of Washington, Seattle, Washing-
ton 98195-2500.
191
M. Johnson et al. (eds.), Mathematical Foundations of Speech and Language Processing
© Springer Science+Business Media New York 2004
192 JEFFREY A. BILMES
1 Note that the name "Bayesian network" does not imply Bayesian statistical infer-
ence . In fact, both Bayesian and non-Bayesian Bayesian networks may exist .
GRAPHICAL MODELS FOR ASR 195
i
The first equality is the probabilistic chain rule, and the second equality
holds under a particular BN, where 7ri designates node i's parents according
to the BN. A Dynamic Bayesian Network (DBN) [38, 66, 56] has exactly
the same semantics as a BN, but is structured to have a sequence of clusters
of connected vertices , where edges between clusters point in the direction
of increasing time. DBNs are particularly useful to describe t ime signals
such as speech, and as can be seen from Figur e 2 many techniques for ASR
fall under this or the BN category.
Several equivalent schemat a exist that formally define a BN's condi-
tional independence relationships [103, 122, 84]. The idea of d-separation
(or directed separation) is perhaps the most widely known: a set of variables
A is condit ionally independent of a set B given a set C if A is d-separated
from B by C . D-separ ation holds if and only if all paths t hat connect any
node in A and any oth er node in B are blocked. A pat h is blocked if it
has a node v with eit her: 1) t he arrows along th e path do not converge
at v (i.e., serial or diverging at v) v E C ; or 2) the arrows along t he path
do converge at v, and neith er v nor any descendant of v is in C . Note
t hat C can be the empty set in which case d-separation encodes standa rd
stat ist ical independ ence.
From d-separation, one may compute a list of conditional indepen-
dence statements made by a graph. This set of probability distributions
for which this list of st at ements is true is precisely the set of distributions
represented by the graph. Graph properties equivalent to d-separation in-
clude the directed local Markov property [103] (a variable is condit ionally
independent of its non-descendants given its parents), and th e Bayes-ball
procedur e [143] which is a simple algorithm that one can use to read condi-
t ional independence st at ements from graphs, and which is arguably simpler
th an d-separ ation. It is assumed henceforth th at t he reader is familiar with
eit her d-separation or some equivalent rule.
Conditional independence prop erti es in undir ected graphical models
(UGMs) are much simpler than for BNs, and are specified using graph
sepa ration. For exa mple, assuming that X A, XB , and X c are disjoint
set s of nodes in a UGM, X AJlXB IX c is true when all paths from any
196 JEFFREY A. BILMES
F IG . 1. Th is figure shows fou r BNs with different arrow directions over the same
rand om varia bles, A, B, and C . On the left side, the variables form a three-variable
first -order Mar kov chain A ~ B ~ C . In the middl e graph, the same conditional
ind ependence statement is realized even though on e of the arrow directions has been
reversed. Both these netwo rks state that AlLCIB . Th e right network corresponds to
the property AlLC but not that AlLCIB .
Certain GMs allow for what are called switching dependencies [65,
115, 16]. In this case, edges in a GM can change as a function of other
variables in the network. An important advantage of switching dependen-
cies is the reduction in the required number of parameters needed by the
model. A related construct allows GMs to have optimized local probability
implementations [55] using, for example , decision trees .
It is sometimes the case that certain observed variables are used only
as conditional variables. For example, consider the graph B --+ A which
implies a factorization of the joint distribution P(A, B) = P(AIB)P(B). In
many cases, it is not necessary to represent the marginal distribution over
B. In such cases B is a "conditional-only" variable, meaning is always and
only to the right of the conditioning bar . In this case, the graph represents
P(AIB). This can be useful in a number of applications including classi-
fication (or discriminative modeling) , where we might only be interested
in posterior distributions over the class random variable, or in situations
where additional observations (say Z) exist that are marginally indepen-
dent of a class variable (say C) but that are dependent conditioned on other
observations (say X) . This can be depicted by the graph C --+ X f - Z,
where it is assumed that the distribution over Z is not represented.
Often, the true (or the best) structure for a given task is unknown .
This can mean that either some of the edges or nodes (which can be hid-
den) or both can be unknown. This has motivated research on learning
the structure of the model from the data, with the general goal to produce
a structure that accurately reflects the important statistical properties in
the data set. These can take a Bayesian [75, 77] or frequentist point of
view [25, 99, 75] . Structure learning is akin to both statistical model se-
lection [107, 26] and data mining [36] . Several good reviews of structure
learning are presented in [25, 99, 75]. Structure learning from a discrimina-
tive perspective, thereby producing what is called discriminative generative
models, was proposed in [10] .
Figure 2 depicts a topological hierarchy of both the semantics and
structure of GMs, and shows where different models fit in, including several
ASR components to be described in Section 3.
Other Semantics
Causal Models
DependencyNetworks
Suppos ing that each variable has K possible values, this comput at ion re-
quires O(K6) operations, a quantity that is exponential in the number of
variables in the joint distribution. If, on t he othe r hand, it was possible
to factor t he joint distribution into factors containing fewer variables, it
would be possible to reduce computation significantly. For example, under
t he graph in Figure 3, the above distr ibut ion may be factored as follows:
FIG. 3. The graph's independence properties are used to move sums inside of factors .
and
Gaussian having a "condit ional" mean and covariance [111, 4]. Specifically,
t he distribut ion for XA given XB is a conditional Gaussian with an X B-
depe ndent mean vector
K A A aa A A ab
K AA = ( KK ) .
KAAba AAbb
FIG . 4. A Gaussian viewed as an UGM. On the left, there are no ind ependence
assumptions . On the right, X 2JLX3 !{X1 ,X4 }.
GRAPHICAL MODELS FOR ASR 205
A Gaussian may also be viewed as a BN, and in fact many BNs. Unlike
with a UGM, to form a Gaussian BN a specific variable ordering must first
be chosen, the same ordering used to factor the joint distribution with the
chain rule of probability. A Gaussian can be factored
both of which are unique for a given ordering (these are an application
of the conditional Gaussian formulas above, but with A and B set to the
specific values {i} and {(i + l):N} respectively). Therefore, the chain rule
expansion can be written:
The unit triangular matrices, however, can be "brought" inside the squared
linear terms by considering the argument within the exponential
When this is equated with Equation (3.1), and note is taken of the unique-
ness of both transformations, it is the case that
and that iii = J-li - B i,i+l :N J-li+l:N . This implies that the regression coef-
ficients within B are a simple function of the original covariance matrix.
Since the quantities in the exponents are identical for each factor (which
are each an appropriately normalized Gaussian), the variance terms Di,
must satisfy:
where lI'i ~ {( i + 1): N} are parents of the variable Xi . When this is con-
sidered in terms of Equation (3.2), it implies that the non-zero entries of
B i,Hl :N correspond to the set of parents of node i, and the zero entries
correspond to missing edges. In other words (under a given variable order-
ing) the B matrix determines the conditional independence statements for
a Gaussians when viewed as a DGM, namely XiJlX {(i+l) :N}\ 11'i IX11'i if and
only if the entries B i ,{(Hl):N}\11'i are zero.2
It is important to realize that these results depend on a particular
ordering of the variables X 1 :N • A different ordering might yield a different
2Standard notation is used here, where if A and B are sets, A \ B is the set of
elements in A that are not in B .
GRAPHICAL MODELS FOR ASR 207
K=( H ! ~) .
4 0 3 6
This concent rat ion matrix states that X 1JLX31{X2 , X 4} and X 2JLX41{X1 ,
X 3 } , but the corresponding B matrix has only a single zero in its upp er
portion reflecting only t he first independence statement .
It was mentioned earlier t hat UGMs and DGMs represent different
families of probability distributio ns, and t his is reflected in the Gaussian
case above by a reduction in spa rsity when moving between certain B and
K matrices. It is interestin g to note that Gaussians are able to repre-
sent any of the dependency st ruct ures captured eithe r in a DGM (via an
appropriate orde r of t he variab les and zeros in t he B matrix) or a UGM
(wit h appropriately placed zeros in t he concentration matrix K) . T here-
fore, Gaussians, along with many ot her interestin g and desirable t heoret ical
propert ies, are quite genera l in terms of th eir ability to possess condit ional
indepen dence relationship s.
T he question t hen becomes what form of Gaussian should be used, a
DGM or a UGM, and if a DGM, in what varia ble order . A common goal
is to minimize the total number of free parameters. If this is the case, t he
Ga ussian should be represented in a "natural" domai n [10], where t he least
degree of par ameter redun dancy exists. Sparse matrices often provide the
answer, assuming no additional cost exists to represent sparse matrices,
since the sparse pattern itself might be considered a par ameter needing
a representation. Thi s was exploited in [17], where t he natural directe d
Gaussian representation was solicited from dat a, and where a negligible
208 JEFFREY A. BILMES
X =fY +J.L.
Slightly abusing notation, one can say that X '" N(rY + J.L, 0), meaning
that X, conditioned on Y, is a linear-conditional constant "Gaussian" -
Le., a conditional-Gaussian random variable with mean ry + J.L and zero
GRAPHICAL MODELS FOR ASR 209
vari ance.i' In this view of PCA , Y consists of the latent or hidden "causes"
of th e observed vector X , where Y rv N(O , A), or if the C-transformation
is used above, Y rv N(O , I) where I is the identity matrix. In eit her case,
the variance in X is entirely explained by the variance within Y, as X is
simply a linear transformation of th ese underlying causes. PCA tr ansforms
a given X to the most likely values of the hidden causes. This is equal to
the conditional mean E [YIXj = rT(X - p,) since p(ylx) rv N(rT(x - p,), 0).
The two left graphs of Figure 5 show the probabilis tic interpretations
of PCA as a GM, where the depend ency implementations are all linear.
Th e left graph corresponds to th e case where Y is spherically distributed.
Th e hidden causes Yare called th e "principle components" of X. It is often
the case that only th e components (i.e., elements of Y) corres ponding to
the largest eigenvalues of E are used in the model, the other elements of
Yare removed, so that Y is k-dimensional with k < d. There are many
prop erties of PCA [111] - for example, using the principle k elements of
Y leads to the smallest reconstruction error of X in a mean-squared sense.
Another notable property (which motivates factor analysis below) is that
PCA is not scale invariant - if the scale of X changes (say by converting
from inches to centimeters) , both r and A will also change, leading to
different components Y. In this sense, PCA explains th e variance in X
using only variances found in th e hidden causes Y.
FIG. 5. Left two gmphs: two views of principle component s analysis (PCA) ; middle:
probabilistic PCA; right ; factor analysis (FA) . In geneml, the gmph corresponds to the
equation X = CY + /l + e, where Y rv N(O,A) and € rv N(O, 111). X is a random
conditional Gaussian with mean C Y + /l and variance CAC T + 111 . With PCA , 111 = 0
so that € = 0 with probability 1. Also, either (far left) A = I is the identity matrix
and C is general, or (second from left) A is diagonal and C = r is orthonorm al. With
PPCA , 111 = a 2 I is a spheri cal covariance matrix, with diagonal term s a 2 • With FA, 111
is diagonal . Other genemlizations are of possible, but they can lead to an in determi nacy
of the parameters.
X=CY+J1.+f
where Y rv N(O, 1), and f rv N(O, '11) with '11 a non-negative diagonal matrix.
In factor analysis, C is the factor loading matrix and Y the common factor
vector . Elements of the residual term f = X - CY - J1., are called the specific
factors, and account both for noise in the model and for the underlying
variance in X. In other words, X possesses a non-zero variance, even
conditional on Y, and Y is constrained to be unable to explain the variance
in X since Y is forced to have I as a covariance matrix. C , on the other
hand, is compelled to represent just the correlation between elements of
X irrespective of its individual variance terms, since correlation can not
be represented by f. Therefore, unlike PCA, if the scale of an element of
X changes, the resulting Y will not change as it is e that will absorb the
change in X 's variance. As in PCA, it can be seen that in FAX is being
explained by underlying hidden causes Y, and the same graph (Figure 5)
can describe both PCA and FA.
Probabilistic PCA (PPCA) (second from the right in Figure 5) [147,
140] while not widely used in ASR is only a simple modification to FA,
where '11 = (J2 I is constrained so that f is a spherically-distributed Gaus-
sian .
In all of the models above, the hidden causes Yare uncorrelated Gaus-
sians, and therefore are marginally independent. Any statistical depen-
dence between elements of X exist only in how they are jointly dependent
on one or more of the hidden causes Y. It is possible to use a GM to make
this marginal Y dependence explicit, as is provided on the left in Figure 6
where all nodes are now scalars . In this case, Yj rv N(O, 1), fj rv N(O,'lfJj),
and p(Xi) rv N(~j CijYj + J1.i, 'lfJi) where 'lfJj = 0 for PCA .
FIG. 6. Left: A graph showing the explicit scalar variables (and therefore their
statistical dependencies) for PCA, PPCA, and FA. The graph also shows the parameters
for these models. In this case, the dependencies are linear and the random variables
are all Gaussian . Right: The graph for PCA/PPCA/FA (parameters not shown) which
is the same as the graph for ICA. For ICA, the implementation of the dependencies
and the random variable distributions can be arbitrary, different implementations lead
to different ICA algorithms . The key goal in all cases is to explain the observed vector
X with a set of statistically independent causes Y.
general model for independent component analysis (ICA) [7, 92], another
method that explains data vectors X with independent hidden causes. Like
PCA and FA, a goal of ICA is to first learn the parameters of the model
that explain X . Once done, it is possible to find Y, the causes of X, that
are as statistically independent as possible. Unlike PCA and FA, however,
dependency implementations in ICA neither need to be linear nor Gaussian.
Since the graph on the right in Figure 6 does not depict implementations,
the vector Y can be any non-linear and/or non-Gaussian causes of X .
The graph insists only that the elements of Yare marginally independent,
leaving alone the operations needed to compute E[YIX] . Therefore, ICA
can be seen simply as supplying the mechanism for different implementation
of the dependencies used to infer E[YIXj . Inference can still be done using
the standard graphical-model inference machinery, described in Section 2.5.
Further generalizations of PCA/FA/ICA can be obtained simply by
using different implementations of the basic graph given in Figure 6. Inde-
pendent factor analysis [6] occurs when the hidden causes Yare described
by a mixture of Gaussians. Moreover, a multi-level factor analysis algo-
rithm, shown in Figure 7, can easily be described where the middle hidden
layer is a possibly non-independent explanation for the final marginally in-
dependent components. The goal again is to train parameters to explain
X, and to compute E[ZJX] . With graphs it is therefore easy to understand
all of these techniques, and simple structural or implementation changes
can lead to dramatically different statistical procedures.
where J.Lt is the class conditional mean and l1y is the global mean in the
transformed space. This is a multi-dimensional generalization of Fisher's
original linear discriminant analysis (LDA) [49] .
LDA can also be seen as a particular statistical modeling assumption
about the way in which observation samples X are generated. In this case,
it is assumed that the class conditional distributions in the transformed
space P(YIC = i) are Gaussians having priors P(C = i). Therefore, Y is
a mixture model p(y) = L:i p(C = i)p(yIC = i), and classification of y is
optimally performed using the posterior:
. p(yli)p(i)
p(C = ~Iy) = L: jP (") (l
YJ PJ
For standard LDA, it is assumed that the Gaussian components p(yJj) =
N(y; I1j, E) all have the same covariance matrix, and are distinguished only
by their different means. Finally, it is assumed that there is a linear trans-
form relating X to Y . The goal of LDA is to find the linear transformation
that maximizes the likelihood P(X) under the assumptions given by the
model above. The statistical model behind LDA can therefore be graphi-
cally described as shown on the far left in Figure 8.
Gaussian component in a Gaussian mixture can use one each from a shared
pool of As and I's, leading to what are called semi-tied covariance matrices
[62, 165]. Once again, this form of tying can be described by a GM as
shown by the far right graph in Figure 8.
3.4. Acoustic modeling: Mixture models. In speech recognition,
hidden Markov model observation distributions rarely use only single com-
ponent Gaussian distributions. Much more commonly, mixtures of such
Gaussians are used. A general mixture distribution for p(x) assumes the
existence of a hidden variable C that determines the active mixture com-
ponent as in:
Many texts such as [148, 112] describe the properties of mixture distribu-
tions, most of which .can be described using graphs in this way.
3.5. Acoustic modeling: Acoustic classifier combination. It
has often been found that when multiple separately trained classifiers are
used to make a classification decision in tandem, the resulting classification
error rate often decreases. This has been found in many instances both
empirically and theoretically [82, 100, 124, 160, 19]. The theoretical results
often make assumptions about the statistical dependencies amongst of the
various classifiers, such as that their errors are assumed to be statistically
independent. The empirical results for ASR have found that combination
is useful at the acoustic feature level [12, 94, 72, 95], the HMM state level
[96], the sub-word or word level [163], and even at the utterance level [48] .
Assume that Pi(clx) is a probability distribution corresponding to the
i t h classifier, where c is the class for feature set x . A number of classi-
fication combination rules exist such as the sum rule [97] where p(clx) =
Li p(i)Pi(clx), or the product rule where p(clx) ex [1 Pi(clx). Each of these
schemes can be explained statistically, by assuming a statistical model that
leads to the particular combination rule . Ideally, the combination rule that
performs best will correspond to the model that best matches the data. For
example, the sum rule corresponds to a mixture model described above, and
the product rule can be derived by the independence assumptions corre-
sponding to a naive Bayes classifier [18] . Additional combination schemes,
moreover, can be defined under the assumption of different models, some
of which might not require the errors to be statistically independent.
GRAPHICAL MODEL S FOR ASR 215
FI G. 11. A simple fir st- order Markov chain . Th is graph encodes the relation ship
QtlLQ l:t- 2IQt-l .
In a normal decision tree, only one decision is made at each level in the
tree. A GM can represent such a "crisp" tree by insistin g th at th e distribu-
tion s at each level De (and th e final decision 0) of th e tree are Dirac-delta
functions, such as P(D e = ii I) = Ji,!t U,du _ 1 ) where fe(I , dU-l ) is a de-
terrninistic function of th e input I and previousl y made decisions dU-l ,
where de = fe(I , du-d· Th erefore, with the appropriate implementation
of dependencies, it can be seen that th e GM-view is a probabilistic gener-
alization of normal decision trees.
n-gram, it does not portray how the parameters of such a model are typi-
cally obtained, a procedure that can be quite involved [29]. In fact , much
research regarding n-grams involves methods to cope with data-sparsity
- because of insufficient training data, "smoothing" methodology must
be employed , whereby a Kth order model is forced to provide probability
for length K + 1 strings of words that did not occur in training data. If
a purely maximum-likelihood procedure was used, these strings would be
given zero probability.
•••
FIG. 13. A GM view oj a LM. The dashed arcs indicate that the parents are
switching, The hidden switching parents St switch between the word variables Wt [orm-
ing either a zeroth (S« = 1), first (St = 2), or second (S« = 3) order Markov chain.
The switching parents also possess previous words as parents so that the probability oj
the Markov- chain order is itself context dependent .
Often, smoothing takes the form of mixing together higher- and lower-
order sub-models, with mixing weights determined from data not used for
training any of the sub-models [83, 29] . In such a case, a language model
mixture can be described by the following equation:
FIG. 14. A class-based language model. Here, a Markov chain is used to model the
dynamics of word classes mther than the words themselves .
6Thanks to John Henderson who first posed the problem to me of how to represent
this construct using a graphical model, in the context of building a word-tagger.
222 JEFFREY A. BILMES
N(w)
Pml(w) = Jr'
where N(w) is the number of times word W occurs in the training set,
and N is the total number of words in the training set. This means, for
example, that N(w) = 1,w E S.
One possible assumption is to force the probability of unk to be 0.5
times the probability of the entire singleton set, i.e., p(unk) = 0.5*Pml(S) =
0.5 * LWESPml(W) . This requires that probability be taken away from
tokens that do occur in the training set . In this case probability is removed
from the singleton words, leading to the following desired probability model
Pd(W) :
0.5Pml(S) if W = unk
(3.3) Pd(W) = 0.5Pml(w) if wE S
{ Pml (w) otherwise.
Pml(wIC)
ifw EM
(3.4) PM(wlc) ~ p(~lc)
{
else
~
where p(Mlc) = EwEMPml(wlc), and
Pml(wlc)
ifw E S
(3.5) ps(wlc) ~ p(~lc)
{
else
where p(Slc) ~ Z=WES Pml(wlc). Note that both PM and PS are valid nor-
malized distributions over all words. Also, p(Slc) + p(Mlc) = 1 since these
two quantities together use up all the probability mass contained in the
maximum likelihood distribution.
FIG. 15. A class-based language model that forces the probability of unk to be a
times the probability of all singleton words. The Vt variables are shaded , indicating that
they are observed, and have value V t = 1, Vt. The K, and B; variables are switching
parents of W t .
PM(wtlcd if k t = 0
(3.8) p(Wt/kt, bt, cd = ps(wtlcd ~f k t = 1 and bt = 1
{
8{wt=unk} If k t = 1 and bt = O.
(3.9)
This model correctly produces the probabilities that are given in Equa-
tion 3.3. First, when Wt = unk:
as desired. This follows because the other terms, for different values of the
hidden variables, are all zero. Next, when Wt E S,
and
(3.11)
for each es i : T . Q~t refers to all variables Qr except for the one at time
T = t. The length T of these sequences is itself an integer-valued random
variable having a complex distribution. An HMM consists of a hidden
Markov chain of random variables (the unshaded nodes) and a collection
of nodes corresponding to the speech utterance (the shaded nodes). In
most ASR systems, the hidden chain corresponds to sequences of words ,
phones, and sub-phones.
This set of properties can be concisely described using the GM shown
in Figure 16. The figure shows two equivalent representations of an HMM,
one as a BN and another as an UGM. They are equivalent because moral-
izing the BN introduces no edges, and because the moralized HMM graph
is already triangulated and therefore decomposable . The UGM on the
226 JEFFREY A. BILMES
right is the result of moralizing the BN on the left. Interestingly, the same
graph describes the structure of a Kalman filter [70] where all variables are
continuous and Gaussian and all dependency implementations are linear.
Kalman filter operations are simply applications of the formulas for con-
ditional Gaussians (Section 3.1), used in order to infer conditional means
and covariances (the sufficient statistics for Gaussians).
FIG. 16 . A hidden Markov model (HMM) , viewed as a gmphical model. Note that
an HMM may be equivalently viewed either as a directed (left) or an undirected (right)
model, as in this case the conditional independence properties are the same .
In other words, each state uses a mixture with components from this glob-
ally shared set of distributions. The GM for such an HMM loses an edge
between Q and X as shown on the right in Figure 17. In this case, all of
the represented dependence occurs via the hidden mixture variable at each
time.
FIG. 17. An HMM with mixture observation distributions (left) and a semi-
continuous HMM (right) .
only the variable Qt but also the variables Xt-l for l = 1,2, . .. , K for some
K . The case where K = 1 is shown in Figure 18. When the additional de-
pendencies are linear and Gaussian, these are sometimes called conditional
Gaussian HMMs [120].
K
L kx,
k=-K
it = K
L
k=-K
k
2
where K in this case is the number of points used to fit the regression. This
can be viewed as a regression because
K
Xt = L akXt-k +e
k=-K
This last equation follows because, observing the process to generate delta
features ,x, is independent of everything else given its parents. The parents
of X t are a subset of X l:T and they do not include the hidden variables
Qt . This leads to the GM on the left in Figure 19, a generative model
for HMMs augmented with delta features . Note that the edges between
the feature stream Xt , and the delta feature stream X t correspond to de-
terministic linear implementations. In this view, delta-features appear to
be similar to fixed-dependency auto-regressive HMMs (Figure 18), where
each child feature has additional parents both from the past and from the
future. In this figure, however, there are no edges between x, and Qt,
because XtJLQtlparents(Xt} . This means that parents(Xt} contain all the
information about Xt, and Qt is irrelevant.
It is often asked why delta features help ASR performance as much
as they do. The left of Figure 19 does not portray the model typically
used with delta features. A goal of speech recognition is for the features
to contain as much information as possible about the underlying word
sequence as represented via the vector Ql :T. The generative model on the
left in Figure 19 shows, however, that there is zero information between the
x, and Qt. When the edges between x,
and its parents parents(Xt} are
GRAPHICAL MODELS FOR ASR 229
F IG. 19. An GM-based explanation of why delta features work in HMM-based ASR
systems. The left figure gives a GM that shows the generative process of HMMs with
delta features. The right figure shows how delta features are typically used in an HMM
system, where the information between x, and Qt is greatly increased relative to the
left figure.
FIG . 20. An input-output HMM. Xl :T the input is tmnsformed via integmtion over
a Markov chain Ql :T into the output Yl :T .
FIG. 21. A factorial HMM where there are multiple hidden Markov chains.
p(X1 :T=Xl:T) =
(4.1)
L L L II P(Xt(i,l)' Xt(i ,2),' .. , Xt(i,£;) , l'ilqi,r)p(qi!qi-l , r)p(r).
T
T ql :T £l :T i=l
where P(Xl, X2, . . " xdl', q) is the length l' segment distribution under hid-
den Markov state q, and p(l'lq) is the explicit duration model for state q.
GRAPHICAL MODEL S FOR ASR 233
for a st ate specific distribution p(xlq). One of the first segment models [121]
is a generalization that allows observations in a segment to be addit ionally
dependent on a region within a segment
l
P(Xl ' X2, "" Xl!t', q) = II p(Xjh ,q)
j=l
•••
•••
•••
FIG. 24. A BN used to explicitly represent parameter tying. In this figure, the
straight edges correspond to deterministic implementations and the rippled edges corre-
spond to stochastic implementations.
phone). St does not determine the identity of the phone. Often, S, will
be a monotonically increasing sequence of successive integers , where either
St+l = S, (the value stays the same) or St+l = St+1 (an increment occurs).
An increment occurs only if R; = 1. R t is a binary indicator variable that
has unity value only when a transition between successive phone positions
occurs. R, is a true random variable and depending on the phone (Qd,
R t will have a different binary distribution, thereby yielding the normal
geometric duration distributions found in HMMs. Qt is a deterministic
function of the position St . A particular word might use a phone multiple
times (consider the phone laal in the word "yamaha"). The variable S,
sequences, say, from 1 through to 6 (the number of phones in "yamaha") ,
and Qt then gets the identity of th e phone via a deterministic mapping
from St to Qt for each position in the word (e.g., 1 maps to IyI, 2 maps to
laa/ , 3 maps to [va], and so on) . This general approach can be extended
to multiple hidden Markov chains, and to continuous speech recognition
to provide graph structures that explicitly represent the control structures
needed for an ASR system [167, 13, 47].
As mentioned above, factorial HMMs require a large expansion of the
state space and therefore a large number of parameters. A recently pro-
posed system that can model dependencies in a factorial HMM using many
fewer parameters are called mixed memory Markov models [142] . Viewed
as a GM as in Figure 25, this model uses an additional hidden variable for
each time frame and chain. Each normal hidden variables possesses an ad-
ditional switching parent (as depicted by dotted edges in the figure, and as
described in Section 2.2). The switching conditional independence assump-
tions for one time slice are that QtJLRt-liSt = 0, QtJLQt-liSt = 1 and
the symmetric relations for R t . This leads to the following distributional
simplification:
236 JEFFREY A. BILMES
([)
I
I
@ I
I
CD I
I
G I
I
FIG. 25 . A mixed-memory hidden Markov model. The dashed edges indicate that
the S and the W nodes are switching parents.
A Buried Markov model (BMM) [16, 15, 14] is another recently pro-
posed GM-based approach to speech recognition . A BMM is based on the
idea that one can quantitatively measure where the conditional indepen-
dence properties of a particular HMM are poorly representing a corpus of
data. Wherever the model is found to be most lacking, additional edges
are added (i.e., conditional independence properties are removed) relative
to the original HMM. The BMM is formed to include only those data-
derived, sparse, hidden-variable specific, and discriminative dependencies
(between observation vectors) that are most lacking in the original model.
In general, the degree to which Xt-1JLXtIQt is true can be measured us-
ing conditional mutual information I(Xt-1;XtIQd [33]. If this quantity
is zero, the model needs no extension, but if it is greater than zero, there
is a modeling inaccuracy. Ideally, however, edges should be added dis-
criminatively, to produce a discriminative generative model, and when the
structure is formed discriminatively, the notion has been termed structural
discriminability [16, 47, 166,47]. For this purpose, the "EAR" (explaining
away residual) measure has been defined that measures the discriminative
mutual information between a variable X and its potential set of parents
Z as follows:
GRAPHICAL MODELS FOR ASR 237
0,./- s., 0, - q,
F IG. 26. A Bu ried Markov Model (BMM) with two hidden Markov chain assign-
me nts, Ql :T = ql :T on the left, and Ql :T = q~ :T on the right.
It seems apparent at this point that the set of models that can be
described using a graph is enormous. With th e options th at are available
in choosing hidden variables, th e different sets of dependencies between
those hidden variables, t he dependencies between observations, choosing
switc hing dependencies, and considering the variety of different possible
implementations of thos e depend encies and the various learning techniques,
it is obvious that th e space of possible models is practically unlimited.
Moreover, each of thes e modeling possibilities , if seen outside of the GM
paradigm, requires a large software development effort before evaluation
is possible with a large ASR system. This effort must be spent without
having any guarantees as to the model's success.
In answer to these issues, a new flexible GM-bas ed software toolkit has
been developed (GMTK) [13]. GMTK is a graphical models toolkit that has
been optimized for ASR and oth er time-series processing tasks. It supports
EM and GEM parameter training, sparse linear and non-linear dependen-
cies between observations, arbit rary parameter sharing , Gaussian vanish-
ing and splitting, decision-tr ee implementations of dependenci es, sampling,
swit ching parent function ality, exact and log-space inference, multi-rate
and multi-stream processing, and a textual graph programming language.
The toolkit supports structural discriminability and arbitra ry model selec-
238 JEFFREY A. BILMES
tion, and makes it much easier to begin to experiment with GM-based ASR
systems.
6. Conclusion. This paper has provided an introductory survey of
graphical models, and then has provided a number of examples of how
many existing ASR techniques can be viewed as instances of GMs. It is
hoped that this paper will help to fuel the use of GMs for further speech
recognition research. While the number of ASR models described in this
document is large, it is of course the case that many existing ASR tech-
niques have not even been given a mention. Nevertheless, it is apparent
that ASR collectively occupies a relatively minor portion of the space of
models representable by a graph. It therefore seems quite improbable that
a thorough exploration of the space of graphical models would not ulti-
mately yield a model th at performs better than the HMM. The search
for such a novel model should ideally occur on multiple fronts: on the
one hand guided by our high-level domain knowledge about speech and
thereby utilize phonetics, linguistics, psycho-acoustics , and so on. On the
other hand, the data should have a strong say, so there should be signifi-
cant data-driven model selection procedures to determine the appropriate
natural graph structure [10] . And since ASR is inherently an instance of
pattern classification, the notion of discriminability (parameter training)
and structural discriminability (structure learning) might playa key role
in this search. All in all, graphical models opens many doors to novel
speech recognition research.
REFERENCES
[I] A.V . AHO , R . SETHI, AND J.D . ULLMAN. Compilers : Principle s, Techniques and
Tools. Addison-Wesley, Inc., Reading, Mass ., 1986.
[2J S.M . AJI AND R.J . McELIECE. The generalized d istributive law. IEEE Transac-
tions in Information Theory, 46 :325-343, March 2000.
[3] T . ANASTASAKOS , J . McDONOUGH , R . SCHWARTZ , AND J MAKHOUL. A compact
model for speaker adaptive training. In Proc. Int. Conf. on Spoken Language
Processing, pp . 1137-1140, 1996.
[4] T .W. ANDERSON. An Introduction to Multivariate Statistical Analysis. Wiley
Series in Probability and Statistics, 1974.
[5] J .J. ATICK. Could information theory provide an ecological theory of sensory
processing? Network, 3 :213-251 , 1992.
[6] H. ATTIAS. Independent Factor Analysis. Neural Computation, 11(4) :803-851 ,
1999.
[71 A .J . BELL AND T .J. SEJNOWSKI. An information maximisation approach to blind
separation and blind deconvolution. Neural Computation, 7(6) :1129-1159,
1995.
[8J Y . BENGIO. Markovian models for sequential data. Neural Computing Surveys,
2:129-162, 1999.
[9J A. L. BERGER, S.A . DELLA PIETRA, AND V.J. DELLA PIETRA . A maximum
entropy approach to natural language processing. Computational Linguistics,
22(1) :39-71, 1996.
[lOJ J . BILMES. Natural Statistical Models for Automatic Speech Recognition. PhD
thesis, U.C. Berkeley, Dept. of EECS, CS Division, 1999.
GRAPHICAL MODELS FOR ASR 239
[34] R.G . COWELL, A.P . DAWID , S.L. LAURITZEN , AND D.J. SPIEGELHALTER. Proba-
bilistic Networks and Expert Systems. Springer-Verlag, 1999.
[35J P . DAGUM AND M. LUBY. Approximating probabilistic inference in Bayesian
belief networks is NP-hard. Artificial Intelligence, 60(141-153), 1993.
[36J Data mining and knowledge discovery . Kluwer Academic Publishers. Maritime
Institute of Technology, Maryland.
[37] A.P . DAWID. Conditional independence in statistical theory. Journal of the Royal
Statistical Society B, 41(1) :1-31 , 1989.
[38J T. DEAN AND K. KANAZAWA. Probabilistic temporal reasoning. AAAI, pp . 524-
528, 1988.
[39] J.R. DELLER, J .G . PROAKIS, AND J .H.L. HANSEN. Discrete-time Processing of
Speech Signals. MacMillan, 1993.
[40J A.P. DEMPSTER, N.M. LAIRD, AND D.B . RUBIN. Maximum-likelihood from in-
complete data via the EM algorithm. J . Royal Statist. Soc. Ser . B., 39,
1977.
[41] L. DENG , M. AKSMANOVIC, D. SUN , AND J. Wu . Speech recognition using hidden
Markov models with polynomial regression functions as non-stationary states.
IEEE Trans. on Speech and Audio Proc., 2(4) :101-119, 1994.
[42] L. DENG AND C. RATHINAVELU . A Markov model containing state-conditioned
second-order non-stationarity: application to speech recognition. Computer
Speech and Language, 9(1) :63-86, January 1995.
[43J M. DEVIREN AND K. DAOUD/. Structure learning of dynamic bayesian networks
in speech recognition. In European Conf. on Speech Communication and
Technology {Eurospeecli}, 2001.
[44J R .O. DUDA , P.E. HART, AND D.G. STORK. Pattern Classification . John Wiley
and Sons, Inc., 2000.
[45J E . EIDE. Automatic modeling of pronunciation variations. In European Conf. on
Speech Communication and Technology (Eurospeech), 6th , 1999.
[46J K. ELENIUS AND M. BLOMBERG. Effects of emphasizing transitional or stationary
parts of the speech signal in a discrete utterance recognition system. In Proc.
IEEE Inti. Conf. on Acoustics, Speech, and Signal Processing, pp. 535-538,
1982.
[47] J . BILMES et al. Discriminatively structured graphical mod els for speech recog-
nition: JHU-WS-2001 final workshop report. Technical report, CLSP, Johns
Hopkins University, Baltimore MD, 2001.
[48] J .G . FISCUS. A post-processing system to yield reduced word error rates: Rec-
ognizer output voting error reduction (ROVER) . In Proceedings of IEEE
Workshop on Automatic Speech Recognition and Understanding, Santa Bar-
bara, California, 1997.
[49J R .A . FISHER. The use of multiple measurements in taxonomic problems. Ann.
Eugen ., 7:179-188, 1936.
[50J R. FLETCHER. Practical Methods of Optimization. John Wiley & Sons , New
York, NY, 1980.
[51] V . FONTAINE, C. RIS, AND J.M .BOITE. Nonlinear discriminant analysis for im-
proved speech recognition. In European Conf. on Speech Communication
and Technology (Eurospeech), 5th, pp . 2071-2074 , 1997.
[52J E . FOSLER-LusSIER. Dynamic Pronunciation Models for Automatic Speech
Recognition . PhD thesis, University of California, Berkeley., 1999.
[53] B . FREY. Graphical Models for Machine Learning and Digital Communication.
MIT Press, 1998.
[54J J.H . FRIEDMAN. Multivariate adaptive regression splines. The Annals of Statis-
tics, 19(1):1-141, 1991.
[55J N. FRIEDMAN AND M. GOLDSZMIDT. Learning in Graphical Models, chapter
Learning Bayesian Networks with Local Structure. Kluwer Academic Pub-
lishers, 1998.
GRAPHICAL MODELS FO R ASR 241
[78] H. HERMANSKY, D. ELLIS, AND S. SHARMA. Tandem conn ectionist feature st ream
extraction for conventional HMM systems. InProc. IEEE Inti. Conf. on
Acoustics, Speech, and Signal Processing , Istanbul, Turkey, 2000.
[791 J . HERTZ , A. KROGH , AND R.G . PALMER. Introduction to the Th eory of Neural
Computation. Allan M. Wylde, 1991.
[80] X.D . HUANG , A. ACERO, AND H.-W . HON . Spok en Languag e Processing: A
Guide to Theory, Algorithm, and System Development. Prentice Hall, 200l.
[81] T .S. JAAKKOLA AND M.l. JORDAN. Learning in Graphical Models, chapter Im-
proving the Mean Field Approximations via the use of Mixture Distributions.
Kluwer Academic Publishers, 1998.
[82] R .A. JACOBS . Methods for combining experts' probability assessments. Neural
Computation, 1 :867-888, 1995.
[83] F . JELINEK. Statistical Methods for Speech Recognition. MIT Press, 1997.
[84] F.V . JENSEN . An Introduction to Bayesian Networks. Springer-Verlag, 1996.
[85J M.l. JORDAN AND C.M. BISHOP, eds . An Introduction to Graphical Models . to
be published, 200x.
[861 M.l. JORDAN , Z. GHAHRAMANI, T .S. JAAKKOLA , AND L.K. SAUL. Learning
in Graphi cal Models, chapter An Introduction to Variational Methods for
Graphical Models. Kluwer Academic Publishers, 1998.
[87] M.I . JORDAN AND R . JACOBS. Hierarchical mixtures of experts and the EM
algorithm. Neural Computation, 6:181-214, 1994.
[88J B.-H . JUANG AND L.R. RABINER. Mixture autoregressive hidden Markov models
for speech signals. IEEE Trans. Acoustics, Speech, and Signal Processing,
33(6):1404-1413, December 1985.
[89J D. JURAFSKY AND J .H . MARTIN . Speech and Language Processing. Prentice Hall ,
2000.
[90] M. KADIRKAMANATHAN AND A.P . VARGA. Simultaneous model re-estimation from
contaminated data by composed hidd en Markov modeling. In Proc. IEEE
Inti . Conf. on Acoustics, Speech, and Signal Processing, pp. 897-900, 1991.
[91J T . KAMM , G. ANDREOU, AND J . COHEN. Vocal tract normalization in speech
recognition comp ensating for systematic speaker variability. In Proc. of the
15th Annual speech research symposium, pp . 175-178. CLSP, Johns Hopkins
Univers ity, 1995.
[92] J . KARHUNEN. Neural approaches to independent component analysis and source
separation . In Proc 4th European Symposium on Artificial Neural Networks
(ESANN '96), 1996.
[93J P. KENNY, M. LENNIG , AND P . MERMELSTEIN . A linear predictive HMM for
vector-valued observations with applications to speech recognition. IEEE
Transactions on Acoustics, Speech, and Signal Processing , 38(2) :220-225,
February 1990.
[94J B.E .D . KINGSBURY AND N. MORGAN. Recognizing reverberant speech with
RASTA-PLP. Proceedings ICASSP-97, 1997.
[95] K . KIRCHHOFF. Combining acoustic and articulatory information for speech
recognition in noisy and reverberant environments. In Proceedings of the
International Conference on Spoken Language Processing , 1998.
[96] K. KIRCHHOFF AND J . BILMES. Dynamic classifier combination in hybrid spee ch
recognition systems using utterance-level confidence values . Proceedings
ICASSP-99, pp . 693-696, 1999.
[97J J. KITTLER, M. HATAF , R.P.W. DUIN , AND J. MATAS. On combining classi-
fiers. IEEE Transactions on Pattern Analysis and Machine Intelligence,
20(3) :226-239, 1998.
[98J K. KJAERULFF. Tri angulation of graphs - algorithms giving small total space.
Technical Report R90-09, Department of Mathematics and Computer Sci-
ence. Aalborg University, 1990.
[99] P . KRAUSE. Learning probabilistic networks. Philips Research Labs Tech. Report,
1998.
GRAPHICAL MODELS FOR ASR 243
[100] A. KROGH AND J. VEDELSBY . Neur al network ensembles, cross validation, and
acti ve learning. In Advan ces in N eural Information Processing Systems 7.
MIT Press, 1995.
[101] F.R. KSCHISCHANG, B. FREY , AND H.-A . LOELIGER. Fact or graphs and the sum-
product algorithm. IEEE fun s. Inform. Th eory, 47(2):498-519, 200 l.
[102] N. KUMAR. In vestigation of Si licon Auditory Model s and Generalization of Lin-
ear Discriminant Analysis fo r Imp roved Speech Recognitio n. PhD thesis,
Johns Hopkins University, 1997.
[103] S.L. LAURITZEN. Graph ical Mod els. Oxford Science Publications , 1996.
[104] C.J. LEGGETTER AND P. C. WOODLAND . Max imum likelihood linear regression for
speaker ad aptation of conti nuous density hidden Marko v models . Com pu ter
Speech and Lang uage, 9:171-1 85, 1995.
[105] E . LEVIN . Word recognition using hidden cont rol neur al architecture. In Proc.
IEEE Int! . Conf. on A cousti cs, Speech, and Signal Processing, pp . 433-436.
IEEE, 1990.
[106J E . LEVIN. Hidden control neural architecture modeling of nonlin ear time varying
systems and its applicat ions. IEEE funs . on N eural N etworks, 4(1):109-
116, January 1992.
[107] H. LINHART AND W . ZUCCHINI. Model Selection. Wiley, 1986.
[108] B .T . LOGAN AND P.J . MORENO. Factorial HMMs for acoust ic mod eling. Proc.
IEEE Intl . Conf. on A coustics, Speech, and Signal Processing , 1998.
[109J D.J .C. MACKAY. Learn ing in Graphical Models , chapter Introduction to Monte
Carlo Methods. Kluwer Acad emic Publishers, 1998.
[110J J . MAKHOUL. Linear predict ion: A tutorial review. Proc . IEEE, 63 :561-580,
April 1975.
[111] K.V. MARDlA , J .T . KENT, AND J.M. BIBBY. Mult ivari ate Analysis. Academic
Press, 1979.
[112] G.J . McLACHLAN. Fin ite Mixture Model s. Wiley Series in Probability and Statis-
tics , 2000.
[113] G .J . McLACHLA N AND T . KRISHNAN. Th e EM A lgori thm and Extensions . W iley
Series in Probabi lity and St atistics, 1997.
[114] C. MEEK. Ca usa l inference and causa l explanat ion with background knowledge.
In Besnard, Philippe and Ste ve Hanks, editors , Proceeding s of the 11th Con-
ference on Uncert ainty in Artific ial Intelligen ce (U A I'95), pp. 403-410, San
Francisco, CA, USA, Augus t 1995. Morgan Kaufmann Publishers.
[115] M. MElLA. Learning wi th MixtUTeS of Tree s. PhD t hesis, MIT, 1999.
[116] M. MOHRI , F .C.N. PEREIRA , AND M. RILEY. The design pr inciples of a weighted
finite-state t ransducer library. Th eoretical Com pute r Scie nce , 231 (1) :17-32,
2000.
[117] N. MORGAN AND B. GOLD. Speech and Audio Signal Processing. John Wiley and
Sons, 1999.
[118] H. NEY, U . ESSEN, AND R. KNESER. On structuring probabilistic dependencies
in stochastic language modelling. Computer Speech and Language, 8 :1-38,
1994.
[119] H.J . NOCK AND S.J . YOUNG . Loosely-coupled HMMs for ASR . In Proc . Int.
Conf. on Spok en Language Processing, Beijing , China , 2000.
[120] M. OSTENDORF , V. DIGALAKIS , AND O.KIMBALL. From HMM 's to segment
models : A unified view of st ochastic modeling for speech recognition. IEEE
Trans . Speech and Audio Proc., 4 (5), Sept emb er 1996.
[121] M. OSTENDORF , A. KANNAN , O. KIMBALL, AND J . ROHLICEK . Cont inuous word
recognition based on th e st ochast ic segment model. Proc. DARPA Workshop
CS R , 1992.
[122] J. PEARL. Probabilistic Reasoning in Intelligent S yst em s: N etwo rks of Plaus ible
Inferen ce. Morgan Kaufmann, 2nd printing editi on , 1988.
[123J J . PEARL. Caus ality . Ca mbridge, 2000.
244 JEFFREY A. BILMES
[124] M.P . PERRONE AND L.N . COOPER. When networks disagree: ensemble methods
for hybrid neural networks. In RJ . Mammone, editor, N eural Networks for
Speech and Image Processing, page Chapter 10, 1993.
[125] J . PICONE, S. PIKE, R REGAN, T . KAMM, J. BRIDLE, L. DENG , Z. MA,
H. RICHARDS , and M. Schuster. Initial evaluation of hidden dynamic models
on conversational speech . In Proc . IEEE Inti. Conf. on Acoustics, Speech,
and Signal Processing, 1999.
[126] S.D . PIETRA, V.D . PIETRA , AND J . LAFFERTY. Inducing features of random fields.
Technical Report CMU-CS-95-144, CMU, May 1995.
[127J A.B. PORITZ. Linear predictive hidden Markov models and the speech signal.
Proc . IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, pp .
1291-1294, 1982.
[128] A.B. PORlTZ. Hidden Markov models : A guided tour. Proc. IEEE Inil. Conf.
on Acoustics, Speech, and Signal Processing, pp . 7-13 , 1988.
[129] L.R . RABINER AND B.-H . JUANG . Fundamentals of Speech Recognition. Prentice
Hall Signal Processing Series, 1993.
[130] L.R RABINER AND B.H. JUANG . An introduction to hidden Markov models .
IEEE ASSP Magazine, 1986.
[131] M. RICHARDSON, J . BILMES, AND C. DIORIO. Hidden-articulator markov models
for speech recognition. In Proc. of the ISCA ITRW ASR2000 Workshop ,
Paris, France, 2000. LIMSI-CNRS.
[132J M. RICHARDSON, J . BILMES, AND C. DIORIO. Hidden-articulator markov models :
Performance improvements and robustness to noise. In Proc . Int. Conf. on
Spoken Language Processing, Beijing, China, 2000.
[133] T .S. RICHARDSON . Learning in Graphical Models, chapter Chain Graphs and
Symmetric Associations. Kluwer Academic Publishers, 1998.
[134] M.D . RILEY. A statistical model for generating pronunciation networks. Proc .
IEEE Int!. Conf. on Acoustics, Speech, and Signal Processing, pp . 737-740,
1991.
[135] R .T. ROCKAFELLAR. Convex Analysis. Princeton, 1970.
[136] R ROSENFELD. Adaptive Statistical Language Modeling: A Maximum Entropy
Approach. PhD thesis, School of Computer Science, CMU, Pittsburgh, PA,
April 1994.
[137J R ROSENFELD. Two decades of statistical language modeling: Where do we go
from here? Proceedings of the IEEE, 88(8) , 2000.
[138] R ROSENFELD, S.F . CHEN, AND X. ZHU. Whole-sentence exponential language
models: a vehicle for linguistic-statistical integration. Computer Speech and
Language, 15(1), 200l.
[139] D.B . ROWE. Multivariate Bayesian Statistics: Models for Source Separation adn
Signal Unmixing. CRC Press, Boca Raton, FL, 2002.
[140] S. ROWElS AND Z. GHAHRAMANI. A unifying review of linear gaussian models.
Neural Computation, 11 :305-345 , 1999.
[141] L.K . SAUL, T. JAAKKOLA , AND M.1. JORDAN. Mean field theory for sigmoid belief
networks. lAIR, 4 :61-76 , 1996.
[142J L.K. SAUL AND M.1. JORDAN . Mixed memory markov models: Decomposing
complex stochastic processes as mixtures of simpler ones. Machine Learning,
1999.
[143] RD . SHACHTER. Bayes-ball: The rational pastime for determining irrelevance
and requisite information in belief networks and influence diagrams. In Un-
certainty in Artificial Intelligence, 1998.
[144J P . SMYTH , D. HECKERMAN, AND M.I . JORDAN . Probabilistic independence net-
works for hidden Markov probability models. Technical Report A.I. Memo
No. 1565, C.B.C.L . Memo No. 132, MIT AI Lab and CBCL, 1996.
[145] T . STEPHENSON, H. BOURLARD , S. BENGIO , AND A. MORRIS. Automatic speech
recognition using dynamic bayesian networks with both acoustic and artie-
GRAPHICAL MODELS FOR ASR 245
(2) c=Lh(x)
xES
is not known in closed form and that S is too unwieldy for c to be found
numerically from the sum in (2). Nevertheless, our goal is to compute
expectations of particular functions 9 under 7r; that is, we require
for any relevant g, where again the summation in equation (3) cannot be
evaluated directly.
As an especially important special case, note that (3) includes the
probability of any particular event concerning X . Explicitly, for any relevant
subset B of S,
MARKOV CHAIN MONTE CARLO METHODS 249
(5)
This is unbiased for Eng and its sampling variance can be assessed in the
usual way.
Thinking ahead, we remark that (5) may provide an approximation
to Eng even when (x(l), .. . ,x(m)) is not a random sample from tt . In
particular, this occurs when m is sufficiently large and x(l) , x(2) , . . ., seeded
by some x(O) E 5, are successive observations from a Markov chain with
(finite) state space 5 and limiting distribution n . This extension provides
the basis of MCMC when random sampling from n is no longer feasible.
It requires that useful general recipes exist for constructing appropriate
Markov chains, as in Section 3.
2.2. Bayesian computation. For a description of parametric Bayes-
ian inference, see e.g. Gelman et al. (1995). Here we begin in the simplest
possible context. Thus, let x now denote a meaningful constant whose value
we wish to estimate. Suppose we know that this parameterx lies in a finite
space 5 and that our initial beliefs about its value can be represented by
a prior p.m.f. {p(x) : X E 5}. If y denotes relevant discrete data, then the
probability of y given x, viewed as a function of x, is called the likelihood
L(ylx) . In the Bayesian paradigm, the prior information and the likelihood
are combined via Bayes theorem to produce the posterior p.m.f.
for x given y . All inferences are based on (6) and require the evaluation
of corresponding expectations. In terms of (1), (2), (3) and (4), we replace
1T(X) by 1T(xIY) and let h(x) be proportional to L(Ylx)p(x). Note that
L(ylx) and p(x) need be known only up to scale.
Unfortunately, it is rarely possible to calculate the expectations di-
rectly. A potential remedy is to generate a large random sample x(l), ... ,
250 JULIAN BESAG
x(m) from the p.m.f. (6) and then use (5) to approximate the correspond-
ing posterior mean and variance, to evaluate posterior probabilities about
x and to construct corresponding credible intervals . In principle, this ap-
proach extends to multicomponent parameters but it is generally impossible
to implement because of the difficulty of sampling directly from complex
multivariate p.m.I, 's. This hindered applied Bayesian inference until it was
recognized that ordinary Monte Carlo could be replaced by MCMC.
As an example of how the availability of random samples from 7r(xly)
would permit trivial solutions to ostensibly very complicated problems,
consider a clinical, industrial or agricultural trial in which the aim is to
compare different treatment effects (}i. Then x = ((},</», where () is the vec-
tor of (}i'S and </> is a vector of other, possibly uninteresting, parameters in
the posterior distribution. A natural quantity of interest from a Bayesian
perspective is the posterior probability that any particular treatment effect
is best or is among the best three, say, where here we suppose best to mean
having the largest effect. Such calculations are usually far beyond the capa-
bilities of conventional numerical methods, because they involve sums (or
integrals) of non-standard functions over awkward regions of the parameter
space S. However, in a sampling approach, we can closely approximate the
probability that treatment i is best, simply by the proportion of simulated
(}(t)'s among which (}~t) is the largest component; and the probability that
treatment i is one of the best three by the proportion of (}(t)'s for which
e?) is one of the largest three components. Note that the values obtained
for components that are not of immediate interest are simply ignored and
that this procedure is entirely rigorous.
Ranking and selection is just one area in which the availability of
random samples from posterior distributions would have had a profound
influence on applied Bayesian inference. Not only does MCMC deliver
what is beyond ordinary Monte Carlo methods but also it encourages the
investigator to build and analyze more realistic statistical models. Indeed,
one must sometimes resist the temptation to build representations whose
complexity cannot be justified by the underlying scientific problem or by
the available data.
2.2.1. Hidden Markov models. Basic hidden Markov models
(HMM's) are among the most complex formulations for which ordinary
Monte Carlo can be implemented. In addition to their popularity in speech
recognition (e.g. Rabiner, 1989; Juang and Rabiner, 1991), HMM's also
occur in neurophysiology (e.g. Fredkin and Rice, 1992), in computational
biology (e.g. Haussler et al., 1993; Eddie et al., 1995; Liu et al., 1995), in cli-
matology (e.g. Hughes et al., 1999), in epidemiologic surveillance (Le Strat
and Carrat, 1999), and elsewhere. For a useful survey, see MacDonald and
Zucchini (1997). Here, we briefly describe HMM's and the corresponding
role of the Baum et al. (1970) algorithm.
MARKOV CHAIN MONTE CARLO METHODS 251
given x . Now suppose Xl, X2, .. . is (modeled as) the output of a Markov
chain , with transition probability q(Xi' xHI) of the it h component Xi being
followed by Xi+!, so that the prior probability for X = (Xl, .. . , Xn ) is
n-l
(8) p(x) = q(Xl) II q(Xi ,Xi+!) , XES = {O,1 , . . . ,s}n ,
i=l
where q(.) is the p.m.f. of Xl . Then the post erior probability of x, given Y,
is
n
(9) 1r(xly) ex: q(XI)f(Xl,yI) II q(Xi-l,Xi)f(Xi,Yi), xES.
i=2
The goal is to make inferences about the true X and perhaps about some
future Xi'S. For example, we might require the marginal posterior modes
(MPM) estimate x* of x, defined by
where 1r(Xijy) is the posterior marginal p.m.f, for Xi. Note here that x*
is generally distinct from the maximum a posteriori (MAP) estimate X,
defined by
than a physical model of the underlying process; and more important here,
a sampling approach is viable under almost any modification of the basic
HMM formulation, but with ordinary Monte Carlo replaced by MCMC,
for which the Baum algorithm retains its relevance if an HMM lives within
the overall formulation. We comment further at the end of Section 3.7; see
also Robert et al. (2000).
The Baum et al. (1970) recursions for (9) exploit the fact that 1r(x) ==
1r(xly) inherits th e Markov property, though its transition probabilities are
functions of Y and therefore nonhomogeneous. Specifically,
n
(12) 1r(xly) = 1r(xIly) II 1r(xilxi-l'Y~ i)'
i=2
where Y~i = (Yi , . .. ,Yn), a form of notation we use freely below. The
factorization (12) provides access to the calculation of expected values and
to sampling from {1r(xly) : XES}, except that the conditional probabilities
in (12) are not immediately available. It can easily be shown that
n
(13) 1r(x~ilxi-l,Y~i) oc II q(Xk-l ,Xk)!(Xk,Yk), i = 2, . .. ,n ,
k=i
which determines 1r(xilxi-l, Y~i) by summing over X>i but the summations
are impracticable unless n is tiny. The Baum algorithm avoids the problem
by using the results,
Then (14) and (15) are used forwards to calculate expectations or to sample
from 1r(xly). Some care is needed in using (16) because the probabilities
quickly become vanishingly small. However, as they are required only up
to scale in (14) and (15), a dummy normalization can be carried out at
each stage to remedy the problem .
Example. Noisy binary channel. The noisy binary channel provides a
convenient numerical illustration of sampling via the Baum algorithm. We
additionally impose symmetry and st ationarity, merely to ease the notation.
The noisy binary channel is not only the simplest HMM but is also a rather
special case of the autologistic distribution, which we consider in Section
3.4 and which is not generally amenable to the Baum algorithm.
MARKOV CHAIN MONTE CARLO METHODS 253
Thus, suppose that both the hidden Xi'S and the observed Yi'S are
binary and that the posterior probability (9) of a true signal XES given
data YES, where S = {O, l}" , is
where 1[. ] again denotes the usual indicator function . Here a is the log-
odds of correct to incorrect transmission of each Xi and (3 is the log-odds
in favor of Xi+! = Xi. In particular, we set a = In 4, corresponding to a
corruption probability of 0.2, and (3 = In 3, so that Xi+! is a repeat of Xi
with probability 0.75.
As a trite example, suppose y = 11101100000100010111, so that lSI =
220 = 1048576. For such a tiny space, we can calculate expected values
simply by enumeration but we also applied the Baum algorithm to generate
a random sample of size 10000 from 1T(xly). First, consider the posterior
marginal probabilities for the x/so We obtained Xl = 1 in 8989 of the
samples x(t) and hence our estimate of the corresponding probability is
0.899, versus the exact value 0.896; for X2 = I, we obtained 0.927 versus
0.924; and so on. Hence, the MPM estimate x* of x, defined by (10),
is correctly identified as x* = 11111100000000010111 . Clearly, x* is a
smoothed version of the data, with two fewer isolated bits . The xi's for
positions i = 4,12,16 and 17 are the most doubtful, with estimated (exact)
probabilities of Xi = 1 equal to 0.530 (0.541), 0.421 (0.425),0.570 (0.570)
and 0.434 (0.432). Note that, neither component 16 nor 17 flips in the
MPM estimate but that, if we examine them as a single unit, the posterior
probabilities of 00, 10, 01 and 11 are 0.362 (0.360), 0.203 (0.207), 0.068
(0.070) and 0.366 (0.362), respectively. Thus, there is a preference for 00
or 11, rather than the 10 obtained in x* .
The previous point illustrates how an estimate may be affected by
choosing either a marginal or a multivariate criterion. Indeed, at the op-
posite extreme to MPM is the MAP estimate (11), which here is equally
11111100000000011111or 11111100000000000111, both of which are easily
seen to have the same posterior probability. In our random sample, they
were indeed the two most frequent configurations, occurring 288 and 323
times, respectively, compared with the exact probability 0.0304. Note that
x* and Y itself occurred 138 and 25 times, compared with the exact proba-
bilities 0.0135 and 0.0027. If one requires a single-shot estimate of x, then
the choice of a particular criterion, ultimately in the form of a loss func-
tion , should depend on the practical goals of the analysis. For example,
the MAP estimate corresponds to zero loss for the correct X and unit loss
for any incorrect estimate, regardless of the number of errors among its
components; whereas MPM arises from a componentwise loss function and
minimizes the expected total number of errors among all the components.
The writer's own view is that a major benefit of a sampling approach is
254 JULIAN BESAG
Suppose we have a random sample x(1), . . . ,x(m) from 1l"(x) = h(x)je >
o for xES but that our real interest lies in E 11". g, for some specific g, where
1l"*(x) = h*(x) je* > 0, x E So,
so t hat, as our eventual approx imation to E 11". g, we adopt the ratio esti-
mate,
m
(19) L w(x(t») g(x(t») ,
t= l
where
Note that the w(x(t») 's are independent of 9 and are well defined because
S* ~ S. T he estimate (19) is satisfactory if (5) is adeq uate for E 11"g and
there are no large weights among the w(x(t») ,s. In practice, the latter con-
dition requires that hand h* are not too far apart. There are modifications
of th e basic method described here that can extend its range (e.g. umbre lla
sampling).
2.4. Monte Carlo maximum likelihood estimation. Let x(O) de-
note an observatio n, generally a vector, from a p.m.f.
v v - - v- •
g BEe 1r(x(O); B) g BEe h(x(O); B) c( B)
The first quotient on the right-hand side of (20) is known and the second
can be approximated using (18), where c(B), c(B), h(x(O) ; B) and h(x(O) ; B)
play the roles of c", c, h * and h, respectively. That is,
1r X,
)
v L...J L...J v - v
°
12,16,20,21 ,22,23,24,25, there were 22290, 11928,6791,3826,442,30, 14,
0, 0, 2, 0, 0, discrepancies, respectively. Although still a toy example,
7I"(yly) ~ 5 X 10- 324 , so the task was not entirely trivial from a sampling
perspective.
Of course, in the real world, it is typical that, if x cannot be found
directly, then nor can we generate draws from 7I"k(X) . In that case, we
must implement an MCMC version in which successive 7I"k'S in a single run
of the algorithm are sampled approximately rather than exactly. This re-
quires some care in selecting a "schedule" for how the mk's in (21) should
increase, because the observation attributed to 7I"k must also serve as an
approximate draw from 7I"k+1 ' It is typical that eventually the mk's must
increase extremely slowly at a rate closer to logarithmic than to linear. Sim-
ulated annealing can also be extended to continuous functions via Langevin
diffusion; see Geman and Hwang (1986).
3. Markov chain Monte Carlo calculations.
3.1. Markov chains, stationary distributions and ergodicity.
In ordinary Monte Carlo calculations, we require perfect draws from the
258 JULIAN BESAG
(22) t = 0,1, . . . ,
(23) tt P = tt ;
(24)
(25)
for all x' E S . However, we also need to avoid the generally intractable
summation over the state space S. We can achieve this by demanding
a much more stringent condition than general balance, namely detailed
balance,
(27)
for all x, x' E S. Summing both sides of (27) over xES implies that
general balance is satisfied; moreover, detailed balance is much simpler to
confirm, particularly if we insist that Pk(X, x') = 0 = Pdx ', x) for the vast
majority of x , x' E S . Also note that (27) need only be checked for x' =1= x ,
which is helpful in practice because the diagonal elements of Pk are often
quite complicated. The physical significance of (27) is that, if a stationary
Markov chain . .. , X( -1) , X(O), X(1), . . . satisfies detailed balance, then it is
time reversible, which means that it is impossible to tell whether a film of
a sample path is being shown forwards or backwards.
It is clear that, if PI, . . . , Pn individually satisfy detailed balance with
respect to 'Tr, then so does P in (25). Time reversibility is not inherited in
the same way by P in (24) but it can easily be resurrected by assembling
the Pk'S as a random rather than as a fixed permutation at each stage.
The maintenance of time reversibility has some theoretical advantages (e.g.
the Central Limit Theorem of Kipnis and Varadhan, 1986, and the Initial
Sequen ce Estimators of Geyer, 1992) and is worthwhile in practice if it adds
a negligible computational burden.
3.3. Hastings algorithms. Hastings (1970) provides a remarkably
simple general construction of t .p.m.Is Pk satisfying detailed balance (27)
MARK OV CHA IN M ONTE CARLO METHODS 261
with respect to 7L Thus, let Rk be any Markov t .p.m. having state space S
and elements Rk(X, x' ), say. Now define the off-diagonal elements of Pk by
(28) x' f; XE S ,
with Pk(X, x ) obtained by subtraction to ensure that Pk has unit row sums,
which is achievable since Rk is itself a t .p.m. Th en, to verify th at de-
t ailed balance (27) is satisfied for x' f; x, eit her Pk(X, X') = 0 = Pk(X', X)
and there is nothing to prove or else direct substitution of (28) produ ces
min {7r( x) Rk(X, X') , 7r (x') Rdx ' , x )} on both sides of th e equation. Thus,
7r is a st ationary distribution for Pk, despite the arbitrary choice of Rk'
t hough note that we might as well have insisted that zeros in Rk occur
symmet rically. Note also that Pk depends on 7r only through h(x ) in (1)
and th at the usually unknown and problemati c normalizing constant c can-
cels out . Of course, t hat is not quite t he end of th e story: it is necessary to
check th at P , obt ained via an amalgamat ion of different Pk'S , is sufficientl y
rich to guarantee irreducibility with respect to 7r but usually this is simple
to ensure in any par ticular case.
Operationally, any Pk is applied as follows . When in state x, a pro-
posal x* for the subsequent state x' is generated with probability Rk(X, z"},
T his requires calculat ing th e non- zero elements in row x of Rk on the fly,
rath er tha n storing any matri ces. Th en eit her x' = x*, with t he acceptance
probability Ak(X, x*), or else x' = x is ret ained as th e next state of the
chain. Note that (28) does not apply to the diagonal elements of P: two
successive st at es x and x' can be th e same eit her because x happens to be
proposed as the new state or because some other state x* is proposed but is
not accepte d. Also note that the procedur e differs from ordinary rejection
sampling, where proposals x* are made until one is accepte d, which is not
valid in MCMC.
3.4. Componentwise Hastings algorithms. In practice, we still
need to choose a particular set of Rk's. It is important th at proposals and
decisions on their acceptance are simple and fast to make. We now openly
acknowledge t hat X has many components and write X = (X l," " X n ),
where each Xi is univariate (though this is not essential). Then, the most
common approach is to devise an algorit hm in which a proposal matrix R ; is
assigned to each individu al component X i' Th at is, if x is t he current state,
th en R; propos es replacing th e ith component Xi by xi, while leaving th e
remainder X-i of x unaltered. Note th at we can also allow some continuous
components: th en t he corresponding Ri' s and Pi' s become transition ker-
nels rath er t han matri ces and have elements that are conditional densities
rath er than probabilities. Although t he underlying Markov chain t heory
262 JULIAN BESAG
which identifies the crucial role played by the full conditionals 7r(xilx-i) .
Note that these n univariate distributions comprise the basic building
blocks of Markov random field formulations in spatial statistics (Besag,
1974), where formerly they were called the local characteristics of X.
The full conditionals for any particular 7r(x) follow from the trivial
but, at first sight, slightly strange-looking result,
(31)
where the normalizing constant involves only a one-dimensional summation
over Xi. Even this drops out in the ratio (30) and, usually, so do many other
terms because likelihoods, priors and posteriors are typically formed from
products and then only those factors in (31) that involve Xi itself need to
be retained. Such cancelations imply enormous computational savings.
In terms of Markov random fields, the neighbors ai of i comprise the
minimal subset of -i such that 7r(xilx-i) = 7r(xilx8i)' Under a mild pos-
itivity condition (see Section 3.5), it can be shown that, if j E Bi , then
i E oj, so that the n neighborhoods define an undirected graph in which
there is an edge between i and j if they are neighbors. Similar considera-
tions arise in graphical models (e.g. Lauritzen, 1996) and Bayesian networks
(e.g. Pearl, 2000) in constructing the implied undirected graph from a di-
rected acyclic graph or from a chain graph, for example. Note that conven-
tional dynamic simulation makes use of directed graphs, whereas MCMC is
based on undirected representations or a mix of the two, as in space-time
(chain graph) models, for example. Generally speaking, dynamic simula-
tion should be used and componentwise MCMC avoided wherever possible .
Example. Autologistic and related distributions. The autologistic dis-
tribution (Besag, 1974) is a pairwise-interaction Markov random field for
dependent binary r.v.'s. It includes binary Markov chains, noisy binary
channels and finite-lattice Ising models as special cases, so that simulation
without MCMC can range from trivial to taxing to (as yet) impossible.
We define X = (Xl, " " X n ) to have an autologistic distribution if its
p.m.f. is
where the indices i and j run from 1 to n and the {3ij'S control the de-
pendence in the system. The simplification with respect to a saturated
binary model is that no terms involve interactions between three or more
components. The autologistic model also appears under other names: in
Cox and Wermuth (1994), as the quadratic exponential binary distribution
and, in Jordan et al. (1998), as the Boltzmann distribution, after Hinton
and Sejnowski (1986).
It is convenient to define {3ij = {3ji for i > j . Then, as regards the spe-
cial cases mentioned above, the r.v.'s Xl, . .. , X n form a simple symmetric
Markov chain if O:i = 0 and {3ij = {3 for Ii - j! = 1, with (3ij = 0 otherwise.
The noisy binary channel (17) is obtained when n(x) becomes n(xIY) with
y fixed, O:i = (2Yi - 1)0: and {3ij = {3 whenever Ii - j\ = 1, with {3ij = 0
otherwise. For the symmetric Ising model, the indices i are identified with
the sites of a finite d-dimensional regular array, O:i = 0 for all i and {3ij = {3
for each pair of adjacent sites, with (3ij = 0 otherwise. In each case, the
asymmetric version is also catered for by (32).
It follows from (31) that the full conditional of Xi in the distribution
(32) is
so that i and j are neighbors if and only if {3ij 1= o. i. For example, in the
noisy binary channel (17)
(35)
264 JULIAN BESAG
so that the quotient in (30) has value 1 and the proposals are always ac-
cepted. The n individual Pi'S can then be combined as in (24), producing
a systematic scan of all n components, or as in (25), giving a random scan,
or otherwise. Systematic and random scan Gibbs samplers are aperiodic,
because Ri(x,x) > a for any x E S; and they are irreducible under the pos-
itivity condition that the minimal support S of X is the Cartesian product
of the minimal supports of the individual Xi'S. Positivity holds in most
practical applications and can be relaxed somewhat (Besag, 1994b) to cater
for exceptions. To see its relevance, consider the trite example in which
X = (X}, X 2 ) and S = {OO, 11}, so th at no movement is possible using 'a
componentwise updating algorithm. On the other hand, if S = {OO, 01,11} ,
then positivity is violated but both the systematic and random scan Gibbs
samplers are irreducible. Severe problems occur most frequently in con-
strained formulations and can be tackled by using block updates of more
than one component at a time, to which we return in Section 3.7, or by
augmenting S by dummy states (e.g. {01} in the above example) .
Although the validity of the Gibbs sampler is ensured by the general
theory for Hastings algorithms, there is a more direct justification, which
formalizes the argument that, if X has distribution Jr and any of its com-
ponents is replaced by one sampled from the corresponding full conditional
induced by n , to produce a new vector X', then X' must also have distri-
bution Jr. That is, if x' differs from x in its ith component at most, so that
x'-i = X-i, then
Xi
(36)
larly, if ?T(xAlx-A) is a hidden Markov chain, then the Baum algorithm can
be exploited when updating X A . It may also be feasible to manufacture
block Gibbs updates by using special recursions (Bartolucci and Besag,
2002) .
3.8. Perfect MCMC simulation. We have noted that it may be dif-
ficult to determine a suitable burn-in period for MCMC. Perfect MCMC
simulation solves the problem by producing an initial draw that is exactly
from the target distribution ?T. The original method, called monotone cou-
pling from the past (Propp and Wilson, 1996), in effect runs the chain from
the infinite past and samples it at time zero, so that complete convergence
is assured. This sounds bizarre but can be achieved in several important
special cases, including the Ising model, even at its critical temperature on
very large arrays (e.g. 2000 x 2000) . Perfect MCMC simulation is a very
active research area; see
http ://research.microsoft.com/~dbYilson/exact/
Here we focus on coupling from the past (CFTP) and its implementation
when ?T is the posterior distribution (17) for the noisy binary channel.
We assume that ?T has finite support S . We can interpret burn-in
of fixed length mo as running our ergodic t.p .m. P forwards from time
-mo and ignoring the output until time O. Now imagine that, instead of
doing this from a single state at time -mo , we do it from every state in
S, using the identical stream of random numbers in every case, with the
consequence that, if any two paths ever enter the same state, then they
coalesce permanently. In fact, since S is finite, we can be cert ain that, if
mo is large enough, coalescence of all paths will occur by time 0 and we
obtain the same state x(O) regardless of x( -m o) . It also ensures that we will
obtain x(O) if we run the chain from any state arbit rarily far back in time,
so long as we use the identical random number stream during the final mo
steps. Hence x(O) is a random draw from zr. It is crucial in this argument
that the timing of the eventual draw is fixed. If we run the chain forwards
from every possible initialization at time 0 and wait for all the paths to
coalesce, we obtain a random stopping time and a corresponding bias in
the event ual state. As an extreme example, suppose that P(x' , x") = 1 for
two particular states x' and x" but that P(x,x") = 0 for all x 1= x' , Then
?T(x") = ?T(x') but the state at coalescence is never x".
At first sight, useful implementation of the above idea seems hopeless.
Unless S is tiny, it is not feasible to run the chain from every st at e in S
even for mo = 1. However, we can sometimes find a monotonicity in the
paths which allows us to conclude that coalescence from certain extremal
states implies coalescence from everywhere. We discuss this here merely
for the noisy binary channel but the reasoning is identical to that in Propp
and Wilson (1996) for th e ostensibly much harder Ising model and also
extends immediately to the general autologistic model (32), provided the
f3ij 'S are non-negative .
MARKOV CHAIN MONTE CARLO METHODS 267
Thus, again consider the posterior distribution (17), with Q: and (3 > 0
known. We have seen already that it is easy to implement a systematic
scan Gibbs sampler based on (34). We presume that the usual inverse dis-
tribution function method is used at every stage : that is, when addressing
component X i , we generate a uniform deviate on the unit interval and, if
its value exceeds the probability for Xi = 0, implied by (34), we set the new
Xi = 1, else Xi = O.
Now imagine that, using a single stream of random numbers, we run
the chain as above from each of two states x' and x" E S such that x' :S x"
componentwise. Then the corresponding inequality is inherited by the new
pair of states obtained at each iteration, because (3 > O. Similarly, consider
three initializations, 0 (all zeros), 1 (all ones) and any other xES . Because
o :S X :S 1 elementwise, it follows that the corresponding inequality holds
at every subsequent stage and so all paths must coalesce by the time the
two ext reme ones do so. Hence, we need only monitor the two extremal
paths. Note that coalescence occurs much faster than one might expect,
because of the commonalit ies in the simulation method.
However, we must still determine how far back we need to go to ensure
that coalescence occurs by time O. A basic method is as follows. We
begin by running simulations from time -1 , initialized by x( -1) = 0 and
1, respectively. If the paths do not coalesce at t ime 0, we repeat the
procedure from time -2, ensuring that the previous random numbers are
used again between times -1 and O. If the paths do not coalesce by time
0, we repeat from time -3, ensuring that the previous random numbers
are used between times -2 and 0; and so on. We terminate the process
when coalescence by time 0 first occurs and t ake the corresponding x(O) as
our random draw from n . We say coalescence "by" rather than "at" time
o because, in the final run, this may occur before time O. In practice, it is
generally more efficient to use increasing increments between the starting
times of successive runs, again with duplication of the random numbers
during the common intervals . There is no need to identify the smallest m
for which coalescence occurs by time zero.
For a numerical illustration, we again chose Q: = In 4 and (3 = In 3
in (17), with y = 111001110011100..., a vector of length 100000. Thus,
the state space has 2100000 elements. Moving back one step at a tim e,
coalescence by time 0 first occurred when running from time -15, with
an approximate halving of the discrepancies between each pair of paths,
generation by generation, though not even a decrease is guaranteed. Co-
alescence itself occurred at time -2. There were 77759 matches between
the CFTP sample x(O) and the MPM and MAP estimates, which recall
are both equal to y in this case. Of course, the performance of CFTP
becomes hopeless if (3 is too large but, in such cases, it may be possible
to adopt algorithms that converge faster but still preserve monotonicity.
Indeed, for the Ising model, Propp and Wilson (1996) use Sweeny's cluster
algorit hm rather than th e Gibbs sampler. An alternative would be to use
268 JULIAN BESAG
REFERENCES
BESAG J .E ., GREEN P .J ., HIGDON D.M ., AND MENGERSEN K.L. (1995) . Bayesian com-
putation and stochastic systems (with Discussion) . Statistical Science, 10, 3-66.
CHEN M .-H ., SHAO Q .-M ., AND IBRAHIM J.G . (2000) . Monte Carlo Methods in Bayesian
Computation. Springer: New York .
Cox D .R. AND WERMUTH N. (1994). A note on the quadratic exponential binary dis-
tribution. Biometrika, 81, 403-408.
DOUCET A. , DE FREITAS N. , AND GORDON N. (2001) . Sequential Monte Carlo Methods
in Practice . Springer: New York.
EDDIE S .R., MITCHISON G ., AND DURBIN R . (1995) . Maximum discrimination hidden
Markov models of sequence concensus. Journal of Computational Biology, 2, 9-24.
FISHMAN G .S . (1996) . Monte Carlo: Concepts, Algorithms, and Applications. Springer
Verlag: New York.
FREDKIN D .R. AND RICE J.A . (1992). Maximum likelihood estimation and identification
directly from single-channel recordings. Proceedings of the Royal Society of London
B, 249, 125-132.
GELMAN A ., CARLIN J .B ., STERN H .S., AND RUBIN D .B . (1995) . Bayesian Data Anal-
ysis. Chapman and Hall/CRC: Boca Raton .
GEMAN S . AND GEMAN D. (1984) . Stochastic relaxation, Gibbs distributions and the
Bayesian restoration of images. Institute of Electrical and Electronics Engineers ,
Transactions on Pattern Analysis and Machine Intelligence, 6, 721-741.
GEMAN S. AND HWANG C .-R. (1986) . Diffusions for global optimization. SIAM Journal
on Control and Optimization, 24, 1031-1043.
GEYERC .J . (1991) . Markov chain Monte Carlo maximum likelihood. In Computing Sci-
ence and Statistics: Proceedings of the 23m Symposium on the Interface (ed . E .M .
Keramidas) , 156-163. Interface Foundation of North America, Fairfax Station, VA.
GEYER C .J . (1992) . Practical Markov chain Monte Carlo (with Discussion) . Statistical
Science, 7, 473-511.
GEYERC.J . AND THOMPSON E .A . (1992). Constrained Monte Carlo maximum likelihood
for dependent data (with Discussion) . Journal of the Royal Statistical Society B,
54, 657-699.
GEYER C.J. AND THOMPSON E .A. (1995) . Annealing Markov chain Monte Carlo with
applications to ancestral inference. Journal of the American Statistical Association,
90, 909-920.
GILKS W.R. (1992) . Derivative-free adaptive rejection sampling for Gibbs sampling. In
Bayesian Statistics 4 (eds . J .O . Berger, J .M . Bernardo, A.P . Dawid, and A.F .M .
Smith), 641-649. Oxford University Press.
GILKS W .R. , RICHARDSON S ., AND SPIEGELHALTER D . (eds.) (1996). Markov Chain
Monte Carlo in Practice . Chapman and Hall : London.
GREEN P .J . (1995). Reversible jump Markov chain Monte Carlo computation and
Bayesian model determination. Biometrika, 82, 711-732.
HASTINGS W.K . (1970) . Monte Carlo sampling methods using Markov chains and their
applications. Biometrika, 57,97-109.
HAUSSLER D ., KROGH A ., MIAN S., AND SJOLANDER K . (1993) . Protein modeling us-
ing hidden Markov models: analysis of glob ins . In Proceedings of the Hawaii In-
ternational Conference on System Sciences . IEEE Computer Science Press: Los
Alamitos, CA .
HINTON G .E . AND SEJNOWSKI T . (1986) . Learning and relearning in Boltzmann ma-
chines. In Parallel Distributed Processing (eds . D.E . Rumelhart and J .L. McClel-
land). M.l.T Press.
HUGHES J .P ., GUTTORP P ., AND CHARLES S.P. (1999) . A nonhomogeneous hidden
Markov model for precipitation. Applied Statistics, 48 , 15-30.
HURN M ., HUSBY 0 ., AND RUE H . (2003) . Advances in Bayesian image analysis. In
Highly Structured Stochastic Systems (eds. P.J . Green, N.L. Hjort, and S. Richard-
son). Oxford University Press.
JORDAN M.L , GHAHRAMANI Z., JAAKKOLA T.S ., AND SAUL L.K . (1998) . An introduction
to variational methods for graphical models. In Learning in Graphical Models
(ed. M.l. Jordan) . Kluwer Academic Publishers.
270 JULIAN BESAG
JUANG B .H. AND RABINER L.R . (1991). Hidden Markov models for speech recognition.
Technometrics, 33, 251-272 .
KIPNIS C . AND VARADHAN S.R.S. (1986). Central limit theorem for additive functionals
of reversible Markov processes and applications to simple exclusions. Communica-
tions in Mathematical Physics, 104, 1-19 .
KIRKPATRICK S., GELATT C .D ., AND VECCHI M.P. (1983). Optimization by simulated
annealing. Science, 220, 671-680.
LAURITZEN S.L . (1996). Graphical Models. Clarendon Press: Oxford.
LE STRAT Y. AND CARRAT F . (1999). Monitoring epidemiologic surveillance data using
hidden Markov models. Statistics in Medicine , 18, 3463-3478.
LIU J .S. (1996). Peskun's theorem and a modified discrete-state Gibbs sampler. Bio-
metrika, 83, 681--682.
LIU J .S. (2001). Monte Carlo Strategies in Scientific Computing. Springer: New York.
LIU J.S ., NEUWALD A.F. , AND LAWRENCE C .E . (1995). Bayesian models for multiple
local sequence alignment and Gibbs sampling strategies. Journal of the American
Statistical Association, 90, 1156-1170.
MACCORMICK J . (2002). Stochastic Algorithms for Visual Tracking. Springer: New
York .
MACDoNALD I.L. AND ZUCCHINI W . (1997). Hidden Markov and Other Models for
Discrete-valued Time Series . Chapman and Hall : London.
MARINARI E . AND PARISI G . (1992). Simulated tempering: a new Monte Carlo scheme.
Europhysics Letters, 19, 451-458 .
METROPOLIS N., ROSENBLUTH A.W., ROSENBLUTH M.N ., TELLER A.H. , AND TELLER
E . (1953) . Equations of state calculations by fast computing machines. Journal of
Chemical Physics, 21, 1087-1092 .
NEWMAN M.E.J . AND BARKEMA G .T . (1999). Monte Carlo Methods in Statistical
Physics . Clarendon Press: Oxford.
NUMMELIN E . (1984). General Irreducible Markov Chains and Non-Negative Operators.
Cambridge University Press.
PEARL J . (2000).] Causality . Cambridge University Press.
PENTTINEN A. (1984). Modeling interaction in spatial point patterns: parameter estima-
tion by the maximum likelihood method. Jyviiskylii Studies in Computer Science,
Economics and Statistics, 7.
PESKUN P.H . (1973). Optimum Monte Carlo sampling using Markov chains. Biometri-
ka, 60, 607-612.
PROPP J .G . AND WILSON B.M . (1996). Exact sampling with coupled Markov chains
and applications to statistical mechanics. Random Structures and Algorithms, 9,
223-252.
RABINER L.R. (1989). A tutorial on hidden Markov models and selected applications
in speech recognition. Proceedings of the Institute of Electrical and Electronics
Engineers, 77, 257-284 .
ROBERT C.P. AND CASELLA G . (1999). Monte Carlo Statistical Methods. Springer: New
York.
ROBERT C.P ., RYDEN T ., AND TITTERINGTON D.M. (2000). Bayesian inference in hidden
Markov models through the reversible jump Markov chain Monte Carlo method.
Journal of the Royal Statistical Society B, 62, 57-75.
SOKAL A.D . (1989). Monte Carlo methods in statistical mechanics: foundations and
new algorithms. Cours de Troisieme Cycle de la Physique en Suisse Romande,
Lausanne.
SWENDSEN R.H. AND WANG J .-S . (1987). Non-universal critical dynamics in Monte
Carlo simulations. Physics Review Letters, 58, 86-88 .
TIERNEY L.J. (1994). Markov chains for exploring posterior distributions (with Discus-
sion). Annals of Statistics, 22, 1701-1762.
TJELMELAND H. AND BESAG J. (1998). Markov random fields with higher-order inter-
actions. Scandinavian Journal of Statistics, 25, 415-433 .
SEMIPARAMETRIC FILTERING IN SPEECH PROCESSING
BENJAMIN KEDEM' AND KONSTANTINOS FOKIANOSt
Abstract. We consider m data sets where the first m - 1 are obtained by sampling
from multiplicative exponential distortions of the mth distribution, it being a refer-
ence. The combined data from m samples, one from each distribution, are used in the
semipararnetric large sample problem of estimating each distortion and the reference
distribution, and testing the hypothesis that the distributions are identical. Possible
applications to speech processing are mentioned.
271
M. Johnson et al. (eds.), Mathematical Foundations of Speech and Language Processing
© Springer Science+Business Media New York 2004
272 BENJAMIN KEDEM AND KONSTANTINOS FOKIANOS
their distribution and deviations from them in some sense . The theory will
be illustrated in terms of autoregressive signals akin to speech.
The present formulation of the general scheme follows closely the re-
cent development in Fokianos, et al. (2001) which extends Fokianos, et, al.
(1998), and Qin and Zhang (1997) . Qin and Lawless (1994) is the prede-
cessor to all this work . Related references dealing with more general tilting
or bias are the pioneering papers of Vardi (1982), (1986) .
2. Mathematical formulation of source combination. In our
formalism, "sources" are identified with "dat a" . Deviations are formu-
lated in terms of deviations from a reference distribution. Thus, a data set
deviates from a reference set in the sense that its distribution is a distortion
of a reference distribution.
To motivate this, consider the classical one-way analysis of variance
with m = q + 1 independent normal random samples,
gj(X)
(1) -(-)
gm X
= exp(aj + (3jx), j = 1, ..., q
where
where Xj = (Xjl ' ..., XjnJ' is the j th sample of length nj, and n = nl +
' " + nq + nm ·
Th e statistical semipara met ric est imat ion/ testing problems using the
com bined data tare.
1. Nonparametric est imat ion of G(x) , th e cdf correspondin g to g(x).
2. Estimation of the par ameters 0: = (0:1, ..., O:q)' , {3 = ({31, ... , {3q)' , and
the study of th e large sample properties of the est imators.
3. Testing of the hypothesis Ho : {31 = .. . = {3q = O.
Evidently, the general const ruct ion does not require normality or even
symmetry of the distributions, the variances need not be th e same, and the
model does not require knowledge of the reference distribution . The main
assumption is th e form of the distortion of the reference distribution.
2.1. Estimation and large sample results. A maximum likelihood
estimator of G(x) can be obt ained by maximizing the likelihood over the
class of ste p cdf's with jumps at the observed values t1 , ..., tn' Accordingl y,
if Pi = dG(td , i = 1, .., n , t he likelihood becomes,
n nl nq
(3) £ (0:,{3, G) = IT Pi IT exp(O:I + !it h(XIj )) ' " IT exp(O:q + {3qh(Xqj )).
i= 1 j=1 j=1
n n n
LPi = 1, LPd w 1(ti) - 1] = 0, ..., LPdwq(ti) - 1] = 0
i=1 i=1 i= 1
(4)
n
l = - L log]l + P1W1(ti) + ... + PqWq(ti)]
i=1
(5)
nt nq
+ L[a1 + tJ1h(X1j)] + ... + L[aq + tJqh(xqj)].
j=1 j=l
The score equations for j = 1, ..., q, are therefore,
~=-:t pjwj(td +nj=O
8aj i=l 1 + P1 W1(ti) + ... + pqwq(td
(6)
~= _ ~ pjh(ti)wj(ti) + ~ h(Xji) = o.
8tJj 6 1 + P1 W1(ti)+ ...+ PqWq(ti) 6
The solution of the score equations gives the maximum likelihood estima-
tors &, (3, and consequently by substitution also the estimates
(7)
It can be shown that the estimators &, {3, are asymptotically normal,
o)
(9) Vii ( &-a
(3 _ f3 ~ N(O,~)
0
E(t k) == J hk(t)dG(t)
Var(t) == E(e) - E 2(t) .
2.2.1. The Xl test. Under Ho : f3 = O-so that all the moments of
h(t) are taken with respect to g-consider the q x q matrix All whose jth
diagonal element is
where
and the eigenvalues of the matrix on the right are 1,1 + PI + P2.
The elements are bounded by 1 and the matrix is nonsingular,
IA 11 I = [1
ITk-1 Pk
+ ",q ]m > 0
6k=1 Pk
8 _ ( 1 E(t)) A
E(t) E(e ) 0 11
and,
2
8- 1 1 (E (t ) -1E (t) ) 0 A } .
= Var (t) - E (t ) J
V=Var(t) ( ~ ~ll)
as is
2(t)
~ = 8- 1V8- I = _1_
,u
(E - E (t ) )
1 0
A 1
11 ·
V ar(t ) -E(t)
(10)
276 BENJAMIN KEDEM AND KONSTANTINOS FOKIANOS
It follows under Ho : (3 = 0
A' A
(11) Xl = nVar(t){3 A l1 {3
(12)
(14) li(xd =
- cPr exp (1
VI.ji.;- - cPr 2)
---2-Xt , i = 1,2,3.
a 21T a
With this (2) becomes,
(15)
where
D:i =
1'2 [1 -cP
log 1 - cP~
2
]
,
(16)
(3t. -- cPr - cP~
20'2 .
The last three equations show that the exponential distortion model (2)
holds for Gaussian AR(I) data with h(x) = x 2 . Thus we expect our method
to produce consistent estimates for independent samples from sufficiently
long time series. We show that this is indeed the case by resorting to the
following limited simulation study.
We assume throughout that Et follows the normal distribution with
mean 0 and variance 1, and that N, denotes the length of time series
i . Practically independent samples were obtained by sampling at random
from long time series. Each entry in the Table 1 was obtained from 100
independent time series.
In Table 1 the reference cP3 = 0.1 is rather small meaning that the
reference signal is close to being white noise. The table shows that there is
a good agreement between the true parameter values and their estimates.
To demonstrate the asymptotic normality, Figure 1 displays the Q-Q plots
of the estimators when N 1 = N 2 = N 3 = 200. The plots indicate that the
asymptotic distribution of the estimators is close to normal.
3.2. Conditional analysis. The exponential tilt model (2) also holds
for the conditional one-step transition probability density function of the
AR(I) model (13). First observe that the conditional density of X; given
X t - 1 is given by
(17) i = 1,2,3.
278 BENJAMIN KEDEM AND KONSTANTINOS FOKIANOS
TABLE 1
Parameters estimates and their standard errors for different sample sizes for model
(15). The data were gen erated according to (13) with <PI = -0.5, <P2 = - 0.3, and
<P3 = 0.1, so that 01 = -0.139, {31 = 0.12, 0 2 = -0.042, {32 = 0.04.
(1) (2)
d
~
0
,;
,;
~
~
N
9
0
9
·2 -1 -2 -1
(3) (4)
0
....
9 51
,;
'"
0
9 ~
-: g
0;- ,;
0 0
-2 -1 -2 -1 0
FIG. 1. Q-Q plots of the parameter estim ates under model (15) . Th e data were
generated according to (13) with <PI = -0.5, <P2 = -0.3, <P3 = 0.1, and N, = N2 =
N3 = 200. Th e numb er of runs is 100. (1) ch (2) 131 (3) 0:2 (4) 132 .
SEMIPARAMETRIC FILTERING IN SPEECH PROCESSING 279
Put h (xt \xt-d for the reference distribution. Then a simple calculat ion
shows that
( 20) C ¢5 - ¢;
"Ii = 2(72 '
for i = 1,2. A theory similar to the marginal case can be developed for th e
condit ional case as well, but this requires further resear ch.
4. Summary. We have out lined a method for combining data from
several sources. This can be used in testing the similarity of m signals given
a reference signal, and in th e estimat ion of the margin al distribution of the
reference signal from the combined dat a as in (8). Th e mathematical devel-
opment assumes exponential distortions or deviations of the form (2), but
the t heory can be developed for other types of distortions. The analysis was
illustrated in terms of Gaussian AR(I ) data where h( x ) = x 2 . In general ,
h(x) is not known and must be estimated. The choice/ estimat ion of h( x )
can be approached in severa l ways. One possibility is to approximate h(x)
by a polynomial or by B-splines. Anoth er possibility is to employ some
type of kernel estimation applied to the log-ratio of prob ability densities.
For skewed geophysical dat a h(x) = log(x) or even h(x ) = log2(x ) may be
helpful.
APPENDIX
We prove the asserted asymptotic normality (9). First define
and,
A1(j,j') ==
+ PkWk
Jh2(t)WfYWj'(t)df~t)
k=l
A 2(j,j') ==
+ 1 k=l PkWk t
for j, jf = 1, ..., q. Then, the entries in
n
(",ucx)
81 .) = 1 +~l Pk [Ao(j,j) - r=l
.!..Var
k=l
f>rA6(j,r)]
- ~prAo(j,r)Al(jf,r)]
.!..var(!(3l) =
n U ) 1+
~l k=l Pk
[-A2(j,j)+2A1(j ,j)Ej(t)
- fprAi(j ,r)]
r=l
+
1+
-fJ
k=l Pk
Varj(t)
- ~prAl(j,r)Al(jf,r)].
Next , as n -. 00 ,
1
--VV fl(a,{3) -. S
n
where S is a 2q x 2q matrix with entries corresponding to j , j' = 1, ..., q,
_.!.. o2l
n ocx;
-. Pj
1 + 2:%=1 Pk
J + 2:t
+ 2:%=1
[1
1
h pkWk(t)]Wj(t) dG(t)
PkWk(t)
SEMIPARAMETRIC FILTERING IN SPEECH PROCESSING 281
(A.l) vn ( 0 - ao )
j3 _ f3
0
:::} N(0, :E)
where E = S-lVS- 1 .
Acknowledgments. Research supported by NASA contract
NAG52783. The authors are very grateful to the referee for a number of
helpful suggestions which were incorporated in the paper.
REFERENCES
[1] FOKIANOS K. , B. KEDEM , J . QIN , AND D.A. SHORT (2001), "A semiparametric
approach to the one-way layout ," Technometrics, 43 , 56-64.
[2] FOKIANOS K. , B. KEDEM , J . QIN, J. HAFERMAN , AND D.A. SHORT (1998) , "On
Combining Instruments." Journal of Applied Meteorology, 37, 220-226 .
[3] QIN J . AND J .F . LAWLESS (1994), "Empirical likelihood and gener al estimating
equations," The Annals of Statistics, 22, 300-325.
[4] QIN J . AND B. ZHANG (1997), "A goodness of fit test for logistic regression models
based on case-control data." in Biometrika, 84, 609-618.
[5] SEN P .K. AND J .M. SINGER (1993) , Large Sampl e Methods in Statistics, An In-
troduction With Applications, London : Chapman & Hall.
[6] VARD! Y. (1982), "Nonparamet ric estimation in th e presence of length bias ,"
Annals of Statistics, 10, 616-620.
[7] VARD! Y. (1986) , "Empirical distributions in selection bias models," Annals of
Statistics, 13, 178-203 .
LIST OF WORKSHOP PARTICIPANTS
283
284 LIST OF WORKSHOP PARTICIPANTS
1987 Robotics
1988 Signal Processing
1989 Robust Statistics and Diagnostics
1990 Radar and Sonar (June 18-29)
New Directions in Time Series Analysis (July 2-27)
1991 Semiconductors
1992 Environmental Studies : Mathematical, Computational, and
Statistical Analysis
1993 Modeling, Mesh Generation, and Adaptive Numerical Methods
for Partial Differential Equations
1994 Molecular Biology
1995 Large Scale Optimizations with Applications to Inverse Problems,
Optimal Control and Design, and Molecular and Structural
Optimization
1996 Emerging Applications of Number Theory (July 15-26)
Theory of Random Sets (August 22-24)
1997 Statistics in the Health Sciences
1998 Coding and Cryptography (July 6-18)
Mathematical Modeling in Industry (July 22-31)
1999 Codes, Systems, and Graphical Models (August 2-13, 1999)
2000 Mathematical Modeling in Industry: A Workshop for Graduate
Students (July 19-28)
2001 Geometric Methods in Inverse Problems and PDE Control
(July 16-27)
2002 Special Functions in th e Digital Age (July 22-August 2)
2003 Probability and Partial Differential Equations in Modern
Applied Mathematics (July 21-August 1)
2004 n-Categories: Foundations and Applications (June 7-18)