Probabilistic Latent Semantic Analysis: Thomas Hofmann

289
Probabilistic Latent Semantic Analysis
Thomas Hofmann
EECS Department, Computer Science Division, University of California, Berkeley &
International Computer Science Institute, Berkeley, CA
hofmann@cs.berkeley.edu
ing problems are twofold: ( i) polysems, i.e., a word

Abstract was referred to" in a text or an utterance. The result
in different context, and ( ii) synonymys and semanti

Probabilistic Latent Semantic Analysis is a may have multiple senses and multiple types of usage
novel statistical technique for the analysis
of two-mode and co-occurrence data, which cally related words, i.e., different words may have a
has applications in information retrieval and similar meaning, they may at least in certain contexts
filtering, natural language processing, ma denote the same concept or - in a weaker sense - refer
Latent semantic analysis ( LSA ) (3] is well-known tech

chine learning from text, and in related ar to the same topic.
eas. Compared to standard Latent Semantic
Analysis which stems from linear algebra and nique which partially addresses these questions. The
performs a Singular Value Decomposition of key idea is to map high-dimensional count vectors,
co-occurrence tables, the proposed method such as the ones arising in vector space representa
is based on a mixture decomposition derived tions of text documents (12], to a lower dimensional
from a latent class model. This results. in a representation in a so-called latent semantic space. As
more principled approach which has a solid the name suggests, the goal of LSA is to find a data
foundation in statistics. In order to avoid mapping which provides information well beyond the
overfitting, we propose a widely applicable lexical level and reveals semantical relations between
generalization of maximum likelihood model the entities of interest. Due to its generality, LSA
range of applications ( e.g. (3, 5, 8, 1]). Yet its theoreti

fitting by tempered EM. Our approach yields has proven to be a valuable analysis tool with a wide
substantial and consistent improvements over
Latent Semantic Analysis in a number of ex cal foundation remains to a large extent unsatisfactory
periments. and incomplete.
This paper presents a statistical view on LSA which
mantics Analysis ( PLSA ) . In contrast to standard

1 Introduction leads to a new model called Probabilistic Latent Se
Learning from text and natural language is one of the LSA, its probabilistic variant has a sound statistical
great challenges of Artificial Intelligence and Machine foundation and defines a proper generative model of
Learning. Any substantial progress in this domain has the data. A detailed discussion of the numerous ad
strong impact on many applications ranging from in vantages of PLSA can be found in subsequent sections.
formation retrieval, information filtering, and intelli
--; gent interfaces, to speech recognition, natural language
2 Latent Semantic Analysis
processing, and machine translation. One of the fun
damental problems is to learn the meaning and usage
2.1 Count Data and Co-occurrence Tables
of words in a data-driven fashion, i.e., from some given
data over a discrete dyadic domain ( cf. [7]). How

text corpus, possibly without further linguistic prior
LSA can in principle be applied to any type of count
knowledge.
The main challenge a machine learning system has to ever, since the most prominent application of LSA is
address roots in the distinction between the lexical in the analysis and retrieval of text documents, we
level of "what actually has been said or written" and focus on this setting for sake of concreteness. Sup
the semantical level of "what was intended" or "what pose therefore we have given a collection of text doc-
290 Hofmann
uments D = { d1,. . . , dN} with terms from a vocab-

P C� CzldJf:\"Cw!>
(a) (b)
Q
ulary W = { w1, ... wM}. By ignoring the sequen-
tial order in which words occur in a document, one
may summarize the data in a N x M co-occurrence
�
table of counts N = (n(d;,wj))ij' where n(d,w) E IN
Figure 1: Graphical model representation of the as

denotes how often the term w occurred in document
d. In this particular case, N is also called the term
document matrix and the rows/columns of N are re pect model in the asymmetric (a) and symmetric (b)
ferred to as document/term vectors, respectively. The parameterization.
vector-space representation [12] of documents will in

key assumption is that the simplified 'bag-of-words' or
defined by the mixture
P(d,w)=P(d)P(wid), P(wld)=LP(wlz)P(zid). (1)

many cases preserve most of the relevant information,
e.g., for tasks like text retrieval based on keywords.
zEZ
2.2 Latent Semantic Analysis by SVD Like virtually all statistical latent variable models the
aspect model introduces a conditional independence
As mentioned in the introduction, the key idea of LSA assumption, namely that d and ware independent con
is to map documents (and by symmetry terms) to a ditioned on the state of the associated latent variable
mantic space [3]. The mapping is restricted to be lin depicted in Figure 1 (a)). Since the cardinality of z is
vector space of reduced dimensionality, the latent se (the corresponding graphical model representation is
ear and is based on a Singular Value Decomposition smaller than the number of documents/words in the
(SVD) of the co-occurrence table. One thus starts collection, z acts as a bottleneck variable in predict
equivalently parameterized by (cf. Figure 1 (b))

with the standard SVD given by N = UEV1, where ing words. It is worth noticing that the model can be
U and V are orthogonal matrices U1U = V1V = I
(2)
and the diagonal matrix E contains the singular val
P(d,w) = L P(z)P(diz)P(wiz) ,
ues of N. The LSA approximation of N is computed
to zero (= E), which is rank K optimal in the sense

zEZ
by setting all but the largest K singular values in E
which is perfectly symmetric in both entities, docu
tion N = UEV1 � UEV1 = N. Notice that the

of the L2-matrix norm. One obtains the approxima ments and words.
approximation are given by NN1 = UE2U1 and hence

document-to-document inner products based on this 3.2 Model Fitting with the EM Algorithm
one might think of the rows of UE as defining coor The standard procedure for maximum likelihood es
Maximization (EM) algorithm [4]. EM alternates two

dinates for documents in the latent space. While the timation in latent variable models is the Expectation
original high-dimensional vectors are sparse, the corre
sponding low-dimensional latent vectors will typically coupled steps: (i) an expectation (E) step where poste
not be sparse. This implies that it is possible to com rior probabilities are computed for the latent variables,
updated. Standard calculations (cf. [7, 13]) yield the

pute meaningful association values between pairs of (ii) an maximization (M) step, where parameters are
documents, even if the documents do not have any
terms in common. The hope is that terms having a E-step equation
' (3)
common meaning, in particular synonyms, are roughly P(z)P(diz)P(wlz)
mapped to the same direction in the latent space. P(zid,w)
Lz'EZ P(z')P(diz')P(wlz')
as well as the following M-step formulae
3 Probabilistic LSA
P(wiz) ex L n(d,w)P(zld,w), (4)
dE'D
3.1 The Aspect Model
P(dlz) ex L n(d,w)P(zld,w), (5)
wEW
The starting point for Probabilistic Latent Semantic
Analysis is a statistical model which has been called P(z) ex L L n(d,w)P(zld,w). (6)
dE'D wEW
aspect model [7]. The aspect model is a latent variable
model for co-occurrence data which associates an un Before discussing further algorithmic refinements, we
observed class variable z E Z = { z1,. . . , ZK} with each will study the relationship between the proposed
observation. A joint probability model over D x W is model and LSA in more detail.
Probabilistic Latent Semantic Analysis 291
'
'
�
�:� : .�- _____ ��I��:�--------�--� -## -#1 ��ing
'
'
'
'
lihood function of multinomial sampling and aims at
an explicit maximization of the predictive power of
the model. As is well known, this corresponds to a
' ' '
' ' '
' ' '
' ' '
' ' '
minimization of the cross entropy or Kullback-Leibler
: : :
'
'
'
'
'
'
'
'
'
divergence between the empirical distribution and the
'
model, which is very different from any type of squared

' '
i
'
i
'
j
' deviation. On the modeling side this offers important
: : :
: : : advantages, for example, the mixture approximation
' '
P of the co-occurrence table is a well-defined proba

'
' ' '
' ' '
' ' '
: bility distribution and factors have a clear probabilistic
normalized probability distribution and N may even

meaning. In contrast, LSA does not define a properly
j # ---
#
:----·------:;"\
�
.
: --------------------------------
Figure 2: Sketch of the probability
# - --- sub-simplex contain negative entries. In addition, there is no obvi
spanned by the aspect model. ous interpretation of the directions in the LSA latent
space, while the directions in the PLSA space are in
3.3 Probabilistic Latent Semantic Space
terpretable as multinomial word distributions. The
probabilistic approach can also take advantage of the
tions P(-iz) over the vocabulary which we call factors.

Consider the class-conditional multinomial distribu well-established statistical theory for model selection
and complexity control, e.g., to determine the opti
They can be represented as points on the M - 1 di mal number of latent space dimensions. Choosing the
its convex hull, this set of K points defines a L :S

mensional simplex of all possible multinomials. Via number of dimensions in LSA on the other hand is
typically based on ad hoc heuristics.
K -1 dimensional sub-simplex. The modeling assump A comparison of the computational complexity might
P(wid) for all document are approximated by a multi
tion expressed by (1) is that conditional distributions suggest some advantages for LSA: ignoring potential
problems of numerical stability the SVD can be com
tors P(wlz), where the mixing weights P(zid) uniquely

nomial representable as a convex combination of fac puted exactly, while the EM algorithm is an iterative
method which is only guaranteed to find a local max
define a point on the spanned sub-simplex. A simple imum of the likelihood function. However, in all our
sketch of this situation is shown in Figure 2. Despite experiments the computing time of EM has not been
of the discreteness of the introduced latent variables, a significantly worse than performing an SVD on the co
continuous latent space is obtained within the space of occurrence matrix. There is also a large potential for
all multinomial distributions. Since the dimensionality improving run-time performance of EM by on-line up
of the sub-simplex is :S K -1 as opposed to a maximum date schemes, which has not been explored so far.
of M- 1 for the complete probability simplex, this per
forms a dimensionality reduction in the space of multi
nomial distributions and the spanned sub-simplex can 3.4 Topic Decomposition and Polysemy
be identified with a probabilistic latent semantic space.
Let us briefly discuss some elucidating examples at
To stress this point and to clarify the relation to this point which will also reveal a further advantage of
LSA, let us rewrite the aspect model as parameter PLSA over LSA in the context of polsemous words. We
trices by U = (P(d;izk));,k, V = (P(wjizk))j,k, and

ized by (2) in matrix notati'?n. Hence define ma have generated a dataset (CLUSTER) with abstracts
i: = diag(P(Zk))k. The joint probability model P

of 1568 documents on clustering and trained an aspect
can then be written as a matrix product P = (ri;yt.

model with 128 latent classes. Four pairs of factors are
visualized in Figure 3. These pairs have been selected
Comparing this with SVD, one can make the follow as the two factors that have the highest probability to
U and V reflect conditional independence in PLSA,

ing observations: (i) outer products between rows of generate the words "segment", "matrix", "line", and
"power", respectively. The sketchy characterization of
(ii) the K factors correspond to the mixture compo the factors by their 10 most probable words already re
nents in the aspect model, (iii) the mixing proportions veals interesting topics. In particular, notice that the
in PLSA substitute the singular values. The crucial term used to select a particular pair has a different
difference between PLSA and LSA, however, is the meaning in either topic factor: (i) 'Segment' refers to
decomposition/approximation. In LSA, this is the L2-

objective function utilized to determine the optimal an image region in the first and to a phonetic segment
in the second factor. (ii) 'Matrix' denotes a rectangu
or Frobenius norm, which corresponds to an implicit lar table of numbers and to a material in which some
additive Gaussian noise assumption on (possibly trans thing is embedded or enclosed. (iii) 'Line' can refer to
formed) counts. In contrast, PLSA relies on the like- a line in an image, but also to a line in a spectrum.
292 Hofmann
"segment 1" "segment 2" "matrix 1" "matrix 2" "line 1" "line 2" "power 1" power 2"
imag speaker robust manufactur constraint alpha POWER load

SEGMENT speech MATRIX cell LINE redshift spectrum memon
texture recogni eigenvalu part match LINE omega vlsi
color signal uncertainti MATRIX locat galaxi mpc POWER
tissue train plane cellular imag quasar hsup systolic
brain hmm linear famili geometr absorp larg input
slice source condition design 1mpos high redshift complex
cluster speakerind. perturb machinepart segment ssup galaxi arra1
mn SEGMENT root format fundament densiti standard present
volume sound suffici group recogn veloc model implement
Figure 3: Eight selected factors from a 128 factor decomposition. The displayed word stems are the 10 most
probable words in the class-conditional distribution P(wJz), from top to bottom in descending order.
Document I, P(z,jd1,w; ='segment')= (0.95!,0.0001, ...)

P{w; = 'segment'jdi) = 0.06
SEGMENT medic imag challeng problem field imag analysi diagnost base proper SEGMENT digit imag SEGMENT medic imag need
SEGMENT despit exist techniqu SEGMENT
applic involv estim boundari object classif tissu abnorm shape analysi contour detec textur
Document 2, P(z,jd2,w; ='segment')= (0.025,0.867, . .)

specif medic imag remain crucial problem[ ... ]
P{w; ='segment'jd,) = 0.010

.
consid signal origin sequenc sourc specif problem SEGMENT signal relat SEGMENT sourc address issu wide applic field report describ
[ .]
resolu method ergod hidden markov model hmm hmm state correspond signal sourc signal sourc sequenc determin decod procedur viterbi
algorithm forward algorithm observ sequenc baumwelch train estim hmm paramet train materi applic multipl signal sourc identif problem
experi perform unknown speaker identif . .
Figure 4: Abstracts of 2 exemplary documents from the CLUSTER collection along with latent class posterior
probabilities P{zJd, w= 'segment'} and word probabilities P{w = 'segment'Jd}.
(iv) 'Power' is used in the context of radiating objects clustering model [10, 7] which can be thought of as an
in astronomy, but also in electrical engineering. unsupervised version of a naive Bayes' classifier. It
can be shown that the conditional word probability of
Figure 4 shows the abstracts of two exemplary docu
a probabilistic clustering model is given by
ments which have been pre-processed by a standard
stop-word list and a stemmer. The posterior probabil
ities for the classes given the different occurrences of
P(wJd)= LP{c(d)=z}P(wJz), (7)
zEZ
'segment' indicate how likely it is for each of the factors
in the first pair of Figure 3 to have generated this ob where P{ c(d)=z} is the posterior probability of doc
servation. We have also displayed the estimates of the ument d having latent class z. It is a simple impli
conditional word probabilities P{ w= 'segment'Jd 1,2}. cation of Bayes' rule that these posterior probabili
One can see that the correct meaning of the word 'seg ties will concentrate their probability mass on a cer
ment' is identified in both cases. This implies that al tain value z with an increasing number of observations
though 'segment' occurs frequently in both document, (i.e., with the length of the document). This means
the overlap in the factored representation is low, since that although (1) and (7) are algebraically equiva
'segement' is identified as a polysemous word (relative lent, they are conceptually very different and yield in
to the chosen resolution level) which - dependent on fact different results. The aspect model assumes that
the context - is explained by different factors. document-specific distributions are a convex combina
tion of aspects, while the clustering model assumes
3.5 Aspects versus Clusters there is just one cluster-specific distribution which is
inherited by all documents in the cluster 1 Thus in
It is worth comparing the aspect model with statistical clustering models the class-conditionals P(wJz) have
clustering models ( cf. also [7]). In clustering models
1 In the distributional clustering model it is only the pos
for documents, one typically associates a latent class terior uncertainty of the cluster assignments that induces
variable with each document in the collection. Most some averaging over the class-conditional word distribu
closely related to our approach is the distributional tions P( wlz ) .
oooo�,====� ,.,F=======:i:l
to capture the complete vocabulary of a subset (clus (a) (b)
"""
ter) of documents, while factors can focus on certain
aspects of the vocabulary of a subset of documents. 2500 """
'·
For example, a factor can be very well used to ex -�1000 I
2000 � '
·. ... ... ... ... ... ...

plain some fraction of the words occurring in a doc LsA
-· i 900 \
. . . . D.. 800 '.,,
lOO��·-.....
ument, although it might not explain other words at 1500 ·.
IOOJ�· ·· . . PLSA"
LSA
•
EM �
all (e.g., even assign zero probability), because these . .... ...
other words can be taken care of by other factors. TEM
600 ... .......PLSA
... . ... . ... . .....
500
200 400 600 800 1000 500 1 000 1500
Latent space dim!!lnsions Latent space dimensions
3.6 Model Fitting Revisited: Improving

Generalization by Tempered EM Figure 5: Perplexity results as a function of the latent
space dimensionality for (a) the MED data (rank 1033)
So far we have focused on maximum likelihood estima and (b) the LOB data (rank 1674). Plotted results
tion to fit a model from a given document collection. are for LSA (dashed-dotted curve) and PLSA (trained
Although the likelihood or, equivalently, the perplex by TEM = solid curve, trained by early stopping EM
ity2 is the quantity we believe to be crucial in assessing = dotted curve). The upper baseline is the unigram
the quality of a model, one clearly has to distinguish model corresponding to marginal independence. The
ity of the largest trained aspect models (K::: 2048).

between the performance on the training data and on star at the right end of the PLSA denotes the perplex
unseen test data. To derive conditions under which
generalization on unseen data can be guaranteed is ac
This shows that the effect of the entropy at ,8 < 1 is

tually the fundamental problem of statistical learning
theory. Here, we propose a generalization of maxi
mum likelihood for mixture models which is known as to dampen the posterior probabilities such that they
decreasing ,8.
annealing and is based on an entropic regularization will become closer to the uniform distribution with
term. The resulting method is called Tempered Expec
tation Maximization (TEM) and is closely related to
Somewhat contrary to the spirit of annealing as a con
deterministic annealing [11].
tinuation method, we propose an 'inverse' annealing
The starting point of TEM is a derivation of the E strategy which first performs EM iterations and then
pointed out in [9] the EM procedure in latent variable

step based on an optimization principle. As has been decreases ,8 until performance on held-out data deteri
orates. Compared to annealing this may accelerate the
models can be obtained by minimizing a common ob model fitting procedure significantly (e.g., by a factor
jective function - the (Helmholtz) free energy- which of� 10- 50) and we have not found the test set per
for the aspect model is given by formance of 'heated' models to be worse than the one
achieved by carefully 'annealed' models. The TEM
Fr; ::: -,8 L n(d,w) L P(z; d, w) logP(d,wlz)P(z) algorithm can thus be implemented in the following
d,w z way:
+ L n(d,w) L P(z;d,w)logP(z; d,w). (8)
1. Set ,8 +- 1 and perform EM with early stopping.
2. Decrease ,8 +- 1),8 (with 1)

d,w
fine a conditional distribution over Z and ,8 is a pa

Here P(z ; d,w) are variational parameters which de < 1) and perform one
TEM iteration.
rameter which - in analogy to physical systems - is
3. As long as the performance on held-out data im
called the inverse computational temperature. Notice
at this value of ,8, otherwise goto step 2
proves (non-negligible) continue TEM iterations
pected log-likelihood scaled by /3. Thus in the case of
that the first contribution in (8) is the negative ex
4. Perform stopping on ,8, i.e., stop when decreasing

,8 does not yield further improvements.
P(z; d,w) = P(zid,w) minimizing F w.r.t. the param
eters defining P(d,wlz)P(z) amounts to the standard
M-step in EM. In fact, it is straightforward to ver
w.r.t. Pat ,8::: 1. In general Pis determined by

ify that the posteriors are obtained by minimizing F
4 Experimental Results
(9)
- P z)P(d\z)P(w\z)]il
P(z;d,w) = L::':z[,[ ( In the experimental evaluation, we focus on two tasks:
P(z')P(dlz')P(wlz')]il ·
(i) perplexity minimization for a document-specific un
2 Perplexity refers to the log-averaged inverse probabil igram model and noun-adjective pairs, and (ii) auto
ity on unseen data. mated indexing of documents. The evaluation of LSA
294 Hofmann
Table 1: Average precision results and relative improvement w.r.t. the baseline method cos+tf for the 4 standard
test collections. Compared are LSI, PLSI, as well as results obtained by combining PLSI models (PLSI').
An asterix for LSI indicates that no performance gain could be achieved over the baseline, the result at 256
dimensions with >. = 2/3 is reported in this case.
MED CRAN CACM CISI
prec. 1mpr. prec. 1mpr. prec. 1mpr. prec. 1mpr.
cos+tf 44.3 - 29.9 - 17.9 - 12.7 -
LSI 51.7 +16.7 '28.7 -4.0 '16.0 -11.6 12.8 +0.8
PLSI 63.9 +44.2 35.1 +17.4 22.9 +27.9 18.8 +48.0
PLSI' 66.3 +49.7 37.5 +25.4 26.8 +49.7 20.1 +58.3
and PLSA on the first task will demonstrate the advan since it makes a very inefficient use of the available
tages of explicitly minimizing perplexity by TEM, the degrees of freedom. Notice, that with both methods
second task will show that the solid statistical founda it is possible to train high-dimensional models with a
tion of PLSA pays off even in applications which are continuous improvement in performance. The num
not directly related to perplexity reduction. ber of latent space dimensions may even exceed the
rank of the co-occurrence matrix N and the choice of
4.1 Perplexity Evaluation the number of dimensions becomes merely an issue of
possible limitations of computational resources.
In order to compare the predictive performance of
PLSA and LSA one has to specify how to extract 4.2 Information Retrieval
probabilities from a LSA decomposition. This problem
re-normalization of the approximating matrix N. We

is not trivial, since negative entries prohibit a simple One of the key problems in information retrieval is
automatic indexing which has its main application in
have followed the approach of [2] to derive LSA prob query-based retrieval. The most popular family of in
abilities. formation retrieval techniques is based on the Vector
Space Model (VSM) for documents [12]. Here, we have
on the (untransformed) term frequencies n(d, w) to

Two data sets that have been used to evaluate the utilized a rather straightforward representation based
perplexity performance: (i) a standard information re
trieval test collection MED with 1033 document, (ii) gether with the standard cosine matching function,
in [6]. The same representation applies to queries q,

a dataset with noun-adjective pairs generated from a a more detailed experimental analysis can be found
tagged version of the LOB corpus. In the first case, the
goal was to predict word occurrences based on (parts so that the matching function for the baseline term
of) the words in a document. In the second case, nouns matching method can be written as
Lw n(d, w)n(q, w)
have to predicted conditioned on an associated adjec
s (d, q)
y L, w n(q, w)2'
VLw n(d, w ) 21'\'
tive. Figure 5 reports perplexity results for LSA and
PLSA on the MED (a) and LOB (b) datasets in de
_
-
(10)
pendence on the number of dimensions of the (proba
bilistic) latent semantic space. PLSA outperforms the
In Latent Semantic Indexing (LSI), the original vec
statistical model derived from standard LSA by far.
tor space representation of documents is replaced by a
On the MED collection PLSA reduces perplexity rela
representation in the low-dimensional latent space and
tive to the unigram baseline by more than a factor of
the similarity is computed based on that representa
three (3073/936 � 3.3), while LSA achieves less than
tion. Queries or documents which were not part of the
a factor of two in reduction (3073/1647 � 1.9). On the
original collection can be folded in by a simple matrix
less sparse LOB data the PLSA reduction in perplex
multiplication (cf. [3] for details). In our experiments,
ity is 1316/547 � 2.41 while the reduction achieved by
we have actually considered linear combinations of the
LSA is only 1316/632 � 2.08. In order to demonstrate
original similarity score (10) (weight >.) and the one
the advantages of TEM, we have also trained aspect
derived from the latent space representation (weight
models on the MED data by standard EM with early
1- >.).
stopping. As can be seen from the curves in Figure 5
(a), the difference between EM and TEM model fit The same ideas have been applied in Probabilistic La
ting is significant. Although both strategies - temper tent Semantic Indexing (PLSI) in conjunction with
representation in the factor space P(zld) and P(zlq)

ing and early stopping - are successful in controlling the PLSA model. More precisely, the low-dimensional
the model complexity, EM training performs worse,
90 70 ,---�--· so,---�---,
-\
I\1:\
:,
I '
I
\ 45
80 ' \
l �MED 60 GRAN CISI
\
I :
" f.\ 40
70 :I
\ 1':
·.I \
I 50 -�-- \
\
\
'
\
60
\
40
\I
�50 \
\ I·. '.\
.
5
. \
\·-_ \
iI
�
c. 40 I
I
\
30 \.\
\
30 I
20
20
10
50 100
recall[%]
Figure 6: Precision-recall curves for term matching, LSI, and PLSI* on the 4 test collections.
have been utilized to evaluate similarities. To achieve
PLSA by fixing the P(w[z) parameters and calculating

this, queries have to be folded in, which is done in the K=48 K-:128
weights P(z[q) by TEM.

One advantage of using statistical models vs. SVD 1200
techniques is that it allows us to systematically com 0.6 0.7 08 09 06 0.7 0.8 0.9
bine different models. While this should optimally beta beta
be done according to a Bayesian model combination 70•r-----
scheme, we have utilized a much simpler approach in

our experiments which has nevertheless shown excel
lent performance and robustness. Namely, we have
simply combined the cosine scores of all models with a
uniform weight. The resulting method is referred to as 30
PLSI*. Empirically we have found the performance to 0.6 0.8

,.� ,.�
0.7 0.9
be very robust w.r.t. different (non-uniform) weights

and also w.r.t. the >.-weight used in combination with
Figure 7: Perplexity and average precision as a func
the original cosine score. This is due to the noise re
tion of the inverse temperature (3 for an aspect model
ducing benefits of (model) averaging. Notice that LSA
with J{ = 48 (left) and J{ = 128 (right).
representations for different J{ form a nested sequence,
which is not true for the statistical models which are
expected to capture a larger variety of reasonable de
tion). The condensed results in terms of average pre
compositions.
--; cision recall (at the 9 recall levels 10%- 90%) are sum
We have utilized the following four medium-sized stan marized in Table 1, while the corresponding precision
dard document collection with relevance assessment: recall curves can be found in Figure 6. Here are some
(i) MED (1033 document abstracts from the National additional details of the experimental setup: PLSA
Library of Medicine), (ii) CRAN (1400 document ab models at f{ = 32, 48, 64, 80, 128 have been trained by
stracts on aeronautics from the Cranfield Institute of TEM for each data set with 10% held-out data. For
Technology), (iii) CACM (3204 abstracts from the PLSI we report the best result obtained by any of these
CACM Journal), and (iv) CISI (1460 abstracts in li models, for LSI we report the best result obtained for
brary science from the Institute for Scientific Informa- the optimal dimension (exploring 32-512 dimensions
at a step size of 8). The combination weight). with the
296 Hofmann
cosine baseline score has been coarsely optimized by modeling. In Proceedings of ICASSP'gs, vol
hand, MED, CRAN: >. = 1/2, CACM, CISI:>. = 2/3. ume 2, pages 677-80, 1998.
The experiments consistently validate the advantages [2] N. Coccaro and D. Jurafsky. Towards better inte
of PLSI over LSI. Substantial performance gains have gration of semantic predictors in statistical lan
been achieved for all 4 data sets. Notice that the rela guage modeling. In Proceedings of ICSL P-98,
tive precision gain compared to the baseline method is 1998. to appear.
typically around 100% in the most interesting interme
diate regime of recall! In particular, PLSI works well [3] S. Deerwester, S. T. Dumais, G. W. Furnas, Lan
even in cases where LSI fails completely (these prob dauer. T. K., and R. Harshman. Indexing by la
lems of LSI are in accordance with the original results tent semantic analysis. Journal of the American
reported in [3]). The benefits of model combination Society for Information Science, 41, 1990.
are also very substantial. In all cases the (uniformly)
[4] A.P. Dempster, N.M. Laird, and D.B. Rubin.
combined model performed better than the best single
Maximum likelihood from incomplete data via the
model. As a sight-effect model averaging also deliber
EM algorithm. J. Royal Statist. Soc. B, 39: 1-38,
ated from selecting the correct model dimensionality.
1977.
These experiments demonstrate that the advantages of
PLSA over standard LSA are not restricted to appli [5] P.W. Foltz and S. T. Dumais. An analysis of in
cations with performance criteria directly depending formation filtering methods. Communications of
on the perplexity. Statistical objective functions like the ACM, 35(12):51-60, 1992.
the perplexity (log-likelihood) may thus provide a gen [6] T. Hofmann. Probabilistic latent semantic index
eral yardstick for analysis methods in text learning and ing. In Proceedings of SIGIR'99, 1999.
information retrieval. To stress this point we ran an
experiment on the MED data, where both, perplexity [7] T. Hofmann, J. Puzicha, and M. I. Jordan. Unsu
and average precision, have been monitored simulta pervised learning from dyadic data. In Advances
neously as a function of (3. The resulting curves which in Neural Information Processing Systems, vol
show a striking correlation are plotted in Figure 7. ume 11. MIT Press, 1999.
5 Conclusion [8] T.K. Landauer and S.T. Dumais. A solution

to Plato's problem: The latent semantic anal
We have proposed a novel method for unsupervised ysis theory of acquisition, induction, and rep
learning, called Probabilistic Latent Semantic Analy resentation of knowledge. Psychological Review,
sis, which is based on a statistical latent class model. 104(2):211-240, 1997.
We have argued that this approach is more principled
than standard Latent Semantic Analysis, since it pos [9] R.M. Neal and G.E. Hinton. A view of the EM al
sesses a sound statistical foundation. Tempered Expec gorithm that justifies incremental and other vari
tation Maximization has been presented as a powerful ants. In M.I. Jordan, editor, Learning in Graph
fitting procedure. We have experimentally verified the ical Models, pages 355-368. Kluwer Academic
claimed advantages achieving substantial performance Publishers, 1998.
gains. Probabilistic Latent Semantic Analysis has thus [10] F.C.N. Pereira, N.Z. Tishby, and L. Lee. Distribu
to be considered as a promising novel unsupervised tional clustering of english words. In Proceedings
learning method with a wide range of applications in of the ACL , pages 183-190, 1993.
text learning and information retrieval.
[11] K. Rose, E. Gurewitz, and G. Fox. A determin
Acknowledgments istic annealing approach to clustering. Pattern
The author would like to thank Jan Puzicha, Andrew Recognition Letters, 11(11):589-594, 1990.
McCallum, Mike Jordan, Joachim Buhmann, Tali
[12] G. Salton and M. J. McGill. Introduction to Mod
Tishby, Nelson Morgan, Jerry Feldman, Dan Gildea,
ern Information Retrieval. McGraw-Hill, 1983.
Andrew Ng, Sebastian Thrun, and Tom Mitchell for
stimulating discussions and helpful hints. This work [13] L. Saul and F. Pereira. Aggregate and mixed
has been supported by a DAAD fellowship. order Markov models for statistical language pro
cessing. In Proceedings of the 2nd International
References Conference on Empirical Methods in Natural Lan
guage Processing, 1997.
[1] J.R. Bellegarda. Exploiting both local and global
constraints for multi-span statistical language

Probabilistic Latent Semantic Analysis: Thomas Hofmann

Uploaded by

Copyright:

Available Formats

Probabilistic Latent Semantic Analysis: Thomas Hofmann

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Probabilistic Latent Semantic Analysis: Thomas Hofmann

Uploaded by

Copyright:

Available Formats

289

Probabilistic Latent Semantic Analysis

ing problems are twofold: ( i) polysems, i.e., a word

in different context, and ( ii) synonymys and semanti­

Latent semantic analysis ( LSA ) (3] is well-known tech­

range of applications ( e.g. (3, 5, 8, 1]). Yet its theoreti­

mantics Analysis ( PLSA ) . In contrast to standard

data over a discrete dyadic domain ( cf. [7]). How­

uments D = { d1,. . . , dN} with terms from a vocab-

Figure 1: Graphical model representation of the as­

vector-space representation [12] of documents will in

defined by the mixture

P(d,w)=P(d)P(wid), P(wld)=LP(wlz)P(zid). (1)

equivalently parameterized by (cf. Figure 1 (b))

to zero (= E), which is rank K optimal in the sense

tion N = UEV1 � UEV1 = N. Notice that the

approximation are given by NN1 = UE2U1 and hence

Maximization (EM) algorithm [4]. EM alternates two

updated. Standard calculations (cf. [7, 13]) yield the

model, which is very different from any type of squared

P of the co-occurrence table is a well-defined proba­

: bility distribution and factors have a clear probabilistic

normalized probability distribution and N may even

tions P(-iz) over the vocabulary which we call factors.

its convex hull, this set of K points defines a L :S

tors P(wlz), where the mixing weights P(zid) uniquely

trices by U = (P(d;izk));,k, V = (P(wjizk))j,k, and

i: = diag(P(Zk))k. The joint probability model P

can then be written as a matrix product P = (ri;yt.

U and V reflect conditional independence in PLSA,

decomposition/approximation. In LSA, this is the L2-

imag speaker robust manufactur constraint alpha POWER load

Document I, P(z,jd1,w; ='segment')= (0.95!,0.0001, ...)

Document 2, P(z,jd2,w; ='segment')= (0.025,0.867, . .)

P{w; ='segment'jd,) = 0.010

·. ... ... ... ... ... ...

3.6 Model Fitting Revisited: Improving

ity of the largest trained aspect models (K::: 2048).

This shows that the effect of the entropy at ,8 < 1 is

pointed out in [9] the EM procedure in latent variable

2. Decrease ,8 +- 1),8 (with 1)

fine a conditional distribution over Z and ,8 is a pa­

4. Perform stopping on ,8, i.e., stop when decreasing

w.r.t. Pat ,8::: 1. In general Pis determined by

re-normalization of the approximating matrix N. We

on the (untransformed) term frequencies n(d, w) to­

in [6]. The same representation applies to queries q,

representation in the factor space P(zld) and P(zlq)

have been utilized to evaluate similarities. To achieve

PLSA by fixing the P(w[z) parameters and calculating

weights P(z[q) by TEM.

bine different models. While this should optimally beta beta

be done according to a Bayesian model combination 70•r-----

scheme, we have utilized a much simpler approach in

PLSI*. Empirically we have found the performance to 0.6 0.8

be very robust w.r.t. different (non-uniform) weights

5 Conclusion [8] T.K. Landauer and S.T. Dumais. A solution

You might also like

in different context, and ( ii) synonymys and semanti

Latent semantic analysis ( LSA ) (3] is well-known tech

range of applications ( e.g. (3, 5, 8, 1]). Yet its theoreti

data over a discrete dyadic domain ( cf. [7]). How

Figure 1: Graphical model representation of the as

P of the co-occurrence table is a well-defined proba

fine a conditional distribution over Z and ,8 is a pa

on the (untransformed) term frequencies n(d, w) to