Probabilistic Latent Semantic Analysis: Thomas Hofmann
Probabilistic Latent Semantic Analysis: Thomas Hofmann
Probabilistic Latent Semantic Analysis: Thomas Hofmann
Thomas Hofmann
EECS Department, Computer Science Division, University of California, Berkeley &
International Computer Science Institute, Berkeley, CA
hofmann@cs.berkeley.edu
Learning from text and natural language is one of the LSA, its probabilistic variant has a sound statistical
great challenges of Artificial Intelligence and Machine foundation and defines a proper generative model of
Learning. Any substantial progress in this domain has the data. A detailed discussion of the numerous ad
strong impact on many applications ranging from in vantages of PLSA can be found in subsequent sections.
formation retrieval, information filtering, and intelli
--; gent interfaces, to speech recognition, natural language
2 Latent Semantic Analysis
processing, and machine translation. One of the fun
damental problems is to learn the meaning and usage
2.1 Count Data and Co-occurrence Tables
of words in a data-driven fashion, i.e., from some given
Q
ulary W = { w1, ... wM}. By ignoring the sequen-
tial order in which words occur in a document, one
may summarize the data in a N x M co-occurrence
�
table of counts N = (n(d;,wj))ij' where n(d,w) E IN
2.2 Latent Semantic Analysis by SVD Like virtually all statistical latent variable models the
aspect model introduces a conditional independence
As mentioned in the introduction, the key idea of LSA assumption, namely that d and ware independent con
is to map documents (and by symmetry terms) to a ditioned on the state of the associated latent variable
mantic space [3]. The mapping is restricted to be lin depicted in Figure 1 (a)). Since the cardinality of z is
vector space of reduced dimensionality, the latent se (the corresponding graphical model representation is
ear and is based on a Singular Value Decomposition smaller than the number of documents/words in the
(SVD) of the co-occurrence table. One thus starts collection, z acts as a bottleneck variable in predict
one might think of the rows of UE as defining coor The standard procedure for maximum likelihood es
' (3)
common meaning, in particular synonyms, are roughly P(z)P(diz)P(wlz)
mapped to the same direction in the latent space. P(zid,w)
Lz'EZ P(z')P(diz')P(wlz')
as well as the following M-step formulae
3 Probabilistic LSA
P(wiz) ex L n(d,w)P(zld,w), (4)
dE'D
3.1 The Aspect Model
P(dlz) ex L n(d,w)P(zld,w), (5)
wEW
The starting point for Probabilistic Latent Semantic
Analysis is a statistical model which has been called P(z) ex L L n(d,w)P(zld,w). (6)
dE'D wEW
aspect model [7]. The aspect model is a latent variable
model for co-occurrence data which associates an un Before discussing further algorithmic refinements, we
observed class variable z E Z = { z1,. . . , ZK} with each will study the relationship between the proposed
observation. A joint probability model over D x W is model and LSA in more detail.
Probabilistic Latent Semantic Analysis 291
'
'
�
�:� : .�- _____ ���I�����������������������:�--------�--� -## -#1 �����ing
'
'
'
'
lihood function of multinomial sampling and aims at
an explicit maximization of the predictive power of
the model. As is well known, this corresponds to a
' ' '
' ' '
' ' '
' ' '
' ' '
minimization of the cross entropy or Kullback-Leibler
: : :
'
'
'
'
'
'
'
'
'
divergence between the empirical distribution and the
'
i
'
i
'
j
' deviation. On the modeling side this offers important
: : :
: : : advantages, for example, the mixture approximation
' '
.
: --------------------------------
Figure 2: Sketch of the probability
# - --- sub-simplex contain negative entries. In addition, there is no obvi
spanned by the aspect model. ous interpretation of the directions in the LSA latent
space, while the directions in the PLSA space are in
3.3 Probabilistic Latent Semantic Space
terpretable as multinomial word distributions. The
probabilistic approach can also take advantage of the
"segment 1" "segment 2" "matrix 1" "matrix 2" "line 1" "line 2" "power 1" power 2"
consid signal origin sequenc sourc specif problem SEGMENT signal relat SEGMENT sourc address issu wide applic field report describ
[ .]
resolu method ergod hidden markov model hmm hmm state correspond signal sourc signal sourc sequenc determin decod procedur viterbi
algorithm forward algorithm observ sequenc baumwelch train estim hmm paramet train materi applic multipl signal sourc identif problem
experi perform unknown speaker identif . .
Figure 4: Abstracts of 2 exemplary documents from the CLUSTER collection along with latent class posterior
probabilities P{zJd, w= 'segment'} and word probabilities P{w = 'segment'Jd}.
(iv) 'Power' is used in the context of radiating objects clustering model [10, 7] which can be thought of as an
in astronomy, but also in electrical engineering. unsupervised version of a naive Bayes' classifier. It
can be shown that the conditional word probability of
Figure 4 shows the abstracts of two exemplary docu
a probabilistic clustering model is given by
ments which have been pre-processed by a standard
stop-word list and a stemmer. The posterior probabil
ities for the classes given the different occurrences of
P(wJd)= LP{c(d)=z}P(wJz), (7)
zEZ
'segment' indicate how likely it is for each of the factors
in the first pair of Figure 3 to have generated this ob where P{ c(d)=z} is the posterior probability of doc
servation. We have also displayed the estimates of the ument d having latent class z. It is a simple impli
conditional word probabilities P{ w= 'segment'Jd 1,2}. cation of Bayes' rule that these posterior probabili
One can see that the correct meaning of the word 'seg ties will concentrate their probability mass on a cer
ment' is identified in both cases. This implies that al tain value z with an increasing number of observations
though 'segment' occurs frequently in both document, (i.e., with the length of the document). This means
the overlap in the factored representation is low, since that although (1) and (7) are algebraically equiva
'segement' is identified as a polysemous word (relative lent, they are conceptually very different and yield in
to the chosen resolution level) which - dependent on fact different results. The aspect model assumes that
the context - is explained by different factors. document-specific distributions are a convex combina
tion of aspects, while the clustering model assumes
3.5 Aspects versus Clusters there is just one cluster-specific distribution which is
inherited by all documents in the cluster 1 Thus in
It is worth comparing the aspect model with statistical clustering models the class-conditionals P(wJz) have
clustering models ( cf. also [7]). In clustering models
1 In the distributional clustering model it is only the pos
for documents, one typically associates a latent class terior uncertainty of the cluster assignments that induces
variable with each document in the collection. Most some averaging over the class-conditional word distribu
closely related to our approach is the distributional tions P( wlz ) .
Probabilistic Latent Semantic Analysis 293
oooo�,====� ,.,F=======:i:l
to capture the complete vocabulary of a subset (clus (a) (b)
"""
ter) of documents, while factors can focus on certain
aspects of the vocabulary of a subset of documents. 2500 """
'·
For example, a factor can be very well used to ex -�1000 I
2000 � '
IOOJ�· ·· . . PLSA"
LSA
•
EM �
all (e.g., even assign zero probability), because these . .... ...
other words can be taken care of by other factors. TEM
600 ... .......PLSA
... . ... . ... . .....
500
200 400 600 800 1000 500 1 000 1500
Latent space dim!!lnsions Latent space dimensions
decreasing ,8.
annealing and is based on an entropic regularization will become closer to the uniform distribution with
term. The resulting method is called Tempered Expec
tation Maximization (TEM) and is closely related to
Somewhat contrary to the spirit of annealing as a con
deterministic annealing [11].
tinuation method, we propose an 'inverse' annealing
The starting point of TEM is a derivation of the E strategy which first performs EM iterations and then
(9)
- P z)P(d\z)P(w\z)]il
P(z;d,w) = L::':z[,[ ( In the experimental evaluation, we focus on two tasks:
P(z')P(dlz')P(wlz')]il ·
(i) perplexity minimization for a document-specific un
2 Perplexity refers to the log-averaged inverse probabil igram model and noun-adjective pairs, and (ii) auto
ity on unseen data. mated indexing of documents. The evaluation of LSA
294 Hofmann
Table 1: Average precision results and relative improvement w.r.t. the baseline method cos+tf for the 4 standard
test collections. Compared are LSI, PLSI, as well as results obtained by combining PLSI models (PLSI').
An asterix for LSI indicates that no performance gain could be achieved over the baseline, the result at 256
dimensions with >. = 2/3 is reported in this case.
MED CRAN CACM CISI
prec. 1mpr. prec. 1mpr. prec. 1mpr. prec. 1mpr.
cos+tf 44.3 - 29.9 - 17.9 - 12.7 -
LSI 51.7 +16.7 '28.7 -4.0 '16.0 -11.6 12.8 +0.8
PLSI 63.9 +44.2 35.1 +17.4 22.9 +27.9 18.8 +48.0
PLSI' 66.3 +49.7 37.5 +25.4 26.8 +49.7 20.1 +58.3
and PLSA on the first task will demonstrate the advan since it makes a very inefficient use of the available
tages of explicitly minimizing perplexity by TEM, the degrees of freedom. Notice, that with both methods
second task will show that the solid statistical founda it is possible to train high-dimensional models with a
tion of PLSA pays off even in applications which are continuous improvement in performance. The num
not directly related to perplexity reduction. ber of latent space dimensions may even exceed the
rank of the co-occurrence matrix N and the choice of
4.1 Perplexity Evaluation the number of dimensions becomes merely an issue of
possible limitations of computational resources.
In order to compare the predictive performance of
PLSA and LSA one has to specify how to extract 4.2 Information Retrieval
probabilities from a LSA decomposition. This problem
Lw n(d, w)n(q, w)
have to predicted conditioned on an associated adjec
s (d, q)
y L, w n(q, w)2'
VLw n(d, w ) 21'\'
tive. Figure 5 reports perplexity results for LSA and
PLSA on the MED (a) and LOB (b) datasets in de
_
-
(10)
pendence on the number of dimensions of the (proba
bilistic) latent semantic space. PLSA outperforms the
In Latent Semantic Indexing (LSI), the original vec
statistical model derived from standard LSA by far.
tor space representation of documents is replaced by a
On the MED collection PLSA reduces perplexity rela
representation in the low-dimensional latent space and
tive to the unigram baseline by more than a factor of
the similarity is computed based on that representa
three (3073/936 � 3.3), while LSA achieves less than
tion. Queries or documents which were not part of the
a factor of two in reduction (3073/1647 � 1.9). On the
original collection can be folded in by a simple matrix
less sparse LOB data the PLSA reduction in perplex
multiplication (cf. [3] for details). In our experiments,
ity is 1316/547 � 2.41 while the reduction achieved by
we have actually considered linear combinations of the
LSA is only 1316/632 � 2.08. In order to demonstrate
original similarity score (10) (weight >.) and the one
the advantages of TEM, we have also trained aspect
derived from the latent space representation (weight
models on the MED data by standard EM with early
1- >.).
stopping. As can be seen from the curves in Figure 5
(a), the difference between EM and TEM model fit The same ideas have been applied in Probabilistic La
ting is significant. Although both strategies - temper tent Semantic Indexing (PLSI) in conjunction with
90 70 ,---�--· so,---�---,
-\
I\1:\
:,
I '
I
\ 45
80 ' \
l �MED 60 GRAN CISI
\
I :
" f.\ 40
70 :I
\ 1':
·.I \
I 50 -�-- \
\
\
'
\
60
\
40
\I
�50 \
\ I·. '.\
.
5
. \
\·-_ \
iI
�
c. 40 I
I
\
30 \.\
\
30 I
20
20
10
50 100
recall[%]
Figure 6: Precision-recall curves for term matching, LSI, and PLSI* on the 4 test collections.
techniques is that it allows us to systematically com 0.6 0.7 08 09 06 0.7 0.8 0.9
cosine baseline score has been coarsely optimized by modeling. In Proceedings of ICASSP'gs, vol
hand, MED, CRAN: >. = 1/2, CACM, CISI:>. = 2/3. ume 2, pages 677-80, 1998.
The experiments consistently validate the advantages [2] N. Coccaro and D. Jurafsky. Towards better inte
of PLSI over LSI. Substantial performance gains have gration of semantic predictors in statistical lan
been achieved for all 4 data sets. Notice that the rela guage modeling. In Proceedings of ICSL P-98,
tive precision gain compared to the baseline method is 1998. to appear.
typically around 100% in the most interesting interme
diate regime of recall! In particular, PLSI works well [3] S. Deerwester, S. T. Dumais, G. W. Furnas, Lan
even in cases where LSI fails completely (these prob dauer. T. K., and R. Harshman. Indexing by la
lems of LSI are in accordance with the original results tent semantic analysis. Journal of the American
reported in [3]). The benefits of model combination Society for Information Science, 41, 1990.
are also very substantial. In all cases the (uniformly)
[4] A.P. Dempster, N.M. Laird, and D.B. Rubin.
combined model performed better than the best single
Maximum likelihood from incomplete data via the
model. As a sight-effect model averaging also deliber
EM algorithm. J. Royal Statist. Soc. B, 39: 1-38,
ated from selecting the correct model dimensionality.
1977.
These experiments demonstrate that the advantages of
PLSA over standard LSA are not restricted to appli [5] P.W. Foltz and S. T. Dumais. An analysis of in
cations with performance criteria directly depending formation filtering methods. Communications of
on the perplexity. Statistical objective functions like the ACM, 35(12):51-60, 1992.
the perplexity (log-likelihood) may thus provide a gen [6] T. Hofmann. Probabilistic latent semantic index
eral yardstick for analysis methods in text learning and ing. In Proceedings of SIGIR'99, 1999.
information retrieval. To stress this point we ran an
experiment on the MED data, where both, perplexity [7] T. Hofmann, J. Puzicha, and M. I. Jordan. Unsu
and average precision, have been monitored simulta pervised learning from dyadic data. In Advances
neously as a function of (3. The resulting curves which in Neural Information Processing Systems, vol
show a striking correlation are plotted in Figure 7. ume 11. MIT Press, 1999.