Probabilistic Latent Semantic Analysis: Thomas Hofmann

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

289

Probabilistic Latent Semantic Analysis

Thomas Hofmann
EECS Department, Computer Science Division, University of California, Berkeley &
International Computer Science Institute, Berkeley, CA
hofmann@cs.berkeley.edu

ing problems are twofold: ( i) polysems, i.e., a word


Abstract was referred to" in a text or an utterance. The result­

in different context, and ( ii) synonymys and semanti­


Probabilistic Latent Semantic Analysis is a may have multiple senses and multiple types of usage
novel statistical technique for the analysis
of two-mode and co-occurrence data, which cally related words, i.e., different words may have a
has applications in information retrieval and similar meaning, they may at least in certain contexts
filtering, natural language processing, ma­ denote the same concept or - in a weaker sense - refer

Latent semantic analysis ( LSA ) (3] is well-known tech­


chine learning from text, and in related ar­ to the same topic.
eas. Compared to standard Latent Semantic
Analysis which stems from linear algebra and nique which partially addresses these questions. The
performs a Singular Value Decomposition of key idea is to map high-dimensional count vectors,
co-occurrence tables, the proposed method such as the ones arising in vector space representa­
is based on a mixture decomposition derived tions of text documents (12], to a lower dimensional
from a latent class model. This results. in a representation in a so-called latent semantic space. As
more principled approach which has a solid the name suggests, the goal of LSA is to find a data
foundation in statistics. In order to avoid mapping which provides information well beyond the
overfitting, we propose a widely applicable lexical level and reveals semantical relations between
generalization of maximum likelihood model the entities of interest. Due to its generality, LSA

range of applications ( e.g. (3, 5, 8, 1]). Yet its theoreti­


fitting by tempered EM. Our approach yields has proven to be a valuable analysis tool with a wide
substantial and consistent improvements over
Latent Semantic Analysis in a number of ex­ cal foundation remains to a large extent unsatisfactory
periments. and incomplete.
This paper presents a statistical view on LSA which

mantics Analysis ( PLSA ) . In contrast to standard


1 Introduction leads to a new model called Probabilistic Latent Se­

Learning from text and natural language is one of the LSA, its probabilistic variant has a sound statistical
great challenges of Artificial Intelligence and Machine foundation and defines a proper generative model of
Learning. Any substantial progress in this domain has the data. A detailed discussion of the numerous ad­
strong impact on many applications ranging from in­ vantages of PLSA can be found in subsequent sections.
formation retrieval, information filtering, and intelli­
--; gent interfaces, to speech recognition, natural language
2 Latent Semantic Analysis
processing, and machine translation. One of the fun­
damental problems is to learn the meaning and usage
2.1 Count Data and Co-occurrence Tables
of words in a data-driven fashion, i.e., from some given

data over a discrete dyadic domain ( cf. [7]). How­


text corpus, possibly without further linguistic prior
LSA can in principle be applied to any type of count
knowledge.
The main challenge a machine learning system has to ever, since the most prominent application of LSA is
address roots in the distinction between the lexical in the analysis and retrieval of text documents, we
level of "what actually has been said or written" and focus on this setting for sake of concreteness. Sup­
the semantical level of "what was intended" or "what pose therefore we have given a collection of text doc-
290 Hofmann

uments D = { d1,. . . , dN} with terms from a vocab-


P C� CzldJf:\"Cw!>
(a) (b)

Q
ulary W = { w1, ... wM}. By ignoring the sequen-
tial order in which words occur in a document, one
may summarize the data in a N x M co-occurrence

table of counts N = (n(d;,wj))ij' where n(d,w) E IN

Figure 1: Graphical model representation of the as­


denotes how often the term w occurred in document
d. In this particular case, N is also called the term­
document matrix and the rows/columns of N are re­ pect model in the asymmetric (a) and symmetric (b)
ferred to as document/term vectors, respectively. The parameterization.

vector-space representation [12] of documents will in


key assumption is that the simplified 'bag-of-words' or

defined by the mixture

P(d,w)=P(d)P(wid), P(wld)=LP(wlz)P(zid). (1)


many cases preserve most of the relevant information,
e.g., for tasks like text retrieval based on keywords.
zEZ

2.2 Latent Semantic Analysis by SVD Like virtually all statistical latent variable models the
aspect model introduces a conditional independence
As mentioned in the introduction, the key idea of LSA assumption, namely that d and ware independent con­
is to map documents (and by symmetry terms) to a ditioned on the state of the associated latent variable

mantic space [3]. The mapping is restricted to be lin­ depicted in Figure 1 (a)). Since the cardinality of z is
vector space of reduced dimensionality, the latent se­ (the corresponding graphical model representation is

ear and is based on a Singular Value Decomposition smaller than the number of documents/words in the
(SVD) of the co-occurrence table. One thus starts collection, z acts as a bottleneck variable in predict­

equivalently parameterized by (cf. Figure 1 (b))


with the standard SVD given by N = UEV1, where ing words. It is worth noticing that the model can be
U and V are orthogonal matrices U1U = V1V = I
(2)
and the diagonal matrix E contains the singular val­
P(d,w) = L P(z)P(diz)P(wiz) ,
ues of N. The LSA approximation of N is computed

to zero (= E), which is rank K optimal in the sense


zEZ
by setting all but the largest K singular values in E
which is perfectly symmetric in both entities, docu­

tion N = UEV1 � UEV1 = N. Notice that the


of the L2-matrix norm. One obtains the approxima­ ments and words.

approximation are given by NN1 = UE2U1 and hence


document-to-document inner products based on this 3.2 Model Fitting with the EM Algorithm

one might think of the rows of UE as defining coor­ The standard procedure for maximum likelihood es­

Maximization (EM) algorithm [4]. EM alternates two


dinates for documents in the latent space. While the timation in latent variable models is the Expectation
original high-dimensional vectors are sparse, the corre­
sponding low-dimensional latent vectors will typically coupled steps: (i) an expectation (E) step where poste­
not be sparse. This implies that it is possible to com­ rior probabilities are computed for the latent variables,

updated. Standard calculations (cf. [7, 13]) yield the


pute meaningful association values between pairs of (ii) an maximization (M) step, where parameters are
documents, even if the documents do not have any
terms in common. The hope is that terms having a E-step equation

' (3)
common meaning, in particular synonyms, are roughly P(z)P(diz)P(wlz)
mapped to the same direction in the latent space. P(zid,w)
Lz'EZ P(z')P(diz')P(wlz')
as well as the following M-step formulae
3 Probabilistic LSA
P(wiz) ex L n(d,w)P(zld,w), (4)
dE'D
3.1 The Aspect Model
P(dlz) ex L n(d,w)P(zld,w), (5)
wEW
The starting point for Probabilistic Latent Semantic
Analysis is a statistical model which has been called P(z) ex L L n(d,w)P(zld,w). (6)
dE'D wEW
aspect model [7]. The aspect model is a latent variable
model for co-occurrence data which associates an un­ Before discussing further algorithmic refinements, we
observed class variable z E Z = { z1,. . . , ZK} with each will study the relationship between the proposed
observation. A joint probability model over D x W is model and LSA in more detail.
Probabilistic Latent Semantic Analysis 291

'
'

�:� : .�- _____ ���I�����������������������:�--------�--� -## -#1 �����ing
'
'
'
'
lihood function of multinomial sampling and aims at
an explicit maximization of the predictive power of
the model. As is well known, this corresponds to a
' ' '
' ' '
' ' '
' ' '
' ' '
minimization of the cross entropy or Kullback-Leibler
: : :
'
'
'
'
'
'
'
'
'
divergence between the empirical distribution and the
'

model, which is very different from any type of squared


' '

i
'
i
'
j
' deviation. On the modeling side this offers important
: : :
: : : advantages, for example, the mixture approximation
' '

P of the co-occurrence table is a well-defined proba­


'
' ' '
' ' '
' ' '

: bility distribution and factors have a clear probabilistic

normalized probability distribution and N may even


meaning. In contrast, LSA does not define a properly
j # ---
#
:----·------:;"\

.
: --------------------------------
Figure 2: Sketch of the probability
# - --- sub-simplex contain negative entries. In addition, there is no obvi­
spanned by the aspect model. ous interpretation of the directions in the LSA latent
space, while the directions in the PLSA space are in­
3.3 Probabilistic Latent Semantic Space
terpretable as multinomial word distributions. The
probabilistic approach can also take advantage of the

tions P(-iz) over the vocabulary which we call factors.


Consider the class-conditional multinomial distribu­ well-established statistical theory for model selection
and complexity control, e.g., to determine the opti­
They can be represented as points on the M - 1 di­ mal number of latent space dimensions. Choosing the

its convex hull, this set of K points defines a L :S


mensional simplex of all possible multinomials. Via number of dimensions in LSA on the other hand is
typically based on ad hoc heuristics.
K -1 dimensional sub-simplex. The modeling assump­ A comparison of the computational complexity might
P(wid) for all document are approximated by a multi­
tion expressed by (1) is that conditional distributions suggest some advantages for LSA: ignoring potential
problems of numerical stability the SVD can be com­

tors P(wlz), where the mixing weights P(zid) uniquely


nomial representable as a convex combination of fac­ puted exactly, while the EM algorithm is an iterative
method which is only guaranteed to find a local max­
define a point on the spanned sub-simplex. A simple imum of the likelihood function. However, in all our
sketch of this situation is shown in Figure 2. Despite experiments the computing time of EM has not been
of the discreteness of the introduced latent variables, a significantly worse than performing an SVD on the co­
continuous latent space is obtained within the space of occurrence matrix. There is also a large potential for
all multinomial distributions. Since the dimensionality improving run-time performance of EM by on-line up­
of the sub-simplex is :S K -1 as opposed to a maximum date schemes, which has not been explored so far.
of M- 1 for the complete probability simplex, this per­
forms a dimensionality reduction in the space of multi­
nomial distributions and the spanned sub-simplex can 3.4 Topic Decomposition and Polysemy
be identified with a probabilistic latent semantic space.
Let us briefly discuss some elucidating examples at
To stress this point and to clarify the relation to this point which will also reveal a further advantage of
LSA, let us rewrite the aspect model as parameter­ PLSA over LSA in the context of polsemous words. We

trices by U = (P(d;izk));,k, V = (P(wjizk))j,k, and


ized by (2) in matrix notati'?n. Hence define ma­ have generated a dataset (CLUSTER) with abstracts

i: = diag(P(Zk))k. The joint probability model P


of 1568 documents on clustering and trained an aspect

can then be written as a matrix product P = (ri;yt.


model with 128 latent classes. Four pairs of factors are
visualized in Figure 3. These pairs have been selected
Comparing this with SVD, one can make the follow­ as the two factors that have the highest probability to

U and V reflect conditional independence in PLSA,


ing observations: (i) outer products between rows of generate the words "segment", "matrix", "line", and
"power", respectively. The sketchy characterization of
(ii) the K factors correspond to the mixture compo­ the factors by their 10 most probable words already re­
nents in the aspect model, (iii) the mixing proportions veals interesting topics. In particular, notice that the
in PLSA substitute the singular values. The crucial term used to select a particular pair has a different
difference between PLSA and LSA, however, is the meaning in either topic factor: (i) 'Segment' refers to

decomposition/approximation. In LSA, this is the L2-


objective function utilized to determine the optimal an image region in the first and to a phonetic segment
in the second factor. (ii) 'Matrix' denotes a rectangu­
or Frobenius norm, which corresponds to an implicit lar table of numbers and to a material in which some­
additive Gaussian noise assumption on (possibly trans­ thing is embedded or enclosed. (iii) 'Line' can refer to
formed) counts. In contrast, PLSA relies on the like- a line in an image, but also to a line in a spectrum.
292 Hofmann

"segment 1" "segment 2" "matrix 1" "matrix 2" "line 1" "line 2" "power 1" power 2"

imag speaker robust manufactur constraint alpha POWER load


SEGMENT speech MATRIX cell LINE redshift spectrum memon
texture recogni eigenvalu part match LINE omega vlsi
color signal uncertainti MATRIX locat galaxi mpc POWER
tissue train plane cellular imag quasar hsup systolic
brain hmm linear famili geometr absorp larg input
slice source condition design 1mpos high redshift complex
cluster speakerind. perturb machinepart segment ssup galaxi arra1
mn SEGMENT root format fundament densiti standard present
volume sound suffici group recogn veloc model implement
Figure 3: Eight selected factors from a 128 factor decomposition. The displayed word stems are the 10 most
probable words in the class-conditional distribution P(wJz), from top to bottom in descending order.

Document I, P(z,jd1,w; ='segment')= (0.95!,0.0001, ...)


P{w; = 'segment'jdi) = 0.06
SEGMENT medic imag challeng problem field imag analysi diagnost base proper SEGMENT digit imag SEGMENT medic imag need
SEGMENT despit exist techniqu SEGMENT
applic involv estim boundari object classif tissu abnorm shape analysi contour detec textur

Document 2, P(z,jd2,w; ='segment')= (0.025,0.867, . .)


specif medic imag remain crucial problem[ ... ]

P{w; ='segment'jd,) = 0.010


.

consid signal origin sequenc sourc specif problem SEGMENT signal relat SEGMENT sourc address issu wide applic field report describ

[ .]
resolu method ergod hidden markov model hmm hmm state correspond signal sourc signal sourc sequenc determin decod procedur viterbi
algorithm forward algorithm observ sequenc baumwelch train estim hmm paramet train materi applic multipl signal sourc identif problem
experi perform unknown speaker identif . .

Figure 4: Abstracts of 2 exemplary documents from the CLUSTER collection along with latent class posterior
probabilities P{zJd, w= 'segment'} and word probabilities P{w = 'segment'Jd}.

(iv) 'Power' is used in the context of radiating objects clustering model [10, 7] which can be thought of as an
in astronomy, but also in electrical engineering. unsupervised version of a naive Bayes' classifier. It
can be shown that the conditional word probability of
Figure 4 shows the abstracts of two exemplary docu­
a probabilistic clustering model is given by
ments which have been pre-processed by a standard
stop-word list and a stemmer. The posterior probabil­
ities for the classes given the different occurrences of
P(wJd)= LP{c(d)=z}P(wJz), (7)
zEZ
'segment' indicate how likely it is for each of the factors
in the first pair of Figure 3 to have generated this ob­ where P{ c(d)=z} is the posterior probability of doc­
servation. We have also displayed the estimates of the ument d having latent class z. It is a simple impli­
conditional word probabilities P{ w= 'segment'Jd 1,2}. cation of Bayes' rule that these posterior probabili­
One can see that the correct meaning of the word 'seg­ ties will concentrate their probability mass on a cer­
ment' is identified in both cases. This implies that al­ tain value z with an increasing number of observations
though 'segment' occurs frequently in both document, (i.e., with the length of the document). This means
the overlap in the factored representation is low, since that although (1) and (7) are algebraically equiva­
'segement' is identified as a polysemous word (relative lent, they are conceptually very different and yield in
to the chosen resolution level) which - dependent on fact different results. The aspect model assumes that
the context - is explained by different factors. document-specific distributions are a convex combina­
tion of aspects, while the clustering model assumes
3.5 Aspects versus Clusters there is just one cluster-specific distribution which is
inherited by all documents in the cluster 1 Thus in
It is worth comparing the aspect model with statistical clustering models the class-conditionals P(wJz) have
clustering models ( cf. also [7]). In clustering models
1 In the distributional clustering model it is only the pos­
for documents, one typically associates a latent class terior uncertainty of the cluster assignments that induces
variable with each document in the collection. Most some averaging over the class-conditional word distribu­
closely related to our approach is the distributional tions P( wlz ) .
Probabilistic Latent Semantic Analysis 293

oooo�,====� ,.,F=======:i:l
to capture the complete vocabulary of a subset (clus­ (a) (b)

"""
ter) of documents, while factors can focus on certain
aspects of the vocabulary of a subset of documents. 2500 """


For example, a factor can be very well used to ex­ -�1000 I
2000 � '

·. ... ... ... ... ... ...


plain some fraction of the words occurring in a doc­ LsA
-· i 900 \
. . . . D.. 800 '.,,
lOO��·-.....
ument, although it might not explain other words at 1500 ·.

IOOJ�· ·· . . PLSA"
LSA

EM �
all (e.g., even assign zero probability), because these . .... ...
other words can be taken care of by other factors. TEM
600 ... .......PLSA
... . ... . ... . .....
500
200 400 600 800 1000 500 1 000 1500
Latent space dim!!lnsions Latent space dimensions

3.6 Model Fitting Revisited: Improving


Generalization by Tempered EM Figure 5: Perplexity results as a function of the latent
space dimensionality for (a) the MED data (rank 1033)
So far we have focused on maximum likelihood estima­ and (b) the LOB data (rank 1674). Plotted results
tion to fit a model from a given document collection. are for LSA (dashed-dotted curve) and PLSA (trained
Although the likelihood or, equivalently, the perplex­ by TEM = solid curve, trained by early stopping EM
ity2 is the quantity we believe to be crucial in assessing = dotted curve). The upper baseline is the unigram
the quality of a model, one clearly has to distinguish model corresponding to marginal independence. The

ity of the largest trained aspect models (K::: 2048).


between the performance on the training data and on star at the right end of the PLSA denotes the perplex­
unseen test data. To derive conditions under which
generalization on unseen data can be guaranteed is ac­

This shows that the effect of the entropy at ,8 < 1 is


tually the fundamental problem of statistical learning
theory. Here, we propose a generalization of maxi­
mum likelihood for mixture models which is known as to dampen the posterior probabilities such that they

decreasing ,8.
annealing and is based on an entropic regularization will become closer to the uniform distribution with
term. The resulting method is called Tempered Expec­
tation Maximization (TEM) and is closely related to
Somewhat contrary to the spirit of annealing as a con­
deterministic annealing [11].
tinuation method, we propose an 'inverse' annealing
The starting point of TEM is a derivation of the E­ strategy which first performs EM iterations and then

pointed out in [9] the EM procedure in latent variable


step based on an optimization principle. As has been decreases ,8 until performance on held-out data deteri­
orates. Compared to annealing this may accelerate the
models can be obtained by minimizing a common ob­ model fitting procedure significantly (e.g., by a factor
jective function - the (Helmholtz) free energy- which of� 10- 50) and we have not found the test set per­
for the aspect model is given by formance of 'heated' models to be worse than the one
achieved by carefully 'annealed' models. The TEM
Fr; ::: -,8 L n(d,w) L P(z; d, w) logP(d,wlz)P(z) algorithm can thus be implemented in the following
d,w z way:
+ L n(d,w) L P(z;d,w)logP(z; d,w). (8)
1. Set ,8 +- 1 and perform EM with early stopping.

2. Decrease ,8 +- 1),8 (with 1)


d,w

fine a conditional distribution over Z and ,8 is a pa­


Here P(z ; d,w) are variational parameters which de­ < 1) and perform one
TEM iteration.
rameter which - in analogy to physical systems - is
3. As long as the performance on held-out data im­
called the inverse computational temperature. Notice
at this value of ,8, otherwise goto step 2
proves (non-negligible) continue TEM iterations
pected log-likelihood scaled by /3. Thus in the case of
that the first contribution in (8) is the negative ex­

4. Perform stopping on ,8, i.e., stop when decreasing


,8 does not yield further improvements.
P(z; d,w) = P(zid,w) minimizing F w.r.t. the param­
eters defining P(d,wlz)P(z) amounts to the standard
M-step in EM. In fact, it is straightforward to ver­

w.r.t. Pat ,8::: 1. In general Pis determined by


ify that the posteriors are obtained by minimizing F
4 Experimental Results

(9)
- P z)P(d\z)P(w\z)]il
P(z;d,w) = L::':z[,[ ( In the experimental evaluation, we focus on two tasks:
P(z')P(dlz')P(wlz')]il ·
(i) perplexity minimization for a document-specific un­
2 Perplexity refers to the log-averaged inverse probabil­ igram model and noun-adjective pairs, and (ii) auto­
ity on unseen data. mated indexing of documents. The evaluation of LSA
294 Hofmann

Table 1: Average precision results and relative improvement w.r.t. the baseline method cos+tf for the 4 standard
test collections. Compared are LSI, PLSI, as well as results obtained by combining PLSI models (PLSI').
An asterix for LSI indicates that no performance gain could be achieved over the baseline, the result at 256
dimensions with >. = 2/3 is reported in this case.
MED CRAN CACM CISI
prec. 1mpr. prec. 1mpr. prec. 1mpr. prec. 1mpr.
cos+tf 44.3 - 29.9 - 17.9 - 12.7 -
LSI 51.7 +16.7 '28.7 -4.0 '16.0 -11.6 12.8 +0.8
PLSI 63.9 +44.2 35.1 +17.4 22.9 +27.9 18.8 +48.0
PLSI' 66.3 +49.7 37.5 +25.4 26.8 +49.7 20.1 +58.3

and PLSA on the first task will demonstrate the advan­ since it makes a very inefficient use of the available
tages of explicitly minimizing perplexity by TEM, the degrees of freedom. Notice, that with both methods
second task will show that the solid statistical founda­ it is possible to train high-dimensional models with a
tion of PLSA pays off even in applications which are continuous improvement in performance. The num­
not directly related to perplexity reduction. ber of latent space dimensions may even exceed the
rank of the co-occurrence matrix N and the choice of
4.1 Perplexity Evaluation the number of dimensions becomes merely an issue of
possible limitations of computational resources.
In order to compare the predictive performance of
PLSA and LSA one has to specify how to extract 4.2 Information Retrieval
probabilities from a LSA decomposition. This problem

re-normalization of the approximating matrix N. We


is not trivial, since negative entries prohibit a simple One of the key problems in information retrieval is
automatic indexing which has its main application in
have followed the approach of [2] to derive LSA prob­ query-based retrieval. The most popular family of in­
abilities. formation retrieval techniques is based on the Vector
Space Model (VSM) for documents [12]. Here, we have

on the (untransformed) term frequencies n(d, w) to­


Two data sets that have been used to evaluate the utilized a rather straightforward representation based
perplexity performance: (i) a standard information re­
trieval test collection MED with 1033 document, (ii) gether with the standard cosine matching function,

in [6]. The same representation applies to queries q,


a dataset with noun-adjective pairs generated from a a more detailed experimental analysis can be found
tagged version of the LOB corpus. In the first case, the
goal was to predict word occurrences based on (parts so that the matching function for the baseline term
of) the words in a document. In the second case, nouns matching method can be written as

Lw n(d, w)n(q, w)
have to predicted conditioned on an associated adjec­
s (d, q)
y L, w n(q, w)2'
VLw n(d, w ) 21'\'
tive. Figure 5 reports perplexity results for LSA and
PLSA on the MED (a) and LOB (b) datasets in de­
_

-
(10)
pendence on the number of dimensions of the (proba­
bilistic) latent semantic space. PLSA outperforms the
In Latent Semantic Indexing (LSI), the original vec­
statistical model derived from standard LSA by far.
tor space representation of documents is replaced by a
On the MED collection PLSA reduces perplexity rela­
representation in the low-dimensional latent space and
tive to the unigram baseline by more than a factor of
the similarity is computed based on that representa­
three (3073/936 � 3.3), while LSA achieves less than
tion. Queries or documents which were not part of the
a factor of two in reduction (3073/1647 � 1.9). On the
original collection can be folded in by a simple matrix
less sparse LOB data the PLSA reduction in perplex­
multiplication (cf. [3] for details). In our experiments,
ity is 1316/547 � 2.41 while the reduction achieved by
we have actually considered linear combinations of the
LSA is only 1316/632 � 2.08. In order to demonstrate
original similarity score (10) (weight >.) and the one
the advantages of TEM, we have also trained aspect
derived from the latent space representation (weight
models on the MED data by standard EM with early
1- >.).
stopping. As can be seen from the curves in Figure 5
(a), the difference between EM and TEM model fit­ The same ideas have been applied in Probabilistic La­
ting is significant. Although both strategies - temper­ tent Semantic Indexing (PLSI) in conjunction with

representation in the factor space P(zld) and P(zlq)


ing and early stopping - are successful in controlling the PLSA model. More precisely, the low-dimensional
the model complexity, EM training performs worse,
Probabilistic Latent Semantic Analysis 295

90 70 ,---�--· so,---�---,
-\
I\1:\
:,
I '
I

\ 45
80 ' \
l �MED 60 GRAN CISI

\
I :
" f.\ 40
70 :I
\ 1':
·.I \
I 50 -�-- \
\

\
'

\
60
\

40

\I
�50 \
\ I·. '.\
.
5
. \
\·-_ \

iI

c. 40 I
I
\
30 \.\
\
30 I

20

20

10

50 100
recall[%]

Figure 6: Precision-recall curves for term matching, LSI, and PLSI* on the 4 test collections.

have been utilized to evaluate similarities. To achieve

PLSA by fixing the P(w[z) parameters and calculating


this, queries have to be folded in, which is done in the K=48 K-:128

weights P(z[q) by TEM.


One advantage of using statistical models vs. SVD 1200

techniques is that it allows us to systematically com­ 0.6 0.7 08 09 06 0.7 0.8 0.9

bine different models. While this should optimally beta beta

be done according to a Bayesian model combination 70•r-----

scheme, we have utilized a much simpler approach in


our experiments which has nevertheless shown excel­
lent performance and robustness. Namely, we have
simply combined the cosine scores of all models with a
uniform weight. The resulting method is referred to as 30

PLSI*. Empirically we have found the performance to 0.6 0.8


,.� ,.�
0.7 0.9

be very robust w.r.t. different (non-uniform) weights


and also w.r.t. the >.-weight used in combination with
Figure 7: Perplexity and average precision as a func­
the original cosine score. This is due to the noise re­
tion of the inverse temperature (3 for an aspect model
ducing benefits of (model) averaging. Notice that LSA
with J{ = 48 (left) and J{ = 128 (right).
representations for different J{ form a nested sequence,
which is not true for the statistical models which are
expected to capture a larger variety of reasonable de­
tion). The condensed results in terms of average pre­
compositions.
--; cision recall (at the 9 recall levels 10%- 90%) are sum­
We have utilized the following four medium-sized stan­ marized in Table 1, while the corresponding precision
dard document collection with relevance assessment: recall curves can be found in Figure 6. Here are some
(i) MED (1033 document abstracts from the National additional details of the experimental setup: PLSA
Library of Medicine), (ii) CRAN (1400 document ab­ models at f{ = 32, 48, 64, 80, 128 have been trained by
stracts on aeronautics from the Cranfield Institute of TEM for each data set with 10% held-out data. For
Technology), (iii) CACM (3204 abstracts from the PLSI we report the best result obtained by any of these
CACM Journal), and (iv) CISI (1460 abstracts in li­ models, for LSI we report the best result obtained for
brary science from the Institute for Scientific Informa- the optimal dimension (exploring 32-512 dimensions
at a step size of 8). The combination weight). with the
296 Hofmann

cosine baseline score has been coarsely optimized by modeling. In Proceedings of ICASSP'gs, vol­
hand, MED, CRAN: >. = 1/2, CACM, CISI:>. = 2/3. ume 2, pages 677-80, 1998.
The experiments consistently validate the advantages [2] N. Coccaro and D. Jurafsky. Towards better inte­
of PLSI over LSI. Substantial performance gains have gration of semantic predictors in statistical lan­
been achieved for all 4 data sets. Notice that the rela­ guage modeling. In Proceedings of ICSL P-98,
tive precision gain compared to the baseline method is 1998. to appear.
typically around 100% in the most interesting interme­
diate regime of recall! In particular, PLSI works well [3] S. Deerwester, S. T. Dumais, G. W. Furnas, Lan­
even in cases where LSI fails completely (these prob­ dauer. T. K., and R. Harshman. Indexing by la­
lems of LSI are in accordance with the original results tent semantic analysis. Journal of the American
reported in [3]). The benefits of model combination Society for Information Science, 41, 1990.
are also very substantial. In all cases the (uniformly)
[4] A.P. Dempster, N.M. Laird, and D.B. Rubin.
combined model performed better than the best single
Maximum likelihood from incomplete data via the
model. As a sight-effect model averaging also deliber­
EM algorithm. J. Royal Statist. Soc. B, 39: 1-38,
ated from selecting the correct model dimensionality.
1977.
These experiments demonstrate that the advantages of
PLSA over standard LSA are not restricted to appli­ [5] P.W. Foltz and S. T. Dumais. An analysis of in­
cations with performance criteria directly depending formation filtering methods. Communications of
on the perplexity. Statistical objective functions like the ACM, 35(12):51-60, 1992.
the perplexity (log-likelihood) may thus provide a gen­ [6] T. Hofmann. Probabilistic latent semantic index­
eral yardstick for analysis methods in text learning and ing. In Proceedings of SIGIR'99, 1999.
information retrieval. To stress this point we ran an
experiment on the MED data, where both, perplexity [7] T. Hofmann, J. Puzicha, and M. I. Jordan. Unsu­
and average precision, have been monitored simulta­ pervised learning from dyadic data. In Advances
neously as a function of (3. The resulting curves which in Neural Information Processing Systems, vol­
show a striking correlation are plotted in Figure 7. ume 11. MIT Press, 1999.

5 Conclusion [8] T.K. Landauer and S.T. Dumais. A solution


to Plato's problem: The latent semantic anal­
We have proposed a novel method for unsupervised ysis theory of acquisition, induction, and rep­
learning, called Probabilistic Latent Semantic Analy­ resentation of knowledge. Psychological Review,
sis, which is based on a statistical latent class model. 104(2):211-240, 1997.
We have argued that this approach is more principled
than standard Latent Semantic Analysis, since it pos­ [9] R.M. Neal and G.E. Hinton. A view of the EM al­
sesses a sound statistical foundation. Tempered Expec­ gorithm that justifies incremental and other vari­
tation Maximization has been presented as a powerful ants. In M.I. Jordan, editor, Learning in Graph­
fitting procedure. We have experimentally verified the ical Models, pages 355-368. Kluwer Academic
claimed advantages achieving substantial performance Publishers, 1998.
gains. Probabilistic Latent Semantic Analysis has thus [10] F.C.N. Pereira, N.Z. Tishby, and L. Lee. Distribu­
to be considered as a promising novel unsupervised tional clustering of english words. In Proceedings
learning method with a wide range of applications in of the ACL , pages 183-190, 1993.
text learning and information retrieval.
[11] K. Rose, E. Gurewitz, and G. Fox. A determin­
Acknowledgments istic annealing approach to clustering. Pattern
The author would like to thank Jan Puzicha, Andrew Recognition Letters, 11(11):589-594, 1990.
McCallum, Mike Jordan, Joachim Buhmann, Tali
[12] G. Salton and M. J. McGill. Introduction to Mod­
Tishby, Nelson Morgan, Jerry Feldman, Dan Gildea,
ern Information Retrieval. McGraw-Hill, 1983.
Andrew Ng, Sebastian Thrun, and Tom Mitchell for
stimulating discussions and helpful hints. This work [13] L. Saul and F. Pereira. Aggregate and mixed­
has been supported by a DAAD fellowship. order Markov models for statistical language pro­
cessing. In Proceedings of the 2nd International
References Conference on Empirical Methods in Natural Lan­
guage Processing, 1997.
[1] J.R. Bellegarda. Exploiting both local and global
constraints for multi-span statistical language

You might also like