Inex03 Pto
Inex03 Pto
Inex03 Pto
The next section provides background in language modeling in 2.2 The Maximum-Likelihood Estimate of a
information retrieval. In Section 3 we present our approach to
Language Model
modeling structured documents. Section 4 describes querying
The most direct way to estimate a language model given some
the tree-based language models presented in the previous
observed text is to use the maximum-likelihood estimate,
section. In Section 5we describe the indexes required to support
assuming an underlying multinomial model. In this case, the
retrieval and the retrieval algorithms. We describe the
maximum-likelihood estimate is also the empirical distribution.
experiment setup and indexes used for INEX 2003 in Section 6.
An advantage of this estimate is that it is easy to compute. It is
Section 7 describes experimental results. We discuss
very good at estimating the probability distribution for the
language model when the size of the observed text is very large. P (w ? T ) if w ∈ T
P (w ? T ) = seen
Punseen(w ? T ) otherwise
It is given by:
freq(w, T )
P (w ? T ) =
T where
i =1
where k is the number of language models we are combining, and section 1 section 2 references
λ i is the weight on the model ? i . To ensure that this is a valid Nodes in the document tree correspond directly to XML tags in
probability distribution, we must place these constraints on the the document. For each document node in the tree, we estimate
lambdas: a language model. The language models for leaf nodes with no
k children can be estimated from the text of the node. The
∑λ
i =1
i = 1 and for 1 ≤ i ≤ k , λi ≥ 0 language models for other nodes are estimated by taking a linear
interpolation of a language model formed from the text in the
One use of linear interpolation is to smooth a document’s node (but not in any of its children) and the language models
language model with a collection language model. This new formed from the children.
model would then be used as the smoothed document language We have not specified how the linear interpolation parameters
model in either the generative or KL-divergence ranking for combining language models in the document tree should be
approach. chosen. This could be task specific, and training may be
required. The approach we will adopt in this paper is to set the
2.4 Another Characterization weight on a child node as the accumulated length of the text in
When we take a simple linear interpolation of the maximum the child divided by the accumulated length of the node. By
likelihood model estimated from text and a collection model, we accumulated length we mean the number of words directly in the
can also characterize the probability estimates as: node plus the accumulated length of the node’s children. Setting
the parameters in this manner assumes that a word in a one node 4.3 Prior Probabilities
type is no more important than a word in any other node type; Given relevance assessments from past topics, we can estimate
it is the accumulated length of the text in the node that prior probabilities of the document component being relevant
determines how much information is contained in the node. given its type. Another example prior may depend on the length
We also wish to smooth the maximum likelihood models that are of the text in the node. A way t o incorporate this information is
estimated directly from the text with a collection language to rank by the probability of the document node given the
model. In this work, we will combine the maximum likelihood query. Using Bayes rule, this would allow us incorporate the
models with the collection model using a linear interpolation priors on the nodes. The prior for only the node being ranked
with fixed weights. The collection model may be specific to the would be used, and the system would multiply the probability
node type, giving context sensitive smoothing, or the collection that the node generated the query by the prior:
model may be one large model estimated from everything in the ( )
P NQ ( )
= P Q ? N P (N) P(Q)
corpus, giving a larger sample size.
When the ? parameters are set proportional to the text length ( )
∝ P Q ? N P (N )
and a single collection model is used, this results a special case
that is very similar to the models used in [5][9]. The tree-based This would result in ranking by the probability of the document
language model estimated using these parameter settings will be component node given the query, rather than the other way
identical to a language model estimated by taking a simple linear around.
interpolation of a maximum likelihood estimate from the text in
the node and its ancestors and a the collection model. 5. STORAGE AND ALGORITHMS
This section describes how we support structured retrieval in
4. RANKING THE TREE MODELS the Lemur toolkit. We first describe the indexes built to
In a retrieval environment for structured documents, it is support retrieval. Then we describe how the indices are used
desirable to provide support for both structured queries and by the retrieval algorithm. We also present formulas for the
unstructured, free-text queries. It is easier to adapt the computation of the generative probabilities we estimate for
generative language model to structured documents, so we only retrieval.
consider that model in this paper. It is simpler to support
unstructured queries, so we will describe retrieval for them first. 5.1 Index Support
There are two main storage structures in Lemur that provide the
4.1 Unstructured Queries support necessary for the retrieval algorithm. Lemur stores
To rank document components for unstructured queries, we use inverted indexes containing document and node occurrences and
the generative language modeling approach for IR described in document structures information.
Section 2. For full document retrieval, we need only compute
the probability that the document language model generated the
5.1.1 Inverted Indexes
The basic idea to storing structured documents in Lemur for
query. If we wish to return arbitrary document components,
retrieval is to use a modified inverted list. Similar to storing
we need to compute the probability that each component
term locations for a document entry in an inverted list, we store
generated the query.
the nodes and the term frequencies of the term in the nodes in
Allowing the system to return arbitrary document components the document entries of the inverted list. The current
may result in the system stuffing the results list with many implementation of the structured document index does not store
components from a single document. This behavior is term locations, but could be adapted to store term locations in
undesirable, so a filter on the results is necessary. the future.
One filter we employ takes a greedy approach to preventing The inverted lists are keyed by term, and each list contains the
overlap among components in the results list. For each result, it following:
will be thrown out of the results if there is any component • document frequency of the term
higher in the ranking that is an ancestor or descendent of the • a list of document entries, each entry containing
document component under consideration. o document id
o term frequency (count of term in document)
4.2 Structured Queries o number of nodes the term occurs in
Our previous paper on this subject [11] discusses how some o a list of node entries, each entry containing
structural query operators could be included in the model. We § node id
do not currently support any of these operators in our system, § term frequency (count of term in node)
so we will not discuss in depth here. However, we will note When read into memory, the inverted lists are stored in an array
that the retrieval framework can support most desired structural of integers. The lists are stored on disk using restricted-variable
query operators as relatively easy to implement query nodes. length compression and delta-encoding is applied to document
ids and node ids. In the document entry lists, the documents
entries are stored in order by ascending document id. The node
entry lists are similarly stored in order by increasing node id. the top n results objects in the heap must be stored at any point
Document entries and node entries are only stored in the list in time.
when the term frequency is greater than zero. Access to the
lists on disks is facilitated with an in-memory lookup table for
Heap
vocabulary terms.
There is also an analogous set of inverted lists for attribute Score adjuster
name/value pairs associated with tags. For example, if the
document contained the text
Sum
<date calendar=“Gregorian”>,
the index would have an inverted list keyed by the triple Term Term
date/calendar/Gregorian. The structure and information stored
“gregorian” “chant”
in the inverted lists for the attribute name/value pairs is identical
to those in the inverted lists for terms.
A more detailed description of each of the query nodes follows.
5.1.2 Document Structure When each query node is called, they are passed a document id
The document structure is stored compressed in memory using to evaluate. In order to know which document should be
restricted variable length compression. A lookup table keyed processed next, the term nodes pass up the next document id in
by document id provides quick access to the block of the inverted list. For other query nodes, the minimum next
compressed memory for a document. We choose to store the document id among a node’s children gets passed up the query
document structure in memory because it will be requested tree with the results list. We will describe the query nodes
often during retrieval. For each document, a list of information bottom up, as that is how the scores are computed.
about the document nodes is stored. For each node, we store:
We first note that we can rewrite the log of the probability that
• parent of the node the document node generated the query as
• type of node
• length of the node (number of words)
(( )) qtf (w )log
(
Pseen w ? node )
log P Q ? node = ∑ P ( )
Since this list of information about the document structure is w∈Q , node unseen w ? node
compressed using a variable length encoding, we must (
+ ∑ qtf (w )log Punseen w ? node )
decompress the memory to provide efficient access to w∈Q
information about nodes. When the document structure for a as shown in [16]. This will allow us to easily compute the item
document is being decompressed, we also compute: in the first sum easily using term nodes, combine these
• accumulated length of the node (length of text directly in components of the score using a sum node, and then add on the
the node + accumulated length of children) rest using a score adjustment node.
• number of children of the node
5.2.1 Term Node
• a list of the node’s children
The term nodes read in the inverted lists for a term w and create
This decompression and computation of other useful results where the score for a result is initialized to
information about the document structure is computed in time
linear to the number of nodes in the document being (
Pseen w ? node
qtf (w) ⋅ log
)
decompressed. (
Punseen w ? node
)
The term node assumes that the parent id of a node is smaller
5.2 Retrieval
than the node’s id. It also assumes that the document entries in
We construct a query tree to process and rank document
inverted lists are organized in increasing document id order and
components. A typical query tree is illustrated below. The leaf
the node entries are organized in increasing term id order. The
nodes of the query tree are term nodes which read the inverted
structured document index we built is organized this way. In
lists for a term off of disk and create result objects for document
the follow ing algorithm description, indentation is used to
components containing the term. The term nodes are also
denote the body of a loop.
responsible for propagating the term scores up the document
tree. The sum node merges the result lists returned by each of 1 Seek to the next entry in the inverted list where the
the term nodes, combining the score estimates. The score document id is at least as large as the requested document
adjuster node adjusts the score estimates to get the generation 2 If the document id of the next entry is the requested
probabilities and also applies any priors. The heap node document
maintains a list of the top n ranked objects and returns a sorted
3 Decompress the document structure information for the
result list. Efficient retrieval is achieved using a document at a
document
time approach. This requires that the query tree be walked
many times during the evaluation of a query, but results a large 4 Read in the node entries from the inverted list
saving of memory, as only the result objects for a document and 5 Create the result objects for the leaf nodes. For each
node that contains the term:
6 Initialize the score for the result to the seen 22 Set the score for the result to
probability part for the node
seen(node) + unseen(w, node)
seen(node) = (1 − ω ) freq(w, node) λ (node, node) qtf (w) ⋅ log
unseen(w , node)
where
23 Return the result list and the next document id in the
length(node)
λ (node, node) = inverted list
accumulated length(node)
The result list now contains results for a single document where
and ω will be used to set the influence of the the score is
P (w ? node )
collection model s.
qtf (w) ⋅ log seen
7 Push the node id onto the candidate node heap Punseen(w ? node )
8 Store the result object in an array indexed by node id
for fast access and the list is ordered by increasing node id.
9 While the candidate node heap isn’t empty: 5.2.2 Sum Node
10 Pop the top node id off of the heap (the largest node The sum node maintains an array of result lists, with one result
id), set it to the current node id list for each of the children. It seeks to the next entry in each of
11 Lookup the result from the result array the child result lists where the document id is at least as large as
the requested document. If necessary, it calls the children nodes
12 Lookup the node id for the parent of the current node
to get their next result lists. For the requested document, the
13 Lookup the parent node’s result sum node merges results from the result lists of the children,
14 If the parent node’s result object is NULL: setting the score of the new result equal to the sum of the
children’s results with the same document and node id. This
15 Create a new result object for the parent node and
node assumes that results in a result list are ordered by
put it in the result array, initializing the score to 0
increasing document id, then increasing node id. The results
16 Push the parent node’s id onto the candidate node returned by this component have the score
heap
(
Pseen w ? node
qtf (w )log
)
17 Propagate the seen part of the score from the ∑ (
Punseen w ? node )
current node to the parent node, setting the w∈Q , node
parent node’s seen part to
and the minimum document id returned by the children is
seen( parent) + seen(node )λ (node, parent ) returned.
where
5.2.3 Score Adjustment Node
accumulated length(node)
λ (node, parent ) = The score adjustment node adds
accumulated length( parent )
∑ qtf (w) log P unseen (w ? )
node
18 Push the result onto the front of the results list w∈Q
19 Set the result in the result array for the node to NULL to each of the results, where
Punseen (w ? node ) = unseen (w, node )
(initializing the result array for the next document)
[Now each document node that contains the query term
(or has a child containing the term) has a result in the as defined for the term node. If there is a prior probability for
results list where the score is the seen probability part the node, the score adjustment node also adds on the log of the
for the query term] prior. The results in the list now have the score
We submitted three official runs as described in Table 2. All of In [5] a generative language modeling approach for content only
our runs used the title, description, and keyword fields of the queries is described where a document component’s language
topics. Unfortunately, two of our runs performed rather model is estimated by taking a linear interpolation of the
poorly. This is either an error in our path filter or a problem maximum likelihood model from the text of the node and its
with the component type priors. We would also like to ancestors and a collection model. This corresponds to a special
evaluate the additional runs corresponding to the dashes in the case of our approach. Our model is more flexible in that it
table, but we have not been able to do these experiments yet. allows context sensitive smoothing and different weighting of
text in children nodes.
The LM_context_TDK run has good performance across all
measures. This is our basic language modeling system using The authors of [9] also present a generative language model for
context sensitive smoothing. The strong performance of the content only queries in structured document retrieval. They
context sensitive language modeling approach speaks well for estimate the collection model in a different way, using document
the flexibility of language modeling. frequencies instead of collection term frequencies. As with [5],
this model can be viewed as a special case of the language
For the content only topics, context sensitive smoothing does
modeling approach presented here.
not help. The node type priors also do not consistently help.
There was a significant problem with the path filters we used.
9. CLOSING REMARKS
With regards to context sensitive smoothing, it may not make We presented experiments using a hierarchical language model.
much difference for content only tasks as they are typically The strong performance of language modeling algorithms
searching for textual components such as paragraphs, sections, demonstrates the flexibility and ease of adapting language
and articles. The characteristics of the text in these components models to the problem. In our preliminary experiments with
tend to be very similar, so the context sensitive smoothing may standard text queries, context sensitive smoothing did not give
not be helpful. much different performance than using a single collection model.
With regards to component type priors, we have observed We described data structures and retrieval algorithms to support
similar puzzling behavior in [12]. We discovered that the retrieval of arbitrary XML document components within the
distributions observed in the rankings after applying the prior Lemur toolkit. We are reasonably pleased with the efficiency of
probabilities are not the desired distributions. We are actively the algorithms for a research system, but we will strive to
working on new techniques to incorporate information in a way improve the algorithms and data structures to reduce retrieval
that will provide the desired distributions of results in the times even further.
rankings.
In our future work, we would like to compare the component
retrieval to standard document retrieval. We would also like to
8. RELATED WORK
investigate query expansion using XML document components.
There exists a large and growing body of work in retrieving
Additionally, we would like to explore different ways of setting
information from XML documents. Some work is described in
the ? weights on the nodes’ language models, as we believe that
our previous paper [11] and much of the more recent work is
words in some components may convey more useful
also described in the INEX 2002 proceedings [14]. With that in
information than words in other components.
mind, we will focus our discussion of related work on language
modeling approaches for structured document retrieval.
10. ACKNOWLEDGMENTS [8] Lafferty, J., and C. Zhai. Document language models,
This work was sponsored by the Advanced Research and query models, and risk minimization for information
Development Activity in Information Technology (ARDA) retrieval. In Proceedings of the 24th Annual International
under its Statistical Language Modeling for Information ACM SIGIR Conference on Research and Development in
Retrieval Research Program. Any opinions, findings, Information Retrieval (2001), ACM Press, 111-119.
conclusions, or recommendations expressed in this material are [9] List, J., and A.P. de Vries. CWI at INEX 2002. In [14],
those of the authors, and do not necessarily reflect those of the 133-140.
sponsor.
[10] Myaeng, S.H., D.H. Jang, M.S. Kim, and Z.C. Zhoo. A
flexible model for retrieval of SGML documents. In
11. REFERENCES Proceedings of the 21st Annual International ACM SIGIR
[1] Fuhr, N. and K. Großjohann. XIRQL: A query language Conference on Research and Development in Information
for information retrieval in XML documents. In Retrieval (1998), ACM Press, 138-145.
Proceedings of the 24th Annual International ACM SIGIR
[11] Ogilvie, P. and J. Callan. Language models and structured
Conference on Research and Development in Information
document retrieval. In [14], 33-40.
Retrieval (2001), ACM Press, 172-180.
[12] Ogilvie, P. and J. Callan. Combining Structural Information
[2] Grabs, T. and H.J. Schek. Generating vector spaces on-
and the Use of Priors in Mixed Named-Page and Homepage
the-fly for flexible XML retrieval. In Proceedings of the
Finding. To appear in Proceedings of the Twelfth Text
25th Annual International ACM SIGIR Workshop on XML
REtrieval Conference (TREC 2003).
Information Retrieval (2002), ACM.
[13] Ponte, J., and W.B. Croft. A language modeling approach
[3] Hatanao, K., H. Kinutani, M. Yoshikawa, and S. Uemura.
to information retrieval. In Proceedings of the 21 st Annual
Information retrieval system for XML documents. In
International ACM SIGIR Conference on Research and
Proceedings of Database and Expert Systems Applications
Development in Information Retrieval (1998), ACM Press,
(DEXA 2002), Springer, 758-767.
275-281.
[4] Hiemstra, D. Using language models for information [14] Proceedings of the First Workshop of the Initiative for the
retrieval, Ph.D. Thesis (2001), University of Twente.
Evaluation of XML Retrieval (INEX). 2003, DELOS.
[5] Hiemstra, D. A database approach to context-based XML
[15] Westerweld, T., W. Kraaj, and D. Heimstra. Retrieving
retrieval. In [14], 111-118.
web pages using content, links, URLs, and anchors. In
[6] Kazai, G., M. Lalmas, and T. Rölleke. A model for the Proceedings of the Tenth Text Retrieval Conference, TREC
representation and focused retrieval of structured 2001, NIST Special publication 500-250 (2002), 663-672.
documents based on fuzzy aggregation. In The 8th
[16] Zhai, C. and J. Lafferty. A study of smoothing methods
Symposium on String Processing and Information Retrieval
for language models applied to ad hoc information retrieval.
(SPIRE 2001), IEEE, 123-135.
In Proceedings of the 24th Annual International ACM SIGIR
[7] Kazai, G., M. Lalmas, and T. Rölleke. Focussed Conference on Research and Development in Information
Structured Document Retrieval. In Proceedings of the 9th Retrieval (2001), ACM Press, 334-342.
Symposium on String Processing and Information Retrieval
(SPIRE 2002), Springer, 241-247.