Inex03 Pto

Using Language Models for Flat Text Queries in XML Retrieval
Paul Ogilvie, Jamie Callan

Language Technologies Institute
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA USA
{pto,callan}@cs.cmu.edu
ABSTRACT relationships to other approaches to structured document

This paper presents a language modeling system for ranking flat retrieval in Section 8, and Section 9 concludes the paper.
text queries against a collection of structured documents. The
system, built using Lemur, produces probability estimates that 2. LANGUAGE MODELS FOR
arbitrary document components generated the query. This DOCUMENT RETRIEVAL
paper describes storage mechanisms and retrieval algorithms for Language modeling applied to information retrieval problems
the evaluation of unstructured queries over XML documents. typically models text using unigram language models. Unigram
The paper includes retrieval experiments using a generative language models are similar to bags-of-words representations, as
language model on the content only topics of the INEX testbed, word order is ignored. The unigram language model specifically
demonstrating the strengths and flexibility of language modeling estimates the probability of a word given some text. Document
to a variety of problems. We also describe index characteristics, ranking typically is done one of two ways: by measuring how
running times, and the effectiveness of the retrieval algorithm. much a query language model diverges from document language
models [8], or by estimating the probability that each document
1. INTRODUCTION generated the query string. Since we use the generative language
Language modeling has been studied extensively in standard model for our experiments, we will not describe the divergence
Information Retrieval in the last few years. Researches have based approaches here.
demonstrated that the framework provided by language models
has been powerful and flexible enough to provide strong 2.1 The Generative Language Model
solutions to numerous problems, including ad-hoc information The generative method ranks documents by directly estimating
retrieval, known-item finding on the Internet, filtering, the probability of the query using the texts’ language models
distributed information retrieval, and clustering. [13][4][15][16]:
With the success of language modeling for this wide variety of ( )

P Q? T = ∏ P w ?T ( )
qtf (w )
tasks and the increasing interest in studying structured w∈Q
document retrieval, it is natural to apply the language modeling

where Q is the query string, and ?T is the language model
framework to XML retrieval. This paper describes and
presents experiments using one way the generative language estimated for the text, and qtf(w) is the query term frequency of
model could be extended to model and support queries on the term. Texts more likely to have produced the query are
structured documents. We model documents using a tree-based ranked higher. It is common to rank by the log of the generative
language model. This is similar to many previous models for probability as ti there is less danger of underflow and it
structured document retrieval [1][2][3][6][7][10], but differs in produces the same orderings:
that language modeling provides some guidance in combining
information from nodes in the tree and estimating term weights.
(( ))
log P Q ? T = ∑ qtf (w )log P w ? T ( )
w∈Q
This work is also similar to other works using language models

for XML retrieval [5][9], but differs in that we also present Under the assumptions that query terms are generated
context-sensitive language model smoothing and an independently and that the query language model used in KL-
implementation using information retrieval style inverted lists divergence is the maximum-likelihood estimate, the generative
rather than a database. model and KL divergence produce the same rankings [11].
The next section provides background in language modeling in 2.2 The Maximum-Likelihood Estimate of a
information retrieval. In Section 3 we present our approach to
Language Model
modeling structured documents. Section 4 describes querying
The most direct way to estimate a language model given some
the tree-based language models presented in the previous
observed text is to use the maximum-likelihood estimate,
section. In Section 5we describe the indexes required to support
assuming an underlying multinomial model. In this case, the
retrieval and the retrieval algorithms. We describe the
maximum-likelihood estimate is also the empirical distribution.
experiment setup and indexes used for INEX 2003 in Section 6.
An advantage of this estimate is that it is easy to compute. It is
Section 7 describes experimental results. We discuss
very good at estimating the probability distribution for the
language model when the size of the observed text is very large.  P (w ? T ) if w ∈ T
P (w ? T ) =  seen
Punseen(w ? T ) otherwise
It is given by:
freq(w, T )
P (w ? T ) =
T where
Pseen (w ?T ) = (1 − ω )PMLE(w? T ) + ω P(w? collection)

where T is the observed text, freq(w, T) is the number of times
the word w occurs in T, and |T| is the length in words of T . The and
maximum likelihood estimate is not good at estimating low
frequency terms for short texts, as it will assign zero probability Punseen (w ? T ) = ω P(w ? collection )
to those words. This creates a problem for estimating document
language models in both KL divergence and generative language This notation distinguishes the probability estimates for cases
model approaches to ranking documents, as the log of zero is where the word has been seen in the text and where the word
negative infinity. The solution to this problem is smoothing. has not been seen will be in the sample text. We will use this
notation later when describing the retrieval algorithm, as it
2.3 Smoothing simplifies the description and is similar to the notation used in
Smoothing is the re-estimation of the probabilities in a language previous literature [16]. The simple form of linear interpolation
model. Smoothing is motivated by the fact that many of the where ? is a fixed constant is often referred to as Jelinek-
language models we estimate are based on a small sample of the Mercer smoothing.
“true” probability distribution. Smoothing improves the
estimates by leveraging known patterns of word usage in 3. STRUCTURED DOCUMENTS AND
language and other language models based on larger samples. In LANGUAGE MODELS
information retrieval smoothing is very important [16], because The previous section described how language modeling is used
the language models tend to be constructed from very small in unstructured document retrieval. With structured documents
amounts of text. How we estimate low probability words can such as XML or HTML, we believe that the information
have large effects on the document scores. In addition to the contained in the structure of the document can be used to
problem of zero probabilities mentioned for maximum- improve document retrieval. In order to leverage this
likelihood estimates, much care is required if this probability is information, we need to model document structure in the
close to zero. Small changes in the probability will have large language models.
effects on the logarithm of the probability, in turn having large
effects on the document scores. Smoothing also has an effect We model structured documents as trees. The nodes in the tree
similar to inverse document frequency [4], which is used by correspond directly with tags present in the document. A
many retrieval algorithms. partial tree for a document might look like:
The smoothing technique most commonly used is linear document

interpolation. Linear interpolation is a simple approach to
combining estimates from different language models:
title abstract body
P (w ? ) = ∑ λi P(w ? i )
k
i =1
where k is the number of language models we are combining, and section 1 section 2 references
λ i is the weight on the model ? i . To ensure that this is a valid Nodes in the document tree correspond directly to XML tags in
probability distribution, we must place these constraints on the the document. For each document node in the tree, we estimate
lambdas: a language model. The language models for leaf nodes with no
k children can be estimated from the text of the node. The
∑λ
i =1
i = 1 and for 1 ≤ i ≤ k , λi ≥ 0 language models for other nodes are estimated by taking a linear
interpolation of a language model formed from the text in the
One use of linear interpolation is to smooth a document’s node (but not in any of its children) and the language models
language model with a collection language model. This new formed from the children.
model would then be used as the smoothed document language We have not specified how the linear interpolation parameters
model in either the generative or KL-divergence ranking for combining language models in the document tree should be
approach. chosen. This could be task specific, and training may be
required. The approach we will adopt in this paper is to set the
2.4 Another Characterization weight on a child node as the accumulated length of the text in
When we take a simple linear interpolation of the maximum the child divided by the accumulated length of the node. By
likelihood model estimated from text and a collection model, we accumulated length we mean the number of words directly in the
can also characterize the probability estimates as: node plus the accumulated length of the node’s children. Setting
the parameters in this manner assumes that a word in a one node 4.3 Prior Probabilities
type is no more important than a word in any other node type; Given relevance assessments from past topics, we can estimate
it is the accumulated length of the text in the node that prior probabilities of the document component being relevant
determines how much information is contained in the node. given its type. Another example prior may depend on the length
We also wish to smooth the maximum likelihood models that are of the text in the node. A way t o incorporate this information is
estimated directly from the text with a collection language to rank by the probability of the document node given the
model. In this work, we will combine the maximum likelihood query. Using Bayes rule, this would allow us incorporate the
models with the collection model using a linear interpolation priors on the nodes. The prior for only the node being ranked
with fixed weights. The collection model may be specific to the would be used, and the system would multiply the probability
node type, giving context sensitive smoothing, or the collection that the node generated the query by the prior:
model may be one large model estimated from everything in the ( )
P NQ ( )
= P Q ? N P (N) P(Q)
corpus, giving a larger sample size.
When the ? parameters are set proportional to the text length ( )
∝ P Q ? N P (N )
and a single collection model is used, this results a special case
that is very similar to the models used in [5][9]. The tree-based This would result in ranking by the probability of the document
language model estimated using these parameter settings will be component node given the query, rather than the other way
identical to a language model estimated by taking a simple linear around.
interpolation of a maximum likelihood estimate from the text in
the node and its ancestors and a the collection model. 5. STORAGE AND ALGORITHMS
This section describes how we support structured retrieval in
4. RANKING THE TREE MODELS the Lemur toolkit. We first describe the indexes built to
In a retrieval environment for structured documents, it is support retrieval. Then we describe how the indices are used
desirable to provide support for both structured queries and by the retrieval algorithm. We also present formulas for the
unstructured, free-text queries. It is easier to adapt the computation of the generative probabilities we estimate for
generative language model to structured documents, so we only retrieval.
consider that model in this paper. It is simpler to support
unstructured queries, so we will describe retrieval for them first. 5.1 Index Support
There are two main storage structures in Lemur that provide the
4.1 Unstructured Queries support necessary for the retrieval algorithm. Lemur stores
To rank document components for unstructured queries, we use inverted indexes containing document and node occurrences and
the generative language modeling approach for IR described in document structures information.
Section 2. For full document retrieval, we need only compute
the probability that the document language model generated the
5.1.1 Inverted Indexes
The basic idea to storing structured documents in Lemur for
query. If we wish to return arbitrary document components,
retrieval is to use a modified inverted list. Similar to storing
we need to compute the probability that each component
term locations for a document entry in an inverted list, we store
generated the query.
the nodes and the term frequencies of the term in the nodes in
Allowing the system to return arbitrary document components the document entries of the inverted list. The current
may result in the system stuffing the results list with many implementation of the structured document index does not store
components from a single document. This behavior is term locations, but could be adapted to store term locations in
undesirable, so a filter on the results is necessary. the future.
One filter we employ takes a greedy approach to preventing The inverted lists are keyed by term, and each list contains the
overlap among components in the results list. For each result, it following:
will be thrown out of the results if there is any component • document frequency of the term
higher in the ranking that is an ancestor or descendent of the • a list of document entries, each entry containing
document component under consideration. o document id
o term frequency (count of term in document)
4.2 Structured Queries o number of nodes the term occurs in
Our previous paper on this subject [11] discusses how some o a list of node entries, each entry containing
structural query operators could be included in the model. We § node id
do not currently support any of these operators in our system, § term frequency (count of term in node)
so we will not discuss in depth here. However, we will note When read into memory, the inverted lists are stored in an array
that the retrieval framework can support most desired structural of integers. The lists are stored on disk using restricted-variable
query operators as relatively easy to implement query nodes. length compression and delta-encoding is applied to document
ids and node ids. In the document entry lists, the documents
entries are stored in order by ascending document id. The node
entry lists are similarly stored in order by increasing node id. the top n results objects in the heap must be stored at any point
Document entries and node entries are only stored in the list in time.
when the term frequency is greater than zero. Access to the
lists on disks is facilitated with an in-memory lookup table for
Heap
vocabulary terms.
There is also an analogous set of inverted lists for attribute Score adjuster
name/value pairs associated with tags. For example, if the
document contained the text
Sum
<date calendar=“Gregorian”>,
the index would have an inverted list keyed by the triple Term Term
date/calendar/Gregorian. The structure and information stored
“gregorian” “chant”
in the inverted lists for the attribute name/value pairs is identical
to those in the inverted lists for terms.
A more detailed description of each of the query nodes follows.
5.1.2 Document Structure When each query node is called, they are passed a document id
The document structure is stored compressed in memory using to evaluate. In order to know which document should be
restricted variable length compression. A lookup table keyed processed next, the term nodes pass up the next document id in
by document id provides quick access to the block of the inverted list. For other query nodes, the minimum next
compressed memory for a document. We choose to store the document id among a node’s children gets passed up the query
document structure in memory because it will be requested tree with the results list. We will describe the query nodes
often during retrieval. For each document, a list of information bottom up, as that is how the scores are computed.
about the document nodes is stored. For each node, we store:
We first note that we can rewrite the log of the probability that
• parent of the node the document node generated the query as
• type of node
• length of the node (number of words)
(( )) qtf (w )log
(
 Pseen w ? node ) 
log P Q ? node = ∑ P ( ) 
Since this list of information about the document structure is w∈Q , node  unseen w ? node
compressed using a variable length encoding, we must (
+ ∑ qtf (w )log Punseen w ? node )
decompress the memory to provide efficient access to w∈Q
information about nodes. When the document structure for a as shown in [16]. This will allow us to easily compute the item
document is being decompressed, we also compute: in the first sum easily using term nodes, combine these
• accumulated length of the node (length of text directly in components of the score using a sum node, and then add on the
the node + accumulated length of children) rest using a score adjustment node.
• number of children of the node
5.2.1 Term Node
• a list of the node’s children
The term nodes read in the inverted lists for a term w and create
This decompression and computation of other useful results where the score for a result is initialized to
information about the document structure is computed in time
linear to the number of nodes in the document being (
 Pseen w ? node
qtf (w) ⋅ log
) 
decompressed. (
 Punseen w ? node
 ) 
The term node assumes that the parent id of a node is smaller
5.2 Retrieval
than the node’s id. It also assumes that the document entries in
We construct a query tree to process and rank document
inverted lists are organized in increasing document id order and
components. A typical query tree is illustrated below. The leaf
the node entries are organized in increasing term id order. The
nodes of the query tree are term nodes which read the inverted
structured document index we built is organized this way. In
lists for a term off of disk and create result objects for document
the follow ing algorithm description, indentation is used to
components containing the term. The term nodes are also
denote the body of a loop.
responsible for propagating the term scores up the document
tree. The sum node merges the result lists returned by each of 1 Seek to the next entry in the inverted list where the
the term nodes, combining the score estimates. The score document id is at least as large as the requested document
adjuster node adjusts the score estimates to get the generation 2 If the document id of the next entry is the requested
probabilities and also applies any priors. The heap node document
maintains a list of the top n ranked objects and returns a sorted
3 Decompress the document structure information for the
result list. Efficient retrieval is achieved using a document at a
document
time approach. This requires that the query tree be walked
many times during the evaluation of a query, but results a large 4 Read in the node entries from the inverted list
saving of memory, as only the result objects for a document and 5 Create the result objects for the leaf nodes. For each
node that contains the term:
6 Initialize the score for the result to the seen 22 Set the score for the result to
probability part for the node
 seen(node) + unseen(w, node) 
seen(node) = (1 − ω ) freq(w, node) λ (node, node) qtf (w) ⋅ log 
 unseen(w , node) 
where
23 Return the result list and the next document id in the
length(node)
λ (node, node) = inverted list
accumulated length(node)
The result list now contains results for a single document where
and ω will be used to set the influence of the the score is
 P (w ? node ) 
collection model s.
qtf (w) ⋅ log seen 
7 Push the node id onto the candidate node heap  Punseen(w ? node ) 
8 Store the result object in an array indexed by node id
 
for fast access and the list is ordered by increasing node id.
9 While the candidate node heap isn’t empty: 5.2.2 Sum Node
10 Pop the top node id off of the heap (the largest node The sum node maintains an array of result lists, with one result
id), set it to the current node id list for each of the children. It seeks to the next entry in each of
11 Lookup the result from the result array the child result lists where the document id is at least as large as
the requested document. If necessary, it calls the children nodes
12 Lookup the node id for the parent of the current node
to get their next result lists. For the requested document, the
13 Lookup the parent node’s result sum node merges results from the result lists of the children,
14 If the parent node’s result object is NULL: setting the score of the new result equal to the sum of the
children’s results with the same document and node id. This
15 Create a new result object for the parent node and
node assumes that results in a result list are ordered by
put it in the result array, initializing the score to 0
increasing document id, then increasing node id. The results
16 Push the parent node’s id onto the candidate node returned by this component have the score
heap
(
 Pseen w ? node
qtf (w )log
) 
17 Propagate the seen part of the score from the ∑ (
 Punseen w ? node ) 
current node to the parent node, setting the w∈Q , node 
parent node’s seen part to
and the minimum document id returned by the children is
seen( parent) + seen(node )λ (node, parent ) returned.
where
5.2.3 Score Adjustment Node
accumulated length(node)
λ (node, parent ) = The score adjustment node adds
accumulated length( parent )
∑ qtf (w) log P unseen (w ? )
node
18 Push the result onto the front of the results list w∈Q
19 Set the result in the result array for the node to NULL to each of the results, where
Punseen (w ? node ) = unseen (w, node )
(initializing the result array for the next document)
[Now each document node that contains the query term
(or has a child containing the term) has a result in the as defined for the term node. If there is a prior probability for
results list where the score is the seen probability part the node, the score adjustment node also adds on the log of the
for the query term] prior. The results in the list now have the score
20 For each node in the result list

qtf (w )log
(
 Pseen w ? node ) 
∑ (
 Punseen w ? node ) 
21 Compute the unseen part of the generative w∈Q , node 
probability for each node. For linear
interpolation with a constant ω and one single
(
+ ∑ qtf (w )log Punseen w ? node )
w∈Q
node type independent collection model, this is + log(P(node))
unseen(w, node) = ω P(w? collection)
For linear interpolation with a constant ω and
(( )
= log P Q ? node P (node) )
node type specific collection models, this can be which is the log of the score by which we wish to rank
computed recursively
document components.
unseen (w, node ) =
( )
ωP w ? collection ,type( node) λ(node , node )
5.2.4 Heap Node
The heap node repeatedly calls its child node for result lists
+ ∑ unseen (w, child )λ (child , node ) until the document collection has been ranked. The next
child∈children ( node) document id it calls for its child to process is the document id
returned by the child node in the previous evaluation call. It (not compressed)
maintains a heap of the top n results. After the document Other files (lookup tables, vocabulary, 12
collection has been ranked, it sorts the results by decreasing table of contents, etc.)
score and stores them in a result list that is returned. Total 350
Table 3: Lemur structured index component sizes
5.2.5 Other Nodes
There are many other useful nodes that could be useful for Table 4 shows approximate running times for index construction
retrieval. One example is a node that filters the result lists so and retrieval. The retrieval time for context insensitive
that the XML path of the node in the document tree satisfies smoothing is reasonable at less than 20 seconds per query, but
some requirements. Another example is a node that throws out we would like to lower the average query time even more. We
all but the top n components of a document. feel we can do this with some simple data structure
optimizations that will increase memory reuse.
6. EXPERIMENT SETUP Action Time (mins)
The index we created used the Krovetz stemmer and InQuery Indexing 25
stopword list. Topics are similarly processed, and all of our Retrieval of 36 INEX 2003 CO topics 10
queries are constructed from the title, description, and – context insensitive smoothing
keywords fields. All words in the title, description, and
Retrieval of 36 INEX 2003 CO topics 45
keywords fields of the topic are given equal weight in the query.
– context sensitive smoothing
Table 3 shows the size of components created to support
Table 4: Indexing and retrieval times using Lemur
retrieval on the INEX document collection. The total index size
including information needed to do context sensitive smoothing The higher retrieval time for the context sensitive retrieval
is about 70% the size of the original document collection. A algorithm is due to the recursive computation of the unseen
better compression ratio could be achieved by compression of component of the score as described Step 21 of Section 5.2.1.
the context sensitive smoothing support files. Note that the Clever redesign of the algorithm may reduce the time some.
document term file which is 100 MB is not necessary for the However, all of the descendent nodes in the document’s tree
retrieval algorithms described above. must be visited regardless of whether the descendent nodes
contain any of the query terms. This means that the
computation of the unseen component of the scores is linear in
the number of nodes in the document tree, rather than the
typically sub-linear case for computation of the seen score
components. If the ? and ? functions and their parameters are
Component Size (MB)
known, it is possible to precompute and store necessary
Inverted file 100
information to reduce the running time to something only
Document term file (allows iteration 100
slightly larger than the context insensitive version. However,
over terms in a document)
our implementation is meant for research, so we prefer that
Document structure 30
these parameters remain easily changeable.
Attributes inverted file 23
Smoothing – single collection model 4 7. EXPERIMENT RESULTS
Smoothing – context sensitive models 81
Topic Context Prior Path inex_eval
Fields Strict Gen
TDK YES NO NO .0464 .0646
TDK YES YES NO .0488 .0653
TDK NO NO NO .0463 .0641
TDK NO YES NO .0485 .0654
Table 1: Performance of the retrieval system on INEX 2002 CO topics. Context refers to context sensitive smoothing, prior
refers to the document component type priors, and path refers to the overlapping path filter.
Run Name Topic Context Prior Path inex_eval inex_eval_ng w/o overlap
(Official runs are bold) Fields Strict Gen Strict Gen Strict Gen
LM_context_TDK TDK YES NO NO .0717 .0804 .2585 .3199 .2305 .2773
LM_context_typr_T DK TDK YES YES NO .0769 .0855
LM_context_typr_path_TDK TDK YES YES YES .0203 .0240
LM_base_TDK TDK NO NO NO .0783 .0861
LM_base_typr_TDK TDK NO YES NO .0764 .0847
LM_base_typr_path_TDK TDK NO YES YES .0204 .0234
Table 2: Summary of runs and results for INEX 2003 CO topics.
We submitted three official runs as described in Table 2. All of In [5] a generative language modeling approach for content only
our runs used the title, description, and keyword fields of the queries is described where a document component’s language
topics. Unfortunately, two of our runs performed rather model is estimated by taking a linear interpolation of the
poorly. This is either an error in our path filter or a problem maximum likelihood model from the text of the node and its
with the component type priors. We would also like to ancestors and a collection model. This corresponds to a special
evaluate the additional runs corresponding to the dashes in the case of our approach. Our model is more flexible in that it
table, but we have not been able to do these experiments yet. allows context sensitive smoothing and different weighting of
text in children nodes.
The LM_context_TDK run has good performance across all
measures. This is our basic language modeling system using The authors of [9] also present a generative language model for
context sensitive smoothing. The strong performance of the content only queries in structured document retrieval. They
context sensitive language modeling approach speaks well for estimate the collection model in a different way, using document
the flexibility of language modeling. frequencies instead of collection term frequencies. As with [5],
this model can be viewed as a special case of the language
For the content only topics, context sensitive smoothing does
modeling approach presented here.
not help. The node type priors also do not consistently help.
There was a significant problem with the path filters we used.
9. CLOSING REMARKS
With regards to context sensitive smoothing, it may not make We presented experiments using a hierarchical language model.
much difference for content only tasks as they are typically The strong performance of language modeling algorithms
searching for textual components such as paragraphs, sections, demonstrates the flexibility and ease of adapting language
and articles. The characteristics of the text in these components models to the problem. In our preliminary experiments with
tend to be very similar, so the context sensitive smoothing may standard text queries, context sensitive smoothing did not give
not be helpful. much different performance than using a single collection model.
With regards to component type priors, we have observed We described data structures and retrieval algorithms to support
similar puzzling behavior in [12]. We discovered that the retrieval of arbitrary XML document components within the
distributions observed in the rankings after applying the prior Lemur toolkit. We are reasonably pleased with the efficiency of
probabilities are not the desired distributions. We are actively the algorithms for a research system, but we will strive to
working on new techniques to incorporate information in a way improve the algorithms and data structures to reduce retrieval
that will provide the desired distributions of results in the times even further.
rankings.
In our future work, we would like to compare the component
retrieval to standard document retrieval. We would also like to
8. RELATED WORK
investigate query expansion using XML document components.
There exists a large and growing body of work in retrieving
Additionally, we would like to explore different ways of setting
information from XML documents. Some work is described in
the ? weights on the nodes’ language models, as we believe that
our previous paper [11] and much of the more recent work is
words in some components may convey more useful
also described in the INEX 2002 proceedings [14]. With that in
information than words in other components.
mind, we will focus our discussion of related work on language
modeling approaches for structured document retrieval.
10. ACKNOWLEDGMENTS [8] Lafferty, J., and C. Zhai. Document language models,
This work was sponsored by the Advanced Research and query models, and risk minimization for information
Development Activity in Information Technology (ARDA) retrieval. In Proceedings of the 24th Annual International
under its Statistical Language Modeling for Information ACM SIGIR Conference on Research and Development in
Retrieval Research Program. Any opinions, findings, Information Retrieval (2001), ACM Press, 111-119.
conclusions, or recommendations expressed in this material are [9] List, J., and A.P. de Vries. CWI at INEX 2002. In [14],
those of the authors, and do not necessarily reflect those of the 133-140.
sponsor.
[10] Myaeng, S.H., D.H. Jang, M.S. Kim, and Z.C. Zhoo. A
flexible model for retrieval of SGML documents. In
11. REFERENCES Proceedings of the 21st Annual International ACM SIGIR
[1] Fuhr, N. and K. Großjohann. XIRQL: A query language Conference on Research and Development in Information
for information retrieval in XML documents. In Retrieval (1998), ACM Press, 138-145.
Proceedings of the 24th Annual International ACM SIGIR
[11] Ogilvie, P. and J. Callan. Language models and structured
Conference on Research and Development in Information
document retrieval. In [14], 33-40.
Retrieval (2001), ACM Press, 172-180.
[12] Ogilvie, P. and J. Callan. Combining Structural Information
[2] Grabs, T. and H.J. Schek. Generating vector spaces on-
and the Use of Priors in Mixed Named-Page and Homepage
the-fly for flexible XML retrieval. In Proceedings of the
Finding. To appear in Proceedings of the Twelfth Text
25th Annual International ACM SIGIR Workshop on XML
REtrieval Conference (TREC 2003).
Information Retrieval (2002), ACM.
[13] Ponte, J., and W.B. Croft. A language modeling approach
[3] Hatanao, K., H. Kinutani, M. Yoshikawa, and S. Uemura.
to information retrieval. In Proceedings of the 21 st Annual
Information retrieval system for XML documents. In
International ACM SIGIR Conference on Research and
Proceedings of Database and Expert Systems Applications
Development in Information Retrieval (1998), ACM Press,
(DEXA 2002), Springer, 758-767.
275-281.
[4] Hiemstra, D. Using language models for information [14] Proceedings of the First Workshop of the Initiative for the
retrieval, Ph.D. Thesis (2001), University of Twente.
Evaluation of XML Retrieval (INEX). 2003, DELOS.
[5] Hiemstra, D. A database approach to context-based XML
[15] Westerweld, T., W. Kraaj, and D. Heimstra. Retrieving
retrieval. In [14], 111-118.
web pages using content, links, URLs, and anchors. In
[6] Kazai, G., M. Lalmas, and T. Rölleke. A model for the Proceedings of the Tenth Text Retrieval Conference, TREC
representation and focused retrieval of structured 2001, NIST Special publication 500-250 (2002), 663-672.
documents based on fuzzy aggregation. In The 8th
[16] Zhai, C. and J. Lafferty. A study of smoothing methods
Symposium on String Processing and Information Retrieval
for language models applied to ad hoc information retrieval.
(SPIRE 2001), IEEE, 123-135.
In Proceedings of the 24th Annual International ACM SIGIR
[7] Kazai, G., M. Lalmas, and T. Rölleke. Focussed Conference on Research and Development in Information
Structured Document Retrieval. In Proceedings of the 9th Retrieval (2001), ACM Press, 334-342.
Symposium on String Processing and Information Retrieval
(SPIRE 2002), Springer, 241-247.

Inex03 Pto

Uploaded by

Copyright:

Available Formats

Inex03 Pto

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Inex03 Pto

Uploaded by

Copyright:

Available Formats

Using Language Models for Flat Text Queries in XML Retrieval

Paul Ogilvie, Jamie Callan

ABSTRACT relationships to other approaches to structured document

With the success of language modeling for this wide variety of ( )

tasks and the increasing interest in studying structured w∈Q

document retrieval, it is natural to apply the language modeling

This work is also similar to other works using language models

Pseen (w ?T ) = (1 − ω )PMLE(w? T ) + ω P(w? collection)

The smoothing technique most commonly used is linear document

20 For each node in the result list

You might also like