Lucene Lecture at Pisa

Lucene lecture at Pisa http://lucene.sourceforge.
net/talks/pisa/
Lucene
Doug Cutting
cutting@apache.org
November 24 2004
University of Pisa
Prelude
my background..
please interrupt with questions
blog this talk now so that we can search for it later
(using a Lucene-based blog search engine, of course)
In this course, Paolo and Antonio have presented many techniques.
I present real software that uses many of these techniques.
Lucene is
software library for search
open source
not a complete application
set of java classes
active user and developer communities
widely used, e.g, IBM and Microsoft.
Lucene Architecture
1 of 11 9/28/2010 5:55 AM
Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/
[draw on whiteboard for reference throughout talk]
Lucene API
org.apache.lucene.document
org.apache.lucene.analysis
org.apache.lucene.index
org.apache.lucene.search
Package: org.apache.lucene.document
A Document is a sequence of Fields.
A Field is a <nam e, value> pair.
nam e is the name of the field, e.g., title, body, subject, date, etc.
value is text.
Field values may be stored, indexed o r analyzed (and, now, vectored).
2 of 11 9/28/2010 5:55 AM
Example
public Document makeDocument(File f) throws FileNotFoundException {
Document doc = new Document();
doc.add(new Field("path", f.getPath(), Store.YES, Index.TOKENIZED));
doc.add(new Field("modified",
DateTools.timeToString(f.lastModified(), Resolution.MINUTE),
Store.YES, Index.UN_TOKENIZED));
// Reader implies Store.NO and Index.TOKENIZED

doc.add(new Field("contents", new FileReader(f)));
return doc;
}
Example (continued)
field stored indexed analyzed
path yes yes yes
modified yes yes no
content no yes yes
Package: org.apache.lucene.analysis
A n Analyzer is a TokenStream factory.
A TokenStream is an iterator over Tokens.
input is a character iterator (Reader)
A Token is tuple <text, type, start, length, positionIncrement>
text (e.g., “pisa”).
type (e.g., “word”, “sent”, “para”).
start & length offsets, in characters (e.g, <5,4>)
positionIncrement (normally 1)
standard TokenStream implementations are
Tokenizers, which divide characters into tokens and
TokenFilters, e.g., stop lists, stemmers, etc.
Example
public class ItalianAnalyzer extends Analyzer {
private Set stopWords =
StopFilter.makeStopSet(new String[] {"il", "la", "in"};
public TokenStream tokenStream(String fieldName, Reader reader) {

TokenStream result = new WhitespaceTokenizer(reader);
result = new LowerCaseFilter(result);
3 of 11 9/28/2010 5:55 AM
result = new StopFilter(result, stopWords);

result = new SnowballFilter(result, "Italian");
return result;
}
}
Package: org.apache.lucene.index
Term is <fieldName, text>
index maps Term → <df, <docNum, <position>* >*>
e.g., “content:pisa” → <2, <2, <14>>, <4, <2, 9>>>
new: term vectors!
Example
IndexWriter writer = new IndexWriter("index", new ItalianAnalyzer());
File[] files = directory.listFiles();
for (int i = 0; i < files.length; i++) {
writer.addDocument(makeDocument(files[i]));
}
writer.close();
Some Inverted Index Strategies

1. batch-based: use file-sorting algorithms (textbook)
+ fastest to build
+ fastest to search
- slow to update
2. b-tree based: update in place (http://lucene.sf.net/papers/sigir90.ps)
+ fast to search
- update/build does not scale
- complex implementation
3. segment based: lots of small indexes (Verity)
+ fast to build
+ fast to update
- slower to search
Lucene's Index Algorithm

two basic algorithms:
1. make an index for a single document
2. merge a set of indices
incremental algorithm:
maintain a stack of segment indices
create index for each incoming document
push new indexes onto the stack
4 of 11 9/28/2010 5:55 AM
let b=10 be the merge factor; M=∞
for (size = 1; size < M; size *= b) {

if (there are b indexes with size docs on top of the stack) {
pop them off the stack;
merge them into a single index;
push the merged index onto the stack;
} else {
break;
}
}
optimization: single-doc indexes kept in RAM, saves system calls

notes:
average b*logb(N)/2 indexes
N=1M, b=2 gives just 20 indexes
fast to update and not too slow to search
batch indexing w/ M=∞, merge all at end
equivalent to external merge sort, optimal
segment indexing w/ M<∞
Indexing Diagram
b=3
11 documents indexed
stack has four indexes
grayed indexes have been deleted
5 merges have occurred
Index Compression
5 of 11 9/28/2010 5:55 AM
For keys in Term -> ... map, use technique from Paolo's slides:
For values in Term -> ... map, use technique from Paolo's slides:
6 of 11 9/28/2010 5:55 AM
VInt Encoding Example
Value First byte Second byte Third byte
0 00000000
1 00000001
2 00000010
...
127 01111111
128 10000000 00000001
7 of 11 9/28/2010 5:55 AM
129 10000001 00000001
130 10000010 00000001
...
16,383 11111111 01111111
16,384 10000000 10000000 00000001
16,385 10000001 10000000 00000001
...
This provides compression while still being efficient to decode.
Package: org.apache.lucene.search
primitive queries:
TermQuery: match docs containing a Term
PhraseQuery: match docs w/ sequence of Terms
BooleanQuery: match docs matching other queries.
e.g., +path:pisa +content:“Doug Cutting” -path:nutch
new: SpansQuery
derived queries:
PrefixQuery, WildcardQuery, etc.
Example
Query p i s a = new TermQuery(new Term("content", "pisa"));
Query babel = new TermQuery(new Term("content", "babel"));
PhraseQuery leaningTower = new PhraseQuery();

leaningTower.add(new Term("content", "leaning"));
leaningTower.add(new Term("content", "tower"));
BooleanQuery query = new BooleanQuery();

query.add( leaningTower, Occur.MUST);
query.add(pisa, Occur.SHOULD);
query.add(babel, Occur.MUST_NOT);
Search Algorithms
From Paolo's slides:
8 of 11 9/28/2010 5:55 AM
Lucene's Disjunctive Search Algorithm

described in http://lucene.sf.net/papers/riao97.ps
since all postings must be processed
goal is to minimize per-posting computation
merges postings through a fixed-size array of accumulator buckets
performs boolean logic with bit masks
scales well with large queries
[ draw a diagram to illustrate? ]
Lucene's Conjunctive Search Algorithm

9 of 11 9/28/2010 5:55 AM
Algorithm
use linked list of pointers to doc list

initially sorted by doc
loop
if all are at same doc, record hit
skip first to-or-past last and move to end of list
Scoring
10 of 11 9/28/2010 5:55 AM
Is very much like Lucene's Similarity.
Lucene's Phrase Scoring

approximate phrase IDF with sum of terms
compute actual tf of phrase
slop penalizes slight mismatches by edit-distance
Thanks!
And there's lots more to Lucene.
Check out http://jakarta.apache.org/lucene/.
Finally, search for this talk on Technorati.
11 of 11 9/28/2010 5:55 AM

Lucene Lecture at Pisa

Uploaded by

Copyright:

Available Formats

Lucene Lecture at Pisa

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lucene Lecture at Pisa

Uploaded by

Copyright:

Available Formats

Lucene lecture at Pisa http://lucene.sourceforge.

[draw on whiteboard for reference throughout talk]

// Reader implies Store.NO and Index.TOKENIZED

public TokenStream tokenStream(String fieldName, Reader reader) {

result = new StopFilter(result, stopWords);

Some Inverted Index Strategies

Lucene's Index Algorithm

let b=10 be the merge factor; M=∞

for (size = 1; size < M; size *= b) {

optimization: single-doc indexes kept in RAM, saves system calls

VInt Encoding Example

Value First byte Second byte Third byte

128 10000000 00000001

129 10000001 00000001

130 10000010 00000001

16,383 11111111 01111111

16,384 10000000 10000000 00000001

16,385 10000001 10000000 00000001

This provides compression while still being efficient to decode.

PhraseQuery leaningTower = new PhraseQuery();

BooleanQuery query = new BooleanQuery();

Lucene's Disjunctive Search Algorithm

[ draw a diagram to illustrate? ]

Lucene's Conjunctive Search Algorithm

use linked list of pointers to doc list

Is very much like Lucene's Similarity.

Lucene's Phrase Scoring

Finally, search for this talk on Technorati.

You might also like