Lucene Lecture at Pisa
Lucene Lecture at Pisa
Lucene Lecture at Pisa
net/talks/pisa/
Lucene
Doug Cutting
cutting@apache.org
November 24 2004
University of Pisa
Prelude
my background..
please interrupt with questions
blog this talk now so that we can search for it later
(using a Lucene-based blog search engine, of course)
In this course, Paolo and Antonio have presented many techniques.
I present real software that uses many of these techniques.
Lucene is
software library for search
open source
not a complete application
set of java classes
active user and developer communities
widely used, e.g, IBM and Microsoft.
Lucene Architecture
1 of 11 9/28/2010 5:55 AM
Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/
Lucene API
org.apache.lucene.document
org.apache.lucene.analysis
org.apache.lucene.index
org.apache.lucene.search
Package: org.apache.lucene.document
A Document is a sequence of Fields.
A Field is a <nam e, value> pair.
nam e is the name of the field, e.g., title, body, subject, date, etc.
value is text.
Field values may be stored, indexed o r analyzed (and, now, vectored).
2 of 11 9/28/2010 5:55 AM
Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/
Example
public Document makeDocument(File f) throws FileNotFoundException {
Document doc = new Document();
doc.add(new Field("path", f.getPath(), Store.YES, Index.TOKENIZED));
doc.add(new Field("modified",
DateTools.timeToString(f.lastModified(), Resolution.MINUTE),
Store.YES, Index.UN_TOKENIZED));
return doc;
}
Example (continued)
field stored indexed analyzed
path yes yes yes
modified yes yes no
content no yes yes
Package: org.apache.lucene.analysis
A n Analyzer is a TokenStream factory.
A TokenStream is an iterator over Tokens.
input is a character iterator (Reader)
A Token is tuple <text, type, start, length, positionIncrement>
text (e.g., “pisa”).
type (e.g., “word”, “sent”, “para”).
start & length offsets, in characters (e.g, <5,4>)
positionIncrement (normally 1)
standard TokenStream implementations are
Tokenizers, which divide characters into tokens and
TokenFilters, e.g., stop lists, stemmers, etc.
Example
public class ItalianAnalyzer extends Analyzer {
private Set stopWords =
StopFilter.makeStopSet(new String[] {"il", "la", "in"};
3 of 11 9/28/2010 5:55 AM
Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/
Package: org.apache.lucene.index
Term is <fieldName, text>
index maps Term → <df, <docNum, <position>* >*>
e.g., “content:pisa” → <2, <2, <14>>, <4, <2, 9>>>
new: term vectors!
Example
IndexWriter writer = new IndexWriter("index", new ItalianAnalyzer());
File[] files = directory.listFiles();
for (int i = 0; i < files.length; i++) {
writer.addDocument(makeDocument(files[i]));
}
writer.close();
incremental algorithm:
maintain a stack of segment indices
create index for each incoming document
push new indexes onto the stack
4 of 11 9/28/2010 5:55 AM
Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/
Indexing Diagram
b=3
11 documents indexed
stack has four indexes
grayed indexes have been deleted
5 merges have occurred
Index Compression
5 of 11 9/28/2010 5:55 AM
Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/
For keys in Term -> ... map, use technique from Paolo's slides:
For values in Term -> ... map, use technique from Paolo's slides:
6 of 11 9/28/2010 5:55 AM
Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/
0 00000000
1 00000001
2 00000010
...
127 01111111
7 of 11 9/28/2010 5:55 AM
Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/
...
...
Package: org.apache.lucene.search
primitive queries:
TermQuery: match docs containing a Term
PhraseQuery: match docs w/ sequence of Terms
BooleanQuery: match docs matching other queries.
e.g., +path:pisa +content:“Doug Cutting” -path:nutch
new: SpansQuery
derived queries:
PrefixQuery, WildcardQuery, etc.
Example
Query p i s a = new TermQuery(new Term("content", "pisa"));
Query babel = new TermQuery(new Term("content", "babel"));
Search Algorithms
From Paolo's slides:
8 of 11 9/28/2010 5:55 AM
Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/
9 of 11 9/28/2010 5:55 AM
Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/
Algorithm
Scoring
From Paolo's slides:
10 of 11 9/28/2010 5:55 AM
Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/
Thanks!
And there's lots more to Lucene.
Check out http://jakarta.apache.org/lucene/.
11 of 11 9/28/2010 5:55 AM