Lucene (Doug Cutting November 24 2004 University of Pisa)[reference]

 

Lucene

 

Doug?Cutting
cutting@apache.org

November?24?2004
University?of?Pisa

 

 

Prelude

  • my?background..
  • please?interrupt?with?questions
  • blog?this?talk?now?so?that?we?can?search?for?it?later
  • (using?a?Lucene-based?blog?search?engine,?of?course)
  • In?this?course,?Paolo?and?Antonio?have?presented?many?techniques.
  • I?present?real?software?that?uses?many?of?these?techniques.

 

Lucene?is

 

Lucene?Architecture
[draw?on?whiteboard?for?reference?throughout?talk]

Lucene?API

  • org.apache.lucene.document
  • org.apache.lucene.analysis
  • org.apache.lucene.index
  • org.apache.lucene.search

 

 

Package:?org.apache.lucene.document

  • A?Document?is?a?sequence?of?Fields.
  • A?Field?is?a?<name,?value>?pair.

name?is?the?name?of?the?field,?e.g.,?title,?body,?subject,?date,?etc.

value?is?text.

  • Field?values?may?be?stored,?indexed?or?analyzed?(and,?now,?vectored).

 

Example

public?Document?makeDocument(File?f)?throws?FileNotFoundException?{
Document?doc?=?new?Document();
doc.add(new?Field(“path”,?f.getPath(),?Store.YES,?Index.TOKENIZED));

doc.add(new?Field(“modified”,
DateTools.timeToString(f.lastModified(),?Resolution.MINUTE),
Store.YES,?Index.UN_TOKENIZED));

//?Reader?implies?Store.NO?and?Index.TOKENIZED
doc.add(new?Field(“contents”,?new?FileReader(f)));

return?doc;
}

 

 

Example?(continued)

field stored indexed analyzed
path yes yes yes
modified yes yes no
content no yes yes

 

 

Package:?org.apache.lucene.analysis

  • An?Analyzer?is?a?TokenStream?factory.
  • A?TokenStream?is?an?iterator?over?Tokens.

input?is?a?character?iterator?(Reader)

  • A?Token?is?tuple?<text,?type,?start,?length,?positionIncrement>

text?(e.g.,?“pisa”).

type?(e.g.,?“word”,?“sent”,?“para”).

start?&?length?offsets,?in?characters?(e.g,?<5,4>)

positionIncrement?(normally?1)

  • standard?TokenStream?implementations?are

Tokenizers,?which?divide?characters?into?tokens?and

TokenFilters,?e.g.,?stop?lists,?stemmers,?etc.

 

Example

public?class?ItalianAnalyzer?extends?Analyzer?{

private?Set?stopWords?=
StopFilter.makeStopSet(new?String[]?{“il”,?”la”,?”in”};

public?TokenStream?tokenStream(String?fieldName,?Reader?reader)?{
TokenStream?result?=?new?WhitespaceTokenizer(reader);
result?=?new?LowerCaseFilter(result);
result?=?new?StopFilter(result,?stopWords);
result?=?new?SnowballFilter(result,?”Italian”);
return?result;
}
}

 

Package:?org.apache.lucene.index

  • Term?is?<fieldName,?text>
  • index?maps?Term?→?<df,?<docNum,?<position>*?>*>
  • e.g.,?“content:pisa”?→?<2,?<2,?<14>>,?<4,?<2,?9>>>
  • new:?term?vectors!

 

Example

IndexWriter?writer?=?new?IndexWriter(“index”,?new?ItalianAnalyzer());
File[]?files?=?directory.listFiles();
for?(int?i?=?0;?i?<?files.length;?i++)?{
writer.addDocument(makeDocument(files[i]));
}
writer.close();

 

Some?Inverted?Index?Strategies

1?batch-based:?use?file-sorting?algorithms?(textbook)

+fastest?to?build
+fastest?to?search
-slow?to?update

2?b-tree?based:?update?in?place?(http://lucene.sf.net/papers/sigir90.ps)

+fast?to?search
-update/build?does?not?scale
-complex?implementation

3?segment?based:?lots?of?small?indexes?(Verity)

+fast?to?build
+fast?to?update
-slower?to?search

 

Lucene’s?Index?Algorithm

  • two?basic?algorithms:

3?make?an?index?for?a?single?document

3?merge?a?set?of?indices

  • incremental?algorithm:

maintain?a?stack?of?segment?indices

create?index?for?each?incoming?document

push?new?indexes?onto?the?stack

let?b=10?be?the?merge?factor;?M=∞
for(size?=?1;?size?<?M;?size?*=?b)?{
if?(there?are?b?indexes?with?size?docs?on?top?of?the?stack)?{
pop?them?off?the?stack;
merge?them?into?a?single?index;
push?the?merged?index?onto?the?stack;
}?else?{
break;
}
}

  • optimization:?single-doc?indexes?kept?in?RAM,?saves?system?calls
  • notes:

average?b*logb(N)/2?indexes

N=1M,?b=2?gives?just?20?indexes

fast?to?update?and?not?too?slow?to?search

batch?indexing??w/?M=∞,?merge?all?at?end

equivalent?to?external?merge?sort,?optimal

segment?indexing?w/?M<∞

 

Indexing?Diagram

  • b?=?3
  • 11?documents?indexed
  • stack?has?four?indexes
  • grayed?indexes?have?been?deleted
  • 5?merges?have?occurred

 

 

Index?Compression

For?keys?in?Term?->?…?map,?use?technique?from?Paolo’s?slides:

For?values?in?Term?->?…?map,?use?technique?from?Paolo’s?slides:

 

VInt?Encoding?Example

Value First?byte Second?byte Third?byte
0 00000000
1 00000001
2 00000010
127 01111111
128 10000000 00000001
129 10000001 00000001
130 10000010 00000001
16,383 11111111 01111111
16,384 10000000 10000000 00000001
16,385 10000001 10000000 00000001

This?provides?compression?while?still?being?efficient?to?decode.

 

 

 

Package:?org.apache.lucene.search

  • primitive?queries:

TermQuery:?match?docs?containing?a?Term

PhraseQuery:?match?docs?w/?sequence?of?Terms

BooleanQuery:?match?docs?matching?other?queries.
e.g.,?+path:pisa?+content:“Doug?Cutting”?-path:nutch

  • new:?SpansQuery
  • derived?queries:

PrefixQuery,?WildcardQuery,?etc.

 

Example

Query?pisa?=?new?TermQuery(new?Term(“content”,?”pisa”));
Query?babel?=?new?TermQuery(new?Term(“content”,?”babel”));

PhraseQuery?leaningTower?=?new?PhraseQuery();
leaningTower.add(new?Term(“content”,?”leaning”));
leaningTower.add(new?Term(“content”,?”tower”));

BooleanQuery?query?=?new?BooleanQuery();
query.add(leaningTower,?Occur.MUST);
query.add(pisa,?Occur.SHOULD);
query.add(babel,?Occur.MUST_NOT);

 

Search?Algorithms

From?Paolo’s?slides:

Lucene’s?Disjunctive?Search?Algorithm

goal?is?to?minimize?per-posting?computation

  • merges?postings?through?a?fixed-size?array?of?accumulator?buckets
  • performs?boolean?logic?with?bit?masks
  • scales?well?with?large?queries

 

[?draw?a?diagram?to?illustrate??]

 

Lucene’s?Conjunctive?Search?Algorithm

From?Paolo’s?slides:

Algorithm

  • use?linked?list?of?pointers?to?doc?list
  • initially?sorted?by?doc
  • loop

if?all?are?at?same?doc,?record?hit

skip?first?to-or-past?last?and?move?to?end?of?list

 

Scoring

From?Paolo’s?slides:

Is?very?much?like?Lucene’s?Similarity.

 

Lucene’s?Phrase?Scoring

  • approximate?phrase?IDF?with?sum?of?terms
  • compute?actual?tf?of?phrase
  • sloppenalizes?slight?mismatches?by?edit-distance

 

Thanks!

And?there’s?lots?more?to?Lucene.
Check?out?http://jakarta.apache.org/lucene/.

Finally,?search?for?this?talk?on?Technorati.

原创文章。为了维护文章的版本一致、最新、可追溯,转载请注明: 转载自idouba

本文链接地址: Lucene (Doug Cutting November 24 2004 University of Pisa)[reference]


,

No comments yet.

发表评论