views:

396

answers:

2

Hello,

I read some document about Lucene; also I read the document in this link (http://lucene.sourceforge.net/talks/pisa).

I don't really understand how Lucene indexes documents and don't understand which algorithms Lucene uses for indexing?

On the above link, it says Lucene uses this algorithm for indexing:

  • incremental algorithm:
    • maintain a stack of segment indices
    • create index for each incoming document
    • push new indexes onto the stack
    • let b=10 be the merge factor; M=8

for (size = 1; size < M; size *= b) {
    if (there are b indexes with size docs on top of the stack) {
        pop them off the stack;
        merge them into a single index;
        push the merged index onto the stack;
    } else {
        break;
    }
}

How does this algorithm provide optimized indexing?

Does Lucene use B-tree algorithm or any other algorithm like that for indexing - or does it have a particular algorithm?

Thank you for reading my post.

+1  A: 

It is inverted index, but that does not specify which structure it uses. Index format in lucene has complete information.
Start with 'Summary of File Extensions'.

You'll first notice that it talks about various different indexes. As far as I could notice none of these use strictly speaking a B-tree, but there are similarities - the above structures do resemble trees.

Unreason
+3  A: 

There's a fairly good article here: http://www.ibm.com/developerworks/library/wa-lucene/

In a nutshell, when lucene indexes a document it breaks it down into a number of terms. It then stores the terms in an index file where each term is associated with the documents that contain it. You could think of it as a bit like a hashtable.

Terms are generated using an analyzer which stems each word to its root form. The most popular stemming algorithm for the english language is the Porter stemming algorithm: http://tartarus.org/~martin/PorterStemmer/

When a query is issued it is processed through the same analyzer that was used to build the index and then used to look up the matching term(s) in the index. That provides a list of documents that match the query.

Darren
Thanks for your answer and links .But i heard that Lucene project has a special stemmer named "Snowball"?Do you heard anything about that ?
Mehdi Amrollahi
This is a different question: See http://www.lucidimagination.com/search/out?u=http%3A%2F%2Fwiki.apache.org%2Flucene-java%2FConceptsAndDefinitionsOther than that, seeing your question pattern I suggest you read the 'Lucene in Action' book: http://www.manning.com/hatcher2/(First edition is a bit dated, but can be found in a dead tree version. The second edition can be bought as an e-book).
Yuval F