ansaurus

Question

Answer 1

+1 A:

It is inverted index, but that does not specify which structure it uses. Index format in lucene has complete information.
Start with 'Summary of File Extensions'.

You'll first notice that it talks about various different indexes. As far as I could notice none of these use strictly speaking a B-tree, but there are similarities - the above structures do resemble trees.

Unreason 2010-04-09 08:36:30

Answer 2

+3 A:

There's a fairly good article here: http://www.ibm.com/developerworks/library/wa-lucene/

In a nutshell, when lucene indexes a document it breaks it down into a number of terms. It then stores the terms in an index file where each term is associated with the documents that contain it. You could think of it as a bit like a hashtable.

Terms are generated using an analyzer which stems each word to its root form. The most popular stemming algorithm for the english language is the Porter stemming algorithm: http://tartarus.org/~martin/PorterStemmer/

When a query is issued it is processed through the same analyzer that was used to build the index and then used to look up the matching term(s) in the index. That provides a list of documents that match the query.

Darren 2010-04-09 13:22:30

Thanks for your answer and links .But i heard that Lucene project has a special stemmer named "Snowball"?Do you heard anything about that ?

Mehdi Amrollahi 2010-04-09 17:44:11

This is a different question: See http://www.lucidimagination.com/search/out?u=http%3A%2F%2Fwiki.apache.org%2Flucene-java%2FConceptsAndDefinitionsOther than that, seeing your question pattern I suggest you read the 'Lucene in Action' book: http://www.manning.com/hatcher2/(First edition is a bit dated, but can be found in a dead tree version. The second edition can be bought as an e-book).

Yuval F 2010-04-11 14:01:43

ansaurus

tags:

views:

answers:

How does lucene index documents?

related questions