views:

41

answers:

1

I'm looking for some documentation on how Information Retrieval systems (e.g., Lucene) store their indexes for speedy "relevancy" lookups. My Google-fu is failing me: I've found a page which describes Lucene's file format, but it's more focused on how many bits each number is than on how the database is used in producing speedy queries.

Surely someone has some useful bookmarks lying around that they can refer me to.

Thanks!

+2  A: 

The Lucene index is an inverted index, so any search on this topic should be relevant, like:

Pascal Dimassimo
True, it's an inverted index, but if I have a 10-term query, is lucene really looking up each term in the inverted index, intersecting the results, and ranking them?
jemfinch
In essence, yes, if you look at the Lucene scoring formula (http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/search/Similarity.html), you'll see that each query terms is used to build a vector that is gonna be used to search in the index
Pascal Dimassimo