tags:

views:

1089

answers:

3

Is there a known math formula that I can use to estimate the size of a new Lucene index? I know how many fields I want to have indexed, and the size of each field. And, I know how many items will be indexed. So, once these are processed by Lucene, how does it translate into bytes?

A: 

I think it has to also do with the frequency of each term (i.e. an index of 10,000 copies of the sames terms should be much smaller than an index of 10,000 wholly unique terms).

Also, there's probably a small dependency on whether you're using Term Vectors or not, and certainly whether you're storing fields or not. Can you provide more details? Can you analyze the term frequency of your source data?

Bob King
A: 

Here is the lucene index format documentation. The major file is the compound index (.cfs file). If you have term statistics, you can probably get an estimate for the .cfs file size, Note that this varies greatly based on the Analyzer you use, and on the field types you define.

Yuval F
A: 

The index stores each "token" or text field etc., only once...so the size is dependent on the nature of the material being indexed. Add to that whatever is being stored as well. One good approach might be to take a sample and index it, and use that to extrapolate out for the complete source collection. However, the ratio of index size to source size decreases over time as well, as the words are already there in the index, so you might want to make the sample a decent percentage of the original.

alchemical