tags:

views:

48

answers:

1

Hello

I would like to store data retrieved hourly from RSS feeds in a database or in Lucene so that the text can be easily indexed for wordcounts.

I need to get the text from the title and description elements of RSS items.

Ideally, for each hourly retrieval from a given feed, I would add a row to a table in a dataset made up of the following columns:

feed_url, title_element_text, description_element_text, polling_date_time

From this, I can look up any element in a feed and calculate keyword counts based upon the length of time required.

This can be done as a database table and hashmaps used to calculate counts. But can I do this in Lucene to this degree of granularity at all? If so, would each feed form a Lucene document or would each 'row' from the database table form one?

Can anyone advise?

Thanks

Martin O'Shea.

A: 

My parsing of your question is:

for each item in feed:
    calculate term frequency of item, then add to feed's frequency list

This is not something that Lucene excels at, so CouchDB or another db might be as good if not a better choice (like larsmans suggests). However, it can be done (in a way that is probably slightly easier than other DBs):

HashMap<string, int> terms = new HashMap<string, int>(indexReader.getUniqueTermCount());
TermEnum tEnum = indexReader.Terms();
while (tEnum.Next())
{
    results.Add(tEnum.Term().Text(), tEnum.DocFreq());
}

All Lucene is saving you is the difficulty of calculating the docfreq, and it will probably be a bit faster than looping through all the rows yourself. But I'd be surprised if the performance difference is noticeable for reasonably small data sets.

Xodarap