tags:

views:

123

answers:

3

Hello

I'm a Lucene newbie and am thinking of using it to index the words in the title and description elements of RSS feeds so that I can record counts of the most popular words in the feeds.

Various search options are needed, some will have keywords entered manually by users, whereas in other cases popular terms would be generated automatically by the system. So I could have Lucene use query strings to return the counts of hits for manually entered keywords and TermEnums in automated cases?

The system also needs to be able to handle new data from the feeds as they are polled at regular intervals.

Now, I could do much / all of this using hashmaps in Java to work out counts, but if I use Lucene, my question concerns the best way to store the words for counting. To take a single RSS feed, is it wise to have Lucene create a temporary index in memory, and pass the words and hit counts out so other programs can write them to database?

Or is it better to create a Lucene document per feed and add new feed data to it at polling time? So that if a keyword count is required between dates x and y, Lucene can return the values? This implies I can datestamp Lucene entries which I'm not sure of yet.

Hope this makes sense.

Mr Morgan.

A: 

Hello,

I can recommend you check out Apache Solr. In a nutshell, Solr is a web enabled front end to Lucene that simplifies integration and also provides value added features. Specifically, the Data Import Handlers make updating/adding new content to your Lucene index really simple.

Further, for the word counting feature you are asking about, Solr has a concept of "faceting" which will exactly fit the problem you're are describing.

If you're already familiar with web applications, I would definitely consider it: http://lucene.apache.org/solr/

Andre
One of my issues is that i do not have time to really learn either Lucene or Solr in detail. So I'm tempted towards quick simple solutions. Hence my comment about using Lucene only to count words - most of the data would then be database stored for other parts of the application.
Mr Morgan
A: 

Solr is definitely the way to go although I would caution against using it with Apache Tomcat on Windows as the install process is a bloody nightmare. More than happy to guide you through it if you like as I have it working perfectly now.

You might also consider the full text indexing capabilities of MySQL, far easier the Lucene.

Regards

Alan Simes
Thanks for the advice and I am using Tomcat on Windows. I have Java hashmap programs which will do wordcounts from RSS feeds and they're fast. But I've come to Lucene from having heard bad things about fulltext indexing in MySQL (which I'm also using). But I see myself using a hybrid of the two; Lucene to drive indexing of words and counts from RSS feeds which are then written to MySQL (I don't like the idea of caching large Lucene indexes to file system and updating them), and fulltext indexing through MySQL for search queries of tags.
Mr Morgan
@Morgan: Why don't you want to store Lucene indexes? The Lucene index will not take more space than the database created using MySQL. Moreover, retrieval (i.e. searching) is much faster in Lucene. You can also take advantage of various scoring mechanisms in Lucene for better answering the user queries. Unless you have a very specific reason to use MySQL, I think you should use Lucene index.
athena
+1  A: 

From the description you have given in the question, I think Lucene alone will be sufficient. (No need of MySQL or Solr). Lucene API is also easy to use and you won't need to change your frontend code.

From every RSS feed, you can create a Document having three fields; namely title, description and date. The date must preferably be a NumericField. You can then append every document to the lucene index as the feeds arrive.

How frequently do you want the system to automatically generate the popular terms? For eg. Do you want to show the users, "most popular terms last week", etc.? If so, then you can use the NumericRangeFilter to efficiently search the date field you have stored. Once you get the documents satisfying a date range, you can then find the document frequency of each term in the retrieved documents to find the most popular terms. (Do not forget to remove the stopwords from your documents (say by using the StopAnalyzer) or else the most popular terms will be the stopwords)

athena
@Athena: the issue is open at the moment because I playing with Lucene of MySQL for a few days yet. But your advice is good.
Mr Morgan
@Athena: Thanks.
Mr Morgan