Hi all, I am looking for a keyword indexing library for java. I found Lucene in google search. I think it is a very popular one but just wondering if it is the best (in terms of speed performance) indexing library (of course, it can be subjective but your opinion should be good enough for a beginner like me)? Is the example in this site http://snippets.dzone.com/posts/show/4020 good enough, or you have a better recommendation? Thanks in advance.
We have tested Lucene (but .Net version) against MSSQL's Full Text Search. It is rather difficult comparison, since both system provides indexing in incomparable way, but we do it for well defined task - index some product with multiple text fields (so fileds have different weight in search results) and provide user searching on these products.
Lucene wins because we have full control over compounding query, solve which indexes are in memory, and which are stored on filesystem, we have not been restrict by language pack (MSSQL FTS have limited list of supported languages). Lucene allows us use non-static noise word dictionary (for multiple product category we have used different set of noises).
So it is hard to talk about pure performance, but rich functional of Lucenr opens many ways for optimization.
The content management software Alfresco has to ingest tons of documents as fast as possible, so I guess the indexer they use is amongst the fastest they could find.
Yes, they use Lucene.
Databases like MySQL have an integrated Full Text Index (see: MySQL Index creation) you can use. This is quite fast but not as easy to configure as Lucene. I tried it one day and didn't get the results I intended (Especially since the included tokenizer can not be exchanged as easily as with Lucene).
Another alternative would be to use a simple database table, where you have one column with the index terms and another pointing to the postings (all documents containing the term) list. A collegue of me does it that way and says he evaluated performance against Lucene and the result was that the dabase is much faster.
However as a conclusion I must say whenever I tried some different technology I was back at Lucene quite fast. The documentation is one of the best I ever read and the configuration as easy as it is extensive.
Lucene is an awesome search tool, but I would also urge you to take a look at Apache Solr ,a full-fledged search server built using Lucene, over a RESTful/HTTP interface.