views:

60

answers:

3

I'm working on a system that performs matching on large sets of records based on strings and numeric ranges, and date ranges. The String matches are mostly exact matches as far as I can tell, as opposed to less exact full text search type results that I understand lucene is generally designed for. Numeric precision is important as the data concerns prices.

I noticed that Lucene recently added some support for numeric range searching but it's not something it's originally designed for.

Currently the system uses procedural SQL to do the matching and the limits are being reached as to the scalability of the system. I'm researching ways to scale the system horizontally and using search engine technology seems like a possibility, given that there are technologies that can scale to very large data sets while performing very fast search results. I'd like to investigate if it's possible to take a lot of load off the database by doing the matching with the lucene generated metadata without hitting the database for the full records until the matching rules have determined what should be retrieved. I would like to aim eventually for near real time results although we are a long way from that at this point.

My question is as follows: Is it likely that Lucene would perform many times faster and scale to greater data sets more cheaply than an RDBMS for this type of indexing and searching?

A: 

At its heart, and in its simplest form, Lucene is a word density search engine. Lucene can scale to handle extremely large data sets and when indexed correctly return results in a blistering speed. For text based searching it is possible and very probable that search results will return quicker in Lucene as opposed to SQL Server/Oracle/My SQL. That being said it is unfair to compare Lucene to traditional RDBMS as they both have completely different usages.

Kane
I am considering offloading only the text searches to lucene but first I will have to find out what proportion of the load is attributed to text searches to justify the investment. I'm sure it's a lot faster at this. Well technology comparison aside, I'm considering adding lucene as a whole system optimisation as opposed to an either or option.
barrymac
I would definately recommend using a combination of Lucene and RDBMS, you'll be really surprised by the performance.
Kane
Well I implemented a trivial hibernate search setup once. Although that particular setup is actually quite limited, the speed and power of lucene on large data sets would be enough to make you giddy :-)
barrymac
+1  A: 

I suggest you read Marc Krellenstein's "Full Text Search Engines vs DBMS".

A relatively easy way to start using Lucene is by trying Solr. You can scale Lucene and Solr using replication and sharding.

Yuval F
Thank you very much for those helpful links!
barrymac
+1  A: 
  1. Lucene stores its numeric stuff as a trie; a SQL implementation will probably store it as a b-tree or an r-tree. The way Lucene stores its trie and SQL uses an R-tree are pretty similar, and I would be surprised if you saw a huge difference (unless you leveraged some of the scalability that comes from Solr).
  2. As a general question of the performance of Lucene vs. SQL fulltext, a good study I've found is: Jing, Y., C. Zhang, and X. Wang. “An Empirical Study on Performance Comparison of Lucene and Relational Database.” In Communication Software and Networks, 2009. ICCSN'09. International Conference on, 336-340. IEEE, 2009.

First, when executing exact query, the performance of Lucene is much better than that of unindexed-RDB, while is almost same as that of indexed-RDB. Second, when the wildcard query is a prefix query, then the indexed-RDB and Lucene both perform very well still by leveraging the index... Third, for combinational query, Lucene performs smoothly and usually costs little time, while the query time of RDB is related to the combinational search conditions and the number of indexed fields. If some fields in the combinational condition haven’t been indexed, search will cost much more time. Fourth, the query time of Lucene and unindexed-RDB has relations with the record complexity, but the indexed-RDB is nearly independent of it.

In short, if you are doing a search like "select * where x = y", it doesn't matter which you use. The more clauses you add in (x = y OR (x = z AND y = x)...) the better Lucene becomes.

They don't really mention this, but a huge advantage of Lucene is all the built-in functionality: stemming, query parsing etc.

Xodarap
Great answer! That really shines the light. This problem will have some cases where tens of rules may be running sequentially on the dataset and it may be able to improve things a lot here. I'm an aspiring empiricist so I really appreciate the reference. I'm thinking also that as well as some of the nice functionality, there's also an architectural advantage to separating out the indexing load.
barrymac
@barrymac: You may also be interested in http://philosophyforprogrammers.blogspot.com/2010/09/lucene-performance.html, and the papers cited therein. You can plug in some values of sample searches and see what your expected performance gain might be (assuming you know the metrics for your current implementation, of course).
Xodarap