views:

3313

answers:

7

Has anyone used Lucene.NET rather than using the full text search that comes with sql server?

If so I would be interested on how you implemented it.

Did you for example write a windows service that queried the database every hour then saved the results to the lucene.net index?

+2  A: 

I think this article is a good starting point:

http://www.aspfree.com/c/a/BrainDump/Working-with-Lucene-dot-Net/1/

Espo
+1  A: 

I have not done it against database yet, your question is kinda open.

If you want to search an db, and can choose to use Lucene, I also guess that you can control when data is inserted to the database. If so, there is little reason to poll the db to find out if you need to reindex, just index as you insert, or create an queue table which can be used to tell lucene what to index.

I think we don't need another indexer that is ignorant about what it is doing, and reindexing everytime, or uses resources wasteful.

neslekkiM
A: 

@neslekkiM yes my question is a bit vague i guess, i hope people dont mind but I am interested in different ways something like this could be implemented.

i quite like your idea of a queue table that gets used to populate/update the lucene index. perhaps a trigger might be useful here.

solrev
+25  A: 

Yes, I've used it for exactly what you are describing. We had two services - one for read, and one for write, but only because we had multiple readers. I'm sure we could have done it with just one service (the writer) and embedded the reader in the web app and services.

I've used lucene.net as a general database indexer, so what I got back was basically DB id's (to indexed email messages), and I've also use it to get back enough info to populate search results or such without touching the database. It's worked great in both cases, tho the SQL can get a little slow, as you pretty much have to get an ID, select an ID etc. We got around this by making a temp table (with just the ID row in it) and bulk-inserting from a file (which was the output from lucene) then joining to the message table. Was a lot quicker.

Lucene isn't perfect, and you do have to think a little outside the relational database box, because it TOTALLY isn't one, but it's very very good at what it does. Worth a look, and, I'm told, doesn't have the "oops, sorry, you need to rebuild your index again" problems that MS SQL's FTI does.

BTW, we were dealing with 20-50million emails (and around 1 million unique attachments), totaling about 20GB of lucene index I think, and 250+GB of SQL database + attachments.

Performance was fantastic, to say the least - just make sure you think about, and tweak, your merge factors (when it merges index segments). There is no issue in having more than one segment, but there can be a BIG problem if you try to merge two segments which have 1mil items in each, and you have a watcher thread which kills the process if it takes too long..... (yes, that kicked our arse for a while). So keep the max number of documents per thinggie LOW (ie, dont set it to maxint like we did!)

EDIT (Corey Trager): I documented how I use Lucene.NET (in my app BugTracker.NET) here:
http://www.ifdefined.com/blog/post/2009/02/Full-Text-Search-in-ASPNET-using-LuceneNET.aspx

Nic Wise
Ashley Tate
+1  A: 

I used Lucene.NET along with MySQL. My approach was to store primary key of db record in Lucene document along with indexed text. In pseudo code it looks like:

  • Store record:

    insert text, other data to the table
    get latest inserted ID
    create lucene document
    put (ID, text) into lucene document update lucene index

  • Querying
    search lucene index
    for each lucene doc in result set load data from DB by stored record's ID

Just to note, I switched from Lucene to Sphinx due to it superb performance

aku
Aku, With this implementation, how do you manage updates to the data? Do you delete and re-create a file for the Lucene index? Do you schedule the re-index or is it real-time? I ask these questions in hopes of better understanding whether I should use it or not. Thanks, deadbug.
deadbug
@deadbug, my system was read-intensive, I deleted old lucene doc and added new one. I experimented with scheduling, but decided to keep system simple. There were no performance problems with this approach.
aku
+1  A: 

thanks guys thats exactly the sort of info i am after. food for thought!

@nic - 50 million indexed records and the performance was still fantastic. thats heartening to hear. dont really see that kind of perf info in tutorials!

@aku - i'll check out Sphinx. looks an interesting alternative

solrev
You should delete these answers and use the comment feature to respond.
Ashley Tate
+2  A: 

I have used lucene.net also as storage engine, because it's easier to distribute and setup alternate machines with an index than a database, it's just a filesystem copy, you can index on one machine, and just copy the new files to the other machines to distribute the index. All the searches and details are shown from the lucene index, and the database is just used for editing. This setup has been proven as a very scalable solution for our needs.

Regarding the differences between sql server and lucene, the principal problem with sql server 2005 full text search is that the service is decoupled from the relational engine, so joins, orders, aggregates and filter between the full text results and the relational columns are very expensive in performance terms, Microsoft claims that this issues have been addressed in sql server 2008, integrating the full text search inside the relational engine, but I don't have tested it. They also made the whole full text search much more transparent, in previous versions the stemmers, stopwords, and several other parts of the indexing where like a black box and difficult to understand, and in the new version are easier to see how they works.

With my experience, if sql server meet your requirements, it will be the easiest way, if you expect a lot of growth, complex queries or need a big control of the full text search, you might consider working with lucene from the start because it will be easier to scale and personalise.

Jokin