views:

156

answers:

8

We have an email service that hosts close to 10000 domains such that we store the headers of messages in a SQL Server database.

I need to implement an application that will search the message body for keywords. The messages are stored as files on a NAS storage system.

As a proof of concept, I had implemented a SQL server based search system were I would parse the message and store all the words in a database table along with the memberid and the messageid. The database was on a separate server to the headers database.

The problem with that system was that I ended up with a table with 600 million rows after processing messages on just one domain. Obviously this is not a very scalable solution.

Since the headers are stored in a SQL Server table, I am going to need to join the messageIDs from the search application to the header table to display the messages that contain the searched for keywords.

Any suggestions on a better architecture? Any better alternative to using SQL server? We receive over 20 million messages a day. We are a small company with limited resources with respect to servers, maintenance etc.

Thanks

+3  A: 

lucene.net will help you a lot, but no matter how you approach this, it's going to be a lot of work.

John Sheehan
lucene isn't going to help with the scaling which I think is the central question, but it probably is the right tool for tokenizing, stemming, and searching.
jshen
+2  A: 

Consider not using SQL for this. It isn't helping.

GREP and other flat-file techniques for searching the text of the headers is MUCH faster and much simpler.

S.Lott
With 20 million messages a day, guessing at a header length of 100 bytes, that's 2G of data a day, after a month 60 gigabytes. Would a flat-file search [byte 1, byte 2, byte 3...] on that that much data really be faster and simpler?
Philip Kelley
Yes. Flat files are physically smaller. Less I/O. Faster. GREP pattern matching can be very simple.
S.Lott
+1  A: 

You can also check out the java lucene stuff which might be useful to you. Both Katta which is a distributed lucene index and Solr which can use rsync for index syncing might be useful. While I don't consider either to be very elegant it is often better to use something that is already built and known to work before embarking on actual development. Without knowing more details its hard to make a more specific recommendation.

Steve
+4  A: 

have a look at Hadoop. It's complete "map-reduce" framework for working with huge datasets inspired by Google. It think (but I could be wrong) Rackspace is using it for email search for their clients.

lubos hasko
A: 

i wonder if BigTable (http://en.wikipedia.org/wiki/BigTable) does searching.

Dustin Getz
no, but MapReduce tasks create inverted indexes from which they do the search. 'they' being the only ones that have BigTable installations :-)
Javier
A: 

Look into the SQL Server full text search services/functionality. I haven't used it myself, but I once read that Stack Overflow uses it.

Philip Kelley
A: 

three solutions:

  1. Use an already-existant text search engine (lucene is the most mentioned, there are several more)
  2. Store the whole message in the SQL database, and use included full text search (most DBs have it these days).
  3. Don't create a new record for each word occurrence, just add a new value to a big field in the word record. Even better if you don't use SQL for this table, use a key-value store where the key is the word and the value is the list of occurrences. Check some Inverted Index bibliography for inspiration

but to be honest, i think the only reasonable approach is #1

Javier
+1  A: 

If you can break up your 600 million rows, look into database sharding. Any query across all rows is going to be slow. At very least you could break up by language. If they're all English, well, find some way to split the data that makes sense based on common searches. I'm just guessing here but maybe domains could be grouped by TLD (.com, .net, .org, etc).

For fulltext search, compare SQL Server vs Lucene.NET vs cLucene vs MySQL vs PostgreSQL. Note full-text search will be faster if you don't need to rank the results. If a database is still slow look into performance tuning and if that fails look into a Linux-based db.

http://incubator.apache.org/lucene.net/

http://sourceforge.net/projects/clucene/

Axl