views:

939

answers:

4

30 million distinct phrases, not documents, ranging from one word to a 10 word sentence and I need to support word/phrase searching. Basically what where contains(phrase, "'book' or 'stack overflow'") offers.

I have an instance of SQL Server 2005 (32 bit, 4 proc, 4gb) going against several full text catalogs and performance is awful for word searches with high cardinality.

Here are my thoughts to speed things up, perhaps someone can offer guidance--

1) Upgrade to 2008 iFTS, 64bit. Sql Server 2005 FTS's windows service is never more than 50mb. From what I have gathered, it uses the file system cache for looking up catalog indexes. My populated catalogs on disk are only around 300mb, so why can't this all be in memory? Might iFTS's new memory architecture, which is part of the sqlserver process help here?

2) Scale out the catalogs to several servers. Will the queries to the linked FTS servers run in parallel?

3) Since I'm searching phrases here and not documents, maybe Sql Server's Full Text Search isn't the answer. Lucene.NET? Put the catalog index on a ram drive?

+1  A: 

I'm slightly surprised that FTS is creaking under this sort of load. However, if this proves to be the case, then the classic approach (Gary Kildall developed it for searching CDs!) would be to use an inversion index. I've used this technique for a long time with a succession of applications. It is usually called the ‘Inverted’ or ‘Inversion’ index technique. (see http://en.wikipedia.org/wiki/Search_engine_indexing#Inverted_indices ). The technique scales very well and I've tested it indexing up to 8 million documents. Even when searching through eight million documents, It gets results within three seconds if the indexes are right. Often it is a lot quicker than this.

I use an Inversion index to get (up to a bearable number of via TOP x ) a pool of the likely candidates, and then do a brute-force search of these with a regex. It works very well.

Phil Factor
Interesting, thanks for the suggestion. I read up on Inverted Indexes and found an article that explains that FTS uses this technique. http://tinyurl.com/byvxqf
jfrantzen
A: 

As an out of the box solution i would prefer using "Microsoft Office SharePoint Server" for indexing and searching within the content of documents. A free alternative is Lucene.Net library if you want to write your own service for indexing and searching. Writing your own full-text search service with Lucene.Net will give you all the flexibility you need (yes you can store the index on an external storage if you want to).

Alexander
+1  A: 

Lucene.Net can offer very high performance for this kind of application along with a pretty simple API. Release 2.3.2 is nearing completion, which offers additional performance increases over release 2.1. While putting the Lucene index in a RAMDirectory (Lucene's memory-based index structure) will offer even better performance, we see great results even with the FSDirectory (a disk-based index).

Sean Carpenter
A: 

Take a look at Apache Solr. It's a search server that wraps Lucene with a HTTP interface. Each of your phrases would map to a Solr document. 30M documents is not a lot for Solr since your documents would be very short. The final performance would also depend on how many queries/sec you need.

Mauricio Scheffer