views:

45

answers:

2

At work we had a discussion about adding a fulltext search engine to our product. One proposal was to use a MSSQL database table as a fulltext search index, like this:

Keyword  Document
------------------
This     Doc1
Is       Doc1
A        Doc1
Test     Doc1
And      Doc2
This     Doc2
Too      Doc2

A search would get all rows containing the word grouped by document.

EDIT: This table would contain > 50 Mio. rows.

Just to clarify: We don't want to use the fulltext features of MSSQL.

To be blunt: I don't like this and I'd rather use Lucene, but as our product is written in VB6 it would require some work. So currently we will do it as I described, unless I come up with some hard facts against it. Or maybe you can convince me that this isn't such a bad idea after all.

+1  A: 

If you're using VB6, and you want to use Lucene, get yourself clucene, build it as a dll (maybe with a C wrapper to expose an AddDocument and SearchFiles functions) and call them from your VB6 code. (or create a simple COM object to do the same and call that)

the data can then be parsed and presented to your user as easily as your DB 'hack', and will be a lot less trouble, especially when keeping your DB table up to date with new additions.

What will you do with duplicate words? It'll be dreadfully slow to update the DB with every word in every document (if you're expecting 50 million uniques!), and how will you handle adding new files, without simply trying to add every single word (unless you'll be diffing the document and only updating the added words, or deleting the deleted - what a nuisance!)

You might as well take the day or so it'll take to get clucene running.

gbjbaanb
+1  A: 

What you describe is actually the way many forums work (phpBB, for example). So yes, this is a way to do it that has been employed by other people and it works pretty well. The drawbacks:

  • You can't do things like "word NEAR word", only "word AND/OR word"
  • The tables tend to grow quickly. Full text search engines compress words by stripping plural-s, folding common prefixes, etc, to shrink the search space.
  • It's relatively simple to do a denial of service attack against these things since searches pull in lots of data (at least, for many forums, this was a big issue).

One small optimization: Create a table with words and assign each word an ID (32 bit int; English has only about 2 million words including names, etc.). Then create an M:N mapping "word-id" to "doc-id". This will reduce the DB size a lot.

Aaron Digulla