views:

252

answers:

3

I have an application that needs to have .doc files uploaded to it. These documents should then be index and the whole collection of documents should be searchable. This will run on a Windows Server, without Word installed, using IIS and SqlServer, but I'd rather not be tied to SqlServer's full text indexing.

I was thinking of using Lucene.Net for the indexing part and was wondering what the best way to get the text out of the .doc files would be. I could probably extract the text by reading in the whole stream and then using a regEx to pull out any regular characters, but that seems hefty and prone to error.

I saw an article on using iFilters that sounds promising, but I thought I'd put this out there since it's not something I'm familiar with.

P.S. If it matters, these .doc files will have mail-merge fields in them and there's no other current alternative for the .doc format.

A: 

Maybe you'd like to checkout Solr.

Sinan Taifour
Looks like that needs Apache. We're running IIS.
Jared
You can run it separately and communicate with it using its API.
Sinan Taifour
+1  A: 

In our PHP based applications we always used external programs similar to this one: doc2txt. Then we took the text and saved it into the database. If you search on Google for "doc2txt" you will find many different programs doing exactly the same thing. Just take the one that suits you best.

Raffael Luthiger
+1  A: 

As far as a solution that didn't require an external program, it looks like the iFilter solution is the way to go (even though you might count that as an external program).

Here's a simple CodePlex article and code on how it can be done: http://www.codeproject.com/KB/cs/IFilter.aspx

Jared