views:

717

answers:

5

Currently we are saving files (PDF, DOC) into the database as BLOB fields. I would like to be able to retrieve the raw text of the file to be able to manipulate it for hit-highlighting and other functions.

Does anyone know of a simple way to either parse out the files and save the raw text on save, either via SQL or .net code. I have found that Adobe has a filtdump utility that will convert the PDF to text. Filtdump seems to be a command line tool, and i don't see a way to use a file stream. And what would the extractor be for Office documents and other file types?

-or-

Is there a way to pull out the raw text from the SQL Full text index, without using 3rd party filters?

Note i am trying to build a .net & MSSql solution without having to use a third party tool such as Lucene

+3  A: 

If it isn't absolutely necessary to stream directly from SQL Server into your app, the hard part is parsing the PDF or DOC file formats.

The iTextSharp library will give you access to the innards of a PDF file:

http://itextsharp.sourceforge.net/

Here's a commercial product that claims to parse Word docs:

Aspose.Words

Edited to add:

I think you're also asking if there are ways to make SQL Server Full-text Indexing do the work for you by adding IFilters. This sounds like a good idea. I haven't done this myself, but MS has apparently supported a Word filter for a long time, and now Adobe has released a (free) PDF filter. There's a lot of information here:

Filter Central

10 Ways to Optimize SQL Server Full-text Indexing

SQL Server Full Text Search: Language Features - a little out of date but easy to understand.

egrunin
Since SQL is already pulling out the text though it's own filters, why do other tools need to be used?
Glennular
Thanks for the clarifications.
egrunin
+1  A: 

You could from your C# application open the .doc file and save it as text and put both the text and .doc document into the database.

Tom Groszko
This would only help for the .doc format. Is there any more universal method?
Glennular
+1  A: 

If you are using SQL 2008, then you could consider using the new FILESTREAM feature.

Your data is stored in a varbinary(max) column, but you can also access the raw data via a regular Win32 handle.

Here's some sample code showing how to get the handle.

David Gardiner
The FILESTREAM is to handle the raw file. Which we have being steamed to and from SQL in its original format (binary or text). I would like to get at the indexed text of the binary file, that the indexer is indexing.
Glennular
+1  A: 

I had this same issue... I solved it by adding the following to my application:

I use these to grab the plain text and then store it in the database alongside the binary data. Keep in mind that I am certainly not an expert, so there may be a better way to do this, but this works for everything but "Quick Save" pre-2007 Word Documents, which apparently aren't read by iFilters. I just have my users resave the document if that error occurs, and everything works fine.

Let me know if you'd like some sample code... I would post it right now, but it's a bit long.

emmilely
Since SQL is already pulling out the text though it's own filters, why do other tools need to be used? Do you find that these two filters combined solve the majority of file formats that would be indexed?
Glennular
I believe that SQL Server uses iFilters to read text, so the EPocalipse dll uses the same filters that SQL Server does. I agree, it would be much easier to just have SQL Server return the plain text, but I could not find a way to do so.The iFilters should be able to read the text of anything that Microsoft can index, and I even recall seeing something about using them to read text in images, but I only needed to deal with .doc, .docx, and .pdf files, so I cannot verify this.
emmilely
+1  A: 

SQL Server Full-Text Search feature uses IFilters for extracting plain text from PDF or Office file formats. You can install IFilters on your server or if your code is running on the same machine as SQL Server you're already have it.

Here is an article which shows how to use IFilters from .NET: http://www.codeproject.com/KB/cs/IFilter.aspx

Yaroslav