Currently we are saving files (PDF, DOC) into the database as BLOB fields. I would like to be able to retrieve the raw text of the file to be able to manipulate it for hit-highlighting and other functions.
Does anyone know of a simple way to either parse out the files and save the raw text on save, either via SQL or .net code. I have found that Adobe has a filtdump utility that will convert the PDF to text. Filtdump seems to be a command line tool, and i don't see a way to use a file stream. And what would the extractor be for Office documents and other file types?
-or-
Is there a way to pull out the raw text from the SQL Full text index, without using 3rd party filters?
Note i am trying to build a .net & MSSql solution without having to use a third party tool such as Lucene