views:

218

answers:

2

I need to implement a service to search PDFs. Initially I started using SQL Server 2008 FTS, but soon realized that my PDFs would have to be stored in the DB itself. I was then pointed to Indexing Services as well as to the SQL 2008 FILESTREAM data type so that I can store PDFs in the file system. So how do these three (Indexing Services, FTS, and the FILESTREAM option) relate with each other? Do I need to use all three together to implement my search?

Also, Do hosting services like DiscountASP typically have these enabled? Or should I consider switching to Lucene.NET?

A: 

If you know in advance what you want to find (eg you get hundreds of PDFs a day and will need to find the ones with certain "known-before-reception" strings then you could make a text version on reception, create index entries for the PDF file, and then throw away the text.

If you do not know the search terms in advance, life becomes much slower :( There is a program called PDF Search that claims to do full-text search in PDF files. I haven't needed to use it, so I can't say how it is, but it's here: http://www.getpdf.com/.

Hope this helps

dcpking
+1  A: 

WE used to use a PDF iFilter which allows you to store the PDF in the DB and then perform a FTS against it. HOwever, we now convert our PDFs to text and store the text in the full text index. This allows us to store all our docs now (we store .doc, .pdf etc) in the same index.

DiscountASP does allow FTS /iFTS on the hosted database.

Coolcoder