views:

116

answers:

3

I'm trying to create an application which will be basically a catalogue of my PDF collection. We are talking about 15-20GBs containing tens of thousands of PDFs. I am also planning to include a full-text search mechanism. I will be using Lucene.NET for search (actually, NHibernate.Search), and a library for PDF->text conversion. Which would be the best choice? I was considering these:

  • PDFBox
  • pdftotext (from xpdf) via c# wrapper
  • iTextSharp

Edit: Other good option seems to be using iFilters. How well (speed/quality) would they perform (Foxit/Adobe) in comparison to these libraries?

Commercial libraries are probably out of the question, as it is my private project and I don't really have a budget for commercial solutions - although PDFTextStream looks really nice.

From what I've read pdftotext is a lot faster than PDFBox. How well performs iTextSharp in comparison to pdftotext? Or maybe someone can recommend other good solutions?

+3  A: 

If it is for a private project, is this going to an ongoing conversion process? E.g. after you've converted the 15-20Gb are you going to still be converting?

The reason I ask is because I'm trying to work out whether speed is your primary issue. If it were me, for example, converting a library of books, my primary concern would be the quality of the conversion, not the speed. I could always leave the conversion over-night/-weekend if necessary!

Ray Hayes
OK, you're right. The quality comes first. But I still need the performance because I will be probably adding batches (in hundreds) of documents later. Also the ease of use would be nice - writing a wrapper for a console prog is definitely worse than just having a C# library (like iTextSharp, for example).
n0e
A: 

I guess using any library is fine, but do you want to search all these 20Gb files at time of search?

For full text search, best is you can create a database, something like sqlite or any local database on client machine, read all pdf and convert them to plain text and store it in database when they are added first.

Your database can simpley be as following..

Table: PDFFiles
PDFFileID
PDFFilePath
PDFTitle
PDFAuthor
PDFKeywords
PDFFullText....

and you can search this table when you need to, this way your search will be extremely fast independent of type of pdf, plus this conversion from pdf to database is needed only when pdf is added to your collection or modified.

Akash Kava
Yes, I plan to search all collection full-text at once. But what you say is already taken care of, I will be using NHibernate.Search, which will be creating tables in DB for me and doing full-text searching. What I need, right now, are the ways of converting documents in different formats, mostly PDFs but also DJVUs, to plaintext, so I can feed that to NHibernate.Search.
n0e
A: 

The desktop version of Foxit's PDF IFilter is free

http://www.foxitsoftware.com/pdf/ifilter/

It will automatically do the indexing and searching, but perhaps their index is available for you to use as well. If you are planning to use it in an application you sell or distribute, then I guess it won't be a good choice, but if it's just for yourself, then it might work.

The Foxit code is at the core my company's PDF Reader/Text Extraction library, which wouldn't be appropriate for your project, but I can vouch for the speed and quality of the results of the underlying Foxit engine.

Lou Franco
$3000 for the SDK doesn't sound free or cheap to me...
Ray Hayes
I'm pretty sure that the Foxit IFilter is free for desktop usage. My point is that it's used in commercial software that I have built and that it's fast and good.
Lou Franco