ansaurus

Question

Fastest PDF->text library for .NET project

Answer 1

+3 A:

If it is for a private project, is this going to an ongoing conversion process? E.g. after you've converted the 15-20Gb are you going to still be converting?

The reason I ask is because I'm trying to work out whether speed is your primary issue. If it were me, for example, converting a library of books, my primary concern would be the quality of the conversion, not the speed. I could always leave the conversion over-night/-weekend if necessary!

Ray Hayes 2010-07-22 10:40:48

OK, you're right. The quality comes first. But I still need the performance because I will be probably adding batches (in hundreds) of documents later. Also the ease of use would be nice - writing a wrapper for a console prog is definitely worse than just having a C# library (like iTextSharp, for example).

n0e 2010-07-22 10:48:01

Answer 2

A:

I guess using any library is fine, but do you want to search all these 20Gb files at time of search?

For full text search, best is you can create a database, something like sqlite or any local database on client machine, read all pdf and convert them to plain text and store it in database when they are added first.

Your database can simpley be as following..

Table: PDFFiles
PDFFileID
PDFFilePath
PDFTitle
PDFAuthor
PDFKeywords
PDFFullText....

and you can search this table when you need to, this way your search will be extremely fast independent of type of pdf, plus this conversion from pdf to database is needed only when pdf is added to your collection or modified.

Akash Kava 2010-07-22 10:55:35

Yes, I plan to search all collection full-text at once. But what you say is already taken care of, I will be using NHibernate.Search, which will be creating tables in DB for me and doing full-text searching. What I need, right now, are the ways of converting documents in different formats, mostly PDFs but also DJVUs, to plaintext, so I can feed that to NHibernate.Search.

n0e 2010-07-22 11:08:00

Answer 3

A:

The desktop version of Foxit's PDF IFilter is free

http://www.foxitsoftware.com/pdf/ifilter/

It will automatically do the indexing and searching, but perhaps their index is available for you to use as well. If you are planning to use it in an application you sell or distribute, then I guess it won't be a good choice, but if it's just for yourself, then it might work.

The Foxit code is at the core my company's PDF Reader/Text Extraction library, which wouldn't be appropriate for your project, but I can vouch for the speed and quality of the results of the underlying Foxit engine.

Lou Franco 2010-07-22 12:59:30

$3000 for the SDK doesn't sound free or cheap to me...

Ray Hayes 2010-07-22 15:15:54

I'm pretty sure that the Foxit IFilter is free for desktop usage. My point is that it's used in commercial software that I have built and that it's fast and good.

Lou Franco 2010-07-22 16:15:06

ansaurus

tags:

views:

answers:

Fastest PDF->text library for .NET project

related questions