I'm trying to create an application which will be basically a catalogue of my PDF collection. We are talking about 15-20GBs containing tens of thousands of PDFs. I am also planning to include a full-text search mechanism. I will be using Lucene.NET for search (actually, NHibernate.Search), and a library for PDF->text conversion. Which would be the best choice? I was considering these:
- PDFBox
- pdftotext (from xpdf) via c# wrapper
- iTextSharp
Edit: Other good option seems to be using iFilters. How well (speed/quality) would they perform (Foxit/Adobe) in comparison to these libraries?
Commercial libraries are probably out of the question, as it is my private project and I don't really have a budget for commercial solutions - although PDFTextStream looks really nice.
From what I've read pdftotext is a lot faster than PDFBox. How well performs iTextSharp in comparison to pdftotext? Or maybe someone can recommend other good solutions?