views:

348

answers:

3

What I want to do is pretty simple: given a PDF/PS/DjVu file containing a paper/book, find the authors and title of the paper (any other metadata would be good, but less needed). This recognition doesn't have to be perfect, but I'd like to make it as good as I can. I am looking for open-source .NET and/or Java libraries (preferably .NET) which allow to access metadata and contents of these files.

For PDF I've found PDFBox (.NET/Java) and PDF Library (.NET), but there may be better alternatives I am not aware of; for Postscript and DjVu, I haven't found anything.

+1  A: 

For most PDF manipulation we use iTextSharp. This is a port of the original Java implementation.

Douglas Anderson
+1  A: 

Another PDF library is PDFSharp. It has pretty decent read/parse capabilities.

ijprest
+1  A: 

For DjVu, you can use the commerical SDK from CamiNova or the open source library, DjVu Libre.

msr