views:

928

answers:

5

I am working on an online portal, where researchers can upload their research papers. One requirement is, that all PDFs are stored in PDF/A-format. As I can't rely on the users to generate PDF/A conforming documents, I need a tool to check and convert standard PDFs into PDF/A format.

What is the best tool you know of?

  • Price
  • Quality
  • Speed
  • Available APIs

Open-source tools would be prefered, but a search revealed none. iText can create PDF/a, but converting isn't easy to do, as you have to read every page and copy it to a new document, losing all bookmarks and annotations in this process. (At least as far as I know, if you know of an easy solution, let me know).

APIs should be available for either PHP, Java or a command-line-tool should be provided. Please do not list either GUI-only or Online-only solutions.

A: 

I am not sure of PDF/a documents, but you have looked at jodconverter? It can convert many different formats for you, and it is open source. We use it quite extensively in our project.

http://www.artofsolving.com/opensource/jodconverter

Shervin
+1  A: 

The Open Office API project might be what your looking for. As of 2.4 Open Office supports PDF/a documents. Here is a code example from the website on how to convert documents, this example is in Java.

Mark Robinson
+4  A: 

I am not sure all your goals can be satisfied at the same time. The story around PDF/A is a lot more complex than format conversions like tiff to png.

  • The base format is PDF 1.4: what to do with higher versioned documents which use features from those higher versions? Information might be lost.
  • In both PDF/A-1a and 1b, metadata in XMP/RDF format is mandatory. If the original document is without metadata, you'll have to get it from somewhere and add it. At least iText can do that.
  • There are lots of little details to get right, from embedding fonts to making sure spaces are present instead of only horizontal movement commands.

To sum it all up: I think you are better off placing some or all of the responsibility for compliance with the producers of the PDFs. Of course, that doesn't mean you can't help them: If you figure out which tools the majority use to create their papers, you can point to documentation about PDF/A and the specific tools. (as a bit of an extreme example of such documentation, have a look at this)

Good luck with your efforts.

Bart Schuller
+3  A: 

For the identification part you could try the Droid tool (Digital Record Object Identification), which provides access to the Pronom technical registry (which contains PDF/A).

Fabian Steeg
+1  A: 

I used to work for the French National Library, to build an archive system that did this kind of things. As most of the top-ten libraries in the world, we used JHOVE to recognize file formats.

JHOVE can tell whether files are PDF/A or not, and it can even validate them. It also knows 7 other kinds of PDF, see the details.

JHOVE is open source, it is maintained by JSTOR and the Harvard University Library. It is rather simple to use.

Nicolas Raoul