views:

707

answers:

3

Hi.

I am trying to convert PDF to PDF/A. Currently I can do this using OpenOffice pdf viewer plugin together with Jodconverter 2. But this is pretty cumbersome to do.

Does anybody know of any open source / free Java libraries I can use to do this?

I have found these open source libraries so far, but none of which has support for converting PDF to PDF/A

iText
gnujpdf
PDF Box
FOP
JFreeReport
PJX
JPedal
PDFjet
jPod
PDF Renderer

UPDATE

Seems like Apache FOP has ability to convert a document (not a PDF document though) to PDF/A

+1  A: 

You mention Apache FOP in your list of APIs, but from this page - http://xmlgraphics.apache.org/fop/trunk/pdfa.html it mentions that there is some support for PDF/A:

PDF/A-1b is implemented to the degree that FOP supports the creation of the elements described in ISO 19005-1.

PDF/A-1a is based on PDF-A-1b and adds accessibility features (such as Tagged PDF). This format is available within the limitation described on the Accessibility page.

It doesn't specifically mention anything about PDF to PDF/A, but it might possibly be an open source alternative.

Liggy
The question still remains, can I take a pdf from file/byte[] and convert it to pdf/a using FOP?
Shervin
+3  A: 

Seam PDF is just a convenience for projects that are using Seam. There is nothing that stops you from using Apache FOP with Seam in order to generate PDF files.

I have personally used Apache FOP to generate PDF/A files in a Web application and it works fine. As the link already given by Liggy says it is as simple as

userAgent.getRendererOptions().put("pdf-a-mode", "PDF/A-1b");

So my suggestion is to use directly Apache FOP instead of dealing with conversion (which also has performance issues)

Update:

The Apache FOP website contains a list of examples on how to use it via Java code. http://xmlgraphics.apache.org/fop/0.95/embedding.html

Here is a minimal command line application that converts XML to PDF

Another approach which deals specifically with XHTML (and not just XML) is to use the xhtml2fo stylesheet from Antenna.

This is an example: http://blog.platinumsolutions.com/node/216

Just add the following two lines before the creation of the "FOP" object and you are good to go.

FOUserAgent foUserAgent = fopFactory.newFOUserAgent(); 
foUserAgent.getRendererOptions().put("pdf-a-mode","PDF/A-1b");
kazanaki
Also as far as I know Seam PDF is just a wrapper over iText. It is not a different library.
kazanaki
Without diving in to Apache FOP, can we use FOP to read a rendered xhtml page, and generate PDF/A from that page without too much hassle? Maybe you can show a few lines of code examples? That would benefit greatly.
Shervin
Seems like this requires to create XLS Stylesheet, which is a little work, but not impossible. I will consider this approach if we decide to scrap OpenOffice and JodConverter. I will consider accepting this as an appropriate answer when reputation ending is near if no one comes up with a better answer
Shervin
The stylesheet is already offered by Antenna. No need to create it again. It can be found in various places. For example http://github.com/jeffrafter/xhtml2fo
kazanaki
@Shervin, I am confused... I thought you wanted to convert PDF to PDF/A, not XML to PDF/A?
vladr
Yes I do. However, I do create the PDF file from an XHTML page. So If I could do it like that, its fine with me. But as to the question of converting pdf to pdf/a it isn't really a correct answer.
Shervin
+3  A: 

Converting from PDF to PDF/A

This is the answer to your question as originally phrased.

For a solution that does not involve potentially lossy re-rendering, take a look at http://www.opensubscriber.com/message/[email protected]/8027900.html , it appears that Foris Zoltan was able to get something (not exhaustive, but possibly sufficient for most PDFs) going using iText without the overkill of re-rendering.

If Zoltan's solution is not acceptable/sufficient according to your requirements then you are stuck with re-rendering. You could stick with OpenOffice/JODConverter, or go for less overhead by preferably using GhostScript (the mother of them all), piping pdf2ps back into PDF/A-enabled ps2pdf.

Apache FOP

Other respondents have suggested Apache FOP, which in the context of PDF to PDF/A conversion has the following advantages and disadvantages:

  • advantage: less "moving parts" than an OpenOffice/JODCOnverter combination (e.g. comparing in-process FOP with daemonized OO)
  • disadvantage: you are responsible for converting from PDF to XSL-FO or otherwise rendering to FOP (more coding and/or integration work required of you), whereas OpenOffice/JODCOnverter and Ghostscript can require less additional coding.

However, if I am not mistaken, it appears that you are using PDF as an intermediate format, i.e. that what you are trying to achieve is XHTML to PDF to PDF/A conversion. By converting directly from XHTML to PDF/A the process will be faster, will use less resources (e.g. memory) and will not needlessly degrade output quality (as re-rendering solutions can) or require intimate knowledge of the PDF format (as Zoltan's solution does.)

In this case, directly converting from XHTML to PDF/A would be an ideal solution, either using iText directly (the example uses iTextSharp, a .Net port of iText, but it's the same for Java), or by using Apache FOP as others have suggested (which also uses iText internally when outputting to PDF, and although it is more bloated, inefficient and complicated to setup than using iText directly, it might produce better results than the iText example -- only one way to settle that, i.e. you have to try it out on a few of your XHTML files as samples. :) )

vladr
Unfortunately, iText only support PDF/A upon creation of the PDF, and not converting of existing PDF.
Shervin
Shervin, if Seam PDF is just a wrapper for iText as kazanaki said and iText supports PDF/A creation (as you said), why don't you create PDF/A documents in the first place?
Tomislav Nakic-Alfirevic
@Shervin, did you read Zoltan's post? He uses iText to read in the original PDF, then write it back to the converted PDF/A but **without doing any rendering** (only dictionary, font etc. manipulation.) So no, iText does not offer a `convert(from, to)` function, obviously, but that never meant that conversion is not possible because of that. :)
vladr
@Vlad Yes I read it, but I could not see any code examples, nor that this is not apart of the API, or which version of iText it is.
Shervin
@Shervin, try contacting Zoltan. The functionality he's using (dictionary manipulation) has been available in iText since time immemorial, so if he's willing to make publicly available the code he used you're in business. **BTW, re. the FOP discussion, you should rephrase your original question. Any PDF-to-PDF/A conversion solution is overkill compared to converting directly from XHTML to PDF/A, yet XHTML to PDF/A is irrelevant to someone browsing SO in the future looking for a PDF-to-PDF/A solution (where there is no non-PDF original format available.) :)**
vladr
This a great answer. Thanks a lot!
Shervin