views:

585

answers:

2

Does anybody know of a open source Java library that will do robust diffing the text parts of pdf files?

Ideally I would like something that would produce a diff in the for of a patch.

+1  A: 

Extract the pdf text with http://incubator.apache.org/pdfbox/ and create a diff with http://code.google.com/p/google-diff-match-patch.

bwalliser
A: 

If the PDFs are different only in text, you could also rasterize the pages and then look at the differences that way - we use that for regression testing output on our PDF code.