I have a set of images over which I run an OCR application. This process results in a XML file with character offsets. Then I convert the images to PDF using Acrobat 9. Now, I would like to add the XML file information as an invisible text layer into the PDF in order to achieve a searchable PDF. Is there an easy and free way?
Some details:
I don't want to use Acrobat's OCR functionality;
The OCR process results in a XML file which contains elements like:
<line baseline="1049" l="158" t="1012" r="1196" b="1060">This is a sample line of text from an image</line>
Update: it may be possible doing what I want in a different way. Supposing there is already a PDF file generated from a set of images, and which already contains OCRed text. Would it be possible to (maybe programmatically) access just the image of each page, process it (e.g., converting it to monochrome), and save it back to the PDF file? If yes, then the OCRed text would not be lost.
[Should I put this update into a separate question?]