tags:

views:

370

answers:

2

I have a set of images over which I run an OCR application. This process results in a XML file with character offsets. Then I convert the images to PDF using Acrobat 9. Now, I would like to add the XML file information as an invisible text layer into the PDF in order to achieve a searchable PDF. Is there an easy and free way?

Some details:

  • I don't want to use Acrobat's OCR functionality;

  • The OCR process results in a XML file which contains elements like:

    <line baseline="1049" l="158" t="1012" r="1196" b="1060">This is a sample line of text from an image</line>

Update: it may be possible doing what I want in a different way. Supposing there is already a PDF file generated from a set of images, and which already contains OCRed text. Would it be possible to (maybe programmatically) access just the image of each page, process it (e.g., converting it to monochrome), and save it back to the PDF file? If yes, then the OCRed text would not be lost.

[Should I put this update into a separate question?]

A: 

If all you want to do is convert an existing pdf to grayscale, try Imagemagick:

convert foo.pdf -colorspace Gray -compress zip gray.pdf

I don't think this will change any other attributes in your pdf.

DaveParillo
This does not seem to retain the hidden text layer in the PDF. (Tried with ImageMagick 6.4.5.)
Jukka Matilainen
odd, because imagemagick uses ghostscript to do it's image conversion...
DaveParillo
I also tried it, and also lost the text layer. I used ImageMagick 6.4.5, too.
kepler
+1  A: 

For your follow-up question about processing PDF files without losing the the hidden layers: I believe Ghostscript is able to do this. For example, the following command should convert a PDF to grayscale:

gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dColorConversionStrategy=/Gray -dProcessColorModel=/DeviceGray -sOutputFile=output.pdf input.pdf
Jukka Matilainen
Nice, it worked. But the output is not as clean as I wanted. If ImageMagick could convert the PDF without losing the text layer, I would like to process each page with something like: convert \( -white-threshold 50% \) -monochrome ...Maybe there is a way of telling IM how to use GS, like DaveParillo said. I'll check on this later.
kepler