views:

460

answers:

5

I am experimenting with a system to scan letters and convert the scanned bitmaps to PDF with the goal to have a high resolution and a small PDF file size.

I am prototyping with scanner, GIMP for bitmap manipulation and ImageMagick for bitmap-to-PDF conversion.

My process looks as follows:

  • Scan in 3x8bit color, 600 DPI, LZW-compressed true-color TIFF file size is around 8 Mb.

  • Use GIMP to convert bitmap to indexed image with a typical color table of 4 to 8 colors. That makes the image better compressible.

  • Use ImageMagick to convert the LZW-compressed indexed TIFF file PDF, with around 500K per page.

Now in order to make the image even better compressible, I could make the bitmap more compression-friendly. Before experimenting here, I would like to know how PS/PDF stores bitmaps.

Are bitmaps in PS/PDF run-lenght-encoded? Then I woud gain compression by removing single pixles form bitmap rows.

Do you have ideas for further optimizing here?

Do you know references to bitmap storage format in PS/PDF?

A: 

PDF supports many types of image compression, see: http://en.wikipedia.org/wiki/Pdf#Raster_images

I think you can specify which one to use with the imagemagick -compress option: http://www.imagemagick.org/script/command-line-options.php#compress

gromgull
A: 

For bitmaps, IIRC, PDF uses deflate. But PDF can also store images with more specific image compression algorithms, such JPEG (lossy), CCITT (lossless), JBIG2 (lossy and lossless) and JPX (of JPEG2000, lossy and lossless).

vartec
A: 

Adobe's PDF reference might be a good place to start. From a very cursory look, it looks like images are stored uncompressed, but that doesn't feel right at all. It can also link to external images, in JPEG for instance.

unwind
A: 

The compression method is generally selected by the tool creating the PDF and you may have limited control over that.

If you have Acrobat 9.0 there is a really nice 'hidden' feature which allows you to see the object tree inside a PDF (you are interested in the XObjects under Resources). There is a short blog on using it at http://pdf.jpedal.org/java-pdf-blog/bid/10479/Viewing-PDF-objects

+1  A: 

A few companies (Luratech and CamiNova are the only ones I know) make a "Mixed Raster Content" model in PDF. The files are viewable in the standard Adobe Reader but are very, very small -- comparable to DjVu.

"Mixed Raster Content" means they segment the image into a high resolution B&W mask (hard edges, lines, letters) and lower resolution smooth tone image (background pictures). The mask gets stored using a bitonal compression algorithm (probably JBIG2) and the smooth tone image gets compressed using JP2K (probably).

msr