tags:

views:

929

answers:

6

I have a large collection of documents scanned into PDF format, and I wish to write a shell script that will convert each document to DjVu format. Some documents were scanned at 200dpi, some at 300dpi, and some at 600dpi. Since DjVu is a pixel-based format, I want to be sure I use the same resolution in the target DjVu file as was used for the scan.

Does anyone know what program I can run, or how I can write a program, to determine what resolution was used to produce a scanned PDF? (Number of pixels might work too as almost all documents are 8.5 by 11 inches.)


Clarification after responses: I'm aware of the difficulties highlighted by Breton, and I'm willing to concede that the problem in general is ill-posed, but I'm not asking about general PDF documents. My particular documents came out of a scanner. They contain one scanned image per page, same resolution each page. If I convert the PDF to PostScript I can poke around by hand and find pixel dimensions easily; I could probably find image sizes with more work. And if in desperate need I could modify the dictionary stack that gs is using; long ago, I wrote an interpreter for PostScript Level 1.

All of that is what I'm trying to avoid.


Thanks to help received, I've posted an answer below:

  1. Extract the bounding box from the PDF using identify, taking only the output for the first page, and understanding that the units will be PostScript points, of which there are 72 to an inch.
  2. Extract images from the first page using pdfimages.
  3. Get height and width of image. This time identify will give number of pixels.
  4. Add the total areas of the images to get the number of dots squared.
  5. To get resolution, compute areas of bounding box in inches squared, divide dots squared by inches squared, take the square root, and round to the nearest multiple of 10.

Full answer with script is below. I'm using it in live fire and it works great. Thanks Harlequin for pdfimages and Spiffeah for the alert about multiple images per page (it's rare, but I've found some).

+1  A: 

Too long to put into a comment, but neither ImageMagick nor GraphicsMagic is up to the job; every answer is wrong:

: nr@yorkie 1932 ; gm identify -format "x=%x y=%y w=%w h=%h" drh*rec*pdf
x=0 y=0 w=612 h=792
x=0 y=0 w=612 h=792
x=0 y=0 w=612 h=792
x=0 y=0 w=612 h=792
x=0 y=0 w=612 h=792
x=0 y=0 w=612 h=792
x=0 y=0 w=612 h=792
x=0 y=0 w=612 h=792

: nr@yorkie 1933 ; identify -format "x=%x y=%y w=%w h=%h" drh*rec*pdf   
x=72 Undefined y=72 Undefined w=612 h=792x=72 Undefined y=72 Undefined     w=612 h=792x=72 Undefined y=72 Undefined w=612 h=792x=72 Undefined     y=72 Undefined w=612 h=792x=72 Undefined y=72 Undefined w=612     h=792x=72 Undefined y=72 Undefined w=612 h=792x=72 Undefined y=72     Undefined w=612 h=792x=72 Undefined y=72 Undefined w=612 h=792
: nr@yorkie 1934 ;

The correct parameters for this document is that each scanned page is 5100 pixels wide and 6600 pixels high, unsurprising for this was an 8.5-by-11 scanned at 600dpi. The output from ImageMagic is astoundingly unprofessional.

No downvotes because you were trying to be helpful, but *Magick don't work.

Norman Ramsey
The output from ImageMagick is exactly what you asked for; I hardly see how it's "unprofessional". (There's no filename in the output string, for instance, because you didn't ask it for one). "Undefined" is clear: The file format is such that those attributes don't make sense.
Charles Duffy
erm, I meant what you asked *it* for by running the command, not what you asked for in your question. :)
Charles Duffy
I guess that ImageMagick converts PDF by rendering it to some default resolution, and then works on that. You will have to extract the image data first (see my answer).
Svante
@Charles: The output is *not* what I asked for: the numbers are *wrong*. I find it unprofessional to report a %y value of '72 Undefined'. This value is neither useful nor reasonable. I was also annoyed by the lack of newlines, but perhaps it was my responsibility to put \n in the format string.
Norman Ramsey
@Harlequin: Looks right. 72dpi by fiat, regardless of truth.
Norman Ramsey
@Norman Ramsey: you were supposed to run `identify` against the **output of `pdfimages`**, not against the **original PDF**. If you run it against the original PDF, identify will call Ghostscript for help (it can't handle PDFs natively). And Ghostscript will convert to an image format for identify (PPM?), using its default resolution of 72dpi.
pipitas
A: 

PDF is a resolution independent format, it's a nonsensical question. You may have scanned some bitmaps at a particular resolution, and those bitmaps are individually embedded inside the pdf, but the PDF itself may contain images at multiple resolutions, as well as resolution independent vector graphics. There's no way to know without cracking open the pdf and examining every object inside it.

Editing to continue expounding on the problem:

You may have gotten lucky, and the software you used to scan the documents embedded some metadata about this, but don't bet on it. Such metadata is unlikely to be standard. As far as parsing the pdf, you'd want a prewritten library - such as ghostscript. The problem is that PDF isn't really a format so much as it is a specified subset of the PostScript programming language, and an agreed upon way of compressing/compiling this subset along with some binaries. Thus reading a PDF is more complicated than other types of image formats, as it involves writing a language interpreter - not so straightforward.

The best approach is to either throw up your hands and give up, or really look hard at ghostscript and see if you can get that to tell you the answer.

Breton
@Breton: PDF's native imaging operators may be resolution independent. But PDF as a container format is able to embed all sorts of image types -- and here is were resolution **IS** important. The statement of this being a *'nonsensical question'* therefor is wrong.
pipitas
@pipitas you are committing fallacy of composition. It is still a nonsensical question, just as, say, "What is the hair color of stanford university?" is nonsensical.
Breton
@pipitas You could make it sensical by stating it as "what is the average hair color of people attending stanford university", and the original author could have made his question sensical by asking instead "How do I determine the average resolution of all the images within a pdf, via a shell script?"
Breton
@Breton: Not being a native English speaker, the term 'fallacy of composition' I don't understand. However, in the face of your intellectual superiority, I won't argue any more with you about this topic. -- I for one, devote slave of pixels and vectors, will continue to take into account resolution dependency of various graphical objects inside PDFs. May you continue to live happy and unbothered in your resolution independent PDF world...
pipitas
@pipitas "Fallacy of composition", is a logical error that you commit when asserting that the properties of the whole are inherited from the properties of its parts. So a machine doesn't have the same properties of a gear, a university doesn't have the same properties as a student, or faculty, and a PDF doesn't have the "resolution" property of an image. (though an image does, and a pdf can contain an image- the resolution property does not transfer to the "pdf")
Breton
@pipitas and you might think I am being overly pedantic, keep in mind, this is a programming site. If I were so imprecise with my words in my job, serious mistakes would get made.
Breton
+2  A: 

I guess that the scans are included as images in the PDF, so you could use pdfimages to extract them first. Then, identify should be able to find the correct data.

Svante
Nice idea! It's a 90% solution---it's *very* quick and gives me accurate width and height in pixels. I need to see how to extract and combine bounding-box information.
Norman Ramsey
Great help. Complete solution is now posted.
Norman Ramsey
+3  A: 

If a pdf has been created by scanning then there should only be one image associated with each page. You can find each image resolution for each page image by parsing the pdf using the iText(Java) or iTextSharp(the .net port) libraries easily.

If you want to roll your own utility to do this, do something like the following in iTextSharp :

PdfReader reader = new PdfReader(filename);
for (int i = 1; i <= reader.NumberOfPages; i++)
{
PdfDictionary pg = reader.GetPageN(i);
PdfDictionary res = (PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));
PdfDictionary xobjs = (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));
if (xobjs != null) 
{
    foreach (PdfName xObjectKey in xobjs.Keys)
    {
 PdfObject xobj = xobjs.Get(xObjectKey);
 PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(xobj);
 PdfName subtype = (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));
 if  (subtype.Equals(PdfName.IMAGE))
 {
     PdfNumber width = (PdfNumber)tg.Get(PdfName.WIDTH);
     PdfNumber height = (PdfNumber)tg.Get(PdfName.HEIGHT);
     MessageBox.Show("image on page [" + i + "] resolution=[" + width +"x" + height + "]");
 }
    }
}
} 
reader.Close();

Here for each page we read through each XObject of subtype Image and get the WIDTH and HEIGHT values. This will be the pixel resolution of the image that the scanner has embedded in the pdf.

Note that the scaling of this image to match the page resolution (as in the size of the page rendered in Acrobat - A4, Letter, etc) is performed separately in the page content stream, which is represented as a subset of postscript, and much harder to find without parsing the postscript.

Be aware that there are some scanners that will embed the scanned image as a grid of smaller images (for some kind of size optimization I assume). So if you see something like 50 small images popping up for each page, that could be why.

Hope this helps in some way if you have to roll your own utility.

Spiffeah
Thanks for the suggestion. These libraries are clearly powerful. From the code example, what are the *units* of the `width` and `height` variables?
Norman Ramsey
The units are pixels when talking about the xobject image Width and Height.
Spiffeah
Alert of multiple images per page was useful. Thanks kindly. Complete solution now posted.
Norman Ramsey
+2  A: 

Here are the elements to this answer:

  • pdfimages will extract images so that the number of dots can be discovered.
  • identify will give the size of the image in units of PostScript points (72 to the inch)
  • Because some scanners may split a single page into multiple images of varying sizes and shapes, the key is to add up the areas of all the images. Dividing square dots by square inches and taking the square root produces the answer.

Below is a Lua script that solves the problem. I probably could have used a plain shell, but capturing the width and height would have been a greater nuisance.

#!/usr/bin/env lua

require 'osutil'
require 'posixutil'
require 'mathutil'

local function runf(...) return os.execute(string.format(...)) end

assert(arg[1], "no file on command line")

local function dimens(filename)
  local cmd = [[identify -format "return %w, %h\n" $file | sed 1q]]
  cmd = cmd:gsub('$file', os.quote(filename))
  local w, h = assert(loadstring(os.capture(cmd)))()
  assert(w and h)
  return w, h
end

assert(#arg == 1, "dpi of just one file")

for _, pdf in ipairs(arg) do
  local w, h = dimens(pdf)  -- units are points
  local insquared = w * h / (72.00 * 72.00)
  local imagedir = os.capture 'mktemp -d'
  assert(posix.isdir(imagedir))
  runf('pdfimages -f 1 -l 1 %s %s 1>&2', os.quote(pdf),
                                         os.quote(imagedir .. '/img'))
  local dotsquared = 0
  for file in posix.glob(imagedir .. '/img*') do
    local w, h = dimens(file)  -- units are pixels
    dotsquared = dotsquared + w * h
  end
  os.execute('rm -rf ' .. os.quote(imagedir))
  local dpi = math.sqrt(dotsquared / insquared)

  if true then
    io.stderr:write(insquared, " square inches\n")
    io.stderr:write(dotsquared, " square dots\n")
    io.stderr:write(dpi, " exact dpi\n")
    io.stderr:write(math.round(dpi, 10), " rounded dpi\n")
  end
  print(math.round(dpi, 10))
end
Norman Ramsey
A: 

Apago's PDF Spy will tell you the acutal resolution of images in a PDF along with lots of other stuff. It's a commercial product but has a 10 day demo.

Dwight Kelly