Extracting text from PDF with Poppler (C++)

You should be able to set the selection rectangle to the pageSize/MediaBox of the page and get all the text.

I say should because before you start wondering why you get surprised by the output of poppler_page_get_text, you should be aware of how text gets laid out on a page. All graphics are laid out on a page using a program expressed in post-fix notation. To render the page, this program is executed on a blank page.

Operations in the program can include, changing colors, position, current transformation matrix, drawing lines, bezier curves and so on. Text is laid out by a series of text operators that are always bracketed by BT (begin text) and ET (end text). How or where text is placed on a page is at the sole discretion of the software that generates the PDF. For example, for print drivers, the code responds to GDI calls for DrawString and translates that into text drawing operations.

If you are lucky, the text on the page is laid out in a sane order with sane font usage, but many programs that generate PDF aren't so kind. Psroff, for example liked to place all the plain text first, then the italic text, then the bold text. Words may or may not be placed in reading order. Fonts may be re-encoded so that 'a' maps to '{' or whatever. Then you might have ligatures where multiple characters are replaced by single glyphs - the most common ones are ae, oe, fi, fl, and ffl.

With all of this in place, the process of extracting text is decidedly non-trivial, so don't be surprised if you see poor quality results from text extraction.

I used to work on the text extraction tools in Acrobat 1.0 and 2.0 - it's a real challenge to get right.

ansaurus

tags:

views:

answers:

Extracting text from PDF with Poppler (C++)

related questions