views:

1800

answers:

5
+1  A: 

I'm not a Perl user but I imagine you'll struggle to find a better free text extractor than pdftotext.

pdftotext usually recognises non-ASCII characters fine, is it possible it's extracting them ok but the app you're using to view the text file isn't using the correct encoding? If pdftoetxt on windows is the same as the one on my linux system, then it defaults to exporting as utf-8.

James Healy
+1  A: 

There is getpdftext.pl; part of CAM::PDF.

Sinan Ünür
Yeah, but it's not very good (I'm the author)
Chris Dolan
@Chris Dolan It is not *that* bad either ;-)
Sinan Ünür
A: 

Take a look at PDFBox. It is a library but i think that it also comes with some tool to do text extracting.

Per Arneng
+3  A: 

You may never get an appropriate solution to your problem. The PDF format can encode text either as ASCII values with a font applied, or it can encode it as a bitmap. If the tool that created your PDF decided to encode the special characters as a bitmap, you will be out of luck (unless you want to get into OCR solutions, of course).

Andrew Barnett
It worse than this - text need not be laid out on the page in reading order. It need not be laid out rectilinearly. Writing a simple find word command for Acrobat 1.0 took me 5 months, and that's with the people who created all the support libraries and designed the format in adjacent offices. Extracting text is a subset of that problem.
plinth
Letters not being represented by character codes, but instead by bitmaps or vector graphics, is really pathological these days.Text not being laid out in reading order is kind of normal, but usually the results are intelligible.
Charles Stewart
+2  A: 

These modules you can acheive the extract text from pdf

PDF::API2

CAM::PDF

CAM::PDF::PageText

From CPAN

   my $pdf = CAM::PDF->new($filename);
   my $pageone_tree = $pdf->getPageContentTree(1);
   print CAM::PDF::PageText->render($pageone_tree);

This module attempts to extract sequential text from a PDF page. This is not a robust process, as PDF text is graphically laid out in arbitrary order. This module uses a few heuristics to try to guess what text goes next to what other text, but may be fooled easily by, say, subscripts, non-horizontal text, changes in font, form fields etc.

All those disclaimers aside, it is useful for a quick dump of text from a simple PDF file.

joe
I'm the CAM::PDF author and I agree with the disclaimers. I built the text extraction on a whim and it turned out to be a lot harder than I anticipated.
Chris Dolan