ansaurus

Question

How do I get character offset information from a pdf document?

Answer 1

A:

I think you can do this using the Adobe Acrobat SDK, a Linux version of which can be downloaded for free from Adobe. You can use this to extract text from PDFs and then work out offsets. The PDF can then be highlighted by using the Acrobat XML highlighting file. This is used to specify which words in which positions are to be highlighted and is fed to acrobat as follows:

http://example.com/a.pdf#xml=http://example.com/highlightfile.xml

msanders 2008-10-14 11:20:29

Answer 2

+3 A:

CAM::PDF can do the geometry part quite nicely, but has some trouble with the string matching sometimes. The technique would be something like the following lightly-tested code:

use CAM::PDF;
my $pdf = CAM::PDF->new('my.pdf') or die $CAM::PDF::errstr;
for my $pagenum (1 .. $pdf->numPages) {
   my $pagetree = $pdf->getPageContentTree($pagenum) or die;
   my @text = $pagetree->traverse('MyRenderer')->getTextBlocks;
   for my $textblock (@text) {
      print "text '$textblock->{str}' at ",
            "($textblock->{left},$textblock->{bottom})\n";
   }
}

package MyRenderer;
use base 'CAM::PDF::GS';

sub new {
   my ($pkg, @args) = @_;
   my $self = $pkg->SUPER::new(@args);
   $self->{refs}->{text} = [];
   return $self;
}
sub getTextBlocks {
   my ($self) = @_;
   return @{$self->{refs}->{text}};
}
sub renderText {
   my ($self, $string, $width) = @_;
   my ($x, $y) = $self->textToDevice(0,0);
   push @{$self->{refs}->{text}}, {
      str => $string,
      left => $x,
      bottom => $y,
      right => $x + $width,
      #top => $y + ???,                                                                                 
   };
   return;
}

where the output looks something like this:

text 'E' at (52.08,704.16)
text 'm' at (73.62096,704.16)
text 'p' at (113.58936,704.16)
text 'lo' at (140.49648,704.16)
text 'y' at (181.19904,704.16)
text 'e' at (204.43584,704.16)
text 'e' at (230.93808,704.16)
text ' N' at (257.44032,704.16)
text 'a' at (294.6504,704.16)
text 'm' at (320.772,704.16)
text 'e' at (360.7416,704.16)
text 'Employee Name' at (56.4,124.56)
text 'Employee Title' at (56.4,114.24)
text 'Company Name' at (56.4,103.92)

As you can see from that output, the string matching will be a little tedious, but the geometry is straightforward (except maybe for the font height).

Chris Dolan 2008-10-15 02:39:35

Answer 3

+1 A:

Try to look at PdfLib TET http://www.pdflib.com/products/tet/

(it's not free)

Fabrizio

Fabrizio 2008-10-15 10:39:46

ansaurus

tags:

views:

answers:

How do I get character offset information from a pdf document?

related questions