tags:

views:

312

answers:

3

I'm trying to implement search result highlighting for pdfs in a web app. I have the original pdfs, and small png versions that are used in search results. Essentially I'm looking for an api like:

pdf_document.find_offsets('somestring')
# => { top: 501, left: 100, bottom: 520, right: 150 }, { ... another box ... }, ...

I know it's possible to get this information out of a pdf because Apple's Preview.app implements this.

Need something that runs on Linux and ideally is open source. I'm aware you can do this with acrobat on windows.

A: 

I think you can do this using the Adobe Acrobat SDK, a Linux version of which can be downloaded for free from Adobe. You can use this to extract text from PDFs and then work out offsets. The PDF can then be highlighted by using the Acrobat XML highlighting file. This is used to specify which words in which positions are to be highlighted and is fed to acrobat as follows:

http://example.com/a.pdf#xml=http://example.com/highlightfile.xml

msanders
+3  A: 

CAM::PDF can do the geometry part quite nicely, but has some trouble with the string matching sometimes. The technique would be something like the following lightly-tested code:

use CAM::PDF;
my $pdf = CAM::PDF->new('my.pdf') or die $CAM::PDF::errstr;
for my $pagenum (1 .. $pdf->numPages) {
   my $pagetree = $pdf->getPageContentTree($pagenum) or die;
   my @text = $pagetree->traverse('MyRenderer')->getTextBlocks;
   for my $textblock (@text) {
      print "text '$textblock->{str}' at ",
            "($textblock->{left},$textblock->{bottom})\n";
   }
}

package MyRenderer;
use base 'CAM::PDF::GS';

sub new {
   my ($pkg, @args) = @_;
   my $self = $pkg->SUPER::new(@args);
   $self->{refs}->{text} = [];
   return $self;
}
sub getTextBlocks {
   my ($self) = @_;
   return @{$self->{refs}->{text}};
}
sub renderText {
   my ($self, $string, $width) = @_;
   my ($x, $y) = $self->textToDevice(0,0);
   push @{$self->{refs}->{text}}, {
      str => $string,
      left => $x,
      bottom => $y,
      right => $x + $width,
      #top => $y + ???,                                                                                 
   };
   return;
}

where the output looks something like this:

text 'E' at (52.08,704.16)
text 'm' at (73.62096,704.16)
text 'p' at (113.58936,704.16)
text 'lo' at (140.49648,704.16)
text 'y' at (181.19904,704.16)
text 'e' at (204.43584,704.16)
text 'e' at (230.93808,704.16)
text ' N' at (257.44032,704.16)
text 'a' at (294.6504,704.16)
text 'm' at (320.772,704.16)
text 'e' at (360.7416,704.16)
text 'Employee Name' at (56.4,124.56)
text 'Employee Title' at (56.4,114.24)
text 'Company Name' at (56.4,103.92)

As you can see from that output, the string matching will be a little tedious, but the geometry is straightforward (except maybe for the font height).

Chris Dolan
+1  A: 

Try to look at PdfLib TET http://www.pdflib.com/products/tet/

(it's not free)

Fabrizio

Fabrizio