tags:

views:

287

answers:

1

I use the following to read a PDF file and get text strings of a page:

my $pdf = CAM::PDF->new($pdf_file);
my $pagetree = $pdf->getPageContentTree($page_no);

# Get all text strings of the page
# MyRenderer is a separate package which implements getTextBlocks and
# renderText methods

my @text = $pagetree->traverse('MyRenderer')->getTextBlocks;

Now, @text has all the text strings and start x,y of each text string.

How can I get the width (and possibly the height) of each string?

MyRenderer package is as follows:

package MyRenderer;
use base 'CAM::PDF::GS';
sub new {
    my ($pkg, @args) = @_;
    my $self = $pkg->SUPER::new(@args);
    $self->{refs}->{text} = [];
    return $self;
}

sub getTextBlocks {
    my ($self) = @_;
    return @{$self->{refs}->{text}};
}

sub renderText {
    my ($self, $string, $width) = @_;
    my ($x, $y) = $self->textToDevice(0,0);
    push @{$self->{refs}->{text}}, {
                                    str => $string,
                                    left => $x,
                                    bottom => $y,
                                    right =>$x + $width,
                                   };
    return;
}

Update 1: There's a function getStringWidth($fontmetrics, $string) in CAM::PDF. Altough there's a parameter $fontmetrics in that function, irespective of what I pass to that parameter, the function returns the same value for a given string.

Also, I am not sure of the unit of measure the returned value uses.

Update 2: I changed the renderText function to following:

sub renderText {
    my ($self, $string, $width) = @_;
    my ($x, $y) = $self->textToDevice(0,0);
    push @{$self->{refs}->{text}}, {
                                str => $string,
                                left => $x,
                                bottom => $y,
                                right =>$x + ($width * $self->{Tfs}),
                                font => $self->{Tf},
                                font_size => $self->{Tfs},
                               };
    return;
}

Note that in addition to getting font and font_size, I multiplied $width with font size to get the real width of the string.

Now, only thing missing is the height.

+1  A: 

getStringWidth() depends heavily on the font metrics you provide. If it can't find the character widths in that data structure, then it falls back to the following code:

   if ($width == 0)
   {
      # HACK!!!                                                                   
      #warn "Using klugy width!\n";                                               
      $width = 0.2 * length $string;
   }

which may be what you're seeing. When I wrote that, I thought it was better than returning 0. If your font metrics seem good and you think there's a bug in CAM::PDF, feel free to post more details and I'll take a look.

Chris Dolan
Thanks for the feedback Chris.Check my update 2 in OP.Hope what I did was right to get the width.
Thushan