views:

409

answers:

3

Real quick background : We have a PDFMaker (HTMLDoc) that converts html into a pdf. HTMLDoc doesn't consistently pick up the styles that we need from the html that is provided to us by the client. Thus I'm trying to convert things such as style="width:80px;height:90px;" to height=80 width=90.

My attempt so far has revealed my limited understanding of back references and how to utilize them properly during Perl Regex. I can take an input file and convert it to an output file, but it only catches one "style" per line, and only replaces one name/value pair from that css.

I'm probably approaching this the wrong way but I can't figure out a faster or smarter way to do this in Perl. Any help would be greatly appreciated!

NOTE: The only attributes I'm trying to change for this particular script are "height", "width" and "border," because our client utilizes a tool that automatically applies styles to elements that they drag around with a WYSIWYG-style editor. Obviously, using a regex to strip these out of a lot of places works fairly well, as you just let the table cells be sized by their content, which looks okay, but I figured a quicker way to deal with the issue would just be to replace those three attributes with "width" "height" and "border" attributes, which behave mostly the same as their css counterparts (excepting that CSS allows you to actually customize the width, color, and style of the border, but all they ever use is solid 1px, so I can add a condition to replace "solid 1px" with "border=1". I realize these are not fully equivalent, but for this application it would be a step).

Here's what I've got so far:

#!/usr/bin/perl
if (!@ARGV[0] || !@ARGV[1])
{
  print "Usage: converter.pl [input file] [output file] \n";
  exit;
}
open FILE, "<", @ARGV[0] or die $!;
open OUTFILE, ">", @ARGV[1] or die $!;
my $line;
my $guts;
while ( <FILE> ) {
  $line = $_ ;
  $line =~ /style=\"(.+)\"/;
  $guts = $1;
  $guts =~ /([a-zA-Z]+)\:([a-zA-Z0-9]+)\;/;
  $name = $1;
  $value = $2;
  $guts = $name."=".$value;
  $line =~ s/style=\"(.+)\"/$guts/g;
  print OUTFILE $line ;
}

exit;

Note: This is NOT homework, and no I'm not asking you to do my job for me, this would end up being an internal tool that just sped up the process of formatting our incoming html to work properly in the pdf converter we have.

UPDATE

For those interested, I got an initial working version. This one only replaces width and height, the border attribute we're scrapping for now. But if anyone wanted to see how we did it, take a look...

#!/usr/bin/perl

## NOTES ##
# This script was made to simply replace style attributes with their name/value pair equivalents as attributes.
# It was designed to replace width and height attributes on a metric buttload of table elements from client data we got.
# As such, it's not really designed to handle more than that, and only strips the unit "PX" from the values. 
# All of these can be modified in the second foreach loop, which checks for height and width. 

if (!@ARGV[0] || !@ARGV[1])
{
  print "Usage: quickvert.pl [input file] [output file] \n";
  exit;
}
open FILE, "<", @ARGV[0] or die $!;
open OUTFILE, ">", @ARGV[1] or die $!;
my $line;
my $guts;
my $count = 1;
while ( <FILE> ) {
  $line = $_ ;
  my (@match) = $line =~ /style=\"(.+?)\"/g;
  my $guts;
  my $newguts;
  foreach (@match) {
    #print $_ ."\n";
    $guts = $_;
    $guts =~ /([a-zA-Z]+)\:([a-zA-Z0-9]+)\;/;
    $newguts = "";
    foreach my $style (split(/;/,$guts)) {
      my ($name, $value) = split(/:/,$style);
      $value =~ s/px//g;
      if ( $name =~ m/height/g || $name =~ m/width/g ) {
      $newguts .= "$name='$value' ";
      } else {
      $newguts .= "";
      }
    }
    #print "replacing $guts with $newguts on line $count \n";
  $line =~ s/style=\"$guts\"/$newguts/i;
  }

  #print $newguts;



  print OUTFILE $line ;
  $count++;
}

exit;
+5  A: 

You will have a very difficult time with this, for a few reasons:

  • Most things that can be accomplished with CSS can't be done with HTML attributes. To deal with this you'd either have to ignore or attempt to compensate for things like margins and padding, etc...
  • Many things that correspond between HTML attributes and CSS actually behave slightly differently, and you will need to account for this. To deal with this you would have to write specific code for each difference...
  • Because of the way CSS rules are applied, you basically need to use a complete CSS engine to parse and apply all of the rules before you will know what needs to be done at the element/attribute level. To deal with this you could just ignore anything except inline styles, but...

This work is almost as complicated as writing a rendering engine for a browser. You might be able to deal with a few specific cases, but even there your success rate would be haphazard at best.

EDIT: Given your very specific feature set, I can give you a little advice on your implementation:

You want to be case-insensitive and use a non-greedy match when looking for the value of the style attribute, i.e.:

$line =~ /style=\"(.+?)\"/i;

So that you only find stuff up to the very next double-quote, not the entire content of the line up to the last double quote. Also, you probably want to skip the line if the match isn't found, so:

next unless ($line =~ /style=\"(.+?)\"/i);

For parsing the guts, I'd use split instead of regex:

my $newguts;
foreach my $style (split(/;/,$guts)) {
    my ($name, $value) = split(/:/,$style);
    $newguts .= "$name='$value' ";
}
$line =~ s/style=\"$guts\"/$newguts/i;

Of course, this being Perl there are standard mantras such as always use strict and warnings, try to use named matches rather than $1, $2, etc., but I'm trying to restrict my advice to stuff that will move your solution forward right away.

Adam Bellaire
I should probably specify that the only attributes I want to convert are border, height, and width. Everything else is unimportant and we do by hand anyways.
NateDSaint
Also, I should probably comment that I'm not trying to create a CSS rendering engine that is fully aware of what tags have which style attributes and be aware of the way to deal with them. In the particular case that I am provided, the tool they use only stipulates the size of individual table cells, everything else is given an ID or class and styled with an external sheet which is ignored by the pdfmaker in any case.
NateDSaint
The split was where I was having a hard time, making sure to hit them all. As far as using the proper formats for perl, I was originally trying to do it as shorthand as possible for practice, but streaming an in file and an out file proved tricky, so I just did it the quick and dirty way. Thanks for your help!
NateDSaint
+3  A: 

Have a look on CPAN for HTML parsing modules like HTML::TreeBuilder, HTML::DOM or even XML modules like XML::LibXML.

Below is quick example using HTML::TreeBuilder which adds border="1" attribute to any tag that has style attribute with border content:

use strict;
use warnings;
use HTML::TreeBuilder;

my $data =q{
<html>
<head>
</head>
<body>
<h1>blah</h1>
<p style="color: red;">Red</p>
<span style="width:80px;height:90px;border: 1px solid #000000">Some text</span>
</body>
</html>
};

my $tree = HTML::TreeBuilder->new;
$tree->parse_content( $data );

for my $style ( $tree->look_down( sub { $_[0]->attr('style') } ) ) {
    my $prop = $style->attr( 'style' );
    $style->attr( 'border', 1 ) if $prop =~ m/border/;
}

say $tree->as_HTML;

Which will reproduce the HTML but with border="1" added just to the span tag.

In unison to these modules you can also have a look at CSS and CSS::DOM to help parse the CSS bit.

/I3az/

draegtun
Thanks for the input, and it looks awesome! I'll see if we can get this set up on our server, but for right now I've got to get a quick script method in there. But that's a much better long-term solution, thanks!
NateDSaint
+2  A: 

I don't know your stance on proprietary software, but PrinceXML is the best HTML to PDF converter available.

Adam Flott
I have no problem with proprietary software, but my boss does. We typically work with as much open-source as possible. Not necessarily to save money (since the labor costs of integration end up making it a wash in most cases) but to integrate it with our insanely customized Apache installation.
NateDSaint