Real quick background : We have a PDFMaker (HTMLDoc) that converts html into a pdf. HTMLDoc doesn't consistently pick up the styles that we need from the html that is provided to us by the client. Thus I'm trying to convert things such as style="width:80px;height:90px;" to height=80 width=90.
My attempt so far has revealed my limited understanding of back references and how to utilize them properly during Perl Regex. I can take an input file and convert it to an output file, but it only catches one "style" per line, and only replaces one name/value pair from that css.
I'm probably approaching this the wrong way but I can't figure out a faster or smarter way to do this in Perl. Any help would be greatly appreciated!
NOTE: The only attributes I'm trying to change for this particular script are "height", "width" and "border," because our client utilizes a tool that automatically applies styles to elements that they drag around with a WYSIWYG-style editor. Obviously, using a regex to strip these out of a lot of places works fairly well, as you just let the table cells be sized by their content, which looks okay, but I figured a quicker way to deal with the issue would just be to replace those three attributes with "width" "height" and "border" attributes, which behave mostly the same as their css counterparts (excepting that CSS allows you to actually customize the width, color, and style of the border, but all they ever use is solid 1px, so I can add a condition to replace "solid 1px" with "border=1". I realize these are not fully equivalent, but for this application it would be a step).
Here's what I've got so far:
#!/usr/bin/perl
if (!@ARGV[0] || !@ARGV[1])
{
print "Usage: converter.pl [input file] [output file] \n";
exit;
}
open FILE, "<", @ARGV[0] or die $!;
open OUTFILE, ">", @ARGV[1] or die $!;
my $line;
my $guts;
while ( <FILE> ) {
$line = $_ ;
$line =~ /style=\"(.+)\"/;
$guts = $1;
$guts =~ /([a-zA-Z]+)\:([a-zA-Z0-9]+)\;/;
$name = $1;
$value = $2;
$guts = $name."=".$value;
$line =~ s/style=\"(.+)\"/$guts/g;
print OUTFILE $line ;
}
exit;
Note: This is NOT homework, and no I'm not asking you to do my job for me, this would end up being an internal tool that just sped up the process of formatting our incoming html to work properly in the pdf converter we have.
UPDATE
For those interested, I got an initial working version. This one only replaces width and height, the border attribute we're scrapping for now. But if anyone wanted to see how we did it, take a look...
#!/usr/bin/perl
## NOTES ##
# This script was made to simply replace style attributes with their name/value pair equivalents as attributes.
# It was designed to replace width and height attributes on a metric buttload of table elements from client data we got.
# As such, it's not really designed to handle more than that, and only strips the unit "PX" from the values.
# All of these can be modified in the second foreach loop, which checks for height and width.
if (!@ARGV[0] || !@ARGV[1])
{
print "Usage: quickvert.pl [input file] [output file] \n";
exit;
}
open FILE, "<", @ARGV[0] or die $!;
open OUTFILE, ">", @ARGV[1] or die $!;
my $line;
my $guts;
my $count = 1;
while ( <FILE> ) {
$line = $_ ;
my (@match) = $line =~ /style=\"(.+?)\"/g;
my $guts;
my $newguts;
foreach (@match) {
#print $_ ."\n";
$guts = $_;
$guts =~ /([a-zA-Z]+)\:([a-zA-Z0-9]+)\;/;
$newguts = "";
foreach my $style (split(/;/,$guts)) {
my ($name, $value) = split(/:/,$style);
$value =~ s/px//g;
if ( $name =~ m/height/g || $name =~ m/width/g ) {
$newguts .= "$name='$value' ";
} else {
$newguts .= "";
}
}
#print "replacing $guts with $newguts on line $count \n";
$line =~ s/style=\"$guts\"/$newguts/i;
}
#print $newguts;
print OUTFILE $line ;
$count++;
}
exit;