ansaurus

Question

Replace specific inline CSS with HTML counterpart in Perl

Answer 1

+3 A:

From the HTML::Element docs, it appears that look_down() returns a list of HTML::Element objects. Perl objects are typically references to hashes (although they need not be) -- which is why you're getting HASH when you print $span.

At any rate, inside your for-loop, you should be able to call

 $span->method()

where method is any method of HTML::Element. For your purposes, the methods all_attr(), as_text(), and replace_with() look fairly promising.

I tried to link to each of the methods but SO didn't like the gnarly CPAN anchored links, so here's one quick link to the main doc page for convenience:

http://search.cpan.org/~petek/HTML-Tree-3.23/lib/HTML/Element.pm

Ben Dunlap 2009-11-10 05:17:23

You're right, it only linked to one page, but I think I got the idea. I'll take a look, thanks.

Mike 2009-11-10 05:20:30

"Perl objects are just hashes internally..." Not true. Perl hashes are blessed references. `bless {}, $class` works as well as `bless [], $class` or `bless do{ \(my $o = "") }, $class` do.

Chris Lutz 2009-11-10 05:25:57

OK, I give. Edited accordingly.

Ben Dunlap 2009-11-10 05:33:43

Should I edit my original question with the new code I've come up with or is there some better way to do it? Adding it in a comment wont be very nice, and it might get eaten by the system.

Mike 2009-11-10 06:13:23

Answer 2

+2 A:

Mike,
The problem is that in Perl you can unfortunately not see the type of the elements in the debugger, as the object system is just a wrapper around the standard types. Thus it is impossible to find relevant attributes/methods wo looking at the documentation and/or code. About Objects gives you more details about this.
Every $span will be a HTML::Element object - Ben's answer covers this part. I guess you will just change some attributes inside the tree and will save the tree to a new file.

weismat 2009-11-10 05:19:42

+1 for link to About Objects doc.

Ben Dunlap 2009-11-10 05:26:21

Thanks for that. I had kinda guessed that was why I couldn't just print `$span`. That's a good article.

Mike 2009-11-10 06:15:46

Answer 3

+1 A:

By using HTML::TreeBuilder you are definitely on the right track; for parsing CSS, I've just found CSS::DOM. It is a really interesting module, which allows you to access properties with little effort.

#!/usr/bin/perl
use warnings;
use strict;

use HTML::TreeBuilder;
use CSS::DOM::Style;

my $html = <<HTML;
<p style="text-align:center"><span style="font-weight:bold;font-style:italic;">Some random text here. What's here doesn't matter so much as what needs to ha>
HTML

my $tb = HTML::TreeBuilder->new_from_content($html);


my @replacements = (
    { property => 'font-style', value => 'italic', replacement => 'em' },
    { property => 'font-weight', value => 'bold', replacement => 'strong' },
    { property => 'text-align', value => 'center', replacement => 'center' },
);

# build a sensible list of tag names (or just use sub { 1 })
my @nodes = $tb->look_down(sub { $_[0]->tag =~ /^(p|span)$/ });

for my $el (@nodes) {
    if ($el->attr('style')) {
        my $st = CSS::DOM::Style::parse($el->attr('style'));
        if ($st) {
            foreach my $h (@replacements) {
                if ($st->getPropertyValue($h->{property}) eq $h->{value}) {
                    $st->removeProperty($h->{property});
                    my $new = HTML::Element->new($h->{replacement});
                    foreach my $inner ($el->detach_content) {
                        $new->push_content($inner);
                    }
                    $el->push_content($new);
                }
            }
            $el->attr('style', $st->cssText ? $st->cssText : undef);
        }
    }
}

print $tb->as_HTML(undef, "\t");

Leonardo Herrera 2009-11-10 19:06:45

I had originally discarded CSS::DOM because the CPAN page I read made it out to be more for external CSS than inline (or even internal CSS, at the top of the page).I'll give your code a test as soon as I install CSS::DOM. Thanks!

Mike 2009-11-10 19:29:42

Awesome! It worked, and I ran a bit of regex to clean up the errant `span` that was still hanging around: `Some random text here. What's here doesn't matter so much as what needs to happen around it. And sometimes not all the text is styled the same.` Now I just need to figure out how to make this work for the `p` tags as well, and we'll be golden.

Mike 2009-11-10 19:44:48

Check it out now. I had a problem with the way I used 'detach_content'. Also, take a look at how to build a list of all allowed nodes to parse.

Leonardo Herrera 2009-11-10 20:06:26

Excellent! I'll post my slightly tweaked version as a new answer, so you can see what I've done with it. This is exactly what I needed! Thank you, thank you, thank you. Also, I don't know if you noticed, but `as_HTML` seems to lop off the ending `p` tag. I fixed it by adding an empty hashref (`{}`) as the third param (as per the HTML::TreeBuilder docs).

Mike 2009-11-10 20:34:26

Scratch that. It doesn't want me to answer my own question. :P Here's the code: http://pastebin.com/f75bfd1a5

Mike 2009-11-10 20:40:34

Glad to be of help.

Leonardo Herrera 2009-11-11 12:29:37

ansaurus

tags:

views:

answers:

Replace specific inline CSS with HTML counterpart in Perl

related questions