ansaurus

Question

How can I modify HTML files in Perl?

Answer 1

+3 A:

If tags matter in your search and replace, you'll need to use HTML::Parser.

This tutorial looks a bit easier to understand than the documentation with the module.

bemace 2010-10-10 15:50:13

Can I use HTML::TreeBuilder instead?? I'm asking coz I never used any of them.

soulSurfer2010 2010-10-10 15:58:29

@soulSurfer2010, yes HTML::TreeBuilder can help you do that. (It's built on top of HTML::Parser.)

cjm 2010-10-10 16:07:50

@soulSurfer2010 Yeah, that looks like it would work too. The real point I was making is that you'll need to actually parse the HTML, not just apply regexes to the source, which is what I'm guessing you're doing based on what little info you provided.

bemace 2010-10-10 16:09:33

Yes, I tried using regex's all worked fine, until I had something similar to this: 'From Argum bay in love' which was already in a href, then what my script done, is href'ing it again, which not what I'm looking for. only if the text is NOT already href'ed then I want to replace it with my href (=hyperlink)

soulSurfer2010 2010-10-10 16:19:11

Well, I can use HTML::TreeBuilder or HTML::TokeParser to find if a keyword is href'ed, but my problem at the moment is, if its not, how do I replace it to my href, since I'm parsing it using the the module and not directly from a list which I can replace stuff and then print to a file.... any suggestions?

soulSurfer2010 2010-10-10 16:37:12

Can you edit your question to include the code you've got so far?

bemace 2010-10-10 17:22:36

The code does not matter actually (I also dont have it here atm), what matters is how to edit (create new one) the html files the way I mentioned above, see also the example. Thanks!!!

soulSurfer2010 2010-10-10 19:00:35

Answer 2

+5 A:

To do this with HTML::TreeBuilder, you would read the file, modify the tree, and write it out (to the same file, or a different file). This is fairly complex, because you're trying to convert part of a text node into a tag, and because you have comments that can't move.

A common idiom with HTML-Tree is to use a recursive function that modifies the tree:

use strict;
use warnings;
use 5.008;

use File::Slurp 'read_file';
use HTML::TreeBuilder;

sub replace_keyword
{
  my $elt = shift;

  return if $elt->is_empty;

  $elt->normalize_content;      # Make sure text is contiguous

  my $content = $elt->content_array_ref;

  for (my $i = 0; $i < @$content; ++$i) {
    if (ref $content->[$i]) {
      # It's a child element, process it recursively:
      replace_keyword($content->[$i])
          unless $content->[$i]->tag eq 'a'; # Don't descend into <a>
    } else {
      # It's text:
      if ($content->[$i] =~ /here/) { # your keyword or regexp here
        $elt->splice_content(
          $i, 1, # Replace this text element with...
          substr($content->[$i], 0, $-[0]), # the pre-match text
          # A hyperlink with the keyword itself:
          [ a => { href => 'http://example.com' },
            substr($content->[$i], $-[0], $+[0] - $-[0]) ],
          substr($content->[$i], $+[0])   # the post-match text
        );
      } # end if text contains keyword
    } # end else text
  } # end for $i in content index
} # end replace_keyword


my $content = read_file('foo.shtml');

# Wrap the SHTML fragment so the comments don't move:
my $html = HTML::TreeBuilder->new;
$html->store_comments(1);
$html->parse("<html><body>$content</body></html>");

my $body = $html->look_down(qw(_tag body));
replace_keyword($body);

# Now strip the wrapper to get the SHTML fragment back:
$content = $body->as_HTML;
$content =~ s!^<body>\n?!!;
$content =~ s!</body>\s*\z!!;

print STDOUT $content; # Replace STDOUT with a suitable filehandle

The output from as_HTML will be syntactically correct HTML, but not necessarily nicely-formatted HTML for people to view the source of. You can use HTML::PrettyPrinter to write out the file if you want that.

cjm 2010-10-11 00:17:45

WOOOOOOOOOOOOOOOOOOOOOOOOW! Seriously man, where did you come from? I couldnt ask for a better solution! Amazing. It works perfect, but not I will need few hours to understand what you did there (-: Thanks a lot!

soulSurfer2010 2010-10-11 08:39:34

cjm 2010-10-11 16:19:22

You might also search for other StackOverflow questions that ask the same thing (and often have the same answer). HTML::TreeBuilder makes frequent appearances here.

brian d foy 2010-10-11 17:44:51

Hi, I believe its a simple thing, But seems like it erases all the comments found in the HTML, SHTML file(the main issue is that it erases SSI's in SHTMLs), i tried using: store_comments(1) method on the $html before calling the recursive function, but to no avail. any idea what am I missing here?

soulSurfer2010 2010-10-14 10:00:55

@soulSurfer2010, you can't use `new_from_file` if you want to keep comments, because you have to call `store_comments` _before_ loading the file. Instead, call `new`, then `store_comments`, then `parse_file`.

cjm 2010-10-14 15:17:09

Well it works, but the HTML tree structure is acting weird. It takes all the comments and move them to the buttom of the page. This is a sample SHTML file: http://uploading.com/files/73dc4545/0000394092.shtml/ Is this a bug?

soulSurfer2010 2010-10-14 19:00:47

This is exactly what Happens to me: http://programming.itags.org/perl/40337/

soulSurfer2010 2010-10-14 19:05:17

@soulSurfer2010, that's a current limitation of HTML::TreeBuilder. Everything is a child of the <html> node, even comments that appeared before or after it.

cjm 2010-10-14 21:50:03

)-: So basically this type of thing cant be done so easily, I guess? weird that conceptually it seems so simple, but technically... \-:

soulSurfer2010 2010-10-14 22:29:58

@soulSurfer2010, I've added a workaround to my example. Since your example isn't a complete HTML document, you can wrap it in a `<body>` tag, and the comments won't get rearranged.

cjm 2010-10-14 22:48:21

I still havent checked it, and I even forgot that I set a bounty price. you got it, and frankly, you deserve much more!!! Anyway, I will check it later when I'll get home. btw, how do I become an expert like you? (-:

soulSurfer2010 2010-10-15 05:53:42

Answer 3

A:

If you wanted to go a regular-expression-only type method and you're prepared to accept the following provisos:

this will not work correctly within HTML comments
this will not work where the < or > character is used within a tag
this will not work where the < or > character is used and not part of a tag
this will not work where a tag spans multiple lines (if you're processing one line at a time)

If any of the above conditions do exist then you will have to use one of the HTML/XML parsing strategies outlined by other answers.

Otherwise:

my $searchfor = "From Argumbay";
my $replacewith = "<a href='http://google.com/?s=Argumbay'&gt;From_Argumbay&lt;/a&gt;";

1 while $html =~ s/
  \A             # beginning of string
  (              # group all non-searchfor text
    (            # sub group non-tag followed by tag
      [^<]*?     # non-tags (non-greedy)
      <[^>]*>    # whole tags
    )*?          # zero or more (non-greedy)
  )
  \Q$searchfor\E # search text
/$1$replacewith/sx;

Note that this will NOT work if $searchfor matches $replacetext (so don't put "From Argumbay" back into the replacement text).

PP 2010-10-11 08:08:41

I already came up with some similar solution a minutes ago before visiting this site today, and I couldn't accept these provisions. Thanks!

soulSurfer2010 2010-10-11 08:40:28

ansaurus

tags:

views:

answers:

How can I modify HTML files in Perl?

related questions