views:

164

answers:

3

I have a bunch of HTML files, and what I want to do is to look in each HTML file for the keyword 'From Argumbay' and change this with some href that I have. I thought its very simple at first, so what I did is I opended each HTML file and loaded its content into an array (list), then I looked for each keyword and replaced it with s///, and dumped the contents to the file, what the problem? sometimes the keyword can also appear in a href, which in this case I dont want it to be replaced, or it can appear inside some tags and such.

An EXAMPLE: http://www.astrosociety.org/education/surf.html

I would like my script to replace each occurance of the word 'here' with some href that I have in $href, but as you can see, there is another 'here' which is already href'ed, I dont want it to href this one again. In this case there arent additional 'here's there except from the href, but lets assume that there are.

I want to replace the keyword only if its just text, any idea?

BOUUNTY EDIT: Hi, I believe its a simple thing, But seems like it erases all the comments found in the HTML, SHTML file(the main issue is that it erases SSI's in SHTMLs), i tried using: store_comments(1) method on the $html before calling the recursive function, but to no avail. any idea what am I missing here?

+3  A: 

If tags matter in your search and replace, you'll need to use HTML::Parser.

This tutorial looks a bit easier to understand than the documentation with the module.

bemace
Can I use HTML::TreeBuilder instead?? I'm asking coz I never used any of them.
soulSurfer2010
@soulSurfer2010, yes HTML::TreeBuilder can help you do that. (It's built on top of HTML::Parser.)
cjm
@soulSurfer2010 Yeah, that looks like it would work too. The real point I was making is that you'll need to actually parse the HTML, not just apply regexes to the source, which is what I'm guessing you're doing based on what little info you provided.
bemace
Yes, I tried using regex's all worked fine, until I had something similar to this: 'From Argum bay in love' which was already in a href, then what my script done, is href'ing it again, which not what I'm looking for. only if the text is NOT already href'ed then I want to replace it with my href (=hyperlink)
soulSurfer2010
Well, I can use HTML::TreeBuilder or HTML::TokeParser to find if a keyword is href'ed, but my problem at the moment is, if its not, how do I replace it to my href, since I'm parsing it using the the module and not directly from a list which I can replace stuff and then print to a file.... any suggestions?
soulSurfer2010
Can you edit your question to include the code you've got so far?
bemace
The code does not matter actually (I also dont have it here atm), what matters is how to edit (create new one) the html files the way I mentioned above, see also the example. Thanks!!!
soulSurfer2010
+5  A: 

To do this with HTML::TreeBuilder, you would read the file, modify the tree, and write it out (to the same file, or a different file). This is fairly complex, because you're trying to convert part of a text node into a tag, and because you have comments that can't move.

A common idiom with HTML-Tree is to use a recursive function that modifies the tree:

use strict;
use warnings;
use 5.008;

use File::Slurp 'read_file';
use HTML::TreeBuilder;

sub replace_keyword
{
  my $elt = shift;

  return if $elt->is_empty;

  $elt->normalize_content;      # Make sure text is contiguous

  my $content = $elt->content_array_ref;

  for (my $i = 0; $i < @$content; ++$i) {
    if (ref $content->[$i]) {
      # It's a child element, process it recursively:
      replace_keyword($content->[$i])
          unless $content->[$i]->tag eq 'a'; # Don't descend into <a>
    } else {
      # It's text:
      if ($content->[$i] =~ /here/) { # your keyword or regexp here
        $elt->splice_content(
          $i, 1, # Replace this text element with...
          substr($content->[$i], 0, $-[0]), # the pre-match text
          # A hyperlink with the keyword itself:
          [ a => { href => 'http://example.com' },
            substr($content->[$i], $-[0], $+[0] - $-[0]) ],
          substr($content->[$i], $+[0])   # the post-match text
        );
      } # end if text contains keyword
    } # end else text
  } # end for $i in content index
} # end replace_keyword


my $content = read_file('foo.shtml');

# Wrap the SHTML fragment so the comments don't move:
my $html = HTML::TreeBuilder->new;
$html->store_comments(1);
$html->parse("<html><body>$content</body></html>");

my $body = $html->look_down(qw(_tag body));
replace_keyword($body);

# Now strip the wrapper to get the SHTML fragment back:
$content = $body->as_HTML;
$content =~ s!^<body>\n?!!;
$content =~ s!</body>\s*\z!!;

print STDOUT $content; # Replace STDOUT with a suitable filehandle

The output from as_HTML will be syntactically correct HTML, but not necessarily nicely-formatted HTML for people to view the source of. You can use HTML::PrettyPrinter to write out the file if you want that.

cjm
WOOOOOOOOOOOOOOOOOOOOOOOOW! Seriously man, where did you come from? I couldnt ask for a better solution! Amazing. It works perfect, but not I will need few hours to understand what you did there (-: Thanks a lot!
soulSurfer2010
cjm
You might also search for other StackOverflow questions that ask the same thing (and often have the same answer). HTML::TreeBuilder makes frequent appearances here.
brian d foy
Hi, I believe its a simple thing, But seems like it erases all the comments found in the HTML, SHTML file(the main issue is that it erases SSI's in SHTMLs), i tried using: store_comments(1) method on the $html before calling the recursive function, but to no avail. any idea what am I missing here?
soulSurfer2010
@soulSurfer2010, you can't use `new_from_file` if you want to keep comments, because you have to call `store_comments` _before_ loading the file. Instead, call `new`, then `store_comments`, then `parse_file`.
cjm
Well it works, but the HTML tree structure is acting weird. It takes all the comments and move them to the buttom of the page. This is a sample SHTML file: http://uploading.com/files/73dc4545/0000394092.shtml/ Is this a bug?
soulSurfer2010
This is exactly what Happens to me: http://programming.itags.org/perl/40337/
soulSurfer2010
@soulSurfer2010, that's a current limitation of HTML::TreeBuilder. Everything is a child of the <html> node, even comments that appeared before or after it.
cjm
)-: So basically this type of thing cant be done so easily, I guess? weird that conceptually it seems so simple, but technically... \-:
soulSurfer2010
@soulSurfer2010, I've added a workaround to my example. Since your example isn't a complete HTML document, you can wrap it in a `<body>` tag, and the comments won't get rearranged.
cjm
I still havent checked it, and I even forgot that I set a bounty price. you got it, and frankly, you deserve much more!!! Anyway, I will check it later when I'll get home. btw, how do I become an expert like you? (-:
soulSurfer2010
A: 

If you wanted to go a regular-expression-only type method and you're prepared to accept the following provisos:

  • this will not work correctly within HTML comments
  • this will not work where the < or > character is used within a tag
  • this will not work where the < or > character is used and not part of a tag
  • this will not work where a tag spans multiple lines (if you're processing one line at a time)

If any of the above conditions do exist then you will have to use one of the HTML/XML parsing strategies outlined by other answers.

Otherwise:

my $searchfor = "From Argumbay";
my $replacewith = "<a href='http://google.com/?s=Argumbay'&gt;From_Argumbay&lt;/a&gt;";

1 while $html =~ s/
  \A             # beginning of string
  (              # group all non-searchfor text
    (            # sub group non-tag followed by tag
      [^<]*?     # non-tags (non-greedy)
      <[^>]*>    # whole tags
    )*?          # zero or more (non-greedy)
  )
  \Q$searchfor\E # search text
/$1$replacewith/sx;

Note that this will NOT work if $searchfor matches $replacetext (so don't put "From Argumbay" back into the replacement text).

PP
I already came up with some similar solution a minutes ago before visiting this site today, and I couldn't accept these provisions. Thanks!
soulSurfer2010