views:

111

answers:

3

What is a Perl regex that can replace select text that is not part of an anchor tag? For example I would like to replace only the last "text" in the following code.

blah <a href="http://www.text.com"&gt; blah text blah </a> blah text blah.

Thanks.

A: 

Don't use regexps for this kind of stuff. Use some proper HTML parser, and simply use plain regexp for parts of html that you're interested in.

depesz
+8  A: 

You don't want to try to parse HTML with a regex. Try HTML::TreeBuilder instead.

use HTML::TreeBuilder;

my $html = HTML::TreeBuilder->new_from_file('file.html');
# or some other method, depending on where your HTML is

doReplace($html);

sub doReplace
{
  my $elt = shift;

  foreach my $node ($elt->content_refs_list) {
    if (ref $$node) {
      doReplace($$node) unless $$node->tag eq 'a';
    } else {
      $$node =~ s/text/replacement/g;
    } # end else this is a text node
  } # end foreach $node

} # end doReplace
cjm
+1  A: 

I have temporarily prevailed:

$html =~ s|(text)([^<>]*?<)(?!\/a>)|replacement$2|is;

but I was dispirited, dismayed, and enervated by the seminal text; and so shall pursue Treebuilder in subsequent endeavors.

zylstra
Use of regex html parsers will cause you to wind up like Charles Dexter Ward.
daotoad
Your regex will also replace the "text" inside `<a><i>text</i></a>`, because it only looks at the first end tag.
cjm
it depends on what you're parsing - if they are small, regular lines of HTML output by another process for example, then a regex might be appropriate. if they are actual full HTML pages, then a proper HTML parser makes sense...
plusplus