ansaurus

Question

How can I replace text that is not part of an anchor tag in Perl?

Answer 1

A:

Don't use regexps for this kind of stuff. Use some proper HTML parser, and simply use plain regexp for parts of html that you're interested in.

depesz 2010-01-25 10:24:50

Answer 2

+8 A:

You don't want to try to parse HTML with a regex. Try HTML::TreeBuilder instead.

use HTML::TreeBuilder;

my $html = HTML::TreeBuilder->new_from_file('file.html');
# or some other method, depending on where your HTML is

doReplace($html);

sub doReplace
{
  my $elt = shift;

  foreach my $node ($elt->content_refs_list) {
    if (ref $$node) {
      doReplace($$node) unless $$node->tag eq 'a';
    } else {
      $$node =~ s/text/replacement/g;
    } # end else this is a text node
  } # end foreach $node

} # end doReplace

cjm 2010-01-25 10:33:02

Answer 3

+1 A:

I have temporarily prevailed:

$html =~ s|(text)([^<>]*?<)(?!\/a>)|replacement$2|is;

but I was dispirited, dismayed, and enervated by the seminal text; and so shall pursue Treebuilder in subsequent endeavors.

zylstra 2010-01-25 10:55:39

Use of regex html parsers will cause you to wind up like Charles Dexter Ward.

daotoad 2010-01-25 18:28:50

Your regex will also replace the "text" inside `<a><i>text</i></a>`, because it only looks at the first end tag.

cjm 2010-01-25 19:41:51

it depends on what you're parsing - if they are small, regular lines of HTML output by another process for example, then a regex might be appropriate. if they are actual full HTML pages, then a proper HTML parser makes sense...

plusplus 2010-01-26 11:01:01

ansaurus

tags:

views:

answers:

How can I replace text that is not part of an anchor tag in Perl?

related questions