ansaurus

Question

Replace text if it's not inside certain specified HTML tags

Answer 1

+6 A:

This has been said many times, but I may as well repeat it here... You really don't want to use regex for HTML parsing. It's simply not suited to the complexities of HTML (it's a lot harder to parse with regex than it may first seem).

The best option for .NET is the HTML Agility Pack, which is a very robust library that can parse any form of HTML "soup" correctly. It's also a lot easier to manipulate, since it exposes a DOM structure. This would enable you to simply traverse the DOM and easily check the parent/ancestor nodes so that the replacements can be performed by changing the InnerText property of the appropiate element. When you're all finished, it's a simple call to output the raw HTML from the modified DOM object.

Noldorin 2009-06-06 22:40:16

I agree - regexp usage for parsing is not sutable. But here is a bit other situation - what is needed - just to replace text in a text file. I looked to HTML Agility Pack and found it very interesting for HTML parsing and transforming, but not for text replacement.

2009-06-06 22:46:30

Oh, when it comes to replacing the actual text, you just modify the value of the element.InnerText property - for this you can probably just get away with `string.Rpleace`. If you really must, use regex in conjunction with the HTML Agility Pack. (Correct me if I've misunderstood you please.)

Noldorin 2009-06-06 22:50:32

Answer 2

+2 A:

You could use a MatchEvaluator. The idea is that you match either a complete element of one of the types on your list, or the target string. If you match a complete element, you just plug it back in--you don't care if it contains the target string. Otherwise, you insert the replacement text.

public string GetReplacement(Match m) {
    return m.Groups[1].Success ? m.Groups[1].Value : "YYY";
}

Regex r = new Regex( @"(?is)(<([abi]\b)[^<>]*>.*?</\2>)|XXX" );
string newString = r.Replace(oldString,
                   new MatchEvaluator(GetReplacement));

But be aware that there are many circumstances where this code would fail, even in valid (X)HTML. For example, an element could be nested inside another element of the same kind, like this:

<i>blah <i>blah</i> XXX</i>

Or a start or end tag inside a comment could trip you up:

<b>blah <!-- </b> --> XXX</b>

You could handle many of the potential problems by making the regex and the MatchEvaluator code more complicated, but eventually you either have to accept a few flaws, or switch to dedicated HTML parser like the one Noldorin recommended.

Alan Moore 2009-06-07 04:15:51

ansaurus

tags:

views:

answers:

Replace text if it's not inside certain specified HTML tags

related questions