tags:

views:

303

answers:

2

Hi,

I have a list of words that should be replaced on HTML page, but only if word is not inside a list of tags (like A B I)

So if there is text :

<p> some text and XXX term <a href="http://some-XXX-bla.com"&gt;good morning XXX world</a> other text and XXX term <b>another XXX inside other sentance</b> </p>

and XXX should be replaced to YYY than final text should be:

<p> some text and YYY term <a href="http://some-XXX-bla.com"&gt;good morning XXX world</a> other text and YYY term <b>another XXX inside other sentance</b> </p>

YYY replaced XXX only if XXX was not inside a list of restricted tags (A, I, B)

Should be done somehow in C# regex

Thanks a lot for help :)

+6  A: 

This has been said many times, but I may as well repeat it here... You really don't want to use regex for HTML parsing. It's simply not suited to the complexities of HTML (it's a lot harder to parse with regex than it may first seem).

The best option for .NET is the HTML Agility Pack, which is a very robust library that can parse any form of HTML "soup" correctly. It's also a lot easier to manipulate, since it exposes a DOM structure. This would enable you to simply traverse the DOM and easily check the parent/ancestor nodes so that the replacements can be performed by changing the InnerText property of the appropiate element. When you're all finished, it's a simple call to output the raw HTML from the modified DOM object.

Noldorin
I agree - regexp usage for parsing is not sutable. But here is a bit other situation - what is needed - just to replace text in a text file. I looked to HTML Agility Pack and found it very interesting for HTML parsing and transforming, but not for text replacement.
Oh, when it comes to replacing the actual text, you just modify the value of the element.InnerText property - for this you can probably just get away with `string.Rpleace`. If you really must, use regex in conjunction with the HTML Agility Pack. (Correct me if I've misunderstood you please.)
Noldorin
+2  A: 

You could use a MatchEvaluator. The idea is that you match either a complete element of one of the types on your list, or the target string. If you match a complete element, you just plug it back in--you don't care if it contains the target string. Otherwise, you insert the replacement text.

public string GetReplacement(Match m) {
    return m.Groups[1].Success ? m.Groups[1].Value : "YYY";
}

Regex r = new Regex( @"(?is)(<([abi]\b)[^<>]*>.*?</\2>)|XXX" );
string newString = r.Replace(oldString,
                   new MatchEvaluator(GetReplacement));

But be aware that there are many circumstances where this code would fail, even in valid (X)HTML. For example, an element could be nested inside another element of the same kind, like this:

<i>blah <i>blah</i> XXX</i>

Or a start or end tag inside a comment could trip you up:

<b>blah <!-- </b> --> XXX</b>

You could handle many of the potential problems by making the regex and the MatchEvaluator code more complicated, but eventually you either have to accept a few flaws, or switch to dedicated HTML parser like the one Noldorin recommended.

Alan Moore