views:

64

answers:

2

I have a string containing HTML and I need to replace some words to be links - I do this with the following code;

string lNewHTML = Regex.Replace(lOldHTML, "(\bword1\b|\bword2|word3\b)", "<a href=\"page.aspx#$1\">$1</a>", RegexOptions.IgnoreCase);

The code works, but I need to include some exceptions to the replace - e.g. I will not replace anything i an img-, li- and a-tag (including link-text and attributes like href and title) but still allow replacements in p-, td- and div-tags.

Can anyone figure this one out?

+1  A: 

You need to use the Replace overload with the MatchEvaluator parameter so that you examine each match and decide whether to replace or not.

logicnp
I have been experimenting with MatchEvaluator but can figure out how to solve my issue using this method.One thing I did solve, without using the MatchEvaluator, is how to avoid replace of attributes, namely this; string lNewHTML = Regex.Replace(lOldHTML, "(?<!<[^>]*)(\bword1\b|\bword2|word3\b)", "<a href=\"page.aspx#$1\">$1</a>", RegexOptions.IgnoreCase);But with this I still need to make exeptions of values in a few specific tags (e.g. a so my link-text is not replace).Any ideas/suggestions?
keysersoze
+1  A: 

Ok, after some time of trying to construct a fitting regex, here my try.. This might need additional work, but should point you in the right direction.

I am matching the words "word1" and "word2", not inside a "tag1" or "tag2" tag. You need to adjust this to your needs, of course. Enable RegexOptions.IgnorePatternWhitespace, if you'd like to keep my formatting.

Unfortunatly, I have come up with a regex you could simply plug into Regex.Replace, since this Regex will match the whole String since the match before, but the word you are concerned with is in the first group. This group contains index and length of the word, so you can easily replace it using String.Substring...

(?:
    \G
    (?:
        (?>
             <tag1(?<N>)
            |<tag2(?<N>)
            |</tag1(?<-N>)
            |</tag2(?<-N>)
            |.)*?
        (?(N)(?!))
    )*
 )
(word1|word2)
Jens
The regex seems to work perfect - well, almost... Besides my own application I am using RegEx Tracer to test the regex and if i write a long string (starts from a length about 500 to maybe 2000) to be replace it crashes. I have tried to give it some additional work but with no luck - any ideas why it crashes and what to do?
keysersoze
Sorry, I do not know RegEx Tracer.
Jens
I experience the exact same problem when I use the regex in my own custom winform/webform application - the application crashes when my string, lOldHTML, is more than a few characters.
keysersoze
Sorry, I cannot reproduce the crash. My test works up to strings of length 4000 chars.
Jens
Can I ask how you test the regex and maybe even give a full .NET-example? If I copy/paste your regex into Regex Tracer together with my HTML nothing happens (no replace happens) - if I then remove newlines and spaces in your example the application crashes immediately. This is my current C#-code; string lNewHTML = Regex.Replace(lOldHTML, "(?:\\G(?:(?><a(?<N>)|<tag2(?<N>)|</a(?<-N>)|</tag2(?<-N>)|.)*?(?(N)(?!)))*)(\bword1\b|\bword2\b)", "REPLACED$1REPLACED", RegexOptions.IgnoreCase|RegexOptions.Singleline);
keysersoze
I tested it via http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx. Try my Regex with some string like "word1...<tag1>...word2...</tag1>... word1", switch on IgnorePatternWhitespace and look at the results.As I said, the result given by my regex is not to be used by Regex.Replace. Youd have to do the replacing in a loop over the matches.
Jens
I dont doubt that your code is working but I could not build a complete functional application using it - instead I found HTML Agility Pack which maked it a lot easier. But thanks for your work!
keysersoze