tags:

views:

34

answers:

1

Consider this blob of text:

@"
I want to match  the word 'highlight' in a string. But I don't want to match
highlight when it is contained in an HTML anchor element. The expression
should not match highlight in the following text: <a href='#'>highlight</a>
"

Here's what the output should look like (matches are in bold):

I want to match the word "highlight" in a string. But I don't want to match highlight when it is contained in an HTML anchor element. The expression should not match highlight in the following text: highlight

How would you construct an expression that matches all occurrences of X, excluding matches inside HTML anchor elements?

+2  A: 

I know you asked for RegEx, but I won't do it. Instead here's a solution using Html Agility Pack.

public static void Parse()
{
    string htmlFragment =
        @"
    I want to match  the word 'highlight' in a string. But I don't want to match
    highlight when it is contained in an HTML anchor element. The expression
    should not match highlight in the following text: <a href='#'>highlight</a> more
    ";
    HtmlDocument htmlDocument = new HtmlAgilityPack.HtmlDocument();
    htmlDocument.LoadHtml(htmlFragment);
    foreach (HtmlNode node in htmlDocument.DocumentNode.SelectNodes("//.").Where(FilterTextNodes()))
    {
        Console.WriteLine(node.OuterHtml);
    }
}

private static Func<HtmlNode, bool> FilterTextNodes()
{
    return node => node.NodeType == HtmlNodeType.Text && node.ParentNode != null && node.ParentNode.Name != "a" && node.OuterHtml.Contains("highlight");
}
Mikael Svenson
I went with a JavaScript-based approach. So I'm gonna accept this answer in the name of being pragmatic :)
roosteronacid
Regex in javascript then ;)
Mikael Svenson