views:

320

answers:

3

I am writing some C# code to parse RSS feeds and highlight specific whole words in the content, however, I need to only highlight words that are outside HTML. So far I have:

string contentToReplace = "This is <a href=\"test.aspx\" alt=\"This is test content\">test</a> content";

string pattern = "\b(this|the|test|content)\b";

string output = Regex.Replace(contentToReplace, pattern, "<span style=\"background:yellow;\">$1</span>", RegexOptions.Singleline | RegexOptions.IgnoreCase);

This works fine, except it will highlight the word "test" in the alt tag. I can easily write a function that strips the HTML, then does the replace, but I need the keep the HTML to display the content.

+2  A: 

If the input is valid XHTML/XML you could parse it to a tree structure (DOM/XLinq), recursively walk through the tree, replace all keyword occurrences in text nodes and finally serialize the tree structure back to a string.

Untested pseudo-code:

XNode Highlight(XElement element, List<string> keywords)
{
    var result = new XElement(element.Name);
    // copy element attributes to result

    foreach (var node in element)
    {
        if (node.Type == NodeType.Text)
        {
            var value = node.Value;
            // while value contains keyword
            // {
            //      add substring before keyword in value to result
            //      add new XElement with highlighted keyword to result
            //      remove consumed substring from value
            // }
        }
        else if (node.Type == NodeType.Element)
        {
            result.Add(Highlight((XElement)node, keywords));
        }
        else
        {
            result.Add(node);
        }
    }

    return result;
}

var output = Highlight(XElement.Parse(input), new List<string> {...}).ToString();
dtb
holy snikey, that's going to take some time to wrap my head around. I'll mark it as the answer, since it seems to have gotten the most votes.
This will not work if the html document is not well formed. It's not required that all tags are closed in html. Take the td tag, for example. You can have an unclosed td tag and it's valid html but it would be an invalid xml. This would work if the docs were xhtml but the question does not metion that detail.
Steve
That's why my answer starts with "**If** the input is valid XHTML/XML"
dtb
Well, since the content is coming from random sites, it most likely won't be valid XHTML.Sounds like this isn't going to be easy. For shamme.
A: 

Another solution if you have valid XML but don't want to parse it: First split the input string into parts such that each part contains only a tag or text but not both. For example:

"This is ",
"<a href=\"test.aspx\" alt=\"This is test content\">",
"test"
"</a>"
" content"

Then iterate through the parts and apply your regex only to strings that don't start with '<'. Finally, join all parts back into a single string.

dtb
A: 

Here's a basic one.

private void Form1_Load(object sender, EventArgs e)
    {
        string contentToReplace = "This is <a href=\"test.aspx\" alt=\"This is test content\"> hello test world</a> content";

        string pattern = @"(>{1}.*)(test)(.*<{1})";

        string output = Regex.Replace(contentToReplace, pattern, "$1<span>$2</span>$3", RegexOptions.Singleline | RegexOptions.IgnoreCase);

        //output is :
        //This is <a href="test.aspx" alt="This is test content"> hello <span>test</span> world</a> content


        MessageBox.Show(output);
        Close();
    }
Steve
What happens if the input is `"...> hello test test world <..."` ? Does the regex replace both occurrences of `"test"` or just the first one?
dtb
This would match something like: ``<img alt="> test <" src="..." />'' (which it shouldn't match) and fail to match ``test'' (which it should match).
strager
@dtb - yeah just tested your case. it fails
Steve
@strager - yeah, your case would fail too.
Steve