ansaurus

Question

Answer 1

+2 A:

If the input is valid XHTML/XML you could parse it to a tree structure (DOM/XLinq), recursively walk through the tree, replace all keyword occurrences in text nodes and finally serialize the tree structure back to a string.

Untested pseudo-code:

XNode Highlight(XElement element, List<string> keywords)
{
    var result = new XElement(element.Name);
    // copy element attributes to result

    foreach (var node in element)
    {
        if (node.Type == NodeType.Text)
        {
            var value = node.Value;
            // while value contains keyword
            // {
            //      add substring before keyword in value to result
            //      add new XElement with highlighted keyword to result
            //      remove consumed substring from value
            // }
        }
        else if (node.Type == NodeType.Element)
        {
            result.Add(Highlight((XElement)node, keywords));
        }
        else
        {
            result.Add(node);
        }
    }

    return result;
}

var output = Highlight(XElement.Parse(input), new List<string> {...}).ToString();

dtb 2009-08-23 18:58:27

holy snikey, that's going to take some time to wrap my head around. I'll mark it as the answer, since it seems to have gotten the most votes.

2009-08-24 02:59:10

This will not work if the html document is not well formed. It's not required that all tags are closed in html. Take the td tag, for example. You can have an unclosed td tag and it's valid html but it would be an invalid xml. This would work if the docs were xhtml but the question does not metion that detail.

Steve 2009-08-24 14:25:57

That's why my answer starts with "**If** the input is valid XHTML/XML"

dtb 2009-08-24 14:32:20

Well, since the content is coming from random sites, it most likely won't be valid XHTML.Sounds like this isn't going to be easy. For shamme.

2009-08-24 18:52:34

Answer 2

A:

Another solution if you have valid XML but don't want to parse it: First split the input string into parts such that each part contains only a tag or text but not both. For example:

"This is ",
"<a href=\"test.aspx\" alt=\"This is test content\">",
"test"
"</a>"
" content"

Then iterate through the parts and apply your regex only to strings that don't start with '<'. Finally, join all parts back into a single string.

dtb 2009-08-23 19:15:26

Answer 3

A:

Here's a basic one.

private void Form1_Load(object sender, EventArgs e)
    {
        string contentToReplace = "This is <a href=\"test.aspx\" alt=\"This is test content\"> hello test world</a> content";

        string pattern = @"(>{1}.*)(test)(.*<{1})";

        string output = Regex.Replace(contentToReplace, pattern, "$1<span>$2</span>$3", RegexOptions.Singleline | RegexOptions.IgnoreCase);

        //output is :
        //This is <a href="test.aspx" alt="This is test content"> hello <span>test</span> world</a> content


        MessageBox.Show(output);
        Close();
    }

Steve 2009-08-23 19:52:13

What happens if the input is `"...> hello test test world <..."` ? Does the regex replace both occurrences of `"test"` or just the first one?

dtb 2009-08-23 20:02:38

This would match something like: ``<img alt="> test <" src="..." />'' (which it shouldn't match) and fail to match ``test'' (which it should match).

strager 2009-08-23 23:33:43

@dtb - yeah just tested your case. it fails

Steve 2009-08-24 02:45:39

@strager - yeah, your case would fail too.

Steve 2009-08-24 02:48:54

ansaurus

tags:

views:

answers:

Highlight whole words, omit HTML

related questions