tags:

views:

2206

answers:

4

I am writing code for a search results page that needs to highlight search terms. The terms happen to occur within table cells (the app is iterating through GridView Row Cells), and these table cells may have HTML.

Currently, my code looks like this (relevant hunks shown below):

const string highlightPattern = @"<span class=""Highlight"">$0</span>";
DataBoundLiteralControl litCustomerComments = (DataBoundLiteralControl)e.Row.Cells[CUSTOMERCOMMENTS_COLUMN].Controls[0];

// Turn "term1 term2" into "(term1|term2)"
string spaceDelimited = txtTextFilter.Text.Trim();
string pipeDelimited = string.Join("|", spaceDelimited.Split(new[] {" "}, StringSplitOptions.RemoveEmptyEntries));
string searchPattern = "(" + pipeDelimited + ")";

// Highlight search terms in Customer - Comments column
e.Row.Cells[CUSTOMERCOMMENTS_COLUMN].Text = Regex.Replace(litCustomerComments.Text, searchPattern, highlightPattern, RegexOptions.IgnoreCase);

Amazingly it works. BUT, sometimes the text I am matching on is HTML that looks like this:

<span class="CustomerName">Fred</span> was a classy individual.

And if you search for "class" I want the highlight code to wrap the "class" in "classy" but of course not the HTML attribute "class" that happens to be in there! If you search for "Fred", that should be highlighted.

So what's a good regex that will make sure matches happen only OUTSIDE the html tags? It doesn't have to be super hardcore. Simply making sure the match is not between < and > would work fine, I think.

+2  A: 

You can use a regex with balancing groups and backreferences, but I strongly recommend that you use a parser here.

Santiago Palladino
A: 

Hmm, I'm not a C# programmer so I don't know the flavor of regex it uses but (?!<.+?>) should ignore anything inside of tags. It will force you to use &#60 &#62 in your HTML code, but you should be doing that anyway.

WolfmanDragon
To match "class" as I described in my example, where would the word "class" go in your regex? I don't understand how to use your regex. On its own, it appears to match every char position in the whole phrase.
Chris
The regex "(?!<.+?>)" is just a negative lookahead; it says, "from this position, we're not looking at something that looks vaguely like a tag." It won't match anything, nor will it prevent matching anything, inside a tag or out.
Alan Moore
+3  A: 

This regex should do the job : (?<!<[^>]*)(regex you want to check: Fred|span) It checks that it is impossible to match the regex <[^>]* going backward starting from a matching string.

Modified code below:

const string notInsideBracketsRegex = @"(?<!<[^>]*)";
const string highlightPattern = @"<span class=""Highlight"">$0</span>";
DataBoundLiteralControl litCustomerComments = (DataBoundLiteralControl)e.Row.Cells[CUSTOMERCOMMENTS_COLUMN].Controls[0];

// Turn "term1 term2" into "(term1|term2)"
string spaceDelimited = txtTextFilter.Text.Trim();
string pipeDelimited = string.Join("|", spaceDelimited.Split(new[] {" "}, StringSplitOptions.RemoveEmptyEntries));
string searchPattern = "(" + pipeDelimited + ")";
searchPattern = notInsideBracketsRegex + searchPattern;

// Highlight search terms in Customer - Comments column
e.Row.Cells[CUSTOMERCOMMENTS_COLUMN].Text = Regex.Replace(litCustomerComments.Text, searchPattern, highlightPattern, RegexOptions.IgnoreCase);
madgnome
Found through Google, helped out a lot, thanks! :)
Helgi Hrafn Gunnarsson
A: 

Writing a regex that can handle CDATA sections is going to be hard. You may no longer asssume that > closes a tag.

For instance, "<span class="CustomerName>Fred.</span> is a good customer (<![CDATA[ >10000$ ]]> )"

The solution is (as noted earlier) a parser. They're much better in dealing with the kind of mess you find in a CDATA. madgnome's backwards check cannot be used to find the starting <![CDATA from a ]]>, as a CDATA section may include the literal <![CDATA.

MSalters
Good point, I haven't think of that.
madgnome
I know the solution isn't perfect, but weighing all the ups and downs, it's the best one I've found thus far.
Chris