ansaurus

Question

What regex will match text excluding what lies within HTML tags?

Answer 1

+2 A:

You can use a regex with balancing groups and backreferences, but I strongly recommend that you use a parser here.

Santiago Palladino 2008-10-07 18:47:43

Answer 2

A:

Hmm, I'm not a C# programmer so I don't know the flavor of regex it uses but (?!<.+?>) should ignore anything inside of tags. It will force you to use &#60 &#62 in your HTML code, but you should be doing that anyway.

WolfmanDragon 2008-10-07 20:24:36

To match "class" as I described in my example, where would the word "class" go in your regex? I don't understand how to use your regex. On its own, it appears to match every char position in the whole phrase.

Chris 2008-10-08 07:14:45

The regex "(?!<.+?>)" is just a negative lookahead; it says, "from this position, we're not looking at something that looks vaguely like a tag." It won't match anything, nor will it prevent matching anything, inside a tag or out.

Alan Moore 2008-10-09 13:36:59

Answer 3

+3 A:

This regex should do the job : (?<!<[^>]*)(regex you want to check: Fred|span) It checks that it is impossible to match the regex <[^>]* going backward starting from a matching string.

Modified code below:

const string notInsideBracketsRegex = @"(?<!<[^>]*)";
const string highlightPattern = @"<span class=""Highlight"">$0</span>";
DataBoundLiteralControl litCustomerComments = (DataBoundLiteralControl)e.Row.Cells[CUSTOMERCOMMENTS_COLUMN].Controls[0];

// Turn "term1 term2" into "(term1|term2)"
string spaceDelimited = txtTextFilter.Text.Trim();
string pipeDelimited = string.Join("|", spaceDelimited.Split(new[] {" "}, StringSplitOptions.RemoveEmptyEntries));
string searchPattern = "(" + pipeDelimited + ")";
searchPattern = notInsideBracketsRegex + searchPattern;

// Highlight search terms in Customer - Comments column
e.Row.Cells[CUSTOMERCOMMENTS_COLUMN].Text = Regex.Replace(litCustomerComments.Text, searchPattern, highlightPattern, RegexOptions.IgnoreCase);

madgnome 2008-10-08 08:56:52

Found through Google, helped out a lot, thanks! :)

Helgi Hrafn Gunnarsson 2010-09-29 21:18:09

Answer 4

A:

Writing a regex that can handle CDATA sections is going to be hard. You may no longer asssume that > closes a tag.

For instance, "<span class="CustomerName>Fred.</span> is a good customer (<![CDATA[ >10000$ ]]> )"

The solution is (as noted earlier) a parser. They're much better in dealing with the kind of mess you find in a CDATA. madgnome's backwards check cannot be used to find the starting <![CDATA from a ]]>, as a CDATA section may include the literal <![CDATA.

MSalters 2008-10-08 09:09:13

Good point, I haven't think of that.

madgnome 2008-10-08 09:20:58

I know the solution isn't perfect, but weighing all the ups and downs, it's the best one I've found thus far.

Chris 2008-10-08 17:44:13

ansaurus

tags:

views:

answers:

What regex will match text excluding what lies within HTML tags?

related questions