tags:

views:

102

answers:

2

I have a string, and I would like to find all greater-than characters that are not part of an HTML tag.

Ignoring CDATA, etc., this should be easy: find any ">" character that either has no "<" before it, or there is another ">" between them.

Here's the first attempted solution I came up with:

 (?<=(^|>)[^<]*)>

I think this should look for any ">" where there are no "<" characters to the left of it, either back to the beginning of the string, or back to the previous ">".

I tried phrasing it negatively as well:

 (?<!<[^>]*)>

I.e., a ">" that is nor preceded by a "<" unless that is only followed by non-">" characters.

I suspect I'm just twisted up in my head about how lookbehinds work.

Unit Tests:

 No match in: <foo>
 No match in: <foo bar>
 Match in: <foo> bar>
 Match in: foo> bar
 Match in: >foo
 Two matches in: foo>>
 Two matches in: <foo> >bar>

Use case: I'm scrubbing HTML from a wiki-like form field that accepts some HTML tags, but the users are not terribly HTML savvy and sometimes enter unescaped ">" and "<" literals for actual less-than and greater-than meanings. My intent is to replace these with HTML entities, but only if they aren't part of an HTML tag. I know there's the possibility of them entering text like "Heigh is < 10 and > 5", which would break this, but that's an edge case I can work around or live with.

A: 

Get expresso, great tool for working with and writing regexes

To be honest though, I don't know if you can write one to do what you need.
Don't forget, some html tags don't 'need' to be closed to be valid html, and some are self closing in xhtml.

eg. <hr>, <br/>, <p>, <li> <img> or <img /> etc

You might be better off, just keeping a list of valid tags, changing all < and > signs to &lt; and &gt; that aren't part of the valid tags.

Chad
+1  A: 

This is a lot trickier than it seems at first (as you're discovering). It's much easier to come at it from the other direction: use one regex to match an HTML tag OR an angle bracket. If it's a tag you found, you plug it back in; otherwise you convert it. The Replace method with a MatchEvaluator parameter is good for that:

static string ScrubInput(string input)
{
  return Regex.Replace(input, @"</?\w+>|[<>]", GetReplacement);
}

static string GetReplacement(Match m)
{
  switch (m.Value)
  {
    case "<":
      return "&lt;";
    case ">":
      return "&gt;";
    default:
      return m.Value;
  }
}

You'll notice that my tag regex -- </?\w+> -- is more restrictive than yours. I don't know if mine is exactly right for your needs, but I would advise against using <[^<>]+> -- it would find a match in something like "if (x<3||x>9)".

Alan Moore