I have a string, and I would like to find all greater-than characters that are not part of an HTML tag.
Ignoring CDATA, etc., this should be easy: find any ">" character that either has no "<" before it, or there is another ">" between them.
Here's the first attempted solution I came up with:
(?<=(^|>)[^<]*)>
I think this should look for any ">" where there are no "<" characters to the left of it, either back to the beginning of the string, or back to the previous ">".
I tried phrasing it negatively as well:
(?<!<[^>]*)>
I.e., a ">" that is nor preceded by a "<" unless that is only followed by non-">" characters.
I suspect I'm just twisted up in my head about how lookbehinds work.
Unit Tests:
No match in: <foo>
No match in: <foo bar>
Match in: <foo> bar>
Match in: foo> bar
Match in: >foo
Two matches in: foo>>
Two matches in: <foo> >bar>
Use case: I'm scrubbing HTML from a wiki-like form field that accepts some HTML tags, but the users are not terribly HTML savvy and sometimes enter unescaped ">" and "<" literals for actual less-than and greater-than meanings. My intent is to replace these with HTML entities, but only if they aren't part of an HTML tag. I know there's the possibility of them entering text like "Heigh is < 10 and > 5", which would break this, but that's an edge case I can work around or live with.