tags:

views:

699

answers:

2

i want to get an ending html tag like </EM> only if somewhere before it i.e. before any previous tags or text there is no starting <EM> tag my sample string is

ddd d<STRONG>dfdsdsd dsdsddd<EM>ss</EM>r and</EM>and strong</STRONG>

in this string the output should be </EM> and this also the second </EM> because it lacks the starting <EM>. i have tried

(?!=<EM>.*)</EM>

but it doesnt seem to work please help thnks

+1  A: 

I am not sure regex is best suited for this kind of task, since tags can always be nested.

Anyhow, a C# regex like:

(?<!<EM>[^<]+)</EM>

would only bring the second </EM> tag

Note that:

  • ?! is a negative look*ahead* which explains why both </EM> are found.
    So... (?!=<EM>.*)xxx actually means capture xxx if it is not followed by =<EM>.*. I am not sure you wanted to include an = in there
  • ?<! is a negative look*behind*, more suited to what you wanted to do, but which would not work with java regex engine, since this look-behind regex does not have an obvious maximum length.

However, with a .Net regex engine, as tested on RETester, it does work.

VonC
i tried this it isnt working it brings exactly the same matches as that of mine thanks anyway
shabby
A: 

You need a pushdown automaton here. Regular expressions aren't powerful enough to capture this concept, since they are equivalent to finite-state automata, so a regex solution is strictly speaking a no-go.

That said, .NET regular expressions do have a pushdown automaton behind them so they can theoretically cope with such cases. If you really feel you need to do this with regular expressions rather than a formal HTML parser, take a glimpse here.

Konrad Rudolph
Interesting. Isn't that an advanced form of "forward reference" ? (http://www.regular-expressions.info/brackets.html)
VonC
Not really: Forward references work similarly to back referecens, i.e. it's enough to store their content in an array. However, for balancing groups to work, the content of these groups has to be stored on a stack (which is the “pushdown” part).
Konrad Rudolph