views:

289

answers:

5

I am making a regex expression in which I only want to match wrong tags like: <p> *some text here, some other tags may be here as well but no ending 'p' tag* </p>

 <P>Affectionately Inscribed </P><P>TO </P><P>HENRY BULLAR, </P><P>(of the western circuit)<P>PREFACE</P>

In the above same text I want to get the result as <P>(of the western circuit)<P> and nothing else should be captured. I'm using this but its not working:

<P>[^\(</P>\)]*<P>

Please help.

A: 

Rather than using * for maximal match, use *? for minimal.

Should be able to make a start with

<P>((?!</P>).)*?<P>

This uses a negative lookahead assertion to ensure the end tag is not matched at each point between the "<P>" matches.

EDIT: Corrected to put assertion (thanks to commenter).

Richard
This will match </P> of course... need zero width forward assertion in there...
Richard
That still matches </P>. You need to put the dot _after_ the lookahead, not before it. The way you're doing it, the very first character after the start tag is consumed by the dot and the lookahead never sees it.
Alan Moore
+5  A: 

Regex is not always a good choice for xml/html type data. In particular, attributes, case-sensitivity, comments, etc all have a big impact.

For xhtml, I'd use XmlDocument/XDocument and an xpath query.

For "non-x" html, I'd look at the HTML Agility Pack and the same.

Marc Gravell
The agility pack is great for any xpath-like search inside html (even not well formed!)
Dror
+1  A: 

Match group one of:

(?:<p>(?:(?!<\/?p>).?)+)(<p>)

matches the second <p> in:

<P>(of the western circuit)<P>PREFACE</P>

Note: I'm usually one of those that say: "Don't do HTML with regex, use a parser instead". But I don't think the specific problem can be solved with a parser, which would probably just ignore/transparently deal with the invalid markup.

Tomalak
thanks man.....it worked gr8 u rock!!!
shabby
so why not accept the answer?
hometoast
there you go man, answer accepted ;)
shabby
A: 

I know this isn't likely (or even html-legal?) to happen in this case, but a generic unclosed xml-tag solution would be pretty difficult as you need to consider what would happen with nested tags like

<p>OUTER BEFORE<p>INNER</p>OUTER AFTER</p>

I'm pretty sure the regular expressions given so-far would match the second <p> there, even though it is not actually an unclosed <p>.

David Dean
A: 

All of the solutions offered so far match the second <P>, but that's wrong. What if there are two consecutive <P> elements without closing tags? The second one won't be matched because the first match ate its opening tag. You can avoid that problem by using a lookahead as I did here:

@"<p\b(?>(?:[^<]+|<(?!/?p>))*)(?=<p\b|$)"

As for the rest of it, I used a "not the initial or not the rest" technique along with an atomic group to guide the regex to a match as efficiently as possible (and, more importantly, to fail as quickly as possible if it's going to).

Alan Moore