tags:

views:

70

answers:

3
<br>Aggie<br><br>John<br><p>Hello world</p><br>Mary<br><br><b>Peter</b><br>

I'd like to create a regexp that safely matches these:

<br>Aggie<br>
<br>John<br>
<br>Mary<br>
<br><b>Peter</b><br>

This is possible that there are other tags (e.g. <i>,<strike>...etc ) between each pair of <br> and they have to be collected just like the <br><b>Peter</b><br>

How should the regexp look like?

+6  A: 

If you learn one thing on SO, let it be - "Do not parse HTML with a regex". Use an HTML Parser

RC
To anyone pointing automatically to this one, quoting from the very same blog post: "Many programs will neither need to, nor should, anticipate the entire universe of HTML when parsing." It's absolutely OK to parse a HTML-like input if you keep this in mind.
candiru
This question is missing the obligatory bobince reference: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
intgr
@candiru: The asker explicitly asked for a regexp that is **"safe"**. Regexps are fine for one-off hacks, but they are certainly not safe.
intgr
intgr: It's linked from Jeff's post I linked in the comment to the question. It's just another pointer to dereference :-)
Joey
A: 

Split the string at (<br>)+. You'll get empty strings at the beginning and the end of the result, so you need to remove them, too.

If you want to preserve the <br>, then this is not possible unless you know that there is one before and after each element in the result.

Aaron Digulla
Sorry, I misread the question.
Aaron Digulla
You can still pre- and append an `<br>` to each result, though. Not nice but if the OP *requires* the `<br>` ...
Joey
+1  A: 
<br>.*?<br>

will match anything from one <br> tag to the closest following one.

The main problem with parsing HTML using regexes is that regexes can't handle arbitrarily nested structures. This is not a problem in your example.

Tim Pietzcker
You are right, I need a non-greedy match.
bobo