views:

36

answers:

3

I have some broken html-code that i would like to fix with regex.

The html might be something like this:

<p>text1</p>
<p>text2</p>
text3
<p>text4</p>
<p>text5</p>

But there can be much more paragraphs and other html-elements too.

I want to turn in into:

<p>text1</p>
<p>text2</p>
<p>text3</p>
<p>text4</p>
<p>text5</p>

Is this possible with a regex? I'm using php if that matters.

+1  A: 

Could http://htmlpurifier.org/ help you?

Knarf
Ah, it would probably have been a bit overkill since i only need to solve this specific problem but i will use the htmlpurifier another time :)
Martin
+3  A: 

No, this is generally a bad idea with regexes. Regexes don't do stateful parsing. HTML has implicit tags and requires state to be kept to parse.

HTML generally has lots of quirks. It is hard to write an HTML parser as not only you have to keep track of how things should be, but also account for broken behaviour seen in the wild.

Regexes are the wrong tool for this job.

szbalint
You're right. That's because HTML is NOT a regular language.
David Brunelle
I see. I wrote a parser for it instead, works good. Thanks :)
Martin
+1  A: 
toefel
Thanks, i followed the advice to not use regex for this but thanks a lot anyway!
Martin