views:

267

answers:

3

I've run into this problems several times before when trying to do some html scraping with php and the preg* functions.

Most of the time I've to capture structures like that:

<!-- comment -->
<tag1>lorem ipsum</tag>

<p>just more text with several html tags in it, sometimes CDATA encapsulated…</p>
<!-- /comment -->

In particular I want something like this:

/<tag1>(.*?)<\/tag1>\n\n<p>(.*?)<\/p>/mi

but the \n\n doesn't look like it would work.

Is there a general line-break switch?

+3  A: 

I think you could replace the \n\n with (\r?\n){2} this way you capture the CRLF pair instead of just the LF char.

Paulo Santos
+1  A: 

Are you sure you want to parse HTML using regexps ? HTML isn't regular and there are too many corner cases.

I would investigate some form of HTML parser (perhaps this one ?), and then identify the pattern you're interested in via the returned HTML data structure.

Brian Agnew
Scraping using regex worked fine for me as of today. But thanks for the link.
DASKAjA
A: 

Or you could look at the Dom Extension to php. It has a function to load html from a string or a file. You can then use the php dom methods to traverse the dom and find the data you are interested in.

timmow