views:

268

answers:

2

How can I extract the content between tags with several line breaks?

I'm a newbie to regex, who would like to know how to handle unknown numbers of line break to match my query.

Task: Extract content between <div class="test"> and the first closing </div> tag.

Original source:

<div class="test">optional text<br/>
content<br/>
<br/>
content<br/>
...
content<br/><a href="/url/">Hyperlink</a></div></div></div>

I've worked out the below regex,

/<div class=\"test\">(.*?)<br\/>(.*?)<\/div>/

Just wonder how to match several line breaks using regex.

There is DOM for us but I am not familiar with that.

+2  A: 

You should not parse (x)html with regular expressions. Use DOM.

I'm a beginner in xpath, but one like this should work:

//div[@class='test']

This selects all divs with the class 'test'. You will need to load your html into a DOMDocument object, then create a DOMXpath object relating to that, and call its execute() method to get the results. It will return a DOMNodeList object.
Final code looks something like this:

$domd = new DOMDocument();
$domd->loadHTML($your_html_code);
$domx = new DOMXPath($domd);
$items = $domx->execute("//div[@class='test']");

After this, your div is in $items->item(0).

This is untested code, but if I remember correctly, it should work.

Update, forgot that you need the content.

If you need the text content (no tags), you can simply call $items->item(0)->textContent. If you also need the tags, here's the equivalent of javascript's innerHTML for PHP DOM:

function innerHTML($node){
  $doc = new DOMDocument();
  foreach ($node->childNodes as $child)
    $doc->appendChild($doc->importNode($child, true));

  return $doc->saveHTML();
}

Call it with $items->item(0) as the parameter.

Maerlyn
I'll study your method seriously later on. Currently reading some tutorials of DOM.
Iron
@John XPath will need some getting used to, but when you start to see the power of it, you'll see that it is awesome, and a lot more useful than regexps. I recently started rewriting one of my data miner classes from regexps to dom and xpath, and I was surprised how shorter it got, and it is also quite readable, unlike the regex version.
Maerlyn
A: 

You could use preg_match_all('/<div class="test">(.*?)<\/div>/si', $html, $matches);. But remember that this will match the first closing </div> within the HTML. Ie. if the HTML looks like <div class="test">...aaa...<div>...bbb...</div>...ccc...</div> then you would get ...aaa...<div>...bbb... as the result in $matches...

So in the end using a DOM parser would indeed by a better solution.

wimvds