ansaurus

Question

Capture content inside html tags with regex

Answer 1

A:

This obviously doesn't work because the . character will not match space characters.

Should do, but if it doesn't, we can just add them in:

<div\s*class="intro-content">([ \t\r\n.]*)</div>

You then need to make it lazy, so it captures everything up to the first </div> and not the last. We do this by adding a question mark:

<div\s*class="intro-content">([ \t\r\n.]*?)</div>

There. Give that a shot. You might be able to replace the space characters ( \t\r\n) between [ and ] with a single \s too.

Samir Talwar 2009-11-04 22:27:57

Answer 2

A:

It sounds like you need to enable the "dot all" (s) flag. This will make . match all characters including line breaks. For example:

preg_match('/<div\s*class="intro-content">(.*)<\/div>/s', $html);

Phil Ross 2009-11-04 22:28:53

Ugh, can't believe I forgot that. thank you.

meder 2009-11-04 22:38:50

Answer 3

+2 A:

You should not use regexp's to parse html like this. div tags can be nested, and since regexp don't have any context, there is no way to parse that. Use a HTML parser instead. For example:

$doc = new DomDocument();
$doc->loadHtml($html);
foreach ($doc->getElementsByClassName("div") as $div) {
  var_dump($div);
}

See: DomDocument

Edit:

And then I saw your note:

I am forced to use regex because this application stores regexes in a database and only functions this way. I absolutely cannot change the functionality

Well. At least make sure that you match non-greedy. That way it'll match correct as long as there are no nested tags:

preg_match('/<div\s*class="intro-content">(.*?)<\/div>/s', $html);

troelskn 2009-11-04 22:40:39

ansaurus

tags:

views:

answers:

Capture content inside html tags with regex

related questions