tags:

views:

406

answers:

3

First off, I'm aware this is a bad practice and I have answered many questions even saying so, but to clarify I am forced to use regex because this application stores regexes in a database and only functions this way. I absolutely cannot change the functionality

Now that we got that out of the way.. because I always use DOM methods I'm not used to doing this with regular expressions.

I want to capture everything inside of the intro content division, up to the first end div tag. I don't care if the regex will fail on nested divs. I need to capture space ( newline ) characters too.

<div class="intro-content">
<p>blah</p>
<br/>
<strong>test</strong>
</div>

Regex so far:

<div\s*class="intro-content">(.*)</div>

This obviously doesn't work because the . character will not match space characters.

I do realize there have been hundreds of question asked, but the questions I visited only had relatively simple answers ( excluding the DOM suggestion answers ) where a (.*) would not suffice because it doesn't account for newlines, and some regexes were too greedy.

I'm not looking for a perfect, clean solution that will account for every possibility ( like that's even possible ) - I just want a quick solution that will work for this solution so I can move on and work on more modern applications that aren't so horribly coded.

A: 

This obviously doesn't work because the . character will not match space characters.

Should do, but if it doesn't, we can just add them in:

<div\s*class="intro-content">([ \t\r\n.]*)</div>

You then need to make it lazy, so it captures everything up to the first </div> and not the last. We do this by adding a question mark:

<div\s*class="intro-content">([ \t\r\n.]*?)</div>

There. Give that a shot. You might be able to replace the space characters ( \t\r\n) between [ and ] with a single \s too.

Samir Talwar
A: 

It sounds like you need to enable the "dot all" (s) flag. This will make . match all characters including line breaks. For example:

preg_match('/<div\s*class="intro-content">(.*)<\/div>/s', $html);
Phil Ross
Ugh, can't believe I forgot that. thank you.
meder
+2  A: 

You should not use regexp's to parse html like this. div tags can be nested, and since regexp don't have any context, there is no way to parse that. Use a HTML parser instead. For example:

$doc = new DomDocument();
$doc->loadHtml($html);
foreach ($doc->getElementsByClassName("div") as $div) {
  var_dump($div);
}

See: DomDocument

Edit:

And then I saw your note:

I am forced to use regex because this application stores regexes in a database and only functions this way. I absolutely cannot change the functionality

Well. At least make sure that you match non-greedy. That way it'll match correct as long as there are no nested tags:

preg_match('/<div\s*class="intro-content">(.*?)<\/div>/s', $html);
troelskn