views:

759

answers:

3

I'm trying to find eveything inside a div using regexp. I'm aware that there probably is a smarter way to do this - but I've chosen regexp.

so currently my regexp pattern looks like this:
$gallery_pattern = '/([\s\S]*)<\/div>/';
And it does the trick - somewhat. the problem is if i have two divs after each other - like this.

<div class="gallery">text to extract here</div>
<div class="gallery">text to extract from here as well</div>

I want to extract the information from both divs, but my problem, when testing, is that im not getting the text in between as a result but instead:
"text to extract here
text to extract from here as well"

So . to sum up. It skips the first end of the div. and continues on to the next. The text inside the div can contain "<", "/" and linebreaks. just so you know!

Does anyone have a simple solution to this problem? Im still a regexp novice.

+3  A: 

Hi,

What about something like this :

$str = <<<HTML
<div class="gallery">text to extract here</div>
<div class="gallery">text to extract from here as well</div>
HTML;

$matches = array();
preg_match_all('#<div[^>]*>(.*?)</div>#', $str, $matches);

var_dump($matches[1]);

Note the '?' in the regex, so it is "not greedy".

Which will get you :

array
  0 => string 'text to extract here' (length=20)
  1 => string 'text to extract from here as well' (length=33)

This should work fine... If you don't have imbricated divs ; if you do... Well... actually : are you really sure you want to use rational expressions to parse HTML, which is quite not that rational itself ?

Pascal MARTIN
@downvoter : please, could you explain what is wrong in an answer, when you downvote it ? It would benefit to everyone : the guy who answered (me), so he doesn't make the same mistake again ; and people reading the answer, so they know there is something wrong in it, and what... (if it's because I used regex : well, the OP stated he knows there are better ways, but he said he wants a regex... )
Pascal MARTIN
+1 for the "not greedy" trick and mentioning that it won't work correctly for nested <div>s. I would strongly recommend going with meder's solution though.
Filip Navara
@Filip : I would recommend using DOM and loadHTML too, actually -- I did several times, in other answers (see http://stackoverflow.com/questions/1274020/extract-form-fields-using-regex/1274074#1274074 for instance) : HTML is not something that can be properly parsed with regexes... not rational enough, I suppose ^^
Pascal MARTIN
Your great! it was that ? I needed! just inserted into my already existing expression and it worked like a charm. You're probably right that I shouldn't use regexps but I'll have controll of the input and i just need it to perform this. so this will have to do! Thanks again.
Fifth-Edition
You're welcome :-) If you have control over the input, and know you will always get the same kind of data, well, in this case, I suppose regexes are ok ^^
Pascal MARTIN
+7  A: 
meder
+1 for showing the correct way to do this, even though it doesn't use a regular expression.
Filip Navara
Ye. Thanks for showing this option to - although I want to solve this using regular expressions. I might take a look at this in a bit, as i suspect that this is the way to go. But, ill accept pascal martins sollution!
Fifth-Edition
A: 

A possible answer to this problem can be found at http://simplehtmldom.sourceforge.net/ That class help me to solve similar problem quickly

dhee