views:

90

answers:

3

Hi guyz,

I need to get all contents of div class = "parent" using preg_match,

<div class = "parent">

    <div id = "child1">
    </div>

    <div id = "child2">
    </div>

</div>

Anyone?

+2  A: 

The right way to do this is using the DOM and xpath to target the specific elements and attributes you are attempting to extract. However, as this is homework, let's educate your instructor.

Given that exact string, this regex will work: !<div class = "parent">(.+)</div>!s

The key is the "s" modifier. It turns the "." character from "anything except a newline" to "anything including a newline."

However, if the spaces were removed around the =, this would break. If there were more attributes, it would break. If there were more class names, this would break. In other words, it's the worst way to deal with HTML ever.

Hell, if the HTML looked like this, it would break:

<div>
    <div class = "parent">
        My spoon is too big!
        <div>
            I am a banana!
        </div><!-- Matches when un-greedy -->
    </div>
</div><!-- Matches when greedy -->

Why? Because .+ is what's called "greedy." It will match everything it possibly can until the next clause. That means that it would match everything from div.parent to the greedy comment. While it can be made un-greedy by adding a question mark (.+?), then it would match the first possible next clause, not the last possible next clause. That means that it would match everything from div.parent to the un-greedy comment.

Because of the nesting issues, regular expressions are a very poor tool to parse HTML. The problems I've shown you here only touch the surface of the h̨̜̜̟̬̭͍̀o̶̻̹̲̥̻ͧ́̆͆̊̉̍r̟͓ͨ́͆ͨͅr̪̖̠̖̤̊̾ͣͦ̀o̡̬͉͈͚̙͙ͯ͑ͨ͒ͥͩ̇ȓ̵̥̙͈̟͂̃s̠̏̊̃͠ that await you.

Please, when possible, use a real HTML/XML parser and work with the resulting DOM. It will save your sanity.

Charles
+1 for explaining `s` modifier. That has been an issue for me a few times lately, and I'd forgotten how to solve it!
JGB146
Ok thanks alot.
Karl
A: 

For your purposes, this will probably do, though it's not without problem (as noted in the links):

preg_match('/<div class = \'parent\'>(.*)<\/div>/s',$input,$matches);

After which, $matches[0] will contain the matching text (including the parent div) and $matches[1] will contain the inner items only.

JGB146
Thanks alot guyz.
Karl
A: 

You end up with something barbaric like this:

/<div[^>]+class ?= ?"parent"[^>]*>(\s*(?:<div.*<\/div>\s*)*)<\/div>/Us

First, searching within the opening div tag for the desired class - I like using [^>] which is a character group specifying anything but a ">" character. Then allowing for spaces around the "="(or not).

Then the basic idea is to pair up each subsequent opening div tag with it's closing mate so as to be able to stop at the right spot. This is done with a non-capturing subpattern that can repeat 0 or more times. Note this only works with one level of nesting. To deal with that you'll need recursion and that gets hard to conceptualize.

The recursive version would look something like this:

/<div[^>]+class ?= ?"parent"[^>]*>(\s*(<div.*(?2).*<\/div>\s*)*)<\/div>/Us

Overall, if I couldn't do the sane thing and use the DOM I'd prefer to walk through the string(starting each time from the previous match) incrementing a counter for every opening div tag I encountered and decrementing it for each closing tag.

Note these are off the top of my head and posted for the sake of learning regex and not with the idea that parsing html with regular expressions is sane. Also, I'd hate to see a log of the calisthenics the regex engine has to go through to balance all those wildcards.

Eli