ansaurus

Question

PHP RegExp for nested Div tags

Answer 1

+1 A:

i think it will be better to use some DOM-instruments

Dmitry Merkushin 2009-11-12 10:16:18

Thanks, but my document contains error and it would be easier for me using a regex

2009-11-12 10:30:56

+1. if it contains errors, it would be MORE difficult to use regex.

thephpdeveloper 2009-11-12 10:32:01

Answer 2

A:

As I recently found out, regex can't do that.

http://stackoverflow.com/questions/1692830/matching-pair-tag-with-regex

I ended up using xpath, and it works like a charm

Andrei Serdeliuc 2009-11-12 10:16:47

Thanks for your answer, the reason for me not using XPATH/DOM parsing is since the document contains error. I found this regex: ''#<div[^>]*>(?:(?:(?!</?div).)*|(?R))*</div>#si'. But I can't adjust it to my needs.

2009-11-12 10:30:22

Your persistence in the face of multiple answers telling you ‘no’ is to be applauded, but **you really can't parse HTML with regex**.

bobince 2009-11-12 10:40:44

Answer 3

+1 A:

Don't use regex to parse html.

Amarghosh 2009-11-12 10:17:40

could have provided alternative instead.

thephpdeveloper 2009-11-12 10:31:21

Answer 4

+1 A:

Try a parser instead:

require_once "simple_html_dom.php";
$text = 'foo <div id="t1">Content <div>more stuff</div></div> bar <div>even more</div> baz  <div id="t2">yes</div>';
$html = str_get_html($text);
foreach($html->find('div') as $e) {
    if(isset($e->attr['id']) && preg_match('/^t\d++/', $e->attr['id'])) {
        echo $e->outertext . "\n";
    }
}

Output:

<div id="t1">Content <div>more stuff</div></div>
<div id="t2">yes</div>

Download the parser here: http://simplehtmldom.sourceforge.net/

Edit: More for my own amusement I tried to do it in regex. Here's what I came up with:

$text = 'foo <div id="t1">Content <div>more stuff</div></div> bar <div>even more</div>
      baz <div id="t2">yes <div>aaa<div>bbb<div>ccc</div>bbb</div>aaa</div> </div>';
if(preg_match_all('#<div\s+id="t\d+">[^<>]*(<div[^>]*>(?:[^<>]*|(?1))*</div>)[^<>]*</div>#si', $text, $matches)) {
    print_r($matches[0]);
}

Output:

Array
(
    [0] => <div id="t1">Content <div>more stuff</div></div>
    [1] => <div id="t2">yes <div>aaa<div>bbb<div>ccc</div>bbb</div>aaa</div> </div>
)

And a small explanation:

<div\s+id="t\d+">  # match an opening 'div' with an id that starts with 't' and some digits
[^<>]*             # match zero or more chars other than '<' and '>'
(                  # open group 1
  <div[^>]*>       #   match an opening 'div'
  (?:              #   open a non-matching group
    [^<>]*         #     match zero or more chars other than '<' and '>'
    |              #     OR
    (?1)           #     recursively match what is defined by group 1
  )*               #   close the non-matching group and repeat it zero or more times
  </div>           #   match a closing 'div'
)                  # close group 1
[^<>]*             # match zero or more chars other than '<' and '>'
</div>             # match a closing 'div'

Now perhaps you understand why people try to persuade you from not using regex for this. As already noted, it will not help if the the html is improperly formed: the regex will make a bigger mess of the output than an html parser, I assure you. Also, the regex will probably make your eyes bleed and your colleagues (or the people who will maintain your software) may come looking for you after seeing what you did. :)

Your best bet is to first clean up your input (using TIDY or similar), and then use a parser to get the info you want.

Bart Kiers 2009-11-12 10:35:00

Thank you thank you thank you! :)

2009-11-12 15:01:34

By the look of your answer, I get the impression that you're actually going to use that regex: in which case I pity your co-workers! :)

Bart Kiers 2009-11-12 15:09:18

ansaurus

tags:

views:

answers:

PHP RegExp for nested Div tags

related questions