tags:

views:

59

answers:

2

Say I had this piece of HTML for example:

<div id="gallery2" class="galleryElement">
  <h2>My Photos</h2>
  <div class = "imageElement">
    <h3>@Embassy - VIP </h3>
    <p><b>Image URL:</b>
      <a href = "http://photos-p.friendster.com/photos/78/86/77426887/1_119466535.jpg" target = "_blank">http://photos-p.friendster.com/photos/78/86/77426887/1_119466535.jpg&lt;/a&gt;&lt;/p&gt;
      <a href = "http://photos-p.friendster.com/photos/78/86/77426887/1_119466535.jpg" title = "open image" class = "open"></a>
      <img src = "http://photos-p.friendster.com/photos/78/86/77426887/1_119466535.jpg" class = "full"/>
      <img src = "http://photos-p.friendster.com/photos/78/86/77426887/1_887303260m.jpg" class = "thumbnail"/>
  </div>
  <div class = "imageElement">
    <h3>@Embassy - VIP </h3>
    <p><b>Image URL:</b>
      <a href = "http://photos-p.friendster.com/photos/78/86/774534426887/1_119466535.jpg" target = "_blank">http://photos-p.friendster.com/photos/78/86/77426887/1_119466535.jpg&lt;/a&gt;&lt;/p&gt;
      <a href = "http://photos-p.friendster.com/photos/78/86/774534426887/1_119466535.jpg" title = "open image" class = "open"></a>
      <img src = "http://photos-p.friendster.com/photos/78/86/774534426887/1_119466535.jpg" class = "full"/>
      <img src = "http://photos-p.friendster.com/photos/78/86/774534426887/1_887303260m.jpg" class = "thumbnail"/>
  </div>
</div>

I nid to build the proper regex expression to parse each div class'ed as imageElement and store the contents (as text) in an array starting drom the opening <div class = "imageElement"> till it's ending div pair </div> Also, there really are spaces on class = "imageElement". So far the expression:

\&lt;div class = "imageElement"&gt;[\s\S\d\D]*&lt;/div&gt;

but it only gets the whole set of elements. Thanks in advance :p

+4  A: 

This is a pretty common question here ("How do I parse this XML/HTML with a regular expression?") and I'll give you the same answer: don't.

Regular expressions are notoriously bad at this kind of thing. HTML/XML is not "regular" in the regex sense.

PHP comes with at least 3 XML parsers (SimpleXML, DOMDocument and XMLReader spring to mind) that will do this reliably. Use one of those.

Take a look at Parse HTML With PHP And DOM as an example.

cletus
:-) It's a fun wheel to re-invent, if you want to sharpen your regex skills, but the answer is correct, whatever you build will fail on some case and there is a reason people build parser libraries.
Devin Ceartas
+1  A: 

sounds like the trouble you're having is that the * is greedy, ie it matches as much as possible, where you want it to match a little as possible.

If the data inside your divs does not contain "</div>" then you can keep the parsing pretty simple. If it can contain arbitrary HTML data (specifically nested divs), you'll need to parse it more.

If it stays basic, you could do the whole thing without regex. It's a little hackish, but as long as your data says simple, and expected, it should work really fast:

$chunks = explode($body, '<div class = "imageElement">');
array_shift($chunks);
$matches = array();
foreach($chunks as $chunk) {
    $pos = strpos('</div>', $chunk);
    if($pos) {
        $matches[] = substr($chunk, 0, $pos);
    {
}

If you need something more flexible, use a real html parser.

JasonWoof