tags:

views:

58

answers:

4

I've spent a good few hours trying to get this regular expression to work and I'll I've got so far is one big headache!

I'm using cURL to load a page into variable $o. Now somewhere in this page is the following:

        <tr valign="top">
   <td>value1</td>
   <td>value2</td>
   <td align="right">value3</td>
  </tr>

And this is repeated 3 or so times, naturally, I'd like to grab value1, value2, value3 and store them in an array. Here's my attempt:

  preg_match_all('/<tr valign="top"><td>(.*)<\/td><td>(.*)<\/td><td align="right">(.*)<\/td><\/tr>/',
                        $o,
                        $out);

But all this seems to output is an empty array. Can anyone spot where I've gone wrong?

+5  A: 

Don't use regular expressions to parse HTML. Use an HTML parser.

Andy Lester
It's about time we get a template for that ...
Jörg W Mittag
+1  A: 

Insert Obligatory Link to http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454 here.

Lotus Notes
I'd like it if that link stopped being obligatory. Yeah, we experienced people get it, ha ha, parsing HTML with regexes makes you crazy. But that answer is more interested in being clever than helpful.
Andy Lester
@andy: what you say is true to an extent, that post gives essentially the same advice you just gave below all `teh funney` so i fail to see how your answer is any different. Not to mention if you are gungho about parsing with regex, there are alot of good tips on that question.
prodigitalson
+1  A: 

Just make your life easier:

$dom = new SimpleXmlElement($curlResponse);
$candidates = $dom->xpath("tr[@valign='top']");

foreach($candidates as $tr)
{
   if(count($tr->td) == 3 && (isset($tr->td[2]['align']) &&  $tr->td[2]['align']== 'right'))
   {
      foreach($tr->td as $td)
      {
          // do something with value $td
      }
   }
}

You culd probably even simplyfiy that by moving some of the tests directly to the xpath expression to find a unique td signature within the structure and then go back up to the parent tr and iterate over the td's... but im far from an xpath guru so i keep it simple :-)

prodigitalson
A: 

Looks like you're missing some newlines. Try

  preg_match_all('/<tr valign="top">.*<td>(.*)<\/td>.*<td>(.*)<\/td>.*<td align="right">(.*)<\/td>.*<\/tr>/s',
                    $o,
                    $out);

The /s makes the dot match all characters (normally it doesn't match newlines). If you run into problems, it might be because there are other tds or trs in the output. You can fix that by making the stars lazy by appending a ? after each.

mwhite