views:

329

answers:

4

hey guys, I'm trying to make a regex for taking some data out of a table.

the code i've got now is:

<table>
   <tr>
     <td>quote1</td>
     <td>have you trying it off and on again ?</td>
   </tr>
   <tr>
     <td>quote65</td>
     <td>You wouldn't steal a helmet of a policeman</td>
   </tr>
</table>

This I want to replace by:

quote1:have you trying it off and on again ?

quote65:You wouldn't steal a helmet of a policeman

the code that I already have written is this:

%<td>((?s).*?)</td>%

But now I'm stuck.

A: 

As usual, extracting text from HTML and other non-regular languages should be done with a parser - regexes can cause problems here. But if you're certain of your data's structure, you could use

%<td>((?s).*?)</td>\s*<td>((?s).*?)</td>%

to find the two pieces of text. \1:\2 would then be the replacement.

If the text cannot span more than one line, you'd be safer dropping the (?s) bits...

Tim Pietzcker
A: 

Don't use regex, use a HTML parser. Such as the PHP Simple HTML DOM Parser

Peter Boughton
+3  A: 

Tim's regex probably works, but you may want to consider using the DOM functionality of PHP instead of regex, as it may be more reliable in dealing with minor changes in the markup.

See the loadHTML method

Jani Hartikainen
+1  A: 

If you really want to use regexes (might be OK if you are really really sure your string will always be formatted like that), what about something like this, in your case :

$str = <<<A
<table>
   <tr>
     <td>quote1</td>
     <td>have you trying it off and on again ?</td>
   </tr>
   <tr>
     <td>quote65</td>
     <td>You wouldn't steal a helmet of a policeman</td>
   </tr>
</table>
A;

$matches = array();
preg_match_all('#<tr>\s+?<td>(.*?)</td>\s+?<td>(.*?)</td>\s+?</tr>#', $str, $matches);

var_dump($matches);

A few words about the regex :

  • <tr>
  • then any number of spaces
  • then <td>
  • then what you want to capture
  • then </td>
  • and the same again
  • and finally, </tr>

And I use :

  • ? in the regex to match in non-greedy mode
  • preg_match_all to get all the matches

You then get the results you want in $matches[1] and $matches[2] (not $matches[0]) ; here's the output of the var_dump I used (I've remove entry 0, to make it shorter) :

array
  0 => 
    ...
  1 => 
    array
      0 => string 'quote1' (length=6)
      1 => string 'quote65' (length=7)
  2 => 
    array
      0 => string 'have you trying it off and on again ?' (length=37)
      1 => string 'You wouldn't steal a helmet of a policeman' (length=42)

You then just need to manipulate this array, with some strings concatenation or the like ; for instance, like this :

$num = count($matches[1]);
for ($i=0 ; $i<$num ; $i++) {
    echo $matches[1][$i] . ':' . $matches[2][$i] . '<br />';
}

And you get :

quote1:have you trying it off and on again ?
quote65:You wouldn't steal a helmet of a policeman

Note : you should add some security checks (like preg_match_all must return true, count must be at least 1, ...)

As a side note : using regex to parse HTML is generally not a really good idea ; if you can use a real parser, it should be way safer...

Pascal MARTIN