ansaurus

Question

php regex for html

Answer 1

A:

As usual, extracting text from HTML and other non-regular languages should be done with a parser - regexes can cause problems here. But if you're certain of your data's structure, you could use

%<td>((?s).*?)</td>\s*<td>((?s).*?)</td>%

to find the two pieces of text. \1:\2 would then be the replacement.

If the text cannot span more than one line, you'd be safer dropping the (?s) bits...

Tim Pietzcker 2009-07-19 20:20:39

Answer 2

A:

Don't use regex, use a HTML parser. Such as the PHP Simple HTML DOM Parser

Peter Boughton 2009-07-19 20:28:10

Answer 3

+3 A:

Tim's regex probably works, but you may want to consider using the DOM functionality of PHP instead of regex, as it may be more reliable in dealing with minor changes in the markup.

See the loadHTML method

Jani Hartikainen 2009-07-19 20:30:32

Answer 4

+1 A:

If you really want to use regexes (might be OK if you are really really sure your string will always be formatted like that), what about something like this, in your case :

$str = <<<A
<table>
   <tr>
     <td>quote1</td>
     <td>have you trying it off and on again ?</td>
   </tr>
   <tr>
     <td>quote65</td>
     <td>You wouldn't steal a helmet of a policeman</td>
   </tr>
</table>
A;

$matches = array();
preg_match_all('#<tr>\s+?<td>(.*?)</td>\s+?<td>(.*?)</td>\s+?</tr>#', $str, $matches);

var_dump($matches);

A few words about the regex :

<tr>
then any number of spaces
then <td>
then what you want to capture
then </td>
and the same again
and finally, </tr>

And I use :

? in the regex to match in non-greedy mode
preg_match_all to get all the matches

You then get the results you want in $matches[1] and $matches[2] (not $matches[0]) ; here's the output of the var_dump I used (I've remove entry 0, to make it shorter) :

array
  0 => 
    ...
  1 => 
    array
      0 => string 'quote1' (length=6)
      1 => string 'quote65' (length=7)
  2 => 
    array
      0 => string 'have you trying it off and on again ?' (length=37)
      1 => string 'You wouldn't steal a helmet of a policeman' (length=42)

You then just need to manipulate this array, with some strings concatenation or the like ; for instance, like this :

$num = count($matches[1]);
for ($i=0 ; $i<$num ; $i++) {
    echo $matches[1][$i] . ':' . $matches[2][$i] . '<br />';
}

And you get :

quote1:have you trying it off and on again ?
quote65:You wouldn't steal a helmet of a policeman

Note : you should add some security checks (like preg_match_all must return true, count must be at least 1, ...)

As a side note : using regex to parse HTML is generally not a really good idea ; if you can use a real parser, it should be way safer...

Pascal MARTIN 2009-07-19 20:31:03

ansaurus

tags:

views:

answers:

php regex for html

related questions