tags:

views:

120

answers:

6

I'm trying to use a regular expression as below:

preg_match_all('|<table.*</table>|',$html,$matches, PREG_SET_ORDER);

But this is not working, and I think the problem is the new line inside the string $html.
Could someone tell me a work around?


EDIT: I've realized that it's not right to use regex to parse HTML. Thanks to those who told me. :)

+1  A: 
preg_match_all('|<table.*?</table>|ms',$html,$matches, PREG_SET_ORDER);
Matt S
+3  A: 

Before making a decision on what to do next, I'd read this first: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

In general, it's not a good idea to parse HTMl with RegEx.

I recommend using DOM

You can check out the PHP Simple HTML DOM Parser as an alternative.

Main Features:

  • A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way!
  • Require PHP 5+.
  • Supports invalid HTML.
  • Find tags on an HTML page with selectors just like jQuery.
  • Extract contents from HTML in a single line.
Robert Greiner
+1  A: 

Did you try the multiline modifier m?

preg_match_all('|<table.*</table>|m',$html,$matches, PREG_SET_ORDER);
prodigitalson
@prodigitalson: Did you?
SilentGhost
+3  A: 

The dot does not match newlines unless the s pattern modifier is used.

preg_match_all('|<table.*?</table>|s',$html,$matches, PREG_SET_ORDER);

(Be aware that using regex to parse HTML ranks among the worst capital sins here in SO).

leonbloy
A: 

Use the /s flag to have the '.' also apply to new line characters, or just check for new line characters explicitly - usually '[\n\r]'. I haven't yet read it myself, but do check out more info on the PCRE library at http://www.pcre.org/pcre.txt

Careful how you form your pattern though - long input strings with newlines mixed with misunderstood patterns can cause unexplained script failures and connection resets.

In your case, PCRE functions don't seem to be needed here, and could cause unexpected results anyway. If you're just looking to extract contents of a single table on a page, why not just do the most basic...

$start = stripos($input, "<table>");
$end = stripos($input, "</table>", $start);
$my_table = substr($input, $start, $end);
bob_the_destroyer
A: 

EDIT: I've realized that it's not right to use regex to parse HTML.

Better: You can read $html into a SimpleXML object and parse it with SimpleXML's Xpath. (Powerful and much easier to use than the DOM extension IMHO.)

Like this:

$html = "<html><body><table id=\"mytbl\"><tr><td>ABC</td></tr><tr><td>DEF</td></tr></table></body></html>";

$xml = simplexml_load_string($html);

if($xml)
foreach($xml->xpath("/html/body/*") as $item) {
    echo $item["id"] . "<br>"; // mytbl
    foreach($item->tr as $tr) {
        echo $tr->td . "<br>"; // 1:ABC, 2:DEF
    }
}
Kristoffer Bohmann