ansaurus

Question

Answer 1

+1 A:

preg_match_all('|<table.*?</table>|ms',$html,$matches, PREG_SET_ORDER);

Matt S 2010-05-26 15:22:59

Answer 2

+3 A:

Before making a decision on what to do next, I'd read this first: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

In general, it's not a good idea to parse HTMl with RegEx.

I recommend using DOM

You can check out the PHP Simple HTML DOM Parser as an alternative.

Main Features:

A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way!

Require PHP 5+.

Supports invalid HTML.

Find tags on an HTML page with selectors just like jQuery.

Extract contents from HTML in a single line.

Robert Greiner 2010-05-26 15:23:34

Answer 3

+1 A:

Did you try the multiline modifier m?

preg_match_all('|<table.*</table>|m',$html,$matches, PREG_SET_ORDER);

prodigitalson 2010-05-26 15:23:50

@prodigitalson: Did you?

SilentGhost 2010-05-26 15:25:10

Answer 4

+3 A:

The dot does not match newlines unless the s pattern modifier is used.

preg_match_all('|<table.*?</table>|s',$html,$matches, PREG_SET_ORDER);

(Be aware that using regex to parse HTML ranks among the worst capital sins here in SO).

leonbloy 2010-05-26 15:23:58

Answer 5

A:

Use the /s flag to have the '.' also apply to new line characters, or just check for new line characters explicitly - usually '[\n\r]'. I haven't yet read it myself, but do check out more info on the PCRE library at http://www.pcre.org/pcre.txt

Careful how you form your pattern though - long input strings with newlines mixed with misunderstood patterns can cause unexplained script failures and connection resets.

In your case, PCRE functions don't seem to be needed here, and could cause unexpected results anyway. If you're just looking to extract contents of a single table on a page, why not just do the most basic...

$start = stripos($input, "<table>");
$end = stripos($input, "</table>", $start);
$my_table = substr($input, $start, $end);

bob_the_destroyer 2010-05-26 20:16:12

Answer 6

A:

EDIT: I've realized that it's not right to use regex to parse HTML.

Better: You can read $html into a SimpleXML object and parse it with SimpleXML's Xpath. (Powerful and much easier to use than the DOM extension IMHO.)

Like this:

$html = "<html><body><table id=\"mytbl\"><tr><td>ABC</td></tr><tr><td>DEF</td></tr></table></body></html>";

$xml = simplexml_load_string($html);

if($xml)
foreach($xml->xpath("/html/body/*") as $item) {
    echo $item["id"] . "<br>"; // mytbl
    foreach($item->tr as $tr) {
        echo $tr->td . "<br>"; // 1:ABC, 2:DEF
    }
}

Kristoffer Bohmann 2010-06-24 06:58:44

ansaurus

tags:

views:

answers:

regular expression breaking on new line

related questions