ansaurus

Question

preg_match a part of an HTML file : find X and (maybe X or not) until the end of HTML file

Answer 1

+1 A:

Use a DOM library and do something like..

$d = new DOMDocument();
$d->loadHTML($htmlString);
$content = $d->getElementById('content');

$inside = innerHTML( $content );
var_dump($inside);

function innerHTML($node){
  $doc = new DOMDocument();
  foreach ($node->childNodes as $child)
    $doc->appendChild($doc->importNode($child, true));

  return $doc->saveHTML();
}

meder 2010-08-21 05:45:45

Thanks for your answer. I don't know if it would work for what I need. Basically I have a script with a list of "commands" in an array (example : get everything between the FIRST <h1> of the HTML page and the 3rd </h2> of the page.

Rock 2010-08-21 14:18:02

then I keep the result in a variable or database. I don't care if the code inside is not perfect or anything, I just need the information inside.

Rock 2010-08-21 14:21:21

Answer 2

+1 A:

Regular expressions are not and never will be the right tool for this job. "I have to use regular expressions" is not true. There is computer science theory to explain this: regular expressions are only capable of matching regular languages, but HTML (or XML) is a more sophisticated language than that.

Another solution for you besides DOM mentioned in @meder's answer is XSLTProcessor. XSLT is a declarative pattern-matching language like regular expressions. But XSLT is capable of matching the hierarchical structure of XHTML or XML.

See the answers in Simple XML parsing on PHP for more solutions, including an example of XSLTProcessor in my answer.

If you want to learn all about HTML scraping techniques in PHP, there's a book on the subject by Matthew Turland, titled php|architect's Guide to Web Scraping with PHP. It's available in digital form now, and should be in print soon.

If you can pry yourself away from PHP for a moment, try a package called Beautiful Soup. This package has one huge advantage: unlike DOM/XSLT parsers, Beautiful Soup doesn't choke if you direct it to parse an HTML page that has some bad markup. Since most web sites you will be scraping probably contain some mistakes, this is a pretty important advantage.

Bill Karwin 2010-08-24 15:21:23

Thank you very much for your answers, really appreciated! I'll be digging in both Meder and Bill's answers and I'll guess I'll do some learning (don't we ever reach the end of that and we can just sit back and relax?) ;) Thanks again!

Rock 2010-08-25 13:29:10

Heh! Name any career where you can just sit back and relax without continually learning. Okay, FOX News morning host. But besides that one.

Bill Karwin 2010-08-25 18:15:24

ansaurus

tags:

views:

answers:

preg_match a part of an HTML file : find X and (maybe X or not) until the end of HTML file

related questions