views:

53

answers:

2

Hi to all,

This is driving me nuts! A little piece of code that I can't seem to debug :( Basically I have an HTML file in a string and I want to find X inside until another X (same value) IF there is another one, if there isn't, then grab X until end of file.

The code that doesn't work:

$contents = "< div id="main" class="clearfix">    < div id="col-1">< div id="content">< div id="p19601634">< h1>< span id="ppt19601634">";
$regex = "!<div id="content">(.*?)(?:<div id="content">)!s";>
preg_match_all($regex, $contents, $matches);

Please notice that I added spaces before the DIV for display purpose and that I want to check with NEW LINES and TABS inside the HTML also (basically, there is a line return after the first DIV).

Right now, my code works if it finds many occurences of my search and it will return the searches. But if there is only one item found, it doesnt work.

Does someone knows this?

Thanks a bunch

+1  A: 

Use a DOM library and do something like..

$d = new DOMDocument();
$d->loadHTML($htmlString);
$content = $d->getElementById('content');

$inside = innerHTML( $content );
var_dump($inside);

function innerHTML($node){
  $doc = new DOMDocument();
  foreach ($node->childNodes as $child)
    $doc->appendChild($doc->importNode($child, true));

  return $doc->saveHTML();
}
meder
Thanks for your answer. I don't know if it would work for what I need. Basically I have a script with a list of "commands" in an array (example : get everything between the FIRST <h1> of the HTML page and the 3rd </h2> of the page.
Rock
then I keep the result in a variable or database. I don't care if the code inside is not perfect or anything, I just need the information inside.
Rock
+1  A: 

Regular expressions are not and never will be the right tool for this job. "I have to use regular expressions" is not true. There is computer science theory to explain this: regular expressions are only capable of matching regular languages, but HTML (or XML) is a more sophisticated language than that.

Another solution for you besides DOM mentioned in @meder's answer is XSLTProcessor. XSLT is a declarative pattern-matching language like regular expressions. But XSLT is capable of matching the hierarchical structure of XHTML or XML.

See the answers in Simple XML parsing on PHP for more solutions, including an example of XSLTProcessor in my answer.

If you want to learn all about HTML scraping techniques in PHP, there's a book on the subject by Matthew Turland, titled php|architect's Guide to Web Scraping with PHP. It's available in digital form now, and should be in print soon.

If you can pry yourself away from PHP for a moment, try a package called Beautiful Soup. This package has one huge advantage: unlike DOM/XSLT parsers, Beautiful Soup doesn't choke if you direct it to parse an HTML page that has some bad markup. Since most web sites you will be scraping probably contain some mistakes, this is a pretty important advantage.

Bill Karwin
Thank you very much for your answers, really appreciated! I'll be digging in both Meder and Bill's answers and I'll guess I'll do some learning (don't we ever reach the end of that and we can just sit back and relax?) ;) Thanks again!
Rock
Heh! Name any career where you can just sit back and relax without continually learning. Okay, FOX News morning host. But besides that one.
Bill Karwin