views:

51

answers:

2

What's the code to store in a string the whole webpage's content between <body></body> tags?

  • can be any HTML/XHTML page
  • can be any encoding (ISOx, UTF-8, Asian-something)
  • can have attributes in the <body> (may trick the parser)

I've heard about DOMDocument but I'm a big rookie, some code sample would help!

+1  A: 
$d = new DOMDOcument();
libxml_use_internal_errors(true);
$d->loadHTMLFile("http://stackoverflow.com");
$b = $d->getElementsByTagName("body")->item(0);
if ($b !== null) {
    echo simplexml_import_dom($b)->asXML();
}

This will also include the <body> tag, and the content will have been modified to be well-formed XML.

To have no body tags (though now we don't have a single root, thus not well-formed XML):

$d = new DOMDOcument();
libxml_use_internal_errors(true);
$d->loadHTMLFile("http://stackoverflow.com");
$b = $d->getElementsByTagName("body")->item(0);
if ($b !== null) {
    for ($n = $b->firstChild; $n !== null; $n = $n->nextSibling) {
        echo simplexml_import_dom($n)->asXML();
    }
}
Artefacto
How about stripping off body tags?
Riccardo
@Ric I've edited.
Artefacto
GREAT!Now testing! Thanks
Riccardo
Artefact, try with the code you have suggested this page: http://www.temple.edu/cs/web/sampleweb.htmlI got some errors:Warning: simplexml_import_dom() [function.simplexml-import-dom]: Invalid Nodetype to import
Riccardo
@Ric Good point, the second snippet may not work. The question is, what do you need this for? Manipulating the DOM structure doesn't suit your needs?
Artefacto
Basically try to load ANY possible well/mal/formed web page.... once loaded strip out content between BODY tags; into a string, would do.
Riccardo
Forgot to mention: keep all the HTML tags in the body!
Riccardo
Basically wiping out what's before and after the BODY tags, wiping them out as well.- Keeping the correct charset- allowing management of ALSO japanese/chinese charsets
Riccardo
A: 

Found this solves the problem!

Riccardo