ansaurus

Question

PHP DOMDocument - get html source of BODY

Answer 1

+2 A:

IN your case, you do not want to work with an HTML document, but with an HTML fragment -- a portion of HTML code ;; which means DOMDocument is not quite what you need.

Instead, I would rather use something like HTMLPurifier (quoting) :

HTML Purifier is a standards-compliant HTML filter library written in PHP. HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant, something only achievable with a comprehensive knowledge of W3C's specifications.

And, if you try your portion of code :

<div><p>Hello World

Using the demo page of HTMLPurifier, you get this clean HTML as an output :

<div><p>Hello World</p></div>

Much better, isn't it ? ;-)

(Note that HTMLPurfier suppots a wide range of options, and that taking a look at its documentation might not hurt)

Pascal MARTIN 2010-02-27 00:21:12

There's good information here, but I'd argument that DOMDocument is still a legit tool for this. The existence of a "loadHTML" method implies that DOMDocument is meant for parsing HTML documents as well as XML documents. HTMLPurifier or other "true" HTML parsers written in PHP are great, but their perf. is always going to pale when compared to built in PHP Objects.

Alan Storm 2010-02-27 00:56:18

@Alan : I agree that DOMDocument is great when it comes to parsing HTML Documents ;;; but for HTML portions, especially **user-submitted**, I believe HTMLPurifier is a better tool : it's been created exactly for the purpose of filtering user-submitted HTML -- including from a security point of view *(For instance, DOMDocument doesn't care about XSS, while HTMLPurifier does ;;; DOMDocument doesn't allow you to specify which tags/attributes should be allowed, while HTMLPUrifier does)*

Pascal MARTIN 2010-02-27 09:45:46

Answer 2

+1 A:

The quick solution to your problem is to use an xPath expression to grab the body.

$dom= new DOMDocument();
$dom->loadHTML('<div><p>Hello World');      
$xpath = new DOMXPath($dom);
$body = $xpath->query('/html/body');
echo($dom->saveXml($body->item(0)));

A word of warning here. Sometimes loadHTML will throw a warning when it encounters certainly poorly formed HTML documents. If you're parsing those kind of HTML documents, you'll need to find a better html parser [self link warning].

Alan Storm 2010-02-27 00:52:34

Answer 3

A:

Faced with the same problem, I've created a wrapper around DOMDocument called SmartDOMDocument to overcome this and some other shortcomings (such as encoding problems).

You can find it here: http://beerpla.net/projects/smartdomdocument

Artem Russakovskii 2010-03-12 10:01:18

ansaurus

tags:

views:

answers:

PHP DOMDocument - get html source of BODY

related questions