ansaurus

Question

Answer 1

A:

HTML Tidy should be capable of "correcting" broken and fragmented HTML documents, turning them into something that can be parsed with other tools

http://devzone.zend.com/article/761

The Tidy extension is new in PHP 5, and is available from PHP version 5.0b3 upward. It is based on the TidyLib library, and allows the developer to validate, repair, and parse HTML, XHTML and XML documents from within PHP.

skaffman 2009-12-19 17:50:51

Answer 2

+1 A:

You cannot use getElementById on HTML fragments. You could try SimpleHTML instead.

From DomDocument::getElementById

For this function to work, you will need either to set some ID attributes with DOMElement::setIdAttribute or a DTD which defines an attribute to be of type ID. In the later case, you will need to validate your document with DOMDocument::validate or DOMDocument->validateOnParse before using this function.

And since someone will mention doing it with a Regular Expression sooner or later, here is the pattern you could use: /<div id='double'>(.*)<\/div>/simU

In addition, you could just use regular string functions to extract the div part, e.g.

$div = strstr($html, '<div id="double">');
$div = substr($div, 0, strpos($div, '</div>') + 6);
echo $div;

While I agree, you should not use RegEx or String functions for parsing HTML or XML, I find it absolutely okay to do so, as long as your only concern is to get this single div from the fragments. Keep it simple.

Gordon 2009-12-19 17:51:51

Unless there are nested div tags of course. Regular expressions are *not* for parsing html.

troelskn 2009-12-19 18:38:01

I would agree if he was actually *parsing* that fragment. but he just wants to extract one clearly defined piece out of it. It's not like he is traversing the DOM, so I guess it's ok to treat the fragment as a string.

Gordon 2009-12-19 18:43:31

Besides, I already pointed him to SimpleHTML in the first sentence.

Gordon 2009-12-19 19:06:12

Answer 3

+3 A:

I think DOMDocument::getElementById will not work in your case : (quoting)

For this function to work, you will need either to set some ID attributes with DOMElement::setIdAttribute or a DTD which defines an attribute to be of type ID.
In the later case, you will need to validate your document with DOMDocument::validate or DOMDocument->validateOnParse before using this function.

A solution that might work is using some XPath query to extract the element you are looking for.

First of all, let's load the HTML portion, like you first did :

$dom=new domDocument;
$dom->loadHTML($html);
var_dump($dom->saveHTML());

The var_dump is here only to prove that the HTML portion has been loaded successfully -- judging from its output, it has.

Then, instanciate the DOMXPath class, and use it to query for the element you want to get :

$xpath = new DOMXpath($dom);
$result = $xpath->query("//*[@id = 'double']");
$keepme = $result->item(0);

We now have to element you want ;-)

But, in order to inject its HTML content in another HTML segment, we must first get its HTML content.

I don't remember any "easy" way to do that, but something like this sould do the trick :

$tempDom = new DOMDocument();
$tempImported = $tempDom->importNode($keepme, true);
$tempDom->appendChild($tempImported);
$newHtml = $tempDom->saveHTML();
var_dump($newHtml);

And... We have the HTML content of your double <div> :

string '<div id="double">
<img src="http://images.example.com/double.gif" width="300" height="27" border="0" alt="" title="">
</div>
' (length=125)

Now, you just have to do whatever you want with it ;-)

Pascal MARTIN 2009-12-19 18:14:41

Answer 4

A:

~~An XML document can only have one element at the root level. Probably, the HTML parser has a similar requirement. Try wrapping the content in a <body/> tag.~~

Seems it's something else. This page describes what may be the cause. I'd recommend that you use XPath to get the element.

troelskn 2009-12-19 18:21:25

Answer 5

A:

The fragment is HTML, but to be parsed through DOM it should XHTML. Every open tag must be closed.

In your case it means you should replace <br> with <br /> and <img ... > with <img ... />

filippo 2009-12-19 18:26:10

That is actually not true. $dom->loadHTML("<div><img src='foo' width=150><br></div>") works just fine and doesn't fail parsing. In fact, $dom->saveXML() will show you output with properly closed tags.

Artem Russakovskii 2010-02-18 22:50:56

It all depends on the library you use. In python: xml.dom.minidom.parseString("<br>") -> returns an exception. xml.dom.minidom.parseString("<br />") works. I'd prefer to have the input in the correct format in the first place than relying on a library to parse the incorrect input as I expect.

filippo 2010-02-19 08:55:55

ansaurus

tags:

views:

answers:

How do I parse partial HTML?

related questions