views:

276

answers:

1

I asked this question yesterday, and at the time it was just what I needed, but while working with some live data I discovered that is wasn't quite doing what I expected. http://stackoverflow.com/questions/2571232/parse-html-with-phps-html-domdocument

It gets the data from the HTML page, but then it also strips out all the HTML tags inside the captured block of text, which isn't what I want. (I might wan't to take some of the tags out, but not all, and this can be done later)

+1  A: 

That's a common problem with DOM : you have to do a bit more work if you want to get the content of a tag, and the content of all its children.

Basically, you have to loop over the child nodes of the one you've matched with your XPath query, to get their contents.

There is a solution proposed in one one the user notes on the manual page of the DOMElement class -- see this note.


Integrating this solution into the code you already have should give you something that looks like this for the declaration of the HTML string, with sub-tags :

$html = <<<HTML
<div class="main">
    <div class="text">
        <p>
            Capture this <strong>text</strong> <em>1</em>
        </p>
        <p>
            And some other <strong>text</strong>
        </p>
    </div>
</div>
HTML;


And, to extract the data from that HTML string, you can use something like that :

$dom = new DOMDocument();
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);

$tags = $xpath->query('//div[@class="main"]/div[@class="text"]');
foreach ($tags as $tag) {
    $innerHTML = '';

    // see http://fr.php.net/manual/en/class.domelement.php#86803
    $children = $tag->childNodes;
    foreach ($children as $child) {
        $tmp_doc = new DOMDocument();
        $tmp_doc->appendChild($tmp_doc->importNode($child,true));       
        $innerHTML .= $tmp_doc->saveHTML();
    }

    var_dump(trim($innerHTML));
}

The only thing that has changed is the content of the foreach loop : instead of just using $tag->nodeValue, you have to iterate over the child elements.


Which gives me the following output :

string '<p>
            Capture this <strong>text</strong> <em>1</em>
        </p>


<p>
            And some other <strong>text</strong>
        </p>' (length=150)

Which is the full content of the <div> tag that was matched, and all its children -- including the tags.


Note : there are often interesting ideas and solution in the users notes of the manual ;-)

Pascal MARTIN
Oh damn, and I read some of the user notes too. Thanks again.
Mint
You're welcome :-)
Pascal MARTIN