ansaurus

Question

Removing inline elements when importing HTML into DOMDocument or SimpleXML?

Answer 1

+1 A:

Please read the first answer to this before parsing html with a regex, if only for amusement sake. XPath is the answer, get the text of the td instead of continuing to parse it. So you'll just search for something like //td and take the results of that completely (instead of continuing the tree building so that you have leaves that say strong or whatever.

Chuck Vose 2010-01-17 08:19:42

Answer 2

+1 A:

Can't you just use strip_tags() to remove the extra markup?

$table = simplexml_load_string(
    '<table>
        <tr><td>Thing 1</td><td>Thing 2</td></tr>
        <tr><td>Thing 3</td><td>Thing 4</td></tr>
        <tr><td><strong>Thing 5</strong></td><td><strong>Thing 6</strong></td></tr>
    </table>'
);

foreach ($table->xpath('//td') as $td)
{
    $content = strip_tags($td->asXML());
    echo $content, "\n";
}

Josh Davis 2010-01-17 08:20:15

I'm not sure if this is the BEST solution, but I'm accepting it based not so much on the `strip_tags` suggestion (which is clever), but on the asXML() suggestion, which didn't occur to me to use BEFORE dealing with moving the contents to an array. Very nice.

Anthony 2010-01-17 08:44:56

Answer 3

A:

If you're using DOMDocument, once you've selected a DOMNode, the property textContent should contain only the text part of it and all it's childen... exactly what you asked for.

$table = '<table>
        <tr><td>Thing 1</td><td>Thing 2</td></tr>
        <tr><td>Thing 3</td><td>Thing 4</td></tr>
        <tr><td><strong>Thing 5</strong></td><td><strong>Thing 6</strong></td></tr>
    </table>';

$dom = new DOMDocument;
$dom->loadHTML($table);
$xpath = new DOMXPath($dom);

$els = $xpath->query('//td');
echo $els->item(4)->textContent; //Thing 5

Alternatively, depending on the type of node, you can check nodeValue as well. I can't recall exactly the difference, but textContent is what you want.

seanmonstar 2010-01-17 09:22:02

ansaurus

tags:

views:

answers:

Removing inline elements when importing HTML into DOMDocument or SimpleXML?

related questions