views:

105

answers:

4

After learning how to "correctly" unset a node, I noticed that using PHP's unset() function leaves the tabs and spaces behind. So now I have this big chunk of white space in between nodes at times. I'm wondering if PHP iterates through blank spaces/returns/tabs and whether it would eventually slow down the system.

I'm also asking whether there's an easy to remove the space unset leaves behind?

Thanks, Ryan

ADDED NOTE:

This is how I removed the whitespaces after unsetting a node and it worked for me.

$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$dom->load($xmlPath);
$dom->save($xmlPath);
+3  A: 

Wether it slows down the process: probably to little to care about.

And simpleXML is just that, simple. If you require a 'pretty' output, DOM is your friend:

<?php
$xml = '
<xml>
        <node>foo </node>
        <other>bar</other>
</xml>';
$x = new SimpleXMLElement($xml);
unset($x->other);
echo $x->asXML();

$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$dom->loadXML($xml);
$dom->documentElement->removeChild($dom->documentElement->lastChild);
echo $dom->saveXML();
Wrikken
Replaced "proper" with "pretty" to avoid any misunderstanding about the properness of a "messy" XML document.
Josh Davis
Is this the equivalent of unset?$dom->documentElement->removeChild($dom->documentElement->lastChild);Can I ommit this line if I just want to format the output?Thanks, Ryan
Ryan S.
Wrikken
I got this error when trying the above mentioned approach:Warning: DOMDocument::loadXML() [domdocument.loadxml]: Start tag expected, '<' not found in Entity, line: 1In my research I found this http://bugs.php.net/bug.php?id=45996 but I'm running 2.7.6, so I'm not sure if this still applies.
Ryan S.
I did not say `import` for nothing ;). Try to use the function here: http://www.php.net/manual/en/function.dom-import-simplexml.php If that doesn't work, post / pastebin some actual code you're trying.
Wrikken
OK :). Here's what I did and I'm not getting any errors now, but the document still has whitespace: $dom = dom_import_simplexml($xml)->ownerDocument; $dom->preserveWhiteSpace = false; $dom->formatOutput = true; $dom->saveXML();
Ryan S.
Wrikken
OK, got it working. See working code up top. Is that the right way to do it? Thanks, Ryan
Ryan S.
That's a good way if you already have the XML as a string, if you have a SimpleXMLElement you might get more performance by importing it with `$elm = dom_import_simplexml($simplexml);$dom = new DOMDocument(); $dom->preserveWhiteSpace=false;$dom->formatOutput=true; $domelement = $dom->importNode($elm); $dom->appendChild($domelement);` I'd test the two options just in case there's a lot of performance difference, but I suspect it may be minor.
Wrikken
I will - thank you!
Ryan S.
+3  A: 

Whitespace in XML is TextNodes, e.g.

<foo>
    <bar>baz</bar>
</foo>

is really

<foo><- whitespace node
    -><bar>baz</bar><- whitespace node
-></foo>

If you remove the <bar> node, you get

<foo><- whitespace node
    -><- whitespace node
-></foo>

I think SimpleXml wont allow you to access the Text nodes easily (maybe via XPath) but DOM does. See Wrikken's answer for details. Now that you know that whitespace is a node, you can also imagine that parsing it into a node takes up some cpu cycles. However, I'd say the speed impact is negliglible. When in doubt, do a benchmark with some real world data.


EDIT: Proof that whitespace is really nodes

$xml = <<< XML
<foo>
    <bar>baz</bar>
</foo>
XML;

$dom = new DOMDocument;
$dom->loadXML($xml);
foreach($dom->documentElement->childNodes as $node) {
    var_dump($node);
}

gives

object(DOMText)#4 (0) {}
object(DOMElement)#6 (0) {}
object(DOMText)#4 (0) {}
Gordon
You're not implying that whitespace is a tag ?! ;-) Also, Libxml can distinguish whitespace nodes from text, in fact XMLReader has 2 types of whitespace.
Robin
XPath would of course be `//text()[normalize-space()='']`, but those will be removed on loading if preserveWhiteSpace is false.
Wrikken
@Robin: I think Gordon meant they're actual nodes.
Wrikken
@Robin The `<whitespace>` wasn't meant to imply tag, but just that there is a node between the tags. Sorry if that was misleading. Changed it to a hopefully less ambigous marker.
Gordon
A: 

It's actually Libxml that does the XML parsing, whitespace is read by the parser the same as every other character in the input stream (or file). Most of the PHP xml APIs use Libxml under the hood (XmlReader, XmlWriter, SimpleXml Xslt, Dom...) - some of them give you access to whitespace (e.g. Dom, XmlReader), some don't (e.g. SimpleXML)

Robin
A: 

Quick answers to the questions asked:

I'm wondering if PHP iterates through blank spaces/returns/tabs and whether it would eventually slow down the system.

No, PHP (or libxml) doesn't really iterate over it. Having more whitespace theorically slows down the system, although it's so small it can't be measured directly. You could test that by yourself by removing all whitespace from your XML. It wouldn't make it faster.

I'm also asking whether there's an easy to remove the space unset leaves behind?

No easy way I'm afraid. You can import your SimpleXML stuff to DOM and use formatOutput to completely remodel the whitespace, as suggested in another answer, or you can use a third party library that will do it for you, but you won't find an easy, built-in way to do that.

Josh Davis