tags:

views:

986

answers:

4

Isn't that an inconsistent behavior? (PHP 5.2.6)

<?php

$a = new SimpleXMLElement('<a/>');

$a->addAttribute('b', 'One & Two');
//$a->addChild('c', 'Three & Four'); -- results in "unterminated entity reference" warning!
$a->addChild('c', 'Three &amp; Four');
$a->d = 'Five & Six';

print($a->asXML());

Renders:

<?xml version="1.0"?>
<a b="One &amp; Two">
    <c>Three &amp; Four</c>
    <d>Five &amp; Six</d>
</a>

At bugs.php.net they reject all the submissions about that, saying it's a feature. Why could that possibly be? BTW, there's nothing in the docs about that discrepancy of escaping text values by SimpleXMLElement.

Can anyone convince me it's the best API design decision possible?

A: 

I believe this is caused by the Attribute-Value Normalization that the XML spec requires.

Michael Borgwardt
A: 

The requirement for escaping the characters & and < is provided in the section Character Data and Markup and not in the section Attribute-Value Normalization, as the previous answer states.

To quote the XML Spec.:

"The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they MUST be escaped using either numeric character references or the strings " & " and " < " respectively"

Dimitre Novatchev
+2  A: 

Just to make sure we're on the same page, you have three situations.

  1. The insertion of an ampersand into an attribute using addAttribute

  2. The insertion of an ampersand into an element using addChild

  3. The insertion of an ampersand into an element by property overloading

It's the discrepancy between 2 and 3 that has you flummoxed. Why does addChild not automatically escape the ampersand, whereas adding a property to the object and setting its value does escape the ampersand automatically?

Based on my instincts, and buoyed by this bug, this was a deliberate design decision. The property overloading ($a->d = 'Five & Six';) is intended to be the "escape ampersands for me" way of doing things. The addChild method is meant to be "add exactly what I tell you to add" method. So, whichever behavior you need, SimpleXML can accommodate you.

Let's say you had a database of text where all the ampersands were already escaped. The auto-escaping wouldn't work for you here. That's where you'd use addChild. Or lets say you needed to insert an entity in your document

$a = simplexml_load_string('<root></root>');
$a->b = 'This is a non-breaking space &nbsp;';
$a->addChild('c','This is a non-breaking space &nbsp;'); 
print $a->asXML();

That's what the PHP Developer in that bug is advocating. The behavior of addChild is meant to provide a "less simple, more robust" support when you need to insert a ampersand into the document without it being escaped.

Of course, this does leave us with the first situation I mentioned, the addAttribute method. The addAttribute method does escape ampersands. So, we might now state the inconsistency as

  1. The addAttriute method escapes ampersands
  2. The addElement method does not escape ampersands
  3. This behavior is somewhat inconsistent. It's reasonable that a user would expect the methods on SimpleXML to escape things in a consistent way

This then exposes the real problem with the SimpleXML api. The ideal situation here would be

  1. Property Overloading on Element Objects escapes ampersands
  2. Property Overloading on Attribute Objects escapes ampersands
  3. The addElement method does not escape ampersands
  4. the addAttribute method does not escape ampersands

This is impossible though, because SimpleXML has no concept of an Attribute Object. The addAttribute method is (appears to be?) the only way to add an attribute. Because of that, it turns out (seems?) SimpleXML in incapable of creating attributes with entities.

All of this reveals the paradox of SimpleXML. The idea behind this API was to provide a simple way of interacting with something that turns out to be complex.

The team could have added a SimpleXMLAttribute Object, but that's an added layer of complexity. If you want a multiple object hierarchy, use DomDoument.

The team could have added flags to the addAttribute and addElement, but flags make the API more complex.

The real lesson here? Maybe it's that simple is hard, and simple on a deadline is even harder. I don't know if this was the case or not, but with SimpleXML it seems like someone started with a simple idea (use property overloading to make the creation of XML documents easy), and then adjusted as the problems/feature requests came in.

Actually, I think the real lesson here is to just use JSON ;)

Alan Storm
A: 

"Let's say you had a database of text where all the ampersands were already escaped."

If you're doing this, you're doing it wrong. Data should be stored in its most accurate form, not munged for whatever type of output you're currently using. This is even worse if you actually store blobs of (valid) HTML in the database. Using addChild() and grabbing the data out again will destroy your HTML; no sensible library exhibits such horrible asymmetry.

addChild() not encoding your text for you is completely counter-intuitive. What is the point in an API that doesn't protect you from this? It's like json_encode() barfing if you use a double quote in one of your values.

Anyway, to answer the original question: Obviously, I too think it's not a good decision. I do think it's consistent with a lot of PHP's design decisions, which is to fulfill someone's idea of what is "quicker", rather than being correct.

Daniel