views:

1491

answers:

4

I have to parse externally provided XML that has attributes with line breaks in them. Using SimpleXML, the line breaks seem to be lost. According to another stackoverflow question, line breaks should be valid (even though far less than ideal!) for XML.

Why are they lost? [edit] And how can I preserve them? [/edit]

Here is a demo file script (note that when the line breaks are not in an attribute they are preserved).

PHP File with embedded XML

$xml = <<<XML
<?xml version="1.0" encoding="utf-8"?>
<Rows>
    <data Title='Data Title' Remarks='First line of the row.
Followed by the second line.
Even a third!' />
    <data Title='Full Title' Remarks='None really'>First line of the row.
Followed by the second line.
Even a third!</data>
</Rows>
XML;

$xml = new SimpleXMLElement( $xml );
print '<pre>'; print_r($xml); print '</pre>';

Output from print_r

SimpleXMLElement Object
(
    [data] => Array
        (
            [0] => SimpleXMLElement Object
                (
                    [@attributes] => Array
                        (
                            [Title] => Data Title
                            [Remarks] => First line of the row. Followed by the second line. Even a third!
                        )

                )

            [1] => First line of the row.
Followed by the second line.
Even a third!
        )

)
A: 

The entity for a new line is &#10;. I played with your code until I found something that did the trick. It's not very elegant, I warn you:

//First remove any indentations:
$xml = str_replace("     ","", $xml);
$xml = str_replace("\t","", $xml);

//Next replace unify all new-lines into unix LF:
$xml = str_replace("\r","\n", $xml);
$xml = str_replace("\n\n","\n", $xml);

//Next replace all new lines with the unicode:
$xml = str_replace("\n","&#10;", $xml);

Finally, replace any new line entities between >< with a new line:
$xml = str_replace(">&#10;<",">\n<", $xml);

The assumption, based on your example, is that any new lines that occur inside a node or attribute will have more text on the next line, not a < to open a new element.

This of course would fail if your next line had some text that was wrapped in a line-level element.

Anthony
Very Clever!!! The only catch is that I'm working with massive SOAP-enveloped XML spewing from SharePoint web services, so it makes me a bit nervous to do something so brute force. Based on bobince's post though, it looks like I might have to go this direction. I wonder if there is any more elegant way to pull it off.
Joshua
+3  A: 

Using SimpleXML, the line breaks seem to be lost.

Yes, that is expected... in fact it is required of any conformant XML parser that newlines in attribute values represent simple spaces. See attribute value normalisation in the XML spec.

If there was supposed to be a real newline character in the attribute value, the XML should have included a &#10; character reference instead of a raw newline.

bobince
To clarify just a little bit: the newlines are *VALID*, but the XML parser (in order to be compliant with the spec) **MUST** reduce them down to a single space character (see item 3 of bobince's link).
TML
Thanks for the link bobince, and the clarification TML. So I suppose my question now becomes, how can I retain those line breaks? I am receiving this data from a SharePoint web service, so I can't change the XML to include . Is there a way to override the parser compliance in this regard?
Joshua
Unfortunately no, XML is quite unflexible on this point; if the web service is producing `\n` when it means `` it's a bug. (And a surprising one as this is a fundamental feature that any XML serialiser would be expected to get right... unless of course the service is mucking around with regex or string templating instead of using a proper XML library!)
bobince
Unless you have access to subclass or monkey-patch your XML parser it's not something you're going to be able to change... and I think SimpleXML uses libxml, which you've no hope of fiddling with from PHP. Pre-processing general XML input to put the ``s in is also a bit of a non-starter, as you'd have to write most of an XML parser already to be able to tell the difference between a newline in an attribute value and one directly inside a tag (where `` would be illegal). Hacks like Anthony's could work as a temporary fix if the exact formatting is very locked down at the moment.
bobince
(sorry about the `code` there, seems to be a flaw in SO's markup around `` or something...)
bobince
A: 

Anthony, that was a really smart work around; thanks! Looking at the code, you have to say "duh--why didn't I think of that?!" so thanks so much for cluing me in :)

Robbie

Robbie
A: 

This is what worked for me:

First, get the xml as a string:

    $xml = file_get_contents($urlXml);

Then do the replacement:

    $xml = str_replace(".\xe2\x80\xa9<as:eol/>",".\n\n<as:eol/>",$xml);

The "." and "< as:eol/ >" were there because I needed to add breaks in that case. The new lines "\n" can be replaced with whatever you like.

After replacing, just load the xml-string as a SimpleXMLElement object:

    $xmlo = new SimpleXMLElement( $xml );

Et Voilà

German