tags:

views:

664

answers:

1

I'm using SimpleXML and xpath to read elements from an external UTF-8 XHTML document. I then iteratively echo the output of SimpleXML's asXML() function executed upon each element returned from an xpath selector. But the XML carriage return entity is annoyingly inserted after every line of my code. There aren't any extra characters in the XHTML document. What is causing this? It seems to be the way I'm iterating through each array element returned from xpath. I don't get the entities when I'm just outputting one element directly from SimpleXML's asXML() (without using xpath).

<?php
$content = new DOMDocument();
$content->loadHTMLFile(CONTENT.html);
$story = simplexml_import_dom($content->getElementById('story'));
$topics = $story->xpath('div[@class="topic"]');
foreach ($topics as $topic) {
    $topicContents = $topic->xpath('div/child::node()'); // Array of elements within 'content'.
    foreach ($topicContents as $contentElement) {
     echo $contentElement->asXML();
    }
}
?>

Excerpt from outputted XHTML code with auto-generated XML carriage returns:

<div class="content">&#13;
<p>Lorem ipsum dolor sit amet</p>&#13;
<h2>Lorem ipsum</h2>&#13;
<p>Lorem ipsum dolor sit amet</p>&#13;
<ul>
 <li>Lorem ipsum</li>&#13;
 <li>Lorem ipsum</li>&#13;
 <li>Lorem ipsum</li>&#13;
+1  A: 

That's how libxml treats \r in text nodes.

<?php
$xml = <<< XML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html lang="en">
    <head>
     <title>...</title>
    </head>
    <body><pre>a\nb\r\nc</pre></body>
</html>
XML;
$content = new DOMDocument(); $content->loadhtml($xml); $content = simplexml_import_dom($content); echo $content->asxml();
prints
<html lang="en"><head><title>...</title></head><body><pre>a
b&#13;
c</pre></body></html>
(the \n characters are "left alone" while the \r\n is handled as &#13;\n)
I'm not an XML expert but I think according to http://www.w3.org/TR/REC-xml/#sec-line-ends
To simplify the tasks of applications, the XML processor MUST behave as if it normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character.
it should treat the \r\n as a single \n but it doesn't.
If it doesn't cause you serious trouble just live with it...

VolkerK