views:

294

answers:

2

Hey everyone,

our customer supplied us with XML data that needs to be processed using PHP. They chose to abuse attributes by using them for big chunks of text (containing line breaks). The XML parser replaces the line breaks with spaces to make the XML W3 compliant.

To make sure we do not lose our line breaks, I want to read in the file as a string, then translate all line breaks that are between double quotes with 
. I think I need a regular expression for that, but I am having trouble coming up with one.

This is my test code (PHP 5) so far, using a look-ahead and look-behind, but it does not work:

$xml = '<tag attribute="Header\r\rFirst paragraph.">\r</tag>';
$pattern = '/(?<=")([^"]+?)\r([^"]+?)(?=")/';

print_r( preg_replace($pattern, "$1&#13;$2", $xml) );

Can anyone help me getting this right? Should be easy for a seasoned regexp master :)

+1  A: 

The best method would be to search character-by-character instead. Set a boolean to true if you encounter a quote mark, then to false when you find the matching quote.

If you find a new line character, if you are inside the quotes (i.e. your variable is true) then "translate with &#13;" whatever you mean by that. Otherwise leave it alone.

DisgruntledGoat
+1  A: 

Exactly, that is what I ended up with. For future reference I will post the working code here:

<?php
    header("Content-Type: text/plain");

    $xml = '<tag attribute="Header\r\rFirst paragraph.">\r</tag>';

    // split the contents at the quotes
    $array = preg_split('/["]+/', $xml);

    // replace new lines in each of the odd strings parts
    for($i=1;$i<count($array);$i+=2){
        $array[$i] = str_replace('\n\r','&#13;',$array[$i]);
        $array[$i] = str_replace('\r\n','&#13;',$array[$i]);
        $array[$i] = str_replace('\r','&#13;',$array[$i]);
        $array[$i] = str_replace('\n','&#13;',$array[$i]);
    }

    // reconstruct the original string
    $xml = implode('"', $array);

    print_r( $xml );
?>

Thanks for replying and supporting this solution :)

Droozle
Maybe it would be enough to simply replace *any* line breaks you encounter with ``? I mean, newlines are (should be) insignificant in XML, so you could replace the ones between tags as well without breaking anything. Don't forget to replace TAB characters also.
Tomalak