views:

128

answers:

3

I want to remove all the line breaks and carriage returns from an XML file so all tags fit on one line each.

XML Source example:

<resources>
  <resource>
    <id>001</id>
    <name>Resource name 1</name>
    <desc>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas nibh magna, fermentum et pretium vel, malesuada sit amet dolor. Morbi dictum, nunc sed interdum facilisis, ligula enim pharetra tortor, at egestas urna massa non nulla.</desc>
  </resource>
  <resource>
    <id>002</id>
    <name>Resource name 2</name>
    <desc>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas nibh magna, fermentum et pretium vel, malesuada sit amet dolor. Morbi dictum, nunc sed interdum facilisis, ligula enim pharetra tortor, at egestas urna massa non nulla.
</desc>
  </resource>
  <resource>
    <id>003</id>
    <name>Resource name 3</name>
    <desc>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas nibh magna, fermentum et pretium vel, malesuada sit amet dolor.
Morbi dictum, nunc sed interdum facilisis, ligula enim pharetra tortor, at egestas urna massa non nulla.
</desc>
  </resource>
</resources>

My take at it:

$pattern = "#(\t\t<[^>]*>[^<>]*)[\r\n]+([^<>]*</.*>)#";
$replacement = "$1$2";
$data = preg_replace($pattern, $replacement, $data);

This pattern corrects the 2nd resource and puts it back on its line. However, it doesn't correct the 2 line breaks from the 3rd resource, it only corrects one. The result is this:

<resources>
  <resource>
    <id>001</id>
    <name>Resource name 1</name>
    <desc>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas nibh magna, fermentum et pretium vel, malesuada sit amet dolor. Morbi dictum, nunc sed interdum facilisis, ligula enim pharetra tortor, at egestas urna massa non nulla.</desc>
  </resource>
  <resource>
    <id>002</id>
    <name>Resource name 2</name>
    <desc>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas nibh magna, fermentum et pretium vel, malesuada sit amet dolor. Morbi dictum, nunc sed interdum facilisis, ligula enim pharetra tortor, at egestas urna massa non nulla.</desc>
  </resource>
  <resource>
    <id>003</id>
    <name>Resource name 3</name>
    <desc>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas nibh magna, fermentum et pretium vel, malesuada sit amet dolor.
Morbi dictum, nunc sed interdum facilisis, ligula enim pharetra tortor, at egestas urna massa non nulla.</desc>
  </resource>
</resources>

What's wrong with my pattern?

A: 

What's wrong with my pattern?

It's a pattern, not an XML parser.

Try using the DOM, or one of the many, many real XML parsers available to PHP. It should be a simple matter of going through all of the text nodes and trimming them.

Charles
A: 

Unless there's a lot more to what you're trying to do than you describe, I think you're making it way too complicated. You don't need nearly as complex a regex as you have. Try just using /\r?\n. This worked for me with your data:

$data = preg_replace("/\r?\n/", "", $data);
JGB146
A: 

The first [^<>]* in your regex initially gobbles up all of the remaining text, and then has to backtrack a ways so the rest of the regex can match. It only backtracks as far as it has to, i.e., to the last line break in the text. The rest of the regex is able to match what's left, so that's that.

But your regex would only match one line break in any case, because it consumes the whole text. It should consume only the part you want to remove. Check this out:

preg_replace('#[\r\n]+(?=[^<>]*</desc>)#', ' ', $data);

After the line break is found, the lookahead confirms that it was found inside a <desc> element. But the lookahead doesn't consume anything, so the next line break (if there is one) is still there to be matched on the next pass.

You can't have the lookahead match just any end tag (</\w+>) because that would let it match line breaks between elements as well as inside them. You can, however, enumerate the elements you want to work on:

</(?:desc|name|id)>
Alan Moore