tags:

views:

436

answers:

8

I'm trying to write a regular expression using the PCRE library in PHP.

I need a regex to match only &, > and < chars that exist within string part of any XML node and not the tag declaration themselves.

Input XML:

<pnode>
  <cnode>This string contains > and < and & chars.</cnode>
</pnode>

The idea is to to a search and replace these chars and convert them to XML entities equivalents.

If I was to convert the entire XML to entities the XML would look like this:

Entire XML converted to entities

&lt;pnode&gt;
  &lt;cnode&gt;This string contains &gt; and &lt; and &amp; chars.&lt;/cnode&gt;
&lt;/pnode&gt;

I need it to look like this:

Correct XML

<pnode>
  <cnode>This string contains &gt; and &lt and &amp; chars.</cnode>
</pnode>

I have tried to write a regular expression to match these chars using look-ahaead but I don't know enough to get this to work. My attempt (currently only attempting to match > symbols):

/>(?=[^<]*<)/g

Just to make it clear the XML I'm trying to fix comes from a 3rd party and they seem unable to fix it their end hence my attempt to fix it.

+1  A: 

Classic example of garbage in, garbage out. The real solution is to fix the broken XML exporter, but obviously that's out of the scope of your problem. Sounds like you might have to manually parse the XML, run htmlentites() on the contents, then put the XML tags back.

TravisO
or htmlspecialchars() if you just want to convert the mentioned characters.
jeroen
The XML is provided from a 3rd party and I have no control over the data. Also there are fewer character entities in XML than PHP so htmlentites() would over entitise! ;-)
Camsoft
Problem with parsing it as an object is that actual XML document I want to fix is 5MB and 42,000 lines. I hoped that a regex would quickly search and replace the invalid chars.
Camsoft
A: 

Would it be possible to intercept the text before it tries to become part of your XML? A few ounces of prevention might be worth pounds of cure.

No Refunds No Returns
I'm not the author of the XML, I'm just the one trying to use it.
Camsoft
A: 

This should do it for ampersands:

/(\s+)(&)(\s+)/gim

This means you're only looking for those characters when they have whitespace characters on both sides.

Just make sure the replacement expression is "$1$2amp;$3";

The others would go like this, with their replacement expressions on the right

/(\s+)(>)(\s+)/gim   "$1&gt;$2"
/(\s+)(<)(\s+)/gim   "$1&lt;$2"
Robusto
This is close but it does not work when the chars don't have spaces surrounding them.
Camsoft
Robusto
Yep I know. Hence my original question.
Camsoft
+2  A: 

I'm reasonably certain it's simply not possible. You need something that keeps track of nesting, and there's no way to get a regular expression to track nesting. Your choices are to fix the text first (when you probably can use an RE) or use something that's at least vaguely like an XML parser, specifically to the extent of keeping track of how the tags are nested.

There's a reason XML demands that these characters be escaped though -- without that, you can only guess about whether something is really a tag or not. For example, given something like:

    <tag>Text containing < and > characters</tag>

you and I can probably guess that the result should be: ...containing &lt; and &gt;... but I'm pretty sure the XML specification allows the extra whitespace, so officially "< and >" should be treated as a tag. You could, I suppose, assume that anything that looks like an un-matched tag really isn't intended to be a tag, but that's going to take some work too.

Jerry Coffin
Yeah I was starting to think that. The more I look at the problem the more complicated it seems to get. I just would love to be able to avoid using a XML parser as its a huge XML file I'm trying to fix.
Camsoft
A: 

As stated by others, regular expressions don't do well with hierarchical data. Besides, if the data is improperly formatted, you can't guarantee that you'll get it right. Consider:

<xml>
    <tag>Something<br/>Something Else</tag>
</xml>

Is that <br/> supposed to read &lt;br/&gt;? There's no way to know because it's validly formatted XML.

If you have arbitrary data that you wish to include in your XML tree, consider using a <![CDATA[ ... ]]> block instead. It's treated the same as a text node, and the only thing you don't have to escape is the character sequence ]]>.

MightyE
A: 
Max
+1  A: 

In the end I've opted to use the Tidy library in PHP. The code I used is shown below:

  // Specify configuration
  $config = array(
    'input-xml'  => true,
    'show-warnings' => false,
    'numeric-entities' => true,
    'output-xml' => true);

  $tidy = new tidy();
  $tidy->parseFile('feed.xml', $config, 'latin1');
  $tidy->cleanRepair()

This works perfectly correcting all the encoding errors and converting invalid characters to XML entities.

Camsoft
Don't forget to accept your answer, even though it's you it has answered your question and will save others trolling through the other answers.
Lazarus
Will do. Can't accept it till tomorrow.
Camsoft
A: 

RegEx is NOT the right tool to parse XML. Use an XML parser instead.

Salman A
I agree but I was trying to avoid parsing an XML file that contains 42,000 lines of XML and weighs in at 5MB.
Camsoft