ansaurus

Question

Regular expression to match ">", "<", "&" chars that appear inside XML nodes

Answer 1

+1 A:

Classic example of garbage in, garbage out. The real solution is to fix the broken XML exporter, but obviously that's out of the scope of your problem. Sounds like you might have to manually parse the XML, run htmlentites() on the contents, then put the XML tags back.

TravisO 2010-02-17 16:59:29

or htmlspecialchars() if you just want to convert the mentioned characters.

jeroen 2010-02-17 17:01:17

The XML is provided from a 3rd party and I have no control over the data. Also there are fewer character entities in XML than PHP so htmlentites() would over entitise! ;-)

Camsoft 2010-02-17 17:03:47

Problem with parsing it as an object is that actual XML document I want to fix is 5MB and 42,000 lines. I hoped that a regex would quickly search and replace the invalid chars.

Camsoft 2010-02-17 17:14:21

Answer 2

A:

Would it be possible to intercept the text before it tries to become part of your XML? A few ounces of prevention might be worth pounds of cure.

No Refunds No Returns 2010-02-17 17:01:17

I'm not the author of the XML, I'm just the one trying to use it.

Camsoft 2010-02-17 17:12:31

Answer 3

A:

This should do it for ampersands:

/(\s+)(&)(\s+)/gim

This means you're only looking for those characters when they have whitespace characters on both sides.

Just make sure the replacement expression is "$1$2amp;$3";

The others would go like this, with their replacement expressions on the right

/(\s+)(>)(\s+)/gim   "$1&gt;$2"
/(\s+)(<)(\s+)/gim   "$1&lt;$2"

Robusto 2010-02-17 17:11:50

This is close but it does not work when the chars don't have spaces surrounding them.

Camsoft 2010-02-17 17:19:50

Robusto 2010-02-17 17:21:57

Yep I know. Hence my original question.

Camsoft 2010-02-17 17:24:01

Answer 4

+2 A:

I'm reasonably certain it's simply not possible. You need something that keeps track of nesting, and there's no way to get a regular expression to track nesting. Your choices are to fix the text first (when you probably can use an RE) or use something that's at least vaguely like an XML parser, specifically to the extent of keeping track of how the tags are nested.

There's a reason XML demands that these characters be escaped though -- without that, you can only guess about whether something is really a tag or not. For example, given something like:

    <tag>Text containing < and > characters</tag>

you and I can probably guess that the result should be: ...containing < and >... but I'm pretty sure the XML specification allows the extra whitespace, so officially "< and >" should be treated as a tag. You could, I suppose, assume that anything that looks like an un-matched tag really isn't intended to be a tag, but that's going to take some work too.

Jerry Coffin 2010-02-17 17:18:27

Yeah I was starting to think that. The more I look at the problem the more complicated it seems to get. I just would love to be able to avoid using a XML parser as its a huge XML file I'm trying to fix.

Camsoft 2010-02-17 17:22:56

Answer 5

A:

As stated by others, regular expressions don't do well with hierarchical data. Besides, if the data is improperly formatted, you can't guarantee that you'll get it right. Consider:

<xml>
    <tag>Something<br/>Something Else</tag>
</xml>

Is that <br/> supposed to read <br/>? There's no way to know because it's validly formatted XML.

If you have arbitrary data that you wish to include in your XML tree, consider using a <![CDATA[ ... ]]> block instead. It's treated the same as a text node, and the only thing you don't have to escape is the character sequence ]]>.

MightyE 2010-02-17 18:08:20

Answer 6

A:

Max 2010-02-17 20:33:25

Answer 7

+1 A:

In the end I've opted to use the Tidy library in PHP. The code I used is shown below:

  // Specify configuration
  $config = array(
    'input-xml'  => true,
    'show-warnings' => false,
    'numeric-entities' => true,
    'output-xml' => true);

  $tidy = new tidy();
  $tidy->parseFile('feed.xml', $config, 'latin1');
  $tidy->cleanRepair()

This works perfectly correcting all the encoding errors and converting invalid characters to XML entities.

Camsoft 2010-02-18 09:21:31

Don't forget to accept your answer, even though it's you it has answered your question and will save others trolling through the other answers.

Lazarus 2010-02-18 09:51:03

Will do. Can't accept it till tomorrow.

Camsoft 2010-02-18 13:13:52

Answer 8

A:

RegEx is NOT the right tool to parse XML. Use an XML parser instead.

Salman A 2010-02-18 09:23:40

I agree but I was trying to avoid parsing an XML file that contains 42,000 lines of XML and weighs in at 5MB.

Camsoft 2010-02-18 13:15:26

ansaurus

tags:

views:

answers:

Regular expression to match ">", "<", "&" chars that appear inside XML nodes

related questions