tags:

views:

369

answers:

4

Hiya,

I'm trying to parse an XML file and one of the fields looks like the following:

<link>http://foo.com/this-platform/scripts/click.php?var_a=a&amp;var_b=b&amp;varc=http%3A%2F%2Fwww.foo.com%2Fthis-section-here%2Fperf%2F229408%3Fvalue%3D0222%26some_variable%3Dmeee&lt;/link&gt;

This seems to break the parser. i think it might be something to do with the & in the link?

My code is quite simple:

<?

$xml = simplexml_load_file("files/this.xml");

echo $xml->getName() . "<br />";

foreach($xml->children() as $child) {
  echo $child->getName() . ": " . $child . "<br />";
}
?>

any ideas how i can resolve this?

+2  A: 

It breaks the parser because your XML is invalid - & should be encoded as &amp;.

Greg
mjv
I can't change the feed unfortunately, so there is no other way other than regex to do this? Will this: <link><![CDATA[link]]></link> fix it? (if i can change the file?)
Shadi Almosri
Mjv - if you want to place your comment in the form of an answer i will accept as it's made life easier now and my xml "valid"...
Shadi Almosri
+3  A: 

The XML snippet you posted is not valid. Ampersands have to be escaped, this is why the parser complaints.

Malax
I can't change the feed unfortunately, so there is no other way other than regex to do this?
Shadi Almosri
+3  A: 

Your XML feed is not valid XML : the & should be escaped as &amp;

This means you cannot use an XML parser on it :-(

A possible "solution" (feels wrong, but should work) would be to replace '&' that are not part of an entity by '&amp;', to get a valid XML string before loading it with an XML parser.


In your case, considering this :

$str = <<<STR
<xml>
  <link>http://foo.com/this-platform/scripts/click.php?var_a=a&amp;var_b=b&amp;varc=http%3A%2F%2Fwww.foo.com%2Fthis-section-here%2Fperf%2F229408%3Fvalue%3D0222%26some_variable%3Dmeee&lt;/link&gt;
</xml>
STR;

You might use a simple call to str_replace, like this :

$str = str_replace('&', '&amp;', $str);

And, then, parse the string (now XML-valid) that's in $str :

$xml = simplexml_load_string($str);
var_dump($xml);

In this case, it should work...


But note that you must take care about entities : if you already have an entity like '&gt;', you must not replace it to '&amp;gt;' !

Which means that such a simple call to str_replace is not the right solution : it will probably break stuff on many XML feeds !

Up to you to find the right way to do that replacement -- maybe with some kind of regex...

Pascal MARTIN
A: 

The comment by mjv resolved it:

Alternatively to using &, you may consider putting the urls and other XML-unfriendly content in , i.e. a Character Data block

Shadi Almosri