views:

334

answers:

2

Hi all,

I'm having some trouble parsing malformed XML in PHP. In particular I'm querying a third party webservice that returns data in an XML format without encoding the XML entities in actual data. For example one of the the elements contains an ASCII heart, '<3', without the quotes, which the XML parser sees as an opening tag. It should be '&lt;3'.

Right now I'm simply passing the XML string into a SimpleXMLElement which, predictably, fails on these instances. I've done some looking around and it seems like PHP Tidy package might be able to help me, but the amount of configuration you can do is overwhelming :(

Thus, I'm just wondering if anyone else has had a problem like this and, if so, how they were able to solve it.

Thanks!

+4  A: 

Try tidy.repairString:

php > $tidy = new tidy();
php > $repaired = $tidy->repairString("<foo>I <3 Philadelphia</foo>", array("input-xml"=>1));
php > print($repaired);
<foo>I &lt;3 Philadelphia</foo>
php > $el = new SimpleXMLElement($repaired);
Matthew Flaschen
Perfect, thank you :)! I feel kind of silly for not just trying that configuration option now.
jszwedko
A: 
  1. Read the content as a string.
  2. htmlspecialchars(preg_replace('/[\x-\x8\xb-\xc\xe-\x1f]/','',$string))
  3. Load the transformed string in SimpleXMLElement

It worked for me so far.

rpSetzer
That doesn't work: new SimpleXMLElement(htmlspecialchars(preg_replace('/[\x-\x8\xb-\xc\xe-\x1f]/','', "<foo>I <3 Philadelphia</foo>"))); will throw, because you're over-escaping.
Matthew Flaschen