views:

401

answers:

2

I've been trying to parse this feed. If you click on that link, you'll notice that it can't even parse it correctly in the browser.

Anyway, my hosting service won't let me use simplexml_load_file, so I've been using cURL to get it then loading the string into the DOM, like this:

$dom = new DOMDocument;
$dom->loadXML($rawXML);
if (!$dom) {
 echo 'Error while parsing the document';
 exit;
}
$xml = simplexml_import_dom($dom);

But I get errors ("DOMDocument::loadXML() [domdocument.loadxml]: Entity 'nbsp' not defined in Entity"), then I tried using SimpleXMLElement without luck (it shows the same error "parser error : Entity 'nbsp' not defined", etc... because of the HTML in that one element).

$xml = new SimpleXMLElement($rawXML);

So my question is, how do I skip/ignore/remove that element so I can parse the rest of the data?


Edit: Thanks to mjv for the solution!... I just did this (for others that have the same trouble)

$rawXML = str_replace('<description>','<description><![CDATA[',$rawXML);
$rawXML = str_replace('</description>',']]></description>',$rawXML);
+5  A: 

You're probably going to need to manipulate the source code with something like:

$xml = @file_get_contents('http://www.wow-europe.com/realmstatus/index.xml');
if ( $xml ) {
    $xml = preg_replace( '/&nbsp/', '&amp;nbsp', $xml );
    $xml = new SimpleXMLElement($xml);
    var_dump($xml);
}

Before feeding it to an xml parser AFAIK, I'd love to recommend some other way but I think this is the only way.

Edit: I think you can actually replace <description> with <description><![CDATA[ and so forth:

<?php
$xml = @file_get_contents('http://www.wow-europe.com/realmstatus/index.xml');
$xml = preg_replace( '/<description>/', '<description><![CDATA[', $xml );
$xml = preg_replace( '/<\/description>/', ']]></description>', $xml );
$xml = new SimpleXMLElement($xml);
var_dump($xml);

You'd need to do this for each element which contains character data.

meder
updated solution because I did the wrong replacement :p
meder
He he thanks, +1, but I think mjv beat ya to it :)
fudgey
yah, took a break to watch some anime. it's all good.
meder
LOL yeah, I was watching Adult Swim too LOL... so which is better the preg_replace or str_replace, or does it even matter?
fudgey
actually str_replace would probably be more efficient since there aren't any real patterns, just the first thing I thought of.
meder
+3  A: 

You may need to introduce a pre-parsing step which would add

<![CDATA[

after each <description> tag
and add

]]>

before each </description> tag
Specifically, (see meder's response for corresponding PHP snippet)

<description>blah <br />&nbsp; blah, blah...</description>
should become
<description><![CDATA[blah <br />&nbsp; blah, blah...]]></description>

In this fashion, the complete content of the 'decription' element would be 'escaped', so that any html (or even xhtml) construct found in this element and susceptible of throwing the XML parsing logic would be ignored. This would take care of the &nbsp; problem you mention but also many other common issues.

mjv
This worked perfectly!! Thanks!!
fudgey
+upvote, you thought of it before I did :)
meder
Glad it worked. Interestingly it took me a while to get my response ok, because I had to escape some "xml-like" characters in my text, lest they would be handled in an undesirable fashion within SO's response diplay. ;-)
mjv
@meder: thks to you, you got the PHP part, you seem much more fluent that I am in this language. Teamwork !
mjv