views:

783

answers:

4

I'm trying to parse an XML string containing characters & < and > in the TEXTDATA. Normally, those characters should be htmlencoded, but in my case they aren't so I get the following messages:

Warning: DOMDocument::loadXML() [function.loadXML]: error parsing attribute name in Entity ... Warning: DOMDocument::loadXML() [function.loadXML]: Couldn't find end of Start Tag ...

I can use the str_replace to encode all the &, but if I do that with < or > I'm doing it for valid XML tags too.

Does anyone know a workaround for this problem??

Thank you!

A: 

I often use @ in front of calls to load() for DomDocument mainly because you can never be absolutely sure what you load, is what you expected.

Using @ will suppress errors.

@$dom->loadXml($myXml);
jakenoble
+4  A: 

If you have a < inside text in an XML... it's not a valid XML. Try to encode it or to enclose them into <![CDATA[.

If it's not possible (because you're not outputting this "XML") I'd suggest to try with some Html parsing library (I didn't used them, but they exists) beacuse they're less strict than XML ones.

But I'd really try to get valid XML before trying any other thing!!

helios
Thanks for the tip. I'll first see if it's possible to change the incoming XML flow, and if not, I'll try out the HTML parser...
nikola
A: 

Put all your text inside CDATA elements?

<!-- Old -->
<blah>
    x & y < 3
</blah>

<!-- New -->
<blah><![CDATA[
    x & y < 3
]]></blah>
nickf
A: 

I can use the str_replace to encode all the &, but if I do that with < or > I'm doing it for valid XML tags too.

As a strictly temporary fixup measure you can replace the ones that aren't part of what looks like a tag or entity reference, eg.:

$str= preg_replace('<(?![a-zA-Z_!?])', '&lt;', $str);
$str= preg_replace('&(?!([a-zA-Z]+|#[0-9]+|#x[0-9a-fA-F]+);)', '&amp;', $str);

However this isn't watertight and in the longer term you need to fix whatever is generating this bogus markup, or shout at the person who needs to fix it until they get a clue. Grossly-non-well-formed XML like this is simply not XML by definition.

bobince