tags:

views:

1056

answers:

4

I'm trying to parse a XML file, but when loading it simpleXML prints the following warning:

Warning: simplexml_load_file() [function.simplexml-load-file]: gpr_545.xml:55: parser error : Entity 'Oslash' not defined in import.php on line 35

This is that line:

<forenames>B&Oslash;IE</forenames><x> </x>

As it is a warning, I might ignore it, but I'd like to understand what is happening.

+2  A: 

I think this is an encoding problem. php, simplexml in this particular case, does not like the danish O you've got in that fornames tag. You could try to encode the whole file in utf-8 and removing the escaped version from the tag by that. Aferwards you can read a fully escaped character free file into simplexml.

K

KB22
not sure what you mean. This xml file is encoded as ISO-8859-1 (<?xml version="1.0" encoding="iso-8859-1"?>).
Maarten
Right: use utf-8 instead of iso-8859-1
Nerdling
yepp, and make use of utf8_encode() for the actual encoding of the text.
KB22
that'd make sense if I were the author, but I'm on the parsing end so to say ;-)
Maarten
You got the file, so you can read it line by line and encode it - can't you? I happend to write a xmlfilter application once for a japanese customer. And belive me, doing this extra step before the actual parsing payed... ;)
KB22
+1  A: 

HTML Encoding of Latin1 characters (like Ø, what that character describes) is what has broken the XML parser. If you're in control of the data, you need to escape it using XML style character encoding (Ø just happens to be & #216;)

squeeks
thanks. So this is a broken XML file actually?
Maarten
Yes, unforgiving XML parsers break when they are expecting XML-style encoding of non-ASCII characters and are given HTML-style encoding instead.
squeeks
ok. So I'm just parsing this. I looked at the table from Björn's answer, and it works for my first example, but the next problem is this entity which is not in that table: . Is there a more stable solution?
Maarten
XSLT transforming the document before you pass it off to an XML parser would be one solution.
squeeks
+3  A: 
Björn
Thanks so much for the table Björn, saved my ass!
FFish
A: 

Try to use this line:

<forenames><![CDATA[B&Oslash;IE]]></forenames><x> </x>

and read this about CDATA

lg
ok, but this is not my XML, I'm just parsing it.
Maarten
Before parsing you should insert CDATA tag for every entity with "strange" characters.
lg
if it's got this error in it, then it's not valid xml to begin with. up to you to tell the original authors to fix it or do this sort of check prior to parsing and wrap the invalid chunks
Nerdling
just send them an email to discuss this indeed..
Maarten