views:

453

answers:

4

I'm starting out with some XML that looks like this (simplified):

<?xml version="1.0" encoding="UTF-8"?>
<alldata>
   <data name="Forsetì" />
</alldata>
</xml>

But after I've parsed it with simplexml_load_string the special character (the i) becomes: ì which is obviously pretty mangled.

Is there a way to prevent this from happening?

I know for a fact the XML is fine, when saved as .txt and viewed in the browser the characters are fine. When I use simplexml_load_string on the XML and then save values as a text file, or to the database, its mangled.

A: 

It's very likely that the XML is fine, but the character gets mangled when stored or output.

If you're outputting data on a HTML page: Make sure it's encoded in UTF-8 as well. If your HTML page is in ISO-8859-1, you can use utf8_decode as a quick fix; using UTF-8 is the better option in the long run.

If you're storing the data in a mySQL, you need to have UTF8 selected as the encoding all the way through: As the connection's encoding, in the table, and in the column(s) you insert the data into.

Pekka
I know for a fact the XML is fine, when saved as .txt and viewed in the browser the characters are fine. When I use simplexml_load_string on the XML and then save values as a text file, or to the database, its mangled.
Stomped
A: 

I've also had some problems with this, and it came from the PHP script encoding. Make sure it's set to UTF-8. If it's still not good, try printing the variable using uft8_encode or utf8_decode.

Daan
A: 

XML is strict when it comes to entities, like & should be &amp; and ì should &igrave;

So you will need a translation table.

function xml_entity_decode($_string) {
    // Set up XML translation table
    $_xml=array();
    $_xl8=get_html_translation_table(HTML_ENTITIES,ENT_COMPAT);
    while (list($_key,)=each($_xl8))
        $_xml['&#'.ord($_key).';']=$_key;
    return strtr($_string,$_xml);
}
stillstanding
The only characters that are *required* to be replaced with entities in XML are the basic five markup characters: ampersand, apostrophe, quotation mark, and the angle brackets. Others may need to be replaced if the document's encoding doesn't support them, but that's not an issue with UTF-8.
Alan Moore
+1  A: 

This looks SimpleXML is creating a UTF-8 string, which is then rendered in ISO-8859-1 (latin-1) or something close like CP-1252.

When you save the result to a file and serve that file via a web server, the browser will use the encoding declared in the file.

Including in a web page
Since your web page encoding is not UTF-8, you need to convert the string to whatever encoding you are using, eg ISO-8859-1 (latin-1).

This is easily done with iconv():

    $xmlout = iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $xmlout);

Saving to database
You database column is not using UTF-8 collation, so you should use iconv to convert the string to the charset that your database uses.

Assuming your database collation is the same as the encoding that you render in, you will not have to do anything when reading from the database.

Explanation
In UTF-8, a 0xc2 prefix byte is used to access the top half of the "Latin-1 Supplement" block which includes characters such as accented letters, currency symbols, fractions, superscript 2 and 3, the copyright and registered trademark symbols, and the non-breaking space.

However in ISO-8859-1, the byte 0xC2 represents an Â. So when your UTF-8 string is misinterpreted as one of those, then you get  followed by some other nonsense character.

Lachlan Roche