views:

182

answers:

2

I have the following XML code.

<firstname>
 <default length="6">Örwin</default>
 <short>Örwin</short>
 <shorter>Örwin</shorter>
 <shortest>�.</shortest>
</firstname>

Why does the content of the "shortest" node break? It should be a simple "Ö" instead of the tedious �. XML is UTF-8 encoded and the function which processes the output of that node also writes the content of "short" and "shorter". Where the "Ö" is clearly visible.

+9  A: 

My guess is that the XML isn't properly UTF-8 encoded. Please show the bytes within the <shortest> element in the raw file... I suspect you'll find they're not a validly encoded character. If you could show a short but complete program which generates this XML from valid input, that would be very helpful. (Preferably saying which platform it is, too :)

EDIT: Something very odd is going on in this file. Here are the hex values for the "shorter" and "shortest" values:

Shorter: C3 96 72 77 69 63

Shortest: EF BF BD 2E

Now "C3 96" is the valid UTF-8 encoding for U+00D6 which is "Latin capital letter O with diaeresis" as you want.

However, EF BF BD is the UTF-8 encoding for U+FFFC which is "object replacement character" - definitely not what you want. (The 2E is just the ASCII dot.)

So, this is actually valid UTF-8 - but it doesn't contain the characters you want. Again, you should examine what created the file...

Jon Skeet
You took typing lessons in school didn't you? :)
Kevin
You are toooooooooooooo fast
rahul
Hi Jon, here's the file (saved from Firefox): http://clipboard.i8network.de/ged2xml.xmlThe XML is generated by PHP's SimpleXML on a Linux environment.
individual8
Or better: save the XML as UTF-16... Then those strange characters won't be any problem. :-)
Workshop Alex
@Alex, you should probably read http://www.joelonsoftware.com/articles/Unicode.html . UTF-16 is not really a good solution here (doesn't cover all language possibilities covered by utf-8). As Chuck Norris... I mean Jon Skeet.. pointed out, the problem likely lies with what is generating the xml.
Jonathan Fingland
just to add, I'm inclined to suspect that the problem is using a non-multibyte-character-sensitive way of getting the first character. getting the first _byte_ of a multi-byte character string is not the same as getting the first character of said string.
Jonathan Fingland
I just downloaded and checked your XML in a hex editor, and indeed the <shortest> element contains the garbage byte sequence 0xEF 0xBF 0xBD 0X2E. The problem is obviously in the producer.
Lars Haugseth
@Lars: That's not actually garbage as such - it's valid UTF-8, but not the desired character data.
Jon Skeet
A: 

XML parses the elements inside the tags as any element can contain nested elements. Thus your "ö" might break the parsing.

Put your data inside a CDATA tag, example: http://www.w3schools.com/XML/xml_cdata.asp

rasjani
I thought of that already. But then why do the other umlauts not break?
individual8