ansaurus

Question

Answer 1

+9 A:

My guess is that the XML isn't properly UTF-8 encoded. Please show the bytes within the <shortest> element in the raw file... I suspect you'll find they're not a validly encoded character. If you could show a short but complete program which generates this XML from valid input, that would be very helpful. (Preferably saying which platform it is, too :)

EDIT: Something very odd is going on in this file. Here are the hex values for the "shorter" and "shortest" values:

Shorter: C3 96 72 77 69 63

Shortest: EF BF BD 2E

Now "C3 96" is the valid UTF-8 encoding for U+00D6 which is "Latin capital letter O with diaeresis" as you want.

However, EF BF BD is the UTF-8 encoding for U+FFFC which is "object replacement character" - definitely not what you want. (The 2E is just the ASCII dot.)

So, this is actually valid UTF-8 - but it doesn't contain the characters you want. Again, you should examine what created the file...

Jon Skeet 2009-06-24 12:31:33

You took typing lessons in school didn't you? :)

Kevin 2009-06-24 12:33:43

You are toooooooooooooo fast

rahul 2009-06-24 12:34:02

Hi Jon, here's the file (saved from Firefox): http://clipboard.i8network.de/ged2xml.xmlThe XML is generated by PHP's SimpleXML on a Linux environment.

individual8 2009-06-24 12:50:21

Or better: save the XML as UTF-16... Then those strange characters won't be any problem. :-)

Workshop Alex 2009-06-24 12:51:30

@Alex, you should probably read http://www.joelonsoftware.com/articles/Unicode.html . UTF-16 is not really a good solution here (doesn't cover all language possibilities covered by utf-8). As Chuck Norris... I mean Jon Skeet.. pointed out, the problem likely lies with what is generating the xml.

Jonathan Fingland 2009-06-24 12:56:57

just to add, I'm inclined to suspect that the problem is using a non-multibyte-character-sensitive way of getting the first character. getting the first _byte_ of a multi-byte character string is not the same as getting the first character of said string.

Jonathan Fingland 2009-06-24 13:01:05

I just downloaded and checked your XML in a hex editor, and indeed the <shortest> element contains the garbage byte sequence 0xEF 0xBF 0xBD 0X2E. The problem is obviously in the producer.

Lars Haugseth 2009-06-24 13:20:24

@Lars: That's not actually garbage as such - it's valid UTF-8, but not the desired character data.

Jon Skeet 2009-06-24 13:29:45

Answer 2

A:

XML parses the elements inside the tags as any element can contain nested elements. Thus your "ö" might break the parsing.

Put your data inside a CDATA tag, example: http://www.w3schools.com/XML/xml_cdata.asp

rasjani 2009-06-24 12:35:40

I thought of that already. But then why do the other umlauts not break?

individual8 2009-06-24 12:51:15

ansaurus

tags:

views:

answers:

What causes my XML to break?

related questions