views:

178

answers:

3

I have a few simple questions, because I got confused reading all difference responses.

1) If I have an xml with prolog: <?xml version="1.0" encoding="utf-8" ?> and I'm going to unmarshall it with Java (for example: JaXB). I suppose, that I can't put CROSS OF LORRAINE (http://www.fileformat.info/info/unicode/char/2628/index.htm) inside, but I can put "\u2628", correct?

2) I've also heard that UTF-8 doesn't contain it, but anything in Unicode can be saved with encoding UTF-8 (or UTF-16), and here is an example from this page:

UTF-8 (hex) 0xE2 0x98 0xA8 (e298a8)

Is my reasoning correct? Can I use this form and put it in the xml with utf-8 encoding?

+1  A: 

It should be absolutely fine - UTF-8 can encode any Unicode character.

XML has some restrictions around control characters (U+0000 to U+001F) but U+2628 should be fine.

(Personally I prefer to go to unicode.org for definitive code charts, but U+2628 definitely appears here.)

You shouldn't need to worry about the UTF-8 side of things - you should be able to put the character in your data directly, and let JAXB do the encoding.

Jon Skeet
The restricted characters in XML are a royal pain, but I have a Regex for that: `[\x00-\x08\x0b-\x0c\x0e-\x1f\x7f-\x84\x86-\x9f]`
Toby
ok, so I can send one of these: \u2628 or ☨ in XML file and JAXB should create an object from this xml, right?
Sergio Morieca
@Sergio The \u version won't work - that's a Java escape code, effectively. I believe the ...; version should work, but do you have any reason not to just include the character itself?
Jon Skeet
No, I don't. I was curious - I cannot try it now, that's the reason for asking. Thanks for help!
Sergio Morieca
A: 

If your prolog specifying utf-8 encoding for xml:

<?xml version="1.0" encoding="utf-8" ?>

then you can use utf-8 characters directly, or you can encode them as &#9768;

Eugene Kuleshov
Could you provide me the example with usage of utf-8 in that case?
Sergio Morieca
The utf-8 is a multi-byte encoding, so a symbol with code 9768 can be represented directly without any special encoding.
Eugene Kuleshov
A: 

1 more addition...

just specifying the encoding in the prolog is not sufficient. u need to make sure the content is serialized using correct encoding.

Pangea