views:

345

answers:

1

Hi,

I have an XML document that may have shift-jis encoded data in it and I'm trying to parse it using an NSXMLParser object.

Ordinarily I assume the document is UTF8 encoded and all is well - does anyone know if/how I can determine if an element is shift-jis encoded and then how to decode it?

Thanks

+1  A: 

An XML document is UTF-8 encoded unless it has an XML declaration stating otherwise, for example:

<?xml version="1.0" encoding="shift_jis"?>

or:

<?xml version="1.0" encoding="cp932"?>

Any XML parser should detect the encoding given in the XML declaration. (Some parsers may not support some of the CJK codecs so will complain, but AIUI NSXMLParser should be fine.)

If you've got a file with Shift-JIS byte sequences that does not have such a stated encoding, or which contains Shift-JIS byte sequences in some elements and UTF-8 in others, what you have is not well-formed; it's not an XML document at all and no parser will read it.

If you've just got a missing encoding declaration, you really need to fix it at the source end, but in the meantime hacking in a suitable XML declaration or transcoding the bytes manually from Shift-JIS to UTF-8 before feeding it into the parser should help.

bobince