views:

51

answers:

3

I have an XML page with some elements in various languages - Arabic, English, Chinese, Japanese.. Which encoding format should I have to choose for that? If I try to render the XML with an XSL (using utf-8 or ISO-8859-6 or ISO-2022-JP), I get this error:

An invalid character was found in text content.

How shall or solve this?

Thanks.

+1  A: 

Where exactly is the error found? It sounds like the XML itself may have an invalid character (e.g. a control character between U+0000 and U+001F other than \r, \t and \n IIRC). You'd probably see this when loading the XML into any decent XML editor (or programmatically).

Aside from that, UTF-8 is generally a good choice of encoding - it's less efficient than UTF-16 for Far East characters, mind you. Both UTF-16 and UTF-8 allow all Unicode characters to be represented (using surrogate pairs in UTF-16 for characters outside the basic multilingual plane).

Jon Skeet
Thanks Jon. This article helped me in understanding that. http://www.joelonsoftware.com/articles/Unicode.html
bdhar
A: 

UTF-8 covers all of the UCS2 (Which is what most people are referring to when they say Unicode) characters, and as such should be appropriate. You still have to make sure there isn't any embedded characters that shouldn't appear in XML such as < or > or non-printable characters

Rowland Shaw
UTF-8 covers the entirity of Unicode, including the astral planes, not just UCS2.
bobince
Some UTF-8 Parsers fall over if you give them UCS4 though :)
Rowland Shaw
+1  A: 

UTF-8 is the only encoding that can handle all those alphabets. It's also the default encoding for XML, and the only encoding that makes sense for a modern application. (For storage/on-the-wire, anyway; for internal processing your language's string type would be more likely to be UTF-16 or 32.)

It would seem from the error that you have a problem in the input file, rather than an issue with your choice of output encoding. Maybe it's encoded in something other than UTF-8 but has forgotten to include an <?xml encoding?> declaration to say so. Or maybe there's an invalid ISO-2202-JP escape sequence? (This is a horror of an encoding.)

You should try to load the input file into something that parses XML (eg. Firefox or IE) and see what errors, if any, it comes up with.

(You can't mix encodings in a single XML file. If you've spat out bytes strings from different sources into XML, you've already lost. How is this XML generated?)

bobince
Actually our application supports multiple languages. This XML is used for a reporting purpose which contains data in all the languages from DB. I am not able to choose a generic encoding format for the report!
bdhar
Multiple languages, does that mean multiple encodings? It is impossible to create an XML file with content in different encodings; if you need to output XML from different encoding sources, the program that creates that XML *must* transcode all data to a single encoding (typically UTF-8) before including it in XML. An XML file that includes invalid UTF-8 byte sequences due to lack of transcoding is not well-formed, and thus by definition not an XML file.
bobince
How to find if there are any bad character sequence in my XML file? Is there any tool for that??
bdhar
@bdhar Any editor that lets you view the hex representation should let you analyse your output and allow you to cross reference with the spec for UTF-8
Rowland Shaw
Also if you drop such a file into IE, it will give you a “The XML page cannot be displayed” error page that quotes the line/column number where the bad input occurs.
bobince
I have pasted the xml file with utf-8 encoding here: http://pastebin.com/kqMR4Tm6 .. When i try to open this in internet explorer, i get the error. if i change the encoding format from utf-8 to iso-8859-6 then it works fine.
bdhar
Well the pasting service removes any character set information we could have seen in the original file (pastebin.com transcodes to UTF-8), but if ISO-8869-6 gives you the right results fine. This encoding can't cope with Chinese or Japanese characters though.
bobince
Correct. Even I tried to create an XML file by copy-pasting the pastebin text. It threw the same error while trying to open in internet explorer.
bdhar
Depends how you save it. If your Windows system default codepage (misleadingly known in Windows as “ANSI”) is cp1256, Notepad will save it as this codepage by default and you would need to set the `encoding` declaration to `cp1256` or the similar `ISO-8859-6`. From the Save As dialog you can also choose to save it as UTF-8 to avoid this.
bobince
Thanks a lot. It works. Is there a programatic way to change the encoding format of the file to UTF-8? I am using VB 6
bdhar
Not familiar with old-school VB mysql, but this page suggests you can do that with an `ADODB.Stream`: http://www.nonhostile.com/howto-convert-byte-array-utf8-string-vb6.asp
bobince
Thanks bob, this one is also good.. http://www.vbforums.com/showthread.php?t=537249
bdhar