tags:

views:

154

answers:

1

invalid token error while parsing an XML file with UTF-8 encoding.

This error is coming when it encountered extended ASCII character 'â' { "â", "â" }.

When I have changed the encoding from UTF-8 to ISO-8859-1 the parsing is successful. But my application should support UTF-8, ASCII and extended ASCII characters. What should I do for this?

Any ideas are welcome.

Thanks in Advance for your time and solution.

A: 

Telling a parser that a latin-1 file is UTF-8 by setting the encoding attribute of the XML declaration will result in an error similar to that which you report.

If the 'â' character (U+00E2) appears in a UTF-8 encoded file, then that character will be encoded in that file as a two byte sequence. So if you are not changing the bytes in the file when you say you are changing the encoding, you are not changing the encoding of the file, only telling the parser that a non-UTF-8 file is UTF-8.

Pete Kirkham
Thank you Pete for your response.What can I do to make the XML file generic, so that it will not throw any errors for UTF-8 as well as Extended ASCII characters?
Niranjan
You can't, an XML file that's using an encoding other than the default UTF-8 (or UTF-16 with a BOM) **must** specify its own encoding; there is no woolly “guess UTF-8 unless it can't be in which case guess something else” mode. You need to go have harsh words with whoever is responsible for producing the “XML” file that has a non-UTF-8 byte sequence without specifying a `<?xml encoding?>` for it, because what they've produced is in no way valid.
bobince
Thank you bobince for the info.
Niranjan