xml parse error on illegal character

The parser is correct, whatever produced the serialisation is wrong. As with most of the C0/C1 control characters, it is invalid—actually, worse than that: not well-formed—to put a U+001A SUBSTITUTE into an XML 1.0 file(*), even if encoded as a character reference such as .

No XML parser will read this, nor should it. Whilst you could put some horrific hack in to try to filter out  sequences before passing them to the parser, such crude hacks wouldn't work for the general case. The serialiser should be fixed to stop producing them.

Actually I have no idea how the character (often used to mark end-of-file in ancient horrible operating systems) would get into the dataset used by an ASP.NET app, but it wouldn't seem to play any valid role in a name, address or e-mail. Perhaps really you need to be looking at cleaning your data.

(*: It would be legal if encoded as a character reference in an XML 1.1 document. If you absolutely must round-trip control characters through XML, you will have to use XML 1.1. Though that may lead to compatibility issues with older XML parsers, and you still can't use the U+0000 NULL character, so you're never going to be completely binary-safe.)

@bobince,thank you for your detailed answer -- I am presuming the data was entered as a copy paste from a word file or something of that sort.

bushman 2010-06-29 12:58:16

Yeah, that would be common for the C1 control codes in the range 0x80-0x9F (typically coming from code page 1252 smart quotes mis-interpreted as ISO-8859-1), but the 0x1A control code isn't used for anything by Word, or any other common modern Windows app I can think of.

bobince 2010-06-29 13:20:43

so bob, I have no control over the data how it comes to me -- is the only way to have that horrific hack and remove it from the string or is there another way to represent it --- for example before the serialization -- check if the string is UTF-8 legal.

bushman 2010-06-29 13:44:40

It's not an encoding issue: character U+001A is equally invalid in UTF-8, ISO-8859-1 or plain old 7-bit ASCII. You can remove the string `` with a simple string replace, but all attempts to handle XML with string/regex hacking risks breaking cases where it is not markup, such as in a ``, `<?pi?>` or `<![CDATA section]]>`. But you can't handle this input as XML, because with this control character in it, **it simply isn't XML**. If it is *supposed to be* XML, you need to find the party responsible for generating it and complain vociferously until they fix it.

bobince 2010-06-29 14:04:16

:) got it. thank you

bushman 2010-06-29 14:29:53

ansaurus

tags:

views:

answers:

xml parse error on illegal character

related questions