tags:

views:

32

answers:

1

SO, I am asking as a last resort, as I am completely out of ideas.

I have a Windows ASP.NET ASMX web services app that returns a serialized Person object with a -- name, address, email... etc

but some attributes in the xml are encoded very weirdly, for instance- &#x1a (I dont know where the encoding takes place. I assume in the serialization process)

googling those characters I see that it is "Windows-1252" encoding.

The problem occurs during parsing of the XML, I found, a parse error of "invalid unicode character" at the position of the 1252 encoding.

how can I successfully parse it? what solutions do you suggest?

+1  A: 

The parser is correct, whatever produced the serialisation is wrong. As with most of the C0/C1 control characters, it is invalid—actually, worse than that: not well-formed—to put a U+001A SUBSTITUTE into an XML 1.0 file(*), even if encoded as a character reference such as .

No XML parser will read this, nor should it. Whilst you could put some horrific hack in to try to filter out  sequences before passing them to the parser, such crude hacks wouldn't work for the general case. The serialiser should be fixed to stop producing them.

Actually I have no idea how the character (often used to mark end-of-file in ancient horrible operating systems) would get into the dataset used by an ASP.NET app, but it wouldn't seem to play any valid role in a name, address or e-mail. Perhaps really you need to be looking at cleaning your data.

(*: It would be legal if encoded as a character reference in an XML 1.1 document. If you absolutely must round-trip control characters through XML, you will have to use XML 1.1. Though that may lead to compatibility issues with older XML parsers, and you still can't use the U+0000 NULL character, so you're never going to be completely binary-safe.)

bobince
@bobince,thank you for your detailed answer -- I am presuming the data was entered as a copy paste from a word file or something of that sort.
bushman
Yeah, that would be common for the C1 control codes in the range 0x80-0x9F (typically coming from code page 1252 smart quotes mis-interpreted as ISO-8859-1), but the 0x1A control code isn't used for anything by Word, or any other common modern Windows app I can think of.
bobince
so bob, I have no control over the data how it comes to me -- is the only way to have that horrific hack and remove it from the string or is there another way to represent it --- for example before the serialization -- check if the string is UTF-8 legal.
bushman
It's not an encoding issue: character U+001A is equally invalid in UTF-8, ISO-8859-1 or plain old 7-bit ASCII. You can remove the string `` with a simple string replace, but all attempts to handle XML with string/regex hacking risks breaking cases where it is not markup, such as in a `<!-- comment -->`, `<?pi?>` or `<![CDATA section]]>`. But you can't handle this input as XML, because with this control character in it, **it simply isn't XML**. If it is *supposed to be* XML, you need to find the party responsible for generating it and complain vociferously until they fix it.
bobince
:) got it. thank you
bushman