ansaurus

Question

.NET DataSet.GetXml() - what's the default encoding?

Answer 1

A:

I believe your approach should be to use WriteXml instead of GetXml. That should allow you to specify the encoding.

However, note that you will have to write through an intermediate stream - if you output directly to a string, it will always use UTF-16. Since you are using a TEXT column, that will permit characters not valid for TEXT.

John Saunders 2009-12-09 19:06:57

what's wrong with doing it as per my example, concatenate [the xml encoding] + DataSet.GetXml() ?

joedotnot 2009-12-09 21:56:29

1) Don't use string concatenation to manipulate XML. There are differences in the rules between XML and strings. 2) Your method only declares what the encoding is - it does not change the encoding at all.

John Saunders 2009-12-09 22:14:47

Answer 2

A:

DataSet.GetXml() returns a string. In .NET, strings are internally encoded using UTF-16, but that is not really relevant here.

The reason why there's no <?xml encoding=...> declaration in the string is because that declaration is only useful or needed to parse XML in a byte stream. A .NET string is not a byte stream, it's just text with well-defined codepoint semantics (which is Unicode), so it is not needed there.

If there is no XML encoding declaration, UTF-8 is to be assumed by the XML parser in the absence of BOM. In your case, however, it is also entirely irrelevant since the problem is not with an XML parser (XML isn't parsed by SQL Server when it's stored in a TEXT column). The problem is that your XML contains some Unicode characters, and TEXT is a non-Unicode SQL type.

You can encode a string to any encoding using Encoding.GetBytes() method.

Pavel Minaev 2009-12-09 23:34:51

Wrong assumption, the column is not TEXT, only a parameter of type TEXT is being used to accept the XML string; TEXT is used because varchar(8000) has a restriction on length;The problem *is* with the parser on sqlserver. Server: Msg 6603, Level 16, State 1, Procedure sp_xml_preparedocument, Line 40XML parsing error: An invalid character was found in text content.As i said, when i declare the XML string as ISO-8859-1, no error occurs on the sproc, so the parser is treating ASCII 146 as acceptable.

joedotnot 2009-12-09 23:59:35

The problem is still `TEXT`, actually. Specifically, when you pass a Unicode `string` to your sproc, it has to be converted to non-Unicode encoding to match `TEXT`; the result is of course not encoded using UTF, and which encoding it is going to use to convert isn't easy to determine. If you have control over sproc, just replace `TEXT` with `NTEXT`, and don't bother with ancodings

Pavel Minaev 2009-12-10 00:28:29

I arrived at the same conclusion to use NTEXT just prior to reading your last comment, then i won't need to declare <br/>?xml version="1.0" encoding="ISO-8859-1"?<br/>to make it work (or won't need to bother with encodings as you have said).Can you please just clarify a few things:<br/> are you saying if i use NTEXT, the xml string that i pass will be interpreted by the xml parser as UTF-16? <br/>why does keeping TEXT and declaring the xml string as ISO-8859-1 works?

joedotnot 2009-12-10 00:46:22

Apparently, when ADO.NET MSSQL provider does the conversion from .NET Unicode string to `TEXT`, it uses ISO-8859-1 as the encoding (I suspect it either uses the current system locale, or the codepage specified in your database). Hence the string, once it arrives into SQL, is encoded using ISO-8859-1 (insofar as it can represent characters from the original string), and then XML parser in MSSQL treats it as sequence of bytes, and presumes UTF-8 (or picks up your explicit encoding declaration). With NTEXT, I'd expect it to treat it as Unicode text rather than raw bytes.

Pavel Minaev 2009-12-10 01:52:59

ansaurus

tags:

views:

answers:

.NET DataSet.GetXml() - what's the default encoding?

related questions