views:

499

answers:

2

Existing app passes XML to a sproc in SQLServer 2000, input parameter data type is TEXT; The XML is derived from Dataset.GetXML(). But I notice it doesn't specify an encoding.

So when the user sneaks in an inappropriate character into the dataset, specifically ASCII 146 (which appears to be an apostrophe) instead of ASCII 39 (single quote), the sproc fails.

One approach is to prefix the result of GetXML with

<?xml version="1.0" encoding="ISO-8859-1"?>

It works in this case, but what would be a more correct approach to ensure the sproc does not crash (if other unforeseen characters pop up)?

PS. I suspect the user is typing text into MS-Word or similar editor, and copy & pasting into the input fields of the app; I would probably want to allow the user to continue working this way, just need to prevent the crashes.

EDIT: I am looking for answers that confirm or deny a few aspects, For example:
- as per title, whats the default encoding if none specified in the XML?
- Is the encoding ISO-8859-1 the right one to use?
- if there a better encoding that would encompass more characters in the english-speaking world and thus less likely to cause an error in the sproc?
- would you filter at the app's UI level for standard ASCII (0 to 127 only), and not allow extended ASCII?
- any other pertinent details.

A: 

I believe your approach should be to use WriteXml instead of GetXml. That should allow you to specify the encoding.

However, note that you will have to write through an intermediate stream - if you output directly to a string, it will always use UTF-16. Since you are using a TEXT column, that will permit characters not valid for TEXT.

John Saunders
what's wrong with doing it as per my example, concatenate [the xml encoding] + DataSet.GetXml() ?
joedotnot
1) Don't use string concatenation to manipulate XML. There are differences in the rules between XML and strings. 2) Your method only declares what the encoding is - it does not change the encoding at all.
John Saunders
A: 

DataSet.GetXml() returns a string. In .NET, strings are internally encoded using UTF-16, but that is not really relevant here.

The reason why there's no <?xml encoding=...> declaration in the string is because that declaration is only useful or needed to parse XML in a byte stream. A .NET string is not a byte stream, it's just text with well-defined codepoint semantics (which is Unicode), so it is not needed there.

If there is no XML encoding declaration, UTF-8 is to be assumed by the XML parser in the absence of BOM. In your case, however, it is also entirely irrelevant since the problem is not with an XML parser (XML isn't parsed by SQL Server when it's stored in a TEXT column). The problem is that your XML contains some Unicode characters, and TEXT is a non-Unicode SQL type.

You can encode a string to any encoding using Encoding.GetBytes() method.

Pavel Minaev
Wrong assumption, the column is not TEXT, only a parameter of type TEXT is being used to accept the XML string; TEXT is used because varchar(8000) has a restriction on length;The problem *is* with the parser on sqlserver. Server: Msg 6603, Level 16, State 1, Procedure sp_xml_preparedocument, Line 40XML parsing error: An invalid character was found in text content.As i said, when i declare the XML string as ISO-8859-1, no error occurs on the sproc, so the parser is treating ASCII 146 as acceptable.
joedotnot
The problem is still `TEXT`, actually. Specifically, when you pass a Unicode `string` to your sproc, it has to be converted to non-Unicode encoding to match `TEXT`; the result is of course not encoded using UTF, and which encoding it is going to use to convert isn't easy to determine. If you have control over sproc, just replace `TEXT` with `NTEXT`, and don't bother with ancodings
Pavel Minaev
I arrived at the same conclusion to use NTEXT just prior to reading your last comment, then i won't need to declare <br/>?xml version="1.0" encoding="ISO-8859-1"?<br/>to make it work (or won't need to bother with encodings as you have said).Can you please just clarify a few things:<br/> are you saying if i use NTEXT, the xml string that i pass will be interpreted by the xml parser as UTF-16? <br/>why does keeping TEXT and declaring the xml string as ISO-8859-1 works?
joedotnot
Apparently, when ADO.NET MSSQL provider does the conversion from .NET Unicode string to `TEXT`, it uses ISO-8859-1 as the encoding (I suspect it either uses the current system locale, or the codepage specified in your database). Hence the string, once it arrives into SQL, is encoded using ISO-8859-1 (insofar as it can represent characters from the original string), and then XML parser in MSSQL treats it as sequence of bytes, and presumes UTF-8 (or picks up your explicit encoding declaration). With NTEXT, I'd expect it to treat it as Unicode text rather than raw bytes.
Pavel Minaev