views:

576

answers:

2

I have some XML that has creating using an XmlDocument object in C#.

I am then inserting the xml data into an XML column in SQL 2005 using the XmlDocument.OuterXml method to get the raw xml.

The problem I have is that the xml contains some special characters. Namely:  This is because the xml is built up from user input from various sources.

When I am performing the insert I get a SQL error message saying XML parsing: illegal xml character.

CREATE TABLE #MyTable (MyColumn XML)
INSERT INTO #MyTable VALUES ('<Element>&#x1F;&#x1C;</Element>')

Msg 9420, Level 16, State 1, Line 2

XML parsing: line 1, character 15, illegal xml character

Can someone point me in a direction as to how I can solve this?

It will be extremely hard for me to change anything on the XmlDocument object or original xml itself, however, I can change the mechanism I use to get at the RAW XML. I can also change my database settings if required.

+1  A: 

If you've got an XmlDocument with invalid characters in, you basically haven't got valid XML. There's no way of representing anything below U+0020 other than tab, carriage return and linefeed in XML. I'm not entirely sure why, but it's invalid XML according to the specification.

You could store the raw text of the invalid XML in the database, but I assume there's some reason why you wanted it to be an XML column in the first place :)

Do you actually want to preserve this invalid XML? Do you know where it comes from and whether it's required? I'd personally try to cleanse the data wherever possible.

(Btw, the first Stack Overflow data dump had a very similar problem, so it's not just you :)

Jon Skeet
Is there a way to make the XmlDocument object enforce the same validation rules? These characters came from a user typing/pasting them into an html textbox and this TextBox.Text was then assigned to XmlNode.InnerText. If it's invalid XML, I'm surprised it didn't crash at this point.
Robin Day
Have a look at http://msdn.microsoft.com/en-us/library/56xykaw6.aspx
Jon Skeet
I don't need to preserve the invalid data. It is all user input from web pages and so not sure how any escape characters got in there anyway. I would like to cleanse it, I just need an elegant solution. I've found code that strips invalid characters by looping through the whole string. This just seems wrong to me, I shouldn't be coding which chars are valid and which aren't.
Robin Day
Agreed - it's a bit of a mess. I don't know of an elegant solution unfortunately :(
Jon Skeet
Well if Jon Skeet doesn't know, there's no hope for the rest of us ;) I'll just strip the characters before I throw the XML in the database. I have millions of these XML files of which I'm sure only a handful will have invalid characters, it's going to be a mess running through them all and ensuring they never become "bad" again. Thanks for your help.
Robin Day
A: 

This is possible INSERT INTO #MyTable VALUES ('&#x1F;')

I add spaces now for clarity

INSERT INTO #MyTable VALUES (' < E l e m e n t > & a m p ; # x 1 F ; < / E l e m e n t > ')

guille