ansaurus

Question

How to prevent illegal characters to appear in my XML when retrieving it from SQL Server

Answer 1

+1 A:

Define the allowed characters and block everything else, i.e.:

// only lowercase letters and digits
if(Regex.IsMatch(yourString, @"^[a-z0-9]*$"))
{
    // allowed
}

But I think your problem may lie somewhere else, because you say it comes from serializing (valid) string and then deserializing (invalid) strings. It is possibly that you use default serialization and that you don't apply proper ISerializable implementation for your classes (or proper use of the Serializable attributes), resulting in properties or fields being serialized that you don't want to be serialized.

PS: others have mentioned encoding issues, which is a possible cause and might mean you cannot read back the data at all. About encoding there's one simple rule: use the same encoding everywhere (streams, database, xml) and be specific. If you are not, the default encoding is used, which can be different from system to system.

Edit: possible solution

Based on new information (see thread under original question), it is pretty clear that the issue has to do with encoding. The OP mentions that it appears with dashes, which are often replaced with pretty dashes like "—" (—) when used in some fancy editing environment. Since it seems that there's some unclarity in how to fix SQL Server to accept proper encoded strings, you can also solve this in your XML.

When you create your XML, simply change the encoding to the most basic possible (US-ASCII). This will automatically force the XML writer to use the proper numerical entities. When you deserialize, this will be properly parsed in your strings without further ado. Something along these lines:

Stream stream = new MemoryStream();
XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = Encoding.ASCII;
XmlWriter writer = XmlWriter.Create(stream, settings);
// make sure to output the xml-prolog header

But be aware of using StringBuilder or StringWriter, because it is fixed to using UTF-16, and the XmlWriter will always write in that encoding, more info on that issue at my blog, which is not compatible with SQL Server.

Note: when using the ASCII encoding, any character higher than 0x7F will be encoded. So, é will look like &#xE9 and the dash may look like &#x2014, but this means just the same and you should not worry about that. Every XML capable tool will properly interpret this input.

Note 2: the location where you want to change the way XML is written is the Web Service you talk of, that receives XML and then stores it into the SQL Server database. Before storing into SQL Server, the change must be applied. Earlier on in the chain is useless.

Abel 2010-08-24 12:58:33

@downvoters: care to comment so I can improve and learn?

Abel 2010-08-24 13:06:00

@Abel: I downvoted on your answer as I didn't think Regex was the correct solution. I noticed you changed your answer, however, I cannot remove the downvote until you edit your question :)

James 2010-08-24 13:12:47

Downvote removed :)

James 2010-08-24 13:13:32

@James: I agree that a regex is a workaround, not a solution and may not even help at all. Thanks for commenting. I actually just fixed that very regex ;-)

Abel 2010-08-24 13:13:37

Answer 2

+3 A:

Personally I don't think using a Regex to check for these characters is the correct solution. If you aren't storing those characters then there is obviously some sort of encoding issue.

Verify that the XML document itself is stored using the correct encoding to support the characters you need to store. Then verify when you are reading the file in that you are using the same encoding as the document i.e. if your XML document is stored as UTF-8 then you need to make sure when you read it in your encoding it as UTF-8.

James 2010-08-24 12:59:29

Answer 3

+1 A:

Take a deeper look at the characters themselves, what are the acutal char values?

When a character shows up an a square it means you can't represent it visually. This is either because it's a non-visual character, or it's outside of your current char set.

edit, nope

In your example I'd venture a guess that your seeing imbedded newline characters.

asawyer 2010-08-24 13:00:32

If I check the value in the debugger it reads "123-45-6789"

mint 2010-08-24 13:01:46

Have a look at this char entity table. Your issue might be mdash - the longer dash. Encoding can often cause issues. http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

cofiem 2010-08-24 13:15:35

Answer 4

+3 A:

The first thing to remember, is that there is no such thing as a "special character" or an "illegal character". There are characters that are special in certain circumstances, there are non-characters, but there are no generally "special characters" or "illegal characters".

What you have here is either:

Perfectly normal characters for which your font doesn't have a glyph.
Perfectly normal characters that aren't printable (e.g. control characters).
An artefact of how the debugger works.

The first thing is to find out what that character is. Find the integer value of the character, and then look it up.

An important one to look out for is U+FFFD (�) as it is sometimes used when a decoder has recieved a bunch of bytes that make no sense in the context of the encoding it is trying to use (e.g. 0x80 followed by 0x20 makes no sense in UTF-8, and one possible response is to use U+FFFD as a "something strange here" marker, other possible responses are throwing an error, and also silently ignoring the error or trying to guess at intent though those last two bring security issues).

Once you've this figured out, you can begin to reason about why it's getting in there if it isn't expected. Could it be an ecoding issue (charset written in is not the charset read in)? Could it be actually intended to be there? Could it be something else? You can't begin to answer that until you have more information on the bug.

Finally, there's the matter of what to do about it. This will hopefully be obvious from the answers you've found in your research above. Possibly the answer will be "nothing it's fine", possibly something simple or something hard. Can't say yet.

Do not just filter with a regular expression. Maybe that will turn out to be the correct solution, but you don't know yet, so maybe you're making a deeper bug harder to find than it is now, or damaging perfectly good data.

Jon Hanna 2010-08-24 13:15:59

Well the XML in the database reads fine as "123-45-6789" but when I deserialize the XML back into an object on the client it reads "123[]45[]6789", is this all back to encoding?

mint 2010-08-24 13:20:36

Possibly. What is the value of (int)"123[]45[]6789"[3] here? That'll tell us a lot. For that matter, are you sure that the fourth character in the first string is - and not ‐,‑,–,‒ or — (indeed, ‒ would be better typography than - though not as useful in machine-readable contexts). That'll tell us something too.

Jon Hanna 2010-08-24 13:35:02

Well when I ran this in the debuuger ?Char.GetNumericValue("12‑34‑5678", 2) it returned -1.0

mint 2010-08-24 13:48:18

No, the actual code point of the third character, that you get if you convert it to an int directly.

Jon Hanna 2010-08-24 14:15:49

It returned 8209

mint 2010-08-24 14:25:38

That's "non-breaking hyphen". This is an encoding problem.

Hans Passant 2010-08-24 14:49:10

Yep. Next question is what value you get if you do the same thing on the other string, the one with the boxes.

Jon Hanna 2010-08-24 14:53:24

(Unless that was from the one with the boxes, in which case the data is fine and it's just that the font used to display it doesn't deal with it).

Jon Hanna 2010-08-24 14:54:16

@Hans Where do I need to set the encoding for it? When I serialize to XML? I don't recall setting encoding anywhere (which is probably the problem).

mint 2010-08-24 14:54:22

@Jon that was the one with the boxes, in the database it reads as '-' but when pulled down gets turned to those boxes

mint 2010-08-24 14:55:05

The XML should already have an encoding declared in the document comment (first line). This may go wrong when you read the web service response and convert it to a string.

Hans Passant 2010-08-24 14:58:17

If that's 8209, then I think the problem isn't with the serialisation at all. 8209 is U+2011, which is ‑ which is a perflectly sensible character to have there. However, if whatever you are using to look at this data doesn't have a glyph, then it's going to be corrupted at that very final stage where you look at it, not in the code in between.If that code is test code, then it's your test that's broken. If that code also has real-use value, then that's where you need to fix.

Jon Hanna 2010-08-24 14:58:36

ansaurus

tags:

views:

answers:

How to prevent illegal characters to appear in my XML when retrieving it from SQL Server

Edit: possible solution

related questions