views:

109

answers:

4

Sometimes the string values of Properties in my Classes become odd. They contain illegal characters and are displayed like this (with boxes):

123[]45[]6789

I'm assuming those are illegal/unrecognized characters. I serialize all my objects to XML and then upload them via Web Service. When I retrieve them again, some characters are replaced with oddities. This happens most often with hyphens and dashes that have been typed using Word. Is that the cause of it?

Is there anyway I can check to see if the string contains any of these unrecognized characters via regex or something?

+1  A: 

Define the allowed characters and block everything else, i.e.:

// only lowercase letters and digits
if(Regex.IsMatch(yourString, @"^[a-z0-9]*$"))
{
    // allowed
}

But I think your problem may lie somewhere else, because you say it comes from serializing (valid) string and then deserializing (invalid) strings. It is possibly that you use default serialization and that you don't apply proper ISerializable implementation for your classes (or proper use of the Serializable attributes), resulting in properties or fields being serialized that you don't want to be serialized.

PS: others have mentioned encoding issues, which is a possible cause and might mean you cannot read back the data at all. About encoding there's one simple rule: use the same encoding everywhere (streams, database, xml) and be specific. If you are not, the default encoding is used, which can be different from system to system.


Edit: possible solution

Based on new information (see thread under original question), it is pretty clear that the issue has to do with encoding. The OP mentions that it appears with dashes, which are often replaced with pretty dashes like "—" (—) when used in some fancy editing environment. Since it seems that there's some unclarity in how to fix SQL Server to accept proper encoded strings, you can also solve this in your XML.

When you create your XML, simply change the encoding to the most basic possible (US-ASCII). This will automatically force the XML writer to use the proper numerical entities. When you deserialize, this will be properly parsed in your strings without further ado. Something along these lines:

Stream stream = new MemoryStream();
XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = Encoding.ASCII;
XmlWriter writer = XmlWriter.Create(stream, settings);
// make sure to output the xml-prolog header

But be aware of using StringBuilder or StringWriter, because it is fixed to using UTF-16, and the XmlWriter will always write in that encoding, more info on that issue at my blog, which is not compatible with SQL Server.

Note: when using the ASCII encoding, any character higher than 0x7F will be encoded. So, é will look like &#xE9 and the dash may look like &#x2014, but this means just the same and you should not worry about that. Every XML capable tool will properly interpret this input.

Note 2: the location where you want to change the way XML is written is the Web Service you talk of, that receives XML and then stores it into the SQL Server database. Before storing into SQL Server, the change must be applied. Earlier on in the chain is useless.

Abel
@downvoters: care to comment so I can improve and learn?
Abel
@Abel: I downvoted on your answer as I didn't think Regex was the correct solution. I noticed you changed your answer, however, I cannot remove the downvote until you edit your question :)
James
Downvote removed :)
James
@James: I agree that a regex is a workaround, not a solution and may not even help at all. Thanks for commenting. I actually just fixed that very regex ;-)
Abel
+3  A: 

Personally I don't think using a Regex to check for these characters is the correct solution. If you aren't storing those characters then there is obviously some sort of encoding issue.

Verify that the XML document itself is stored using the correct encoding to support the characters you need to store. Then verify when you are reading the file in that you are using the same encoding as the document i.e. if your XML document is stored as UTF-8 then you need to make sure when you read it in your encoding it as UTF-8.

James
+1  A: 

Take a deeper look at the characters themselves, what are the acutal char values?

When a character shows up an a square it means you can't represent it visually. This is either because it's a non-visual character, or it's outside of your current char set.

edit, nope

In your example I'd venture a guess that your seeing imbedded newline characters.

asawyer
If I check the value in the debugger it reads "123-45-6789"
mint
Have a look at this char entity table. Your issue might be mdash - the longer dash. Encoding can often cause issues. http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
cofiem
+3  A: 

The first thing to remember, is that there is no such thing as a "special character" or an "illegal character". There are characters that are special in certain circumstances, there are non-characters, but there are no generally "special characters" or "illegal characters".

What you have here is either:

  1. Perfectly normal characters for which your font doesn't have a glyph.
  2. Perfectly normal characters that aren't printable (e.g. control characters).
  3. An artefact of how the debugger works.

The first thing is to find out what that character is. Find the integer value of the character, and then look it up.

An important one to look out for is U+FFFD (�) as it is sometimes used when a decoder has recieved a bunch of bytes that make no sense in the context of the encoding it is trying to use (e.g. 0x80 followed by 0x20 makes no sense in UTF-8, and one possible response is to use U+FFFD as a "something strange here" marker, other possible responses are throwing an error, and also silently ignoring the error or trying to guess at intent though those last two bring security issues).

Once you've this figured out, you can begin to reason about why it's getting in there if it isn't expected. Could it be an ecoding issue (charset written in is not the charset read in)? Could it be actually intended to be there? Could it be something else? You can't begin to answer that until you have more information on the bug.

Finally, there's the matter of what to do about it. This will hopefully be obvious from the answers you've found in your research above. Possibly the answer will be "nothing it's fine", possibly something simple or something hard. Can't say yet.

Do not just filter with a regular expression. Maybe that will turn out to be the correct solution, but you don't know yet, so maybe you're making a deeper bug harder to find than it is now, or damaging perfectly good data.

Jon Hanna
Well the XML in the database reads fine as "123-45-6789" but when I deserialize the XML back into an object on the client it reads "123[]45[]6789", is this all back to encoding?
mint
Possibly. What is the value of (int)"123[]45[]6789"[3] here? That'll tell us a lot. For that matter, are you sure that the fourth character in the first string is - and not ‐,‑,–,‒ or — (indeed, ‒ would be better typography than - though not as useful in machine-readable contexts). That'll tell us something too.
Jon Hanna
Well when I ran this in the debuuger ?Char.GetNumericValue("12‑34‑5678", 2) it returned -1.0
mint
No, the actual code point of the third character, that you get if you convert it to an int directly.
Jon Hanna
It returned 8209
mint
That's "non-breaking hyphen". This is an encoding problem.
Hans Passant
Yep. Next question is what value you get if you do the same thing on the other string, the one with the boxes.
Jon Hanna
(Unless that was from the one with the boxes, in which case the data is fine and it's just that the font used to display it doesn't deal with it).
Jon Hanna
@Hans Where do I need to set the encoding for it? When I serialize to XML? I don't recall setting encoding anywhere (which is probably the problem).
mint
@Jon that was the one with the boxes, in the database it reads as '-' but when pulled down gets turned to those boxes
mint
The XML should already have an encoding declared in the document comment (first line). This may go wrong when you read the web service response and convert it to a string.
Hans Passant
If that's 8209, then I think the problem isn't with the serialisation at all. 8209 is U+2011, which is ‑ which is a perflectly sensible character to have there. However, if whatever you are using to look at this data doesn't have a glyph, then it's going to be corrupted at that very final stage where you look at it, not in the code in between.If that code is test code, then it's your test that's broken. If that code also has real-use value, then that's where you need to fix.
Jon Hanna