views:

246

answers:

3

From my database I am getting a very long string which is basically xml. I need to change it to a byte array.

I can't get my head around the potential encoding issues.

What do I need to be careful of when doing this conversion?

 public static byte[] StringToByteArray1(string str)
    {
        return Encoding.ASCII.GetBytes(str);
    }

    public static byte[] StringToByteArray2(string str)
    {
        return Encoding.UTF8.GetBytes(str);
    }

Encoding.ASCII.GetBytes vs Encoding.UTF8.GetBytes

A: 

You should only use ASCII for legacy code compatibility, where it truly is ASCII. Note that this is 7 bits, and will not support extended characters.

UTF-8 is generally OK (others may disagree), and will give you 8-bit values. Using Unicode (UTF-16) is preferred.

What you're doing with the byte array will impact whether you want ASCII, UTF-8, or Unicode.

Here's a nice reference.

Jon B
+2  A: 

If you are dealing with ASCII characters then the result is identical.

On the other hand if you have non-ASCII characters in your string (for example π) then: in ASCII encoding these will be replaced by "?".

in UTF8 they will be represented by a (possibly several byte character).

It is probably worth pointing out that internally .NET uses UTF16 to encode its strings.

In general though you are probably best off using UTF8 unless you have specific reason not to.

Oliver Hallam
+1  A: 

What encoding to use, when converting strings to bytes and exporting them from your application, depends 100% on the program that is going to be reading these bytes and interpreting them as strings.

For example if you are writing a file that is to be read by a program that requires ASCII encoded files then you have to use ASCII, if the reading-program requires CodePage 850 then you need to use that encoding and if it requires UTF-8 then you use that encoding and so on.

However, if you are writing to a file that is going to be read by your own program I would suggest that use you UTF-8 because that encoding seems to be becoming the de facto encoding.

Finally, you should know what encoding is about and how to use it. So if you haven't read it yet - you have to read Joel Spolskys article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)". It is a very good article. Read it! Yes, you have to.

Hope this helps!

Alfred B. Thordarson