views:

24458

answers:

14

How do I convert a string to a byte array in .NET (C#)?

Update: Also please explain why encoding should be taken into consideration. Can't I simply get what bytes the string has been stored in? Why this dependency on encoding?!!!

+4  A: 
byte[] strToByteArray(string str)
{
    System.Text.ASCIIEncoding enc = new System.Text.ASCIIEncoding();
    return enc.GetBytes(str);
}
gkrogers
But, why should encoding be taken into consideration? Why can't I simply get the bytes without having to see what encoding is being used? Even if it were required, shouldn't the String object itself know what encoding is being used and simply dump what is in memory?
Vulcan Eager
This doesn't always work. Some special characters can get lost in using such a method I've found the hard way.
JB King
+53  A: 

it depends on the encoding of your string (ascii, utf8, ...)

e.g.:

byte[] b1 = System.Text.Encoding.UTF8.GetBytes (myString);
byte[] b2 = System.Text.Encoding.ASCII.GetBytes (myString);

Update: A small sample why encoding matters:

string pi = "\u03a0";
byte[] ascii = System.Text.Encoding.ASCII.GetBytes (pi);
byte[] utf8 = System.Text.Encoding.UTF8.GetBytes (pi);

Console.WriteLine (ascii.Length); //will print 1
Console.WriteLine (utf8.Length); //will print 2
Console.WriteLine (System.Text.Encoding.ASCII.GetString (ascii)); //will print '?'

Ascii simply isn't equipped to deal with special characters

internally, the .NET framework uses UTF16 to represent strings, so if you simply want to get the exact bytes that .NET uses, use System.Text.Encoding.Unicode.GetBytes (...)

see msdn for more information

bmotmans
+1 FGW (Fastest Gun in the West).
AnthonyWJones
But, why should encoding be taken into consideration? Why can't I simply get the bytes without having to see what encoding is being used? Even if it were required, shouldn't the String object itself know what encoding is being used and simply dump what is in memory?
Vulcan Eager
A .NET strings are always encoded as Unicode. So use System.Text.Encoding.Unicode.GetBytes(); to get the set of bytes that .NET would using to represent the characters. However why would you want that? I recommend UTF-8 especially when most characters are in the western latin set.
AnthonyWJones
There's also System.Text.Encoding.Default
Joel Coehoorn
Also: the exact bytes used internally in the string _don't matter_ if the system that retrieves them doesn't handle that encoding or handles it as the wrong encoding. If it's all within .Net, why convert to an array of bytes at all. Otherwise, it's better to be explicit with your encoding
Joel Coehoorn
@Joel, Be careful with System.Text.Encoding.Default as it could be different on each machine it is run. That's why it's recommended to always specify an encoding, such as UTF-8.
Ash
+1  A: 
// C# to convert a string to a byte array.
public static byte[] StrToByteArray(string str)
{
    System.Text.ASCIIEncoding  encoding=new System.Text.ASCIIEncoding();
    return encoding.GetBytes(str);
}


// C# to convert a byte array to a string.
byte [] dBytes = ...
string str;
System.Text.ASCIIEncoding enc = new System.Text.ASCIIEncoding();
str = enc.GetString(dBytes);
cyberbobcat
1) That will lose data due to using ASCII as the encoding. 2) There's no point in creating a new ASCIIEncoding - just use the Encoding.ASCII property.
Jon Skeet
+13  A: 

You need to take the encoding into account, because 1 character could be represented by 1 or more bytes (up to about 6), and different encodings will treat these bytes differently.

Joel has a posting on this:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Zhaph - Ben Duguid
"1 character could be represented by 1 or more bytes"I agree. I just want those bytes regardless of what encoding the string is in. The only way a string can be stored in memory is in bytes. Even characters are stored as 1 or more bytes. I merely want to get my hands on them bytes.
Vulcan Eager
+7  A: 

Have a look at Jon Skeet's answer in a post with the exact question. It will explain why you depend on encoding.

hmemcpy
+2  A: 

The key issue is that a glyph in a string takes 32 bits (16 bits for a character code) but a byte only has 8 bits to spare. A one-to-one mapping doesn't exist unless you restrict yourself to strings that only contain ASCII characters. System.Text.Encoding has lots of ways to map a string to byte[], you need to pick one that avoids loss of information and that is easy to use by your client when she needs to map the byte[] back to a string.

Utf8 is a popular encoding, it is compact and not lossy.

Hans Passant
UTF-8 is compact only if the majority of your characters are in the English (ASCII) character set. If you had a long string of Chinese characters, UTF-16 would be a more compact encoding than UTF-8 for that string. This is because UTF-8 uses one byte to encode ASCII, and 3 (or maybe 4) otherwise.
Joel Mueller
True. But, how can you not know about encoding if you're familiar with handling Chinese text?
Hans Passant
+2  A: 

I'm not sure, but I think the string stores its info as an array of Chars, which is inefficient with bytes. Specifically, the definition of a Char is "Represents a Unicode character".

take this example sample:

String str = "asdf éß";
String str2 = "asdf gh";
EncodingInfo[] info =  Encoding.GetEncodings();
foreach (EncodingInfo enc in info)
{
    System.Console.WriteLine(enc.Name + " - " 
      + enc.GetEncoding().GetByteCount(str)
      + enc.GetEncoding().GetByteCount(str2));
}

Take note that the Unicode answer is 14 bytes in both instances, whereas the UTF-8 answer is only 9 bytes for the first, and only 7 for the second.

So if you just want the bytes used by the string, simply use Encoding.Unicode, but it will be inefficient with storage space.

Ed Marty
+4  A: 

The first part of your question (how to get the bytes) was already answered by others: look in the System.Text.Encoding namespace.

I will address your follow-up question: why do you need to pick an encoding? Why can't you get that from the string class itself?

The answer is that the bytes used internally by the string class don't matter.

If your program is entirely within the .Net world then you don't need to worry about getting byte arrays for strings at all, even if you're sending data across a network. Instead, use .Net Serialization to worry about transmitting the data. You don't worry about the actual bytes any more: the Serialization formatter does it for you.

On the other hand, what if you are sending these bytes somewhere that you can't guarantee will pull in data from a .Net serialized stream? In this case you definitely do need to worry about encoding, because obviously this external system cares. So again, the internal bytes used by the string don't matter: you need to pick an encoding so you can be explicit about this encoding on the receiving end.

I understand that in this case you might prefer to use the actual bytes stored by the string variable in memory where possible, with the idea that it might save some work creating your byte stream. But that's just not important compared to making sure that your output is understood at the other end, and to guarantee that you must be explicit with your encoding. If you really want to match your internal bytes, just use the Unicode encoding.

Joel Coehoorn
There are areas in .NET where you do have to get byte arrays for strings. Many of the .NET Cryptrography classes contain methods such as ComputeHash() that accept byte array or stream. You have no alternative but to convert a string to a byte array first (choosing an Encoding) and then optionally wrap it in a stream. However as long as you choose an encoding (ie UTF8) an stick with it there are no problems with this.
Ash
A: 

If it is C/C++, you can cast the pointer to (unsigned char *). But since you are using C#, the best way to fake contiguous memory representation of string(unicode or otherwise) to array of bytes is to write the string to file. Then read it back as byte[].

Or research something about serializing the string to MemoryStream. Then deserialize it to byte array

Hao
+15  A: 
BinaryFormatter bf = new BinaryFormatter();
byte[] bytes;
MemoryStream ms = new MemoryStream();

string orig = "喂 Hello 谢谢 Thank You";
bf.Serialize(ms, orig);
ms.Seek(0, 0);
bytes = ms.ToArray();

MessageBox.Show("Original bytes Length: " + bytes.Length.ToString());

MessageBox.Show("Original string Length: " + orig.Length.ToString());

for (int i = 0; i < bytes.Length; ++i) bytes[i] ^= 168; // pseudo encrypt
for (int i = 0; i < bytes.Length; ++i) bytes[i] ^= 168; // pseudo decrypt

BinaryFormatter bfx = new BinaryFormatter();
MemoryStream msx = new MemoryStream();            
msx.Write(bytes, 0, bytes.Length);
msx.Seek(0, 0);
string sx = (string)bfx.Deserialize(msx);

MessageBox.Show("Still intact :" + sx);

MessageBox.Show("Deserialize string Length(still intact): " 
    + sx.Length.ToString());

BinaryFormatter bfy = new BinaryFormatter();
MemoryStream msy = new MemoryStream();
bfy.Serialize(msy, sx);
msy.Seek(0, 0);
byte[] bytesy = msy.ToArray();

MessageBox.Show("Deserialize bytes Length(still intact): " 
   + bytesy.Length.ToString());
Michael Buen
THIS IS AWESOME!!
Vulcan Eager
You could use the same BinaryFormatter instance for all of those operations
Joel Coehoorn
Excellent, I was always scared when using Encoding.Default or similar. +1
Pierre-Alain Vigeant
A: 

Two ways:

public static byte[] StrToByteArray(this string s)
{
    List<byte> value = new List<byte>();
    foreach (char c in s.ToCharArray())
        value.Add(c.ToByte());
    return value.ToArray();
}

And,

public static byte[] StrToByteArray(this string s)
{
    s = s.Replace(" ", string.Empty);
    byte[] buffer = new byte[s.Length / 2];
    for (int i = 0; i < s.Length; i += 2)
        buffer[i / 2] = (byte)Convert.ToByte(s.Substring(i, 2), 16);
    return buffer;
}

I tend to use the bottom one more often than the top, haven't benchmarked them for speed.

What about multibyte characters?
Vulcan Eager
A: 

Also please explain why encoding should be taken into consideration. Can't I simply get what bytes the string has been stored in? Why this dependency on encoding?!!!

Because there is no such thing as "the bytes of the string".

A string (or more generically, a text) is composed of characters: letters, digits, and other symbols. That's all. Computers, however, do not know anything about characters; they can only handle bytes. Therefore, if you want to store or transmit text by using a computer, you need to transform the characters to bytes. How do you do that? Here's where encodings come to the scene.

An encoding is nothing but a convention to translate logical characters to phyisical bytes. The simplest and best known encoding is ASCII, and it is all you need if you write in english. For other languages you will need more complete encodings, being any of the Unicode flavours the safest choice nowadays.

So, in short, trying to "get the bytes of a string without using encodings" is as impossible as "writing a text without using any language".

Konamiman
Allow me to clarify: An encoding has been used to translate "hello world" to physical bytes. Since the string is stored on my computer, I am sure that it must be stored in bytes. I merely want to access those bytes to save them on disk or for any other reason. I do not want to interpret these bytes. Since I do not want to interpret these bytes, the need for an encoding at this point is as misplaced as requiring a phone line to call printf.
Vulcan Eager
But again, there is no concept of text-to-physical-bytes-translation unless yo use an encoding. Sure, the compiler stores the strings somehow in memory - but it is just using an internal encoding, which you (or anyone except the compiler developer) do not know. So, whatever you do, you need an encoding to get physical bytes from a string.
Konamiman
A: 

By the way, I strongly recommend you (and anyone, for that matter) to read this small piece of wisdom:

http://www.joelonsoftware.com/articles/Unicode.html

Konamiman
A: 

Fastest way

public static byte[] GetBytes(string text)
{
    return ASCIIEncoding.UTF8.GetBytes(text);
}
Sunrising