tags:

views:

199

answers:

4

How many bits is a .NET string that's 10 characters in length? (.NET strings are UTF-16, right?)

+1  A: 

If you are talking pure Unicode-16 then:

10 characters = 20 bytes = 160 bits

This really needs a context in order to be answered properly.

John Gietzen
Keep in mind that there are multiple 16-bit encodings for Unicode. What you're talking about here is UCS-2, which always uses 2 bytes per character. UTF-16, on the other hand, uses one or two byte pairs to encode code points, so it could take more than 20 bytes to store 10 characters. Then again, it depends on your definition of a character.
Trillian
God dammit! Why is Unicode so complicated!
John Gietzen
+7  A: 

On 32-bit systems:

4 bytes          = Type pointer (Every object has one of these)
4 bytes          = Lock         (One of these too!)
4 bytes          = Length       (Need the length)
2 * Length bytes = Data         (And the chars themselves)
=======================
12 + 2*Length bytes
=======================
96 + 16*Length bits

So 10 chars would = 256 bits = 32 bytes

I am not sure if the Lock grows to 64-bit on 64-bit systems. I kinda hope not, but you never know. The 64-bit structure overhead is therefore anywhere from 16-20 bytes (as opposed to the 12 bytes on 32-bit).

Frank Krueger
Oh, if you want to go that route, there's no lock, but a vtable instead.
Gonzalo
Well, this is pedantic, but... isn't there string interning going on?
John Gietzen
@Gonzalo, The vtable is the first "Type pointer" (a lot more than just a vtable). Are you sure there's no lock?
Frank Krueger
I'm sorry, but most people who count bits **are** pedantic. :-)
Frank Krueger
I was looking at the Mono implementation. Actually there first 8 bytes are the vtable *and* the lock. You were right.
Gonzalo
Btw, it's (2 * length + 2). We all forgot about the two 0 bytes at the end.
Gonzalo
@Gonzalo: Why would they store a null terminator? First of all, it's redundant when you have a length, and second .NET strings can have embedded nulls ('\0'), so even when present a null termination doesn't necessarily indicate the end of the string.
280Z28
Looking at the Mono code, the unmanaged allocation takes the extra space for the NUL. But everything that touches the string uses the length stored in the string object.
Gonzalo
+4  A: 

Every char in the string is two bytes in size, so if you are just converting the chars directly and not using any particular encoding, the answer is string.Length * 2 * 8

otherwise the result depends on the encoding, you can write:

int numbits = System.Text.Encoding.UTF8.GetByteCount(str)*8; //returns 80

or

int numbits = System.Text.Encoding.Unicode.GetByteCount(str)*8 //returns 160
RRUZ
A: 

It all comes down to how you define character and how to you store the data.

For example, if you define character as a single letter from the users point of view it can be more than 2 bytes, for example this character: Å is two Unicode code points (U+0041 U+030A, Latin Capital A + Combining Ring Above) so it will require two .net chars or 4 bytes int UTF-16.

Now even if you are talking about 10 .net Char elements than if it's in memory you have some object overhead (that was already mentioned) and a bit of alignment overhead (on 32bit system everything has to be aligned to 4 bytes boundary, in 64bit the rules are more complicated) so you may have some empty bytes at the end.

If you are talking about database or files than each database and file system has its own overhead.

Nir