ansaurus

Question

Bit/byte conversion

Answer 1

+1 A:

If you are talking pure Unicode-16 then:

10 characters = 20 bytes = 160 bits

This really needs a context in order to be answered properly.

John Gietzen 2009-11-11 05:50:15

Keep in mind that there are multiple 16-bit encodings for Unicode. What you're talking about here is UCS-2, which always uses 2 bytes per character. UTF-16, on the other hand, uses one or two byte pairs to encode code points, so it could take more than 20 bytes to store 10 characters. Then again, it depends on your definition of a character.

Trillian 2009-11-11 15:52:33

God dammit! Why is Unicode so complicated!

John Gietzen 2009-11-11 20:23:27

Answer 2

+7 A:

On 32-bit systems:

4 bytes          = Type pointer (Every object has one of these)
4 bytes          = Lock         (One of these too!)
4 bytes          = Length       (Need the length)
2 * Length bytes = Data         (And the chars themselves)
=======================
12 + 2*Length bytes
=======================
96 + 16*Length bits

So 10 chars would = 256 bits = 32 bytes

I am not sure if the Lock grows to 64-bit on 64-bit systems. I kinda hope not, but you never know. The 64-bit structure overhead is therefore anywhere from 16-20 bytes (as opposed to the 12 bytes on 32-bit).

Frank Krueger 2009-11-11 05:53:29

Oh, if you want to go that route, there's no lock, but a vtable instead.

Gonzalo 2009-11-11 05:55:20

Well, this is pedantic, but... isn't there string interning going on?

John Gietzen 2009-11-11 05:56:32

@Gonzalo, The vtable is the first "Type pointer" (a lot more than just a vtable). Are you sure there's no lock?

Frank Krueger 2009-11-11 05:57:24

I'm sorry, but most people who count bits **are** pedantic. :-)

Frank Krueger 2009-11-11 05:58:12

I was looking at the Mono implementation. Actually there first 8 bytes are the vtable *and* the lock. You were right.

Gonzalo 2009-11-11 05:59:23

Btw, it's (2 * length + 2). We all forgot about the two 0 bytes at the end.

Gonzalo 2009-11-11 06:52:00

@Gonzalo: Why would they store a null terminator? First of all, it's redundant when you have a length, and second .NET strings can have embedded nulls ('\0'), so even when present a null termination doesn't necessarily indicate the end of the string.

280Z28 2009-11-11 07:00:58

Looking at the Mono code, the unmanaged allocation takes the extra space for the NUL. But everything that touches the string uses the length stored in the string object.

Gonzalo 2009-11-11 07:21:06

Answer 3

+4 A:

Every char in the string is two bytes in size, so if you are just converting the chars directly and not using any particular encoding, the answer is string.Length * 2 * 8

otherwise the result depends on the encoding, you can write:

int numbits = System.Text.Encoding.UTF8.GetByteCount(str)*8; //returns 80

or

int numbits = System.Text.Encoding.Unicode.GetByteCount(str)*8 //returns 160

RRUZ 2009-11-11 05:54:27

Answer 4

A:

It all comes down to how you define character and how to you store the data.

For example, if you define character as a single letter from the users point of view it can be more than 2 bytes, for example this character: Å is two Unicode code points (U+0041 U+030A, Latin Capital A + Combining Ring Above) so it will require two .net chars or 4 bytes int UTF-16.

Now even if you are talking about 10 .net Char elements than if it's in memory you have some object overhead (that was already mentioned) and a bit of alignment overhead (on 32bit system everything has to be aligned to 4 bytes boundary, in 64bit the rules are more complicated) so you may have some empty bytes at the end.

If you are talking about database or files than each database and file system has its own overhead.

Nir 2009-11-11 15:38:43

ansaurus

tags:

views:

answers:

Bit/byte conversion

related questions