views:

219

answers:

2

I'm making some pretty string-manipulation-intensive code in C#.NET and got curious about some Joel Spolsky articles I remembered reading a while back:

http://www.joelonsoftware.com/articles/fog0000000319.html
http://www.joelonsoftware.com/articles/Unicode.html

So, how does .NET do it? Two bytes per char? There ARE some Unicode chars^H^H^H^H^H code points that need more than that. And how is the length encoded?

+11  A: 

Before Jon Skeet turns up here is a link to his excellent blog on strings in C#.

In the current implementation at least, strings take up 20+(n/2)*4 bytes (rounding the value of n/2 down), where n is the number of characters in the string. The string type is unusual in that the size of the object itself varies

John Nolan
Bah humbug. Not a lot more for me to say, really :)
Jon Skeet
That'll teach you to blog!
John Nolan
It's not actually on my blog - it's on my articles site :) I think I ought to negotiate some sort of rep-sharing scheme. Pity a poor blogger/article poster...
Jon Skeet
@Jon: rep-sharing for the poor would involve redistributing your points ;)
Jimmy
@Jon: Don't be sad, I think there ARE a few things left to say. ;) The "memory usage" section of that post is more or less the answer to my question (plus the interesting tidbit about over-allocation), but how is(are) the length(s) stored? Signed dword? Unsigned word? Endianness?
JCCyC
+4  A: 

.NET uses UTF-16.

From System.String on MSDN:

"Each Unicode character in a string is defined by a Unicode scalar value, also called a Unicode code point or the ordinal (numeric) value of the Unicode character. Each code point is encoded using UTF-16 encoding, and the numeric value of each element of the encoding is represented by a Char object."

Reed Copsey