tags:

views:

249

answers:

4

Hello. Another question re. Unicode, terminals and now C# and wc. If I write this simple piece of code

  int i=0;
  foreach(char c in Console.In.ReadToEnd())
  {
    if(c!='\n') i++;
  }
  Console.WriteLine("{0}", i);

and input it only the character "€" (3 bytes in utf-8), wc returns 3 characters (maybe using wint_t, though I haven't checked), but ReadToEnd() returns 1 (one character). What exactly is the behavior of ReadToEnd in this case? How do I know what ReadToEnd is doing behind the scenes?

I'm running xterm initialized with utf-8.en.US, running Ubuntu Linux and Mono.

Thank you.

+3  A: 

wc and most unix-like commands deal with characters in terms of the C char data type which is usually an unsigned 8 bit integer. wc simply reads the bytes from the standard input one by one with no conversion and determines that there are 3 characters.

.NET deals with characters in terms of its own Char data type which is a 16 bit unsigned integer and represents a UTF-16 character. The console class has recieved the 3 bytes of input, determined that the console it is attached to is UTF-8 and has properly converted them to a single UTF-16 euro character.

rpetrich
So, quick followup question. If I were to write the same program in C, by using wchar or wint_t I'd be wasting (twice) space. In this case it's trivial, because it's just 16 bits but in huge files the difference is noticeable. Is this correct?
Dervin Thunk
It depends. If you are dealing with english text, an 8bit char type and a Latin-1 or UTF-8 encoding will probably take up the least amount of space. If you are dealing with chinese or japanese text, UTF-8 will be less efficient than other encodings and Latin-1 won't be able to represent your text at all. For that use UTF-16, UCS-2 or one of the language specific encodings would be more compact. Also note that it's also much more complicated to work with encodings where characters have a variable number of bytes. Choosing a more compact encoding may make your text processing slower.
rpetrich
+2  A: 

ReadToEnd returns a string. All strings in .NET are Unicode. They're not just an array of bytes.

Apparently, wc is returning the number of bytes. The number of bytes and the number of characters used to be the same thing.

John Saunders
+3  A: 

You need to take into consideration the character encoding. Currently you are merely counting the bytes and chars and bytes are not necessarily the same size.

Encoding encoding = Encoding.UTF8;
string s = "€";

int byteCount = encoding.GetByteCount(s);
Console.WriteLine(byteCount); // prints "3" on the console

byte[] bytes = new byte[byteCount];
encoding.GetBytes(s, 0, s.Length, bytes, 0);
int charCount = encoding.GetCharCount(bytes);
Console.WriteLine(charCount); // prints "1" on the console
Jason
+1  A: 

wc, by default, returns the number of lines, words and bytes in a file. If you want to to return the number of characters according to the active locale's encoding rather than just the number of bytes then you should look at the -m or --chars option which modern wc's have.

Charles Bailey