ansaurus

Question

TextWriter.ReadToEnd vs. Unix wc Command

Answer 1

+3 A:

wc and most unix-like commands deal with characters in terms of the C char data type which is usually an unsigned 8 bit integer. wc simply reads the bytes from the standard input one by one with no conversion and determines that there are 3 characters.

.NET deals with characters in terms of its own Char data type which is a 16 bit unsigned integer and represents a UTF-16 character. The console class has recieved the 3 bytes of input, determined that the console it is attached to is UTF-8 and has properly converted them to a single UTF-16 euro character.

rpetrich 2009-07-23 02:32:06

So, quick followup question. If I were to write the same program in C, by using wchar or wint_t I'd be wasting (twice) space. In this case it's trivial, because it's just 16 bits but in huge files the difference is noticeable. Is this correct?

Dervin Thunk 2009-07-23 03:08:06

It depends. If you are dealing with english text, an 8bit char type and a Latin-1 or UTF-8 encoding will probably take up the least amount of space. If you are dealing with chinese or japanese text, UTF-8 will be less efficient than other encodings and Latin-1 won't be able to represent your text at all. For that use UTF-16, UCS-2 or one of the language specific encodings would be more compact. Also note that it's also much more complicated to work with encodings where characters have a variable number of bytes. Choosing a more compact encoding may make your text processing slower.

rpetrich 2009-07-23 03:39:48

Answer 2

+2 A:

ReadToEnd returns a string. All strings in .NET are Unicode. They're not just an array of bytes.

Apparently, wc is returning the number of bytes. The number of bytes and the number of characters used to be the same thing.

John Saunders 2009-07-23 02:32:22

Answer 3

+3 A:

You need to take into consideration the character encoding. Currently you are merely counting the bytes and chars and bytes are not necessarily the same size.

Encoding encoding = Encoding.UTF8;
string s = "€";

int byteCount = encoding.GetByteCount(s);
Console.WriteLine(byteCount); // prints "3" on the console

byte[] bytes = new byte[byteCount];
encoding.GetBytes(s, 0, s.Length, bytes, 0);
int charCount = encoding.GetCharCount(bytes);
Console.WriteLine(charCount); // prints "1" on the console

Jason 2009-07-23 02:33:50

Answer 4

+1 A:

wc, by default, returns the number of lines, words and bytes in a file. If you want to to return the number of characters according to the active locale's encoding rather than just the number of bytes then you should look at the -m or --chars option which modern wc's have.

Charles Bailey 2009-07-23 05:11:56

ansaurus

tags:

views:

answers:

TextWriter.ReadToEnd vs. Unix wc Command

related questions