tags:

views:

230

answers:

6

String returned by string.Format seems to use some strange encoding. Spaces contained in format string are represented using different byte values compared to spaces contained in strings declared in source code.

The following test case demonstrates the problem:

[Test]
public void FormatSize_Regression() 
{
  string size1023 = FileHelper.FormatSize(1023);
  Assert.AreEqual("1 023 Bytes", size1023);
}

Fails:

    String lengths are both 11. Strings differ at index 1.
    Expected: "1 023 Bytes"
    But was:  "1 023 Bytes"
    ------------^

FormatSize method:

public static string FormatSize(long size) 
{
  if (size < 1024)
     return string.Format("{0:N0} Bytes", size);
  else if (size < 1024 * 1024)
     return string.Format("{0:N2} KB", (double)((double)size / 1024));
  else
     return string.Format("{0:N2} MB", (double)((double)size / (1024 * 1024)));
}

From VS immediate window when breakpoint is set on the Assert line:

size1023
"1 023 Bytes"

System.Text.Encoding.UTF8.GetBytes(size1023)
{byte[12]}
    [0]: 49
    [1]: 194 <--------- space is 194/160 here? Unicode bytes indicate that space should be the 160. What is the 194 then?
    [2]: 160
    [3]: 48
    [4]: 50
    [5]: 51
    [6]: 32
    [7]: 66
    [8]: 121
    [9]: 116
    [10]: 101
    [11]: 115
System.Text.Encoding.UTF8.GetBytes("1 023 Bytes")
{byte[11]}
    [0]: 49
    [1]: 32  <--------- space is 32 here
    [2]: 48
    [3]: 50
    [4]: 51
    [5]: 32
    [6]: 66
    [7]: 121
    [8]: 116
    [9]: 101
    [10]: 115

System.Text.Encoding.Unicode.GetBytes(size1023)
{byte[22]}
    [0]: 49
    [1]: 0
    [2]: 160 <----------- 160,0 here
    [3]: 0
    [4]: 48
    [5]: 0
    [6]: 50
    [7]: 0
    [8]: 51
    [9]: 0
    [10]: 32
    [11]: 0
    [12]: 66
    [13]: 0
    [14]: 121
    [15]: 0
    [16]: 116
    [17]: 0
    [18]: 101
    [19]: 0
    [20]: 115
    [21]: 0
System.Text.Encoding.Unicode.GetBytes("1 023 Bytes")
{byte[22]}
    [0]: 49
    [1]: 0
    [2]: 32 <----------- 32,0 here
    [3]: 0
    [4]: 48
    [5]: 0
    [6]: 50
    [7]: 0
    [8]: 51
    [9]: 0
    [10]: 32
    [11]: 0
    [12]: 66
    [13]: 0
    [14]: 121
    [15]: 0
    [16]: 116
    [17]: 0
    [18]: 101
    [19]: 0
    [20]: 115
    [21]: 0

Question: How is this possible?

+10  A: 

I suspect your current culture is using an interesting "thousands" separator - U+00A0, which is the non-breaking space character. That's not an entirely unreasonable thousands separator, to be honest... it means you shouldn't get text like this displayed:

The size of the file is 1
023 bytes.

Instead you'd get

The size of the file is
1 023 bytes.

On my box, I get "1,023" instead. Do you want your FormatSize method to use the current culture, or a specific one? If it's the current culture, you should probably make your unit test specify the culture. I have a couple of wrapper methods I use for this:

internal static void WithInvariantCulture(Action action)
{
    WithCulture(CultureInfo.InvariantCulture, action);
}

internal static void WithCulture(CultureInfo culture, Action action)
{
    CultureInfo original = Thread.CurrentThread.CurrentCulture;
    try
    {
        Thread.CurrentThread.CurrentCulture = culture;
        action();
    }
    finally
    {
        Thread.CurrentThread.CurrentCulture = original;
    }            
}

so I can run:

WithInvariantCulture(() =>
{
    // Body of test
};

etc.

If you want to test for the exact string you're getting, you can use:

Assert.AreEqual("1\u00A0023 Bytes", size1023);
Jon Skeet
Thanks Jon, great explanation!
Marek
+4  A: 

Unicode 160 in UTF8 is not represented by the single byte 160, but by two bytes. And without checking, I'd wager those to be 194 + 160.

In fact, any Unicode codepoint beyond 127 is represented by more than one byte.

And I guess that your CultureInfo uses a non-breaking space (160) as a thousands grouping separator, and not a simple space (32) like you type yourself.

Ruben
+2  A: 

194, 160 is utf8 for codepoint 160: the non-breaking space - &nbsp; in html.

That makes sense, you don't want a single number to be considered several words.

In short, your test revealed a flawed assumption - great! However, in terms of a unit test, your test has issues; you should always include a CultureInfo object when converting to and from strings - otherwise your unit tests may fail depending on the logged-in user's culture settings. You expect a particular form of string formatting - make sure you explicitly state which CultureInfo you're expecting.

Eamon Nerbonne
Thanks for your comment, the unit test is actually only part of regression testing before refactoring and I am including it here only for illustration of the problem, it is not an actual production unit test :)
Marek
+1  A: 

160 is a non breaking space, which sort of makes sense, cause you wouldn't want your number to be split between rows. But 194... Oh yeah. UTF8 doublebytes.

J. Steen
A: 

First of all, all strings in .NET are Unicode, so getting UTF8 bytes is useless. Second of all, when comparing strings you should specify culture info and when using string.format you should use an IFormatProvider. This way you control what characters are used in these functions.

Jonathan van de Veen
+2  A: 

Maybe you could change the test string in the Assert.Equal method to use CultureInfo.CurrentCulture.NumberFormat.NumberGroupSeparator instead of a single space character?

Konamiman