views:

1356

answers:

7

I have a localization issue.

One of my industrious coworkers has replaced all the strings throughout our application with constants that are contained in a dictionary. That dictionary gets various strings placed in it once the user selects a language (English by default, but target languages are German, Spanish, French, Portuguese, Mandarin, and Thai).

For our test of this functionality, we wanted to change a button to include text which has a ñ character, which appears both in Spanish and in the Arial Unicode MS font (which we're using throughout the application).

Problem is, the ñ is appearing as a square block, as if the program did not know how to display it. When I debug into that particular string being read from disk, the debugger reports that character as a square block as well.

So where is the failure? I think it could be in a few places:

1) Notepad may not be unicode aware, so the ñ displayed there is not the same as what vs2008 expects, and so the program interprets the character as a square (EDIT: notepad shows the same characters as vs; ie, they both show the ñ. In the same place.).

2) vs2008 can't handle ñ. I find that very, very hard to believe.

3) The text is read in properly, but the default font for vs2008 can't display it, which is why the debugger shows a square.

4) The text is not read in properly, and I should use something other than a regular StreamReader to get strings.

5) The text is read in properly, but the default String class in C# doesn't handle ñ well. I find that very, very hard to believe.

6) The version of Arial Unicode MS I have doesn't have ñ, despite it being listed as one of the 50k characters by http://www.fileinfo.info.

Anything else I could have left out?

Thanks for any help!

A: 

I was having a similar problem just the other day - see Unicode characters not showing in System.Windows.Forms.TextBox. I was able to fix by changing a TextBox to a RichTextBox.

Sean
This is on a button-- is there such a thing as a 'richbutton'?
mmr
+3  A: 

I would say that most certainly Notepad is the culprit. Notepad does not deal well with unicode characters. If you want to hand edit this file, use something like Notepad++ which can handle unicode, and make sure you save the file as UTF-8. You can probably just use VS to edit the file, and just forget about notepad or Notepad++ completely. .Net and Visual studio are actaully very good at handling accented characters. All strings are UTF-8 by default, so the problem almost certainly lies with Notepad.

Kibbee
ok, I've just checked-- the tilde is visible in notepad, and in the visual studio editor, when opened. Or is that no guarantee?
mmr
ok, Jon Skeet had the answer, but I couldn't have made it work without notepad++. so, half credit? Thanks for the answer!
mmr
A: 

Have you tried using String.Format when assigning the button.Text property, and providing the proper IFormatProvider with a spanish CultureInfo object?

I don't know if that would have an effect, but could help.

scottm
Of all that's holy, I really don't want to also have to go through and provide cultural contexts for all these strings. Maybe that's just laziness. It also doesn't work; I've tried to .ToString(new CultureInfo("es-ES")) with no success.
mmr
Also doesn't help if I switch to spanish language on the machine.
mmr
A: 

How are you reading the strings?

Have you tried to read the text file like this (with the encoding on set to UTF8):

using(StreamReader sr = new StreamReader(File.Open("file.txt", FileMode.Open), Encoding.UTF8))
{
// add your string to dictionary
}
scottm
+1  A: 

I have a very short guide to debugging Unicode problems. It's targeted at fetching text from databases, but the same principles apply in general.

The most important starting point IMO is to know what's actually in your string when it just shows a box. Dump the contents to the console, with code like this:

static void DumpString (string value)
{
    foreach (char c in value)
    {
        Console.Write ("{0:x4} ", (int)c);
    }
    Console.WriteLine();
}

Then look up the character in the code charts on unicode.org. I suspect you want U+00F1, but there may be another similar character with a different code point - I've been fooled by that before.

Jon Skeet
Yeah, I was just ignorant here. Thanks for the guide, I'll have to peruse it more fully, but I've gotten the display working. The notepad++ answer was key, however, because that gave me the tool; vs2008 wasn't helping during string editing.
mmr
A: 

Have you checked that your source file encoding is really utf-8? May not be applicable to default vs2008 install but IDE may detect your OS' default locale (or filesystem encoding) and set the matching non-utf-8 encoding for all your files. You might want to try with the doubly encoded mess (which you often come across on the web) "ñ" without changing anything in your setup to test your encoding mismatches.

I have bitten by this thing when I get to work on a coworker's god-knows-what-editor-in-what-encoding code.

I'm quite sure and assume all your api calls are utf-8 aware so all your text is interpreted as utf-8 even if it is not.

artificialidiot
A: 

To read correctly spanish characters (ñ, á, é, etc) you can try the Codepage 1252 for the Encoding.

Rafa
Unicode is a much better approach, as it will be more internationalizable. Check out @JonSkeet's guide, it was very helpful to me.
mmr