views:

2928

answers:

7

Can somebody please provide me some important aspects I should be aware of while handling Unicode strings in C#?

+6  A: 

C# (and .Net in general) handle unicode strings transparently, and you won't have to do anything special unless your application needs to read/write files with specific encodings. In those cases, you can convert managed strings to byte arrays of the encoding of your choice by using the classes in the System.Text.Encodings namespace.

C. Lawrence Wenham
+1  A: 

Only think about encoding when reading and writing streams. Use TextReader and TextWriters to read and write text in different encodings. Always use utf-8 if you have a choice.

Don't get confused by languages and cultures - that's a completely separate issue from unicode.

JacquesB
A: 

.Net has relatively good i18n support. You don't really need to think about unicode that much as all .Net strings and built-in string functions do the right thing with unicode. The only thing to bear in mind is that most of the string functions, for example DateTime.ToString(), use by default the thread's culture which by default is the Windows culture. You can specify a different culture for formatting either on the current thread or on each method call.

The only time unicode is an issue is when encoding/decoding strings to and from bytes.

Matt Howells
+4  A: 

Keep in mind that C# strings are sequnces of Char, UTF-16 code units. They are not Unicode code-points. Some unicode code points require two Char's, and you should not split strings between these Chars.

In addition, unicode code points may combine to form a single language 'character' -- for instance, a 'u' Char followed by umlat Char. So you can't split strings between arbitrary code points either.

Basically, it's mess of issues, where any given issue may only in practice affect languages you don't know.

Aaron
A: 

As mentioned, .NET strings handle Unicode transparently. Besides file I/O, the other consideration would be at the database layer. SQL Server for instance distinguishes between VARCHAR (non-unicode) and NVARCHAR (which handles unicode). Also need to pay attention to stored procedure parameters.

Pat
+2  A: 

System.String already handled unicode internally so you are covered there. Best practice would be to use System.Text.Encoding.UTF8Encoding when reading and writing files. It's more than just reading/writing files however, anything that streams data out including network connections is going to depend upon the encoding. If you're using WCF, it's going to default to UTF8 for most of the bindings (in fact most don't allow ASCII at all).

UTF8 is a good choice because while it still supports the entire Unicode character set, for the majority of the ASCII character set it has a byte similarity. Thus naive applications that don't support Unicode have some chance of reading/writing your applications data. Those applications will only begin to fail when you start using extended characters.

System.Text.Encoding.Unicode will write UTF-16 which is a minimum of two bytes per character, making it both larger and fully incompatible with ASCII. And System.Text.Encoding.UTF32 as you can guess is larger still. I'm not sure of the real-world use case of UTF-16 and 32, but perhaps they perform better when you have large numbers of extended characters. That's just a theory, but if it is true, then Japanese/Chinese developers making a product that will be used primarily in those languages might find UTF-16/32 a better choice.

nedruod
A: 

More details can be found on this thread:

http://discuss.joelonsoftware.com/default.asp?dotnet.12.189999.12

pradeeptp