views:

704

answers:

15

I'd like to have a canonical place to pool information about Unicode support in various languages. Is it a part of the core language? Is it provided in libraries? Is it not available at all? Is there a resource popular resource for Unicode information in a language? One language per answer please. Also if you could make the language a heading that would make it easier to find.

+1  A: 

Python

The Truth about Unicode in Python

docgnome
A summary and a mention of the Python version would be good (the article is old-ish and probably doesn't handle Python 3).
Joachim Sauer
+2  A: 

.NET (C#, VB.NET, ...)

.NET stores strings internally as a sequence of System.Char objects. One System.Char represents a UTF-16 code unit.

From the MSDN documentation on System.Char:

The .NET Framework uses the Char structure to represent a Unicode character. The Unicode Standard identifies each Unicode character with a unique 21-bit scalar number called a code point, and defines the UTF-16 encoding form that specifies how a code point is encoded into a sequence of one or more 16-bit values. Each 16-bit value ranges from hexadecimal 0x0000 through 0xFFFF and is stored in a Char structure.

Additional resources:

MiffTheFox
+3  A: 

C/C++

C and C++ support Unicode via the wchar_t type (and STL's std::wstring). Technially the size of wchar_t is compiler-dependent in width. With most compilers it's 16-bit (UTF-16) or 32-bit (UTF-32).

zpasternack
It should be noted that text stored in a wchar_t doesn't magically become Unicode - but any decent C programmer should know that nothing magically works in C. :)
Chris Lutz
It's equally accurate to say that C and C++ support Unicode via `char*` strings encoded in UTF-8.
dan04
UTF-8 is said to be more useful encoding for use in C++. See http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmfulAnd it is natively supported via char*
Pavel Radzivilovsky
+6  A: 

Java

Same as with .NET, Java uses UTF-16 internally: java.lang.String

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.

Joey
A: 

Common Lisp (SBCL and CLisp)

According to this, SBCL and CLisp support Unicode.

docgnome
+3  A: 

Perl

Perl has built-in Unicode support, mostly. Sort of. From perldoc:

  • perlunitut - Tutorial on using Unicode in Perl. Largely teaches in absolute terms about what you should and should not do as far as Unicode. Covers basics.
  • perlunifaq - Frequently asked questions about Unicode in Perl.
  • perluniintro - Introduction to Unicode in Perl. Less "preachy" than perlunitut.
  • perlunicode - For when you absolutely have to know everything there is to know about Unicode and Perl.
Chris Lutz
Great answer! This is _exactly_ the sort of thing I was hoping to get.
docgnome
A: 

Arc

Arc doesn't have any unicode support. Yet.

docgnome
I'd "-1" the article (not the answer!) based on the fact that the author equates "Unicode support" to "the color of the bicycle shed".
Joachim Sauer
+2  A: 

Delphi

Delphi 2009 fully supports Unicode. They've changed the implementation of string to default to 16-bit Unicode encoding, and most libraries including the third party ones support Unicode. See Marco Cantù's Delphi and Unicode.

Prior to Delphi 2009, the support for Unicode was limited, but there was WideChar and WideString to store the 16-bit encoded string. See Unicode in Delphi for more info.

Note, you can still develop bilingual CJKV application without using Unicode. For example, Shift JIS encoded string for Japanese can be stored using plain AnsiString.

eed3si9n
+1  A: 

Ruby

The only stuff I can find for Ruby is pretty old and not being much of a rubist, I'm not sure how accurate it is.

For the record, Ruby does support utf8, but not multibyte. Internally, it usually assumes strings are byte vectors, though there are libraries and tricks you can usually use to make things work.

Found that here.

docgnome
Ruby has some bugs that make using unicode a pain for many use cases: http://redmine.ruby-lang.org/issues/show/2034
Eduardo
+1  A: 

JavaScript

Looks like before JS 1.3 there was no support for Unicode. As of 1.5, UTF-8, UTF-16 and UCS-2 are all supported. You can use Unicode escape sequences in strings, regexs and identifiers. Source

docgnome
A: 

PHP

There is already an entire thread on this on SO!

docgnome
+4  A: 

Python 3k

Python 3k (or 3.0 or 3000) has new approach for handling text (unicode) and data:
Text Vs. Data Instead Of Unicode Vs. 8-bit

Shirkrin
+1  A: 

Objective-C

None built-in, aside from whatever happens to be available as part of the C string library.

However, once you add frameworks…

Foundation (Cocoa and Cocoa Touch) and Core Foundation

NSString and CFString each implement a fully Unicode-based string class (actually several classes, as an implementation detail). The two are “toll-free-bridged” so that the API for one can be used with instances of the other, and vice versa.

For data that doesn't necessarily represent text, there's NSData and CFData. NSString provides methods and CFString provides functions to encode text into data and decode text from data. Core Foundation supports more than a hundred different encodings, including all forms of the UTFs. The encodings are divided into two groups: built-in encodings, which are supported everywhere, and external encodings, which are at least supported on Mac OS X.

NSString provides methods for normalizing to forms D, KD, C, or KC. Each returns a new string.

Both NSString and CFString provide a wide variety of comparison/collation options. Here are Foundation's comparison-option flags and Core Foundation's comparison-option flags. They are not all synonymous; for example, Core Foundation makes literal (strict code-point-based) comparison the default, whereas Foundation makes non-literal comparison (allowing characters with accents to compare equal) the default.

Note that Core Foundation does not require Objective-C; indeed, it was created pretty much to provide most of the features of Foundation to Carbon programmers, who used straight C or C++. However, I suspect most modern usage of it is in Cocoa or Cocoa Touch programs, which are all written in Objective-C or Objective-C++.

Peter Hosey
+2  A: 

Tcl

Tcl strings have been sequences of Unicode characters since Tcl 8.1 (1999). Internally, they are morphed dynamically between UTF-8 (strictly the same Modified UTF-8 as Java due to the handling of U+00000 characters) and UCS-2 (in host endianness and BOM, of course). All external strings (with one exception), including those used to communicate with the OS, are internally Unicode before being transformed into whatever encoding is required for the host (or is manually configured on a communications channel). The exception is for where data is copied between two communications channels with a common encoding (and a few other restrictions not germane here) where a direct copy-free binary transfer is used.

Characters outside the BMP are not currently handled either internally or externally. This is a known issue.

Donal Fellows
+1  A: 

R6RS Scheme

Requires the implementation of Unicode 5.1. All strings are in 'unicode format'.

leppie