I'd like to have a canonical place to pool information about Unicode support in various languages. Is it a part of the core language? Is it provided in libraries? Is it not available at all? Is there a resource popular resource for Unicode information in a language? One language per answer please. Also if you could make the language a heading that would make it easier to find.
.NET (C#, VB.NET, ...)
.NET stores strings internally as a sequence of System.Char
objects. One System.Char
represents a UTF-16 code unit.
From the MSDN documentation on System.Char
:
The .NET Framework uses the Char structure to represent a Unicode character. The Unicode Standard identifies each Unicode character with a unique 21-bit scalar number called a code point, and defines the UTF-16 encoding form that specifies how a code point is encoded into a sequence of one or more 16-bit values. Each 16-bit value ranges from hexadecimal 0x0000 through 0xFFFF and is stored in a Char structure.
Additional resources:
- Strings in .NET and C# (by Jon Skeet).
C/C++
C and C++ support Unicode via the wchar_t type (and STL's std::wstring). Technially the size of wchar_t is compiler-dependent in width. With most compilers it's 16-bit (UTF-16) or 32-bit (UTF-32).
Java
Same as with .NET, Java uses UTF-16 internally: java.lang.String
A
String
represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in theCharacter
class for more information). Index values refer tochar
code units, so a supplementary character uses two positions in aString
.
Perl
Perl has built-in Unicode support, mostly. Sort of. From perldoc:
- perlunitut - Tutorial on using Unicode in Perl. Largely teaches in absolute terms about what you should and should not do as far as Unicode. Covers basics.
- perlunifaq - Frequently asked questions about Unicode in Perl.
- perluniintro - Introduction to Unicode in Perl. Less "preachy" than perlunitut.
- perlunicode - For when you absolutely have to know everything there is to know about Unicode and Perl.
Delphi
Delphi 2009 fully supports Unicode. They've changed the implementation of string
to default to 16-bit Unicode encoding, and most libraries including the third party ones support Unicode. See Marco Cantù's Delphi and Unicode.
Prior to Delphi 2009, the support for Unicode was limited, but there was WideChar
and WideString
to store the 16-bit encoded string. See Unicode in Delphi for more info.
Note, you can still develop bilingual CJKV application without using Unicode. For example, Shift JIS encoded string for Japanese can be stored using plain AnsiString
.
Ruby
The only stuff I can find for Ruby is pretty old and not being much of a rubist, I'm not sure how accurate it is.
For the record, Ruby does support utf8, but not multibyte. Internally, it usually assumes strings are byte vectors, though there are libraries and tricks you can usually use to make things work.
Found that here.
Python 3k
Python 3k (or 3.0 or 3000) has new approach for handling text (unicode) and data:
Text Vs. Data Instead Of Unicode Vs. 8-bit
Objective-C
None built-in, aside from whatever happens to be available as part of the C string library.
However, once you add frameworks…
Foundation (Cocoa and Cocoa Touch) and Core Foundation
NSString and CFString each implement a fully Unicode-based string class (actually several classes, as an implementation detail). The two are “toll-free-bridged” so that the API for one can be used with instances of the other, and vice versa.
For data that doesn't necessarily represent text, there's NSData and CFData. NSString provides methods and CFString provides functions to encode text into data and decode text from data. Core Foundation supports more than a hundred different encodings, including all forms of the UTFs. The encodings are divided into two groups: built-in encodings, which are supported everywhere, and external encodings, which are at least supported on Mac OS X.
NSString provides methods for normalizing to forms D, KD, C, or KC. Each returns a new string.
Both NSString and CFString provide a wide variety of comparison/collation options. Here are Foundation's comparison-option flags and Core Foundation's comparison-option flags. They are not all synonymous; for example, Core Foundation makes literal (strict code-point-based) comparison the default, whereas Foundation makes non-literal comparison (allowing characters with accents to compare equal) the default.
Note that Core Foundation does not require Objective-C; indeed, it was created pretty much to provide most of the features of Foundation to Carbon programmers, who used straight C or C++. However, I suspect most modern usage of it is in Cocoa or Cocoa Touch programs, which are all written in Objective-C or Objective-C++.
Tcl
Tcl strings have been sequences of Unicode characters since Tcl 8.1 (1999). Internally, they are morphed dynamically between UTF-8 (strictly the same Modified UTF-8 as Java due to the handling of U+00000
characters) and UCS-2 (in host endianness and BOM, of course). All external strings (with one exception), including those used to communicate with the OS, are internally Unicode before being transformed into whatever encoding is required for the host (or is manually configured on a communications channel). The exception is for where data is copied between two communications channels with a common encoding (and a few other restrictions not germane here) where a direct copy-free binary transfer is used.
Characters outside the BMP are not currently handled either internally or externally. This is a known issue.
R6RS Scheme
Requires the implementation of Unicode 5.1. All strings are in 'unicode format'.