views:

862

answers:

10

Which widely used programming languages were designed ground-up with Unicode support?

A lot of programming languages have added Unicode support as an afterthought in later versions, but which widely used languages were released with Unicode support from day one?

+26  A: 

Basically all of the .NET languages are Unicode languages, such as C# and VB.NET.

Jay Bazuzi
Really? High five Microsoft! Any idea if IronRuby, IronPython and F# are in the same boat?
George Mauer
George, all .NET languages that use the System.String class have full Unicode support. I don't know of any .NET languages that don't use the System.String class, so that means IronRuby, IronPython and especially F# (which is a first class language starting with VS2010) have native Unicode support. I can't think of a good reason why someone would create a .NET language and make a special non-Unicode string class for it when a Unicode string class is already provided in the BCL.
Allon Guralnek
Strictly speaking, a System.String is composed of UTF-16 encoded characters, not Unicode 5 abstract code points (graphemes). If your app cares about the difference (most won't need to), then you can use the System.Globalization.StringInfo class.
Christian Hayter
Can you make a CLS compliant language without System.String support?
Chris S
+31  A: 

Java was probably the first popular language to have ground-up Unicode support.

Ken Keenan
Apart from the fact that it "only" supports the Basic Multilingual Plane (which was all that Unicode had when Java was invented). The .NET framework is the first language I know which is designed around "full" unicode support (including correct Length for strings that contain surrogates...)
mihi
Java supports the full Unicode standard since forever, not just the BMP. Strings are stored in UTF-16 (not UCS-2, which would mean BMP only).
Joachim Sauer
When Java was designed, Unicode was only the BMP. According to MSDN's documentation on String in .Net - "The Length property returns the number of Char objects in this instance, not the number of Unicode characters. ". The java.lang.String.codePointCount() method returns the number of code points in the string taking account for surrogates.
Pete Kirkham
@ Joachim Sauer: UCS-2 support the full Unicode standard (don't forget the surrogate pairs D800 to DBFF). Java was designed to use UTF-16 as was the .Net framework, but Java was designed before UTF-32/UCS-4 while .NET was designed after, but both languages have access to the full range of code points.
Martin York
A: 

Java uses characters from the Unicode character set.

sdu
Most programming languages use characters from the Unicode character set. ( they just place restrictions on which characters they use )
Pete Kirkham
+12  A: 

There were many breaking changes in Python 3, among them the switch to Unicode for all text.

So Python wasn't designed ground-up for Unicode, but Python 3 was.

Mark Rushakoff
A: 

Python 3.x: http://docs.python.org/dev/3.0/whatsnew/3.0.html

janneb
+5  A: 

It really is difficult to design Unicode support for the future, in a programming language right from the beginning.

Java is one one of the languages that had this designed into the language specification. However, Unicode support in v1.0 of Java is different from v5 and v6 of the Java SDK. This is primarily due to the version of Unicode that the language specification catered to, when the language was originally designed. Java attempts to track changes in the Unicode standard with every major release.

Early implementations of the JLS could claim Unicode support, primarily because Unicode itself supported 65536 characters (v1.0 of Java supported Unicode 1.1, and Java v1.4 supported Unicode 3.0) which was compatible with the 16-bit storage space taken up by characters. That changed with Unicode 3.1 - its an evolving standard, usually with more characters getting added in each release. The characters added later in 3.1 were called supplementary characters. Support for supplementary characters were added in Java 5 via JSR-204; Java 5 and 6 support Unicode 4.0.

Therefore, don't be surprised if different programming languages implement Unicode support differently.

On the other hand, PHP(!!) and Ruby did not have Unicode support built into them during inception.

PS: Support for v5.1 of Unicode is to be made in Java 7.

Vineet Reynolds
+8  A: 

I don't know how far this goes in other languages, but a fun thing about C# is that not only is the runtime (the string class etc) unicode aware - but unicode is fully supported in source:

using משליט = System.Object;
using תוצאה = System.Int32;
public class שלום : משליט  {
    public תוצאה בית() {
        int אלף = 0;
        for (int λ = 0; λ < 20; λ++) אלף+=λ;
        return אלף;
    }
}
Marc Gravell
(note that there is possibly some odd right-to-left issue in the above in browser/editor; if you paste it into VS, it is "int {name} = 0")
Marc Gravell
Odd, I can use π and θ for identifiers but not √...
gw
@gw: Try running `"πθ√".Select(c=>CharUnicodeInfo.GetUnicodeCategory(c))` in LINQPad and you'll see why ;-)
Eamon Nerbonne
Java is as well.
Noon Silk
Same in Perl 5 and Perl 6. Perl 6 has even Unicode operators.
Alexandr Ciornii
A: 

java and .net languages

eglasius
+3  A: 

Java and the .NET languages, as other commenters have pointed out, although Java's strings are UTF-16 rather than UCS or UTF-8. (At the time, it seemed like a sensible idea! Now clearly either UTF-8 or UCS would be better.) And Python 3 is really a different, incompatible language from Python 1.x and 2.x, so it qualifies too.

The Plan9 languages around 1992 were probably the first to do this: their dialect of C, rc, Alef, mk, ACID, and so on, were all Unicode-enabled. They took the very simple approach that anything that wasn't ASCII was an identifier character. See their paper from 1993 on the subject. (This is the project where UTF-8 was invented, which meant they could do this in a pretty compatible way, in particular without plumbing binary-versus-text through all their programs.)

Other languages that support non-ASCII identifiers include current PHP.

Kragen Javier Sitaker
A: 

Sometimes, a feature that was included in a language when it was first designed is not always the best.

Languages have changed over time and many have become bloated with extra features, while not necessarily keeping up-to-date with the features it first included.

So I just throw out the idea that you shouldn't necessarily discount languages that have recently added Unicode. They will have the advantage of adding Unicode to an already mature development tool, and getting the chance to do it right the first time.

With that in mind, I want to ensure that Delphi is included here, as one of your answers. Embarcadero added Unicode in their Delphi 2009 version and did a mighty fine job on it. It was enough to finally prompt me to upgrade from the Delphi 4 that I had been using for 10 years.

lkessler