Is there a programming language with full and correct Unicode support?

views:

202

answers:

+8 Q:

Is there a programming language with full and correct Unicode support?

+1 A:

I believe that any language supported on the .NET framework has correct unicode (UTF-16) support.

Also, similar question here

Peter Kelly 2010-07-24 13:46:52

.NET has the same issue as Java. From http://msdn.microsoft.com/en-us/library/system.string.length.aspx: The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.

Simon Nickerson 2010-07-24 14:03:16

Thanks for the tip - I didn't know that. But is the fact that there is alternative System.Globalization.StringInfo available mean that it is in fact implemented correctly albeit not in the Length property?

Peter Kelly 2010-07-24 14:18:32

There is no issue, and the `Length` property is implemented correctly. Indexing works on UTF-16 code units, and there is nothing wrong with that. Knowing how many code points a string contains is of minor interest—`ä` is one code point while `ä` is two, do you see the difference? These two strings are canonically equivalent, yet their length (measured in code points) is different. What the user sees are graphemes or grapheme clusters, which can consist of more than one code point.

Philipp 2010-07-24 14:36:02

@Philipp: Thanks for clearing that up.

Peter Kelly 2010-07-24 14:53:25

@Philipp It's of minor interest because most of us makes programs for western languages mostly. Consider something as common as validating - maybe you need to validate that a username is atleast 5 "characters" long. People mostly just do `if(userName.Length) < 5 ) return false;` Which is not really what you want, you want the number of code points - it just happens to work because most languages uses no more than one utf-16 code unit

nos 2010-07-24 15:30:43

@nos: Well then I use `ẹ̷̇‌‌‌` which is as least five characters (three of which are zero-width non-joiners), or even simply five ZWNJs. If you want to do user name validation, you have to do a lot more than counting code units (e.g., counting characters of category L).

Philipp 2010-07-24 15:53:24

@nos: So validation code that just uses 'Length' to check for "at least x characters long..." is actually poor practice with regards to globalization? @Philipp: what do you mean by counting category L characters? Do you mean counting literal characters?

Peter Kelly 2010-07-24 16:07:07

+4 A:

In Python 3, strings are always unicode (there is bytes for ASCII or similar encodings). I'm not aware of any built-ins not working correctly with them. There may be some, but considered it is out for quite a while, I figure they got about everything needed daily working.

Of course Unicode has higher memory comsumption (UTF-8 not really if you stay within ASCII range, but else...) and I can imagine multiple-length encodings are a pain to handle internally. I don't know anything about the implementation, though. Except that it can't be a linked list, since it has O(1) random access.

delnan 2010-07-24 14:10:14

Do you know what Python does internally? Do they store text as arrays of integers or longs? Or do they just have more complex algorithms and use something more naive in the background? Any idea how indexed access works?

soc 2010-07-24 14:20:12

Python is no different than most other languages. Strings are implemented either as UTF-16 or UTF-32 strings using simple flat arrays, as determined at compile time, and always work on code units (everything else would be too inefficient). My Python 3.1 still gives `len("") = 2`.

Philipp 2010-07-24 14:28:44

@Philipp Thats weird, my python 3.1 handles len's correctly, `>>> len("水")`1`>>> len("А")`1`>>> len("")`1

Daniel Kluev 2010-07-24 14:35:26

As I've said, Python can be compiled to use UTF-16 *or* UTF-32. If it is compiled to use UTF-16 (this is the default on Windows and when you compile from source as I did), then `len("") = 2` because Python's `str` datatype indexes by code units, not code points.

Philipp 2010-07-24 14:45:22

Interestingly Python's Unicode HOWTO (http://docs.python.org/py3k/howto/unicode.html#encodings) is fairly oblivious of Python's Unicode implementation itself—“Generally people don’t use this encoding [UTF-32]…” and “There’s also a UTF-16 encoding, but it’s less frequently used than UTF-8.”, while Python itself uses UTF-32 or UTF-16 exclusively for internal processing.

Philipp 2010-07-24 16:10:39

The HOWTO wasn't updated for Python 3.

soc 2010-07-24 19:40:52

+7 A:

The Java implementation is correct in the sense that is does not violate the Unicode standard; there is no prescription that string indexing work on code points instead of code units, and the behavior is documented. The Unicode standard gives implementors great freedom concerning optimizations as long as no invalid string is leaked. Concerning “full support”, that’s even harder to define. The Unicode standard generally doesn’t require that certain features be implemented to be Unicode-compatible; only that the features that are implemented are implemented according to the standard. Huge parts concerning script processing belong to fonts or the operating system, which programming systems cannot control. If you want to judge about the Unicode support of certain technologies, you can start by testing the following (subjective and non-exhaustive) list of topics:

Does the system have a string datatype that uses a Unicode encoding?
Are all Unicode (UTF) encodings supported that are described in the standard?
Normalization
The Bidirectional Algorithm
Is UpperCase("ß") = "SS"?
Is upper-casing locale sensitive? (e.g. in Turkish, UpperCase("i") = "İ")
Are there functions to work with code points instead of code units?
Unicode regular expressions
Does the system raise exceptions when invalid code unit sequences are encountered during decoding?
Access to Unicode Database properties?

I think the Java and .NET answer to these questions is mostly “yes”, while the Python 3.x answer is almost always “no.”

Philipp 2010-07-24 14:11:18

I just tried Python after testing Java/Scala/C# and IMHO at least Python doesn't fail at the most basic operations like reversing a string or accessing a character by index.

soc 2010-07-24 14:17:21

Again, there is no failure. All programming languages index by code units because indexing by code points is not a `O(1)` operation unless you use UTF-32. Python just happens to use UTF-32 if some compile-time constant is used (which is the case on many Unix-like systems).

Philipp 2010-07-24 14:31:36

@Philipp: Thanks! That comment was very helpful!That's exactly the thing I was interested in.Basically Python chose to have higher memory usage by using arrays of ints instead of arrays of chars to represent texts in memory?

soc 2010-07-24 14:35:32

@soc: Usually I avoid terms such as "char" or "int" because they are not well-defined and misleading (the C datatype `char` is nothing but some integer of unspecified width and signedness, not anything Unicode would call a "character"). Any string is a sequence of code units. An UTF-*x* (where *x* is one of 8, 16 or 32) code unit is a number which can be represented by a fixed-size integer of with *x* (in bits). Implementations usually use a 16-bit unsigned integer for UTF-16 code units, and a 32-bit signed or unsigned integer for UTF-32 code units.

Philipp 2010-07-24 14:49:37

Using UTF-32 always wastes some space because Unicode defines just 10FFFFh code points, while a 32-bit integer can store up to 100000000h different values. But I've never seen a program run out of virtual memory because of using too many UTF-32 strings.

Philipp 2010-07-24 14:51:32

@Philipp: That's a good point. So basically there is a choice to 1) implement Strings with a fixed length encoding like UTF-32 and having constant index access/easy counting, or 2a) implementing Strings with a variable length encoding and having either inconvenient results for length/indexed access or 2b) O(n) for length/indexed access or 2c) adding some buffering structure to Strings which remembers how many characters where in the e. g. first 512 bytes to amortize length/indexed access time.

soc 2010-07-24 15:58:44

@Yes. I don't know of any implemenation that uses 2b or 2c, but you can easily build an *iterator* that iterates over code points even if you use an UTF-8 or UTF-16 string (iterating will then be O(*n*) for both code unit and code point iteration). An example for this is ICU's `CharacterIterator` (ICU uses UTF-16 strings). Often iteration is more important than indexed access.

Philipp 2010-07-24 16:05:48

+1 A:

Go, the new language developed at Google invented by Ken Thompson and Rob Pike and the C dialect in Plan9 from Bell Labs were built with Unicode in mind (UTF-8 was invented there, at Bell Labs, by Ken Thompson).

Aram Hăvărneanu 2010-07-24 14:16:27

Is there an online interpreter to test it?My distribution doesn't have an Go packages yet :-/

soc 2010-07-24 14:22:29

+1 A:

The .NET Framework stores char and string data using the UTF-16 encoding. If you assume that all your text lies within the Basic Multilingual Plane, then everything will just work without any special code.

If you regard user-entered strings as blobs and don't try to manipulate them (e.g. most text fields in CRUD apps), then your code will appear to handle characters outside the BMP correctly, because UTF-16 stores them as surrogate pairs. As long as you don't fiddle with the surrogate pairs, then all will be fine.

However, if you want to analyse and manipulate strings while also handling characters outside the BMP correctly, then you have to explicitly code for that possibility. See the StringInfo class for methods to help you process surrogate pairs.

I would guess that Microsoft designed it this way to achieve a balance between performance and correctness. The alternatives would be:

Store strings as UTF-32 - poor performance in terms of memory use
Make all string functions handle surrogate pairs - very poor performance for manipulation

.NET also contains full support for culture-aware case conversion, comparisons and sorting.

Christian Hayter 2010-07-24 14:34:04

+1 A:

It looks like Perl 6 gets good Unicode support:

perlgeek.de/en/article/5-to-6#post_17

For instance it provides you with three different length methods:

bytes (amount of bytes)
codes (amount of codepoints)
graphs (amount of graphemes)

This gets integrated into Perl's regular expressions as well.

Looks like a step into the right direction to me.

soc 2010-07-30 23:51:02

Yes, for Perl this is a step into the right direction but I have a strong feeling that this could not be the right direction for you. Targeting for a unreleased version of [dying language](http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html) that many people call a write-only language? Python 3 has very good Unicode support, is stable and has more funding behind.

Sorin Sbarnea 2010-07-31 07:34:44

Although I don't really like the "wordiness" of the Perl language at least Perl is not going back in time like Python.After seeing how the whole GIL debate repeats again with TCO I completely lost my believe in any sensible leadership of Python.It is shameful to see how many people of the Python community react to these problems by calling others "brainwashed", deny even the existence of a problem or call the current situation the best - despite glaring evidence from the real world as well as from academia.

soc 2010-07-31 23:16:26

ansaurus

tags:

views:

answers:

Is there a programming language with full and correct Unicode support?

related questions