I believe that any language supported on the .NET framework has correct unicode (UTF-16) support.
Also, similar question here
I believe that any language supported on the .NET framework has correct unicode (UTF-16) support.
Also, similar question here
In Python 3, strings are always unicode (there is bytes
for ASCII or similar encodings). I'm not aware of any built-ins not working correctly with them. There may be some, but considered it is out for quite a while, I figure they got about everything needed daily working.
Of course Unicode has higher memory comsumption (UTF-8 not really if you stay within ASCII range, but else...) and I can imagine multiple-length encodings are a pain to handle internally. I don't know anything about the implementation, though. Except that it can't be a linked list, since it has O(1) random access.
The Java implementation is correct in the sense that is does not violate the Unicode standard; there is no prescription that string indexing work on code points instead of code units, and the behavior is documented. The Unicode standard gives implementors great freedom concerning optimizations as long as no invalid string is leaked. Concerning “full support”, that’s even harder to define. The Unicode standard generally doesn’t require that certain features be implemented to be Unicode-compatible; only that the features that are implemented are implemented according to the standard. Huge parts concerning script processing belong to fonts or the operating system, which programming systems cannot control. If you want to judge about the Unicode support of certain technologies, you can start by testing the following (subjective and non-exhaustive) list of topics:
UpperCase("ß") = "SS"
?UpperCase("i") = "İ"
)I think the Java and .NET answer to these questions is mostly “yes”, while the Python 3.x answer is almost always “no.”
Go, the new language developed at Google invented by Ken Thompson and Rob Pike and the C dialect in Plan9 from Bell Labs were built with Unicode in mind (UTF-8 was invented there, at Bell Labs, by Ken Thompson).
The .NET Framework stores char
and string
data using the UTF-16 encoding. If you assume that all your text lies within the Basic Multilingual Plane, then everything will just work without any special code.
If you regard user-entered strings as blobs and don't try to manipulate them (e.g. most text fields in CRUD apps), then your code will appear to handle characters outside the BMP correctly, because UTF-16 stores them as surrogate pairs. As long as you don't fiddle with the surrogate pairs, then all will be fine.
However, if you want to analyse and manipulate strings while also handling characters outside the BMP correctly, then you have to explicitly code for that possibility. See the StringInfo class for methods to help you process surrogate pairs.
I would guess that Microsoft designed it this way to achieve a balance between performance and correctness. The alternatives would be:
.NET also contains full support for culture-aware case conversion, comparisons and sorting.
It looks like Perl 6 gets good Unicode support:
perlgeek.de/en/article/5-to-6#post_17
For instance it provides you with three different length methods:
This gets integrated into Perl's regular expressions as well.
Looks like a step into the right direction to me.