tags:

views:

82

answers:

5

I found this in the wikipedia article on utf-8:

Sorting of UTF-8 strings as arrays of unsigned bytes will produce the same results as sorting them based on Unicode code points.

That would lead me to believe that for comparison purposes (sorting, binary search, etc) that comparing two byte arrays (i.e. byte-by-byte like memcmp) of utf-8 encoded strings would give the same results as comparing the actual unicode strings.

Is this true?

+1  A: 

No, it isn't.

For example, can be written as a single codepoint (U+00C0 LATIN CAPITAL LETTER A WITH GRAVE) or as two codepoints (U+0041 LATIN CAPITAL LETTER A U+0300 COMBINING GRAVE ACCENT).

Both representations should compare equal, but will have different byte encodings.

SLaks
Surely these two codepoints would encode to different sequences of UTF-8 bytes? Or are you saying that you can't round-trip Unicode -> UTF-8 -> Unicode?
Tim Robinson
No, he's saying that a string in Unicode NFC and the same string in NFD (to take to examples of possible states of normalisation) and the same string again that isn't normalised won't have the same sequence of code points.Also, it makes no sense to talk about round-tripping between UTF-8 and Unicode, as UTF-8 *is* Unicode, just stored in a particular order of bytes.
Jon Hanna
+5  A: 

Yes, given that there's a one-to-one mapping between sequences bytes in UTF-8 encoding and Unicode code points.

However, there are way to compare Unicode strings besides looking at the raw code points. If you just look at code points -- or UTF-8 bytes -- as numbers then you miss culture-specific comparison logic.

To implement comparison and sorting correctly for a specific culture, on .NET, you should use the standard string comparison functions.

Tim Robinson
+4  A: 

It depends on what you mean by "comparing the actual Unicode strings".

If you're just going to compare the code points (as 32-bit numbers) instead of the UTF-8 encoded code points, then the answer is yes: that will give the same results. The mapping from code points to UTF-8 encoded bytes is one-to-one.

If you're going to do a proper Unicode string comparison, instead of bytewise comparison of the UTF-8, the answer is no. In Unicode, there can be different ways to represent the same character. For example, é can be represented in (at least) two ways:

  • U+00e9 (LATIN SMALL LETTER E WITH ACUTE), or
  • U+0065 (LATIN SMALL LETTER E) followed by U+0301 (COMBINING ACUTE ACCENT).

A properly written Unicode comparison function will consider these two to be identical.

Thomas
As Jon Hanna points out in his answer, in .NET you compare code points as UTF-16 surrogate pairs, not 32-bit numbers, so you actually get different results. But I accepted your answer because you were the first to point out that meaningful unicode string comparison shouldn't be on the basis of code points.
Eloff
A: 

I found this in the wikipedia article on utf-8:

Sorting of UTF-8 strings as arrays of unsigned bytes will produce the same results as sorting them based on Unicode code points.

That would lead me to believe that for comparison purposes (sorting, binary search, etc) that comparing two byte arrays (i.e. byte-by-byte like memcmp) of utf-8 encoded strings would give the same results as comparing the actual unicode strings.

This all depends on what you mean by "actual Unicode strings" and what you mean by "comparing". In the .Net Framework strings are in the UTF-16 form of Unicode. A simple binary comparison between UTF-16 strings will yeild a different sort order than the same comparison between UTF-8 and UTF-32 (the code-point version referenced in the quote) strings.

But a binary comparison of any of those things is not very useful. You should be using the built-in culture-aware comparisons. This is because two string that are, for all intents and purposes, the same can be constructed from different sequences of code points. The built-in comparisons take those things into account.

Jeffrey L Whitledge
+2  A: 

It is the same as a code-point for code-point comparison, that is to say one that pays no attention to case-folding, cultural orderings, composition, or anything other than the Unicode value.

This is pretty useless when considering strings as a piece of human-readable text, but sometimes you just want to be able to put the strings into an ordering, as some algorithms (binary search as you say) need a consistent ordering, but the details of that consistent ordering is not significant.

It is important to note though, that the ordinal comparison on strings offered by .NET works on the UTF-16 used internally which does not maintain code-point ordering. If we compare a string with just the character U+FF61 and a string with just the character U+10002, then .NET will store the latter as surrogate pairs, of 0xD800 and 0XDC02.

Hence:

string.CompareOrdinal("\U0000ff61", "\U00010002");

and

string.Compare("\U0000ff61", "\U00010002", StringComparison.Ordinal);

both return values great than zero, even though the former is lower in code-point value than the latter (I used the \U form rather than the \u form to make that clearer).

If by "the actual unicode strings" you mean the .NET UTF-16 strings, then the answer to your question is no, for the opposite reason to that which led to your thinking it might work.

Jon Hanna
Fascinating, I would get an ordering with UTF-8, and it would be a code-point ordering, unlike that yielded by the culture invariant ordinal comparison in .NET, simply because .NET uses UTF-16 which does not give code-point ordering. That's subtle and sure to turn some poor programmer prematurely grey :P
Eloff
Note also that culture-invariant is different to ordinal. The invariant culture is pretty much a made-up culture (which looks more like the "mid-Atlantic" Anglo culture shared between the US and the Commonwealth than like any other culture) which is useful when you need to force consistent behaviour over culturally correct handling.Ordinal comparisons are *strictly* for when you either need an arbitrary ordering (it's culturally rubbish but its fast) or to naively compare strings to match a standard (e.g. parsing computer code).
Jon Hanna
Maybe we could call ordinal, "culturally ignorant" :)
Jon Hanna