I added an answer to this question here: Sorting List<String>
in C# which calls for a natural sort order, one that handles embedded numbers.
My implementation, however, is naive, and in lieu of all the posts out there about how applications doesn't handle Unicode correctly by assuming things (Turkey test anyone?), I thought I'd ask for help writing a better implementation. Or, if there is a built-in method of .NET, please tell me :)
My implementation for the answer in that question just goes through the strings, comparing character by character, until it encounters a digit in both. Then it extracts consecutive digits from both strings, which can result in varying lengths, pads the shortest with leading zeroes, and then compares.
However, there's problems with it.
For instance, what if you in string x have two codepoints which together make the character È, but in the other string you have just one codepoint, the one that is that character.
My algorithm would fail on those, since it would treat the diacritic codepoint as a single character, and compare it to the È from the other string.
Can anyone guide me towards how to handle this properly? I want support for specifying a CultureInfo
object to handle language problems, like comparing "ss" with "ß" in Germany, and similar things.
I think I need to get my code to enumerate over "real characters" (I don't know the real term here) instead of individual codepoints.
What's the right approach to this?
Also, if "natural" means "the way humans expect it to work", I would add the following things to ponder:
- What about dates and times?
- What about floating point values?
- Are there other sequences which are considered "natural"?
- How far should this be stretched? (Eeny, meeny, miny, moe)