ansaurus

Question

Answer 1

+2 A:

Any letter with an accent could potentially have different code point, or be a combination of more than one code point. For example, ÂËÕÝ are uppercase characters with lowercase equivalents.

The key is to implement the standards faithfully with respect to your users' locale settings, or get the same effect by using system libraries that handle the general case of toupper()/tolower() correctly.

Novelocrat 2009-05-30 05:33:13

Answer 2

+3 A:

There's a lot of alphabets other than the usual Latin-derived western European alphabet most of us are used to seeing here. To start with, you'd need uppercase and lowercase versions of accented letters and ligatures, like Àà, Ĳĳ, and so on. There's also the fullwidth versions of Latin characters used when setting documents in Asian languages (which I'm too lazy to look up). Further, there are the other alphabets in use nowadays, like the Cyrillic (Бб) and Greek (Δδ) alphabets.

There's also Turkey, which is just kind of difficult according to Jeff Atwood. Using the uppercasing/lowercasing functions provided by your environment are (usually) the way to go with user-input data.

Paul Fisher 2009-05-30 05:37:39

Answer 3

+7 A:

The English language, and even that strange variant, American English :-) , is not the only language on the planet. There are some very strange looking ones (at least to those familiar with the Latin-based characters) but even Latin-based ones have minor variations.

Two of which I am acquainted with on more than a casual basis are Greek and German:

Αα Ββ Γγ Δδ Εε Ζζ  Ηη Θθ Ιι Κκ Λλ Μμ
Νν Ξξ Οο Ππ Ρρ Σσς Ττ Υυ Φφ Χχ Ψψ Ωω

Aa Ää Bb Cc Dd Ee Ff Gg Hh Ii Jj Kk Ll Mm Nn
Oo Öö Pp Qq Rr Ss ß  Tt Uu Üü Vv Ww Xx Yy Zz

That's why we're not allowed to use bits of code like:

char lower = upper - 'A' + 'a';

any more. Doing something like that in a company that takes i18n seriously is near grounds for dismissal. Using Unicode-aware toLower()/toUpper()-type functions is the better way to go.

paxdiablo 2009-05-30 07:26:12

In German it's even more complicated than that:While uppercase ß has got a unicode codepoint recently, there is really no uppercase ß character. The only proper rendering of uppercase ß is SS. If you convert from uppercase to lowercase you are in trouble, because you don't know what SS was in lowercase: ss or ß.(Yes, there are fonts with uppercase ß, but that doesn't mean it's a valid character in the german alphabet. I have never seen it used outside the realm of demos for the mentioned fonts)

Ludwig Weinzierl 2009-05-30 08:25:37

Is ß actually a ligature? It's an old-fashioned "long s" joined to a terminal s. You'll see something similar in English in an old printing of the Declaration of Independence ("pursuit of happiness" looking like "purfuit of happinefs"). Before computerized typesetting, ligatures of fl, fi, and ffl were common in books printed in English. Those were not considered single characters.

Mark Lutton 2009-06-05 01:38:30

It is a ligature, yes, the "eszett" name is a dead giveaway ("sz").

paxdiablo 2009-06-05 01:59:39

ansaurus

tags:

views:

answers:

Unicode lowercase characters?

related questions