views:

155

answers:

2

Hello All,

I have a string that contains UTF-8 Characters, and I have a method that is supposed to convert every character to either upper or lower case, this is easily done with characters that overlap with ASCII, and obviously some characters cannot be converted, e.g. any Chinese character. However is there a good way to detect and convert other characters that can be Upper/Lower, e.g. all the greek characters? Also please note that I need to be able to do this on both Windows and Linux.

Thank you,

+7  A: 

Have a look at ICU.

Note that lower case to upper case functions are locale-dependant. Think about the turkish (ascii) letter I which gets "dotless lowercase i" and (ascii) i which gets "uppercase I with a dot".

Alexandre C.
Thank you for the tip Alexandre, however I am precluded for this application from linking against any 3rd party libs. So I need to figure out how to do this with with out using that lib.
NSA
I suggest that you write you roll your own case mapping utility, check out http://www.unicode.org/faq/casemap_charprop.html. From there you can download all the special case mappings.
jojaba
C++ has no unicode support whatsoever. ICU is *the* way to go.
Alexandre C.
@NSA: Why can you not link against another library?
wilx
@NSA you can link statically against ICU. You could re-implement it, but why? Perhaps you can explain more about your preclusion.
Steven R. Loomis
+1  A: 

Assuming that you have access to wctype.h, then convert your text to a 2-byte unicode string and use towupper(). Then convert it back to UTF-8.

jojaba
or use ICU as Alexandre mentioned.
jojaba
You don't handle German ß and greek terminal sigma this way.
Alexandre C.
@Alexandre C.: Whether or not characters like that get converted correctly depends entirely on the current locale.
caf
@Alexandre C: Even stronger, what _is_ correct depends on the locale. Your opinion on what correct just isn't shared by the whole world; the most famous example being the Turkish i.
MSalters
@caf, @MSalters: In the German eszett case, capital B is SS (ie two characters, obviously not handled by towupper), and for the greek capital sigma, there are two different choices depending whether it is at the end of a word or not (so not handled by towlower). Again, ICU solve these problems.
Alexandre C.