views:

1164

answers:

4

How do I convert a wchar_t string from upper case to lower case in C++?

The string contains a mixture of Japanese, Chinese, German and Greek characters.

I thought about using towlower...

http://msdn.microsoft.com/en-us/library/8h19t214%28VS.80%29.aspx

.. but the documentation says that:

The case conversion of towlower is locale-specific. Only the characters relevant to the current locale are changed in case.

Edit: Maybe I should describe what I'm doing. I receive a Unicode search query from a user. It's originally in UTF-8 encoding, but I'm converting it to a widechar (I may be wrong on the wording). My debugger (VS2008) correctly shows the Japanese, German, etc characters in in the "variable quick watch". I need to go through another set of data in Unicode and find matches of the search string. While this is no problem for me to do when the search is case sensitive, it's more problematic to do it case insensitive. My (maybe naive) approach to solve the problem would be to convert all input data and output data to lower case and then compare it.

+3  A: 

You have a nasty problem in hand. A Japanese locale will not help converting German and vice versa. There are languages which do not have the concept of captalization either (toupper and friends would be a no-op here, I suppose). So, can you break up your string into individual chunks of words from the same language? If you can then you can convert the pieces and string them up.

dirkgently
Japanese and the other ideographic languages from East Asia are examples of languages mainly without upper-case.
Jonathan Leffler
Not only that, but individual languages can have _different_ opinions on how a particular letter should be upper/lowercased. There's simply no single algorithm to do it properly on any random Unicode string without knowing the language.
Pavel Minaev
Though I agree with that assessment, Unicode includes locale-independent uppercase/lowercase properties, its usage described under *3.13 "Default Case Opreations"*, which are *are to be used in the absence of tailoring for particular languages*, so the standard says.
Abel
It does. The problem is that it is right for, say, 99% of all cases, but you'll get 1% wrong. Which may or may not be a problem. In general, it's good enough when you use it for things like identifiers in code, and maybe even filenames.
Pavel Minaev
@Pavel: Which means that you can't do it right all the time, but you can do it consistently all the time. I know that lowercasing 'I' to 'i' is wrong in Turkish, but if you're just normalizing the string for comparison rather than printing out the result it may work just fine.
David Thornley
@David: it might not work fine. Say you have text "Diyarbakır" in the original document, and the user entered "DİYARBAKIR" search string. You use the default Unicode casing rules to lowercase both strings; the first one becomes "diyarbakır", the second one "diyarbakir". And now they don't match, and they really should have, if the text is Turkish.
Pavel Minaev
+4  A: 

If your string contains all those characters, the codeset must be Unicode-based. If implemented properly, Unicode (Chapter 4 'Character Properties') defines character properties including whether the character is upper case and the lower case mapping, and so on.

Given that preamble, the towlower() function from <wctype.h> is the correct tool to use. If it doesn't do the job, you have a QoI (Quality of Implementation) problem to discuss with your vendor. If you find the vendor unresponsive, then look at alternative libraries. In this case, you might consider ICU (International Components for Unicode).

Jonathan Leffler
Unicode case mappings, as specified in the document that you've linked to, are still partially locale-dependent. Quote: "SpecialCasing.txt - Contains additional case mappings that map to more than one character, such as “ß” to “SS”. Also contains context-dependent mappings, with flags to distinguish them from the normal mappings, as well as _some locale-dependent mappings_.". So `tolower` cannot avoid being locale specific.
Pavel Minaev
@Pavel This process is called "normalization of Unicode strings", which makes sure that `ß` and `ss` are treated equal (depending on chosen normalization form) and Unicode contains language-neutral algorithms for that, while not ignoring the wish for locale or application specific treatment.
Abel
@Abel: normalization is not a complete solution. For example, in some Latin languages diacritics disappear on uppercased letters, in other languages they do not. There's no way to tell unless you know which language the text is written in. Then, of course, there's the infamous Turkish dotless "i" problem - you want `İ` to lowercase to `i` and `I` to lowecase to `ı` for Turkish, but you want `I` to lowercase to `i` for any other Latin alphabet language.
Pavel Minaev
@Pavel: that's an excellent elaboration, I fully agree. No, normalization is not perfect, it's more a simplistic brute-force method, but it helps in a fine bunch of situations. Probably good moment in the discussion to include a link to the Unicode Collation Algorithm, which discusses this in full (goes much further then lowercase/uppercase): http://unicode.org/reports/tr10/ and the Unicode Case Mapping: http://unicode.org/reports/tr21/tr21-5.html
Abel
+1  A: 

Consider the (current) second answer of this stackoverflow thread, it shows how to work with facets to work with several locales. If this is on Windows, you can consider using win32 API functions, if you can work with C++.NET (managed C++), you can use the char.ToLower and string.ToLower functions, which are Unicode compliant.

Abel
A: 

Have a look at _wcslwr_l in <wchar.h> (MSDN).

You should be able to run the function on the input for each of the locales.

Jon Seigel
"You should be able to run the function on the input for each of the locales." - what if two locales in the set map the same character differently?
Pavel Minaev
As mentioned in other comments, you have to know the language of each part of the string in order to avoid those cases. There's really no getting around that. I'm merely suggesting a different function to use to more easily manage the issue with running the operation on the current locale.
Jon Seigel