ansaurus

Question

How to find if a character belongs to a particular codepage using c++ or calling winapi

Answer 1

A:

Using the windows functions WideCharToMultiByte and MultiByteToWideChar you can convert between UTF-8 and 16-bit Unicode characters. The functions have arguments to specify the code page and to specify the behavior if an invalid character is encountered.

Patrick 2010-03-10 13:36:27

Thanks , Yes you are right , i was using LPBOOL lpUsedDefaultChar parameter of the WideCharToMultiByte() to determine the same , however for Hindi IME that has code page 0 , the result lpUsedDefaultChar is always true.[Not sure how my previous comment got removed :( but I had mentioned it in details there]

Prakash 2010-03-10 13:44:15

Answer 2

+1 A:

First, Convert your UTF-8 string of characters to UTF-16 using MultiByteToWideChar
Now, reverse the process using WideCharToMultiByte passing the desired codepage as the first parameter.

Use the WC_ERR_INVALID_CHARS flag and WideCharToMultiByte will fail outright if any invalid characters are used. If you want to know which characters are not represented in the target codepage, use the lpDefaultChar, and lpUsedDefaultChar parameters.

LPCWSTR pszUtf16; // converted from utf8 source character
UINT nTargetCP = CP_ACP;
BOOL fBadCharacter = FALSE;
if(WideCharToMultiByte(nTargetCP,WC_NO_BEST_FIT_CHARS,pszUtf16,NULL,0,NULL,&fBadCharacter)
{
  if(fBadCharacter)
  {
    // at least one character in the string was not represented in nTargetCP
  } 
}

Chris Becke 2010-03-10 14:02:24

Thanks Chris .. I posted my comment as answer so that I could add some code..

Prakash 2010-03-10 14:28:17

Answer 3

A:

Thanks Chris..I am running the following code

#define CP_HINDI 0 
#define CP_JAPANESE 932
#define CP_ENGLISH 1252

wchar_t wcsStringJapanese = 'あ';
wchar_t wcsStringHindi = 'र';
wchar_t wcsStringEnglish = 'A';

int main()  
{ 

    BOOL usedDefaultCharacter = FALSE;

    /* Test for ENGLISH */
    WideCharToMultiByte( CP_ENGLISH,
                        0, &wcsStringEnglish,
                        -1,  
                        NULL,
                        0, 
                        NULL, 
                        &usedDefaultCharacter); 
    printf("usedDefaultCharacters for English? %d \n",usedDefaultCharacter);

    usedDefaultCharacter = FALSE;

    /*TEST FOR JAPANESE */

     WideCharToMultiByte( CP_JAPANESE,
                         0,
                         &wcsStringJapanese,
                        -1,  
                        NULL,
                        0, 
                        NULL, 
                        &usedDefaultCharacter); 
    printf("usedDefaultCharacters for Japanese? %d \n",usedDefaultCharacter);

    //TEST FOR HINDI 
    usedDefaultCharacter = FALSE;

    WideCharToMultiByte( CP_HINDI,
                        0, 
                        &wcsStringHindi,
                        -1,  
                        NULL,
                        0, 
                        NULL, 
                        &usedDefaultCharacter); 
    printf("usedDefaultCharacters for Hindi? %d \n",usedDefaultCharacter);   

}

The above code returns:

usedDefaultCharacters for English? 0

usedDefaultCharacters for Japanese? 0

usedDefaultCharacters for Hindi? 1

The third line is incorrect as the Codepage for Hindi is 0 , and the string passed consists of Hindi Character and still the usedDefaultChar is set to 1 .. which should not be the case.

Prakash 2010-03-10 14:26:47

The codepage for hindi is NOT zero. Hindi is one of the new 'unicode only' localizations. There is no actual windows ansi codepage for representing hindi characters. Refer to this page: http://msdn.microsoft.com/en-us/goglobal/bb688174.aspx

Chris Becke 2010-03-10 15:18:58

so is there any value that I can give for the "codepage" parameter of WideCharToMultiByte to find out if the current encoding supports the Hindi Character?Or is there a way (in c++) to find out what if the current encoding of the page is UNICODE?-Thanks

Prakash 2010-03-10 17:03:54

Answer 4

+1 A:

The two previous answers have correctly suggested using MultiByteToWideChar then WideCharToMultiByte to translate your UTF-8 character to UTF-16, then to the current Windows codepage (CP_ACP). Check the result of WideCharToMultiByte to see if the conversion was successful.

What wasn't clear from the original question, is that you are having a particular issue with Hindi. For this language, your question is meaningless because there is no Windows ANSI codepage for Hindi, as Chris Becke pointed out. Therefore, you can never convert a Hindi character to CP_ACP, and WideCharToMultiByte will always fail.

To use Hindi on Windows, as far as I understand it, you must be a Unicode app that calls Unicode APIs.

richb 2010-03-11 05:34:30

ansaurus

tags:

views:

answers:

How to find if a character belongs to a particular codepage using c++ or calling winapi

related questions