views:

570

answers:

4

I have to import some UTF-8 encoded text-file into my C++Builder 5 program. Are there any components or code samples to accomplish that?

A: 

As there is no-one working on weekends, I have to answer it myself :)

String Utf8ToWinLatin1(char* aData, char* aValue)
{
    int i=0;
    for(int j=0;j<strlen(aData);)
    { int val=aData[j];
     int c=(unsigned char)aData[j];
     if(c<=127)
     { aValue[i]=c;
      j+=1;                                  
      i++;
     }
     else if(c>=192 && c<=223)
     {
      aValue[i]=(c-192)*64 + (aData[j+1]-128);
      i++;
      j+=2;
     }
     else if(c>=224 && c<=239)
     {
      aValue[i]=( c-224)*4096 + (aData[j+1]-128)*64 + (aData[j+2]-128);
      i++;
      j+=3;
     }
     else if(c>=240 && c<=247)
     {
      aValue[i]=(c-240)*262144 + (aData[j+1]-128)*4096 + (aData[j+2]-128)*64 + (aData[j+3]-128);
      i++;
      j+=4;
     }
     else if(c>=248 && c<=251)
     {
      aValue[i]=(c-248)*16777216 + (aData[j+1]-128)*262144+ (aData[j+2]-128)*4096 + (aData[j+3]-128)*64 + (aData[j+4]-128);
      i++;
      j+=5;
     }
     else
      j+=1;
    }
    return aValue;
}
Riho
You should know that ASCII only has 128 characters compared to the 1,114,112 Unicode characters that can be encoded with UTF-8. So you will loose all characters that are not in the ASCII charset.
Gumbo
You're function should be better called something like `Utf8ToWinLatin1()` - `ConvertAnsi` doesn't specify what get's converted to what; also, 'ANSI' isn't a name of any encoding...
Christoph
I don't care about 1,000,000 characters - I only want my native ones back (ÕÖÄÜ). I called it Ansi, because that's what it is called in Notepad :) when you select SaveAs.
Riho
+2  A: 

You are best off reading all the other questions on SO that are tagged unicode and c++. For starters you should probably look at this one and see whether library in the accepted answer (UTF8-CPP) works for you.

I would however first think about what you're trying to achieve, as there is no way you can just import UTF-8-encoded strings into "Ansi" (what ever you mean by that, maybe something like ISO8859_1 or WIN1252 encoding?).

mghie
A: 

Your question doesn't say specifically which character set you want to convert to. If you only want the basic 7-bit ASCII charset, discarding every character with a higher value than 127 will work.

If you want to convert to a 8-bit character set, such as latin1, you'll have to do it the hard way.

jalf
This way, you'll lose half the characters of WinLatin1 (aka 'ANSI')
Christoph
He didn't ask about conversion to Latin1 though, just to "ANSI" which, well, can mean a lot of things. Of course you're right, if he wants to convert to some specific 8-bit character set (such as latin1) then you're right, this won't work.
jalf
@jalf: 'ANSI' is a common, incorrect label for Windows-1252 (aka WinLatin1); check wikipedia for details...
Christoph
"The term ANSI as used to signify Windows code pages is a historical reference, but is nowadays a misnomer that continues to persist in the Windows community"
Christoph
Yep, not saying that isn't what he meant, just that if it isn't, and if he only wants the 128 ASCII chars, this is a much simpler solution than his own
jalf
In any case there will be data loss then there are characters that are not element of the smaller charset.
Gumbo
Yes, I don't want any cyrillic or Chinese characters, I just need the common Win-1252 symbols out (like öõäü). And it works.
Riho
+1  A: 

Here is a more VCL-centric approach for you:

UTF8String utf8 = "...";
WideString utf16;
AnsiString latin1;

int len = ::MultiByteToWideChar(CP_UTF8, 0, utf8.c_str(), utf8.Length(), NULL, 0);
utf16.SetLength(len);
::MultiByteToWideChar(CP_UTF8, 0, utf8.c_str(), utf8.Length(), utf16.c_bstr(), len);

len = ::WideCharToMultiByte(1252, 0, utf16.c_bstr(), utf16.Length(), NULL, 0, NULL, NULL);
latin1.SetLength(len);
::WideCharToMultiByte(1252, 0, utf16.c_bstr(), utf16.Length(), latin1.c_str(), len, NULL, NULL);

If you upgrade to CB2009, you can simplify it to this:

UTF8String utf8 = "...";
AnsiString<1252> latin1 = utf8;
Remy Lebeau - TeamB