views:

192

answers:

4

Hi!

I want to encode strings as Python do.

Python code is this:

def EncodeToUTF(inputstr):
  uns = inputstr.decode('iso-8859-2')
  utfs = uns.encode('utf-8')
  return utfs

This is very simple.

But in Delphi I don't understand, how to encode, to force first the good character set (no matter, which computer we have).

I tried this test code to see the convertion:

procedure TForm1.Button1Click(Sender: TObject);
var
    w : WideString;
    buf : array[0..2048] of WideChar;
    i : integer;
    lc : Cardinal;
begin
    lc := GetThreadLocale;
    Caption := IntToStr(lc);
    StringToWideChar(Edit1.Text, buf, SizeOF(buf));
    w := buf;
    lc := MakeLCID(
        MakeLangID( LANG_ENGLISH, SUBLANG_ENGLISH_US),
        0);
    Win32Check(SetThreadLocale(lc));
    Edit2.Text := WideCharToString(PWideChar(w));
    Caption := IntToStr(AnsiCompareText(Edit1.Text, Edit2.Text));
end;

The input is: "árvíztűrő tükörfúrógép", the hungarian accent tester phrase. The local lc is 1038 (hun), the new lc is 1033.

But this everytime makes 0 result (same strings), and the accents are same, I don't lost ŐŰ which is not in english lang.

What I do wrong? How to I do same thing as Python do?

Thanks for every help, link, etc: dd

A: 

If you're using Delphi 2009 or newer every input from the default VCL controls will be UTF-16, so no need to do any conversions on your input.

If you're using Delphi 2007 or older (as it seems) you are at mercy of Windows, because the VCL is ANSI and Windows has a fixed Codepage that determines which characters can be used in i.e. a TEdit.

You can change the system-wide default ANSI CP in the control panel though, but that requires a reboot each time you do.

In Delphi 2007 you have some chance to use TNTUnicode controls or some similar solution to get the Text from the UI to your code.

In Delphi 2009 and newer there are also plenty of Unicode and character set handling routines in the RTL.

The conversion between character sets can be done with SysUtils.TEncoding:

http://docs.embarcadero.com/products/rad_studio/delphiAndcpp2009/HelpUpdate2/EN/html/delphivclwin32/SysUtils_TEncoding.html

Jens Mühlenhoff
No reboot needed, happily switching between German and East European keyboards and code pages here (even in Windows 2000).
mjustin
A: 

There are encoding tools in Open XML library. There is cUnicodeCodecsWin32 unit with functions like: EncodingToUTF16().

My code that converts between ISO Latin2 and UTF-8 looks like:

  s2 := EncodingToUTF16('ISO-8859-2', s);
  s2utf8 := UTF16ToEncoding('UTF-8', s2);
Michał Niklas
Note that newer versions of Delphi include OpenXML as an optional XML library.
mjustin
I tried it with Turbo Delphi (D2006)
Michał Niklas
A: 

The Python code in your question returns a string in UTF-8 encoding. To do this with pre-2009 Delphi versions you can use code similar to:

procedure TForm1.Button1Click(Sender: TObject);
var
  Src, Dest: string;
  Len: integer;
  buf : array[0..2048] of WideChar;
begin
  Src := Edit1.Text;
  Len := MultiByteToWideChar(CP_ACP, 0, PChar(Src), Length(Src), @buf[0], 2048);
  buf[Len] := #0;
  SetLength(Dest, 2048);
  SetLength(Dest, WideCharToMultiByte(CP_UTF8, 0, @buf[0], Len, PChar(Dest),
    2048, nil, nil));
  Edit2.Text := Dest;
end;

Note that this doesn't change the current thread locale, it simply passes the correct code page parameters to the API.

mghie
Or simply `UTF8Encode(WideString(Edit1.Text))` for versions of Delphi that have UTF8Encode.
Jens Mühlenhoff
Sure. This answer however was meant to be close to the code in the question, hopefully illustrating where the problem in it is.
mghie
This simple example (casting to WideString) is not good, because we uses iso-8859-2 data, but the machine were we used it is english, so we lost accents... :-(
durumdara
And I forgot the wrote: Delphi 6 Professional.
durumdara
@durumdara: Yes, and the code in the answer works for any source encoding, one just has to pass the correct code page parameter to `MultiByteToWideChar()`. `CP_ACP` is for the current system code page, which is what the VCL conversion will always use.
mghie
+3  A: 

Windows uses codepage 28592 for ISO-8859-2. If you have a buffer containing ISO-8859-2 encoded bytes, then you have to decode the bytes to UTF-16 first, and then encode the result to UTF-8. Depending on which version of Delphi you are using, you can either:

1) on pre-D2009, use MultiByteToWideChar() and WideCharToMultiByte():

function EncodeToUTF(const inputstr: AnsiString): UTF8String;
var
  ret: Integer;
  uns: WideString;
begin
  SetLength(uns, MultiByteToWideChar(28592, 0, PCharChar(inputstr), Length(inputstr), nil, 0)-1);
  MultiByteToWideChar(28592, 0, PCharChar(inputstr), Length(inputstr), PWideChar(uns), Length(uns)+1);
  Result := UTF8Encode(uns);
end;

2a) on D2009+, use SysUtils.TEncoding.Convert():

function EncodeToUTF(const inputstr: RawByteString): UTF8String;
var
  enc: TEncoding;
  buf: TBytes;
begin
  enc := TEncoding.GetEncoding(28592);
  try
    buf := TEncoding.Convert(enc, TEncoding.UTF8, BytesOf(inputstr));
    SetString(Result, buf[0], Length(buf));
  finally
    enc.Free;
  end;
end;

2b) on D2009+, alternatively define a new string typedef, put your data into it, and assign it to a UTF8String variable. No manual encoding/decoding needed, the RTL will handle everything for you:

type
  Latin2String = type AnsiString(28592);

var
  inputstr: Latin2String;
  outputstr: UTF8String;
begin
  // put the encoded bytes into inputstr, then...
  outputstr := inputstr;
end;
Remy Lebeau - TeamB