views:

251

answers:

1

I am trying to parse RTF (via MSEDIT) in various languages, all in Delphi 2010, in order to produce HTML in unicode.

Taking Russian/Cyrillic as my starting point I find that the overall document codepage is 1252 (Western) but the Russian parts of the text are identified by the charset of the font (RUSSIAN_CHARSET 204).

So far I am:

1) Use AnsiString (or RawByteString) when parsing the RTF

2) Determine the CodePage by a lookup from the font charset (see http://msdn.microsoft.com/en-us/library/cc194829.aspx)

3) Translating using a lookup table in my code: (This table generated from http://msdn.microsoft.com/en-gb/goglobal/cc305144.aspx) - I'm going to need one table per supported codepage!

There MUST be a better way than this? Preferably something supplied by the OS and so less brittle than tables of constants.

+2  A: 

The Charset to codepage table is small enough, and static enough, that I doubt the system provides a function to do it.

To do the actual character translations you can use the SysUtils.TEncoding class or the System.SetCodePage function. Both internally use MultiByteToWideString, which uses OS-provided lookup tables, so you don't need to maintain them.

Using SetCodePage would look something like this:

var
  iStart, iStop: Integer;
  RTF, RawText: AnsiString;
  Text: string;
  CodePage: Word;
begin
   ...
   CodePage := CharSetToCodePage(CharSet);
   RawText := Copy(RTF, iStart, iStop - iStart);
   SetCodePage(RawText, CodePage, False); // Set string codepage to Russian without converting it
   Text := string(RawText); // Automatic conversion from string codepage to Unicode
Craig Peterson
Thanks! The only thing I hadn't tried was setting the Convert parameter of SetCodePage to False and that proved to be the key.
blue painted