I am trying to parse RTF (via MSEDIT) in various languages, all in Delphi 2010, in order to produce HTML in unicode.
Taking Russian/Cyrillic as my starting point I find that the overall document codepage is 1252 (Western) but the Russian parts of the text are identified by the charset of the font (RUSSIAN_CHARSET 204).
So far I am:
1) Use AnsiString (or RawByteString) when parsing the RTF
2) Determine the CodePage by a lookup from the font charset (see http://msdn.microsoft.com/en-us/library/cc194829.aspx)
3) Translating using a lookup table in my code: (This table generated from http://msdn.microsoft.com/en-gb/goglobal/cc305144.aspx) - I'm going to need one table per supported codepage!
There MUST be a better way than this? Preferably something supplied by the OS and so less brittle than tables of constants.