views:

779

answers:

1

I am trying to parse some RTF, that i get back from the server. For most text i get back this works fine (and using a RichTextBox control will do the job), however some of the RTF seems to contain an additional "encoding" and some of the characters get corrupted.

The original string is as follows (and contains some of the characters used in Polish):

ąćęłńóśźż

The RTF string with hex encoded characters that is send back looks like this

{\lang1045\langfe1045\f16383 {\'b9\'e6\'ea\'b3{\f7 \'a8\'bd\'a8\'ae}\'9c\'9f\'bf}}

I am having problems decoding the ńó characters in the returned string, they seem to be represented by two hex values each, whereas the rest of the string is represented (as expected) by single hex values.

Using a RichTextBox control to "parse" the RTF results in corrupter text (the two characters in question are displayed as four different unwanted characters).

If i would encode the plain string myself to hex using the expected codepage (1250, Latin 2, the ANSI codepage for lcid 1045) i would get the following:

\'B9\'E6\'EA\'B3\'F1\'F3\'9C\'9F\'BF

I am lost as to how i can correctly decode the {\f7 \'a8\'bd\'a8\'ae} part of the returned string that should correspond to ńó.

Note that there is no font definition for \f7 in the RTF header and the string looks fine when viewed directly on the server meaning that the characters (if they are corrupted) are corrupted somewhere in the conversion before sending.

I am not sure if the problem is on the server side (as i have no control over that), but since the server is used for a lot of translation work i assume that the returned string is ok.

I have been going through the RTF specs but can not find any hint regarding this type of combination of encodings.

+1  A: 

I don't know why it's happening, but the encoding appears to be GBK (or something sufficiently similar).

Perhaps the server tries to do some "clever" matching to find the characters, or the server's default character encoding is GBK or so, and those characters (and only those) also occur in GBK so it prefers that.

I found out by adding the offending hex codes (A8 BD A8 AE) as bytes into a simple HTML file, so I could go through my browser's encodings and see if anything matched:

<html><body>¨½¨®</body></html>

To my surprise, my browser came up with "ńó" straight away.

mercator
Thanks, i will try to insert an additional font definition in the returned RTF before "parsing" it, using either GBK or GB2312 for the charset. If that wont do the job i will convert the problematic bytes manually using the suggested encoding. Let's hope that the \f7 is a standard behavior.
barry