views:

754

answers:

3
+1  A: 

My guess is that changing font names in the RTF has probably made things worse. If a font specified in the RTF is not a Unicode font, then surely the characters due to be rendered in that font will be encoded as Shift-JIS, not as Unicode. And then so will the other characters in the text. So treating the whole thing as Unicode, or appending Unicode text, will cause the corruption you see. You need to establish whether RTF you import is encoded Shift-JIS or Unicode, and also whether the machine you are running on (and therefore D2009 default input format) is Japanese or not. In Japan, if a text file has no Unicode BOM it would usually be Shift-JIS (but not always).

frogb
Further investigation did show that changing the font is not a good idea. Specifically, Changing the specified charset is a no-no , since \fcharset0 is ANSI and \fcharset128 is Shift-JIS. On the surface at least, it looks like swapping between different fonts with different charsets would allow you to correctly encode what the user entered. Unfortunately, it still does not quite explain why the RTF control can't figure out the correct display.
Ryan Bates
+1  A: 

I was seeing something similar, but not with Japanese fonts. Just special characters like micro (as in microliters) and superscripts. The problem was that even though the RTF string I was sending to the user from an ASP.NET webpage was correct (I could see the encoded RTF stream using Fiddler2), when MS Word actually opened the RTF, it added a bunch of garbage escape codes like what I see in your sample.

What I did was to run the entire RTF text through a conversion routine that swapped all characters over ascii 127 to their special unicode point equivalent. So I would get something like \uc1\u181? (micro) for the special characters. When I did that, Word was able to open the file no problem. Ironically, it re-encoded the \uc1\uxxx? back to their RTF escaped equivalents.

Private Function ConvertRtfToUnicode(ByVal value As String) As String

    Dim ch As Char() = value.ToCharArray()
    Dim c As Char
    Dim sb As New System.Text.StringBuilder()
    Dim code As Integer

    For i As Integer = 0 To ch.Length - 1
        c = ch(i)
        code = Microsoft.VisualBasic.AscW(c)
        If code <= 127 Then
            'Don't need to replace if one of your typical ASCII codes
            sb.Append(c)
        Else
            'MR: Basic idea came from here http://www.eggheadcafe.com/conversation.aspx?messageid=33935981&amp;threadid=33935972
            '  swaps the character for it's Unicode decimal code point equivalent
            sb.Append(String.Format("\uc1\u{0:d}?", code))
        End If
    Next

    Return sb.ToString()

End Function

Not sure if that will help your problem, but it's working for me.

Rake36
Thanks for the sample code! I tried something similar initially, but it made no difference as the RTF character stream itself did not contain any Unicode. This is however still an extremely useful function to keep around.
Ryan Bates
A: 

Great job Rake - worth a bump! That code is elegant and simple - that's worked for me struggling with a mix of Korean and European languages. Inspired me to write a VB.NET app to take a "broken" RTF file, fix it and save it.

sham