views:

2212

answers:

2

Hi,

I'm trying to output unicode string into RTF format. (using c# and winforms)

From wikipedia:

If a Unicode escape is required, the control word \u is used, followed by a 16-bit signed decimal integer giving the Unicode codepoint number. For the benefit of programs without Unicode support, this must be followed by the nearest representation of this character in the specified code page. For example, \u1576? would give the Arabic letter beh, specifying that older programs which do not have Unicode support should render it as a question mark instead.

I don't know how to convert Unicode character into Unicode codepoint ("\u1576"). Conversion to UTF 8, UTF 16 and similar is easy, but I don't know how to convert to codepoint.

Scenario in which I use this:

  • I read existing RTF file into string (I'm reading template)
  • string.replace #TOKEN# with MyUnicodeString (template is populate with data)
  • write result into another RTF file.

Problem, arise when Unicode characters arrived

A: 

You will have to convert the string to a byte[] array (using Encoding.Unicode.GetBytes(string)), then loop through that array and prepend a \ and u character to all Unicode characters you find. When you then convert the array back to a string, you'd have to leave the Unicode characters as numbers.

For example, if your array looks like this:

byte[] unicodeData = new byte[] { 0x15, 0x76 };

it would become:

// 5c = \, 75 = u
byte[] unicodeData = new byte[] { 0x5c, 0x75, 0x15, 0x76 };
Ian Kemp
Hi, thank you for response, I've tried to implement your solution, unfortunately it's not working.I think that's because there is difference between Codepoint and UTF16 encoding (Encoding.Unicode)You are suggesting me to output bytes from UTF16 encoding where Codepoint is excepted. (And this is working for many characters, but not all)
Emir
This answer also seams to work, I probably had a bug in my code when I was testing it.Thank you for your answer and your time
Emir
+3  A: 

Provided that all the characters that you're catering for exist in the Basic Multilingual Plane (it's unlikely that you'll need anything more), then a simple UTF-16 encoding should suffice.

Wikipedia:

All possible code points from U+0000 through U+10FFFF, except for the surrogate code points U+D800–U+DFFF (which are not characters), are uniquely mapped by UTF-16 regardless of the code point's current or future character assignment or use.

The following sample program illustrates doing something along the lines of what you want:

static void Main(string[] args)
{
    // ë
    char[] ca = Encoding.Unicode.GetChars(new byte[] { 0xeb, 0x00 });
    var sw = new StreamWriter(@"c:/helloworld.rtf");
    sw.WriteLine(@"{\rtf
{\fonttbl {\f0 Times New Roman;}}
\f0\fs60 H" + GetRtfUnicodeEscapedString(new String(ca)) + @"llo, World!
}"); 
    sw.Close();
}

static string GetRtfUnicodeEscapedString(string s)
{
    var sb = new StringBuilder();
    foreach (var c in s)
    {
     if (c <= 0x7f)
      sb.Append(c);
     else
      sb.Append("\\u" + Convert.ToUInt32(c) + "?");
    }
    return sb.ToString();
}

The important bit is the Convert.ToUInt32(c) which essentially returns the code point value for the character in question. The RTF escape for unicode requires a decimal unicode value. The System.Text.Encoding.Unicode encoding corresponds to UTF-16 as per the MSDN documentation.

Eric Smith
hmmmm, very interesting point. If that's true, than, there is probably mistake somewhere in my logic... and Ian Kemp's answer makes much more sense... I'll keep googling
Emir
Thank you for example, it works!
Emir