ansaurus

Question

How do I get the characters for context-shaped input in a complex script?

Answer 1

A:

So how are you creating the "wrong" string? If you're just putting it in a string literal, then it's quite possible it's just the input method that's wrong. If you copy the "right" string after displaying it, and then paste that into a string literal, what happens? You might also want to check which encoding Visual Studio is using for your source files. If you're not putting the string into your source code as a literal, how are you creating it?

Given the possibility for confusing, I think I'd want to either keep these strings in a resource, or hard code them using unicode escaping:

string text = "\ufb64\ufea0\ufe91\feea";

(Then possibly put a comment afterwards showing the non-escaped value; at least then if it looks about right, it won't be too misleading. Admittedly it's then easy for the two to get out of sync...)

Jon Skeet 2009-07-23 05:23:36

The input string comes from the user input and it is not static. It is for example title of a page or menu. So it cannot be hard coded. you can event try by using TextBox control and you will get same result.

Mostafa 2009-07-23 05:28:48

Right, in that case it's a limitation of the input method. You *may* find that changing the font of the TextBox helps... I'm not sure. I'll see whether I've got enough fonts etc installed to check it.

Jon Skeet 2009-07-23 06:06:24

I think this happens because when you entering the text by using keyboard, it will enter the default character, which is the isolated form, but on the text box windows will convert it to proper form on the display.

Mostafa 2009-07-23 06:11:36

I don't know... I was copying and pasting your string directly into a textbox, and it still gave the "wrong" string. Hmm... tricky.

Jon Skeet 2009-07-23 06:23:31

But if you go to the Character Map and then use the exact characters, then copy it and past it to the text box, it will return the exact values.

Mostafa 2009-07-23 06:28:48

Answer 2

A:

This is a bit of a wild guess, but does String.Normalize() help here? It is unclear to me whether that just covers character composition or if it includes positional forms as well.

DocMax 2009-07-23 06:27:57

Actually i have tried that one also, but no result T_T

Mostafa 2009-07-23 06:31:42

Answer 3

+2 A:

Windows uses Uniscribe to perform contextual shaping for complex scripts (which can apply to l-to-r as well as r-to-l languages). The displayed text in a text box is based on the glyph info after the characters have been fed into Uniscribe. Although the Unicode standard defines code points for each of isolated, initial, medial, and final forms of a chracter, not all fonts necessarily support them yet they may have pre-shaped glyphs or use a combination of glyphs—Uniscribe uses a shaping engine from the Windows language pack to determine which glyph(s) to use, based on the font's cmap. Here are some relevant links:

More Uniscribe Mysteries (explains difference between glyphs and characters)
Microsoft Bhasha, Glyph Processing: Uniscribe
MSDN: Complex Scripts Awareness
Buried in the bowels of Mozilla code is code that handles complex script rendering using Uniscribe. There's also additional code that scans the list of fonts in the system and reads the cmap tables of each font. (From the comments at http://blogs.msdn.com/michkap/archive/2005/12/06/500485.aspx).
Sorting it all Out: Did he say shaping? It's not in the script!

The TextRenderer.DrawText() method uses Uniscribe via the Win32 DrawTextExW() function, using the following P/Invoke:

[DllImport("user32.dll", CharSet=CharSet.Unicode, SetLastError=true)]
public static extern int DrawTextExW( HandleRef hDC
                                     ,string lpszString
                                     ,int nCount
                                     ,ref RECT lpRect
                                     ,int nFormat
                                     ,[In, Out] DRAWTEXTPARAMS lpDTParams);

[StructLayout(LayoutKind.Sequential)]
public struct RECT
 {
   public int left;
   public int top;
   public int right;
   public int bottom;
 }

[StructLayout(LayoutKind.Sequential)]
public class DRAWTEXTPARAMS
{
  public int iTabLength;
  public int iLeftMargin;
  public int iRightMargin;
  public int uiLengthDrawn;
}

Mark Cidade 2009-07-23 07:08:02

Thanks for your answer. But my question is how i can convert entered text, to the shaped text and get the result as char array or string.

Mostafa 2009-07-23 07:52:44

I added more information about Uniscribe and why it's not trivial to get the characters (code points) that are shown in the text box. It seems that your only options are use Uniscribe by looking up indexes in font cmaps, or roll your own shaping information engine.

Mark Cidade 2009-07-23 08:55:21

ansaurus

tags:

views:

answers:

How do I get the characters for context-shaped input in a complex script?

related questions