tags:

views:

35

answers:

2

Hello, fellows. I have a rather pervert question. Please forgive me :)

There's an official algorithm that describes how bidirectional unicode text should be presented. http://www.unicode.org/reports/tr9/tr9-15.html

I receive a string (from some 3rd-party source), which contains latin/hebrew characters, as well as digits, white-spaces, punctuation symbols and etc.

The problem is that the string that I receive is already in the representation form. I.e. - the sequence of characters that I receive should just be presented from left to right.

Now, my goal is to find the unicode string which representation is exactly the same. Means - I need to pass that string to another entity; it would then render this string according to the official algorithm, and the result should be the same.

Assuming the following:

  • The default text direction (of the rendering entity) is RTL.
  • I don't want to inject "special unicode characters" that explicitly override the text direction (such as RLO, RLE, etc.)
  • I suspect there may exist several solutions. If so - I'd like to preserve the RTL-looking of the string as much as possible. The string usually consists of hebrew words mostly. I'd like to preserve the correct order of those words, and characters inside those words. Whereas other character sequences may (and should) be transposed.

One naive way to solve this is just to swap the whole string (this takes care of the hebrew words), and then swap inside it sequences of non-hebrew characters. This however doesn't always produce correct results, because actual rules of representation are rather complex.

The only comprehensive algorithm that I see so far is brute-force check. The string can be divided into sequences of same-class characters. Those sequences may be joined in random order, plus any of them may be reversed. I can check all those combinations to obtain the correct result. Plus this technique may be optimized. For instance the order of hebrew words is known, so we only have to check different combinations of their "joining" sequences.

Any better ideas? If you have an idea, not necessarily the whole solution - it's ok. I'll appreciate any idea. Thanks in advance.

A: 

If you want to check if a character is Bidirectional you have to use UCD (Unicode Character Database) which provided by Unicode.org and includes lots of information about characters . in one of that DB attributes you can find the Bidirectionality of a character

So you have to Download USD , then write a class to look for your character in the XML and return answer

I did this in an opensource C# application and you can ind it here http://Unicode.Codeplex.com

Please let me know has your issue resolved by this or not.

Nasser Hadjloo
A: 

Nasser, thanks for the answer. Unfortunately it doesn't fully resolve my problem.

So far for every character I can know its directionality. Still I don't see how can I compute the whole string so that its representation would match what I need.

Imagine you want to have the following text written from left to right, whereas hebrew/arabic characters are denoted by BIG:

ABC eng 123 456 DEF

The correct string would be like this: FED 456 123 eng CBA or also: FED eng 456 123 CBA

Or, if using explicit direction override codes it can be written like this: FED eng 123 456 CBA

Currently I solved this problem by injecting explicit directionality override codes into the string. So that I isolate sequences of hebrew/arabic words, and for all the joining LTR/Weak/Neutral characters I explicitly override the direction to LTR.

However I'd like to do this without injecting explicit override codes.

valdo