views:

94

answers:

3

I was trying to convert a file from utf-8 to Arabic-1265 encoding using the Encoding APIs in C#, but I faced a strange problem that some characters are not converted correctly such as "لا" in the following statement "ﻣﺣﻣد ﺻﻼ ح عادل" it appears as "ﻣﺣﻣد ﺻ? ح عادل". Some of my friends told me that this is because these characters are from the Arabic Presentation Forms B. I create the file using notepad++ and save it as utf-8.

here is the code I use

    StreamReader sr = new StreamReader(@"C:\utf-8.txt", Encoding.UTF8);
    string str = sr.ReadLine();
    StreamWriter sw = new StreamWriter(@"C:\windows-1256.txt", false, Encoding.GetEncoding("windows-1256"));
    sw.Write(str);
    sw.Flush();
    sw.Close();

But, I don't know how to convert the file correctly using this presentation forms in C#.

A: 

First of all, the two characters you quoted are not from the Arabic Presentation Forms block. They are \x0644 and \x0627, which are from the standard Arabic block. However, just to be sure I tried the character \xFEFB, which is the “equivalent” (not equivalent, but you know) character for لا from the Presentation Forms block, and it works fine even for that.

Secondly, I will assume you mean the encoding Windows-1256, which is for legacy 8-bit Arabic text.

So I tried the following:

var input = "لا";
var encoding = Encoding.GetEncoding("windows-1256");
var result = encoding.GetBytes(input);
Console.WriteLine(string.Join(", ", result));

The output I get is 225, 199. So let’s try to turn it back:

var bytes = new byte[] { 225, 199 };
var result2 = encoding.GetString(bytes);
Console.WriteLine(result2);

Fair enough, the Console does not display the result correctly — but the Watch window in the debugger tells me that the answer is correct (it says “لا”). I can also copy the output from the Console and it is correct in the clipboard.

Therefore, the Windows-1256 encoding is working just fine and it is not clear what your problem is.

My recommendation:

  • Write a short piece of code that exhibits the problem.

  • Post a new question with that piece of code.

  • In that question, describe exactly what result you get, and what result you expected instead.

Timwi
Thank you very much for your fast response. This problem happens only when I convert UTF-8 file that has arabic characters to the windows-1256 file and the file content is ﻣﺣﻣد ﺻﻼ ح عادل for example this file is created using notepad++ and saved as utf-8.
Maged
here is the code snippet I used in the conversion
Maged
Encoding unicode = Encoding.UTF8; StreamReader sr = new StreamReader(@"C:\Users\mfarag\Desktop\ConsoleTest\ConsoleTest\nn.txt", unicode); string str = sr.ReadLine(); StreamWriter st = new StreamWriter(@"C:\Users\mfarag\Desktop\ConsoleTest\ConsoleTest\windows-1256.txt", false, Encoding.GetEncoding("windows-1256")); st.Write(str); st.Flush(); st.Close();
Maged
@Maged: Right. Your text contains the character “ﻼ”, which is indeed from the Presentation Forms block. You cannot encode this in Windows-1256 because it has no equivalent there. Therefore, the encoder (unsurprisingly) turns it into a question mark. (All the other characters seem to work fine.)
Timwi
+1  A: 

To give a more general answer:

  • The Windows-1256 encoding is an obsolete 8-bit character encoding. It has only 256 characters, of which only 60 are Arabic letters.

  • Unicode has a much wider range of characters. In particular, it contains:

    • the “normal” Arabic characters, U+0600 to U+06FF. These are supposed to be used for normal Arabic text, including text written in other languages that use the Arabic script, such as Farsi. For example, “لا” is U+0644 (ل) followed by U+0627 (ا).

    • the “Presentation Form” characters, U+FB50 to U+FDFF (“Presentation Forms-A”) and U+FE70 to U+FEFF (“Presentation Forms-B”). These are not intended to be used for representing Arabic text. They are primarily intended for compatibility, especially with font-file formats that require separate code points for every different ligated form of every character and ligated character combination. The “لا” ligature is represented by a single codepoint (U+FEFB) despite being two characters.

  • When encoding into Windows-1256, the .NET encoding for Windows-1256 will automatically convert characters from the Presentation Forms block to “normal text” because it has no other choice (except of course to turn it all into question marks). For obvious reasons, it can only do that with characters that actually have an “equivalent”.

  • When decoding from Windows-1256, the .NET encoding for Windows-1256 will always generate characters from the “normal text” block.

As we’ve discovered, your input file contains characters that are not representable in Windows-1256. Such characters will turn into question marks (?). Furthermore, those Presentation-Form characters which do have a normal-text equivalent, will change their ligation behaviour, because that is what normal Arabic text does.

Timwi
+1  A: 

Yes, your string contains lots of ligatures that cannot be represented in the 1256 code page. You'll have to decompose the string before writing it. Like this:

  str = str.Normalize(NormalizationForm.FormKD);
  st.Write(str);
Hans Passant
Thank you very much. This is the nearest solution to the problem.
Maged
Yup. Please don't forget to close your thread by marking the answer. Check mark next to my post.
Hans Passant
This code converted the string characters but also added some extra ? characters for example the string "; ﻧﺎﻧﻳس​ﻣﺣﻣد ﺻﻼ ح اﻟدﻳن 1000900;" will be converted to "; نانيس?محمد صلا ح الدين 1000900;"
Maged
can you tell me how can I convert this line without the extra characters
Maged
There's an unusual Unicode character in that string, \u200b, "Zero width space". You'd have to get rid of it with str = str.Replace("\u200b", ""); You can use string.ToCharArray() to find these kind of trouble-makers by yourself.
Hans Passant
Thanks for pointing me to the right direction
Maged
is there a general way to get rid of all the ? symbols from the string?
Maged
There comes a point where you may have to conclude that converting the text to a lossy code page like 1256 just isn't a good approach. Since the result just isn't a reasonably verbatim copy of the source anymore. This loss of information will screw you up somewhere else. If you want to make this maintainable then you have to set some rules for the input text you accept. And reject it if it doesn't follow those rules. Making it somebody else's problem.
Hans Passant
I'm afraid I have to convert utf-8 to Windows 1256 code page or even ISO8859-6. because It's the only encoding that can be transferred to a machine by some old software interface. So, is there a way to do this conversion or it's impossible to do so
Maged