views:

146

answers:

3

Hi,

the application I'm developing with EVC++ 4 runs on Windows CE 5 and should support unicode (AFAIK wchar_t uses UTF-16 on windows, so I'm using that), so I want to be able to test it with "more exotic" characters. Especially with characters that use 4 Byte in UTF-16 and not just 2. Therefore I'm trying to display such characters in a texteditor (atm on my desktop PC with Windows XP, not on the embedded device).

But I haven't managed it to do so yet. As an example I've chosen this character. Like mentioned here "MPH 2B Damase" should support this character. So I downloaded the font and put it into Windows\Fonts. I created a textfile using a hexeditor (just to be sure) with following content:

FFFE D802 DC00

When I open it with notepad (which should be unicode-capable, right?) and use the downloaded font it doesn't display 1 char, as intended, but this 2:

˘Ü

What am I doing wrong? :)

Thanks!

hrniels

Edit: Flipping the BOM, as suggested, doesn't work. Notepad (and all other editors I tried, too) displays two squares in this case. Interesting is that if I copy the two squares here (with firefox) I see the right character:


I've also tried it with Komodo Edit with the same result.

Using UTF-8 doesn't help notepad either.

A: 

Your text editor might not like UTF-16. It probably assumes ANSI or UTF-8.

Try typing in the UTF-8 equivalent instead:

0xF0 0x90 0xA0 0x80

This won't help your testing, but will make sure your font isn't at fault. A text editor that does support UTF-16 is Komodo Edit.

Skurmedel
0xFFFE is the Byte-order mark, which indicates the use of UTF-16 (Little endian). Notepad should be able to detect this.
Cory Walker
I am well aware of that. But he doesn't say if he is using Notepad or not. There are many text editors which don't handle UTF-16.
Skurmedel
Furthermore all editors can't handle BOMs either.
Skurmedel
I've used notepad for the test. But I'll try that tomorrow, too.
hrniels
I've tried it with Komodo Edit, but without success. Komodo displays the same as notepad :/
hrniels
+1  A: 

What happens if you put the byte order mark the other way around?

FEFF D802 DC00

(At the moment the byte sequence is being interpreted as the two characters U+02D8 U+00DC, so hopefully flipping the BOM will cause the bytes to be read in the intended order)

d__
+1. Seems like a solution.
Skurmedel
Ah, maybe you're right. I'll try that tomorrow and report here :)
hrniels
Unfortunatly it doesn't work. If I change the BOM notepad (and all other editors I tried, too) displays two squares. Interesting is that if I copy the two squares here (with firefox) I see the right character: <pre></pre>
hrniels
If it was one square, I'd have guessed that Notepad didn't have access to the font, but the fact that it displays two squares is a bad sign. It is interesting that the characters are preserved through cut/paste.
d__
+1  A: 

Probably you forgot to read the _wfopen() documentation. There they specify the encoding parameter. BTW, I assumed you are already using Unicode (wchars).

I would recommend you to use UTF-8 in files with or without BOM but forcing your fopen to use UTF-8 flag. It looks _wfopen("newfile.txt", "r, ccs=UTF-8"); will work with UTF-8 with or without BOM and also with UTF-16. Do not make the mistake of using the ccs=Unicode, it is a common thing to have UTF-8 files without BOM.

You should really read a little bit about Unicode before trying to work. This about this as a very good investment - it will save you time if you understand how Unicode works.

Here is a start http://blog.i18n.ro/newbie-guide-to-unicode/ and do not forget to read the links from the end of the article.

If you really need a simple text editor that allows you to play with Unicode encodings, use Notepad++ and forget about Notepad.

Sorin Sbarnea