ansaurus

Question

Displaying Unicode characters above U+FFFF on Windows

Answer 1

A:

Your text editor might not like UTF-16. It probably assumes ANSI or UTF-8.

Try typing in the UTF-8 equivalent instead:

0xF0 0x90 0xA0 0x80

This won't help your testing, but will make sure your font isn't at fault. A text editor that does support UTF-16 is Komodo Edit.

Skurmedel 2009-04-23 15:27:19

0xFFFE is the Byte-order mark, which indicates the use of UTF-16 (Little endian). Notepad should be able to detect this.

Cory Walker 2009-04-23 15:49:19

I am well aware of that. But he doesn't say if he is using Notepad or not. There are many text editors which don't handle UTF-16.

Skurmedel 2009-04-23 15:50:39

Furthermore all editors can't handle BOMs either.

Skurmedel 2009-04-23 15:57:28

I've used notepad for the test. But I'll try that tomorrow, too.

hrniels 2009-04-23 18:47:36

I've tried it with Komodo Edit, but without success. Komodo displays the same as notepad :/

hrniels 2009-04-24 07:10:47

Answer 2

+1 A:

What happens if you put the byte order mark the other way around?

FEFF D802 DC00

(At the moment the byte sequence is being interpreted as the two characters U+02D8 U+00DC, so hopefully flipping the BOM will cause the bytes to be read in the intended order)

d__ 2009-04-23 15:37:50

+1. Seems like a solution.

Skurmedel 2009-04-23 15:58:31

Ah, maybe you're right. I'll try that tomorrow and report here :)

hrniels 2009-04-23 18:45:21

Unfortunatly it doesn't work. If I change the BOM notepad (and all other editors I tried, too) displays two squares. Interesting is that if I copy the two squares here (with firefox) I see the right character: <pre></pre>

hrniels 2009-04-24 07:09:25

If it was one square, I'd have guessed that Notepad didn't have access to the font, but the fact that it displays two squares is a bad sign. It is interesting that the characters are preserved through cut/paste.

d__ 2009-04-24 15:33:56

Answer 3

+1 A:

Probably you forgot to read the _wfopen() documentation. There they specify the encoding parameter. BTW, I assumed you are already using Unicode (wchars).

I would recommend you to use UTF-8 in files with or without BOM but forcing your fopen to use UTF-8 flag. It looks _wfopen("newfile.txt", "r, ccs=UTF-8"); will work with UTF-8 with or without BOM and also with UTF-16. Do not make the mistake of using the ccs=Unicode, it is a common thing to have UTF-8 files without BOM.

You should really read a little bit about Unicode before trying to work. This about this as a very good investment - it will save you time if you understand how Unicode works.

Here is a start http://blog.i18n.ro/newbie-guide-to-unicode/ and do not forget to read the links from the end of the article.

If you really need a simple text editor that allows you to play with Unicode encodings, use Notepad++ and forget about Notepad.

Sorin Sbarnea 2010-08-12 10:02:42

ansaurus

tags:

views:

answers:

Displaying Unicode characters above U+FFFF on Windows

related questions