views:

445

answers:

3

In the process of editing a file encoded as UTF-8 w/o [spurious] BOM the content might become devoid of any Unicode characters outside the ASCII or ANSI ranges. At the next reopening of the file, some text editors (Notepad++) will interpret it as ASCII/ANSI encoded and open it as such. Unaware of the change the user will continue editing, now adding non-ANSI Unicode characters, rendered however useless, since saved in ANSI. A menu option can exist (Notepad++) to open ANSI files as UTF-8 w/o BOM, but leading to the reverse issue of inadvertently overriding ANSI files with Unicode encoding.

+1  A: 

One workaround is to add a character outside the ANSI range to a comment in the file. Depending on the decoding algorithm, it might force the editor (Notepad++) to recognize the file as encoded in UTF-8 w/o BOM.

In a HTML document for example you could follow the charset definition in the header with such a Unicode comment, here the U+05D0 HEBREW LETTER ALEF: <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <!-- א -->

Vlad Atanasiu
+1  A: 

How would you suggest that an editor tell the difference between ASCII/ANSI and UTF-8 w/o BOM, when the files look the same?

If you want guaranteed recognition of UTF-8 as UTF-8, either add the BOM, or force the file to contain UTF-8 characters.

Anon.
Hi Anon.! Sorry, I didn't get the reply to my own question uploaded fast enough for you to see in time. The solution was what you suggested.
Vlad Atanasiu
+1  A: 

Configure your editor to always use UTF-8 if possible, if not, complain to the creators of your editor. Charsets not targeting unicode are, IMO, deprecated and should be treated as such.

Files using only characters in the ASCII space (the 7-bit one) would be pretty much the same in UTF-8 anyway, so if you HAVE to deliver something in ASCII encoding, just don't type any unicode characters.

Daniel Bruce
Thank you for your answer. What I found fascinating in this issue is that a file can change its physical status (encoding) if the information it carries (a text) is modified (using during one edit words in the non-ASCII Unicode ranges and during another happening to use only words made from ASCII charcters). It is a bit like a pen that would suddenly change its colors according to the words you write.
Vlad Atanasiu