views:

277

answers:

6

These days, more languages are using unicode, which is a good thing. But it also presents a danger. In the past there where troubles distinguising between 1 and l and 0 and O. But now we have a complete new range of similar characters.

For example:

ì, î, ï, ı, ι, ί, ׀ ,أ ,آ, ỉ, ﺃ

With these, it is not that difficult to create some very hard to find bugs.

At my work, we have decided to stay with the ANSI characters for identifiers. Is there anybody out there using unicode identifiers and what are the experiences?

+4  A: 

Besides the similar character bugs you mention and the technical issues that might arise when using different editors (w/BOM, wo/BOM, different encodings in the same file by copy pasting which is only a problem when there are actually characters that cannot be encoded in ASCII and so on), I find that it's not worth using Unicode characters in identifiers. English has become the lingua franca of development and you should stick to it while writing code.

This I find particularly true for code that may be seen anywhere in the world by any developer (open source, or code that is sold along with the product).

Vinko Vrsalovic
+3  A: 

My experience with using unicode in C# source files was disastrous, even though it was Japanese (so there was nothing to confuse with an "i"). Source Safe doesn't like unicode, and when you find yourself manually fixing corrupted source files in Word you know something isn't right.

I think your ANSI-only policy is excellent. I can't really see any reason why that would not be viable (as long as most of your developers are English, and even if they're not the world is used to the ANSI character set).

MusiGenesis
We are not english, but that's not a problem ;-).
Gamecat
Sorry. I've read your posts before, and I assumed you were American. You guys hiring? Holland is like the ancestral homeland I never had.
MusiGenesis
Lol, you are welcome to visit our beautiful country. But (just for the record) Holland is just the western part of the Netherlands.
Gamecat
Thanks, I have and I will again. I know about the Holland thing - I just don't like using country names that start with "the", which is why I call myself an American instead of a "TheUnitedStatesian".
MusiGenesis
Nice, if you like I can send you an email (to the email on your site). You can never have too many friends out here.
Gamecat
Unitedstatesian is good enough though. I hate how Unitedstatesians have appropriated the whole continent. I love The Netherlands as well, I'd really like to live there for a couple of years.
Vinko Vrsalovic
@Gamecat: that sounds cool, thanks.
MusiGenesis
@Vinko: it's hard work exterminating the locals and taking over the place (especially when they fight back). We earned our "The".
MusiGenesis
Could someone please upvote me to 5K so I can lay down and finally die?
MusiGenesis
Well then, use your the, but return America back for every canadian, central and latin american! There was a performance on New York about that in the 1970s, it was a projection of a map of the US saying "This is not America", followed by a projection of the whole continent, saying "THIS is America"
Vinko Vrsalovic
@MusiGenesis: here you are :)
hayalci
Thank you, hayalci. I hope my life was not in vain.
MusiGenesis
@Vinko: I live in the one part of the US that would have to be given back to France, so I'm OK with your idea. I would at least be a theoretical neighbor of The Holland.
MusiGenesis
Eh, france and the netherlands are no neighbors. They have invented belgium to keep them apart.
Gamecat
I could have sworn Germany kicked off WWII by invading France through the Netherlands (since I guess they didn't want to wear out Belgium again). Did you guys move everything around since then? :)
MusiGenesis
Yup Europe is constantly moveing its borders.
Gamecat
No we Belgians weren't really invaded, we simply invited them in.
Dave Van den Eynde
A: 

I would also recommend using ascii for identifiers. Comments can stay in a non-english language if the editor/ide/compiler etc. are all locale aware and set up to use the same encoding.

Additionally, some case insensitive languages change the identifiers to lowercase before using, and that causes problems if active system locale is Turkish or Azerbaijani . see here for more info about Turkish locale problem. I know that PHP does this, and it has a long standing bug.

This problem is also present in any software that compares strings using Turkish locales, not only the language implementations themselves, just to point out. It causes many headaches

hayalci
Lol, I know about the Turkish characters. A family member of me speaks turkish and I have setup her keyboard ;-).
Gamecat
A: 

I think it is not a good idea to use the entire ANSI character set for identifiers. No matter which ANSI code page you're working in, your ANSI code page includes characters that some other ANSI code pages don't include. So I recommend sticking to ASCII, no character codes higher than 127.

In experiments I have used a wider range of ANSI characters than just ASCII, even in identifiers. Some compilers accepted it. Some IDEs needed options to be set for fonts that could display the characters. But I don't recommend it for practical use.

Now on to the difference between ANSI code pages and Unicode.

In experiments I have stored source files in Unicode and used Unicode characters in identifiers. Some compilers accepted it. But I still don't recommend it for practical use.

Sometimes I have stored source files in Unicode and used escape sequences in some strings to represent Unicode character values. This is an important practice and I recommend it highly. I especially had to do this when other programmers used ANSI characters in their strings, and their ANSI code pages were different from other ANSI code pages, so the strings were corrupted and caused compilation errors or defective results. The way to solve this is to use Unicode escape sequences.

Windows programmer
A: 

It depends on the language you're using. In Python, for example, is easierfor me to stick to unicode, as my aplications needs to work in several languages. So when I get a file from someone (something) that I don't know, I assume Latin-1 and translate to Unicode.

Works for me, as I'm in latin-america.

Actually, once everithing is ironed out, the whole thing becomes a smooth ride.

Of course, this depends on the language of choice.

voyager
A: 

I haven't ever used unicode for identifier names. But what comes to my mind is that Python allows unicode identifiers in version 3: PEP 3131.

Another language that makes extensive use of unicode is Fortress.

Even if you decide not to use unicode the problem resurfaces when you use a library that does. So you have to live with it to a certain extend.

unbeknown