views:

345

answers:

7

I have a snippet of code that looks like this:

double Δt = lastPollTime - pollTime;
double α = 1 - Math.exp(-Δt / τ);
average += α * (x - average);

Just how bad an idea is it to use unicode characters in Java identifiers? Or is this perfectly acceptable?

+19  A: 

It's a bad idea, for various reasons.

  • Many people's keyboards do not support these characters. If I were to maintain that code on a qwerty keyboard (or any other without Greek letters), I'd have to copy and paste those characters all the time.

  • Some people's editors or terminals might not display these characters properly. For example, some editors (unfortunately) still default to some ISO-8859 (Latin) variant. The main reason why ASCII is still so prevalent is that it nearly always works.

  • Even if the characters can be rendered properly, they may cause confusion. Straight from Sun (emphasis mine):

    Identifiers that have the same external appearance may yet be different. For example, the identifiers consisting of the single letters LATIN CAPITAL LETTER A (A, \u0041), LATIN SMALL LETTER A (a, \u0061), GREEK CAPITAL LETTER ALPHA (A, \u0391), CYRILLIC SMALL LETTER A (a, \u0430) and MATHEMATICAL BOLD ITALIC SMALL A (a, \ud835\udc82) are all different.

    ...

    Unicode composite characters are different from the decomposed characters. For example, a LATIN CAPITAL LETTER A ACUTE (Á, \u00c1) could be considered to be the same as a LATIN CAPITAL LETTER A (A, \u0041) immediately followed by a NON-SPACING ACUTE (´, \u0301) when sorting, but these are different in identifiers.

    This is in no way an imaginary problem: α (U+03b1 GREEK SMALL LETTER ALPHA) and ⍺ (U+237a APL FUNCTIONAL SYMBOL ALPHA) are different characters!

  • There is no way to tell which characters are valid. The characters from your code work, but when I use the FUNCTIONAL SYMBOL ALPHA my Java compiler complains about "illegal character: \9082". Even though the functional symbol would be more appropriate in this code. There seems to be no solid rule about which characters are acceptable, except asking Character.isJavaIdentifierPart().

  • Even though you may get it to compile, it seems doubtful that all Java virtual machine implementations have been rigorously tested with Unicode identifiers. If these characters are only used for variables in method scope, they should get compiled away, but if they are class members, they will end up in the .class file as well, possibly breaking your program on buggy JVM implementations.

Thomas
To expand on the last point: you're dependent on the default file encoding of the underlying platform. Although this is controllable using `-Dfile.encoding` on Sun JVM's (yes, JVM implementation dependent...), you *really* don't want to be dependent on that. That's the major showstopper imo. Great answer btw, +1.
BalusC
@BalusC: Thanks, but I think you misunderstood. In the internals of `.class` files, only one encoding is used, and it's something similar to UTF-8. http://en.wikipedia.org/wiki/Class_%28file_format%29 As far as I could determine, `file.encoding` is only used to specify the default encoding for classes like `InputStreamReader`.
Thomas
A: 

Why not? If the people working on that code can type those easily, it's acceptable.

But god help those who can't display unicode, or who can't type them.

LukeN
Anybody who can't display Unicode by this point needs to get out of the '80s and into the 21st century. I mean flipping RSTS/E had the beginnings of i18n in place!
JUST MY correct OPINION
@ttmrichter: You would be right if there weren't a huge number of misconfigured machines and outdated software around...
Thomas
Also in the unix and linux world there's a lot of people using vim or emacs inside the console to do their stuff, and there's no guarantee they can see or write unicode characters.
LukeN
If vim and emacs can't display characters from a standard that's been around for almost two decades, perhaps their reputation as a productive developer tool is drastically overrated. Or if it's the Unix systems' fault, perhaps Unix isn't the be-all/end-all system it's cracked up to be. Seriously. Get with the 21st century. It's lovely up here. (Thankfully my Linux box seems to cope with the 21st century just fine, given where I live and all that.)
JUST MY correct OPINION
+2  A: 

looks good as it uses the correct symbols, but how many of your team will know the keystrokes for those symbols?

I would use an english representation just to make it easier to type. And others might not have a character set that supports those symbols set up on their pc.

Mauro
A: 

That code is fine to read, but horrible to maintain - I suggest use plain English identifiers like so:

double deltaTime = lastPollTime - pollTime;
double alpha = 1 - Math.exp(-delta....
Crozin
+2  A: 

It is perfectly acceptable if it is acceptable in your working group. A lot of the answers here operate on the arrogant assumption that everybody programs in English. Non-English programmers are by no means rare these days and they're getting less rare at an accelerating rate. Why should they restrict themselves to English versions when they have a perfectly good language at their disposal?

Anglophone arrogance aside, there are other legitimate reasons for using non-English identifiers. If you're writing mathematics packages, for example, using Greek is fine if your target is fellow mathematicians. Why should people type out "delta" in your workgroup when everybody can understand "Δ" and likely type it more quickly? Almost any problem domain will have its own jargon and sometimes that jargon is expressed in something other than the Latin alphabet. Why on Earth would you want to try and jam everything into ASCII?

JUST MY correct OPINION
Absolutely agree; I think if the working group considers it acceptable, easy to type, and more clear, go for it. The only weird thing about doing this is that it is, in a way, a 'fluke' that a character like Δ is a valid Java identifier start, because it's a 'letter'. Other characters with similar uses don't happen to be 'letters', and hence are invalid.
Cowan
-1 for "you suck because you only know English". Until someone invents a spoken language like Python I will not have any reason to learn it. Although everyone in the world should only speak one language. Language is a basic need, not a game, like programming. It's okay to use algebraic symbols though _when you're in a specific domain_.
Longpoke
@Longpoke: Please point to where I said "you suck because you only know English". (Hint: This is not possible.) Hell, point to where I even *inferred* this. (Hint: This, too, is not possible.) What I am pointing out, however, is that the people saying "don't use Unicode in identifiers because it makes things difficult to read" are taking the **very** arrogant attitude that only English-speaking programmers count. Hence "anglophone arrogance".
JUST MY correct OPINION
"Because it scares unilingual wannabes?"
Longpoke
The problem is that the _keywords_ in Java are English. `if`, `while`, `public`, `class` etc, as well as all methods in the runtime library. By using another language for identifiers and methods, you have a situation where the reader must mentally switch continuously between two languages when reading the code. That is simply harder than having only one language, even if the reader is proficient in both.
Thorbjørn Ravn Andersen
@ttmrichter (unrelated to this answer) could you undelete your answer here - http://stackoverflow.com/questions/2707516/is-javaee-really-portable :)
Bozho
@Thorbjørn: The keywords in Java are pseudo-English. The "if" of Java is not the "if" of English. It is the "if" of formal logic which bears only a passing resemblance to English. The same is true of "while", "public", "class", et al. These are not words. They are symbols. We do not process them as English words. We process them as symbols which have a specified meaning in Java only (and often a completely different meaning in another programming language!). So we're ALREADY switching continuously between two languages. By using identifiers in our native tongue this is explicit.
JUST MY correct OPINION
@Bozho: I don't even remember deleting that nor why I did. Mysterious. It's undone.
JUST MY correct OPINION
@Longpoke: Fair point. I'll delete that reference.
JUST MY correct OPINION
@Thorbjørn @ttmrichter: It would probably make more sense to encode keywords such as `if` and `while` in the source code as some symbol, or even just leave them as they are now, then let the IDE translate them to the user's language. Yes they don't directly map onto spoken language, but they are very close, when I see `if (x == 2) { f() }` I think, "if `x` is equal to 2, call `f()`", maybe it's not like this in other human languages, who knows.
Longpoke
@Longpoke: It is, in fact, not like this in several other human languages. The things most people think they know about grammar are completely wrong. SVO, for example, is not only not universal, the very notions of "subject" and "object" are not universal. (Linguists use the terms "agent", "experiencer" and "patient" and describe linguistic cases in terms of these.) Conditional structures are not the same across languages. Double-negatives are not positives in many languages, they're emphasizers. "Not not red" means "very not-red" instead of "red". That kind of stuff.
JUST MY correct OPINION
@ttmrichter, you may be somewhat right in terms of the keywords, but not in the terms of the identifiers used in the runtime library. It is close to impossible to write any non-trivial Java program without referring to the runtime library and that contains tons of camel cased English words. And, yes, I speak from personal experience. The attempts we have done so far to write Danish words into Java programs did not go very well, and I've concluded the language switching is the case. The sole exception would be domain specific concepts with no reasonable English translation.
Thorbjørn Ravn Andersen
@LongPoke, too many symbols also make programs unreadable. Case in point: APL. The COBOL language is old and looked down upon, but it is so English like that you can frequently understand what it does by just reading the words. Readability is probably the most important aspect of programs besides doing what they are intented to do.
Thorbjørn Ravn Andersen
@Thorbjørn: First, readability is in the eye of the beholder. A Chinese user is going to have a different idea of what 'readable' means than is a German or Swedish or English user. Second, the (standard) runtime library is one of my complaints with Java, precisely because it's a huge, chatty mess of English.
JUST MY correct OPINION
@ttmrichter. I acknowledge you know a lot more about the Chinese mindset than me, and that it may work different for very-non-English speakers. Will you, in turn, acknowledge that the two-language mindset at least for Danish speakers makes it less readable than just one? The Runtime library is as it is. English. What would you suggest instead?
Thorbjørn Ravn Andersen
The problem is I'm also a near-native German and a semi-competent French speaker. I have no personal difficulty switching from German to English and back when reading code written by Germans. Indeed I find Germans writing English in code/comments more distracting than their writing German because their English is usually so non-idiomatic. So, from personal experience, I'm going to have to say I still disagree. Of course I disagree from the perspective of a native ENGLISH speaker dealing with foreign writers of code. I'm not sure how it would feel were I a German writing code.
JUST MY correct OPINION
A: 

It's an excellent idea. Honest. It's just not easily practicable at the time. Let's keep a reference to it for the future. I would love to see triangles, circles, squares, etc... as part of program code. But for now, please do try to re-write it, the way Crozin suggests.

Peter Perháč
A: 

In a perfect world, this would be the recommended way.

Unfortunately you run into character encodings when moving outside of plain 7-bit ASCII characters (UTF-8 is different from ISO-Latin-1 is different from UTF-16 etc), meaning that you eventually will run into problems. This has happened to me when moving from Windows to Linux. Our national scandinavian characters broke in the process, but fortunately was only in strings. We then used the \u encoding for all those.

If you can be absolutely certain that you will never, ever run into such a thing - for instance if your files contain a proper BOM - then by all means, do this. It will make your code more readable. If at least the smallest amount of doubt, then don't.

(Please note that the "use non-English languages" is a different matter. I'm just thinking in using symbols instead of letters).

Thorbjørn Ravn Andersen
Those symbols *are* non-English languages. Delta and alpha are Greek. That's a language. That isn't English.
JUST MY correct OPINION
@ttmricher, I was referring to using identifiers in your native language as opposed to use the English terms. (Like Cheval instead of Horse if French). This is different from using "Δ" in the _mathematical_ sense as asked.
Thorbjørn Ravn Andersen