tags:

views:

373

answers:

2

I got into an interesting discussion in a forum where we discussed the naming of variables.

Conventions aside, I noticed that it is legal for a variable to have the name of a Unicode character, for example the following is legal:

int \u1234;

However, if I for example gave it the name #, it produces an error. According to Sun's tutorial it is valid if "beginning with a letter, the dollar sign "$", or the underscore character ""."*

But the unicode 1234 is some Ethiopic character. So what is really defined as a "letter"?

+2  A: 

The Unicode standard defines what counts as a letter.

From the Java Language Specification, section 3.8:

Letters and digits may be drawn from the entire Unicode character set, which supports most writing scripts in use in the world today, including the large sets for Chinese, Japanese, and Korean. This allows programmers to use identifiers in their programs that are written in their native languages.

A "Java letter" is a character for which the method Character.isJavaIdentifierStart(int) returns true. A "Java letter-or-digit" is a character for which the method Character.isJavaIdentifierPart(int) returns true.

From the Character documenation for isJavaIdentifierPart:

Determines if the character (Unicode code point) may be part of a Java identifier as other than the first character. A character may be part of a Java identifier if any of the following are true:

  • it is a letter
  • it is a currency symbol (such as '$')
  • it is a connecting punctuation character (such as '_')
  • it is a digit
  • it is a numeric letter (such as a Roman numeral character)
  • it is a combining mark
  • it is a non-spacing mark
  • isIdentifierIgnorable(codePoint) returns true for the character
Jon Skeet
+1  A: 

Unicode characters fall into character classes. There's a set of Unicode characters which fall into the class "letter".

Determined by Character.isLetter(c) for Java. But for identifiers, Character.isJavaIdentifierStart(c) and Character.isJavaIdentifierPart(c) are more relevant.

For the relevant Unicode spec, see this.

Vinay Sajip