views:

498

answers:

7

What is the technically correct way of referring to "high ascii" or "extended ascii" characters? I don't just mean the range of 128-255, but any character beyond the 0-127 scope.

Often they're called diacritics, accented letters, sometimes casually referred to as "national" or non-English characters, but these names are either imprecise or they cover only a subset of the possible characters.

What correct, precise term that will programmers immediately recognize? And what would be the best English term to use when speaking to a non-technical audience?

A: 

Non-ASCII Unicode characters.

Amuck
This is incorrect. Unicode has nothing to do with ASCII, except for being backwards compatible for the first 127 code points.
Dervin Thunk
That's the point. All of the Unicode characters that don't have ASCII equivalents.
Amuck
@Dervin: just as values over 127 have nothing to do with ASCII.
Joachim Sauer
A character outside of the ASCII range is not a Unicode character. It's a character outside of the ASCII range. Depending on the character encoding you're using, it's either: an invalid bit sequence; a Unicode character, an ISO-8859-x character, a Microsoft 1252 character, or a character in some other character encoding.
thomasrutter
+13  A: 

"Non-ASCII characters"

Aardvark
It seems definition by negation is the best we can do. As soon as we add "Unicode", the term won't be applicable in non-Unicode contexts, etc. I liked sgm's idea of "trans-ascii", but a fresh coinage won't cut it, especially when communicating across languages.
moodforaday
+1  A: 

"Extended ASCII" is the term I'd use, meaning "characters beyond the original 0-127".

Unicode is one possible set of Extended ASCII characters, and is quite, quite large.

UTF-8 is the way to represent Unicode characters that is backwards-compatible with the original ASCII.

Dean J
Actually, "Extended ASCII" would include 0-127; my error!
Dean J
My thought was "extended ascii" would only refer to 128-255. Anything that cannot be expressed in that range isn't really ascii any more :)
moodforaday
Note also (from wikipedia) that the use of the term 'extended ASCII' has been criticized, because it can be mistaken for an extension of the ASCII standard.
thomasrutter
@thomasrutter; if you're going to alter my answer that much in an edit, please just post a different answer, and/or leave a comment here at least?
Dean J
Gee, I was just trying to be helpful. I've rolled everything back.
thomasrutter
A: 

You could coin a term like “trans-ASCII,” “supra-ASCII,” “ultra-ASCII” etc. Actually, “meta-ASCII” would be even nicer since it alludes to the meta bit.

Cirno de Bergerac
I like "trans-ascii" and I think it correctly expresses the idea, but I am primarily looking for a good term to communicate the concept. Using a self-coined term may not do that :)
moodforaday
+2  A: 

ASCII character codes above 127 are not defined. many differ equipment and software suppliers developed their own character set for the value 128-255. Some chose drawing symbols, sone choose accent characters, other choose other characters.

Unicode is an attempt to make a universal set of character codes which includes the characters used in most languages. This includes not only the traditional western alphabets, but Cyrillic, Arabic, Greek, and even a large set of characters from Chinese, Japanese and Korean, as well as many other language both modern and ancient.

There are several implementations of Unicode. One of the most popular if UTF-8. A major reason for that popularity is that it is backwards compatible with ASCII, character codes 0 to 127 are the same for both ASCII and UTF-8.

That means it is better to say that ASCII is a subset of UTF-8. Characters code 128 and above are not ASCII. They can be UTF-8 (or other Unicode) or they can be a custom implementation by a hardware or software supplier.

Jim C
The UTFs are not "implementations" of Unicode. They are encodings of Unicode text into bytestrings. Unicode text is represented as a sequence of numbers (*not* `int`s or `long`s, *numbers*), and the UTFs are ways of translating each number into a sequence of one or more bytes.
Justice
Jim, thank you, but I am more or less aware of what those are :) I was only looking for a precise name.
moodforaday
A: 

If you say "High ASCII", you are by definition in the range 128-255 decimal. ASCII itself is defined as a one-byte (actually 7-bit) character representation; the use of the high bit to allow for non-English characters happened later and gave rise to the Code Pages that defined particular characters represented by particular values. Any multibyte (> 255 decimal value) is not ASCII.

DaveE
A: 

A bit sequence that doesn't represent an ASCII character is not definitively a Unicode character.

Depending on the character encoding you're using, it could be either:

  • an invalid bit sequence
  • a Unicode character
  • an ISO-8859-x character
  • a Microsoft 1252 character
  • a character in some other character encoding
  • a bug, binary data, etc

The one definition that would fit all of these situations is:

  • Not an ASCII character

To be highly pedantic, even "a non-ASCII character" wouldn't precisely fit all of these situations, because sometimes a bit sequence outside this range may be simply an invalid bit sequence, and not a character at all.

thomasrutter