views:

93

answers:

5

How would you design an 8-bit encoding of a set of 256 characters from western languages (say, with the same characters as ISO 8859-1) if it had not to be backward-compatible with ASCII?

I'm thinking to rules of thumb like these: if ABC...XYZabc...xyz0123...89 were, in this order, the first characters of the set (codes from 0 to 61), then isalpha(c) would just need the comparison c < 52, isalnum(c) would be c < 62, and so on. If, otherwise, 0123...89 were the first characters, maybe atoi() and the like would be easier to implement.

Another idea: if the letters were sorted like AaBbCcDdEeFf... or aàáâãbcdeèéêëfgh..., I think that dictionary-like sorting of strings would be more efficient.

Finally: is there a rationale behind 0 being the terminator of C strings instead of, say, 255?

+1  A: 

I wouldn't design an 8-bit encoding. That's dumb. There are far more than 255 human characters.

However, if I could just have a remake of the ANSI character set, I'd remove all the now-defunct control characters than span from 1 to 31. The rest is pretty much okay in my opinion. You have to take into account how strings are sorted too (like how a string starting with an underscore should be sorted before a string starting with a numerical character).

That being said, the rationale for making 0 the string terminator is probably that 0 means false in a condition, so you can iterate through a string just by checking if the character is non-zero, like if(*string) instead of if(*string != 0xFF).

Also, community wiki.

zneak
And regarding the last point I guess that, in terms of machine instructions, the first version would be shorter and faster... isn't it?
Federico Ramponi
Yup. But it's also shorter (which was, apparently, a concern so important to people back in the days that Linux has a function called `creat` instead of `create` to create files).
zneak
+1  A: 

What problems do you see with existing character sets that you hope to solve with a new one?

The efficiency savings of only needing c < 52 rather than c > M && c < N are marginal at best, given that this is rarely a bottleneck. Moreover, isalpha() and isalnum() are locale-specific and need to take care of accented characters, so in locales other than the one you design the charset for, you don't get any savings at all.

Your second idea of aàáâãbcdeèéêëfgh... is nice for ordering single characters according to a particular locale, but it doesn't help ordering multicharacter strings in languages where some characters are equivalent with respect to ordering. For example in German dictionaries umlauts are ignored for ordering purposes (abc < äbd < abe) so you still couldn't do a simple lexicographic order of char values.

Philip Potter
Oh, I don't hope to solve anything, and I don't want to create a new encoding (there are already enough :). I'm just wondering whether there would have been advantages, had history gone otherwise...
Federico Ramponi
Maybe at one point, but not now. Given how multinational computing has become, there are too many languages and locales and local variations to be able to solve problems by carefully designing a charset.Incidentally, one thing every 8-bit encoding should have is digits 0-9 in consecutive order, because otherwise it's not a valid charset to write C code in.
Philip Potter
+1  A: 

255 is not a valid character value on a 7 bit system, or might be somewhere in the the middle of the native character set on a 9 bit machine. Imagine 'e' being your string terminator.

So it's historic: "Can it run on a toaster chip" was a a fundamental (if retrofitted) design principle for C. Type widths are rather weakly defined in C, so implementaitons could use "nativer" elements - char being "the smallest individually adressable element", and that wasn't nor isnt 8 bit for all machines. 0 was widely unused anyway.

For the rest of your question: entirely subjective, depending on what to optimize for. It makes sense only in very strictly defined environments that are very low on ressources. E.g. in German, there are different "phone book" and "dictionary" sort rules. Which do you pick?


In the light of your examples, I'd put digits first, followed by letters (easier for dec/hex strings). I'd keep uppercase and lowercase letters apart - but, as in ascii, by a single bit. Instead of cramming it full of funny characters, I'd rather leave some chars undefined so some of these tricks work better. Optimize for sort is pointless unless you pre-defined the sort algorithm.

peterchen
+1  A: 

If i were doing this from scratch I would have the following scheme:-

x00 -- x10  -- Control characters such as end of file, end of line, end of string.

x10 -- x30  -- Alphabetic characters using the following pattern:-
    x10  -> A Upper case A.
    x11  -> a Lower case A.
    x12  -> a with local accent e.g a acute.
    x13  -> a with second local accent e.g. a grave
    .....................
x40 -- x50  -- Local "extra" characters
    Thing like the Scandanavian AE or Danish /O which are regarded as separate 
    characters with thier own position in the collating scheme.

x50 -- x60 -- Punctuation .,:; etc.
x70 -- x80 -- Other special character {}/\ etc.

xF0 -- xFF -- 0 to 9

There would be a number of advantages to this scheme (none of which are worth the pain of implentation and conversion!).

Firstly isnumeric isalpha etc can be implemented with simple bit mask.

Secondly collating would automatically fall into a natural sequence.

Ale, alchohol, ácute, áccentgrave, Beer Øl

However fitting a complex multicultural world into an eight bit scheme is just not possible and any scheme proposed would be somehow compromised. The real solution is to listen to the good folks at the UNICODE consortium who have all the bases covered by simple using 16 bots (or more!).

James Anderson
+1  A: 
Roger Pate