views:

198

answers:

5

This question is language agnostic but is inspired by these c/c++ questions.

How to convert a single char into an int

Char to int conversion in C

Is it safe to assume that the characters for digits (0123456789) appear contigiously in all text encodings?

i.e. is it safe to assume that

'9'-'8' = 1
'9'-'7' = 2
...
'9'-'0' = 9

in all encodings?

I'm looking forward to a definitive answer to this one :)

Thanks,

Update: OK, let me limit all encodings to mean anything as old as ASCII and/or EBCDIC and afterwards. Sandscrit I'm not so worried about . . .

+5  A: 

I don't know about all encodings, but at least in ASCII and <shudder> EBCDIC, the digits 0-9 all come consecutively and in increasing numeric order. Which means that all ASCII- and EBCDIC-based encodings should also have their digits in order. So for pretty much anything you'll encounter, barring Morse code or worse, I'm going to say yes.

Chris Lutz
Chris, thanks for this, but I'm looking for a definitive answer, hopefully from someone that works closely with encoding specifications. I "believe" as you do that no one would be insane enough to publish an encoding that didn't support 0-9, but as you know yourself, belief isn't enough in this job. Thanks mate
Binary Worrier
Who do you have in mind that "works closely with encoding specificatons"? There isn't one central organization in change of all encodings. Anyone can create their own. I could implement one right now that has the digits in reverse order, or where 0 comes after 9.The only sane answer you're going to get is "it doesn't matter about *every* encoding. Find out which encoding you're reading, and follow its conventions".
jalf
@Binary Worrier - I can assure you that, for 99.9999% of all text you will ever encounter, '9' - '0' will be 9. You will not going to get a more exact answer than that. Even the official encoding of the People's Republic of China is ASCII-compatible. Besides, I did name two cases where 0-9 would NOT be binary consecutive, but Morse code would be very difficult to represent in pure binary, so it may not really count.
Chris Lutz
Chris: Dude, I'm with you, I agree with you, but unfortunately I'm a pedantic git and will hold out a while longer for a "definitive" answer. Thanks
Binary Worrier
I guess your Baudot Code example is the counterproof we needed. The Wiki page even says that variants (ITA2) are still in use today, in a few areas. Problem solved then. The answer is "No".
jalf
This will do me, so the answer we're going with is "A definitive "Yes", for ASCII, EBCDIC and everything based there on, a "maybe" for other modern encodings, and a definitive "No" also, because there exists Baudot, which is still in (very limited) use today. That's all the bases covered then. Gentlemen, thank you for your time, and a thousand apologies for stretching your patience. Fare thee well then.
Binary Worrier
One area that Baudot (ITA2) is still in use is in TDD's (teletype writers for the deaf). There used to be a giant, metal, olive-drab one with a big roll of yellow paper in my home when I was a kid - looked like something from WWII. Nowadays they usually look more like a small keyboard with a small LCD display.
Michael Burr
"Morse code would be very difficult to represent in pure binary" ?? Morse code **is** in binary. dot==0 dash==1
Stephen P
@Stephen P - Morse code has no spaces to separate words, or even signals to separate characters. It relies on the timing between dots and dashes to determine the difference between "..." (s) and ". . ." (eee). Computers would need to represent dots (00), dashes (01), character boundaries (10), and word boundaries (11). And then most letters are 3 or more dots or dashes, meaning that it would be less space efficient than ASCII by a long shot and an unlikely choice for any architecture.
Chris Lutz
+2  A: 

According to K&R ANSI C it is.

Excerpt:

..."This particular program relies on the properties of the character representation of the digits. For example, the test

if (c >= '0' && c <= '9') ...

determines whether the character in c is a digit. If it is, the numeric value of that digit is

c - '0'

This works only if '0', '1', ..., '9' have consecutive increasing values. Fortunately, this is true for all character sets...."

Leif Ericson
All of which predates unicode.
Binary Worrier
This is not my the best argument, but I assume that the recent editions of that book are responsible for it's correctness nowadays. It happens that my book is less the a year old. So I assume this information should be correct.
Leif Ericson
+3  A: 

You're going to find it hard to prove a negative. Nobody can possibly know every text encoding ever invented.

All encodings in common use today (except EBCDIC, is it still in common use?) are supersets of ASCII. I'd say you're more likely to win the lottery than you are to find a practical environment where the strict ordering of '0' to '9' doesn't hold.

Mark Ransom
"All encodings in common use today . . .are supersets of ASCII", can you cite a reference for this? Thanks.
Binary Worrier
+3  A: 

Both the C++ Standard and the C standard require that this be so, for C++ and C program text.

anon
Hm, really? Got a source for this?
jalf
Which would limit the requirement for the encoding the source file is written in, you can't make that statement stretch to the data the compiled program runs on
Binary Worrier
It should probably also be noted that C (and probably C++, but I'm no C++ coder) requires that a `char`, when printed, be printed in ASCII (rather, in the "C" locale, which uses ASCII).
Chris Lutz
@jalf: From the C99 standard, 5.2.1 Character sets: "In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous" (I'm sure there's something similar in C90, but I don't have that at hand right now).
Michael Burr
Michael: What do they mean by "execution basic character sets"?
Binary Worrier
"Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters."
Michael Burr
@Binary - In layman's terms, the character set used during the program's execution. So the C code "int i = getchar() - '0';" will set i to 4 if the user entered the character '4' as input.
Chris Lutz
+2  A: 

All text encodings I know of typically order each representation of digits sequentially. However, your question becomes a lot broader when you include all of the other representations of digits in other encodings, such as Japanese: 1234567890. Notice how the characters for the numbers are different? Well, they are actually different code points. So, I really think the answer to your question is a hard maybe, since there are so many encodings out there and they have multiple representations of digits in them.

A better question is to ask yourself, why do I need to count on digits to be in sequential code points in the first place?

Elijah
Elijah: Two things 1) I see 1 2 3 4 5 6 7 8 9 0 above, not Japanese characters. 2) the why is purely to answer, for once and for all, whether the c/c++ style shortcut for converting a character to an integer is valid i.e. '1' - '0' = 1. Thanks.
Binary Worrier
However, notice that those characters do not have the same code points as your normal ascii digits - there are unicode characters with special spacing requirements used in Japanese.
Elijah
But their code points *are* sequential. Written as 0 .. 9 instead of 1 .. 0, those unicode characters encoded in UTF-8 are `ef bc 90` through `ef bc 99` so even for them, '9'-'0' == 9
Stephen P