views:

82

answers:

5

I have just started reading through The C Programming Language and I am having trouble understanding one part. Here is an excerpt from page 24:

#include<stdio.h>

/*countdigits,whitespace,others*/

main()
{
  intc,i,nwhite,nother;
  intndigit[10];

  nwhite=nother=0;
  for(i=0;i<10;++i)
      ndigit[i]=0;

  while((c=getchar())!=EOF)
      if(c>='0'&&c<='9')
          ++ndigit[c-'0']; //THIS IS THE LINE I AM WONDERING ABOUT
      else if(c==''||c=='\n'||c=='\t')
          ++nwhite;
      else
          ++nother;

  printf("digits=");
  for(i=0;i<10;++i)
      printf("%d",ndigit[i]);
  printf(",whitespace=%d,other=%d\n",
      nwhite,nother);
}

The output of this program run on itself is

digits=9300000001,whitespace=123,other=345

The declaration

intndigit[10];

declares ndigit to be an array of 10 integers. Array subscripts always start at zero in C, so the elements are

ndigit[0], ndigit[ 1], ..., ndigit[9]

This is reflected in the for loops that initialize and print the array. A subscript can be any integer expression, which includes integer variables like i,and integer constants. This particular program relies on the properties of the character representation of the digits. For example, the test

if(c>='0'&&c<='9')

determines whether the character in c is a digit. If it is, the numeric value of that digit is

c-'0'`

This works only if '0', '1', ..., '9' have consecutive increasing values. Fortunately, this is true for all character sets. By definition, chars are just small integers, so char variables and constants are identical to ints in arithmetic expressions. This is natural and convenient; for example

c-'0'

is an integer expression with a value between 0 and 9 corresponding to the character '0' to '9' stored in c, and thus a valid subscript for the array ndigit.

The part I am having trouble understanding is why the -'0' part is necessary in the expression c-'0'. If a character is a small integer as the author says, and the digit characters correspond to their numeric values, then what is -'0' doing?

A: 

It converts from the ASCII code of the '0' key on your keyboard to the value zero.

if you did int x = '0' + '0' the result would not be zero.

James
+3  A: 

The numeric value of a character is (on most systems) its ASCII value. The ASCII value of '0' is 48, '1' is 49, etc.

By subtracting 48 from the value of the character '0' becomes 0, '1' becomes 1, etc. By writing it as c - '0' you don't actually need to know what the ASCII value of '0' is (or that the system is using ASCII - it could be using EBCDIC). The only thing that matters is that the values are consecutive increasing integers.

Mark Byers
+6  A: 

Digit characters don't correspond to their numeric values. They correspond to their encoding values (in this case, ASCII).

IIRC, ascii '0' is the value 48. And, luckily for this example and most character sets, the values of '0' through '9' are stored in order in the character set.

So, subtracting the ASCII value for '0' from any ASCII digit returns its "true" value of 0-9.

Joe
Thanks, the author basically said the same thing but it just was not sinking in. It all makes sense now.
typoknig
If you knew '0' was 48, you could just use 48. The use of '0' is at least in part to aid portability to other (non-ASCII) character sets where the digits have different character values.
Carl Norum
The C standard guarantees that `'0'` through `'9'` are contiguous, whatever be the encoding.
Alok
@Carl Norum: No you can't. Because 48 assumes ASCII encoding. You should ALWAYS use '0' because that will work whatever the encoding of the character set is because the language guarantees that numerical values are contiguous. It is also what everybody is expecting you to use. If you start sprinkling your code with magic numbers you are going to confuse people.
Martin York
I don't think there's anything wrong with modern code assuming the character encoding is ASCII-compatible, or even that it's UTF-8 or that `wchar_t` is UTF-32/UCS-4. C99 even has a way of indicating this, `__STDC_ISO10646__`, and if it's defined, then `char` must be ASCII in the range 0-127.
R..
@Alok, interesting, though I'm not sure how a language standard can influence a character encoding...
Joe
@R., you can't assume `wchar_t` to be UCS-4 because it isn't on Windows with Microsoft's compilers, which like it or represents a huge percentage of all non-embedded machines you will come across. (`wchar_t` is UCS-16LE, IIRC on MS C).
RBerteig
A: 

In most character encodings, all of the digits are placed consecutively in the character set. In ASCII for example, they start with '0' at 0x30 ('1' is 0x31, '2' is 0x32, etc.). If you want the numeric value of a given digit, you can just subtract '0' from it and get the right value. The advantage of using '0' instead of the specific value is that your code can be portable to other character sets with much less effort.

Carl Norum
Its a requirement of the C standard that the encoding used has numeric values in contiguous locations.
Martin York
A: 

If you access a character string by their characters you'll get the ASCII values back, even if the characters happen to be numbers.

Fortunately the guys who designed that character table made sure that the characters for 0 to 9 are sequential, so you can simply convert from ASCII to a number by subtracting the ASCII-value of '0'.

That's what the code does. I have to admit that it is confusing when you see it the first time, but it's not rocket science.

The ASCII-character value of '0' is 48, '1' is 49, '2' is 50 and so on.

For reference here is a nice ASCII-chart:

http://www.sciencelobby.com/ascii-table/images/ascii-table1.gif

Nils Pipenbrinck