tags:

views:

240

answers:

9

I'm a beginning programmer reading K&R, and I feel as if the book assumes a lot of previous knowledge. One aspect that confuses me is the actual representation, or should I say existence, of variables in memory. What exactly does a data type specify for a variable? I'm not too sure of how to word this question... but I'll ask a few questions and perhaps someone can come up with a coherent answer for me.

When using getchar(), I was told that it is better to use type "int" than type "char" due to the fact that "int" can hold more values while "char" can hold only 256 values. Since we may need the variable to hold the EOF value, we will need more than 256 or the EOF value will overlap with one of the 256 characters. In my mind, I view this as a bunch of boxes with empty holes. Could someone give me a better representation? Do these "boxes" have index numbers? When EOF overlaps with a value in the 256 available values, can we predict which value it will overlap with?

Also, does this mean that the data type "char" is only fine to use when we are simply assigning a value to a variable manually, such as char c = 'a', when we definitely know that we will only have 256 possible ASCII characters?

Also, what is the actual important difference between "char" and "int"? If we can use "int" type instead of "char" type, why do we decide to use one over the other at certain times? Is it to save "memory" (I use quotes as I do not actually how "memory" exactly works).

Lastly, how exactly is the 256 available values of type char obtained? I read something about modulo 2^n, where n = 8, but why does that work (something to do with binary?). What is the modulo portion of "modulo 2^n" mean (if it has any relevance to modular arithmetic, I can't see the relation...)?

+2  A: 

G'day,

To go deeper, I'd highly recommend Charles Petzold's excellent book "Code"

It covers more than what you ask, all of which leads to a better understanding of what's actually happening under the covers.

HTH

Rob Wells
A: 

Really, datatypes are an abstraction that allows your programming language to treat a few bytes at some address as some kind of numeric type. Consider the data type as a lens that lets you see a piece of memory as an int, or a float. In reality, it's all just bits to the computer.

dicroce
+4  A: 

One aspect that confuses me is the actual representation, or should I say existence, of variables in memory. What exactly does a data type specify for a variable?

At the machine level, the difference between int and char is only the size, or number of bytes, of the memory allocated for it by the programming language. In C, IIRC, a char is one byte while an int is 4 bytes. If you were to "look" at these inside the machine itself, you would see a sequence of bits for each. Being able to treat them as int or char depends on how the language decides to interpret them (this is also why its possible to convert back and forth between the two types).

When using getchar(), I was told that it is better to use type "int" than type "char" due to the fact that "int" can hold more values while "char" can hold only 256 values.

This is because there are 2^8, or 256 combinations of 8 bits (because a bit can have two possible values), whereas there are 2^32 combinations of 32 bits. The EOF constant (as defined by C) is a negative value, not falling within the range of 0 and 255. If you try to assign this negative value to a char (this squeezing its 4 bytes into 1), the higher-order bits will be lost and you will end up with a valid char value that is NOT the same as EOF. This is why you need to store it into an int and check before casting to a char.

Also, does this mean that the data type "char" is only fine to use when we are simply assigning a value to a variable manually, such as 0char c = 'a', when we definitely know that we will only have 256 possible ASCII characters?

Yes, especially since in that case you are assigning a character literal.

Also, what is the actual important difference between "char" and "int"? If we can use "int" type instead of "char" type, why do we decide to use one over the other at certain times?

Most importantly, you would pick int or char at the language level depending on whether you wanted to treat the variable as a number or a letter (to switch, you would need to cast to the other type). If you wanted an integer value that took up less space, you could use a short int (which I believe is 2 bytes), or if you were REALLY concerned with memory usage you could use a char, though mostly this is not necessary.

Edit: here's a link describing the different data types in C and modifiers that can be applied to them. See the table at the end for sizes and value ranges.

danben
Nitpick: to handle characters you'd stay the hell away from char and use a higher-level abstraction from a library like GLib.
Tobu
Sure, but I still think its important to understand what's actually going on at the lower levels.
danben
In C, an `int` can be 4 bytes, or more, or less. `int` must be able to represent values between `-32767` and `+32767`.
Alok
Dave
Also, in C, `char` may be signed, in which case it *can* store `EOF`, but of course `char` may be unsigned as well, and that's why we use `int` in this case.
Alok
Alok: However, if `char` is signed then it can't store all the valid return values of `getchar` (which are either `unsigned char` values, or `EOF`), so even then you're better off using `int`.
caf
+6  A: 

Great questions. K&R was written back in the days when there was a lot less to know about computers, and so programmers knew a lot more about the hardware. Every programmer ought to be familiar with this stuff, but (understandably) many beginning programmers aren't.

At Carnegie Mellon University they developed an entire course to fill in this gap in knowledge, which I was a TA for. I recommend the textbook for that class: "Computer Systems: A Programmer's Perspective" http://amzn.com/013034074X/

The answers to your questions are longer than can really be covered here, but I'll give you some brief pointers for your own research.

Basically, computers store all information--whether in memory (RAM) or on disk--in binary, a base-2 number system (as opposed to decimal, which is base 10). One binary digit is called a bit. Computers tend to work with memory in 8-bit chunks called bytes.

A char in C is one byte. An int is typically four bytes (although it can be different on different machines). So a char can hold only 256 possible values, 2^8. An int can hold 2^32 different values.

For more, definitely read the book, or read a few Wikipedia pages:

Best of luck!

UPDATE with info on modular arithmetic as requested:

First, read up on modular arithmetic: http://en.wikipedia.org/wiki/Modular_arithmetic

Basically, in a two's complement system, an n-bit number really represents an equivalence class of integers modulo 2^n.

If that seems to make it more complicated instead of less, then the key things to know are simply:

  • An unsigned n-bit number holds values from 0 to 2^n-1. The values "wrap around", so e.g., when you add two numbers and get 2^n, you really get zero. (This is called "overflow".)
  • A signed n-bit number holds values from -2^(n-1) to 2^(n-1)-1. Numbers still wrap around, but the highest number wraps around to the most negative, and it starts counting up towards zero from there.

So, an unsigned byte (8-bit number) can be 0 to 255. 255 + 1 wraps around to 0. 255 + 2 ends up as 1, and so forth. A signed byte can be -128 to 127. 127 + 1 ends up as -128. (!) 127 + 2 ends up as -127, etc.

jasoncrawford
Thanks! Could you explain the "modulo" portion of 2^n?
withchemicals
I would rather have said "back in the days when programming was a lot lower level, closer to the hardware, so learning programming quickly required (and resulted in) a good basic understanding of the underlying hardware".
Software Monkey
Software Monkey: well-said, I think that's more exact than what I wrote.
jasoncrawford
withchemicals: Updated answer with some info on modular arithmetic; hope this helps.
jasoncrawford
A: 

According to "stdio.h" getchars() return value is int and EOF is defined as -1. Depending on the actual encoding all values between 0..255 can occur, there for unsigned char is not enough to represent the -1 and int is used. Here is a nice table with detailed information http://en.wikipedia.org/wiki/ISO/IEC_8859

stacker
A: 
  • In C, EOF is a "small negative number".
  • In C, char type may be unsigned, meaning that it cannot represent negative values.
  • For unsigned types, when you try to assign a negative value to them, they are converted to an unsigned value. If MAX is the maximum value an unsigned type can hold, then assigning -n to such a type is equivalent to assigning MAX - (n % MAX) + 1 to it. So, to answer your specific question about predicting, "yes you can". For example, let's say char is unsigned, and can hold values 0 to 255 inclusive. Then assigning -1 to a char is equivalent to assigning 255 - 1 + 1 = 255 to it.

Given the above, to be able to store EOF in c, c can't be char type. Thus, we use int, because it can store "small negative values". Particularly, in C, int is guaranteed to store values in the range -32767 and +32767. That is why getchar() returns int.

Also, does this mean that the data type "char" is only fine to use when we are simply assigning a value to a variable manually, such as char c = 'a', when we definitely know that we will only have 256 possible ASCII characters?

If you are assigning values directly, then the C standard guarantees that expressions like 'a' will fit in a char. Note that in C, 'a' is of type int, not char, but it's okay to do char c = 'a', because 'a' is able to fit in a char type.

About your question as to what type a variable should hold, the answer is: use whatever type that makes sense. For example, if you're counting, or looking at string lengths, the numbers can only be greater than or equal to zero. In such cases, you should use an unsigned type. size_t is such a type.

Note that it is sometimes hard to figure out the type of data, and even the "pros" may make mistakes. gzip format for example, stores the size of the uncompressed data in the last 4 bytes of a file. This breaks for huge files > 4GB in size, which are fairly common these days.

You should be careful about your terminology. In C, a char c = 'a' assigns an integer value corresponding to 'a' to c, but it need not be ASCII. It depends upon whatever encoding you happen to use.

About the "modulo" portion, and 256 values of type char: if you have n binary bits in a data type, each bit can encode 2 values: 0 and 1. So, you have 2*2*2...*2 (n times) available values, or 2n. For unsigned types, any overflow is well-defined, it is as if you divided the number by (the maximum possible value+1), and took the remainder. For example, let's say unsigned char can store values 0..255 (256 total values). Then, assigning 257 to an unsigned char will basically divide it by 256, take the remainder (1), and assign that value to the variable. This relation holds true for unsigned types only though. See my answer to another question for more.

Finally, you can use char arrays to read data from a file in C, even though you might end up hitting EOF, because C provides other ways of detecting EOF without having to read it in a variable explicitly, but you will learn about it later when you have read about arrays and pointers (see fgets() if you're curious for one example).

Alok
A: 

The beauty of K&R is it's conciseness and readability, writers always have to make concessions for their goals; rather than being a 2000 page reference manual it serves as a basic reference and an excellent way to learn the language in general. I recommend Harbinson and Steele "C: A Reference Manual" for an excellent C reference book for details, and the C standard of course.

You need to be willing to google this stuff. Variables are represented in memory at specific locations and are known to the program of which they are a part of within a given scope. A char will typically be stored in 8 bits of memory (on some rare platforms this isn't necessarily true). 2^8 represents 256 distinct posibilities for variables. Different CPU/compilers/etc represent the basic types int, long of varying sizes. I think the C standard might specify minimum sizes for these, but not maximum sizes. I think for double it specifies at least 64 bits, but this doesn't preclude intel from using 80 bits in a floating point unit. Anyway, typical sizes in memory on 32bit intel platforms would be 32 bits (4 bytes) for unsigned/signed int and float, 64 bits (8 bytes) for double, 8 bits for char (signed/unsigned). You should also look up memory alignment if you are really interested on the topic. You can also at the exact layout in your debugger by getting the address of your variable with the "&" operator and then peeking at that address. Intel platforms may confuse you a little when looking at values in memory so please look up little endian/big endian as well. I am sure stack overflow has some good summaries of this as well.

dudez
+2  A: 

Basically, system memory is one huge series of bits, each of which can be either "on" or "off". The rest is conventions and interpretation.

First of all, there is no way to access individual bits directly; instead they are grouped into bytes, usually in groups of 8 (there are a few exotic systems where this is not the case, but you can ignore that for now), and each byte gets a memory address. So the first byte in memory has address 0, the second has address 1, etc.

A byte of 8 bits has 2^8 possible different values, which can be interpreted as a number between 0 and 255 (unsigned byte), or as a number between -128 and +127 (signed byte), or as an ASCII character. A variable of type char per C standard has a size of 1 byte.

But bytes are too small for a lot of things, so other types have been defined that are larger (i.e. they consist of multiple bytes), and CPUs support these different types through special hardware constructs. An int is typically 4 bytes nowadays (though the C standard does not specify it and ints can be smaller or bigger on different systems) because 4 bytes are 32 bits, and until recently that was what mainstream CPUs supported as their "word size".

So a variable of type int is 4 bytes large. That means when its memory address is e.g. 1000, then it actually covers the bytes at addresses 1000, 1001, 1002, and 1003. In C, it is possible to address those individual bytes as well at the same time, and that is how variables can overlap.

As a sidenote, most systems require larger types to be "word-aligned", i.e. their addresses have to be multiples of the word size, because that makes things easier for the hardware. So it is not possible to have an int variable start at address 999, or address 17 (but 1000 and 16 are OK).

Michael Borgwardt
Again, `int` may be 4 bytes or 2 or even 1, or anything. It must be able to represent the range +-32767.
Alok
wouldn't it be 2^7 and not 2^8?
0A0D
@Alok, yes that's what I say one paragraph higher. @Roboto: Nope. 8 bits means 2^8 different values. One bit has 2 values (2^1), each additional bit doubles this.
Michael Borgwardt
+1  A: 

I'm not going to completely answer Your question, but I would like to help You understand variables, as I had the same problems understanding them, when I began to program by myself.

For the moment, don't bother with the electronic representation of variables in memory. Think of memory as a continuous block of 1-byte-cells, each storing an bit-pattern (consisting of 0s and 1s).

By solely looking at the memory, You can't determine, what the bits in it represent! They are just arbitrary sequences of 0s and 1s. It is YOU, who specifies, HOW to interpret those bit patterns! Take a look at this example:

int a, b, c;
...
c = a + b;

You could have written the following as well:

float a, b, c;
...
c = a + b;

In both cases, the variables a, b and c are stored somewhere in the memory (and You can't tell their type). Now, when the compiler compiles Your code (that is translating Your program into machine instructions), it makes sure, to translate the "+" into integer_add in the first case and float_add in the second case, thus the CPU will interpret the bit patterns correctly and perform, what You desired.

Variable types are like glasses, that let the CPU look at a bit patterns from different perspectives.

Dave