views:

693

answers:

7

I used to think that each memory location contains 8, 16, 32 or 64 bits. So 0101 would be stored in an 8 bit machine as 00000101 (sign extended if it was negative). This was all fine and dandy until I wrote a program in java out of curiosity to find out some more inner workings of this system.

The method in question looks like this:

public void printBinaryRep(File f){
     try{
      FileInputStream inputStream = new FileInputStream(f);
      int next = 0;
      byte b = 0;
      while((next = inputStream.read()) != -1){
       b = (byte)next;
       System.out.println((char)next + " : "+Integer.toBinaryString(next));
      }
      inputStream.close();
     }
     catch(Exception e){System.out.println(e);}
 }

I got this output from a file that says Hello World

H : 1001000
e : 1100101
l : 1101100
l : 1101100
o : 1101111
  : 100000
W : 1010111
o : 1101111
r : 1110010
l : 1101100
d : 1100100

All of it looks fine except for the space. It has 6 bits instead of 8. I'm now wondering how all of that information is stored in memory. If all of it was stored in 8 bit chunks, like

Hello: 10010001100101110110011011001101111

Then you can simply look at each 8 bit chunk and figure out what number it's representing (and then what ASCII code it's referring to). How does it work when a different sized character (like the 6 bit space and the 4 bit /n ) is stored along with them?? Then wouldn't storing a small number in a large bit space waste a lot of bits?

I think I have some of the fundamental understanding wrong (or maybe the program's wrong somewhere...). Sorry if the question sounds strange or too un-necessarily in-depth. I just want to know. I've done some googling, but it didn't come up with anything relevent. If you can let me know where I've gone wrong or point me in the right direction, I'd greatly appreciate it. Thanks!

+2  A: 

According to the Java 4 API,

The unsigned integer value is the argument plus 232 if the argument is negative; otherwise it is equal to the argument. This value is converted to a string of ASCII digits in binary (base 2) with no extra leading 0s.

In reality, data storage is actually much more complicated. For efficiencies in processing, most data types are stored at word-boundaries, which means 4 bytes on 32-bit machines, or 8 bytes on 64-bit machines. Arrays may be packed more closely, so that char [4] may end up using the same amount of "actual space" as char.

Java is a virtual machine, and I'm not certain what memory architecture, if any, it uses.

dcrosta
+5  A: 

The space has 8 bits too. It's just that Integer.toBinaryString doesn't print leading 0 bits the way you used it.

With all the leading 0 bits, it actually looks like this in memory:

H : 01001000
e : 01100101
l : 01101100
l : 01101100
o : 01101111
  : 00100000
W : 01010111
o : 01101111
r : 01110010
l : 01101100
d : 01100100
Stephen Canon
+8  A: 

You'll be better off experimenting in C and/or assembly, rather than Java. Those languages are lower-level and expose the address space directly.

I used to think that each memory location contains 8, 16, 32 or 64 bits. So 0101 would be stored in an 8 bit machine as 00000101 (sign extended if it was negative). This was all fine and dandy until I wrote a program in java out of curiosity to find out some more inner workings of this system.

All memory locations in x86 systems contain 8 bits (1 byte). If a value contains more data than can fit into a single byte, it is stored using multiple bytes. For example, in C, the "float" type is stored using 4 bytes (32 bits).

All of it looks fine except for the space. It has 6 bits instead of 8. I'm now wondering how all of that information is stored in memory. If all of it was stored in 8 bit chunks, like

The space is also stored in a single byte. Your print code is forgetting to pad out to 8 spaces. 100000 == 00100000 == 0x20.

John Millikin
I completely agree with the C/ASM approach.
Paul Nathan
Thanks John, That makes more sense!I'm learning C, but I still tend to code for fun in Java since I'm more comfortable with it.
Cobalt
Most modern computers have 8 bits per memory location. In the old days, I worked on systems with 60 bits (CDC supercomputers) and there were systems with 1 (Burroughs).
David Thornley
+3  A: 

Your original intuition was (mostly) correct: all memory locations consist of the same number of bits. On all modern machines, there are eight bits in a "byte", where a byte is the smallest chunk of memory that the machine can access individually.

Look closely at your output. You have seven digits in all of them except the space. The space just happens to begin with two zeroes in its binary representation, while the other letters begin with one.

JSBangs
+2  A: 

Actually your approach is wrong. Encoding is very important here.

If you use ASCII then you can easily say that each character is stored in a byte (eight bits) but when encoding changes you cannot say that.

Eg: UTF-8 uses one to three bytes (8 to 24 bits) for each character on a string. That is why you will see an overload in which you can specify the encoding on inputstream object.

Choosing wrong input stream will absolutely cause a wrong string output. Thus you have to know the encoding of the file to understand which bit means what. Actually fileinputstream does this for you.

If you store a digit as string it will take a char length in hard drive. Just like another character.

However if you store 123456789 as string with ASCII encoding it will take 9*8 bits = 72 bits.

If you store this as integer, (note that integer's data width differs in different environments) it will only take 16 bits.

Also you cannot be sure that

H : 01001000
e : 01100101
l : 01101100
l : 01101100
o : 01101111
  : 00100000
W : 01010111
o : 01101111
r : 01110010
l : 01101100
d : 01100100
\n: 00001010

is stored in hard drive as H : 01001000 e : 01100101 l : 01101100 l : 01101100 o : 01101111 : 00100000 W : 01010111 o : 01101111 r : 01110010 l : 01101100 d : 01100100 \n: 00001010

You cannot be sure of that. File System is not that simple. Maybe Hello is successive but World string is at the end of drive. Thats why there is defrag command.

But if we talk about main memory (RAM) when you define a string i expect bits to be successive. At least in C it is. You define a string like that.

char[100] value; // c is a char array. (there is no string type in c)

here value[0] is the first character of our string. And value only addresses to the char arrays location in memory.

if value[0]'s address is 10 then value[1]'s address is 10+8 = 18.

JCasso
+2  A: 

The way computers store numbers can be compared to an odometer in a car. If the odometer has 4 digits, it stores the number 33 as "0033".

If someone asks you what your mileage is, you aren't going to say "zero thousand zero hundred and thirty three". By default, Java doesn't either. (Although you can tell it to.)

Then wouldn't storing a small number in a large bit space waste a lot of bits?

Well, not really. Suppose you had 11000100 in memory somewhere. How is the computer supposed to know whether this means 11000100, or 11000 followed by 100, or 1 followed by 1000 followed by 100, and so on?

Well, actually the computer is just following the program it is given (remember that a Java program is created partly by you and partly by the people who design Java). If you can create a viable system for saving bits, you can make the computer do it.

However, keep in mind that there's a tradeoff in terms of processor usage and programming difficulty. Since a typical computer can work with bytes much more quickly than it can with say, 7-bit or variable-bit numbers, storing ASCII codes in bytes is a very common choice for storing text.

But let me return to your question.

Then wouldn't storing a small number in a large bit space waste a lot of bits?

Mathematically speaking, no. A branch of mathematics called Information Theory tells us that the number of bits which are absolutely necessary depends on the possibilities you want to encode and how likely each of them is.

Let's suppose you have only a four letter alphabet (A, B, C, D), and use two-bit numbers (00, 01, 10, 11 respectively) to represent it. If each of these letters is equally likely, then the minimum number of bits required per letter (on average) is 2. In other words, there are no wasted bits even though A is 00 and B is 01.

On the other hand, if you use ASCII and encode A, B, C, D as the following 7-bit numbers:

A: 1000001
B: 1000010
C: 1000011
D: 1000100

then you are "wasting" 5 bits per letter (even though you're not "storing small numbers in a large bit space").

These sorts of considerations are important when designing compression algorithms, and not so important for everday applications. It's certainly important to understand bits and bytes if you wish to learn C.

Artelius
+1  A: 

That clears it up. My main problem was that I was overlooking the zeroes in the beginning. I was experimenting with this as I was reading more about compression algorithms (namely, gzip) I was assuming ASCII for all of this. Seeing the representation was not the goal of the program, but the different number of bits per word threw me off from the original goal of implementing a basic, index based compression for a file type I'm working on. I'll try to rewrite it in C once I have a proof of concept in Java.

Thanks!

Cobalt