tags:

views:

140

answers:

4

In http://www.parashift.com/c++-faq-lite/intrinsic-types.html#faq-26.6, it is wriiten that
"Another valid approach would be to define a "byte" as 9 bits, and simulate a char* by two words of memory: the first could point to the 36-bit word, the second could be a bit-offset within that word. In that case, the C++ compiler would need to add extra instructions when compiling code using char* pointers."

I couldn't understand what it meant by "simulating char* by two words" and further quote.
Could somebody please explain it by giving an example ?

+1  A: 
data: [char1|char2|char3|char4]

To access char1:

ptrToChar = &data
index = 0

To access char2:

ptrToChar = &data
index = 9

To access char3:

ptrToChar = &data
index = 18

...

then to access a char, you would:

(*ptrToChar >> index) & 0x001ff

but ptrToChar and index would be saved in some sort of structure that the compiler creates so they would be associated with each other.

Adam Shiemke
+2  A: 

Since the C++ spec says that a char* must point to individual bytes, and the PDP-6/10 does not allow addressing indivdual bytes in a word, you have a problem with char* (which is a byte pointer) on the PDP-6/10

So one work around is: define a byte as 9 bits, then you essentially have 4 bytes in a word (4 * 9 = 36 bits = 1 word).

You still can't have char* point to individual bytes on the PDP-6/10, so instead have char* be made up of 2 36-bit words. The lower word would be the actual address, and the upper word would be some bytemask magic that the C++ compiler could use to point to the right 9bits in the lower word.

In this case,

sizeof(*int) (36bits) is different than sizeof(*char) (72bits).

It's just a contrived example that shows how the spec doesn't constrain primatives to specific bit/byte sizes.

Alan
A: 

Supposing a PDP-10 implementation wanted to get as close to having 8-bit bytes as possible. The most reasonable to split up a 36-bit word (the smallest unit of memory that the machine's assembly langauge can address) is to divide the word up into four 9-bit bytes. To access a particular 9-bit byte, you need to know which word it's in (you'd use the machine's native addressing mode for that, using a pointer which takes up one word), and you'd need extra data to indicate which of the 4 bytes inside the word was the one you're interested. This extra data would be stored in a second machine word. The compiler would generate lots of extra instructions to use that extra data to pull the right byte out of the word, using the extra data stored in the second word.

Ken Bloom
+3  A: 

I think this is what they were describing:

The PDP-10 referenced in the second paragraph had 36-bit words and was unable to address anything inside of those words. The following text is a description of one way that this problem could have been solved while fitting within the restrictions of the C++ language spec (that are included in the first paragraph).

Let's assume that you want to make 9-bit-long bytes (for some reason). By the spec, a char* must be able to address individual bytes. The PDP-10 can't do this, because it can't address anything smaller than a 36-bit word.

One way around the PDP-10's limitations would be to simulate a char* using two words of memory. The first word would be a pointer to the 36-bit word containing the char (this is normally as precise as the PDP-10's pointers allow). The second word would indicate an offset (in bits) within that word. Now, the char* can access any byte in the system and complies with the C++ spec's limitations.

ASCII-art visual aid:

| Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Byte 6 | Byte 7 | Byte 8 |
-------------------------------------------------------------------------
|               Word 1              |               Word 2              |
|              (Address)            |              (Offset)             |
-------------------------------------------------------------------------

Say you had a char* with word1 = 0x0100 and word2 = 0x12. This would point to the 18th bit (the start of the third byte) of the 256th word of memory.

If this technique was really used to generate a conforming C++ implementation on the PDP-10, then the C++ compiler would have to do some extra work with juggling the extra bits required by this rather funky internal format.

The whole point of that article is to illustrate that a char isn't always 8 bits. It is at least 8 bits, but there is no defined maximum. The internal representation of data types is dependent on the platform architecture and may be different than what you expect.

bta
Happy Mittal
+1 for ASCII-art! (Also, good explanation)
Bill
Nice diagram :D
Alan
`Word1` is a pointer to a 36-bit word. `Word2` is a bit offset within that word (thus, `Word2` is between 0 and 35). A pointer stored internally as `word1=0x100, word2=0x00` would be the same thing as a traditional C++ `char*` declared as `char* ptr = 0x100;`, a pointer stored as `word1=0x100, word2=0x09` would be the same as `char* ptr = 0x101;`, etc etc.
bta
@Happy Mittal- Setting `word2=0x00` will give you the first byte of that 36-bit word. Whether that corresponds to Byte1 or Byte4 on the diagram depends on the endian-ness of the architecture.
bta