views:

32

answers:

3

The normal answers to why data alignment is to access more efficiently and to simplify the design of CPU.

A relevant question and its answers is here. And another source is here. But they both do not resolve my question.

Suppose a CPU has a access granularity of 4 bytes. That means the CPU reads 4 bytes at a time. The material I listed above both says that if I access a misaligned data, say address 0x1, then the CPU has to do 2 accesses (one from addresses 0x0, 0x1, 0x2 and 0x3, one from addresses 0x4, 0x5, 0x6 and 0x7) and combine the results. I can't see why. Why just can't CPU read data from 0x1, 0x2, 0x3, 0x4 when I issue accessing address 0x1. It will not degrade the performance and incur much complexity in circuitry.

Thank you in advance!

A: 

In my opinion that's a very simplistic assumption. The circuitry could involve many layers of pipeling and caching optimisation to ensure that certain bits of memory are read. Also the memory reads are delegated to the memory subsystems that may be built from components that have orders of difference in performance and design complexity to read in the way that you think.

However I do add the caveat that I'm not a cpu or memory designer so I could be talking a crock.

Preet Sangha
A: 

The answer to your question is in the question itself.

The CPU has access granularity of 4 bytes. So it can only slurp up data in chunks of 4 bytes.

If you had accessed the address 0x0, the CPU would give you the 4 bytes from 0x0 to 0x3.

When you issue an instruction to access data from address 0x1, the CPU takes that as a request for 4 bytes of data starting at 0x1 ( ie. 0x1 to 0x4 ). This can't be interpreted in any other way essentially because of the granularity of the CPU. Hence, the CPU slurps up data from 0x0 to 0x3 & 0x4 to 0x7 (ergo, 2 accesses), then puts the data from 0x1 to 0x4 together as the final result.

Kedar Soparkar
This doesn't even begin to address WHY the CPU can "slurp" bytes 0-3 at the same time but not 1-4.
Ben Voigt
+3  A: 

It will not degrade the performance and incur much complexity in circuitry.

My, but aren't you the expert?

Your comment in the other question used much more appropriate wording ("I don't think it would degrade"...)

Did you consider that the memory architecture uses many memory chips in parallel in order to maximize the bandwidth? And that a particular data item is in only one chip, you can't just read whatever chip happens to be most convenient and expect it to have the data you want.

Right now, the CPU and memory can be wired together such that bits 0-7 are wired only to chip 0, 8-15 to chip 1, 16-23 to chip 2, 24-31 to chip 3. And for all integers N, memory location 4N is stored in chip 0, 4N+1 in chip 1, etc. And it is the Nth byte in each of those chips.

Let's look at the memory addresses stored at each offset of each memory chip

memory chip       0       1       2       3
offset

    0             0       1       2       3
    1             4       5       6       7
    2             8       9      10      11
    N            4N    4N+1    4N+2    4N+3



So if you load from memory bytes 0-3, N=0, each chip reports its internal byte 0, the bits all end up in the right places, and everything is great.

Now, if you try to load a word starting at memory location 1, what happens?

First, we look at the way it is done. First memory bytes 1-3, which are stored in memory chips 1-3 at offset 0, end up in bits 8-31, because that's where those memory chips are attached, even though you asked them to be in bits 0-23. This isn't a big deal because the the CPU can swizzle them internally, using the same circuitry used for logical shift left. Then on the next transaction memory byte 4, which is stored in memory chip 0 at offset 1, gets read into bits 0-7 and swizzled into bits 24-31 where you wanted it to be.

Notice something here. The word you asked for is split across offsets, the first memory transaction read from offset 0 of three chips, the second memory transaction read from offset 1 of the other chip. Here's where the problem lies. You have to tell the memory chips the offset so they can send you the right data back, and the offset is ~40 bits wide and the signals are VERY high speed. Right now there is only one set of offset signals that connects to all the memory chips, to do a single transaction for unaligned memory access you would need independent offset (called the address bus BTW) running to each memory chip. For a 64-bit processor, you'd change from one address bus to eight, an increase of almost 300 pins. In a world where CPUs use between 700 and 1300 pins, this can hardly be called "not much increase in circuitry". Not to mention the huge increase in noise and crosstalk from that many extra high-speed signals.

Ok, it isn't quite that bad, because there can only be a maximum of two different offsets out on the address bus at once, and one is always the other plus one. So you could get away with one extra wire to each memory chip, saying in effect either (read the offset listed on the address bus) or (read the offset following) which is two states. But now there's an extra adder in each memory chip, which means it has to calculate the offset before actually doing the memory access, which slows down the maximum clock rate for memory. Which means that aligned access gets slower if you want unaligned access to be faster. Since 99.99% of access can be made aligned, this is a net loss.

So that's why unaligned access gets split into two steps. Because the address bus is shared by all the bytes involved. And this is actually a simplification, because when you have different offsets, you also have different cache lines involved, so all the cache coherency logic would have to double to handle twice the communication between CPU cores.

Ben Voigt
Of course I'm not an expert. Sorry! I will watch out my words from now on. Thank you for your answer and I think this is just what I am seeking.
wbb