tags:

views:

591

answers:

4

Can someone give me a short and plausible explanation for why the compiler adds padding to data structures in order to align it's members? I know that it's done so that the CPU can access the data more efficiently, but i don't understand why this is so.

And if this is only CPU related, why is a double 4 byte aligned in Linux and 8 byte aligned in Windows?

+4  A: 

Alignment helps the CPU fetch data from memory in an efficient manner: less cache miss/flush, less bus transactions etc.

Some memory types (e.g. RDRAM, DRAM etc.) need to be accessed in a structured manner (aligned "words" and in "burst transactions" i.e. many words at one time) in order to yield efficient results. This is due to many things amongst which:

  1. setup time: time it takes for the memory devices to access the memory locations
  2. bus arbitration overhead i.e. many devices might want access to the memory device

"Padding" is used to correct the alignment of data structures in order to optimize transfer efficiency.


In other words, accessing a "mis-aligned" structure will yield lower overall performance. A good example of such pitfall: suppose a data structure is mis-aligned and requires the CPU/Memory Controller to perform 2 bus transactions (instead of 1) in order to fetch the said structure, the performance is thus consequently lower.

jldupont
so what exactly happens if, let's say a float is 1byte aligned?
Mat
@Mat: then, depending on "where" the "float variable" ends up being allocated in memory, efficiency in accessing this "float variable" will vary.
jldupont
but do I understand correctly, that the performance for accessing a badly aligned float will not be worse than accessing a correctly aligned double?
Mat
@Mat: no: accessing a mis-aligned structure will result in **lower** overall efficiency i.e. lower performance.
jldupont
+5  A: 

the CPU fetches data from memory in groups of 4 bytes (it actualy depends on the hardware its 8 or other values for some types of hardware, but lets stick with 4 to keep it simple), all is well if the data begins in an address which is dividable by 4, the CPU goes to the memory address and loads the data.

now suppose the data begins in an address not dividable by 4 say for the sake of simplicity at address 1, the CPU must take data from address 0 and then apply some algorithm to dump the byte at the 0 address , to gain access to the actual data at byte 1. this takes time and therefore lowers preformance. so it is much more efficient to have all data addresses aligned.

Alon
not necessarily in groups of 4bytes: this is highly dependent on the CPU type.
jldupont
This is a bit simplified: It's fine to have a BYTE-sized value at a memory location not dividable by 4. It's also fine to have a WORD-sized value at a memory location dividable by 2.
nikie
I was going for simple ;-)
Alon
@Alon: it can much worse than what you depict ("dumping bytes"): supposed the data structure spawns a boundary that requires an additional bus transaction, then performance goes out the window.
jldupont
@jldupont: I agree. I disregarded paging and caches levels and sizes to illustrate the main point more clearly
Alon
+1  A: 

In addition to jldupont's answer, some architectures have load and store instructions (those used to read/write to and from memory) that only operate on word aligned boundaries - so, to load a non-aligned word from memory would take two load instructions, a shift instruction, and then a mask instruction - much less efficient!

Autopulated
does reading a type which is smaller than 4 bytes (bool, short, whatever) always include a masking operation and if it's not 4byte aligned also a shift instruction?
Mat
@Mat: not necessarily a "shift instruction": at the circuitry level, the chip designers are used to refer to this type of operation as something along the lines of "byte swappers".
jldupont
+1  A: 

A cache line is a basic unit of caching typically 16-64bytes or more.

Pentium IV: 64bytes; Pentium Pro/II: 32bytes; Pentium I: 32bytes; 486: 16bytes.

myrandomreader:
  ; ...
  ; ten instructions to generate next pseudo-random
  ; address in ESI from previous address
  ; ...
  MOV EAX, DS:[ESI]   ; X
  LOOP myrandomreader

For memory read straddling two cachelines:

(for L1 cache miss) the processor must wait for the whole of cache line 1 to be read from L2->L1 into the processor before it can request the second cache line, causing a short execution stall

(for L2 cache miss) the processor must wait for two burst reads from L3 cache (if present) or main memory to complete rather than one

Processor stalls

  • A random 4byte read will straddle a cacheline boundary about 5% of the time for 64byte cachelines, 10% for 32byte ones and 20% for 16byte ones.

  • There may be additional execution overheads for some instructions on misaligned data even if it is within a cacheline. This is talked about on the Intel website for some SSE instructions.

  • If you are defining the structures yourself, it may make sense to look at listing all the <32bit data fields together in a struct so that padding overhead is reduced or alternatively review whether it is better to turn packing on or off for a particular structure.

  • On MIPS and many other platforms you don't get the choice and must align - kernel exception if you don't!!

  • Alignment may also matter extra specially to you if you are doing I/O on the bus or using atomic operations such as atomic increment/decrement or if you wish to be able to port your code to non-Intel.

  • On Intel only (!) code, a common practice is to define one set of packed structures for network and disk, and another padded set for in-memory and to have routines to convert data between these formats (also consider "endianness" for the disk and network formats).

EDIT

Changed "overheads for instructions" to "overheads for some instructions".

Added "On Intel only (!) code, " to "a common practice is to define".

Corrected "1byte ones" to "16byte ones".

martinr