Why aren't there bank conflicts in global memory for Cuda/OpenCL?

views:

122

answers:

+3 Q:

Why aren't there bank conflicts in global memory for Cuda/OpenCL?

One thing I haven't figured out and google isn't helping me, is why is it possible to have bank conflicts with shared memory, but not in global memory? Can there be bank conflicts with registers?

UPDATE Wow I really appreciate the two answers from Tibbit and Grizzly. It seems that I can only give a green check mark to one answer though. I am newish to stack overflow. I guess I have to pick one answer as the best. Can I do something to say thank you to the answer I don't give a green check to?

+5 A:

Short Answer: There are no bank conflicts in either global memory or in registers.

Explanation:

The key to understanding why is to grasp the granularity of the operations. A single thread does not access the global memory. Global memory accesses are "coalesced". Since global memory is soo slow, any access by the threads within a block are grouped together to make as few requests to the global memory as possible.

Shared memory can be accessed by threads simultaneously. When two threads attempt to access an address within the same bank, this causes a bank conflict.

Registers cannot be accessed by any thread except the one to which it is allocated. Since you can't read or write to my registers, you can't block me from accessing them -- hence, there aren't any bank conflicts.

Who can read & write to global memory?

Only blocks. A single thread can make an access, but the transaction will be processed at the block level (actually the warp / half warp level, but I'm trying not be complicated). If two blocks access the same memory, I don't believe it will take longer and it may happen accelerated by the L1 cache in the newest devices -- though this isn't transparently evident.

Who can read & write to shared memory?

Any thread within a given block. If you only have 1 thread per block you can't have a bank conflict, but you won't have reasonable performance. Bank conflicts occur because a block is allocated with several, say 512 threads and they're all vying for different addresses within the same bank (not quite the same address). There are some excellent pictures of these conflicts at the end of the CUDA C Programming Guide -- Figure G2, on page 167 (actually page 177 of the pdf). Link to version 3.2

Who can read & write to registers?

Only the specific thread to which it is allocated. Hence only one thread is accessing it at one time.

M. Tibbits 2010-10-02 19:52:47

Note that my comment about the L1 cache is actually a question of my own -- do bank conflicts occur in the L1 cache. As this is handled entirely in the hardware, I don't believe we've been told in the latest documentation. (But the L1 is only in the newest 2.* hardware -- so if you don't have a Fermi GPU, this point it mute).

M. Tibbits 2010-10-02 19:54:43

+2 A:

Whether or not there can be bank conflicts on a given type of memory is obviously dependent on the structure of the memory and therefore of its purpose.

So why is shared memory designed in a way which allows for bank conflicts?

Thats relatively simple, its not easy to design a memory controller which can handle independent accesses to the same memory simultaneously (proven by the fact that most can't). So in order to allow each thread in a halfwarp to access a individualy addressed word the memory is banked, which an independent controller for each bank (at least thats how one can think about it, not sure about the actual hardware). These banks are interleaved to make sequential threads accessing sequential memory fast. So each of these banks can handle one request at a time ideally allowing for concurrent executions of all requests in the halfwarp (obviously this model can theoretically sustain higher bandwidth due to the independence of those banks, which is also a plus).

What about registers?

Registers are designed to be accessed as operands for ALU instructions, meaning they have to be accessed with very low latency. Therefore they get more transistors/bit to make that possible. I'm not sure how exactly registers are accessed in modern processors (not the kind of information you need often and not that easy to find out). However it would obviously be highly unpractical to organize registers in banks (for simpler architectures you typically see all registers hanging on one big multiplexer). So no, there won't be bank conflicts for registers.

Global memory

First of all global memory works on a different granuality then shared memory. Memory is accessed in 32, 64 or 128byte blocks (for GT200 atleast, for fermi it is 128B always, but cached, AMD is a bit different), where everytime you want something from a block the whole block is accessed/transferred. That is why you need coalesced accesses, since if every thread accesses memory from a different block you have to transfer all blocks.

But who says there aren't bank conflicts? I'm not completely sure about this, because I haven't found any actual sources to support this for NVIDIA hardware, but it seems logical: The global memory is typically distributed to several ram chips (which can be easily verified by looking on a graphicscard). It would make sense, if each of these chips is like a bank of local memory, so you would get bank conflicts if there are several simultaneous requests on the same bank. However the effects would be much less pronounced for one thing (since most of the time consumed by memory accesses is the latency to get the data from A to B anyways), and it won't be an effect noticible "inside" of one workgroup (since only one halfwarp executes at a time and if that halfwarp issues more then one request you have an uncoalesced memory access, so you are already taking a hit making it hard to measure the effects of this conflict. So you would only get conflicts if several workgroups try to access the same bank. In your typical situation for gpgpu you have a large dataset lying in sequential memory so the effects shouldn't really be noticible since there are enough other workgroups accessinng the other banks at the same time, but it should be possible to construct situations where the dataset is centered on just a few banks, which would make for a hit on bandwidth (since the maximal bandwidth would come from equaly distributing access on all banks, so each bank would only have a fraction of that bandwidth). Again I haven't read anything to prove this theory for nvidia hardware (mostly everything focusses on coalescing, which of course is more important as it makes this a nonproblem for natural datasets to). However according to the ATI Stream computing guide this is the situation for Radeon cards (for 5xxx: banks are 2kb apart and you want to make sure that you distribute your accesses (meaning from all worgroups simulateously active) equaly over all banks), so I would imagine this is the situation for Radeon cards to.

Of course for most scenarious the possibility of bank conflicts on global memory is a non issue, so in practice you can say:

Watch for coalescing when accessing global memory
Watch for bank conflicts when accessing local memory
No problems with accessing registers

Grizzly 2010-10-02 21:52:13

Watch out for partition camping in global memory too!

mch 2010-10-06 22:40:29

ansaurus

tags:

views:

answers:

Why aren't there bank conflicts in global memory for Cuda/OpenCL?

related questions