tags:

views:

74

answers:

1

Okay i have already been through most of the ati and nvidia guides to OpenCL, there are some stuff that i just want to be sure of, and some need clarification. Nothing in the documentation gives a clear cut answer.

Now i have a radeon 4650, now on querying my device, i got

  CL_DEVICE_MAX_COMPUTE_UNITS:  8
  CL_DEVICE_ADDRESS_BITS:  32
  CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
  CL_DEVICE_MAX_WORK_ITEM_SIZES: 128 / 128 / 128 
  CL_DEVICE_MAX_WORK_GROUP_SIZE: 128
  CL_DEVICE_MAX_MEM_ALLOC_SIZE:  256 MByte
  CL_DEVICE_GLOBAL_MEM_SIZE:  256 MByte

ok first, my card has 1GB memory, why am i allowed to 256MB only?

2nd i don't understand the Work-item dimension part, does that mean i can have up to 128*3 or 128^3 work-items?

when i calculated this before i run the query, i got 8 cores * 16 stream processors * 4 work-items = 512 why is this wrong?

also i got the same 3 dimension work-item stuff for my inte core 2 duo CPU, does the same calculations apply?

As for the command queues, when i tried accessing my core duo CPU as a device using OpenCL, stuff got processed on one core only, i tried doing multiple queues and queueing several entries, but still got processed on one core only, i used a global_work_size of 128*128*128*8 for a simple write program where each work-item writes its own global-id to the buffer and i got only zeros.

and what about Nvidia Cards? on a Nvidia 9500 GT with 32 cuda cores, does the work-items calculate similarly?

Thanks alot, i've been really all over the place trying to find answers.

+2  A: 

ok first, my card has 1GB memory, why am i allowed to 256MB only?

This is an ATI driver bug/limitation AFAIK. I'll check on my 5850 if I can repro.

http://devforums.amd.com/devforum/messageview.cfm?catid=390&threadid=124142&messid=1069111&parentid=0&FTVAR_FORUMVIEWTMP=Branch

2nd i don't understand the Work-item dimension part, does that mean i can have up to 128*3 or 128^3 work-items?

No. That means you can have max 128 on one dim since CL_DEVICE_MAX_WORK_ITEM_SIZES is 128 / 128 / 128. And since CL_DEVICE_MAX_WORK_GROUP_SIZE is 128, you can have, e.g: work_group_size(128, 1, 1) or work_group_size(1, 128, 1) or work_group_size(64, 1, 2), or work_group_size(8, 4, 4) etc, as long as product of each dim is <= 128 it will be fine.

when i calculated this before i run the query, i got 8 cores * 16 stream processors * 4 work-items = 512 why is this wrong?

also i got the same 3 dimension work-item stuff for my inte core 2 duo CPU, does the same calculations apply?

Don't understand what you are trying to compute here.

Stringer Bell
first off Thanks alot. um nvm the 512 part, i confused processing elements with work-items. as for the CPU i was wondering if the same calculations will count for it as well. If so, my CPU showed 1024 work group size, does that mean it can process 1024 workitems simultaneously?
OSaad
if CPU shows 1024 for max work group size, then the same rule apply. you can have e.g. 128 * 8 * 1 as for a work_group_size.Now about if workitems are process simultaneously, this is abstracted by the runtime, so you don't really know.
Stringer Bell
You're probably using ATI's software OpenCL implementation (it reports 1024 as max work group size). Of course a CPU can not work on 1024 work-items at a time.As far as I know, ATI's software OpenCL executes work-items of a work-group sequentially, as far as possible. If you access shared memory, the kernel is broken up into multiple parts. Curiously, a work group size of 1 (which *should* do well on a CPU) performed badly with ATI's implementation and my code.
dietr
@dietr: maybe ATI is doing 1 OS thread per work group? If you have lots of small work group (size=1), this could generate more overhead than a much bigger group (e.g. size=256), only speculating here. Waiting for my phenom II x6 :)...
Stringer Bell