I'm working on translating a CUDA application (this if you must know) to OpenCL. The original application uses the C-style CUDA API, with a single stream just to avoid the automatic busy-wait when reading the results.
Now I notice that OpenCL command queues look a lot like CUDA streams. But in the device read command, and likewise in the write and kernel execute commands, I notice parameters for events too. So I'm wondering, what does it take to execute a device write, a number of kernels (e.g. one call to one kernel then 100 calls to another kernel), and a device read, all sequentially?
- If I just enqueue them sequentially into the same queue, will they execute sequentially like they do in CUDA?
- If that doesn't work, can/should I daisy-chain events, making each call's wait list the previous call's event?
- Or should I add all previous events to each call's wait list, like if there's an N^2 search for dependencies or something?
- Or do I just have to event.wait() for each call individually, like it says to in AMD's tutorial?
Thanks!