I'm working on a video codec for OMAP3430. I already have code written in C++, and I try to modify/port certain parts of it to take advantage of the DSP (the SDK (OMAP ZOOM3430 SDK) I have has an additional DSP).
I tried to port a small for loop which is running over a very small amount of data (~250 bytes), but about 2M times on different data. But the overload from the communication between CPU and DSP is much more than the gain (if I have any).
I assume this task is much like optimizing a code for the GPU's in normal computers. My question is porting what kind of parts would be beneficial? How do GPU programmers take care of such tasks?
edit:
- GPP application allocates a buffer of size 0x1000 bytes.
- GPP application invokes DSPProcessor_ReserveMemory to reserve a DSP virtual address space for each allocated buffer using a size that is 4K greater than the allocated buffer to account for automatic page alignment. The total reservation size must also be aligned along a 4K page boundary.
- GPP application invokes DSPProcessor_Map to map each allocated buffer to the DSP virtual address spaces reserved in the previous step.
- GPP application prepares a message to notify the DSP execute phase of the base address of virtual address space, which have been mapped to a buffer allocated on the GPP. GPP application uses DSPNode_PutMessage to send the message to the DSP.
- GPP invokes memcpy to copy the data to be processed into the shared memory.
- GPP application invokes DSPProcessor_FlushMemory to ensure that the data cache has been flushed.
- GPP application prepares a message to notify the DSP execute phase that it has finished writing to the buffer and the DSP may now access the buffer. The message also contains the amount of data written to the buffer so that the DSP will know just how much data to copy. The GPP uses DSPNode_PutMessage to send the message to the DSP and then invokes DSPNode_GetMessage to wait to hear a message back from the DSP.
After these the execution of DSP program starts, and DSP notifies the GPP with a message when it finishes the processing. Just to try I don't put any processing inside the DSP program. I just send a "processing finished" message back to the GPP. And this still consumes a lot of time. Could that be because of the internal/external memory usage, or is it merely because of the communication overload?