ansaurus

Question

IPC bottleneck?

Answer 1

+1 A:

Copying of memory involves certain operations under the hood, and for video this can be significant.

I'd try another route: create a shared block for each frame or several of frames. Name them consequently, i.e. block1, block2, block3 etc, so that the recipient knows what block to read next. Now receive the frame directly to the allocated blockX, notify the consumer about availability of the new block and allocate and start using another block immediately. Consumer maps the block and doesn't copy it - the block belongs to consumer now and consumer can use the original buffer in further processing. Once the consumer closes mapping of the block, this mapping is destroyed. So you get a stream of blocks and avoid blocking.

If frame processing doesn't take much time and creation of shared block does, you can create a pool of shared blocks, large enough to ensure that producer and consumer never attempt to use the same block (you can complicate scheme by using a semaphore or mutx to guard each block).

Hope my idea is clear - avoid copying by using the block in producer, than in consumer

Eugene Mayevski 'EldoS Corp 2010-08-24 13:59:17

What do yo mean by "Copying of memory involves certain operations under the hood"?

torak 2010-08-24 14:14:57

I really don't have details to tell you, but my observations show that extra copying of memory blocks significantly slows down many operations, especially in processes that are mainly number crunching or network-involving processes. Call it empirical knowledge.

Eugene Mayevski 'EldoS Corp 2010-08-24 14:19:37

Well, at the moment I don't do any processing on the frame, while I'm trying to figure out how to fix this - do you still think it'll help? I mean, the copy operation will stay there - I just won't have to wait for data consumed event, I guess?

everwicked 2010-08-24 18:36:02

No, there should be no data copy operation involved at all. You will collect the data right to the shared memory block in producer, then signal to consumer and release the block. Consumer will take the address of the block and will pass it to further processing without copying.

Eugene Mayevski 'EldoS Corp 2010-08-24 18:44:49

Like I said in my comment in the other answer, I can't collect the data straight to the shared memory without any considerable work since the buffer is handled internally by DirectShow. However, I did what you suggested with using blocks rather a single buffer. Strangely enough, only one block ends up ever being used but the performance is still lacking. I'm starting to think the bottleneck is actually at CopyMemory itself - is there any reason why it would be slower when triggered in a process I've injected myself in?

everwicked 2010-08-24 22:24:47

Well, as I mentioned above, I noticed that copymemory slows down processing, but this is empirical knowledge without theoretical background. In other words, I don't know why :)

Eugene Mayevski 'EldoS Corp 2010-08-25 05:18:02

Answer 2

A:

The time it takes to copy 3MB of memory really shouldn't be at all noticeable. A quick test on my old (and busted) laptop was able to complete 10,000 memcpy(buf1, buf2, 1024 * 1024 * 3) operations in around 10 seconds. At 1/1000th of a second it shouldn't be slowing down your frame rate by a noticeable amount.

Regardless, it would seem that there is probably some optimisation that could occur to speed things up. Currently you seem to be either double or tripple handling the data. Double handling because you "recieve the frame" then "copy to shared memory". Triple handling if "Copy frame to own memory and send for processing" means that you truly copy to a local buffer and then process instead of just processing from the buffer.

The alternative is to receive the frame into the shared buffer directly and process it directly out of the buffer. If, as I suspect, you want to be able to receive one frame while processing another you just increase the size of the memory mapping to accomodate more than one frame and use it as a circular array. On the consumer side it would look something like this.

char *sharedMemory;
int frameNumber = 0;
...
WaitForSingleObject(...)  // Consume data produced event
frame = &sharedMemory[FRAME_SIZE * (frameNumber % FRAMES_IN_ARRAY_COUNT)
processFrame(frame);
ReleaseSemaphore(...)     // Generate data consumed event

And the producer

char *sharedMemory;
int frameNumber = 0;
...
WaitForSingleObject(...)  // Consume data consumed event
frame = &sharedMemory[FRAME_SIZE * (frameNumber % FRAMES_IN_ARRAY_COUNT)
recieveFrame(frame);
ReleaseSemaphore(...)     // Generate data produced event

Just make sure that the semaphore the data consumed semaphore is initialised to FRAME_IN_ARRAY_COUNT and the data produced semaphore is initialised to 0.

torak 2010-08-24 14:04:37

Yes, I did think of that also - the problem with the first part is that the original image buffer is managed by DirectShow and I don't want to delve into its memory management APIs just yet. The processing is done after the event has been signaled to send more data when available so it shouldn't interfere like that. Makes sense?

everwicked 2010-08-24 18:50:19

Yes and no. You could still ommit the last copy by processing it directly from the shared buffer.

torak 2010-08-24 19:12:04

Answer 3

A:

Well, it seems that copying from the DirectShow buffer onto shared memory was the bottleneck after all. I tried using a Named Pipe to transfer the data over and guess what - the performance is restored.

Does anyone know of any reasons why this may be?

To add a detail that I didn't think was relevant before: the producer is injected and hooks onto a DirectShow graph to retrieve the frames.

everwicked 2010-08-24 22:44:30

ansaurus

tags:

views:

answers:

IPC bottleneck?

related questions