views:

62

answers:

3

I have two processes, a producer and a consumer. IPC is done with OpenFileMapping/MapViewOfFile on Win32.

The producer receives video from another source, which it then passes over to the consumer and synchronization is done through two events.

For the producer:

Receive frame
Copy to shared memory using CopyMemory
Trigger DataProduced event
Wait for DataConsumed event

For the consumer

Indefinitely wait for DataProducedEvent
Copy frame to own memory and send for processing
Signal DataConsumed event

Without any of this, the video averages at 5fps. If I add the events on both sides, but without the CopyMemory, it's still around 5fps though a tiny bit slower. When I add the CopyMemory operation, it goes down to 2.5-2.8fps. Memcpy is even slower.

I find hard to believe that a simple memory copy can cause this kind of slowdown. Any ideas on a remedy?

Here's my code to create the shared mem:

HANDLE fileMap = CreateFileMapping(INVALID_HANDLE_VALUE, 0, PAGE_READWRITE, 0, fileMapSize, L"foomap");
void* mapView = MapViewOfFile(fileMap, FILE_MAP_WRITE | FILE_MAP_READ, 0, 0, fileMapSize);

The size is 1024 * 1024 * 3

Edit - added the actual code:

On the producer:

void OnFrameReceived(...)
{
    // get buffer
    BYTE *buffer = 0;
...

    // copy data to shared memory
    CopyMemory(((BYTE*)mapView) + 1, buffer, length);

    // signal data event
SetEvent(dataProducedEvent);

    // wait for it to be signaled back!
    WaitForSingleObject(dataConsumedEvent, INFINITE);
}

On the consumer:

while(WAIT_OBJECT_0 == WaitForSingleObject(dataProducedEvent, INFINITE))
    {   
        SetEvent(dataConsumedEvent);
    }
+1  A: 

Copying of memory involves certain operations under the hood, and for video this can be significant.

I'd try another route: create a shared block for each frame or several of frames. Name them consequently, i.e. block1, block2, block3 etc, so that the recipient knows what block to read next. Now receive the frame directly to the allocated blockX, notify the consumer about availability of the new block and allocate and start using another block immediately. Consumer maps the block and doesn't copy it - the block belongs to consumer now and consumer can use the original buffer in further processing. Once the consumer closes mapping of the block, this mapping is destroyed. So you get a stream of blocks and avoid blocking.

If frame processing doesn't take much time and creation of shared block does, you can create a pool of shared blocks, large enough to ensure that producer and consumer never attempt to use the same block (you can complicate scheme by using a semaphore or mutx to guard each block).

Hope my idea is clear - avoid copying by using the block in producer, than in consumer

Eugene Mayevski 'EldoS Corp
What do yo mean by "Copying of memory involves certain operations under the hood"?
torak
I really don't have details to tell you, but my observations show that extra copying of memory blocks significantly slows down many operations, especially in processes that are mainly number crunching or network-involving processes. Call it empirical knowledge.
Eugene Mayevski 'EldoS Corp
Well, at the moment I don't do any processing on the frame, while I'm trying to figure out how to fix this - do you still think it'll help? I mean, the copy operation will stay there - I just won't have to wait for data consumed event, I guess?
everwicked
No, there should be no data copy operation involved at all. You will collect the data right to the shared memory block in producer, then signal to consumer and release the block. Consumer will take the address of the block and will pass it to further processing without copying.
Eugene Mayevski 'EldoS Corp
Like I said in my comment in the other answer, I can't collect the data straight to the shared memory without any considerable work since the buffer is handled internally by DirectShow. However, I did what you suggested with using blocks rather a single buffer. Strangely enough, only one block ends up ever being used but the performance is still lacking. I'm starting to think the bottleneck is actually at CopyMemory itself - is there any reason why it would be slower when triggered in a process I've injected myself in?
everwicked
Well, as I mentioned above, I noticed that copymemory slows down processing, but this is empirical knowledge without theoretical background. In other words, I don't know why :)
Eugene Mayevski 'EldoS Corp
A: 

The time it takes to copy 3MB of memory really shouldn't be at all noticeable. A quick test on my old (and busted) laptop was able to complete 10,000 memcpy(buf1, buf2, 1024 * 1024 * 3) operations in around 10 seconds. At 1/1000th of a second it shouldn't be slowing down your frame rate by a noticeable amount.

Regardless, it would seem that there is probably some optimisation that could occur to speed things up. Currently you seem to be either double or tripple handling the data. Double handling because you "recieve the frame" then "copy to shared memory". Triple handling if "Copy frame to own memory and send for processing" means that you truly copy to a local buffer and then process instead of just processing from the buffer.

The alternative is to receive the frame into the shared buffer directly and process it directly out of the buffer. If, as I suspect, you want to be able to receive one frame while processing another you just increase the size of the memory mapping to accomodate more than one frame and use it as a circular array. On the consumer side it would look something like this.

char *sharedMemory;
int frameNumber = 0;
...
WaitForSingleObject(...)  // Consume data produced event
frame = &sharedMemory[FRAME_SIZE * (frameNumber % FRAMES_IN_ARRAY_COUNT)
processFrame(frame);
ReleaseSemaphore(...)     // Generate data consumed event

And the producer

char *sharedMemory;
int frameNumber = 0;
...
WaitForSingleObject(...)  // Consume data consumed event
frame = &sharedMemory[FRAME_SIZE * (frameNumber % FRAMES_IN_ARRAY_COUNT)
recieveFrame(frame);
ReleaseSemaphore(...)     // Generate data produced event

Just make sure that the semaphore the data consumed semaphore is initialised to FRAME_IN_ARRAY_COUNT and the data produced semaphore is initialised to 0.

torak
Yes, I did think of that also - the problem with the first part is that the original image buffer is managed by DirectShow and I don't want to delve into its memory management APIs just yet. The processing is done after the event has been signaled to send more data when available so it shouldn't interfere like that. Makes sense?
everwicked
Yes and no. You could still ommit the last copy by processing it directly from the shared buffer.
torak
A: 

Well, it seems that copying from the DirectShow buffer onto shared memory was the bottleneck after all. I tried using a Named Pipe to transfer the data over and guess what - the performance is restored.

Does anyone know of any reasons why this may be?

To add a detail that I didn't think was relevant before: the producer is injected and hooks onto a DirectShow graph to retrieve the frames.

everwicked