views:

105

answers:

4

I have a simple program that breaks a dataset (a CSV file) into 4 chunks, reads each chunk in, does some calculations, and then appends the output together. Think of it as a simple map-reduce operation. Processing a single chunk uses about 1GB of memory. I'm running the program on a quad core PC, with 4GB of ram, running Windows XP. I happen to have coded it up using R, but I don't think it's relevant.

I coded up two versions. One version processes each chunk in sequence. The other version processes chunks two at a time in parallel. Both versions take nearly the same amount of time to finish.

Under what circumstances would you expect to see this performance result?

My current hypothesis is that the processes are bounded by the memory performance, but I don't know the best way to investigate this further. Any suggestions or guesses?

Edit: The program is not IO-bound in terms of the disk. The processing step reads a chunk of a CSV file into memory, churns on it for 5 minutes or so, and then writes the result back out to a file on disk. The file input and output takes a few seconds at most.

A: 

If the processes are competing for resources, then the benefits of parallization are reduced.

If the disk is running constantly (and so the processes are IO-bound), you won't notice any benefit. If they share the same instances of data structures (resulting in a lot of time wasted in synchronization), you'll notice a greatly reduced performance increase. If the "reduce" part of the operation is what takes most of the time, parallizing the "map" won't produce significant performance increases.

You haven't given us enough data to say for sure what the cause is in your case.

Anon.
+2  A: 

There is one usual answer to questions about performance, and this applies whether you're doing serial or parallel programming. Use a profiler. :-)

Chris Jester-Young
+2  A: 

Your assumption about being memory bound is correct. You need to get your working sets down to the size of the cache or increase your memory bandwidth. One way to do that would be to distribute your program on to several machines. Then you need to make sure that your chunks are coarse enough to overcome the communication expense between the machines. GPUs also have very high memory bandwidth. Your problem is still small enough that it could fit within the memory on a graphics card.

balor123
A: 

5 minutes does sound like a long time even for R to read a gigabyte file, so I'll assume that you're not I/O bound. In that case, the answer is that you are, most likely, memory bound. If so, if you read only half a chunk, parallelizing should help you. (But are you sure the computations are actually occurring in separate threads instead of being time-sliced between the same thread? What happens if you start two separate instances of R, one which processes one chunk, and the other which processes another?)

Rex Kerr
I had the same concern about time slicing. In fact if I start two separate R processes, there's no difference in run time.
Greg