views:

278

answers:

4

I have a Java program for doing a set of scientific calculations across multiple processors by breaking it into pieces and running each piece in a different thread. The problem is trivially partitionable so there's no contention or communication between the threads. The only common data they access are some shared static caches that don't need to have their access synchronized, and some data files on the hard drive. The threads are also continuously writing to the disk, but to separate files.

My problem is that sometimes when I run the program I get very good speed, and sometimes when I run the exact same thing it runs very slowly. If I see it running slowly and ctrl-C and restart it, it will usually start running fast again. It seems to set itself into either slow mode or fast mode early on in the run and never switches between modes.

I have hooked it up to jconsole and it doesn't seem to be a memory problem. When I have caught it running slowly, I've tried connecting a profiler to it but the profiler won't connect. I've tried running with -Xprof but the dumps between a slow run and fast run don't seem to be much different. I have tried using different garbage collectors and different sizings of the various parts of the memory space, also.

My machine is a mac pro with striped RAID partition. The cpu usage never drops off whether its running slowly or quickly, which you would expect if threads were spending too much time blocking on reads from the disk, so I don't think it could be a disk read problem.

My question is, what types of problems with my code could cause this? Or could this be an OS problem? I haven't been able to duplicate it on a windows a machine, but I don't have a windows machine with a similar RAID setup.

+1  A: 

You might have thread that have gone into an endless loop.

Try connecting with VisualVM and use the Thread monitor.

https://visualvm.dev.java.net

You may have to connect before the problem occurs.

Fedearne
Im pretty sure it's not going into an infinite loop because even when the program runs slowly, it still does finish and give the correct output.
javajustice
But thank you I will try visual vm and see if it shows anything.
javajustice
I have tried looking at the threads in visualvm, and none of them are blocking. It says they are all running fine. If I do cpu profiling the results are odd.. it only updates sporadically and gives nonsensical results whether the program is running fast or slow.Running the cpu profiler does always, without fail, knock the program out of "slow mode" though.
javajustice
Just to be clear, loading it in visualvm shows all of the computation threads as 100% "green" with no time spent sleeping, waiting, or in monitor contention. Also, the garbage collector usage is ~0%.
javajustice
+1  A: 

I second that you should be doing it with a profiler looking at the threads view - how many threads, what states are they in, etc. It might be an odd race condition happening every now and then. It could also be the case that instrumenting the classes with profiler hooks (which causes slowdown), sortes the race condition out and you will see no slowdown with the profiler attached :/

Please have a look at this post, or rather the answer, where there is Cache contention problem mentioned.

Are you spawning the same umber of threads each time? Is that number less or equal the number of threads available on your platform? That number could be checked or guestimated with a fair accuracy.

Please post any finidngs!

DanTe
The program takes as input an n-dimensional parameter space, and divides it into a constant given number of chunks one for each thread. In this case I'm using 15 chunks since I have 16 logical processors. The threads are almost completely independent. They read from the same set of data files but each with their own channel, and write out to separate data files (one for each point in the parameter space). The only shared memory is some static arrays of constants that begin uninitialized. When one of the threads tries to look up the constant it first checks if it's been ...
javajustice
... calculated and if not it calculates it and puts it in the array. So here multiple threads would be accessing the arrays simultaneously, but the access isn't synchronized and all modifications to the arrays are atomic. I'm essentially running what could be run in 15 separate processes in one process for convenience sake.
javajustice
+1  A: 

Do you have a tool to measure CPU temperature? The OS might be throttling the CPU to deal with temperature issues.

Jon Bright
That is interesting. Could it have something to do with the TurboBoost stuff in the new Nehalem chips? It is very strange that it slows down while still showing the same level of cpu usage in top.
javajustice
Though if it were throttling because of temperate, I'd expect it to sometimes slow down a process that is running fast, or vice versa. This never happens. The process is always stuck either fast or slow from the beginning, and only connecting visualvm and starting a cpu profile can knock it out of slow mode.
javajustice
+1  A: 

Is it possible that your program is being paged to disk sometimes? In this case, you will need to look at the memory usage of the operating system as whole, rather than just your program. I know from experience there is a huge difference in runtime performance when memory is being continually paged to the disk and back.

I don't know much about OSX, but in linux the "free" command is useful for this purpose.

Another issue that might cause this slowdown is log files? I've known at least some logging code that slowed down the system incrementally as the log files grew. It's possible that your threads are synchronizing on a log file which is growing in size, then when you restart your program, another log file is used.

erg0sum
I have tried to be conscious of this issue.. my machine has 32G of memory and I limit the process to using 16G max heap, and the process generally never gets above 8G. So it shouldn't be paging, but next time I reproduce the bug I will find something to monitor the swap usage.
javajustice
I reporduced the issue and it is definitely not hitting the page file, at least according to the mac os activity monitor.
javajustice