sporadic behavior by the machines in stress

views:

answers:

sporadic behavior by the machines in stress

hi,

we are doing some java stress runs (involving network IO). Initially things are all fine and the system responds very fast (avg latency in test 2ms). But hours later when i redo the same test i observe the performace goes down (20 - 60ms). Its the same Jar files, same JVM, and the same LAN over which the stress is runnig. i am not understanding the reason for this behavior.

The lan is 1GBPS and for the stress requirements i m sure we are not using all of it.

So my QNs:

Can it be because of some switches in the lans?
Does the machine slow off after some time ( The machines are restarted .. say abt 6months back well before the stress can start; They are RHEL5, XEON 64bit Quad core)
What is the general way to debug such an issues?

Any help please?

-- ravi

A few questions...

How much of the environment is under your control and are you putting any measures in place to ensure it's consistent for each run? i.e. are you sharing the network with other systems, is the machine you're using being used solely for your stress testing?

The way I'd look at this is to start gathering details on what your machine and code are up to. That means use perfmon (windows) sar (unix) to find out what the OS and hardware is doing and get a profiler attached to make sure your code is doing the same thing and help pin-point where the bottleneck is occuring from a code perspective.

Nothing terribly detailed but something I hope that will help get you started.

Tom Duckering 2010-02-22 05:22:52

The general way is "measure everything". This, in particular might mean:

Ensure time on all servers is the same (use ntp or something similar);
Measure how long did it take to generate request (what if request generator has a bug?);
Measure when did request leave the client machine(s), or at least how long did it take to do i/o. Sometimes it is enough to know average time necessary for many requests.
Measure when did the request arrive.
Measure how long did it take to generate a response.
Measure how long did it take to send the response.

You can probably start from the 5th element, as this is (you believe) your critical chain. But it is best to log as much as you can - as according to what you've said yourself, it takes days to produce different results.

If you don't want to modify your code, look for cases where you can sniff data without intervening (e.g. define a servlet filter in your web.xml).

mindas 2010-02-22 14:34:56

ansaurus

tags:

views:

answers:

sporadic behavior by the machines in stress

related questions