views:

185

answers:

2

I have pin pointed a bottle neck in my application, it seems to me that it boils down to a call to Thread::setContextClassLoader.

Basically I was forced to mess around with the thread's context class loader due to issues with 3rd party libraries (see this question to understand why).

The solution I picked was to my knowledge the common one and it works something like this:

Thread thread = Thread.currentThread();
ClassLoader old = thread.getContextClassLoader();
thread.setContextClassLoader(newClassLoader);

try {
    ... // problematic code that uses the thread context class loader
} finally {
    thread.setContextClassLoader(old);
}

It turned out that the call to setContextClassLoader wasn't a problem when only 1 thread was running, but when more than one thread is doing it, it slows down drastically.

I've made the following test app to isolate the problem:

ArrayList<Thread> threads = new ArrayList<Thread>();
int thread_count = 1;

long start = System.currentTimeMillis();

for (int i = 0; i < thread_count; i++) {
    Thread thread = new Thread(new MyRunnable(100000000));

    thread.start();
    threads.add(thread);
}

for (Thread thread : threads) {
    thread.join();
}

long total = System.currentTimeMillis() - start;
double seconds = (double)total / 1000;

System.out.println("time in seconds: " + seconds);

And this is MyRunnable class:

public class MyRunnable implements Runnable {
    int _iterations;

    public MyRunnable(int iterations) {
        _iterations = iterations;
    }

    public void run() {
        final Thread curr = Thread.currentThread();
        final ClassLoader loader = ClassLoader.getSystemClassLoader();

        for (int i = 0; i < _iterations; i++) {
            curr.setContextClassLoader(loader);
        }
    }
}

Basically it opens a couple of threads, and sets the current threads context class loader to the system class loader in a loop.

Updated results after code change on my machine: When thread_count is 1 it finishes in half a second. 2 threads take 1.5~, 3 threads 2.7~, 4 threads 4~ - well you get the picture.

I've tried looking in Thread's implementation of setContextClassLoader and it appears that all it does is set a member variable to the class loader passed to it. I've found no locking (or access to shared resources that would require) to explain such overhead when running with more than one thread.

What am I missing here?

P.S. I'm using JRE 1.5 but the same thing happens in 1.6.

EDIT: @Tom Hawtin - See the code changes I made to rule out the reason you mentioned. Even when the system class loader is fetched once, the results are slower when the thread count > 1.

+3  A: 

The only really obvious thing in the source is not related to Thread.setContextClassLoader. ClassLoader.getSystemClassLoader calls initSystemClassLoader that locks ClassLoader.class even if the system class loader is already initialised.

A potential issue is that reading of volatile variables may have performance impact on some multiprocessor machines.

Note that we are only looking at a few hundred cycles here.

Tom Hawtin - tackline
Only a few hundred cycles, but the bulk of the user code is just a few cycles itself, so this could easily cause the slowdown he's seeing. Also, the lock effectively rules out parallelism. Yes, the setContextClassLoader() call adds to the time, but in a real app you are not going to call it ten million times in a tight loop. The usage will be a microscopic fraction of your app's total usage. Don't worry about it.
Jim Garrison
Instead of calling getSystemClassLoader, create an empty URLClassLoader - you'll see the same results.
Idan K
Adding `final ClassLoader system = ClassLoader.getSystemClassLoader();` outside the loop (and using it) mostly clears up the problem for me. Also drastically improves performance in all cases.
Tom Hawtin - tackline
See the updated code - even when there's only one call to getSystemClassLoader() the results stay the same.
Idan K
On my machine (dual core Athlon, 32-bit Vista, Sun JRE 6u17), I get around the same time of 0.6 s for 1, 2, 3, 4 or 5 threads, despite the fact that the amount of work for this microbenchmark is proportional to the number of threads. Benchmark busted. Now 10 ns for each operation is TINY!! What are you complaining about? The difference in timings may be due to the way the HotSpot compiler (you don't mention what JAVA implementation you are using) compiles in a separate thread and then patches in. Try putting the body of the `main` in a different run it five times.
Tom Hawtin - tackline
+2  A: 

Recent JVMs use a techinque called "biased locking" which makes lock acquisition nearly free if only one thread ever accesses a given lock. The first time a second thread tries to access the lock, the "bias" towards the original accessor is revoked and the lock becomes a normal/lightweight lock which requires an atomic operation to acquire (and sometimes one to release).

The performance difference between a biased lock and a normal one can be an order of magnitude (say 5 cycles vs 50 cycles) which is consistent with your measurement. The lock in question here is probably the lock mentioned in the first reply to your question. Biased locking is described in more detail here if you are interested.

Even ignoring biased locking, two threads or more threads attempting to get the same lock will generally be considerably slower in aggregate throughput than a single thread (since they contend for the cache line containing the lock word).

BeeOnRope
See the updated code - even when there's only one call to getSystemClassLoader() the results stay the same.
Idan K
Sure, but why wouldn't you expect the results to be slower? As Tom points out, the work being done is proportional to the number of threads - with 4 threads you are looping 4 times as many times.How many CPUs do you have? Do you have hyperthreading? You may also experience jitter in that the test doesn't end until all CPUs have finished, although most threads may finish earlier.
BeeOnRope