views:

109

answers:

3

I was implementing a FIFO queue of requests instances (preallocated request objects for speed) and started with using the "synchronized" keyword on the add method. The method was quite short (check if room in fixed size buffer, then add value to array). Using visualVM it appeared the thread was blocking more often than I liked ("monitor" to be precise). So I converted the code over to use AtomicInteger values for things such as keeping track of the current size, then using compareAndSet() in while loops (as AtomicInteger does internally for methods such as incrementAndGet()). The code now looks quite a bit longer.

What I was wondering is what is the performance overhead of using synchronized and shorter code versus longer code without the synchronized keyword (so should never block on a lock).

Here is the old get method with the synchronized keyword:

public synchronized Request get()
{
    if (head == tail)
    {
        return null;
    }
    Request r = requests[head];
    head = (head + 1) % requests.length;
    return r;
}

Here is the new get method without the synchronized keyword:

public Request get()
{
    while (true)
    {
        int current = size.get();
        if (current <= 0)
        {
            return null;
        }
        if (size.compareAndSet(current, current - 1))
        {
            break;
        }
    }

    while (true)
    {
        int current = head.get();
        int nextHead = (current + 1) % requests.length;
        if (head.compareAndSet(current, nextHead))
        {
            return requests[current];
        }
    }
}

My guess was the synchronized keyword is worse because of the risk of blocking on the lock (potentially causing thread context switches etc), even though the code is shorter.

Thanks!

+5  A: 

My guess was the synchronized keyword is worse because of the risk of blocking on the lock (potentially causing thread context switches etc)

Yes, in the common case you are right. Java Concurrency in Practice discusses this in section 15.3.2:

[...] at high contention levels locking tends to outperform atomic variables, but at more realistic contention levels atomic variables outperform locks. This is because a lock reacts to contention by suspending threads, reducing CPU usage and synchronization traffic on the shared memory bus. (This is similar to how blocking producers in a producer-consumer design reduces the load on consumers and thereby lets them catch up.) On the other hand, with atomic variables, contention management is pushed back to the calling class. Like most CAS-based algorithms, AtomicPseudoRandom reacts to contention by trying again immediately, which is usually the right approach but in a high-contention environment just creates more contention.

Before we condemn AtomicPseudoRandom as poorly written or atomic variables as a poor choice compared to locks, we should realize that the level of contention in Figure 15.1 is unrealistically high: no real program does nothing but contend for a lock or atomic variable. In practice, atomics tend to scale better than locks because atomics deal more effectively with typical contention levels.

The performance reversal between locks and atomics at differing levels of contention illustrates the strengths and weaknesses of each. With low to moderate contention, atomics offer better scalability; with high contention, locks offer better contention avoidance. (CAS-based algorithms also outperform lock-based ones on single-CPU systems, since a CAS always succeeds on a single-CPU system except in the unlikely case that a thread is preempted in the middle of the read-modify-write operation.)

(On the figures referred to by the text, Figure 15.1 shows that the performance of AtomicInteger and ReentrantLock is more or less equal when contention is high, while Figure 15.2 shows that under moderate contention the former outperforms the latter by a factor of 2-3.)

Update: on nonblocking algorithms

As others have noted, nonblocking algorithms, although potentially faster, are more complex, thus more difficult to get right. A hint from section 15.4 of JCiA:

Good nonblocking algorithms are known for many common data structures, including stacks, queues, priority queues, and hash tables, though designing new ones is a task best left to experts.

Nonblocking algorithms are considerably more complicated than their lock-based equivalents. The key to creating nonblocking algorithms is figuring out how to limit the scope of atomic changes to a single variable while maintaining data consistency. In linked collection classes such as queues, you can sometimes get away with expressing state transformations as changes to individual links and using an AtomicReference to represent each link that must be updated atomically.

Péter Török
A: 

Before doing this kind of synchronization optimizations, you really need a profiler to tell you that it's absolutely necessary.

Yes, synchronized under some conditions may be slower than atomic operation, but compare your original and replacement methods. The former is really clear and easy to maintain, the latter, well it's definitely more complex. Because of this there may be very subtle concurrency bugs, that you will not find during initial testing. I already see one problem, size and head can really get out of sync, because, though each of these operations is atomic, the combination is not, and sometimes this may lead to an inconsistent state.

So, my advise:

  1. Start simple
  2. Profile
  3. If performance is good enough, leave simple implementation as is
  4. If you need performance improvement, then start to get clever (possibly using more specialized lock at first), and TEST, TEST, TEST
Alexander Pogrebnyak
Thanks. A challenge here is how to performance measure highly concurrent code and get meaningful results. I tried visualvm for example, but the results it reports are weird - to the extent I don't trust its results. E.g. a thread that was a while(true){sleep-1-sec();do-something-quick();} was reported as 100% cpu and no sleeps on some but not all of the profile runs I tried. Certainly it felt like visualvm changed the performance characteristics of the application.I will be doing test runs of course, but if felt others would have already done better comparisons (hence this question).
Alan Kent
@Alan. Looks like you are (re)implementing a concurrent linked list. If this is not a homework assignment I suggest using Java implemented one from `java.util.concurrent` library. The implementation of your new method does not look thread-safe, and if it's not thread-safe, then performance really does not matter. Remember, first it has to be "right", then, if required, it should be fast. From Peter's answer the performance is better by a factor of 2-3, but if you are only doing 10 operations a second, it will not matter.
Alexander Pogrebnyak
No, not a homework assignment. I have to process 3 million requests per second with real-time requirements, so pre-allocating everything to avoid GC pauses.Could you be more specific about what does not look thread safe? Its modelled closely on how AtomicInteger works internally. Basically it relies on compareAndSet() being atomic - you propose the new value assuming the value has not been changed by anyone else yet. If the value has changed by someone else, you loop and try again. Faster than locking in low contention situations.
Alan Kent
+1  A: 

I wonder if jvm already does a few spin before really suspending the thread. It anticipate that well written critical sections, like yours, are very short and complete almost immediately. Therefore it should optimistically busy-wait for, I don't know, dozens of loops, before giving up and suspending the thread. If that's the case, it should behave the same as your 2nd version.

what a profiler shows might be very different from what's realy happending in a jvm at full speed, with all kinds of crazy optimizations. it's better to measure and compare throughputs without profiler.

irreputable