views:

449

answers:

8

So I'm working on a speed contest in Java. I have (number of processors) threads doing work, and they all need to add to a B-Tree. Originally I just used a synchronized add method, but I wanted to make it so threads could follow each other through the tree (each thread only has the lock on the object it's accessing). Unfortunately, even for a very large file (48,000 lines), my new binary tree is slower than the old one. I assume this is because I'm getting and releasing a lock every time I move in the tree. Is this the best way to do this or is there a better way?

Each node has a ReentrantLock named lock, and getLock() and releaseLock() just call lock.lock() and lock.unlock();

My code:

public void add(String sortedWord, String word) {

 synchronized(this){
  if (head == null) {
   head = new TreeNode(sortedWord, word);
   return;
  }
  head.getLock();
 }

 TreeNode current = head, previous = null;
 while (current != null) {

  // If this is an anagram of another word in the list..
  if (current.getSortedWord().equals(sortedWord)) {
   current.add(word);
   current.releaseLock();
   return;
  }
  // New word is less than current word
  else if (current.compareTo(sortedWord) > 0) {
   previous = current;
   current = current.getLeft();
   if(current != null){
    current.getLock();
    previous.releaseLock();
   }
  }
  // New word greater than current word
  else {
   previous = current;
   current = current.getRight();
   if(current != null){
    current.getLock();
    previous.releaseLock();
   }
  }
 }

 if (previous.compareTo(sortedWord) > 0) {
  previous.setLeft(sortedWord, word);
 }
 else {
  previous.setRight(sortedWord, word);
 }
 previous.releaseLock();
}

EDIT: Just to clarify, my code is structured like this: The main thread reads input from a file and adds the words to a queue, each worker thread pull words from the queue and does some work (including sorting them and adding them to the binary tree).

+1  A: 

Considering one dataset per line, 48k lines isn't all that much and you can only have wild guesses as to how your operating system and the virtual machine are going to mangle you file IO to make it as fast as possible.

Trying to use a producer/consumer paradigm can be problematically here as you have to balance the overhead of locks vs. the actual amount of IO carefully. You might get better performance if you just try to improve the way you do the File IO (consider something like mmap()).

pmr
That's a good point. I tried it on /usr/share/dict/words (470,000 words) and the fancy locking was still twice as slow as just using a synchronized method. Thanks for the idea. File IO isn't really the problem because reading the entire file only takes ~1 second (I have another class for shuffling files and it's quite fast).
Brendan Long
+1  A: 

I would say that the doing it this way is not the way to go, without even taking the synchronization performance issues into account.

The fact that this implementation is slower than the original fully synchronized version may be a problem, but a bigger problem is that the locking in this implementation is not at all robust.

Imagine, for example, that you pass null in for sortedWord; this will result in a NullPointerException being thrown, which will mean you end up with holding onto the lock in the current thread, and therefore leaving your data structure in an inconsistent state. On the other hand, if you just synchronize this method, you don't have to worry about such things. Considering the synchronized version is faster as well, it's an easy choice to make.

gab
It's a *speed contest,* not production code. That is a big factor in selecting the "right" approach.
erickson
The thing is, I can guarantee based on my other code that it will never ever be passed null. In the function that calls the b-tree's add() function, it actually checks if(sortedWord == null) right before calling this. It's not meant to be particularly safe, but my program always works.
Brendan Long
+2  A: 

Locking and unlocking is overhead, and the more you do it, the slower your program will be.

On the other hand, decomposing a task and running portions in parallel will make your program complete more quickly.

Where the "break-even" point lies is highly-dependent on the amount of contention for a particular lock in your program, and the system architecture on which the program is run. If there is little contention (as there appears to be in this program) and many processors, this might be a good approach. However, as the number of threads decreases, the overhead will dominate and a concurrent program will be slower. You have to profile your program on the target platform to determine this.

Another option to consider is a non-locking approach using immutable structures. Rather than modifying a list, for example, you could append the old (linked) list to a new node, then with a compareAndSet operation on an AtomicReference, ensure that you won the data race to set the words collection in current tree node. If not, try again. You could use AtomicReferences for the left and right children in your tree nodes too. Whether this is faster or not, again, would have to be tested on your target platform.

erickson
I like the AtomicReference idea, but I think I've gone enough over the top already.
Brendan Long
+2  A: 

Another thing. There definitely is no place for a binary tree in performance critical code. The cacheing behaviour will kill all performance. It should have a much larger fan out (one cache line) [edit] With a binary tree you access too much non-contiguous memory. Take a look at the material on Judy trees.

And you probably want to start with a radix of at least one character before starting the tree.

And do the compare on an int key instead of a string first.

And perhaps look at tries

And getting rid of all the threads and synchronization. Just try to make the problem memory access bound

[edit] I would do this a bit different. I would use a thread for each first character of the string, and give them their own BTree (or perhaps a Trie). I'd put a non-blocking work queue in each thread and fill them based on the first character of the string. You can get even more performance by presorting the add queue and doing a merge sort into the BTree. In the BTree, I'd use int keys representing the first 4 characters, only refering to the strings in the leaf pages.

In a speed contest, you hope to be memory access bound, and therefore have no use for threads. If not, you're still doing too much processing per string.

Stephan Eggermont
My binary tree increased performance significantly..
Brendan Long
When compared to?
Stephan Eggermont
I thought we were talking about B-Trees which are a lot different from binary search trees (or general binary trees).
pmr
I looked it up and the only difference I saw was that binary trees are B-Trees with only two child nodes..
Brendan Long
And they don't perform on modern hardware (built the last ten years). Look at the details, they matter in a speed contest
Stephan Eggermont
http://softwareengineering.vazexqi.com/2009/11/23/lessons-from-parallelizing-matrix-multiplication
Stephan Eggermont
A: 

I've got a dumb question: since you're reading and modifying a file, you're going to be totally limited by how fast the read/write head can move around and the disk can rotate. So what good is it to use threads and processors? The disc can't do two things at once.

Or is this all in RAM?

ADDED: OK, It's not clear to me how much parallelism can help you here (some, maybe), but regardless, what I would suggest is squeezing every cycle out of each thread that you can. This is what I'm talking about. For example, I wonder if innocent-looking sleeper code like those calls to "get" and "compare" methods are taking a more of a % of time than you might expect. If they are, you might be able to do each of them once rather than 2 or 3 times - that sort of thing.

Mike Dunlavey
The main thread is reading in input and adding it to a queue, and then there are (number of processors) threads pulling data from the queue. The B-Tree is only one thing the worker threads have to do, so we're not waiting on IO.
Brendan Long
OK, thanks for that clarification. Now, what I would do is see what each thread is typically waiting for, by taking stackshots. That way you can see how much time is spent in I/O, how much in synchronization, and if you're lucky, in other stuff that you don't really need to do.
Mike Dunlavey
Getting rid of all the threads sounds like the right thing. It should be memory access bound
Stephan Eggermont
I was actually just talking to someone about the threads, and apparently turning them off doesn't make a major difference, but it's still a noticeable improvement, especially with very large sets of data.
Brendan Long
+1  A: 

You may try using an upgradeable read/write-lock (maybe its called an upgradeable shared lock or the like, I do not know what Java provides): use a single RWLock for the whole tree. Before traversing the B-Tree, you acquire the read (shared) lock and you release it when done (one acquire and one release in the add method, not more).

At the point where you have to modify the B-Tree, you acquire the write (exclusive) lock (or "upgrade" from read to write lock), insert the node and downgrade to read (shared) lock.

With this technique the synchronization for checking and inserting the head node can also be removed!

It should look somehow like this:

public void add(String sortedWord, String word) {

    lock.read();

    if (head == null) {
        lock.upgrade();
        head = new TreeNode(sortedWord, word);
        lock.downgrade();
        lock.unlock();
        return;
    }

    TreeNode current = head, previous = null;
    while (current != null) {

            if (current.getSortedWord().equals(sortedWord)) {
                    lock.upgrade();
                    current.add(word);
                    lock.downgrade();
                    lock.unlock();
                    return;
            }

            .. more tree traversal, do not touch the lock here ..
            ...

    }

    if (previous.compareTo(sortedWord) > 0) {
        lock.upgrade();
        previous.setLeft(sortedWord, word);
        lock.downgrade();
    }
    else {
        lock.upgrade();
        previous.setRight(sortedWord, word);
        lock.downgrade();
    }

    lock.unlock();
}

Unfortunately, after some googling I could not find a suitable "ugradeable" rwlock for Java. "class ReentrantReadWriteLock" is not upgradeable, however, instead of upgrade you can unlock read, then lock write, and (very important): re-check the condition that lead to these lines again (e.g. if( current.getSortedWord().equals(sortedWord) ) {...}). This is important, because another thread may have changed things between read unlock and write lock.

for details check this question and its answers

In the end the traversal of the B-tree will run in parallel. Only when a target node is found, the thread acquires an exclusive lock (and other threads will block only for the time of the insertion).

frunsi
That's a good idea..
Brendan Long
Would it be safe to do `synchronized(OtherLock){ this.readLock().unlock(); this.writeLock().lock(); }` without rechecking my conditions?
Brendan Long
Thanks this sped it up a little.
Brendan Long
I switched it to having an Object named writeLock and doing: `synchronized(writeLock){ do_things(); this.readLock().unlock(); return; }` and it seems to work :)
Brendan Long
This is not correct. Even thought the java interface (class ReadWriteLock or similar) seems to have two independent locks, thats not the case. Read and Write Locks are no separate things!!!In your case read locks may still be active in the synchronized section. But the concept is that when acquiring a write lock, NO readers can be active. When acquiring the write lock it waits until all read locks were released.
frunsi
Looks like the only safe way is (whenever you need write): `rwl.getReadLock().unlock(); /* short-unlocked-time-here: other threads may modify tree now */ rwl.getWriteLock().lock(); if( condition is still true ) { .. do write .. has_written=true; } rwl.writeLock.unlock(); if( has_written ) return; else ...`
frunsi
I realize my `synchronized(writeLock){ do_things(); this.readLock().unlock(); return; }` code doesn't work (the output was terrible, but wouldn't synchronized(OtherLock){ this.readLock().unlock(); this.writeLock().lock(); do_things(); this.writeLock().unlock(); } work? Assuming I insure that every thread has to lock otherLock before getting writeLock()?
Brendan Long
Not sure, but this seems to be dangerous. Maybe you should just simplify the add() (remove all the locks and make the method synchronized). Then profile your code, check where the real bottlenecks are. Insertion into a B-Tree is not well suited for parallelization.
frunsi
+2  A: 
Erik
I tried code similar to that one but it slowed it down significantly. The answer that said to read lock the whole tree and then write lock the whole tree when I need to add actually worked pretty well (it always gave the correct answer, even with 4 threads and a list of 479,000 words). Fixing the compareTo() method sounds like a good idea..
Brendan Long
+1  A: 

You seem to have implemented a Binary Search Tree, not a B-Tree.

Anyway, have you considered using a ConcurrentSkipListMap? This is an ordered data structure (introduced in Java 6), which should have good concurrency.

Neil