views:

89

answers:

4

I'm using Pthreads to create a new tread for each partition after the list is split into the right and left halves (less than and greater than the pivot). I do this recursively until I reach the maximum number of allowed threads.

When I use printfs to follow what goes on in the program, I clearly see that each thread is doing its delegated work in parallel. However using a single process is always the fastest. As soon as I try to use more threads, the time it takes to finish almost doubles, and keeps increasing with number of threads.

I am allowed to use up to 16 processors on the server I am running it on.

The algorithm goes like this: Split array into right and left by comparing the elements to the pivot. Start a new thread for the right and left, and wait until the threads join back. If there are more available threads, they can create more recursively. Each thread waits for its children to join.

Everything makes sense to me, and sorting works perfectly well, but more threads makes it slow down immensely.

I tried setting a minimum number of elements per partition for a thread to be started (e.g. 50000).

I tried an approach where when a thread is done, it allows another thread to be started, which leads to hundreds of threads starting and finishing throughout. I think the overhead was way too much. So I got rid of that, and if a thread was done executing, no new thread was created. I got a little more speedup but still a lot slower than a single process.

The code I used is below.

http://pastebin.com/UaGsjcq2

Does anybody have any clue as to what I could be doing wrong?

+5  A: 

Starting a thread has a fair amount of overhead. You'd probably be better off creating a threadpool with some fixed number of threads, along with a thread-safe queue to queue up jobs for the threads to do. The threads wait for an item in the queue, process that item, then wait for another item. If you want to do things really correctly, this should be a priority queue, with the ordering based on the size of the partition (so you always sort the smallest partitions first, to help keep the queue size from getting excessive).

This at least reduces the overhead of starting the threads quite a bit -- but that still doesn't guarantee you'll get better performance than a single-threaded version. In particular, a quick-sort involves little enough work on the CPU itself that it's probably almost completely bound by the bandwidth to memory. Processing more than one partition at a time may hurt cache locality to the point that you lose speed in any case.

Jerry Coffin
+1 - though I have suspicions about the cache locality issue. Basically, so long as each thread is handling a substantial partition, most accesses are sequential, so there should be little scope for bad cache handling. That said, 50,000 integers is a bit on the small side (fits easily in a CPU cache, far too small to fully exploit outer layers of cache) but the page size is probably the real issue, but I have little idea how big that is these days. Still, if your memory bandwidth is fully utilized, multithreading cannot help - and could mean more code wasting cache space.
Steve314
+1 - This sounds great --- as if it's borne from years of experience.
Jacob
@Jacob:well, I've been doing this stuff for a long time anyway -- I just hope it's really years of experience, not one year of experience repeated a whole bunch of times...
Jerry Coffin
+1  A: 

First guess would be that creating, destroying, and especially the syncing your threads is going to eat up and possible gain you might receive depending on just how many elements you are sorting. I'd actually guess that it would take quite a long while to make up the overhead and that it probably won't ever be made up.

Because of the way you have your sort, you have 1 thread waiting for another waiting for another... you aren't really getting all that much parallelism to begin with. You'd be better off using a more linear sort, perhaps something like a Radix, that splits the threads up with more further data. That's still having one thread wait for others a lot, but at least the threads get to do more work in the mean time. But still, I don't think threads are going to help too much even with this.

Michael Dorgan
+1  A: 

I just have a quick look at your code. And i got a remark. Why are you using lock. If I understand what you are doing is something like:

quickSort(array)
{
    left, right = partition(array);
    newThread(quickSort(left));
    newThread(quickSort(right));
}

You shouldn't need lock. Normally each call to quick sort do not access the other part of the array. So no sharing is involve.

mathk
Maybe he's returning indices to an existing array instead of creating a new subarray.
Jacob
The lock is for incrementing the global variable which keeps track of how many threads are used.
Murat Ayfer
I bet it would be better to just add a recursion level argument to the algorithm and create a new thread if and only if the recursion level is not excessively deep. This way you aren't accompanying every recursive call by a lock...which at a lower level means all your threads are in contention for read access to your thread_count variable. I'd try 1 level of recursion initially, even though this only makes 2 threads rather than your maximum.
Brian
Why having a global variable for that you can do:quickSort(array, maxThread){...newThread(quickSort(left, floor(maxThread / 2));newThread(quickSort(right, maxThread - floor(maxThread / 2))));..}
mathk
A: 

Unless each thread is running on a separate processor or core they will not truly run concurrently and the context switch time will be significant. The number of threads should be restricted to the number of available execution units, and even then you have to trust the OS will distribute them to separate processors/cores, which it may not do if they are also being used for other processes.

Also you should use a static thread pool rather than creating and destroying threads dynamically. Creating/destroying a thread includes allocating/releasing a stack from the heap, which is non-deterministic and potentially time-consuming.

Finally are the 16 processors on the server real or VMs? And are they exclusively allocated to your process?

Clifford