views:

165

answers:

6

In reference to fastest sort of fixed length 6 int array, I do not fully understand how this sorting network beats an algorithm like insertion sort.

Form that question, here is a comparison of the number of CPU cycles taken to complete the sort :

Linux 32 bits, gcc 4.4.1, Intel Core 2 Quad Q8300, -O2

  • Insertion Sort (Daniel Stutzbach) : 1425
  • Sorting Networks (Daniel Stutzbach) : 1080

The code used is as follows :

Insertion Sort (Daniel Stutzbach)

static inline void sort6_insertion_sort_v2(int *d){
    int i, j;
    for (i = 1; i < 6; i++) {
            int tmp = d[i];
            for (j = i; j >= 1 && tmp < d[j-1]; j--)
                    d[j] = d[j-1];
            d[j] = tmp;
    }
}

Sorting Networks (Daniel Stutzbach)

static inline void sort6_sorting_network_v1(int * d){
#define SWAP(x,y) if (d[y] < d[x]) { int tmp = d[x]; d[x] = d[y]; d[y] = tmp; }
    SWAP(1, 2);
    SWAP(0, 2);
    SWAP(0, 1);
    SWAP(4, 5);
    SWAP(3, 5);
    SWAP(3, 4);
    SWAP(0, 3);
    SWAP(1, 4);
    SWAP(2, 5);
    SWAP(2, 4);
    SWAP(1, 3);
    SWAP(2, 3);
#undef SWAP
}

I understand that sorting networks are really good for sorting in parallel, because some of the steps are independent of the other steps. But here we are not using the parallelization.

I expect it to be faster, as it has the advantage of knowing the exact number of elements beforehand. Where and why exactly does insertion sort make unnecessary comparisons?

EDIT1:

This is the input set these codes are compared against:

int d[6][6] = {\
    {1, 2, 3, 4, 5, 6},\
    {6, 5, 4, 3, 2, 1},\
    {100, 2, 300, 4, 500, 6},\
    {100, 2, 3, 4, 500, 6},\
    {1, 200, 3, 4, 5, 600},\
    {1, 1, 2, 1, 2, 1}\
};\
A: 

I think that loop unwinding is what causing the faster results on the sort network algorithm

Shay Erlichmen
A: 

Theoretically the code could be about the same if the compiler could fully unroll the loops in the Insertion Sort. The first loop can be easily unrolled, while the second can't be unrolled that easy.

It may also be the case that, because the code is not that simple as the network sorting code, the compiler can make less optimizations. I think there are more dependencies in the insertion sort than in the network sort, which may make a big difference when the compiler tries to optimize the code (correct me if I'm wrong).

George B.
+10  A: 

But here we are not using the parallelization.

Modern CPUs can figure out when instructions are independent and will execute them in parallel. Hence, even though there's only one thread, the sorting network's parallelism can be exploited.

Where exactly does insertion sort make unnecessary comparisons?

The easiest way to see the extra comparisons is to do an example by hand.

Insertion sort:
6 5 4 3 2 1
5 6 4 3 2 1
5 4 6 3 2 1
4 5 6 3 2 1
4 5 3 6 2 1
4 3 5 6 2 1
3 4 5 6 2 1
3 4 5 2 6 1
3 4 2 5 6 1
3 2 4 5 6 1
2 3 4 5 6 1
2 3 4 5 1 6
2 3 4 1 5 6
2 3 1 4 5 6
2 1 3 4 5 6
1 2 3 4 5 6

Sorting network:
6 5 4 3 2 1
6 4 5 3 2 1
5 4 6 3 2 1
4 5 6 3 2 1 # These three can execute in parallel with the first three
4 5 6 3 1 2 #
4 5 6 2 1 3 #
4 5 6 1 2 3
1 5 6 4 2 3
1 2 6 4 5 3
1 2 3 4 5 6
1 2 3 4 5 6
Daniel Stutzbach
@Daniel: Okay, since these paths are completely different, we cannot compare them directly. Certainly, Sorting network allows us to sort in lesser number of comparisons. To state my question in a different way, **what stops us from optimizing insertion sort to use this sequence of swaps for any number of inputs?**
Lazer
Lazer: I'm afraid I don't understand. Which sequence are you referring to when you say "this sequence of swaps"? Also, did you mean to say "optimizing insertion sort" or did you intend to refer to sorting networks?
Daniel Stutzbach
@Daniel: Sorry for lack of clarity. In yet other terms, why do we use insertion sort at all if sorting networks are more *efficient*?
Lazer
@Lazer: Ah, that makes more sense. :-) Thanks for the clarification! The trouble with sorting networks is that they only work for a fixed n. Furthermore, they're only practical when n is small since you have to write out all of the compare-and-swaps by hand and there will be O(n log n) of them. They're fast in part because the code is written out and there are no loops, so the speed is part and parcel with the limitation.
Daniel Stutzbach
@Daniel: So, you mean to say there is no good way to write a program to generate the set of swaps to be performed (for network sort) for *any n*? Why do sorting networks work with fixed `n`? Why can't they be generalized?
Lazer
@Lazer: Yes, that's what I mean. :-) If an algorithm works with variable n, it needs to have some kind of loop in it somewhere. A sorting network has no loops. You could write a program to generate the swaps and then execute them, but generating the swaps will use up more time than you will save by using a sorting network. The closest you can get is to use a recursive algorithm like MergeSort or QuickSort and use a sorting network as the base case.
Daniel Stutzbach
@Daniel: thanks! I get it now.
Lazer
A: 

I think all of you questions are answered in Daniel Stutzbach answer to the original post:

The algorithm you posted is similar to an insertion sort, but it looks like you've minimized the number of swaps at the cost of more comparisons. Comparisons are far more expensive than swaps, though, because branches can cause the instruction pipeline to stall.

drewk
You can't make that generalization. If your data objects are large but extracting and comparing the key is fast, comparisons are a lot cheaper than swaps. I would guess the only time swaps are cheaper is when your data elements are a simple type.
R..
A: 

I believe the amount of 'work' done in a parallel algorithm and a serial algorithm is always almost same. Only that since work gets distributed you would get outputs faster. I think you would get output convincingly faster in case when the size of input is sufficient enough to justify using parallel algorithm.

In case of insertion sort division of array amongst processors is such that it forms a pipeline, and it would take some time to fill the pipeline and then it would produce benefits of parallel algorithm.

RIPUNJAY TRIPATHI
+1  A: 

The better question is why the sorting network only outperforms insertion sort (generally a very slow sort) by ~50%. The answer is that big-O is not so important when n is tiny. As for OP's question, Daniel has the best answer.

R..