views:

2080

answers:

13

If I have a sorted list (say quicksort to sort), if I have a lot of values to add, is it better to suspend sorting, and add them to the end, then sort, or use binary chop to place the items correctly while adding them. Does it make a difference if the items are random, or already more or less in order?

+6  A: 

Usually it's far better to use a heap. in short, it splits the cost of maintaining order between the pusher and the picker. Both operations are O(log n), instead of O(n log n), like most other solutions.

Javier
That is especially good advice if the list is some kind of priority queue. Google for weak heaps in that case.
DR
A: 

Inserting an item into a sorted list is O(log n), while sorting a list is O(n log N) Which would suggest that it's always better to sort first and then insert

But remeber big 'O' only concerns the scaling of the speed with number of items, it might be that for your application an insert in the middle is expensive (eg if it was a vector) and so appending and sorting afterward might be better.

Martin Beckett
Inserting into a sorted list is O(log n). Inserting into a hash is O(1).
bmdhacks
Ok, you fixed your notation, but now your first statement is incorrect. Sorting and inserting are the same speed. Sorting is O(N log N), and inserting is performing an O(log N) operation N times, thus being O(N log N).
bmdhacks
But it is different N, if you only need to insert 10 items into a million then 10 * (log 1M) beats 10 + (1M log 1M)ps. Sorry I left you a comment, thanking you for spotting the typo but it seems to have gone?
Martin Beckett
Fair enough. Technically Big-O doesn't care about the size of N, only Big-Omega does, but only computer-science professors probably care. Thanks for putting up with my scrutiny.
bmdhacks
And most people assume that O() tells you everything about the speed. Building pyramids is O(n) but is still a lot slower than sorting their heights!
Martin Beckett
if it was a vector, finding where and inserting an item is O(N). But appending and sorting afterwards for just one element is still worse - O(N log N).
Greg Rogers
@Martin Beckett: Unless you happen to build a lots of pyramids ;)
Maciej Piechotka
A: 

It's about the same. Inserting an item into a sorted list is O(log N), and doing this for every element in the list, N, (thus building the list) would be O(N log N) which is the speed of quicksort (or merge sort which is closer to this approach).

If you instead inserted them onto the front it would be O(1), but doing a quicksort after, it would still be O(N log N).

I would go with the first approach, because it has the potential to be slightly faster. If the initial size of your list, N, is much greater than the number of elements to insert, X, then the insert approach is O(X log N). Sorting after inserting to the head of the list is O(N log N). If N=0 (IE: your list is initially empty), the speed of inserting in sorted order, or sorting afterwards are the same.

bmdhacks
Not to be picky but N is the number of elements to insert, so the last paragraph of your answer do not make too much sense to me!Did you mean, "if N is not too large"?
Remo.D
Edited to clarify after Remo.D's comment.
bmdhacks
paragraph 2 is wrong in some cases. Doing a quick sort on an almost sorted list approaches O(n^2) rather than O(n log n).
Tony BenBrahim
@Tony that depends on how you pick your pivot....
Mike Stone
A: 

If the list is a) already sorted, and b) dynamic in nature, then inserting into a sorted list should always be faster (find the right place (O(n)) and insert (O(1))).

However, if the list is static, then a shuffle of the remainder of the list has to occur (O(n) to find the right place and O(n) to slide things down).

Either way, inserting into a sorted list (or something like a Binary Search Tree) should be faster.

O(n) + O(n) should always be faster than O(N log n).

warren
the insert in a dynamic construct such as a linked list is still O(1) *per insert*. So yes, overall that adds up to an O(N) - but it's not multiplicative, it's additive (ie 2 times O(n), not O(n^2)).
warren
inserting should be O(log(N)) if you do it right and have relatively evenly distributed data
tloach
+3  A: 

In principle, it's faster to create a tree than to sort a list. The tree inserts are O(log(n)) for each insert, leading to overall O(n*log(n)). Sorting in O(n*log(n)).

That's why Java has TreeMap, (in addition to TreeSet, TreeList, ArrayList and LinkedList implementations of a List.)

  • A TreeSet keeps things in object comparison order. The key is defined by the Comparable interface.

  • A LinkedList keeps things in the insertion order.

  • An ArrayList uses more memory, is faster for some operations.

  • A TreeMap, similarly, removes the need to sort by a key. The map is built in key order during the inserts and maintained in sorted order at all times.

However, for some reason, the Java implementation of TreeSet is quite a bit slower than using an ArrayList and a sort.

[It's hard to speculate as to why it would be dramatically slower, but it is. It should be slightly faster by one pass through the data. This kind of thing is often the cost of memory management trumping the algorithmic analysis.]

S.Lott
I'd be careful saying a tree is faster than a list. It really depends on the size of the input and the tree implementation used.
hazzen
Run some speed tests and you will see that is not the case. TreeSet vs. ArrayList, ArrayList was ~2x faster to add 500k random numbers, sort, and dump them into another list. If we don't dump them to another list, ArrayList wins by ~1.6x.
hazzen
TreeSet and TreeMap are essentially the same class; a TreeSet<E> is a TreeMap<E, Object> with the value set to a singleton object on insert. The times are almost identical, and still ~2x slower than an ArrayList solution.
hazzen
I said that insert all into an ArrayList + Collections.sort is about 2x faster than just insert all into a Tree[Set|Map]. This is for large numbers of values. The difference is still about 2x for small numbers of values, but 1ms vs. 2ms doesn't really matter.
hazzen
The reason about the speed difference is that ArrayList is implemented using a single array, while the tree map is a linked structure with different node objects for each entry. Accessing arrays is much faster and the JVM can optimize better than objects (reuse registers, better cache locality)
ddimitrov
A: 

You should add them before and then use a radix sort this should be optimal

http://en.wikipedia.org/wiki/Radix_sort#Efficiency

Peter Parker
A: 

If this is .NET and the items are integers, it's quicker to add them to a Dictionary (or if you're on .Net 3.0 or above use the HashSet if you don't mind losing duplicates)This gives you automagic sorting.

I think that strings would work the same way as well. The beauty is you get O(1) insertion and sorting this way.

Mike Brown
Dictionary<T> is not a sorted collection. SortedDictionary<T> is.
Orlangur
A: 

(If the list you're talking about is like C# List<T>.) Adding some values to right positions into a sorted list with many values is going to require less operations. But if the number of values being added becomes large, it will require more.

I would suggest using not a list but some more suitable data structure in your case. Like a binary tree, for example. A sorted data structure with minimal insertion time.

Orlangur
+5  A: 

If you add enough items that you're effectively building the list from scratch, you should be able to get better performance by sorting the list afterwards.

If items are mostly in order, you can tweak both incremental update and regular sorting to take advantage of that, but frankly, it usually isn't worth the trouble. (You also need to be careful of things like making sure some unexpected ordering can't make your algorithm take much longer, q.v. naive quicksort)

Both incremental update and regular list sort are O(N log N) but you can get a better constant factor sorting everything afterward (I'm assuming here that you've got some auxiliary datastructure so your incremental update can access list items faster than O(N)...). Generally speaking, sorting all at once has a lot more design freedom than maintaining the ordering incrementally, since incremental update has to maintain a complete order at all times, but an all-at-once bulk sort does not.

If nothing else, remember that there are lots of highly-optimized bulk sorts available.

comingstorm
A: 

Inserting an item into a sorted list takes O(n) time, not O(log n) time. You have to find the place to put it, taking O(log n) time. But then you have to shift over all the elements - taking O(n) time. So inserting while maintaining sorted-ness is O(n ^ 2), where as inserting them all and then sorting is O(n log n).

Depending on your sort implementation, you can get even better than O(n log n) if the number of inserts is much smaller than the list size. But if that is the case, it doesn't matter either way.

So do the insert all and sort solution if the number of inserts is large, otherwise it probably won't matter.

hazzen
I think you have a totally incorrect view of the O notation. Inserting an item into a list is not O(n), it is always O(1) in algorithm theorem. Moving millions of byte in memory may not be a constant operation, but O notation is not about the time it takes, but about the complexity, which is 1
Mecki
If it is not a constant operation, it isn't O(1). Period.The code for insert into a list is (for array-based list): for (i = last; i > idx; --i) { list[i + 1] = list[i]; } list[idx] = item;I don't think you will debate that is O(n). You can't just ignore part of your code in Big O.
hazzen
It is O(1) if it is bounded by some constant for any N. There are ways to organize an array so insertion is efficient, such as by making it out of blocks that have a certain amount of empty space.
Mike Dunlavey
+3  A: 

If you're adding in bunches, you can use a merge sort. Sort the list of items to be added, then copy from both lists, comparing items to determine which one gets copied next. You could even copy in-place if resize your destination array and work from the end backwards.

The efficiency of this solution is O(n+m) + O(m log m) where n is the size of the original list, and m is the number of items being inserted.

Edit: Since this answer isn't getting any love, I thought I'd flesh it out with some C++ sample code. I assume that the sorted list is kept in a linked list rather than an array. This changes the algorithm to look more like an insertion than a merge, but the principle is the same.

// Note that itemstoadd is modified as a side effect of this function
template<typename T>
void AddToSortedList(std::list<T> & sortedlist, std::vector<T> & itemstoadd)
{
    std::sort(itemstoadd.begin(), itemstoadd.end());
    std::list<T>::iterator listposition = sortedlist.begin();
    std::vector<T>::iterator nextnewitem = itemstoadd.begin();
    while ((listposition != sortedlist.end()) || (nextnewitem != itemstoadd.end()))
    {
        if ((listposition == sortedlist.end()) || (*nextnewitem < *listposition))
            sortedlist.insert(listposition, *nextnewitem++);
        else
            ++listposition;
    }
}
Mark Ransom
+1  A: 

I'd say, let's test it! :)

I tried with quicksort, but sorting an almost sorting array with quicksort is... well, not really a good idea. I tried a modified one, cutting off at 7 elements and using insertion sort for that. Still, horrible performance. I switched to merge sort. It might need quite a lot of memory for sorting (it's not in-place), but the performance is much better on sorted arrays and almost identical on random ones (the initial sort took almost the same time for both, quicksort was only slightly faster).

This already shows one thing: The answer to your questions depends strongly on the sorting algorithm you use. If it will have poor performance on almost sorted lists, inserting at the right position will be much faster than adding at the end and then re-sorting it; and merge sort might be no option for you, as it might need way too much external memory if the list is huge. BTW I used a custom merge sort implementation, that only uses 1/2 of external storage to the naive implementation (which needs as much external storage as the array size itself).

If merge sort is no option and quicksort is no option for sure, the best alternative is probably heap sort.

My results are: Adding the new elements simply at the end and then re-sorting the array was several magnitudes faster than inserting them in the right position. However, my initial array had 10 mio elements (sorted) and I was adding another mio (unsorted). So if you add 10 elements to an array of 10 mio, inserting them correctly is much faster than re-sorting everything. So the answer to your question also depends on how big the initial (sorted) array is and how many new elements you want to add to it.

Mecki
A: 

At a high level, it's a pretty simple problem, because you can think of sorting as just iterated searching. When you want to insert an element into an ordered array, list, or tree, you have to search for the point at which to insert it. Then you put it in, at hopefully low cost. So you could think of a sort algorithm as just taking a bunch of things and, one by one, searching for the proper position and inserting them. Thus, an insertion sort (O(n* n)) is an iterated linear search (O(n)). Tree, heap, merge, radix, and quick sort (O(n*log(n))) can be thought of as iterated binary search (O(log(n))). It is possible to have an O(n) sort, if the underlying search is O(1) as in an ordered hash table. (An example of this is sorting 52 cards by flinging them into 52 bins.)

So the answer to your question is, inserting things one at a time, versus saving them up and then sorting them should not make much difference, in a big-O sense. You could of course have constant factors to deal with, and those might be significant.

Of course, if n is small, like 10, the whole discussion is silly.

Mike Dunlavey