views:

801

answers:

5

I was asked this interview question recently:

You're given an array that is almost sorted, in that each of the N elements may be misplaced by no more than k positions from the correct sorted order. Find a space-and-time efficient algorithm to sort the array.

I have an O(N log k) solution as follows.

Let's denote arr[0..n) to mean the elements of the array from index 0 (inclusive) to N (exclusive).

  • Sort arr[0..2k)
    • Now we know that arr[0..k) are in their final sorted positions...
    • ...but arr[k..2k) may still be misplaced by k!
  • Sort arr[k..3k)
    • Now we know that arr[k..2k) are in their final sorted positions...
    • ...but arr[2k..3k) may still be misplaced by k
  • Sort arr[2k..4k)
  • ....
  • Until you sort arr[ik..N), then you're done!
    • This final step may be cheaper than the other steps when you have less than 2k elements left

In each step, you sort at most 2k elements in O(k log k), putting at least k elements in their final sorted positions at the end of each step. There are O(N/k) steps, so the overall complexity is O(N log k).

My questions are:

  • Is O(N log k) optimal? Can this be improved upon?
  • Can you do this without (partially) re-sorting the same elements?
+4  A: 

Since k is apparently supposed to be pretty small, an insertion sort is probably the most obvious and generally accepted algorithm.

In an insertion sort on random elements, you have to scan through N elements, and you have to move each one an average of N/2 positions, giving ~N*N/2 total operations. The "/2" constant is ignored in a big-O (or similar) characterization, giving O(N2) complexity.

In the case you're proposing, the expected number of operations is ~N*K/2 -- but since k is a constant, the whole k/2 term is ignored in a big-O characterization, so the overall complexity is O(N).

Jerry Coffin
`k` is not guaranteed a constant, so this is really `O(Nk)`. If `k` is a constant, though, you're right it's `O(N)`.
polygenelubricants
+11  A: 
Norman Ramsey
Can you explain more of what he said instead of just linking to it? References in answers are awesome, but substantive content on stackoverflow itself is even awesomer!
polygenelubricants
Also, I haven't read his definition of what "almost sorted" means, but I've seen variants that define it as "at most `k` elements are misplaced". Here, all `N` elements may be misplaced (bounded by `k`).
polygenelubricants
Well, even in the real world, when the input sizes grow large enough, the asymptotics matter more than the constants. :-) Insertion sort has a very good constant, but the fact that O(n log k) is asymptotically better than O(nk) *can* matter — for example, what if k ≈ √n as n grows large? (It also depends on what the interviewer was looking for. :p)
ShreevatsaR
I could not help but note that Bubble Sort would also have a O(nk) complexity. It's not that often that it can be cited :)
Matthieu M.
This answers neither of the questions asked! Plus the link is pretty useless. -1 till you edit it to make it more informative, professor.
Moron
@polygenelubriants, @Moron: I don't know much more than what I said already, but what little more I know, I've added.
Norman Ramsey
@Norman: Perhaps you can actually point to the paper/book chapter which has the claim about almost sorted arrays? Just a link to the homepage is practically useless. Also, just saying insertion sort will nail it is useless, if k = sqrt(n) for instance. I really don't understand why this answer has so many votes.
Moron
@Norman: There is a big range of values k can take, just saying if k is not 'small' sort the whole thing is not good enough. Take k = log(n) for instance.
Moron
@Moron: If k = log n then k is small. log base 2 of one million is just 20. @Everyone: SO is a *programming* site, not a *CS theory* site!
Norman Ramsey
@Norman: Then take k = (log(n))^20. 20^20 is huge compared to 20*log(20), even while programming and not just in theory.
Moron
While SO is a programming site, I think questions still deserve correct answers. For example, we ought not to say that all algorithms are O(1), even though in programming all running times encountered are bounded by a constant (like 10^1000). More to the point here, *whatever* the constant of insertion sort, there is some sufficiently large k after which insertion sort is no longer faster, and we cannot "may as well" sort the whole array. (I really doubt, even with a trillion elements (k=40) whether insertion sort is faster.)
ShreevatsaR
+2  A: 

Your solution is a good one if k is large enough. There is no better solution in terms of time complexity; each element might be out of place by k places, which means you need to learn log2 k bits of information to place it correctly, which means you need to make log2 k comparisons at least--so it's got to be a complexity of at least O(N log k).

However, as others have pointed out, if k is small, the constant terms are going to kill you. Use something that's very fast per operation, like insertion sort, in that case.

If you really wanted to be optimal, you'd implement both methods, and switch from one to the other depending on k.

Rex Kerr
+3  A: 

If using only the comparison model, O(n log k) is optimal. Consider the case when k = n.

To answer your other question, yes it is possible to do this without sorting, by using heaps.

Use a min-heap of 2k elements. Insert 2k elements first, then remove min, insert next element etc.

This guarantees O(n log k) time and O(k) space and heaps usually have small enough hidden constants.

Moron
+1. I also came up with the min-heap approach (can't you just limit the size to `k` instead of `2k`?), and was told to improve it so that it doesn't use the extra space.
polygenelubricants
@polygenelubricants: You can do this in-place. Start from the far end, and use a max-heap instead of a min-heap. Heapify that final block of 2k elements in-place. Store the first extracted element in a variable; subsequent elements go in the positions vacated immediately before the final block of 2k (which contains the heap structure), similar to regular heapsort. When only 1 block remains, heapsort it in place. A final O(n) pass is needed to "rotate" the final block back to the initial block. The rotation is not trivial but can be done in O(n) and O(1) space.
j_random_hacker
@polygenelubricants: Strange I seem to have missed your comment to this answer! @j_random_hacker: Seems right.
Moron
BTW @Moron, I really like your argument that O(n log k) is optimal: "Consider k = n". Doesn't get much simpler than that!
j_random_hacker
@j_random_hacker: Yes :-) btw, your solution to the in-place problem was good!
Moron
A: 

I think you can use a quicksort-like method to get it down to K log N.

If you do a partition step but only consider elements within K of the pivot then you have the same guarantee as you do in quicksort.

  1. the largest N/2 elements are in the top half of the array
  2. the smallest N/2 elements are in the lower half of the array
  3. the pivot element is greater than the elements in the lower half of the array and less than the elements in the upper part of the array.

This leads to K steps for log N levels.

Cole
I don't think this will work. If you have `N` say a billion elements, and `k` is small like say 2, you're saying that you don't even have to look at all `N` elements at all? Not even once?
polygenelubricants
That does seem a bit odd... But I think that it still holds. If K is 2 then having a[i]'s two upper neighbors in place and a[i]'s two lower neighbors in place implies that a[i] is in place as well. You don't need to see what's in a[i].
Cole
Does the property that elements are within k of their sorted position hold at each step?
Moron
@Moron: if you're careful with the partitioning then the property should hold. @polygenelubricants: You're right. The time complexity is N*K, not K log N. At stage i you have 2^i * K steps for log N stages. So this would actually be slower than your method. And the partition step is not correct as I've described it.
Cole