views:

7103

answers:

10

I have two lists of objects. Each list is already sorted by a property of the object that is of the datetime type. I would like to combine the two lists into one sorted list. Is the best way just to do a bubble sort or is there a smarter way to do this in Python?

+20  A: 

This is simply merging. Treat each list as if it were a stack, and continuously pop the smaller of the two stack heads, adding the item to the result list, until one of the stacks is empty. Then add all remaining items to the resulting list.

Barry Kelly
A merge sort is indeed the optimal solution.
Ignacio Vazquez-Abrams
But is it faster than using Python's built-in sort?
akaihola
http://en.wikipedia.org/wiki/Merge_sort and http://en.wikipedia.org/wiki/Merge_algorithm
John Fouhy
This is simply a merge, not a merge sort.
Glenn Maynard
A: 

Well, the naive approach (combine 2 lists into large one and sort) will be O(N*log(N)) complexity. On the other hand, if you implement the merge manually (i do not know about any ready code in python libs for this, but i'm no expert) the complexity will be O(N), which is clearly faster. The idea is described wery well in post by Barry Kelly.

Drakosha
As a point of interest, the python sorting algorithm is very good, so the performance would likely be better than O(n log n), since the algorithm often takes advantage of regularities in the input data. I would never rely on this, however.
Daniel Nadasi
Then we both agree on it :)
Drakosha
Generalized sorting (i.e. leaving apart radix sorts from limited value domains) cannot be done in less than O(n log n) on a serial machine.
Barry Kelly
A: 
def compareDate(obj1, obj2):
    if obj1.getDate() < obj2.getDate():
        return -1
    elif obj1.getDate() > obj2.getDate():
        return 1
    else:
        return 0



list = list1 + list2
list.sort(compareDate)

Will sort the list in place. Define your own function for comparing two objects, and pass that function into the built in sort function.

Do NOT use bubble sort, it has horrible performance.

Josh Smeaton
Merge sort will definately be faster, but a little more complicated if you have to implement it yourself. I *think* python uses quicksort.
Josh Smeaton
No, Python uses timsort.
Ignacio Vazquez-Abrams
+1  A: 

Use the 'merge' step of merge sort, it runs in O(n) time.

From wikipedia (pseudo-code):

function merge(left,right)
    var list result
    while length(left) > 0 and length(right) > 0
        if first(left) ≤ first(right)
            append first(left) to result
            left = rest(left)
        else
            append first(right) to result
            right = rest(right)
    end while
    while length(left) > 0 
        append left to result
    while length(right) > 0 
        append right to result
    return result
Mongoose
Wish I could mark two answers as the answer. Thanks!
apphacker
+2  A: 

This is simple merging of two sorted lists. Take a look at the sample code below which merges two sorted lists of integers.

#!/usr/bin/env python
## merge.py -- Merge two sorted lists -*- Python -*-
## Time-stamp: "2009-01-21 14:02:57 ghoseb"

l1 = [1, 3, 4, 7]
l2 = [0, 2, 5, 6, 8, 9]

def merge_sorted_lists(l1, l2):
    """Merge sort two sorted lists

    Arguments:
    - `l1`: First sorted list
    - `l2`: Second sorted list
    """
    sorted_list = []

    # Copy both the args to make sure the original lists are not
    # modified
    l1 = l1[:]
    l2 = l2[:]

    while (l1 and l2):
        if (l1[0] <= l2[0]): # Compare both heads
            item = l1.pop(0) # Pop from the head
            sorted_list.append(item)
        else:
            item = l2.pop(0)
            sorted_list.append(item)

    # Add the remaining of the lists
    sorted_list.extend(l1 if l1 else l2)

    return sorted_list

if __name__ == '__main__':
    print merge_sorted_lists(l1, l2)

This should work fine with datetime objects. Hope this helps.

Baishampayan Ghose
Unfortunately this is counterproductive - normally merge would be O(n), but because you're popping from the left of each list (an O(n) operation), you're actually making it an O(n**2) process - worse than the naive sorted(l1+l2)
Brian
+10  A: 

People seem to be over complicating this.. Just combine the two lists, then sort them:

>>> l1 = [1, 3, 4, 7]
>>> l2 = [0, 2, 5, 6, 8, 9]
>>> l1.extend(l2)
>>> sorted(l1)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

..or shorter (and without modifying l1):

>>> sorted(l1 + l2)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

..easy! Plus, it's using only two built-in functions, so assuming the lists are of a reasonable size, it should be quicker than implementing the sorting/merging in a loop. More importantly, the above is much less code, and very readable.

If your lists are large (over a few hundred thousand, I would guess), it may be quicker to use an alternative/custom sorting method, but there are likely other optimisations to be made first (e.g not storing millions of datetime objects)

Using the timeit.Timer().repeat() (which repeats the functions 1000000 times), I loosely benchmarked it against ghoseb's solution, and sorted(l1+l2) is substantially quicker:

merge_sorted_lists took..

[9.7439379692077637, 9.8844599723815918, 9.552299976348877]

sorted(l1+l2) took..

[2.860386848449707, 2.7589840888977051, 2.7682540416717529]
dbr
This is mostly due to a flaw in ghoseb's solution - it's actually O(n**2), so will perform worse than O(n lg(n)) sort. An O(n) merge will probably be faster than sorting, at least for sufficiently large input list (sort may well beat it for short lists).
Brian
Finally a sane answer, taking actual *benchmarking* into account. :-) --- Also, 1 line to maintain instead of 15-20 is much to be preferred.
Deestan
Sorting a very short list created by appending two lists will indeed be very fast, as the constant overheads will dominate. Try doing this for lists with several million items, or files on disk with several billion items, and you'll soon find out why merging is preferable.
Barry Kelly
@Barry: If you have "several billion items" and a speed requisite, *anything* in Python is the wrong answer.
Deestan
Actually, I've done some timings (see my answer). sorted() looks like it is indeed faster for most common cases (under a million items). This will obviously change with a more efficient implementation (eg written in C), or where operations are more time consuming (eg. disk based IO)
Brian
@Deestan: I disagree - there are times when speed will be dominated by other factors. Eg. if you're sorting data on-disk (merge 2 files), IO times will likely dominate and python's speed won't matter much, just the number of operations you do (and hence the algorithm).
Brian
The original question states that "each list [contains items] of the datetime type".. I'd worry about that long before I reimpliment the sorting algorithm!
dbr
Remember to either pass a function that can compare dates within an object or override the _cmp_ method in the object. This is the same solution I gave below (though your benchmarks are very very welcome).
Josh Smeaton
Seriously? Benchmarking a sort function with a 10 entry list??
Seun Osewa
Seun: The answer doesn't specify how large the list is - it's entirely plausible the questioner (or a random Google'r) would wanting to sort fairly small lists.. The benchmark isn't exactly the point of my answer, I included it because it's silly to argue "x is faster than y" without providing any benchmarks at all.
dbr
+5  A: 

There is a slight flaw in ghoseb's solution, making it O(n**2), rather than O(n).
The problem is that this is performing:

item = l1.pop(0)

With linked lists or deques this would be an O(1) operation, so wouldn't affect complexity, but since python lists are implemented as vectors, this copies the rest of the elements of l1 one space left, an O(n) operation. Since this is done each pass through the list, it turns an O(n) algorithm into an O(n**2) one. This can be corrected by using a method that doesn't alter the source lists, but just keeps track of the current position.

I've tried out benchmarking a corrected algorithm vs a simple sorted(l1+l2) as suggested by dbr

def merge(l1,l2):
    if not l1:  return list(l2)
    if not l2:  return list(l1)

    # l2 will contain last element.
    if l1[-1] > l2[-1]:
        l1,l2 = l2,l1

    it = iter(l2)
    y = it.next()
    result = []

    for x in l1:
        while y < x:
            result.append(y)
            y = it.next()
        result.append(x)
    result.append(y)
    result.extend(it)
    return result

I've tested these with lists generated with

l1 = sorted([random.random() for i in range(NITEMS)])
l2 = sorted([random.random() for i in range(NITEMS)])

For various sizes of list, I get the following timings (repeating 100 times):

# items:  1000   10000 100000 1000000
merge  :  0.079  0.798 9.763  109.044 
sort   :  0.020  0.217 5.948  106.882

So in fact, it looks like dbr is right, just using sorted() is preferable unless you're expecting very large lists, though it does have worse algorithmic complexity. The break even point being at around a million items in each source list (2 million total).

One advantage of the merge approach though is that it is trivial to rewrite as a generator, which will use substantially less memory (no need for an intermediate list).

[Edit] I've retried this with a situation closer to the question - using a list of objects containing a field "date" which is a datetime object. The above algorithm was changed to compare against .date instead, and the sort method was changed to:

return sorted(l1 + l2, key=operator.attrgetter('date'))

This does change things a bit. The comparison being more expensive means that the number we perform becomes more important, relative to the constant-time speed of the implementation. This means merge makes up lost ground, surpassing the sort() method at 100,000 items instead. Comparing based on an even more complex object (large strings or lists for instance) would likely shift this balance even more.

# items:  1000   10000 100000  1000000[1]
merge  :  0.161  2.034 23.370  253.68
sort   :  0.111  1.523 25.223  313.20

[1]: Note: I actually only did 10 repeats for 1,000,000 items and scaled up accordingly as it was pretty slow.

Brian
Thanks for the fix. Would be great if you can exactly point out the flaw and your fix :)
Baishampayan Ghose
@ghoseb: I gave a brief description as a comment on your post, but I've now updated the answer to give more details - essentially l.pop() is an O(n) operation for lists. It's fixable by tracking position in some other manner (alternatively by popping from the tail instead, and reversing at the end)
Brian
Can you bench mark these same tests but comparing the dates like the question is requiring? I'm guessing this extra method will take up a fair amount of time relatively.
Josh Smeaton
I'd say the difference is due to the fact that sort() is implemented in c/c++ and compiled vs our merge() that is being interpreted. merge() should be faster on equal terms.
Drakosha
Good point Drakosha. Show's that benchmarking is indeed the only way to know for certain.
Josh Smeaton
@Josh: Good point - I've rerun the timings using code assuming an object with a datetime attribute. This does affect the point at which merge beats out sort().
Brian
+22  A: 

is there a smarter way to do this in Python

This hasn't been mentioned, so I'll go ahead - there is a merge stdlib function in the heapq module of python 2.6+. If all you're looking to do is getting things done, this might be a better idea. Of course, if you want to implement your own, the merge of merge-sort is the way to go.

>>> list1 = [1, 5, 8, 10, 50]
>>> list2 = [3, 4, 29, 41, 45, 49]
>>> from heapq import merge
>>> list(merge(list1, list2))
[1, 3, 4, 5, 8, 10, 29, 41, 45, 49, 50]
sykora
I've added link to heapq.py. `merge()` is implemented as a pure python function so It is easy to port it to older Python versions.
J.F. Sebastian
+4  A: 
from datetime import datetime
from itertools import chain
from operator import attrgetter

class DT:
    def __init__(self, dt):
        self.dt = dt

list1 = [DT(datetime(2008, 12, 5, 2)),
         DT(datetime(2009, 1, 1, 13)),
         DT(datetime(2009, 1, 3, 5))]

list2 = [DT(datetime(2008, 12, 31, 23)),
         DT(datetime(2009, 1, 2, 12)),
         DT(datetime(2009, 1, 4, 15))]

list3 = sorted(chain(list1, list2), key=attrgetter('dt'))
for item in list3:
    print item.dt

The output:

2008-12-05 02:00:00
2008-12-31 23:00:00
2009-01-01 13:00:00
2009-01-02 12:00:00
2009-01-03 05:00:00
2009-01-04 15:00:00

I bet this is faster than any of the fancy pure-Python merge algorithms, even for large data. Python 2.6's heapq.merge is a whole another story.

akaihola
+7  A: 
J.F. Sebastian