views:

1589

answers:

9

Hey,

I am looking for a linked list and related algorithms implementation for Python. Everyone I ask just recommends using built in Python lists, but performance measurements indicate that list insertion and removal is a bottleneck for our application. It's trivial to implement a simple linked list, but I wonder if there is a mature library which includes some operations like sort, merge, splice, search, lower/upper bound, etc...

I know this is a dupe, but searching for python list on any search engine gives predictably poor results, with most people just saying that linked lists are unneeded in python (pfft!).

PS: I need to insert and remove from anywhere in the list, not just the ends.

OK, you asked for it: I need to maintain an ordered list of several hundred thousand entries. I will iterate over the list forwards (one by one), using a visitor on each entry, starting from the beginning or a position found by a binary search. When an entry matching a predicate is found it is removed from the list, and then, another binary search is performed on a subset of the list beginning from the removed entry's previous position, until a position determined statistically beforehand. Ignoring the error condition, the modified entry may be used to create another linked list which is spliced into the new position found through the second binary search. Iteration is continued from the position where the entry was removed. On occasion, several thousand contiguous ordered entries may be added to/removed from any place in the list. Sometimes several thousand non-contiguous entries must be searched for and removed incrementally.

python's list is unnacceptable as the cost of insertion/removal is prohibitive, and the minor gains in speed for the binary search are totally irrelevant to the total cost. Our in house tests confirm this.

if I have neglected any detail perhaps I can e-mail you a copy of my company's non-disclosure agreement and I can privately correspond with you on the matter. sarcasm.end().

+6  A: 

Here's a blog post sharing your pain. It includes an implementation of a linked list and a performance comparison.


Perhaps blist would be better, though (from here)?

The use cases where the BList is slightly slower than Python's list are as follows (O(log n) vs. O(1)):

  1. A large list that never changes length.
  2. A large lists where inserts and deletes are only at the end of the list (LIFO).

With that disclaimer out of the way, here are some of the use cases where the BLists is dramatically faster than the built-in list:

  1. Insertion into or removal from a large list (O(log n) vs. O(n))
  2. Taking large slices of large lists (O(log n) vs O(n))
  3. Making shallow copies of large lists (O(1) vs. O(n))
  4. Changing large slices of large lists (O(log n + log k) vs. O(n + k))
  5. Multiplying a list to make a large, sparse list (O(log k) vs. O(kn))

Note that it's actually implemented as a B+ tree, allowing great performance for all those operations.

Eli Bendersky
Good start, though, that's not O(1) insertion/removal. :-P But maybe it'll suit the OP's purposes well enough.
Chris Jester-Young
Unfortunately no. I found it too, and it is a big improvement, but... As an aside, it's very bizzare how everyone I ask tells me what I really need is XYZ, when I am experienced enough to know that I just need a linked list! BTW that's not aimed at you, the help is appreciated, I am just venting.
@unknown: but just implementing a linked list in Python is easy. Really. Did you look at that blog post? The guy implements it. Do you need a pure Python solution, or as C extension?
Eli Bendersky
@Eli: I know it's simple, but I'd like to avoid writing algorithms from scratch. That being said, I am pretty resigned to doing just that, I don't think I will find what I'm looking for. BTW, Great pygame tutorials Eli, I'm using your code to teach my cousin some game programming!
+2  A: 

Python's deque class is 0(1) for insertion and deletion at the beginning and end of the list.

Daniel Roseman
+5  A: 

There's a single-linked list here (recipe 17.14 in the Python Cookbook 1st ed) but it's hardly "mature" or rich -- it's just doing a FIFO queue so it's pretty minimal.

This recipe is a very concise C implementation of (read-only) Lisp-like cons-boxes -- just car, cdr and cons; again, not a rich type, rather a minimal one (and to use it for mutable data, as opposed to pure functional approaches, you'd need to add setcar and setcdr at least). It may be a better starting point for you simply because cons-cells are so notoriously flexible and familiar.

Some of the operations you require will be likely best done by existing Python primitives. For example, for sorting, it's hard to see how rolling your own sort can beat the performance of Python's sorted(linkedlist) (as long of course as you make the linkedlist type a Python iterable so it plays well with the rest of the language and library;-), considering the power of the timsort algorithm implemented in the Python runtime.

More generally I suggest you carefully timeit things every step along the way to consider how much the C-coded approach is really buying you (when compared to the trivial C-coded one exemplified by the recipe in the printed Cookbook whose URL I give at the start of this answer) -- that will crucially depend on the size and nature of your application's lists, so you're the one best placed to organize these benchmarks of course.

Alex Martelli
Thank you very much for a thoughtful and informative answer.
+3  A: 

Python lists are O(1) for operations at the end of the list. If you'll be doing all your inserting in a semi-sequential fashion--by analogy to C, only keeping a single pointer into the middle of the list as a "cursor" of sorts--you could save yourself a lot of effort by just using two Python lists. One list for what's before the cursor, one for what's after; moving the cursor involves pulling the next item from one list and appending it to the other. That gives you arbitrary O(1) inserting at the cursor location with far less effort and wheel reinventing than making an entire new data structure, letting you reuse a lot of the existing list functions.

For the fully general case allowing multiple references into the list, though, you're probably stuck making a linked list of some sort.

Edit: You're not seriously thinking of actually doing a "binary search" on a linked list, are you? Binary search doesn't even make sense on an intrinsically sequential data structure...

Anyway, if you're okay with linear-time searching and your insertions will always preserve the list order without re-sorting, then a simple linked list might be all you need. If you do as much searching as iterating you should consider something with fast indexing, and if resorting might be required something like a tree would be better.

camccann
insert/removes at the beginning of the list are not O(1) but O(n).
MAK
...but they're O(1) at the *end* of the list, which is exactly what I said. Do note that the list representing what comes after the cursor would be in reversed order, if that wasn't obvious.
camccann
@camcann: Sorry, I misread 'end' as 'ends'.
MAK
Please read the edit. Yes I am doing a binary search on a linked list, it happens all the time and is perfectly acceptable when the costs are understood. I should mention that the search is on a subset of the list predetermined statistically and not the entire list.
+3  A: 

" I will iterate over the list forwards (one by one), using a visitor on each entry, starting from the beginning or a position found by a binary search. When an entry matching a predicate is found it is removed from the list, and then, another binary search is performed on a subset of the list beginning from the removed entry's previous position"

It sounds like a linked list is absolutely the wrong data structure for this - doing a binary search will require random access to the list, which will mean repeatedly iterating through the elements. This is likely to be slower on a linked list than inserting and deleting items in a python list.

It sounds like the data structure you want is a skip list. Google throws up several implementations, but I cannot comment on their completeness or quality.

edit:

Another data structure that may be suitable is a threaded binary tree. this is like a regular binary tree but each leaf node points to next/previous subtree, so it can be iterated through as efficiently as a linked list. Implementing it in Python is left as an exercise for the reader (or Google).

Dave Kirby
Actually it is not slower, the binary search is performed on a subset of the list, predetrmined statistically, and our tests have indicated that the search is not a bottleneck, but insertion and deleting is.
+2  A: 

Your requirements are still (after the edit) unclear. You say that you are doing a binary search on a linked list. Please pardon my ignorance, but I was under the impression that a binary search required direct access to list components via a calculated integer index. Perhaps your "linked list" is really a tree. Can you please explain?

John Machin
It just means that rather than doing *index += someval*, you have to do something like *while index != target: index.next()*. This part of the algorithm is slower on a linked list than a python list, but as I have said repeatedly, our tests indicate that this is NOT the performance bottleneck, abd insertion and removal is. It's right there at the end of the edit.
That's absolutely not a binary search in any sense. Please don't misuse terminology, it confuses things unnecessarily.
camccann
Yes it is a binary search, please do some basic research before throwing out accusations.
@unknown (google): Testing each element sequentially, which is what you are proposing, is not a binary search the way most of us understand it. If you believe it is, please provide sources. We have done our basic research: The top Google hit on "binary search" is http://en.wikipedia.org/wiki/Binary_search_algorithm (and there are many, many more sources that describe the same thing).
John Y
From wikipedia: _In computer science, a binary search is an algorithm for locating the position of an element in a sorted list.[1][2] It inspects the middle element of the sorted list: if equal to the sought value, then the position has been found; otherwise, the upper half or lower half is chosen for further searching based on whether the sought value is greater than or less than the middle element. The method reduces the number of elements needed to be checked by a factor of two each time [snip]_ You mentioned nothing about sortedness in your _while index!=_ statement.
cjrh
@Caleb Hattingh: He did mention that his list is sorted, actually. That's not the problem; the problem is that linked lists can only be traversed sequentially, while the efficiency of a binary search relies on jumping to the middle element being an O(1) operation. Trying to actually do a binary search using linear traversal would in fact be *worse* than a linear search on an unsorted list!
camccann
Well ...unless traversal is a whole lot faster than comparisons, right? That might actually be the case in Python, where traversal could be implemented as a very tight C loop, but comparisons are quite slow.
Jason Orendorff
@Jason Orendorff: The overall time complexity would still be linear, of course. But it is a valid point that discussions of algorithmic time complexity tend to ignore the complexity of "atomic" operations dependent on individual items--e.g., a hash table operation may be O(1) in collection size, but O(n) in key size.
camccann
+3  A: 

It's puzzling that everyone demands justification to need a linked list. Linked lists are one of the most elementary data structures for a reason: they have properties that the other major data structures lack, and if you need those properties, you need a linked list or one of its close relatives. If you don't understand why linked lists are an important data structure that can't always be replaced with a deque or a binary tree, you should never have passed your "intro to data structures" class.

Here's a quick implementation, supporting the usual stuff: constant-time insertion at any point given a reference to the node, splitting the list into two lists, and inserting a list into the middle of another list (splice). Generic Python interfaces are supported: push, pop, pushleft, popleft, extend, ordinary iteration, iteration over a slice (getiter).

I just wrote this, so it's doctested but not production tested; there are probably still bugs.

def _ref(obj):
    """
    weakref.ref has a bit of braindamage: you can't take a weakref to None.
    This is a major hassle and a needless limitation; work around it.
    """
    from weakref import ref
    if obj is None:
        class NullRef(object):
            def __call__(self): return None
        return NullRef()
    else:
        return ref(obj)

class _node(object):
    def __init__(self, obj):
        self.obj = obj
        self._next = None
        self._prev = _ref(None)
    def __repr__(self):
        return "node(%s)" % repr(self.obj)

    def __call__(self):
        return self.obj

    @property
    def next(self):
        return self._next

    @property
    def prev(self):
        return self._prev()

# Implementation note: all "_last" and "prev" links are weakrefs, to prevent circular references.
# This is important; if we don't do this, every list will be a big circular reference.  This would
# affect collection of the actual objects in the list, not just our node objects.
#
# This means that _node objects can't exist on their own; they must be part of a list, or nodes
# in the list will be collected.  We also have to pay attention to references when we move nodes
# from one list to another.
class llist(object):
    """
    Implements a doubly-linked list.
    """
    def __init__(self, init=None):
        self._first = None
        self._last = _ref(None)
        if init is not None:
            self.extend(init)

    def insert(self, item, node=None):
        """
        Insert item before node.  If node is None, insert at the end of the list.
        Return the node created for item.

        >>> l = llist()
        >>> a = l.insert(1)
        >>> b = l.insert(2)
        >>> d = l.insert(4)
        >>> l._check()
        [1, 2, 4]
        >>> c = l.insert(3, d)
        >>> l._check()
        [1, 2, 3, 4]
        """
        item = _node(item)

        if node is None:
            if self._last() is not None:
                self._last()._next = item
                item._prev = _ref(self._last())
            self._last = _ref(item)
            if self._first is None:
                self._first = item
        else:
            assert self._first is not None, "insertion node must be None when the list is empty"
            if node._prev() is not None:
                node._prev()._next = item
            item._prev = node._prev
            item._next = node
            node._prev = _ref(item)
            if node is self._first:
                self._first = item
        return item

    def remove(self, node):
        """
        >>> l = llist()
        >>> a = l.append(1)
        >>> b = l.append(2)
        >>> c = l.append(3)
        >>> d = l.append(4)
        >>> e = l.append(5)
        >>> l.remove(c) # Check removing from the middle
        3
        >>> l._check()
        [1, 2, 4, 5]
        >>> l.remove(a) # Check removing from the start
        1
        >>> l._check()
        [2, 4, 5]
        >>> l.remove(e) # Check removing from the end
        5
        >>> l._check()
        [2, 4]
        """
        if self._first is node:
            self._first = node._next
        if self._last() is node:
            self._last = node._prev
        if node._next is not None:
            node._next._prev = node._prev
        if node._prev() is not None:
            node._prev()._next = node._next
        node._next = None
        node._prev = _ref(None)
        return node.obj

    def __nonzero__(self):
        """
        A list is true if it has any elements.

        >>> l = llist()
        >>> bool(l)
        False
        >>> l = llist([1])
        >>> bool(l)
        True
        """
        return self._first is not None


    def __iter__(self):
        """
        >>> l = llist([1,2,3])
        >>> [i() for i in l]
        [1, 2, 3]
        """
        return self.getiter(self._first, self._last())

    def _check(self):
        if self._last() is None:
            assert self._last() is None
            return []
        node = self._first
        ret = []
        while node is not None:
            if node._next is None:
                assert node == self._last()
            if node._prev() is None:
                assert node == self._first
            if node._next is not None:
                assert node._next._prev() == node
            if node._prev() is not None:
                assert node._prev()._next == node
            ret.append(node.obj)
            node = node._next
        return ret

    def getiter(self, first, last):
        """
        Return an iterator over [first,last].

        >>> l = llist()
        >>> l.append(1)
        node(1)
        >>> start = l.append(2)
        >>> l.extend([3,4,5,6])
        >>> end = l.append(7)
        >>> l.extend([8,9])
        >>> [i() for i in l.getiter(start, end)]
        [2, 3, 4, 5, 6, 7]
        """
        class listiter(object):
            def __init__(self, first, last):
                self.node = first
                self.final_node = last

            def __iter__(self): return self
            def next(self):
                ret = self.node
                if ret is None:
                    raise StopIteration
                if ret is self.final_node:
                    self.node = None
                else:
                    self.node = self.node._next
                return ret
        return listiter(first, last)

    def append(self, item):
        """
        Add an item to the end of the list.

        >>> l = llist()
        >>> l.append(1)
        node(1)
        >>> l.append(2)
        node(2)
        >>> l._check()
        [1, 2]
        """
        return self.insert(item, None)

    def appendleft(self, item):
        """
        Add an item to the beginning of the list.

        >>> l = llist()
        >>> l.appendleft(1)
        node(1)
        >>> l.appendleft(2)
        node(2)
        >>> l._check()
        [2, 1]
        """
        return self.insert(item, self._first)

    def pop(self):
        """
        Remove an item from the end of the list and return it.

        >>> l = llist([1,2,3])
        >>> l.pop()
        3
        >>> l.pop()
        2
        >>> l.pop()
        1
        >>> l.pop()
        Traceback (most recent call last):
        ...
        IndexError: pop from empty llist
        """
        if self._last() is None:
            raise IndexError, "pop from empty llist"
        return self.remove(self._last())

    def popleft(self):
        """
        Remove an item from the beginning of the list and return it.

        >>> l = llist([1,2,3])
        >>> l.popleft()
        1
        >>> l.popleft()
        2
        >>> l.popleft()
        3
        >>> l.popleft()
        Traceback (most recent call last):
        ...
        IndexError: popleft from empty llist
        """
        if self._first is None:
            raise IndexError, "popleft from empty llist"
        return self.remove(self._first)

    def splice(self, source, node=None):
        """
        Splice the contents of source into this list before node; if node is None, insert at
        the end.  Empty source_list.  Return the first and last nodes that were moved.

        # Test inserting at the beginning.
        >>> l = llist()
        >>> a = l.append(1)
        >>> b = l.append(2)
        >>> c = l.append(3)
        >>> l2 = llist([4,5,6])
        >>> l.splice(l2, a)
        (node(4), node(6))
        >>> l._check()
        [4, 5, 6, 1, 2, 3]
        >>> l2._check()
        []

        # Test inserting in the middle.
        >>> l = llist()
        >>> a = l.append(1)
        >>> b = l.append(2)
        >>> c = l.append(3)
        >>> l2 = llist([4,5,6])
        >>> l.splice(l2, b)
        (node(4), node(6))
        >>> l._check()
        [1, 4, 5, 6, 2, 3]
        >>> l2._check()
        []

        # Test inserting at the end.
        >>> l = llist()
        >>> a = l.append(1)
        >>> b = l.append(2)
        >>> c = l.append(3)
        >>> l2 = llist([4,5,6])
        >>> l.splice(l2, None)
        (node(4), node(6))
        >>> l._check()
        [1, 2, 3, 4, 5, 6]
        >>> l2._check()
        []

        # Test inserting a list with a single item.
        >>> l = llist()
        >>> a = l.append(1)
        >>> b = l.append(2)
        >>> c = l.append(3)
        >>> l2 = llist([4])
        >>> l.splice(l2, b)
        (node(4), node(4))
        >>> l._check()
        [1, 4, 2, 3]
        >>> l2._check()
        []
        """
        if source._first is None:
            return
        first = source._first
        last = source._last()

        if node is None:
            if self._last() is not None:
                self._last()._next = source._first
            source._first._prev = self._last
            self._last = source._last
            if self._first is None:
                self._first = source._first
        else:
            source._first._prev = node._prev
            source._last()._next = node
            if node._prev() is not None:
                node._prev()._next = source._first
            node._prev = source._last
            if node is self._first:
                self._first = source._first
        source._first = None
        source._last = _ref(None)
        return first, last

    def split(self, start, end=None):
        """
        Remove all items between [node, end] and return them in a new list.  If end is None,
        remove until the end of the list.

        >>> l = llist()
        >>> a = l.append(1)
        >>> b = l.append(2)
        >>> c = l.append(3)
        >>> d = l.append(4)
        >>> e = l.append(5)
        >>> l._check()
        [1, 2, 3, 4, 5]
        >>> l2 = l.split(c, e)
        >>> l._check()
        [1, 2]
        >>> l2._check()
        [3, 4, 5]

        >>> l = llist()
        >>> a = l.append(1)
        >>> b = l.append(2)
        >>> c = l.append(3)
        >>> d = l.append(4)
        >>> e = l.append(5)
        >>> l2 = l.split(a, c)
        >>> l._check()
        [4, 5]
        >>> l2._check()
        [1, 2, 3]

        >>> l = llist()
        >>> a = l.append(1)
        >>> b = l.append(2)
        >>> c = l.append(3)
        >>> d = l.append(4)
        >>> e = l.append(5)
        >>> l2 = l.split(b, d)
        >>> l._check()
        [1, 5]
        >>> l2._check()
        [2, 3, 4]
        """
        if end is None:
            end = self._last()

        ret = llist()

        # First, move the region into the new list.  It's important to do this first, or
        # once we remove the nodes from the old list, they'll be held only by weakrefs and
        # nodes could end up being collected before we put it into the new one.
        ret._first = start
        ret._last = _ref(end)

        # Hook our own nodes back together.
        if start is self._first:
            self._first = end._next
        if end is self._last():
            self._last = start._prev

        if start._prev() is not None:
            start._prev()._next = end._next
        if end._next is not None:
            end._next._prev = start._prev
        start._prev = _ref(None)
        end._next = None

        return ret

    def extend(self, items):
        """
        >>> l = llist()
        >>> l.extend([1,2,3,4,5])
        >>> l._check()
        [1, 2, 3, 4, 5]
        """
        for item in items:
            self.append(item)

if __name__ == "__main__":
    import doctest
    doctest.testmod()
Glenn Maynard
Thanks a lot, looks similar to the one I wrote last night. Don't get me started on the whole Python religion, I am exasperbated by the constant calls for justification for an extremely simple question. Thanks again for the reasoned and thoughtful reply, unfortunately a rare thing these days.
And, of course, that's why this type being missing from the standard library is such a problem: everyone ends up having to roll their own, which is a waste of time and results in different, incompatible interfaces. A linked list isn't an obscure, specialty data type; it's data structures 101. Binary trees are an equally significant gap in Python's standard library.
Glenn Maynard
It's likely that people are questioning motivation because, after passing "intro to data structures", it's quite unusual to *actually* need a linked list specifically for its performance characteristics. The vast majority of the time, if the data is large enough to worry about time complexity, sorting, indexing, or searching will be needed, and linked lists are *spectacularly bad* at all of those. Without context, insisting on a linked list sounds suspiciously like a design mistake, especially when requests for clarification are met with unwarranted hostility.
camccann
Yeah, I get it, but in face to face conversation if I were to reply to a question with "why, I think what you really need is XYZ", it would come off as rude. I take responsibilty for my temper and hostile attitude, but I don't think that it is entirely unwarranted considering that many of the replies question my motivation instead of giving an honest attempt at an appropriate answer. I can handle a little querying, but some people prefer to interrogate aggresively rather than respond meaningfully.
Double-checking is one thing (lots of beginners ask questions at the level of "my Python script is slow, is there a program to shorten all of my variable names?"), but some people operate under the belief that nobody could ever possibly *really* want something they don't use themselves, and if they're asking for it, they must obviously be mistaken.
Glenn Maynard
A: 

Here's an idea, that would require a little coding, but may give you hugely better performance. It may or may not be suitable for your use case.

You can splice a new list into your list by replacing a single element. To insert the list [6, 7, 8] into [1, 2, 3, 4, 5] at index 2, you would end up with

[1, 2, [3, 6, 7, 8], 4, 5]

By not changing the length of the large (here 5 element) list, you'll not have the speed problems you're currently having.

You can 'delete' an element from the list in the same way, by replacing it with an empty list.

[1, 2, [], 4, 5]

To iterate over this mixed list is straightforward.

def IterateNestedList(xs):
    for x in xs:
        if isinstance(x, list):
            for y in IterateNestedList(x): yield y
        else: yield x
Paul Hankin
A: 

For large data, keep a sorted list is a trick. Don't insert but append new items at the end and then sort it. Don't delet item but replace with a special value, sort them to the end and then pop out. For finding, a sorted list also has very quick performance with bisection method. As for small data, iterate old list, filter and build a new one, like list comprehensions method, is always the fast way.

For me, what is large data? it should be over 1000000 items...

Angel