ansaurus

Question

Python: Memory usage and optimization when modifying lists

Answer 1

A:

A set (or even a dict) might be what you're looking for. It's the same underlying structure as a dictionary (without the associated values), but your objects do need to be hashable.

If order is important in your list/set, you could make an ordered set. There is a good recipe on activestate for an OrderedSet. There is another slick suggestion in this answer. Python 2.7 and 3.1 also have an an OrderedDict You would ave to test the implementation for your self to see how the overhead impacts you, but the speed gains from the hashtable may well be worth it.

Depending on what sort of comparisons you're making on the objects in the list, a heap (heapq module) might also fit your problem. The heap will minimize the number of operations required for inserting and removing items in the underlying list.

JimB 2010-04-13 16:04:51

As I understand, a set describes unordered data while somelist definitively needs to be ordered. I should have been more explicit.The heapq module is interesting but doesn't fit the problem unfortunately.As you suggest, I might have to abandon using classical lists, but anything that avoids such a refactoring would be great.

xApple 2010-04-13 16:13:16

@xApple - I added some more info about ordered sets and OrderedDict

JimB 2010-04-13 16:24:40

Answer 2

+4 A:

Without knowing the specifics of what you're doing with this list, it's hard to know exactly what would be best in this case. If your processing stage depends on the current index of the list element, this won't work, but if not, it appears you've left off the most Pythonic (and in many ways, easiest) approach: generators.

If all you're doing is iterating over each element, processing it in some way, then either including that element in the list or not, use a generator. Then you never need to store the entire iterable in memory.

def process_and_generate_data(source_iterable):
    for item in source_iterable:
        dosomestuff(item)
        if not somecondition(item):
            yield item

You would need to have a processing loop that dealt with persisting the processed iterable (writing it back to a file, or whatever), or if you have multiple processing stages you'd prefer to separate into different generators you could have your processing loop pass one generator to the next.

Jeffrey Harris 2010-04-13 16:15:56

Answer 3

A:

You do not provide enough information I can find to answer this question really well. I don't know your use case well enough to tell you what data structures will get you the time complexities you want if you have to optimize for time. The typical solution is to build a new list rather than repeated deletions, but obviously this doubles(ish) memory usage.

If you have memory usage issues, you might want to abandon using in-memory Python constructs and go with an on-disk database. Many databases are available and sqlite ships with Python. Depending on your usage and how tight your memory requirements are, array.array or numpy might help you, but this is highly dependent on what you need to do. array.array will have all the same time complexities as list and numpy arrays sort of will but work in some different ways. Using lazy iterators (like generators and the stuff in the itertools module) can often reduce memory usage by a factor of n.

Using a database will improve time to delete items from arbitrary locations (though order will be lost if this is important). Using a dict will do the same, but potentially at high memory usage.

You can also consider blist as a drop-in replacement for a list that might get some of the compromises you want. I don't believe it will drastically increase memory usage, but it will change item removal to O(log n). This comes at the cost of making other operations more expensive, of course.

I would have to see testing to believe that the constant factor for memory use for your doubly linked list implementation would be less than the 2 that you get by simply creating a new list. I really doubt it.

You will have to share more about your problem class for a more concrete answer, I think, but the general advice is

Iterate over a list building a new list as you go along (or using a generator to yield the items when you need them). If you actually need a list, this will have a memory factor of 2, which scales fine but doesn't help if you are short on memory period.
If you are running out of memory, rather than microoptimization you probably want an on-disk database or to store your data in a file.

Mike Graham 2010-04-13 16:40:58

Answer 4

+3 A:

From your description it sounds like a deque ("deck") would be exactly what you are looking for:

http://docs.python.org/library/collections.html#deque-objects

"Iterate" across it by repeatedly calling pop() and then, if you want to keep the popped item in the deque, returning that item to the front with appendleft(item). To keep up with when you're done iterating and have seen everything in the deque, either put in a marker object like None that you watch for, or just ask for the deque's len() when you start a particular loop and use range() to pop() exactly that many items.

I believe you will find all of the operations you need are then O(1).

Brandon Craig Rhodes 2010-04-13 17:02:04

A simpler solution would be to pop out of the first deque and append into a *new* deque, which requires keeping track of less.

Mike Graham 2010-04-13 17:23:59

Answer 5

+1 A:

Python stores only references to objects in the list - not the elements themselves. If you grow a list item by item, the list (that is the list of references to the objects) will grow one by one, eventually reaching the end of the excess memory that Python preallocated at the end of the list (of references!). It then copies the list (of references!) into a new larger place while your list elements stay at their old location. As your code visits all the elements in the old list anyway, copying the references to a new list by new_list[i]=old_list[i] will be nearly no burden at all. The only performance hint is to allocate all new elements at once instead of appending them (OTOH the Python docs say that amortized append is still O(1) as the number of excess elements grows with the list size). If you are lacking the place for the new list (of references) then I fear you are out of luck - any data structure that would evade the O(n) in-place insert/delete will likely be bigger than a simple array of 4- or 8-byte entries.

slartibartfast 2010-04-13 17:20:04

Answer 6

A:

Brandon Craig Rhodes suggests using a collections.deque, which can suit this problem: no additional memory is required for the operation and it is kept O(n). I do not know the total memory usage and how it compares to a list; it's worth noting that a deque has to store a lot more references and I would not be surprised if it isn't as memory intensive as using two lists. You would have to test or study it to know yourself.

If you were to use a deque, I would deploy it slightly differently than Rhodes suggests:

from collections import deque
d = deque(range(30))
n = deque()

print d

while True:
    try:
        item = d.popleft()
    except IndexError:
        break

    if item % 3 != 0:
        n.append(item)

print n

There is no significant memory difference doing it this way, but there's a lot less opportunity to flub up than mutating the same deque as you go.

Mike Graham 2010-04-13 17:26:49

I assume the "% 3" check is a placeholder to simulate the operation of "somecondition()" from the original question, right?

Oddthinking 2010-04-14 16:35:08

@Oddthinking, Right, I was just applying some silly condition (numbers divisible by three are filtered out).

Mike Graham 2010-04-14 16:36:33

Overall I still don't like this solution much though. At best it helps your memory by a factor of 2, probably a lot less. If you're that hard-up for RAM, you should be moving your data onto disk. (And because you need all the nodes to keep references to either side in the `collections.deque`, I wouldn't be shocked if the RAM usage was actual higher, albeit less contiguous, than using lists.)

Mike Graham 2010-04-14 16:38:03

Answer 7

+1 A:

A doubly linked list is worse than just reallocating the list. A Python list uses 5 words + one word per element. A doubly linked list will use 5 words per element. Even if you use a singly linked list, it's still going to be 4 words per element - a lot worse than the less than 2 words per element that rebuilding the list will take.

From memory usage perspective, moving items up the list and deleting the slack at the end is the best approach. Python will release the memory if the list gets less than half full. The question to ask yourself is, does it really matter. The list entries probably point to some data, unless you have lots of duplicate objects in the list, the memory used for the list is insignificant compared to the data. Given that, you might just as well just build a new list.

For building a new list, the approach you suggested is not that good. There's no apparent reason why you couldn't just go over the list once. Also, calling gc.collect() is unnecessary and actually harmful - the CPython reference counting will release the old list immediately anyway, and even the other garbage collectors are better off collecting when they hit memory pressure. So something like this will work:

while processingdata:
    retained = []
    for item in somelist:
        dosomething(item)
        if not somecondition(item):
            retained.append(item)
    somelist = retained

If you don't mind using side effects in list comprehensions, then the following is also an option:

def process_and_decide(item):
    dosomething(item)
    return not somecondition(item)

while processingdata:
    somelist = [item for item in somelist if process_and_decide(item)]

The inplace method can also be refactored so the mechanism and business logic are separated:

def inplace_filter(func, list_):
    pos = 0
    for item in list_:
        if func(item):
            list_[pos] = item
            pos += 1
    del list_[pos:]

while processingdata:
    inplace_filter(process_and_decide, somelist)

Ants Aasma 2010-04-13 22:15:38

ansaurus

tags:

views:

answers:

Python: Memory usage and optimization when modifying lists

The problem

Keeping a local index:

Walking the list backwards:

Making a new list:

Using list comprehensions:

Using the filter function:

Using the itertools' filter function:

Moving items up the list while walking

Abandoning Python lists:

related questions