views:

612

answers:

4

I have a big file I'm reading from, and convert every few lines to an instance of an Object.

Since I'm looping through the file, I stash the instance to a list using list.append(instance), and then continue looping.

This is a file that's around ~100MB so it isn't too large, but as the list grows larger, the looping slows down progressively. (I print the time for each lap in the loop).

This is not intrinsic to the loop ~ when I print every new instance as I loop through the file, the program progresses at constant speed ~ it is only when I append them to a list it gets slow.

My friend suggested disabling garbage collection before the while loop and enabling it afterward & making a garbage collection call.

Did anyone else observe a similar problem with list.append getting slower? Is there any other way to circumvent this?


I'll try the following two things suggested below.

(1) "pre-allocating" the memory ~ what's the best way to do this? (2) Try using deque

Multiple posts (see comment by Alex Martelli) suggested memory fragmentation (he has a large amount of available memory like I do) ~ but no obvious fixes to performance for this.

To replicate the phenomenon, please run the test code provided below in the answers and assume that the lists have useful data.


gc.disable() and gc.enable() helps with the timing. I'll also do a careful analysis of where all the time is spent.

+1  A: 

Can you try http://docs.python.org/release/2.5.2/lib/deque-objects.html allocating expected number of required elements in your list? ? I would bet that list is a contiguous storage that has to be reallocated and copied every few iterations. (similar to some popular implementations of std::vector in c++)

EDIT: Backed up by http://www.python.org/doc/faq/general/#how-are-lists-implemented

Tomasz Zielinski
How would I go about the allocation? I know how many items there are before hand, I could try this very easily.
Deniz
Not every few operations—every *n* operations. (The FAQ simplifies to "few" which hides the implementation of the "clever".) This allows `append` to have a complexity of O(1) amortized.
Mike Graham
I understand that it's increased *2. So it's O(1) with "spikes", right?
Tomasz Zielinski
+9  A: 

There is nothing to circumvent: appending to a list is O(1) amortized.

A list (in CPython) is an array at least as long as the list and up to twice as long. If the array isn't full, appending to a list is just as simple as assigning one of the array members (O(1)). Every time the array is full, it is automatically doubled in size. This means that on occasion an O(n) operation is required, but it is only required every n operations, and it is increasingly seldom required as the list gets big. O(n) / n ==> O(1). (In other implementations the names and details could potentially change, but the same time properties are bound to be maintained.)

Appending to a list already scales.

Is it possible that when the file gets to be big you are not able to hold everything in memory and you are facing problems with the OS paging to disk? Is it possible it's a different part of your algorithm that doesn't scale well?

Mike Graham
Thanks for clarifying ~ so once in a while, I have an O(n) operation, but the next time the loop happens, it is back to the usual?I will do a detailed timing analysis of my code in the next few days and post again.
Deniz
Yes, the O(n) operations only occur occasionally. The number of appends between them grows at the same rate as n, so it averages out to having only a constant effect.
Mike Graham
256 GIGABYTES of ram or 128 or 64 but nothing lower than 64 GIGABYTES. For example:top - 02:36:31 up 36 days, 11:21, 7 users, load average: 0.84, 0.31, 0.11Tasks: 274 total, 2 running, 272 sleeping, 0 stopped, 0 zombieCpu(s): 6.2%us, 0.1%sy, 0.0%ni, 93.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stMem: 132370600k total, 7819100k used, 124551500k free, 481084k buffersSwap: 2031608k total, 3780k used, 2027828k free, 5256144k cached
Deniz
Thanks Mike ~ below you mention that turning gc off helped you ~ is there any side effects to that? When should I turn it back on?
Deniz
I *completely* misread!
Mike Graham
@Deniz, Turning of the GC means that any reference cycles will not be discovered and collected. If what you are storing lots of strings, it looks like this doesn't help you. (At least from running a test along the lines of FogleBird's on my machine really quick, it looks like it does not suffer the penalty detected for creating millions of lists. Presumably Python knows it needn't search strings for reference cycles.)
Mike Graham
@Deniz, perhaps if you posted your entire program and explain and present your measurements, there might be something you have not picked up on. There is no reason this should be the fault of `append` and I wouldn't be too quick to blame the garbage collector.
Mike Graham
+4  A: 

A lot of these answers are just wild guesses. I like Mike Graham's the best because he's right about how lists are implemented. But I've written some code to reproduce your claim and look into it further. Here are some findings.

Here's what I started with.

import time
x = []
for i in range(100):
    start = time.clock()
    for j in range(100000):
        x.append([])
    end = time.clock()
    print end - start

I'm just appending empty lists to the list x. I print out a duration for every 100,000 appends, 100 times. It does slow down like you claimed. (0.03 seconds for the first iteration, and 0.84 seconds for the last... quite a difference.)

Obviously, if you instantiate a list but don't append it to x, it runs way faster and doesn't scale up over time.

But if you change x.append([]) to x.append('hello world'), there's no speed increase at all. The same object is getting added to the list 100 * 100,000 times.

What I make of this:

  • The speed decrease has nothing to do with the size of the list. It has to do with the number of live Python objects.
  • If you don't append the items to the list at all, they just get garbage collected right away and are no longer being managed by Python.
  • If you append the same item over and over, the number of live Python objects isn't increasing. But the list does have to resize itself every once in a while. But this isn't the source of the performance issue.
  • Since you're creating and adding lots of newly created objects to a list, they remain live and are not garbage collected. The slow down probably has something to do with this.

As far as the internals of Python that could explain this, I'm not sure. But I'm pretty sure the list data structure is not the culprit.

FogleBird
If you are forming a list of 100000 * 100 empty lists at 36 bytes a piece, that's 360 MB of lists you are trying to store on top of the big list and other overhead. Are you sure you aren't running into memory fragmentation or disk swapping?
Mike Graham
For figuring out which of two similar pieces of code is faster, timing is key (in Python, you'd use the `timeit` module to help you measure.) For determining how things scale, you have to understand mathematically what is going on to characterize the properties of your algorithm; timing can be important but is secondary. One of the most important things when timing an operation is to be sure you control your test to catch exactly what you want.
Mike Graham
It's not disk swapping or anything. It scales up right away and quite linearly. Not sure what point you're trying to make in your second comment? Feel free to run the examples I posted for yourself to see how it behaves. Do you disagree with my conclusions?
FogleBird
@FogleBird, How are you determining that you aren't suffering from memory fragmentation or swapping on your computer? These are usually things your OS does without really telling you much about it. If this is occurring, it makes this example represent that more than anything. The number of life Python objects isn't somehow inherently unscalable, and you haven't proven that it is. You've proven that something about this test slows down your computer.
Mike Graham
@Mike Graham, I encourage you to run your own tests in an effort to explain this behavior.
FogleBird
A variant of this benchmark measuring only the append time (not the time to create an empty list) shows it's very stable. One measuring just the time to create an empty list (and not the time to append) shows it growing... but only if the append is also there (present though not getting measured), otherwise the time to create an empty list (as the old ones get recycled), per se, is also stable (saving the lists to a preallocated list also makes list-creation slow down). Looks like memory fragmentation (costlier and costlier to allocate). (MacOSX 10.5, 4G RAM, tried Python 2.5 and 2.6).
Alex Martelli
@Alex Martelli, I ran all of those variants as well and saw the same results. How can you necessarily conclude "memory fragmentation" though? There can't be any other explanation for increasingly costlier allocations?
FogleBird
@FogleBird, if you have abundant RAM (and I do, as I explained), it's hard to see what else _would_ make memory "costlier and costlier to allocate" -- it's not as if the OS is reluctant to give you more, if there's plenty to go around;-).
Alex Martelli
So how can I destroy these extra lists? Now that people seem to be convinced I do indeed get slowdown, what's the solution? Any ways to fix this?It'd be neat to figure out the ultimate cause as well: I have 256 gigs of RAM (hard to believe but true) on a Linux system.
Deniz
@Deniz, In this contrived example, the stuff consuming all the memory was extra lists; in your case the stuff taking up all the memory is *actual meaningful data*.
Mike Graham
Yes, I guess what I meant to ask was when I do: instance = Class(blah) and then list.append(instance) I just have to live with the fact this is going to get slower and slower and I can't do anything about this? (sure, I could not append at all ~ but is there a way to add instances to a list without slowdown and memory fragmentation?)
Deniz
@FogleBird, I just now saw now that you pointed out you see this ramp up immediately, not just when things got big. I am sorry I ignored it. It appears to be the result of the garbage collector, which periodically searches through all of the objects. When I run this I see the same scaling, but when I turn off the GC, it goes away. Searching through all of the objects is obviously O(n). I apologize for missing your claim and not acknowledging what you found.
Mike Graham
@FogleBird, Incidentally, if I fill each list item with a *different string* (similar to the OP's example) instead of a *different list*, the time stays constant through all iterations.
Mike Graham
@Mike Graham, Thank you. Looks like the GC is to blame, which makes sense and does not conflict with my conclusions.
FogleBird
+12  A: 

The poor performance you observe is caused by a bug in the Python garbage collector. To resolve this issue, disable garbage collection as you build the list and turn it on after you finish. You will find that performance approximates the amoritized 0(1) behavior expected of list appending in Python.

(You can also tweak the garbage collector's triggers or selectively call collect as you progress, but I do not explore these options in this answer because they are more complex and I suspect your use case is amenable to the above solution.)

Background:

See: http://bugs.python.org/issue4074 and also http://docs.python.org/release/2.5.2/lib/module-gc.html

The reporter observes that appending complex objects (objects that aren't numbers or strings) to a list slows linearly as the list grows in length.

The reason for this behavior is that the garbage collector is checking and rechecking every object in the list to see if they are eligible for garbage collection. This behavior causes the linear increase in time to add objects to a list. A fix is expected to land in py3k, so it should not apply to the interpreter you are using.

Test:

I ran a test to demonstrate this. For 1k iterations I append 10k objects to a list, and record the runtime for each iteration. The overall runtime difference is immediately obvious. With garbage collection disabled during the inner loop of the test, runtime on my system is 18.6s. With garbage collection enabled for the entire test, runtime is 899.4s.

This is the test:

import time
import gc

class A:
    def __init__(self):
        self.x = 1
        self.y = 2
        self.why = 'no reason'

def time_to_append(size, append_list, item_gen):
    t0 = time.time()
    for i in xrange(0, size):
        append_list.append(item_gen())
    return time.time() - t0

def test():
    x = []
    count = 10000
    for i in xrange(0,1000):
        print len(x), time_to_append(count, x, lambda: A())

def test_nogc():
    x = []
    count = 10000
    for i in xrange(0,1000):
        gc.disable()
        print len(x), time_to_append(count, x, lambda: A())
        gc.enable()

Full source: http://hypervolu.me/~erik/programming/python_lists/listtest.py.txt

Graphical result: Red is with gc on, blue is with gc off. y-axis is seconds scaled logarithmically.

As the two plots differ by several orders of magnitude in the y component, here they are independently with the y-axis scaled linearly.

Interestingly, with garbage collection off, we see only small spikes in runtime per 10k appends, which suggests that Python's list reallocation costs are relatively low. In any case, they are many orders of magnitude lower than the garbage collection costs.

The density of the above plots make it difficult to see that with the garbage collector on, most intervals actually have good performance; it's only when the garbage collector cycles that we encounter the pathological behavior. You can observe this in this histogram of 10k append time. Most of the datapoints fall around 0.02s per 10k appends.

The raw data used to produce these plots can be found at http://hypervolu.me/~erik/programming/python_lists/

Erik Garrison
I didn't mean to add to the community wiki yet... but somehow the post has been added. Does anyone know how to remove it?
Erik Garrison
You can't undo community wiki. And it get's triggered automatically when you make eight edits (http://meta.stackoverflow.com/questions/11740). (I feel your pain. Community wiki is so strange.)
Jon-Eric
Oh no! After more than a year of lurking on this site, I finally make a contribution and I get bitten by a hidden feature! I wouldn't have made so many small formatting edits had I known of this.
Erik Garrison
`lambda: A()` is a worse way of saying `A`.
Mike Graham
I am not sure your conclusion *"The poor performance you observe is caused by a bug in the Python garbage collector."* is right (perhaps OP can tell us). You create a gazillion `A` objects, which can each contain references to other objects and therefore must be looked at by the GC. If you tested it with a gazillion strings, do you see the same performance problems?
Mike Graham
Dear Mike, I am appending instances from my classes to the list in question, so I believe that doing it with objects is a more accurate way of recreating the issue than doing it with strings.
Deniz