views:

190

answers:

1

I have a web scraping script that gets new data once every minute, but over the course of a couple of days, the script ends up using 200mb or more of memory, and I found out it's because mechanize is keeping an infinite browser history for the .back() function to use.

I have looked in the docstrings, and I found the clear_history() function of the browser class, and I invoke that each time I refresh, but I still get 2-3mb higher memory usage on each page refresh. edit: Hmm, seems as if it kept doing the same thing after I called clear_history, up until I got to about 30mb worth of memory usage, then it cleared back down to 10mb or so (which is the base amount of memory my program starts up with)...any way to force this behavior on a more regular basis?

How do I keep mechanize from storing all of this info? I don't need to keep any of it. I'd like to keep my python script below 15mb memory usage.

+5  A: 

You can pass an argument history=whatever when you instantiate the Browser; the default value is None which means the browser actually instantiates the History class (to allow back and reload). The simplest approach (will give an attribute error exception if you ever do call back or reload):

class NoHistory(object):
  def add(self, *a, **k): pass
  def clear(self): pass

b = mechanize.Browser(history=NoHistory())

a cleaner approach would implement other methods in NoHistory to give clearer exceptions on erroneous use of the browser's back or reload, but this simple one should suffice otherwise.

Note that this is an elegant (though not well documented;-) use of the dependency injection design pattern: in a (bleah) "monkeypatching" world, the client code would be expected to overwrite b._history after the browser is instantiated, but with dependency injection you just pass in the "history" object you want to use. I've often maintained that Dependency Injection may be the most important DP that wasn't in the "gang of 4" book!-).

Alex Martelli
This should really be part of the default mechanize code imho. Thanks a ton ;) I had resorted to using clear_history, importing the gc module and forcing garbage collection in order to keep the memory bouncing between 10 and 18mb of memory usage, hopefully your method will allow things to stay relatively more stable ;)
ThantiK
I'm sure the mechanize maintainers will welcome a tiny patch adding `NoHistory` (in a slightly more fleshed-out version;-) to their `_mechanize.py` module. However, mechanize's real issue is the scarcity of docs -- whether a trivial 5-lines class is part of the code or not is really minor, compared to the fact that you can't learn about it (whether you have to write the trivial 5 lines yourself or not;-) except by carefully studying the sources!
Alex Martelli
wow, Alex, you just enlightened me. I had heard the term 'monkeypatching' before, and kind of tried figuring it out. This gives me a practical example but looking over my code, I was already doing it myself! My class passes in a default 'None' for the cookiejar, and initiates its own unless a different one is initialized and passed in already. Thanks a ton!
ThantiK
@ThantiK, you're welcome -- and yes, I'm not surprised that you were already using Dependency Injection (not monkeypatching: DI _obviates_ the need for MP!-) without connecting it to its name -- knowing the name still helps (easier to talk and think about, easier to look up writeups about the pattern, +s and -s, related ones...).
Alex Martelli