views:

226

answers:

3

I have a tab-separated data file with a little over 2 million lines and 19 columns. You can find it, in US.zip: http://download.geonames.org/export/dump/.

I started to run the following but with for l in f.readlines(). I understand that just iterating over the file is supposed to be more efficient so I'm posting that below. Still, with this small optimization, I'm using 30% of my memory on the process and have only done about 6.5% of the records. It looks like, at this pace, it will run out of memory like it did before. Also, the function I have is very slow. Is there anything obvious I can do to speed it up? Would it help to del the objects with each pass of the for loop?

def run():
    from geonames.models import POI
    f = file('data/US.txt')
    for l in f:
        li = l.split('\t')
        try:
            p = POI()
            p.geonameid = li[0]
            p.name = li[1]
            p.asciiname = li[2]
            p.alternatenames = li[3]
            p.point = "POINT(%s %s)" % (li[5], li[4])
            p.feature_class = li[6]
            p.feature_code = li[7]
            p.country_code = li[8]
            p.ccs2 = li[9]
            p.admin1_code = li[10]
            p.admin2_code = li[11]
            p.admin3_code = li[12]
            p.admin4_code = li[13]
            p.population = li[14]
            p.elevation = li[15]
            p.gtopo30 = li[16]
            p.timezone = li[17]
            p.modification_date = li[18]
            p.save()
        except IndexError:
            pass

if __name__ == "__main__":
    run()

EDIT, More details (the apparently important ones):

The memory consumption is going up as the script runs and saves more lines. The method, .save() is an adulterated django model method with unique_slug snippet that is writing to a postgreSQL/postgis db.

SOLVED: DEBUG database logging in Django eats memory.

+2  A: 

This looks perfectly fine to me. Iterating over the file like that or using xreadlines() will read each line as needed (with sane buffering behind the scenes). Memory usage should not grow as you read in more and more data.

As for performance, you should profile your app. Most likely the bottleneck is somewhere in a deeper function, like POI.save().

Max Shawabkeh
+2  A: 

There's no reason to worry in the data you've given us: is memory consumption going UP as you read more and more lines? Now that would be cause for worry -- but there's no indication that this would happen in the code you've shown, assuming that p.save() saves the object to some database or file and not in memory, of course. There's nothing real to be gained by adding del statements, as the memory is getting recycled at each leg of the loop anyway.

This could be sped up if there's a faster way to populate a POI instance than binding its attributes one by one -- e.g., passing those attributes (maybe as keyword arguments? positional would be faster...) to the POI constructor. But whether that's the case depends on that geonames.models module, of which I know nothing, so I can only offer very generic advice -- e.g., if the module lets you save a bunch of POIs in a single gulp, then making them (say) 100 at a time and saving them in bunches should yield a speedup (at the cost of slightly higher memory consumption).

Alex Martelli
Thanks for the comment. The increased memory consumption was caused by django's DEBUG db logging. I will keep your advice in mind for future performance increases. Simply setting DEBUG to False is keeping memory usage steady at 1% as we would expect.
skyl
+4  A: 

Make sure that Django's DEBUG setting is set to False

ericflo
Yep, looks like this actually turned out to be a Django question rather than a Python question. The accumulating memory was due to Django's DEBUG logging.
skyl