tags:

views:

544

answers:

3

I need to read selected files, matching on the file name, from a remote zip archive using Python. I don't want to save the full zip to a temporary file (it's not that large, so I can handle everything in memory).

I've already written the code and it works, and I'm answering this myself so I can search for it later. But since evidence suggests that I'm one of the dumber participants on Stackoverflow, I'm sure there's room for improvement.

+6  A: 

Here's how I did it (grabbing all files ending in ".ranks"):

import urllib2, cStringIO, zipfile

try:
    remotezip = urllib2.urlopen(url)
    zipinmemory = cStringIO.StringIO(remotezip.read())
    zip = zipfile.ZipFile(zipinmemory)
    for fn in zip.namelist():
        if fn.endswith(".ranks"):
            ranks_data = zip.read(fn)
            for line in ranks_data.split("\n"):
                # do something with each line
except urllib2.HTTPError:
    # handle exception
Marcel Levy
You want to replace the first line with: import urllib2, zipfile.
Jim
Why don't you use `ZipFile(urllib2.urlopen(url))`?
J.F. Sebastian
I tried that, but I couldn't get it to work because even though it was a file-like object, it didn't support a particular function that Zipfile needed. That's why I buffered it with cStringIO.
Marcel Levy
The directory for a zip file is stored at the end, therefore the entire file must be downloaded before extraction, whether into memory, or on disk.
Ignacio Vazquez-Abrams
This is true, but the point wasn't network I/O efficiency.
Marcel Levy
+1  A: 

Bear in mind that merely decompressing a ZIP file may result in a security vulnerability.

Jim
+2  A: 

Thanks Marcel for your question and answer (I had the same problem in a different context and encountered the same difficulty with file-like objects not really being file-like)! Just as an update: For Python 3.0, your code needs to be modified slightly:

import urllib.request, io, zipfile

try:
    remotezip = urllib.request.urlopen(url)
    zipinmemory = io.BytesIO(remotezip.read())
    zip = zipfile.ZipFile(zipinmemory)
    for fn in zip.namelist():
        if fn.endswith(".ranks"):
            ranks_data = zip.read(fn)
            for line in ranks_data.split("\n"):
                # do something with each line
except urllib.request.HTTPError:
    # handle exception
Tim Pietzcker