tags:

views:

137

answers:

3

I am trying to parse an html for all its img tags, download all the images pointed to by src, and then add those files to a zip file. I would prefer to do all this in memory since I can guarantee there won't be that many images.

Assume the images variable is already populated from parsing the html. What I need help with is getting the images into the zipfile.

from zipfile import ZipFile
from StringIO import StringIO
from urllib2 import urlopen

s = StringIO()
zip_file = ZipFile(s, 'w')
try:
    for image in images:
        internet_image = urlopen(image)
        zip_file.writestr('some-image.jpg', internet_image.fp.read())
        # it is not obvious why I have to use writestr() instead of write()
finally:
    zip_file.close()
+1  A: 

I'm not quite sure what you're asking here, since you appear to have most of it sorted.

Have you investigated HtmlParser to actually perform the HTML parsing ? I wouldn't try hand-rolling a parser yourself - it's a major task with numerous edge cases. Don't even think about regexps for anything but the most trivial cases.

For each <img/> tag you can use HttpLib to actually get each image. It may be worth getting the images in multiple threads to speed up the compilation of the zip file.

Brian Agnew
+1 for suggesting parsing the html!
Mongoose
Downvoted why ?
Brian Agnew
+1  A: 

The easiest way I can think of to do this would be to use the BeautifulSoup library.

Something along the lines of:

from BeautifulSoup import BeautifulSoup
from collections import defaultdict

def getImgSrces(html):
    srcs = []
    soup = BeautifulSoup(html)

    for tag in soup('img'):
        attrs = defaultdict(str)
        for attr in tag.attrs:
            attrs[ attr[0] ] = attr[1]
        attrs = dict(attrs)

        if 'src' in attrs.keys():
            srcs.append( attrs['src'] )

    return srcs

That should give you a list of URLs derived from your img tags to loop through.

KingRadical
Why not just have: `for attr in tag.attrs: if attr[0] == 'src': srcs.append(attr[1])` instead? Why bother with your `attrs` dictionary?
Samir Talwar
I just took the example almost verbatim out of a routine I had written where I wanted a dictionary of all attributes, though you could do it that way.I'm not sure there's much to gain there performance-wise, though.
KingRadical
+1  A: 

To answer your specific question about how you create the ZIP archive (others here have discussed parsing the URLs), I tested out your code. You are really remarkably close to having a finished product already.

Here's how I would augment what you have to create a Zip archive (in this example, I'm writing the archive to the drive so that I can verify it was properly written).

from zipfile import ZipFile, ZipInfo, ZIP_DEFLATED
import zlib
from cStringIO import StringIO
from urllib2 import urlopen
from urlparse import urlparse
from os import path

images = ['http://sstatic.net/so/img/logo.png', 
          'http://sstatic.net/so/Img/footer-cc-wiki-peak-internet.png']

buf = StringIO()
# By default, zip archives are not compressed... adding ZIP_DEFLATED
# to achieve that. If you don't want that, or don't have zlib on or
# system, delete the compression kwarg
zip_file = ZipFile(buf, mode='w', compression=ZIP_DEFLATED)

for image in images:
    internet_image = urlopen(image)
    fname = path.basename(urlparse(image).path) 
    zip_file.writestr(fname, internet_image.read())

zip_file.close()

output = open('images.zip', 'wb')
output.write(buf.getvalue())
output.close()
buf.close()
Jarret Hardie