ansaurus

Question

Parsing an html file and adding found images to a zip file

Answer 1

+1 A:

I'm not quite sure what you're asking here, since you appear to have most of it sorted.

Have you investigated HtmlParser to actually perform the HTML parsing ? I wouldn't try hand-rolling a parser yourself - it's a major task with numerous edge cases. Don't even think about regexps for anything but the most trivial cases.

For each <img/> tag you can use HttpLib to actually get each image. It may be worth getting the images in multiple threads to speed up the compilation of the zip file.

Brian Agnew 2009-12-22 22:24:42

+1 for suggesting parsing the html!

Mongoose 2009-12-22 22:31:34

Downvoted why ?

Brian Agnew 2009-12-22 22:50:30

Answer 2

+1 A:

The easiest way I can think of to do this would be to use the BeautifulSoup library.

Something along the lines of:

from BeautifulSoup import BeautifulSoup
from collections import defaultdict

def getImgSrces(html):
    srcs = []
    soup = BeautifulSoup(html)

    for tag in soup('img'):
        attrs = defaultdict(str)
        for attr in tag.attrs:
            attrs[ attr[0] ] = attr[1]
        attrs = dict(attrs)

        if 'src' in attrs.keys():
            srcs.append( attrs['src'] )

    return srcs

That should give you a list of URLs derived from your img tags to loop through.

KingRadical 2009-12-22 22:31:05

Why not just have: `for attr in tag.attrs: if attr[0] == 'src': srcs.append(attr[1])` instead? Why bother with your `attrs` dictionary?

Samir Talwar 2009-12-23 00:16:19

I just took the example almost verbatim out of a routine I had written where I wanted a dictionary of all attributes, though you could do it that way.I'm not sure there's much to gain there performance-wise, though.

KingRadical 2009-12-23 16:55:56

Answer 3

+1 A:

To answer your specific question about how you create the ZIP archive (others here have discussed parsing the URLs), I tested out your code. You are really remarkably close to having a finished product already.

Here's how I would augment what you have to create a Zip archive (in this example, I'm writing the archive to the drive so that I can verify it was properly written).

from zipfile import ZipFile, ZipInfo, ZIP_DEFLATED
import zlib
from cStringIO import StringIO
from urllib2 import urlopen
from urlparse import urlparse
from os import path

images = ['http://sstatic.net/so/img/logo.png', 
          'http://sstatic.net/so/Img/footer-cc-wiki-peak-internet.png']

buf = StringIO()
# By default, zip archives are not compressed... adding ZIP_DEFLATED
# to achieve that. If you don't want that, or don't have zlib on or
# system, delete the compression kwarg
zip_file = ZipFile(buf, mode='w', compression=ZIP_DEFLATED)

for image in images:
    internet_image = urlopen(image)
    fname = path.basename(urlparse(image).path) 
    zip_file.writestr(fname, internet_image.read())

zip_file.close()

output = open('images.zip', 'wb')
output.write(buf.getvalue())
output.close()
buf.close()

Jarret Hardie 2009-12-22 23:53:29

ansaurus

tags:

views:

answers:

Parsing an html file and adding found images to a zip file

related questions