views:

268

answers:

3

I want to get many pages from a website, like

curl "http://farmsubsidy.org/DE/browse?page=[0000-3603]" -o "de.#1"

but get the pages' data in python, not disk files. Can someone please post pycurl code to do this,
or fast urllib2 (not one-at-a-time) if that's possible,
or else say "forget it, curl is faster and more robust" ? Thanks

A: 

You can just put that into a bash script inside a for loop.

However you may have better success at parsing each page using python. http://www.securitytube.net/Crawling-the-Web-for-Fun-and-Profit-video.aspx You will be able to get at the exact data and save it at the same time into a db. http://www.securitytube.net/Storing-Mined-Data-from-the-Web-for-Fun-and-Profit-video.aspx

Dmitry
curl holds a persistent connection during the entire transfer, doing a shell loop for 3600 TCP fresh connections WILL be a lot slower...
Daniel Stenberg
it would still run serially. see my answer for a version that can download many streams in parallel.
Corey Goldberg
yes, and quite possibly then using pycurl in several threads would be even faster! ;-)
Daniel Stenberg
+1  A: 

here is a solution based on urllib2 and threads.

import urllib2
from threading import Thread

BASE_URL = 'http://farmsubsidy.org/DE/browse?page='
NUM_RANGE = range(0000, 3603)
THREADS = 2

def main():
    for nums in split_seq(NUM_RANGE, THREADS):
        t = Spider(BASE_URL, nums)
        t.start()

def split_seq(seq, num_pieces):
    start = 0
    for i in xrange(num_pieces):
        stop = start + len(seq[i::num_pieces])
        yield seq[start:stop]
        start = stop

class Spider(Thread):
    def __init__(self, base_url, nums):
        Thread.__init__(self)
        self.base_url = base_url
        self.nums = nums
    def run(self):
        for num in self.nums:
            url = '%s%s' % (self.base_url, num)
            data = urllib2.urlopen(url).read()
            print data

if __name__ == '__main__':
    main()
Corey Goldberg
`class Spider` - very nice!
Jarrod Dixon
Thanks Corey; how do I (thread newbie) wait til they're all done ?
Denis
Denis, you can call join() on each thread in main(). this will block until the threads are done.
Corey Goldberg
+1  A: 

So you have 2 problem and let me show you in one example. Notice the pycurl already did the multithreading/not one-at-a-time w/o your hardwork.

#! /usr/bin/env python

import sys, select, time
import pycurl,StringIO

c1 = pycurl.Curl()
c2 = pycurl.Curl()
c3 = pycurl.Curl()
c1.setopt(c1.URL, "http://www.python.org")
c2.setopt(c2.URL, "http://curl.haxx.se")
c3.setopt(c3.URL, "http://slashdot.org")
s1 = StringIO.StringIO()
s2 = StringIO.StringIO()
s3 = StringIO.StringIO()
c1.setopt(c1.WRITEFUNCTION, s1.write)
c2.setopt(c2.WRITEFUNCTION, s2.write)
c3.setopt(c3.WRITEFUNCTION, s3.write)

m = pycurl.CurlMulti()
m.add_handle(c1)
m.add_handle(c2)
m.add_handle(c3)

# Number of seconds to wait for a timeout to happen
SELECT_TIMEOUT = 1.0

# Stir the state machine into action
while 1:
    ret, num_handles = m.perform()
    if ret != pycurl.E_CALL_MULTI_PERFORM:
        break

# Keep going until all the connections have terminated
while num_handles:
    # The select method uses fdset internally to determine which file descriptors
    # to check.
    m.select(SELECT_TIMEOUT)
    while 1:
        ret, num_handles = m.perform()
        if ret != pycurl.E_CALL_MULTI_PERFORM:
            break

# Cleanup
m.remove_handle(c3)
m.remove_handle(c2)
m.remove_handle(c1)
m.close()
c1.close()
c2.close()
c3.close()
print "http://www.python.org is ",s1.getvalue()
print "http://curl.haxx.se is ",s2.getvalue()
print "http://slashdot.org is ",s3.getvalue()

Finally, these code is mainly based on an example on the pycurl site =.=

may be you should really read doc. ppl spend huge time on it.

If anyone try to using multithread approach in python level, it is not really fetching pages in parallel as you though because of python GIL's problem. Thanks for the curl lib which is written by C, the multi_perfrom by curl lib is truly multithreading. These is the fastest approach I can think of.