ansaurus

Question

Intelligent screen scraping using different proxies and user-agents randomly?

Answer 1

+1 A:

Use unix tool wget. It has option to specify custom user-agent and delay between each retrieval of the page.

You can see wget(1) man page for more information.

pajton 2010-05-10 15:14:54

That is a good start, thank you! --random-wait can be used. Not so sure about proxy implementation though. Any ideas?

ThinkCode 2010-05-10 15:27:16

I have only used `wget` for basic scrapping, so sorry, I cannot give you more info about proxying with it.

pajton 2010-05-10 15:31:19

Using a proxy in wget: setenv http_proxy=http://proxy.example.com:8080; wget --proxy-user=foo --proxy-password=bar --user-agent="Frobzilla/1.1" [url]

wump 2010-05-10 16:40:41

Thanks a lot for the tip!

ThinkCode 2010-05-10 19:06:15

Answer 2

+1 A:

Use something like:

import urllib2
import time
import random

MAX_WAIT = 5
ids = ...
agents = ...
proxies = ...

for id in ids:
    url = 'http://abc.com/view_page.aspx?ID=%d' % id
    opener = urllib2.build_opener(urllib2.ProxyHandler({'http' : proxies[0]}))
    html = opener.open(urllib2.Request(url, None, {'User-agent': agents[0]})).read()
    open('%d.html' % id, 'w').write(html)
    agents.append(agents.pop()) # cycle
    proxies.append(proxies.pop())
    time.sleep(MAX_WAIT*random.random())

Plumo 2010-05-12 15:04:41

ansaurus

tags:

views:

answers:

Intelligent screen scraping using different proxies and user-agents randomly?

related questions