views:

234

answers:

3

Hello all, these day im making some web crawler script, but one of problem is my internet is very slow. so i was thought whether is it possible webcrawler with multithreading by use mechanize or urllib or so. if anyone have experience ,share info much appreciate. i was look for in google ,but not found much useful info. Thanks in advance

+2  A: 

There's a good, simple example on this Stack Overflow thread.

Alex Martelli
+1 That is a good piece of sample code. I think I'll use that myself!
hughdbrown
Thanks ! it very useful info for me
paul
+1  A: 

Making multiple requests to many websites at the same time will certainly improve your results, since you don't have to wait for a result to arrive before sending new requests.

However threading is just one of the ways to do that (and a poor one, I might add). Don't use threading for that. Just don't wait for the response before sending another request! No need for threading to do that.

A good idea is to use scrapy. It is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It is written in python and can make many concurrent connections to fetch data at the same time (without using threads to do so). It is really fast. You can also study it to see how it is implemented.

nosklo
Thanks ! how about compare with mechanize? i mean..compare with speedThanks in advance
paul
@paul: It will certainly be faster than mechanize. It is easier to do the right thing on it.
nosklo
+1  A: 

Practical threaded programming with Python is worth reading.

sunqiang
it great resource! :) in addition ,are there any some small script ? function with save result from crawled web page thanks
paul
@paul, I don't know, what I needed for save fetched pages is just for demo purpose, pickle or sqlite or directly dir/file is enough for me.
sunqiang