python web crawler with thread support

views:

234

answers:

+1 Q:

python web crawler with thread support

Hello all, these day im making some web crawler script, but one of problem is my internet is very slow. so i was thought whether is it possible webcrawler with multithreading by use mechanize or urllib or so. if anyone have experience ,share info much appreciate. i was look for in google ,but not found much useful info. Thanks in advance

+2 A:

There's a good, simple example on this Stack Overflow thread.

Alex Martelli 2009-12-04 18:06:20

+1 That is a good piece of sample code. I think I'll use that myself!

hughdbrown 2009-12-04 19:36:16

Thanks ! it very useful info for me

paul 2009-12-04 23:52:00

+1 A:

Making multiple requests to many websites at the same time will certainly improve your results, since you don't have to wait for a result to arrive before sending new requests.

However threading is just one of the ways to do that (and a poor one, I might add). Don't use threading for that. Just don't wait for the response before sending another request! No need for threading to do that.

A good idea is to use scrapy. It is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It is written in python and can make many concurrent connections to fetch data at the same time (without using threads to do so). It is really fast. You can also study it to see how it is implemented.

nosklo 2009-12-04 19:11:53

Thanks ! how about compare with mechanize? i mean..compare with speedThanks in advance

paul 2009-12-04 23:52:38

@paul: It will certainly be faster than mechanize. It is easier to do the right thing on it.

nosklo 2009-12-05 13:59:38

+1 A:

Practical threaded programming with Python is worth reading.

sunqiang 2009-12-05 03:10:23

it great resource! :) in addition ,are there any some small script ? function with save result from crawled web page thanks

paul 2009-12-05 06:50:59

@paul, I don't know, what I needed for save fetched pages is just for demo purpose, pickle or sqlite or directly dir/file is enough for me.

sunqiang 2009-12-06 03:33:17

ansaurus

tags:

views:

answers:

python web crawler with thread support

related questions