views:

355

answers:

1

A web crawler script that spawns at most 500 threads and each thread basically requests for certain data served from the remote server, which each server's reply is different in content and size from others.

i'm setting stack_size as 756K's for threads

threading.stack_size(756*1024)

which enables me to have the sufficient number of threads required and complete most of the jobs and requests. But as some servers' responses are bigger than others, and when a thread gets that kind of response, script dies with SIGSEGV.

stack_sizes more than 756K makes it impossible to have the required number of threads at the same time.

any suggestions on how can i continue with given stack_size without crashes? and how can i get the current used stack_size of any given thread?

+5  A: 

Why on earth are you spawning 500 threads? That seems like a terrible idea!

Remove threading completely, use an event loop to do the crawling. Your program will be faster, simpler, and easier to maintain.

Lots of threads waiting for network won't make your program wait faster. Instead, collect all open sockets in a list and run a loop where you check if any of them has data available.

I recommend using Twisted - It is an event-driven networking engine. It is very flexile, secure, scalable and very stable (no segfaults).

You could also take a look at Scrapy - It is a web crawling and screen scraping framework written in Python/Twisted. It is still under heavy development, but maybe you can take some ideas.

nosklo
+1 for mentioning Scrapy.
JV
+1 for good answer. Also, OP should make sure he's downloading gzipped responses.
Triptych