views:

642

answers:

8

I have a simple Python web crawler. It uses SQLite to store its output and also to keep a queue. I want to make the crawler multi-threaded so that it can crawl several pages at a time. I figured i would make a thread and just run several instances of the class at once, so they all run concurrently. But the question is, how many should i run at once? should i stick to two? can i go higher? what would be a reasonable limit for a number of threads? Keep in mind that each thread goes out to a web page, downloads the html, runs a few regex searches through it, stores the info it finds in a SQLite db, and then pops the next url off the queue.

+3  A: 

It's usually simpler to make multiple concurrent processes. Simply use subprocess to create as many Popens as you feel it necessary to run concurrently.

There's no "optimal" number. Generally, when you run just one crawler, your PC spends a lot of time waiting. How much? Hard to say.

When you're running some small number of concurrent crawlers, you'll see that they take about the same amount of time as one. Your CPU switches among the various processes, filling up the wait time on one with work on the others.

You you run some larger number, you see that the overall elapsed time is longer because there's now more to do than your CPU can manage. So the overall process takes longer.

You can create a graph that shows how the process scales. Based on this you can balance the number of processes and your desirable elapsed time.

Think of it this way.

1 crawler does it's job in 1 minute. 100 pages done serially could take a 100 minutes. 100 crawlers concurrently might take on hour. Let's say that 25 crawlers finishes the job in 50 minutes.

You don't know what's optimal until you run various combinations and compare the results.

S.Lott
If I recall correctly, there is a limit to the number of processes you can Popen, as you'll run into the open-file-descriptor limit (although it's around 200 or something..)
dbr
Actually I find for networking stuff, especially HTTP threads are great since they spend 90% of their time waiting for the remote end to do something there's a great efficiency factor and they're really quite easy to program, especially if you use work queues to feed the data to them.
Kurt
+8  A: 

You will probably find your application is bandwidth limited not CPU or I/O limited.

As such, add as many as you like until performance begins to degrade.

You may come up against other limits depending on your network setup. Like if you're behind an ADSL router, there will be a limit on the number of concurrent NAT sessions, which may impact making too many HTTP requests at once. Make too many and your provider may treat you as being infected by a virus or the like.

There's also the issue of how many requests the server you're crawling can handle and how much of a load you want to put on it.

I wrote a crawler once that used just one thread. It took about a day to process all the information I wanted at about one page every two seconds. I could've done it faster but I figured this was less of a burden for the server.

So really theres no hard and fast answer. Assuming a 1-5 megabit connection I'd say you could easily have up to 20-30 threads without any problems.

cletus
Considering networking is an I/O operation (downloading is input), the application is indeed I/O bound.
sebnow
I wanted to make the distinction between bandwidth and disk I/O (ie the database).
cletus
+1  A: 

You can go higher that two. How much higher depends entirely on the hardware of the system you're running this on, how much processing is going on after the network operations, and what else is running on the machine at the time.

Since it's being written in Python (and being called "simple") I'm going to assume you're not exactly concerned with squeezing every ounce of performance out of the thing. In that case, I'd suggest just running some tests under common working conditions and seeing how it performs. I'd guess around 5-10 is probably reasonable, but that's a complete stab in the dark.

Since you're using a dual-core machine, I'd highly recommend checking out the Python multiprocessing module (in Python 2.6). It will let you take advantage of multiple processors on your machine, which would be a significant performance boost.

Chris B.
A: 

Threading isn't necessary in this case. Your program is I/O bound rather than CPU bound. The networking part would probably be better done using select() on the sockets. This reduces the overhead of creating and maintaining threads. I haven't used Twisted, but I heard it has really good support for asynchronous networking. This would allow you you to specify the URLs you wish to download and register a callback for each. When each is downloaded you the callback will be called, and the page can be processed. In order to allow multiple sites to be downloaded, without waiting for each to be processed, a second "worker" thread can be created with a queue. The callback would add the site's contents to the queue. The "worker" thread would do the actual processing.

As already stated in some answers, the optimal amount of simultaneous downloads depends on your bandwidth.

I'd use one or two threads - one for the actual crawling and the other (with a queue) for processing.

sebnow
This is almost certainly incorrect. If the program is I/O bound, then having multiple threads WILL speed it up. If it's CPU bound, then multiple threads WON'T. And even then, a dual-core processor can benefit from multiple processes.
Chris B.
Threads would obviously speed it up. I stated that threads are not _necessary_ in this particular case, since asynchronous I/O, using select(), and other methods would achieve a similar thing. At the end I did state that I would use a second thread for processing - utilising multiple cores.
sebnow
But that makes no sense--one thread will be performing network reads (and will be therefore be I/O bound and block frequently) while the other will just be waiting for the first to fetch a page all the time. And in Python, multiple threads don't take advantage of multiple cores. Processes do.
Chris B.
Depending on the amount of simultaneous downloads, perhaps it would. However, using threads would make little difference. Downloading using select() or async I/O is effectively the same as using threads (without the overhead). That is the point I'm making. The queue would not necessarily be empty.
sebnow
As already stated, I have not actually used Twisted, however this document describes how to use it for asynchronous networking: http://twistedmatrix.com/projects/core/documentation/howto/async.html.The example shows that OP's application could be done with one thread, as I stated.
sebnow
+4  A: 

I would use one thread and twisted with either a deferred semaphore or a task cooperator if you already have an easy way to feed an arbitrarily long list of URLs in.

It's extremely unlikely you'll be able to make a multi-threaded crawler that's faster or smaller than a twisted-based crawler.

Dustin
+3  A: 

cletus's answer is the one you want.

A couple of people proposed an alternate solution using asynchronous I/O, especially looking at Twisted. If you decide to go that route, a different solution is pycurl, which is a thin wrapper to libcurl, which is a widely used URL transfer library. PyCurl's home page has a 'retriever-multi.py' example of how to fetch multiple pages in parallel, in about 120 lines of code.

Andrew Dalke
+1  A: 

One thing you should keep in mind is that some servers may interpret too many concurrent requests from the same IP address as a DoS attack and abort connections or return error pages for requests that would otherwise succeed.

So it might be a good idea to limit the number of concurrent requests to the same server to a relatively low number (5 should be on the safe side).

Michael Borgwardt
A: 

Seems a duplicate of "How many threads is too many?".

bortzmeyer