views:

869

answers:

2

Hi all.

I have implemented a multithreaded crawler in C#. Using a custom threadpool, there's a job queue, all pages to be downloaded are queued up and each thread takes one and downloads.

using 15 threads, When crawling one site only, it's smooth as silk and gets done fast. When crawling several sites on different servers at the same time, I get TONS of timeouts.

Might this have anything to do with DNS resolve? What would you think would cause this to happen?

Thanks. Roey

A: 

There is a connection limit for HttpWebRequest as described here. See HttpWebRequest.ServicePoint.ConnectionLimit in MSDN.

JP Alioto
I have this set to 100000..
Roey
Roey, do you solve it? I have similar problem.
fravelgue
+1  A: 

Presumably, you are running this on Windows. While you may be configuring the number of connections allowed by HttpWebRequest, this does not change the Windows-imposed limits. For example, it is my understanding that with XP SP2, Microsoft imposed a 10 connection per second limit. If you have a large backlog of connections waiting to be allowed to open, they could be running into timeouts due to not being granted permission in time.

Admittedly, I don't have a whole lot of insight into the issue since I've never run into the problem. Try throttling back the number of connections you are trying to make and see if that reduces timeouts.

Stuart Childs
I think the connection limit is per server - not more than 10 live connections to one server. His crawler is talking to more than 1 server.
Sesh