views:

1070

answers:

3

I once wrote a Crawler in .NET. In order to improve its scalability, I tried to take advantage of asynchronous API of .NET.

The System.Net.HttpWebRequest has asynchronous API BeginGetResponse/EndGetResponse. However, this pair of API is just to get a HTTP response headers and a Stream instance from which we can extract HTTP response content. So, my strategy is to use BeginGetResponse/EndGetResponse to asynchronously get the response Stream, then use BeginRead/EndRead to asynchronously get bytes from the response Stream instance.

Everything seems perfect until the Crawler goes to stress test. Under stress test, the Crawler suffers from high memory usage. I checked the memory with WinDbg+SoS and fount out that lots of byte arrays are pined by System.Threading.OverlappedData instances. After some searching in internet, I found this KB http://support.microsoft.com/kb/947862 from microsoft.

According to the KB, the number of asynchronous I/O should have a "upper bound", but it doesn't tell a "suggested" bound value. So, in my eye, this KB helps nothing. This is obviously a .NET bug. Finally, I have to drop the idea to do asynchronous extracting bytes from response Stream, and just do it in synchronous way.

The .NET library that allows Asynchronous IO with dot net sockets (Socket.BeginSend / Socket.BeginReceive / NetworkStream.BeginRead / NetworkStream.BeginWrite) must have an upper bound on the amount of buffers outstanding (either send or receive) with their asynchronous IO.

The network application should have an upper bound on the number of outstanding asynchronous IO that it posts.

Edit: Add some question marks.

Anybody has any experience to do asynchronous I/O on Socket & NetworkStream? Generally speaking, does crawler in production do I/O with internet with Synchronous or Asynchronosly?

+3  A: 

You obviously want to limit the number of concurrent requests, no matter if your crawler is synch/asynch. That limit is not fixed, it depends on your hardware, network, ...

I'm not so sure what's your question here, as .NET implementation of HTTP/Sockets is "ok". There are some holes (See my post about controlling timeouts properly), but it gets the job done (we have a production crawler that fetches ~ hundreds of pages per second).

BTW, we use synchronous IO, just for convenience sake. Every task has a thread, and we limit the number of concurrent thread. For thread-management, we used Microsoft CCR.

ripper234
I have no doubt that synchronous I/O on Socket works fine in DotNet. I just don't trust its asynchronous I/O API.
Morgan Cheng
The problem is aborting/canceling ops, it never works well in .NET.You should always prefer synch API (with timeouts), this way you don't need to cancel the op yourself.
ripper234
+3  A: 

Hmya, this is not a .NET framework problem. The linked KB article could have been a bit more explicit: "you're using a loaded gun, this is what happens when you aim it at your foot". The bullets in that gun are .NET giving you the ability to start as many asynchronous I/O requests as you dare. It will do what you ask it to do, until you hit some kind of resource limit. In this case, probably, having too many pinned receive buffers in the generation 0 heap.

Resource management is still very much our job, not .NET's. It is no different from allocating memory without bound. Solving this particular problem requires you to put a limit on the number of uncompleted BeginGetResponse() requests. Having hundreds of them make little sense, every one of them has to squeeze through the Intertube one at a time. Adding another request will just cause it to take longer to complete. Or crash your program.

Hans Passant
But, how can I tell the "upper bound" in my program? The fact is that .NET doesn't release pinned byte array even if the application has abort the BeginXXX operation after timeout. I still believe this is a .net bug.
Morgan Cheng
A: 

No KB article can give you an upper bound. Upper bounds can vary depending on the hardware available - what is an upperbound for a 2G memory machine will be different for a machine with 16g of ram. It will also depend on the size of the GC heap, how fragmented it is etc.

What you should do is come up with a metric of your own using back of envelope calculations. Figure out how many pages you want to download per minute. That should determine how many async requests you want outstanding (N).

Once you know N, create a piece of code (like the consumer end of a producer-consumer pipeline) that can create N outstanding async download requests. As soon as a request finishes (either due to timeout or due to success), kick off another async request by pulling a workitem from the queue.

You also need to make sure that the queue does not grow beyond bounds, if for eg, the download becomes slow for whatever reason.

feroze