tags:

views:

293

answers:

4

Hi there,

I've just seen a web crawler in action on my computer and it downloads like thousands of metatag info in only a few minutes.

And when I use WebClient to download pages and then parse them locally, why does it take WebClient about 40seconds just to download a single webpage? Is there an alternative to downloading webpages?

thanks:)

+1  A: 

A few things to consider:

  • How many pages are you downloading at once? Web crawlers tend to work in a highly parallel way.
  • By default the .NET framework restricts the number of parallel requests to a single site. That's generally a nice thing to do - you may want to raise the limit a bit, but ideally target different sites in parallel. The <connectionManagement> element is the one you need to look at.
  • Have you used WireShark to see what's going on at the network level? If the web site is taking 40 seconds to serve the page, it's hard to see how changing from using WebClient would help.
  • Could you post some code to show exactly what you're doing?

It's possible that using a different API (possibly even just WebRequest) will speed things up, but you really need to find the current bottleneck first.

Jon Skeet
Unless we know the cause of the slowness, it is too early to say WebClient is problematic.
Lex Li
A: 

There are couple of reasons why you might get poor performance:

  • Not usage of asynchronous methods / threads
  • Poor HTML parsing algorithm
  • The page you are downloading with WebClient is slow

More information/source code will be needed to find a definitive answer.

Darin Dimitrov
+1  A: 

There have been a couple of posts relating to Webclient being slow if there is a default proxy instance. MSDN Social has same details on this. There are several things to do to make this faster, including using Asyncronous connections, threads and if you really need the performance writing the socket code yourself. There are some libraries on the market which claim to provide boosts above the default framework libraries, they may be of benefit if you are willing to pay extra for them.

I have a few programs which use Webrequest (not native webclient) and I see throughputs in the near MB/s range with resources in the 10-20MB range coming from half way around the world. So it is definitely possible with the framework natively.

GrayWizardx
A: 

Almost certainly there is another issue with your code that is not easily discoverable through the info that you have posted.

On the other hand, while making a C# crawler, we found the WebRequest/WebClient API to be very heavy on CPU usage, and ultimately unsuitable for crawling. In the end we wrote our own HTTP stack using the Socket.XxxxAsync methods which reduced CPU load by about 20 times. Be warned that there's quite a steep learning curve involved in pursuing this path.

spender