views:

81

answers:

2

I have a file with millions of URLs/IPs and have to write a program to download the pages really fast. The connection rate should be at least 6000/s and file download speed at least 2000 with avg. 15kb file size. The network bandwidth is 1 Gbps.

My approach so far has been: Creating 600 socket threads with each having 60 sockets and using WSAEventSelect to wait for data to read. As soon as a file download is complete, add that memory address(of the downloaded file) to a pipeline( a simple vector ) and fire another request. When the total download is more than 50Mb among all socket threads, write all the files downloaded to the disk and free the memory. So far, this approach has been not very successful with the rate at which I could hit not shooting beyond 2900 connections/s and downloaded data rate even less.

Can somebody suggest an alternative approach which could give me better stats. Also I am working windows server 2008 machine with 8 Gig of memory. Also, do we need to hack the kernel so as we could use more threads and memory. Currently I can create a max. of 1500 threads and memory usage not going beyond 2 gigs [ which technically should be much more as this is a 64-bit machine ]. And IOCP is out of question as I have no experience in that so far and have to fix this application today.

Thanks Guys!

+1  A: 

First and foremost you need to figure out what is limiting your application. Are you CPU-bound, IO-bound, memory-bound, network-bound, ...? Is there locking contention between your threads? etc...

Its impossible to say from your description. You will need to run your app in a profiler to get an idea where the bottlenecks are.

Frank Meerkötter
A: 

I don't see any performance gain by using extra sockets. For a single CPU processor, it has to "share" code execution between the various sockets, dividing the performance. Same is true with too many threads.

For serious performance handling, you will need extra hardware support. You will need to convert the incoming (serial) data into multiple buffers of data (parallel). This will not necessarily boost your performance. However, if you could download one page per physical connection, that may boost your performance.

Most of the bottleneck (IMHO), is receiving data packets and analyzing their destinations. The more of these analysts, the faster your performance; although you may have performance hits when one or more directors wants to use the same memory area (two directors are downloading the same page).

If you can have the hardware support download an entire page, uninterrupted by a CPU, that is the fastest performance you will see.

"That's just my opinion, I could be wrong." -- Dennis Miller.

Thomas Matthews