views:

395

answers:

4

I'm writing a multi-threaded Java web crawler. From what I understand of the web, when a user loads a web page the browser requests the first document (eg, index.html) and as it receives the HTML it will find other resources that need to be included (images, CSS, JS) and ask for those resources concurrently.

My crawler is only requesting the original document. For some reason, I can't get it to scrape more than 2 to 5 pages every 5 seconds. I'm spinning up a new thread for every HttpURLConnection I am making. It seems like I should be able to be at least scraping 20-40 pages per second. If I try to spin up 100 threads I get I/O exceptions like crazy. Any ideas what's going on?

+1  A: 

It would be a good idea to look at your code as you might have done something slightly wrong and that breaks your crawler, but as a general rule of thumb doing asynchronous IO is far superior then the blocking IO that HttpURLConnection offers. Asynchronous IO allows you to handle all of the processing in a single thread and all the actual IO is done by the operating system on its own time.

For a good implementation of the HTTP protocol over asynchronous IO look at Apache's HTTP core. See an example of such a client here.

Guss
A: 

Details on what -kind- of IOExceptions you're receiving might be handy. There are a few possibilities to consider.

  • Going over open file descriptor limits (too many sockets).
  • Refused connections due to opening too many connections to a given server.
  • Fetching too much data before being able to process any of it (assuming it's blocking IO - if you make 100 requests to 100 different servers, you're suddenly going to get a flood of data back to you - HTTP GET requests are small - the responses possibly not. You can effectively DDoS yourself)
  • You made a silly mistake in your code :)
James
A: 

The best count of threads or HttpUrlConnections depends on many factors.

  • If you crawl a external web site where you are not the owner you should use only one thread and delays. In the other case the website can detect a DOS attack. It can be make sense to crawl different websites in this time.
  • If it is your own website without DOS detection then it depends on the network delay. Is the web server in your LAN then it can be helpful to use the double count of the CPU cores that you use. If the web server is in the Internet then it can be helpful to use some more threads. But I thing 100 threads are to large. This can knock out your web server. How many worker has the web server?
Horcrux7
A: 

Oh, and I hope you're close()ing your inputstreams that you get from the connections. They're getting closed in the finalizer of the Connection anyway, but that may easily be seconds later. I ran into that issue myself, so maybe that helps you.

derBiggi