I'm writing a multi-threaded Java web crawler. From what I understand of the web, when a user loads a web page the browser requests the first document (eg, index.html) and as it receives the HTML it will find other resources that need to be included (images, CSS, JS) and ask for those resources concurrently.
My crawler is only requesting the original document. For some reason, I can't get it to scrape more than 2 to 5 pages every 5 seconds. I'm spinning up a new thread for every HttpURLConnection I am making. It seems like I should be able to be at least scraping 20-40 pages per second. If I try to spin up 100 threads I get I/O exceptions like crazy. Any ideas what's going on?