views:

67

answers:

2

Hello,

My application needs some web scraping functionality. I have URL object that downloads all the data. But I need to scrape many pages and I create many URL objects so I open many connections. How to optimize it, so I can have one connection and only navigate to other pages with it?

Cheers

A: 

As far as I can tell, you must have a different URLConnection for each URL (which makes sense as the underlying network connection must change as well). I seriously doubt that creating this object is your bottleneck; I suspect it is the network time, but without profile it is hard to know for certain.

For a moderate amount of pages, I would consider a working queue (say using an ExecutorService). For a large number of pages, I might even look into a Java version of Map/Reduce.

Edit: For Map/Reduce to be better than a simple worker queue, you need to have multiple computers available to do the scraping.

Kathy Van Stone
A: 

You could use Apache HTTP components, it has a lot of features, including a connection manager supporting concurrent access

Guillaume