I am writing a Java program that downloads and then processes many webpages. What is the best practice for testing a component of the program that downloads a page without hitting the remote servers?
One thought would be to use "InputStream" as the object you pass to your processing code. I believe the HttpClient (or equivalent) class for reading data via HTTP gives you some sort of stream to read on the response. For testing, you could just substitute a different type of stream to read from, such as a local FileStream.
So the component that does the download and the component that processes the page should be a separate. Any time you are having trouble unit testing a piece of code, that's a sign that you may be trying to do too much in one component.
Once you've done that, you test the processing part however makes the most sense. Have the processor component take an InputStream or even just a String as input.
As for the download part, you probably need an integration test. Integration tests are often a lot more involved and would require setting up a local web server (maven can do this), or at the very least using a file: URL.
If your code supports having an HTTP proxy you could have an off network cache that functions as a proxy. Just run the code once with the proxy caching, saving the data, network delays, etc. Then after that you can run the code with the proxy just returning the data. To switch between the two is just a matter of configuring the HTTP proxy.
The advantage of this approach is you can unit test against an arbitrary number of sites. Your network cache/http proxy would be reusable for future uses.
Check out Dependency Injection
It's technique where you "inject" the different "dependenies" into your functions instead of having them in your function to begin with ( simplyfied explanation ).
Read Martin Fowlers article about DI
http://martinfowler.com/articles/injection.html
hope it helps
/Jonas