views:

35

answers:

2

Hello,

I need to mirror some websites from my Java application. I was looking for an open source java library to do this job, but didn't find anything suitable.

Does anybody know about some java-friendly tool to retrieve entire websites, or must I stick to exec wget from my program?

Thanks a lot.

A: 

I would recommend a crawler/spider. Aspider and Sperowider use Apache HttpClient lib (my favourite httplib) and crawls through the site following links. Since they are OSS you should be able to integrate it into your software. They are also currently unmaintained, but Apache HttpClient lib would be a good place to start if you want to write your own mirroring tool in java.

whatnick
A: 

The biggest problem I found with this kind of libraries was the lack of support for css parsing, so the imported stylesheets, background images and so on get downloaded as well when mirroring the website.

wget has built in support for this (at least in recent versions), and although it's not a very clean solution to run this program from java, I'd first try it and see if it fits your needs.

Grieih