I am using a URLConnection class - I want to be able to grab a stream to a given URL even if said URL is unavailable (ie cache the last known cope of the content on a URL, to some local file system dir) - now I have written this code a few times (never happy with it) and was wondering if there is something better out there that might be able to do this.
A:
I just ran into this same problem, and threw together my own WebCache class.. haven't tested it yet, but you can give it a try if you want. Just construct it with a directory where you want to cache the pages, and then call getPage(String url) to grab a page. getPage checks the cache directory first, then if it doesn't exist, downloads it to the cache and returns the result. The cache filenames are url.hashCode() + ".cache"
This is just for getting the source of a page, i'm not sure what else you want to do with your URLConnection, but this may be of some help.
/**
* A tool for downloading and reading the source code of HTML pages.
* Prevents repeated downloading of pages by storing each page in a cache.
* When it recieves a page request, it first looks in its cache.
* If it does not have the page cached, it will download it.
*
* Pages are stored as <cachedir>/<hashcode>.cache
*
* @author Mike Turley
*/
import java.io.*;
import java.net.*;
public class WebCache {
File cachedir;
boolean enabled;
/**
* Create a web cache in the given directory.
*/
public WebCache(File cachedir, boolean enabled) {
this.cachedir = cachedir;
this.enabled = enabled;
}
public WebCache(String cachedir, boolean enabled) {
this.cachedir = new File(cachedir);
this.enabled = enabled;
}
public WebCache(File cachedir) {
this.cachedir = cachedir;
this.enabled = true;
}
public WebCache(String cachedir) {
this.cachedir = new File(cachedir);
this.enabled = true;
}
/**
* Get the content for the given URL.
* First check the cache, then check the internet.
*/
public String getPage(String url) {
try {
if(enabled) {
File cachefile = new File(cachedir.getAbsolutePath() + url.hashCode() + ".cache");
//FIXME - might be missing a slash between path and hashcode.
if(cachefile.exists()) return loadCachedPage(url);
}
return downloadPage(url);
} catch(Exception e) {
System.err.println("Problem getting page at " + url);
e.printStackTrace();
return null;
}
}
public void clear() {
try {
File[] cachefiles = cachedir.listFiles();
for(int i=0; i<cachefiles.length; i++) {
cachefiles[i].delete();
}
cachedir.delete();
} catch(Exception e) {
System.err.println("Problem clearing the cache!");
e.printStackTrace();
}
}
public String downloadPage(String url) {
try {
URL weburl = new URL(url);
URLConnection urlc = weburl.openConnection();
urlc.setDoInput(true);
urlc.setDoOutput(false);
BufferedReader in = new BufferedReader(new InputStreamReader(urlc.getInputStream()));
if(!cachedir.exists()) cachedir.mkdir();
File outfile = new File(cachedir.getAbsolutePath() + url.hashCode() + ".cache");
// FIXME - might be missing a slash between path and hashcode.
PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter(outfile)));
StringBuilder sb = new StringBuilder("");
String inputline;
while ((inputline = in.readLine()) != null) {
out.println(inputline);
sb.append(inputline);
}
in.close();
out.close();
return sb.toString();
} catch(Exception e) {
System.err.println("Problem connecting to URL " + url);
e.printStackTrace();
return null;
}
}
public String loadCachedPage(String url) {
try {
File infile = new File(cachedir.getAbsolutePath() + url.hashCode() + ".cache");
// FIXME - might be missing a slash between path and hashcode.
BufferedReader in = new BufferedReader(new FileReader(infile));
StringBuilder sb = new StringBuilder("");
while (in.ready()) sb.append(in.readLine());
in.close();
return sb.toString();
} catch(Exception e) {
System.err.println("Problem loading cached page " + url);
e.printStackTrace();
return null;
}
}
public void setEnabled(boolean enabled) {
this.enabled = enabled;
}
}
Mike Turley
2010-04-11 22:27:16
I just want to point out that this implementation does not take into account that the HTTP spec explains in much detail when and how a resource can be cached and when it must not be cached. You're missing the handling of status codes (i.e 3xx rediration, 5xx server errors, validation (E-Tag, Last-Modified), non-indempotent methods (POST/PUT/DELETE) and so on. You will run into a big pile of problems...Before using a crippled caching layer like this, I'd suggest using a standalone proxy setup with squid, ngnix or like that.
ordnungswidrig
2010-05-17 08:22:58