tags:

views:

50

answers:

2

I am using a URLConnection class - I want to be able to grab a stream to a given URL even if said URL is unavailable (ie cache the last known cope of the content on a URL, to some local file system dir) - now I have written this code a few times (never happy with it) and was wondering if there is something better out there that might be able to do this.

A: 

I just ran into this same problem, and threw together my own WebCache class.. haven't tested it yet, but you can give it a try if you want. Just construct it with a directory where you want to cache the pages, and then call getPage(String url) to grab a page. getPage checks the cache directory first, then if it doesn't exist, downloads it to the cache and returns the result. The cache filenames are url.hashCode() + ".cache"

This is just for getting the source of a page, i'm not sure what else you want to do with your URLConnection, but this may be of some help.

/**
 * A tool for downloading and reading the source code of HTML pages.
 * Prevents repeated downloading of pages by storing each page in a cache.
 * When it recieves a page request, it first looks in its cache.
 * If it does not have the page cached, it will download it.
 * 
 * Pages are stored as <cachedir>/<hashcode>.cache
 * 
 * @author Mike Turley
 */

import java.io.*;
import java.net.*;

public class WebCache {
    File cachedir;
    boolean enabled;

    /**
     * Create a web cache in the given directory.
     */
    public WebCache(File cachedir, boolean enabled) {
        this.cachedir = cachedir;
        this.enabled = enabled; 
    }
    public WebCache(String cachedir, boolean enabled) {
        this.cachedir = new File(cachedir);
        this.enabled = enabled;
    }
    public WebCache(File cachedir) {
        this.cachedir = cachedir;
        this.enabled = true;
    }
    public WebCache(String cachedir) {
        this.cachedir = new File(cachedir);
        this.enabled = true;
    }

    /**
     * Get the content for the given URL.
     * First check the cache, then check the internet.
     */
    public String getPage(String url) {
        try {
            if(enabled) {
                File cachefile = new File(cachedir.getAbsolutePath() + url.hashCode() + ".cache");
                //FIXME - might be missing a slash between path and hashcode.
                if(cachefile.exists()) return loadCachedPage(url);
            }
            return downloadPage(url);
        } catch(Exception e) {
            System.err.println("Problem getting page at " + url);
            e.printStackTrace();
            return null;
        }
    }

    public void clear() {
        try {
            File[] cachefiles = cachedir.listFiles();
            for(int i=0; i<cachefiles.length; i++) {
                cachefiles[i].delete();
            }
            cachedir.delete();
        } catch(Exception e) {
            System.err.println("Problem clearing the cache!");
            e.printStackTrace();
        }
    }

    public String downloadPage(String url) {
        try {
            URL weburl = new URL(url);
            URLConnection urlc = weburl.openConnection();
            urlc.setDoInput(true);
            urlc.setDoOutput(false);
            BufferedReader in = new BufferedReader(new InputStreamReader(urlc.getInputStream()));
            if(!cachedir.exists()) cachedir.mkdir();
            File outfile = new File(cachedir.getAbsolutePath() + url.hashCode() + ".cache");
            // FIXME - might be missing a slash between path and hashcode.
            PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter(outfile)));
            StringBuilder sb = new StringBuilder("");
            String inputline;
            while ((inputline = in.readLine()) != null) {
                out.println(inputline);
                sb.append(inputline);
            }
            in.close();
            out.close();
            return sb.toString();
        } catch(Exception e) {
            System.err.println("Problem connecting to URL " + url);
            e.printStackTrace();
            return null;
        }
    }

    public String loadCachedPage(String url) {
        try {
            File infile = new File(cachedir.getAbsolutePath() + url.hashCode() + ".cache");
            // FIXME - might be missing a slash between path and hashcode.
            BufferedReader in = new BufferedReader(new FileReader(infile));
            StringBuilder sb = new StringBuilder("");
            while (in.ready()) sb.append(in.readLine());
            in.close();
            return sb.toString();
        } catch(Exception e) {
            System.err.println("Problem loading cached page " + url);
            e.printStackTrace();
            return null;
        }
    }

    public void setEnabled(boolean enabled) {
        this.enabled = enabled;
    }
}
Mike Turley
I just want to point out that this implementation does not take into account that the HTTP spec explains in much detail when and how a resource can be cached and when it must not be cached. You're missing the handling of status codes (i.e 3xx rediration, 5xx server errors, validation (E-Tag, Last-Modified), non-indempotent methods (POST/PUT/DELETE) and so on. You will run into a big pile of problems...Before using a crippled caching layer like this, I'd suggest using a standalone proxy setup with squid, ngnix or like that.
ordnungswidrig
A: 

Don't do that. Deploy a caching HTTP proxy. Apache Squid for example.

EJP