views:

54

answers:

3

I wrote this this method to download a webpage given a URL. It is designed to download HTML only. If I want to do error checking and allow HTML only how should I do this?

public static String download(URL url) throws IOException {
    InputStream is = url.openStream();
    BufferedReader reader = new BufferedReader(new InputStreamReader(is));
    String page = "";
    String line;    
    while((line = reader.readLine()) != null){
        page = page + line;
    }
    return page;
}

Originally I was planning on doing this:

String file = url.getFile();
if(file.subString(file.indexOf("."),file.length()-1).equalsIgnoreCase("HTML")){
    // do method

However the URL: http://www.smu.com returns "" for url.getFile(). Anyone have any suggestions?

+2  A: 

To test if you're getting html you can use URL.openConnection() to get a UrlConnection can then call getContentType() which should return "text/html" for an HTML page. You can then use the getInputStream() method on the UrlConnection() as a drop in replacement for url.openStream();

If you actually want to validate that the content the server is sending you is HTML you'd need to find an HTML validation library. I don't know of one off-hand, sorry.

Something to consider, which may be why www.smu.com returns no data, is that a number of websites will serve different data depending on the User-Agent string sent on the HTTP connection. You may need to modify that on your UrlConnection with: UrlConnection.addRequestProperty("User-Agent", ...); See more info here : http://stackoverflow.com/questions/2529682/setting-user-agent-of-a-java-urlconnection

kkress
smu.com does return data, the URL just has no file part (after the slash)
Bart van Heukelom
@Bart The request tries to fetch "/" which the web-server happily returns "/index.html" for (this is dependent upon the web-server and configuration of such -- it might just have happily returned a 404, but that isn't friendly for web-users). It doesn't explain "no data", but it does explain why you don't need "the full path". See my answer for more (although accept this answer already).
pst
@pst: I know, it's what I said but you said it much better
Bart van Heukelom
+2  A: 

"http://www.smu.com" sends you the data in "http://www.smu.com/index.html". This is the (common) behavior of web-servers when "/" is requested (a web-server could also theoretically redirect one with a 302 or whatnot). Checking to see if the URL ends in ".html" is thus entirely silly (not to mention that it could be a ".php", ".asp" or whatever).

However, a nice web-server serving up HTML should return return a Content-Type header of "text/html". (This is of course assuming it's returning HTML and not XHTML or XML or whatnot and the web-server is not broken).

You will likely want to use URLConnection. Here is an example of URLConnection with headers.

How did I determine the top bit?

I ran curl -I http://www.smu.com (and with ../index.html) and compared the results. They look like:

HTTP/1.1 200 OK
Date: Tue, 19 Oct 2010 18:01:39 GMT
Server: Apache
Last-Modified: Wed, 27 Jan 2010 20:27:52 GMT
Accept-Ranges: bytes
Content-Length: 2993
Content-Type: text/html
pst
+3  A: 

If you want to check the content beyond checking the Content-Type header, then you can use an HTML parser such as (the misleadingly named!) JTidy.

Brian Agnew
+1 For pointing out a secondary validation method.
pst