ansaurus

Question

Java: File type of `url.openStream()`

Answer 1

+2 A:

To test if you're getting html you can use URL.openConnection() to get a UrlConnection can then call getContentType() which should return "text/html" for an HTML page. You can then use the getInputStream() method on the UrlConnection() as a drop in replacement for url.openStream();

If you actually want to validate that the content the server is sending you is HTML you'd need to find an HTML validation library. I don't know of one off-hand, sorry.

Something to consider, which may be why www.smu.com returns no data, is that a number of websites will serve different data depending on the User-Agent string sent on the HTTP connection. You may need to modify that on your UrlConnection with: UrlConnection.addRequestProperty("User-Agent", ...); See more info here : http://stackoverflow.com/questions/2529682/setting-user-agent-of-a-java-urlconnection

kkress 2010-10-19 17:41:47

smu.com does return data, the URL just has no file part (after the slash)

Bart van Heukelom 2010-10-19 17:44:00

@Bart The request tries to fetch "/" which the web-server happily returns "/index.html" for (this is dependent upon the web-server and configuration of such -- it might just have happily returned a 404, but that isn't friendly for web-users). It doesn't explain "no data", but it does explain why you don't need "the full path". See my answer for more (although accept this answer already).

pst 2010-10-19 17:57:25

@pst: I know, it's what I said but you said it much better

Bart van Heukelom 2010-10-19 18:42:04

Answer 2

+2 A:

"http://www.smu.com" sends you the data in "http://www.smu.com/index.html". This is the (common) behavior of web-servers when "/" is requested (a web-server could also theoretically redirect one with a 302 or whatnot). Checking to see if the URL ends in ".html" is thus entirely silly (not to mention that it could be a ".php", ".asp" or whatever).

However, a nice web-server serving up HTML should return return a Content-Type header of "text/html". (This is of course assuming it's returning HTML and not XHTML or XML or whatnot and the web-server is not broken).

You will likely want to use URLConnection. Here is an example of URLConnection with headers.

How did I determine the top bit?

I ran curl -I http://www.smu.com (and with ../index.html) and compared the results. They look like:

HTTP/1.1 200 OK
Date: Tue, 19 Oct 2010 18:01:39 GMT
Server: Apache
Last-Modified: Wed, 27 Jan 2010 20:27:52 GMT
Accept-Ranges: bytes
Content-Length: 2993
Content-Type: text/html

pst 2010-10-19 17:43:14

Answer 3

+3 A:

If you want to check the content beyond checking the Content-Type header, then you can use an HTML parser such as (the misleadingly named!) JTidy.

Brian Agnew 2010-10-19 17:49:43

+1 For pointing out a secondary validation method.

pst 2010-10-19 17:59:14

ansaurus

tags:

views:

answers:

Java: File type of `url.openStream()`

related questions