tags:

views:

228

answers:

7

Given a URL, how can you tell if the referenced file is and html file?

Obviously, its an html file if it ends in .html or /, but then there are .jsp files, too, so I'm wondering what other extensions may be out there for html.

Alternatively, if this information can be easily gained from a URL object in Java, that would be sufficient for my purposes.

+19  A: 

you can't. but you can ask the server for headers and check the content type to see if it is text/html.

Ben Hughes
+7  A: 

You can not. There is nothing wrong with serving up html files with urls that end in .jpeg, or .gif or even .mp3. The only way to know is to fetch the url and view the Content-Type header to see if it is text/html (but that isn't even 100% accurate because of poorly configured web servers).

caskey
+2  A: 

Put simply. You can't.

There are REST-style URL's like

http://yourserver.com/service/givemehtml/

which serve you html.

jitter
+1  A: 

HTML - Hyper Text Markup Language, that means html is a standard, referencing *.html meaning there is static HTML page all, other *.jsp, *.php, *.asp and etc, They generates dynamic html. So you cannot find out, you can try to look on content-type, but this way you still will miss some pages.

Artem Barger
+4  A: 

Fundamentally, a URL is merely an address. There are plenty of useful, meaningful conventions that you can use to decipher what they might contain, but when it comes down to it, a webserver is free to return any type of thing it wants for a given URL. Not even querying the server, asking for what comes back, and examining it is a 100% surefire way of knowing what sort of file it is. The server could easily change what sort of file it points to based on the request, or the time or day, or the whims of its owner.

There are some good basic guidelines that will work most of the time, but I hesitate to even mention them because they're absolutely not reliable.

There is some good news, though. If you actually request the data from the server, it will, just as some other answers point out, tell you precisely what sort of thing it is providing you with (for this particular exchange). It'll give you a MIME-Type in the field named "Content-Type". If it's text/html, then you have yourself an html document (not an image, not an xhtml document, HTML).

CaptainAwesomePants
It's still pretty easy for web servers to mess with the Content-Type header as well. I'm sure there are many PHP scripts that generate a CSV file but have a Content-Type of "text/html"
too much php
This comment + your user name makes me laugh.
CaptainAwesomePants
+4  A: 

Just from the URL you cannot, think of the following urls:

All of them return HTML content. The only sure way is to ask the server for the resource, and check the Content-TYpe header. It is better to use to send an HEAD request to the server, instead of GET or POST - it will give you just the headers and without the content.

  URL url = ...
  HttpURLConnection urlc = (HttpURLConnection)url.openConnection();
  urlc.setAllowUserInteraction( false );
  urlc.setDoInput( true );
  urlc.setDoOutput( false );
  urlc.setUseCaches( true );
  urlc.setRequestMethod("HEAD");
  urlc.connect();
  String mime = urlc.getContentType();
  if(mime.equals("text/html") {
    // do your stuff
  }
David Rabinowitz
It may be that the content type is incorrectly set. Happens. ANd if correct, then it may not even be valid HTML. You should validate against its schema definition depending on what you want to do with it.
nojevive
You are correct, but for that you need the content itself, which was not at the question. Also, it means that the content of the page would not be displayed correctly on a browser, so it is certainly not the common case.
David Rabinowitz
A: 

You can't. Sometimes some URL ends with .html extension, but it actually not a html files. Like in spring actions I normally use extension .html, so it looks like html file from url, but it is not. So practically you can't determine it.

Silent Warrior