ansaurus

Question

How do you determine if a file is html from the URL?

Answer 1

+19 A:

you can't. but you can ask the server for headers and check the content type to see if it is text/html.

Ben Hughes 2009-06-30 19:22:48

Answer 2

+7 A:

You can not. There is nothing wrong with serving up html files with urls that end in .jpeg, or .gif or even .mp3. The only way to know is to fetch the url and view the Content-Type header to see if it is text/html (but that isn't even 100% accurate because of poorly configured web servers).

caskey 2009-06-30 19:23:18

Answer 3

+2 A:

Put simply. You can't.

There are REST-style URL's like

http://yourserver.com/service/givemehtml/

which serve you html.

jitter 2009-06-30 19:23:32

Answer 4

+1 A:

HTML - Hyper Text Markup Language, that means html is a standard, referencing *.html meaning there is static HTML page all, other *.jsp, *.php, *.asp and etc, They generates dynamic html. So you cannot find out, you can try to look on content-type, but this way you still will miss some pages.

Artem Barger 2009-06-30 19:28:00

Answer 5

+4 A:

Fundamentally, a URL is merely an address. There are plenty of useful, meaningful conventions that you can use to decipher what they might contain, but when it comes down to it, a webserver is free to return any type of thing it wants for a given URL. Not even querying the server, asking for what comes back, and examining it is a 100% surefire way of knowing what sort of file it is. The server could easily change what sort of file it points to based on the request, or the time or day, or the whims of its owner.

There are some good basic guidelines that will work most of the time, but I hesitate to even mention them because they're absolutely not reliable.

There is some good news, though. If you actually request the data from the server, it will, just as some other answers point out, tell you precisely what sort of thing it is providing you with (for this particular exchange). It'll give you a MIME-Type in the field named "Content-Type". If it's text/html, then you have yourself an html document (not an image, not an xhtml document, HTML).

CaptainAwesomePants 2009-06-30 19:31:00

It's still pretty easy for web servers to mess with the Content-Type header as well. I'm sure there are many PHP scripts that generate a CSV file but have a Content-Type of "text/html"

too much php 2009-07-01 06:31:09

This comment + your user name makes me laugh.

CaptainAwesomePants 2009-07-01 18:03:41

Answer 6

+4 A:

Just from the URL you cannot, think of the following urls:

http://host1/index.html
http://host2/index.php
http://host3/index.asp
http://host4/index.jsp
http://host5/index.aspx
Or the the url of this question - http://stackoverflow.com/questions/1065503/how-do-you-determine-if-a-file-is-html-from-the-url

All of them return HTML content. The only sure way is to ask the server for the resource, and check the Content-TYpe header. It is better to use to send an HEAD request to the server, instead of GET or POST - it will give you just the headers and without the content.

  URL url = ...
  HttpURLConnection urlc = (HttpURLConnection)url.openConnection();
  urlc.setAllowUserInteraction( false );
  urlc.setDoInput( true );
  urlc.setDoOutput( false );
  urlc.setUseCaches( true );
  urlc.setRequestMethod("HEAD");
  urlc.connect();
  String mime = urlc.getContentType();
  if(mime.equals("text/html") {
    // do your stuff
  }

David Rabinowitz 2009-06-30 19:37:27

It may be that the content type is incorrectly set. Happens. ANd if correct, then it may not even be valid HTML. You should validate against its schema definition depending on what you want to do with it.

nojevive 2009-06-30 19:57:08

You are correct, but for that you need the content itself, which was not at the question. Also, it means that the content of the page would not be displayed correctly on a browser, so it is certainly not the common case.

David Rabinowitz 2009-07-01 14:03:43

Answer 7

A:

You can't. Sometimes some URL ends with .html extension, but it actually not a html files. Like in spring actions I normally use extension .html, so it looks like html file from url, but it is not. So practically you can't determine it.

Silent Warrior 2009-07-01 06:00:31

ansaurus

tags:

views:

answers:

How do you determine if a file is html from the URL?

related questions