views:

234

answers:

3

I would like to check to a remote website if it contains some files. Eg. robots.txt, or favicon.ico. Of course the files should be accessible (read mode).

So if the website is: http://www.example.com/ I would like to check if http://www.example.com/robots.txt.

I tried fetching the URL like http://www.example.com/robots.txt. And sometimes you can see if the file is there because you get page not found error in the header.

But some websites handle this error and all you get is some HTML code saying that page can not be found.

You get headers with status code 200.

So Anybody any idea how to check if file exists really or not?

Thanx, Granit

+1  A: 

If they serve an error page with HTTP 200 I doubt you have a reliable way of detecting this. Needless to say that it's extremely stupid to serve error pages that way ...

You could try:

  1. Issuing a HEAD request which yields you only the headers for the requested resource. Maybe you get more reliable status codes that way
  2. Check the Content-Type header. If it's text/html you can assume that it's a custom error page instead of a robots.txt (which should be served as text/plain). For favicons likewise. But I think simply checking for text/html would be the most reliable way here.
Joey
I agree. But Is is a way to detect if the file is really there?
Granit
Content type is very good idea! Thanx.
Granit
+1  A: 

Well, if the website gives you an error page with a success status code, there is not much you can do about it.

Naturally, if you're just after robots.txt or favicon.ico or something else very specific, you can simply check if the response document is in correct format... like robots.txt should be text/plain containing stuff that robots.txt is allowed to contain and favicon.ico should be an image file.

kkyy
A: 

The header content-type for a .txt file should be text/plain, so if you receive text/html it's not a simple text file.

To check if a picture is a picture you would need to retrieve the content-type as it will usually be image/png or image/gif. There is also the possibility of using PHP's GD library to check if it is in fact an image.

Blekk