



I need to be able to tell if a link (URL) points to an XML file (RSS feed), or a regular HTML file just by looking at the headers, or something similiar (without downloading it)

Any good advice for me there ? :)

Thanks! Roey


just read it in a "text" reader. then decide which is best by, for example, looking for some tags comes to mind ;) then chuck it in your actual reader.

or is that too simple?

He specifically said he wanted to know before downloading the whole file.
Matthew Flaschen
+2  A: 

You can use the Content-Type header, and to save bandwidth you can force web server to serve you a specified part of a document. If the server includes Accept-Ranges: bytes header in its response, you can use Range: bytes=0-10 to download first ten bytes only (or even try not to download anything).

Also research HEAD verb instead of GET.

Anton Gogolev
+1 I was about to suggest content type ;-)

You cannot find out what file type it is just from looking at the URL.

I suggest you try to check the MIME-type of the document you request, or read the first line and hope the author has put in a Doctype.

Arve Systad
+11  A: 

You could just do a HEAD request instead of a full POST/GET

That will get you the headers for that page which should include the content type. From that you should be able to distinguish if its text/html or xml

Theres a good example here on SO

Eoin Campbell
+1 the perfect answer and the exact reason for the existence of the HEAD request
Nick Allen - Tungle139
Just a minor reminder some servers do not support HEAD, so do not forget to fall back to GET/POST when it fails.
dr. evil
I count one "could" and two "should". ;]
Eoin Campbell
+5  A: 

Following up on Eoin Campbell's response, here's a code snippet that should do exactly that using the System.Net functionality:

using (var request = System.Net.HttpWebRequest.Create(
    request.Method = "HEAD";

    using (var response = request.GetResponse())
        switch (response.ContentType)
            case "text/xml":
                // ...
            case "text/html":
                // ...

Of course, this assumes that the web server publishes the content (MIME) type and does so correctly. But since you stated that want a bandwidth-efficient way of doing this, I assume you don't want to download all the markup and analyse that! To be honest, the content type is usually set correctly in any case.

You can just use response.ContentType;
Matthew Flaschen
@Matthew: Good observation. Post edited.
This answers exactly half the question. There are some tricky content-types out there, like:
@bzlm: Yeah, but do they really get used? We're only talking about HTML and XML types here.
+1  A: 

Check the headers in your HttpWebResponse object. The Content-Type header should read text/xml for an XML/RSS document and text/html for a standard web page.


Generally speaking, this impossible. This is because it is possible (though unhelpful) to serve either HTML or XML files as application/octet-stream. Also, as noted by others, there are multiple valid XML mime types. However, a HEAD request then content type check could work sometimes:

WebRequest req = WebRequest.Create(url);
WebResponse resp = req.GetResponse();
req.Method = "HEAD";
String contentType = resp.ContentType;

if(contentType == "text/xml")
else if(contentType == "text/html")

But if you're going to process it somehow either way, you can do:

WebRequest req = WebRequest.Create(url);
WebResponse resp = req.GetResponse();
String contentType = resp.ContentType;

if(contentType == "text/xml")
else if(contentType == "text/html")
  // process error condition

Keep in mind, files are downloaded on an as-needed basis. So just asking for the response object does not cause the whole file to be downloaded.

Matthew Flaschen