views:

1262

answers:

7

I need to be able to tell if a link (URL) points to an XML file (RSS feed), or a regular HTML file just by looking at the headers, or something similiar (without downloading it)

Any good advice for me there ? :)

Thanks! Roey

A: 

just read it in a "text" reader. then decide which is best by, for example, looking for some tags comes to mind ;) then chuck it in your actual reader.

or is that too simple?

Mafti
He specifically said he wanted to know before downloading the whole file.
Matthew Flaschen
+2  A: 

You can use the Content-Type header, and to save bandwidth you can force web server to serve you a specified part of a document. If the server includes Accept-Ranges: bytes header in its response, you can use Range: bytes=0-10 to download first ten bytes only (or even try not to download anything).

Also research HEAD verb instead of GET.

Anton Gogolev
+1 I was about to suggest content type ;-)
Shoban
A: 

You cannot find out what file type it is just from looking at the URL.

I suggest you try to check the MIME-type of the document you request, or read the first line and hope the author has put in a Doctype.

Arve Systad
+11  A: 

You could just do a HEAD request instead of a full POST/GET

That will get you the headers for that page which should include the content type. From that you should be able to distinguish if its text/html or xml

Theres a good example here on SO

Eoin Campbell
+1 the perfect answer and the exact reason for the existence of the HEAD request
Nick Allen - Tungle139
Just a minor reminder some servers do not support HEAD, so do not forget to fall back to GET/POST when it fails.
dr. evil
I count one "could" and two "should". ;]
bzlm
Eoin Campbell
+5  A: 

Following up on Eoin Campbell's response, here's a code snippet that should do exactly that using the System.Net functionality:

using (var request = System.Net.HttpWebRequest.Create(
    "http://tempuri.org/pathToFile"))
{
    request.Method = "HEAD";

    using (var response = request.GetResponse())
    {
        switch (response.ContentType)
        {
            case "text/xml":
                // ...
                break;
            case "text/html":
                // ...
                break;
        }
    }
}

Of course, this assumes that the web server publishes the content (MIME) type and does so correctly. But since you stated that want a bandwidth-efficient way of doing this, I assume you don't want to download all the markup and analyse that! To be honest, the content type is usually set correctly in any case.

Noldorin
You can just use response.ContentType;
Matthew Flaschen
@Matthew: Good observation. Post edited.
Noldorin
This answers exactly half the question. There are some tricky content-types out there, like:http://www.w3.org/TR/xhtml-media-types/#application-xhtml-xml
bzlm
@bzlm: Yeah, but do they really get used? We're only talking about HTML and XML types here.
Noldorin
+1  A: 

Check the headers in your HttpWebResponse object. The Content-Type header should read text/xml for an XML/RSS document and text/html for a standard web page.

Nick
A: 

Generally speaking, this impossible. This is because it is possible (though unhelpful) to serve either HTML or XML files as application/octet-stream. Also, as noted by others, there are multiple valid XML mime types. However, a HEAD request then content type check could work sometimes:

WebRequest req = WebRequest.Create(url);
WebResponse resp = req.GetResponse();
req.Method = "HEAD";
String contentType = resp.ContentType;

if(contentType == "text/xml")
  getXML(url);
else if(contentType == "text/html")
  getHTML(url);

But if you're going to process it somehow either way, you can do:

WebRequest req = WebRequest.Create(url);
WebResponse resp = req.GetResponse();
String contentType = resp.ContentType;

if(contentType == "text/xml")
  processXML(resp.GetResponseStream());
else if(contentType == "text/html")
  processHTML(resp.GetResponseStream());
else
  // process error condition

Keep in mind, files are downloaded on an as-needed basis. So just asking for the response object does not cause the whole file to be downloaded.

Matthew Flaschen