I need to be able to tell if a link (URL) points to an XML file (RSS feed), or a regular HTML file just by looking at the headers, or something similiar (without downloading it)
Any good advice for me there ? :)
Thanks! Roey
I need to be able to tell if a link (URL) points to an XML file (RSS feed), or a regular HTML file just by looking at the headers, or something similiar (without downloading it)
Any good advice for me there ? :)
Thanks! Roey
just read it in a "text" reader. then decide which is best by, for example, looking for some tags comes to mind ;) then chuck it in your actual reader.
or is that too simple?
You can use the Content-Type
header, and to save bandwidth you can force web server to serve you a specified part of a document. If the server includes Accept-Ranges: bytes
header in its response, you can use Range: bytes=0-10
to download first ten bytes only (or even try not to download anything).
Also research HEAD
verb instead of GET
.
You cannot find out what file type it is just from looking at the URL.
I suggest you try to check the MIME-type of the document you request, or read the first line and hope the author has put in a Doctype.
You could just do a HEAD request instead of a full POST/GET
That will get you the headers for that page which should include the content type. From that you should be able to distinguish if its text/html or xml
Theres a good example here on SO
Following up on Eoin Campbell's response, here's a code snippet that should do exactly that using the System.Net
functionality:
using (var request = System.Net.HttpWebRequest.Create(
"http://tempuri.org/pathToFile"))
{
request.Method = "HEAD";
using (var response = request.GetResponse())
{
switch (response.ContentType)
{
case "text/xml":
// ...
break;
case "text/html":
// ...
break;
}
}
}
Of course, this assumes that the web server publishes the content (MIME) type and does so correctly. But since you stated that want a bandwidth-efficient way of doing this, I assume you don't want to download all the markup and analyse that! To be honest, the content type is usually set correctly in any case.
Check the headers in your HttpWebResponse object. The Content-Type header should read text/xml for an XML/RSS document and text/html for a standard web page.
Generally speaking, this impossible. This is because it is possible (though unhelpful) to serve either HTML or XML files as application/octet-stream. Also, as noted by others, there are multiple valid XML mime types. However, a HEAD request then content type check could work sometimes:
WebRequest req = WebRequest.Create(url);
WebResponse resp = req.GetResponse();
req.Method = "HEAD";
String contentType = resp.ContentType;
if(contentType == "text/xml")
getXML(url);
else if(contentType == "text/html")
getHTML(url);
But if you're going to process it somehow either way, you can do:
WebRequest req = WebRequest.Create(url);
WebResponse resp = req.GetResponse();
String contentType = resp.ContentType;
if(contentType == "text/xml")
processXML(resp.GetResponseStream());
else if(contentType == "text/html")
processHTML(resp.GetResponseStream());
else
// process error condition
Keep in mind, files are downloaded on an as-needed basis. So just asking for the response object does not cause the whole file to be downloaded.