ansaurus

Question

C# HttpWebRequest - How to disgtinguish between HTML and XML pages without downloading ?

Answer 1

A:

just read it in a "text" reader. then decide which is best by, for example, looking for some tags comes to mind ;) then chuck it in your actual reader.

or is that too simple?

Mafti 2009-05-25 10:00:06

He specifically said he wanted to know before downloading the whole file.

Matthew Flaschen 2009-05-25 10:14:03

Answer 2

+2 A:

You can use the Content-Type header, and to save bandwidth you can force web server to serve you a specified part of a document. If the server includes Accept-Ranges: bytes header in its response, you can use Range: bytes=0-10 to download first ten bytes only (or even try not to download anything).

Also research HEAD verb instead of GET.

Anton Gogolev 2009-05-25 10:01:00

+1 I was about to suggest content type ;-)

Shoban 2009-05-25 10:03:27

Answer 3

A:

You cannot find out what file type it is just from looking at the URL.

I suggest you try to check the MIME-type of the document you request, or read the first line and hope the author has put in a Doctype.

Arve Systad 2009-05-25 10:01:40

Answer 4

+11 A:

You could just do a HEAD request instead of a full POST/GET

That will get you the headers for that page which should include the content type. From that you should be able to distinguish if its text/html or xml

Theres a good example here on SO

Eoin Campbell 2009-05-25 10:01:47

+1 the perfect answer and the exact reason for the existence of the HEAD request

Nick Allen - Tungle139 2009-05-25 10:02:55

Just a minor reminder some servers do not support HEAD, so do not forget to fall back to GET/POST when it fails.

dr. evil 2009-05-25 10:07:51

I count one "could" and two "should". ;]

bzlm 2009-05-25 10:09:30

Eoin Campbell 2009-05-25 10:12:18

Answer 5

+5 A:

Following up on Eoin Campbell's response, here's a code snippet that should do exactly that using the System.Net functionality:

using (var request = System.Net.HttpWebRequest.Create(
    "http://tempuri.org/pathToFile"))
{
    request.Method = "HEAD";

    using (var response = request.GetResponse())
    {
        switch (response.ContentType)
        {
            case "text/xml":
                // ...
                break;
            case "text/html":
                // ...
                break;
        }
    }
}

Of course, this assumes that the web server publishes the content (MIME) type and does so correctly. But since you stated that want a bandwidth-efficient way of doing this, I assume you don't want to download all the markup and analyse that! To be honest, the content type is usually set correctly in any case.

Noldorin 2009-05-25 10:05:00

You can just use response.ContentType;

Matthew Flaschen 2009-05-25 10:07:32

@Matthew: Good observation. Post edited.

Noldorin 2009-05-25 10:09:14

This answers exactly half the question. There are some tricky content-types out there, like:http://www.w3.org/TR/xhtml-media-types/#application-xhtml-xml

bzlm 2009-05-25 10:12:32

@bzlm: Yeah, but do they really get used? We're only talking about HTML and XML types here.

Noldorin 2009-05-25 18:44:15

Answer 6

+1 A:

Check the headers in your HttpWebResponse object. The Content-Type header should read text/xml for an XML/RSS document and text/html for a standard web page.

Nick 2009-05-25 10:06:33

Answer 7

A:

Generally speaking, this impossible. This is because it is possible (though unhelpful) to serve either HTML or XML files as application/octet-stream. Also, as noted by others, there are multiple valid XML mime types. However, a HEAD request then content type check could work sometimes:

WebRequest req = WebRequest.Create(url);
WebResponse resp = req.GetResponse();
req.Method = "HEAD";
String contentType = resp.ContentType;

if(contentType == "text/xml")
  getXML(url);
else if(contentType == "text/html")
  getHTML(url);

But if you're going to process it somehow either way, you can do:

WebRequest req = WebRequest.Create(url);
WebResponse resp = req.GetResponse();
String contentType = resp.ContentType;

if(contentType == "text/xml")
  processXML(resp.GetResponseStream());
else if(contentType == "text/html")
  processHTML(resp.GetResponseStream());
else
  // process error condition

Keep in mind, files are downloaded on an as-needed basis. So just asking for the response object does not cause the whole file to be downloaded.

Matthew Flaschen 2009-05-25 10:11:50

ansaurus

tags:

views:

answers:

C# HttpWebRequest - How to disgtinguish between HTML and XML pages without downloading ?

related questions