tags:

views:

1716

answers:

4
+2  A: 

Firstly, the easier way of writing that code is to use a StreamReader and ReadToEnd:

HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(myURL);
using (HttpWebResponse response = (HttpWebResponse)webRequest.GetResponse())
{
    using (Stream resStream = response.GetResponseStream())
    {
        StreamReader reader = new StreamReader(resStream, Encoding.???);
        return reader.ReadToEnd();
    }
}

Then it's "just" a matter of finding the right encoding. How did you create the file? If it's with Notepad then you probably want Encoding.Default - but that's obviously not portable, as it's the default encoding for your PC.

In a well-run web server, the response will indicate the encoding in its headers. Having said that, response headers sometimes claim one thing and the HTML claims another, in some cases.

Jon Skeet
In fact, I am trying to get file all over the world and I got some bad output (PNG file wasn't properly formed) and text was badly written (all character like "é").
Daok
If you're trying to read arbitrary HTML, you'll need to examine the headers and sometimes the start of the HTML (which can advertise the encoding just like XML does). Sometimes you then have to detect that it's probably not right and guess by heuristics anyway!
Jon Skeet
Ok, I'll take a look to the header. I have playing with you code and StreamReader(resStream, true) doesn't work (supposed to find the encoding with the byte...) I'll try to get it from the header. I'll post later.
Daok
A: 

Jon Skeet got the answer at 50%. I would like to thank him a lot!

I had a problem reading (I posted the new code in the question) by using the Charset of the html file (Encoding was always empty) and I have to write the file with the Default of the user to display it back well.

Now it works well :)

Daok
A: 

Hi there DAok.

I'm having a very similar problem to you.

I just can't figure out the solution by reading this post.

I'm parsing google.com.ar and whenever i find a 'strange' character ,my program can't understand it properly.

I guess it's just an Encoding problem.

The thing is i can determine the proper encoding by using your code and i'm sending the right Encoding to the StreamReader constructor but still, when i write the results to a file all the weird characters are not there.

How did you solve it exactly?

This is my code:

        HttpWebRequest oRequest;
        HttpWebResponse oResponse;
        Encoding encoding;
        string googleUrl = "http://www.google.com.ar/search?q=";
        string pagedata;
        string charSet;

        oRequest = (HttpWebRequest)WebRequest.Create(googleUrl);

        oResponse = (HttpWebResponse)oRequest.GetResponse();

        oRequest.ContentType = "Text/HTML";

        charSet = oResponse.CharacterSet;

        if (String.IsNullOrEmpty(charSet))
             encoding = Encoding.Default;
         else
             encoding = Encoding.GetEncoding(charSet);

         StreamReader sr = new StreamReader(oResponse.GetResponseStream(), encoding);

        pagedata= sr.ReadToEnd();
        sr.Close();
        oResponse.Close();
        return pagedata;

Any pointer is appreciated!

Thanks a lot!

This is weird that you have problem because you took the good part of my code (Encoding.Default). Try to check the value of the Encoding Request (from oRequest) and try to applied the same to oResponse .
Daok
A: 

oResponse.CharacterSet is always = "ISO-8859-1" with every webpage. for example [http://www.google.com.vn/] have [charset=UTF-8] but oResponse.CharacterSet= "ISO-8859-1" [https://bazar.biglobe.ne.jp/cgi-bin/index.cgi] have [charset=euc-jp] but oResponse.CharacterSet= "ISO-8859-1" Do you have any another way to resovle?

Trrung