tags:

views:

3105

answers:

2

Here's what I got so far (that doesn't work). At this point I thought my target was Ansi encoded, but I really don't want to have to know at this point. My browser seems to be able to determine what encoding to use, How can I?

static void GetUrl(Uri uri, string localFileName)
{
    HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
    HttpWebResponse response;

    response = (HttpWebResponse)request.GetResponse();

    // Save the stream to file
    Stream responseStream = response.GetResponseStream();
    StreamReader reader = new StreamReader(responseStream, Encoding.Default);
    Stream fileStream = File.OpenWrite(localFileName);
    using (StreamWriter sw = new StreamWriter(fileStream, Encoding.Default))
    {
        sw.Write(reader.ReadToEnd());
        sw.Flush();
        sw.Close();
     }
}


After answers (currently only tested on a UTF-8 site):

static void GetUrl(Uri uri, string localFileName)
{
    HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
    HttpWebResponse response = (HttpWebResponse)request.GetResponse();
    try
    {
        // Hope GetEncoding() knows how to parse the CharacterSet
        Encoding encoding = Encoding.GetEncoding(response.CharacterSet);
        StreamReader reader = new StreamReader(response.GetResponseStream(), encoding);
        using (StreamWriter sw = new StreamWriter(localFileName, false, encoding))
        {
            sw.Write(reader.ReadToEnd());
            sw.Flush();
            sw.Close();
        }
    }
    finally
    {
        response.Close();
    }
}
+2  A: 

There are three ways how web-browsers try to detect character encoding.

Look for (if it's HTML):

<meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">

or (for XHTML)

<?xml version="1.0" encoding="ISO-8859-1"?>

or sometimes it's even specified in http header

Content-Type: text/html; charset=ISO-8859-1
lubos hasko
I'm using the header information for the immediate situation.Encoding.GetEncoding(response.CharacterSet);This seems to do the trick for now.
CrashCodes
+2  A: 

You should be looking for the encoding the server sends the response in. Encoding.Default does not cut the mustard here. :-)

Stream responseStream = response.GetResponseStream();
Encoding enc = Encoding.GetEncoding(response.CharacterSet);
StreamReader reader = new StreamReader(responseStream, enc);
Stream fileStream = File.OpenWrite(localFileName);
using (StreamWriter sw = new StreamWriter(fileStream, enc))
{  /* ... */ }

To be sure, you could convert everything to UTF-8 and store your file as UTF-8 always. That way you are never left with the need to guess the encoding when reading the file.

Tomalak