ansaurus

Question

HttpWebRequest: Receiving response with the right encoding

Answer 1

A:

I believe that the HttpWebResponse has a ContentEncoding property. Use it in the constructor of your StreamReader.

John Saunders 2009-03-12 14:10:10

Answer 2

A:

Daniel, Some pages not even return a value in the CharacterSet, so this approach is not so reliable. Sometimes not even the browsers are able to "guess" which Encoding to use, so I think you can't do a 100% enconding recogniton.

In my particular case, as I deal with spanish or portuguese pages, I use the UTF7 encoding and it is working fine for me (áéíóúñÑêã... etc).

May be you can first load a table of CharacterSet codes and their corresponding Encoding. And in case the CharacterSet is empty, you can provide a Default encoding.

The detectEncodingFromByteOrderMarks parameter in the StreamReader constructor, may help a little as it automatically detect or infers some encodings from the very first bytes.

Romias 2009-05-19 04:50:46

Answer 3

+1 A:

Hi Daniel - Gap's site is wrong. The specific problem is that their page claims an encoding of Latin1 (ISO-8859-1), while the page uses character #146 which is not valid in ISO-8859-1.

That character is, however, valid in the Windows CP-1252 encoding (which is a superset of ISO 8859-1). In CP-1252, character code #146 and is used for the right-quote character. You'll see this as an apostrophe in "Youll find Petites and small sizes" in today's text on the Gap.com home page.

You can read http://en.wikipedia.org/wiki/Windows-1252 for more details. Turns out this kind of thing is a common problem on web pages where the content was originally saved in the CP-1252 encoding (e.g. copy/pasted from Word).

Moral of the story here: always store internationalized text as Unicode in your database, and always emit HTML as UTF8 on your web server!

Justin Grant 2009-08-14 21:07:32

ansaurus

tags:

views:

answers:

HttpWebRequest: Receiving response with the right encoding

related questions