ansaurus

Question

How do I convert from a possibly Windows 1252 'ANSI' encoded uploaded file to UTF8 in .NET?

Answer 1

+2 A:

If you know that the file is encoded with Windows 1252, you can open the file with a StreamReader and pass the proper encoding. That is:

StreamReader reader = new StreamReader("filename", Encoding.GetEncoding("Windows-1252"), true);

The "true" tells it to set the encoding based on the byte order marks at the front of the file, if they're there. Otherwise it opens it as Windows-1252.

You can then read the file and, if you want to convert to UTF-8, write to a file that you've opened with that endcoding.

The short answer to your first question is that there isn't a 100% satisfactory way to determine the encoding of a file. If there are byte order marks, you can determine what flavor of Unicode it is, but without the BOM, you're stuck with using heuristics to determine the encoding.

I don't have a good reference for the heuristics. You might search for "how does Notepad determine the character set". I recall seeing something about that some time ago.

In practice, I've found the following to work for most of what I do:

StreamReader reader = new StreamReader("filename", Encoding.Default, true);

Most of the files I read are those that I create with .NET's StreamWriter, and they're in UTF-8 with the BOM. Other files that I get are typically written with some tool that doesn't understand Unicode or code pages, and I just treat it as a stream of bytes, which Encoding.Default does well.

Jim Mischel 2009-01-22 16:54:47

ansaurus

tags:

views:

answers:

How do I convert from a possibly Windows 1252 'ANSI' encoded uploaded file to UTF8 in .NET?

related questions