ansaurus

Question

Answer 1

+1 A:

Where are you writing ASCII.txt? You're writing ANSI.txt in the first line, but that's certainly not ASCII as ASCII doesn't contain any accented characters. The ANSI file won't contain any preamble indicating that it's ANSI rather than ASCII or UTF-8.

You seem to have changed your mind between ASCII and ANSI half way through writing the example, basically.

I'd expect any ASCII file to be "detected" as UTF-8, but the encoding detection relies on the file having a byte order mark for it to be anything other than UTF-8. It's not like it reads the whole file and then guesses at what the encoding is.

From the docs for StreamReader:

This constructor initializes the encoding to UTF8Encoding, the BaseStream property using the stream parameter, and the internal buffer to the default size.

The detectEncodingFromByteOrderMarks parameter detects the encoding by looking at the first three bytes of the stream. It automatically recognizes UTF-8, little-endian Unicode, and big-endian Unicode text if the file starts with the appropriate byte order marks. See the Encoding.GetPreamble method for more information.

Now File.Copy is just copying the raw bytes from place to place - it shouldn't change anything in terms of character encodings, because it doesn't try to interpret the file as a text file in the first place.

It's not quite clear to me where you see a problem (partly due to the ANSI/ASCII part). I suggest you separate out the issues of "does File.Copy change things?" and "what character encoding is detected by StreamReader?" in both your mind and your question. The answers should be:

File.Copy shouldn't change the contents of the file at all
StreamReader can only detect UTF-8 and UTF-16; if you need to read a file encoded with any other encoding, you should state that explicitly in the constructor. (I would personally recommend using Encoding.Default instead of Encoding.GetEncoding(0) by the way. I think it's clearer.)

Jon Skeet 2009-06-16 08:54:46

The problem is not StreamReader. I only used it to create a short piece of code that can reproduce the problem. (and I screwed up since I confused ASCII and ANSI while playing around with it). I noticed it first in a hex editor, and to my understanding the resulting file is incorrect, since it has the UTF-8 byte order mark (3 bytes at the beginning) and a wrong character code for the accented character

chris166 2009-06-16 09:20:42

Something is weird. I'm not able to reproduce it anymore. So something was outdated (my hex editor, the code in VS or whatever). Anyway, thanks for looking into the problem and spending so much time on it!

chris166 2009-06-16 09:26:56

My pleasure - although really this didn't take much more time than it took to just type the answer. Other questions have occasionally soaked up *much* more effort :)

Jon Skeet 2009-06-16 09:29:03

Answer 2

A:

I doubt this has anything to do with File.Copy. I think what you're seeing is that StreamReader uses UTF-8 by default to decode and since UTF-8 is backwards compatible, StreamReader never has any reason to stop using UTF-8 to read the ANSI-encoded file.

If you open ASCII.txt and copy.txt in a hex editor, are they identical?

Josh Einstein 2009-06-16 08:55:51

No, the encoding detection of StreamReader works fine. The copy.txt has the UTF-8 byte order mark at the beginning and the wrong character for the umlaut char

chris166 2009-06-16 09:16:48

ansaurus

tags:

views:

answers:

File.Copy and character encoding

related questions