How do you read a text file without losing odd characters?

views:

253

answers:

+1 Q:

How do you read a text file without losing odd characters?

I would like to read a text file into an array of strings using System.IO.File.ReadAllLines. However, ReadAllLines strips out some odd characters in the file that I would like to keep, such as chr(187). I've tried some different encoding options, but that doesn't help and I don't see an option for "no encoding."

I can use FileOpen and LineInput to read the file without modification, but this is quite a bit slower. Using FileSystemObject also works properly, but I would rather not use that.

What is the best way to read a text file into an array of strings without modification in .net?

+3 A:

There's no such concept as "no encoding". You must find out the right encoding, otherwise you can't possibly interpret the data correctly.

When you say "chr(187)" what Unicode character do you mean?

Some encodings you might want to try:

Encoding.Default - the system default encoding
Encoding.GetEncoding(28591) - ISO-Latin-1
Encoding.UTF8 - very common in modern files

Jon Skeet 2009-11-26 17:32:58

When I say chr(187), I mean that the value of that byte in the file is 187. I realize that it has to get converted to some character in Windows, and I don't care which character that is. But I would like to be able to see that character in my string as a character equal to chr(187). Now, that character is missing when I use ReadAllLines and any of the three encoding options above.

xpda 2009-11-26 17:43:45

I am guessing the code page you want is 1252 Western European (`Encoding.GetEncoding(1252)`). Are you sure you are ‘missing’ characters completely? `ReadAllBytes(..., Encoding.GetEncoding(28591))` and also most locales' values of `Encoding.Default` will convert every byte to *some* character or the other (although in 28591's case it'll be a control character), so if they're not making it through you have a problem elsewhere.

bobince 2009-11-26 17:57:06

GetEncoding(1252) doesn't do it. Yes, the characters are stripped out of the file. If I do a ReadAllLines immediately followed by WriteAllLines, the output file is smaller than the input file.

xpda 2009-11-26 18:00:52

+2 A:

It sounds like you want to read the raw bytes.

Use File.ReadAllBytes to read them into an array (don't do this for large files), or use a FileStream to read chunks of bytes at a time.

SLaks 2009-11-26 17:33:52

I don't want to use raw bytes because I am processing string data. It is too slow and cumbersome to use bytes for this. I would like to be able to read a text file and be confident that I am getting the entire file with no characters missing.

xpda 2009-11-26 17:50:09

The characters that were stripped out were at the beginning of the file. It turns out they were the byte order marks for UTF-8. File.ReadAllLines and File.ReadAllText strips out the byte order marks, while LineInput and FileSystemObject functions do not.

If I had explained in the question that the odd characters were at the file beginning, I imagine I would have gotten a quick answer. I'll give Jon Skeet credit for the best answer to the question I posed.

xpda 2009-11-26 18:43:26

ansaurus

tags:

views:

answers:

How do you read a text file without losing odd characters?

related questions