views:

1013

answers:

4

I'm working on an application in C#, and need to read and write from a particular datafile format. The only issue at the moment is that the format uses strictly single byte characters, and C# keeps trying to throw in Unicode when I use a writer and a char array (which doubles filesize, among other serious issues). I've been working on modifying the code to use byte arrays instead, but that causes a few complaints when feeding them into a tree view and datagrid controls, and it involves conversions and whatnot.

I've spent a little time Googling, and there doesn't seem to be a simple typedef I can use to force the char type to use byte for my program, at least not without causing a extra complications.

Is there a simple way to force a C# .Net program to use ASCII-only and not touch Unicode?

Edit: Aright. Thanks guys, got this almost working. Using the ASCIIEncoding on the BinaryReader/Writers ended up fixing most of the problems (a few issues with an extra char being prepended to strings occurred, but I fixed that up). I'm having one last issue, which is very small but could be big: In the file, a particular char (prints as the Euro sign) gets converted to a ? when I load/save the files. That's not an issue in texts much, but if it occurred in a record length, it could change the size by kilobytes (not good, obviously). I think it's caused by the encoding, but if it came from the file, why won't it go back?

Edit2: The precise problem/results are such: Original file: 0x80 (euro) Encodings: ASCII: 0x3F (?) UTF8: 0xC280 (A-hat euro) Neither of those results will work, since anywhere in the file, it can change (if an 80 changed to 3F in a record length int, it could be a difference of 65*(256^3)). Not good. I tried using a UTF8 encoding, figuring that would fix the issue pretty well, but it's now adding that second character, which is even worse.

+5  A: 

Interally strings in .NET are always Unicode, but that really shouldn't be of much interest to you. If you have a particular format that you need to adhere to, then the route you went down (reading it as bytes) was correct. You simply need to use the System.Encoding.ASCII class to do your conversions from string->byte[] and byte[]->string.

Adam Robinson
Alright. I'll try that. One question (I haven't messed with Encodings before, haven't needed to yet), will there be any issues converting single-byte characters (read as bytes) into a double-byte char string, displaying and letting the user edit the value, then converting back and writing it as single-byte chars again?I know any special/Unicode characters would lose their upper byte, but would/could any damage be done to characters that came from the ASCII file? I can't think of how it could go wrong, but with M$, who knows. ;P
peachykeen
@peachykeen: If you just read and write using a StreamReader and StreamWriter, the .NET program will never know that the file is in ASCII. .NET makes dealing with this very, very easy and robust.
Reed Copsey
>>...then converting back and writing it as single-byte chars again?Depends on what happens while the string is in-program. If the actions taken insert characters not representable in the final Code Page, you may see garbage. The so-called High-ASCII characters (>127 decimal) change depending on the active Code Page but won't necessarily be invalid in the Stream handler.
DaveE
Would using a stream set to ASCII encoding cut off the extra (first) byte of a 2-byte character, or make it into two? One brief experiment with converting to a byte array ended up giving doubled bytes, every other usually unprintable.As for the issue at hand, the program doesn't need and the format doesn't support the UTF-16 that VS is trying to use, but a on-display/after-display conversion might work better in this case, because certain fields contain an 8-byte flag section before the data. That should work with an ASCII reader/writer, but having a converter in the code might help there...
peachykeen
Direct cconversion to a byte array is wrong b/c UTF-16 uses 2 bytes per character. As you've seen that'll give you an every-other-byte-is-nothing structure. Using a StreamWriter as Reed describes should write 'normal' one-byte-per-character output. The 8-byte flag section in the ASCII file will just be a character to the Stream handlers. Unless you *absolutely must* handle things on a per-byte basis, deal with the characters. Your "bytes o' flags" will be a set of "half-characters o' flags" in a plain read/write. Stepping into that might be tricky but should be doable.
DaveE
+21  A: 

C# (.NET) will always use Unicode for strings. This is by design.

When you read or write to your file, you can, however, use a StreamReader/StreamWriter set to force ASCII Encoding, like so:

StreamReader reader = new StreamReader (fileStream, new ASCIIEncoding());

Then just read using StreamReader.

Writing is the same, just use StreamWriter.

Reed Copsey
or StreamReader reader = new StreamReader (fileStream, Encoding.GetEncoding(1252)); if you want ASCII chars higher as 127. Like an euro sign.
Lars Truijens
A: 

If you want this in .net, you could use F# to make a library supporting this. F# supports ASCII strings, with as underlying type a byte array:

msdn

let asciiString = "This is a string"B

JJoos
You're suggesting a use another language to write a library to use ASCII?C# supports ASCII and byte arrays, plus I don't use any ASCII strings. That seems like a lot of trouble...
peachykeen
You're absolutely right.
JJoos
+3  A: 

If you have a file format that mixes text in single-byte characters with binary values such as lengths, control characters, a good encoding to use is code page 28591 aka Latin1 aka ISO-8859-1.

You can get this encoding by using whichever of the following is the most readable:

Encoding.GetEncoding(28591) 
Encoding.GetEncoding("Latin1")
Encoding.GetEncoding("ISO-8859-1")

This encoding has the useful characteristic that byte values up to 255 are converted to unchanged to the unicode character with the same value (e.g. the byte 0x80 becomes the character 0x0080).

In your scenario, this may be more useful than the ASCII encoding (which converts values in the range 0x80 to 0xFF to '?') or any of the other usual encodings, which will also convert some of the characters in this range.

Joe