tags:

views:

110

answers:

3

I need to generate a CSV file. Maybe i am 'doing it wrong' because i am dumping the file with my own code instead of using a lib but anyways.

It looks like i have everything right. Quotes, commas and everything seems to be escaped perfectly. It was rather easy. The problem is i am using unicode strings to test and they come out as ????. When i use MS Excel to save a file with my test string and i hit save as CSV opening the file gets me the same problem (unicode letters becoming ?????). Is unicode not supported?

I just tried dumping the string like this instead of outputting it to a webpage

var f = new System.IO.StreamWriter(filename, false, System.Text.Encoding.Unicode);

and now i see the unicode text but everything is now in one column. Whats weird is everything looks normal in my text editor of choice and if i copy/paste a few columns out and paste it in saving as .csv i see the columns fine. Although it probably strips unicode out.

How do i save this properly?

+3  A: 

System.Text.Encoding.Unicode uses UTF-16 encoding. Try telling your text-editors to decode with UTF-16; I'm guessing the editor you are using to display the output file is defaulting to UTF-8 or ASCII. If this is so, an alternative might be to encode the output with System.Text.Encoding.UTF8 instead.

Ani
I'm surprised. I used UTF8 as you suggested and the columns came out as columns instead of one large column. Accepted! This is still weird. But i'm happy
acidzombie24
This just means that you're using a text editor that doesn't support UTF-16. In UTF-8, the ASCII-7 / Latin chars are one byte, just like ASCII. Things don't become multibyte in UTF-8 until you get out of the ASCII-7 range. In UTF-16, everything is at least two bytes long, so ASCII-7 / Latin chars will look like there is a null byte between each char byte. A text reader that doesn't support UTF-16 could easily fail into a pattern of displaying one character per line as you describe.
dthorpe
The question was how to get the CSV to work in excel and thats exactly what this answer old me. It doesnt matter if excel does something wrong. If it doesnt work then it isnt the solution to my question.
acidzombie24
The hazard is that if switching to UTF-8 really only works because the CSV reader doesn't support Unicode at all, and your current data is ASCII-7 and therefore passes through UTF-8 unchanged, then the problem remains: the CSV reader will fail to handle "non-ASCII" chars when they appear in the data at some point in the future. Things that magically go away have a bad habit of magically returning.
dthorpe
+1 for "Things that magically go away have a bad habit of magically returning." - simply brilliant!
Jeroen Pluimers
A: 

It could also just be the font Word is using is missing these characters you are trying to display. If I open Word, hold ALT and mash my numpad, it changes the font to a math font, but still displays the missing character item from the font in question.

Kogitsune
+1  A: 

You need to do two things: mark the text file (or html page) as containing Unicode chars (either UTF-8 or UTF-16), and make sure that you are using a text editor that supports Unicode text. Notepad is a good choice on Windows.

To mark a text file (such as .csv) as containing Unicode text, you need to write a Byte Order Mark (BOM) as the first character in the text file. For UTF-16 little-endian (Intel), the BOM would be bytes 0xFF, 0xFE. The Byte Order Mark tells the document reader whether the characters in the document are ordered as big-endian or little-endian. The BOM character is a reserved non-printing character in the Unicode character tables. This BOM can also be used to distinguish ASCII text from UTF-8 and other Unicode encodings (because the UTF-8 BOM byte sequence is different from UTF-16, etc).

Some document writers will write the BOM for you, or have an option to include or exclude the BOM. Use a binary hex dump to view the text file bytes to determine whether you have a BOM or not. Do not use a text editor - the BOM is a non-display char.

To indicate that an HTML page you are generating contains Unicode characters, you need to set the Content-Type header to indicate a Unicode charset: Content-Type: text/html; charset=utf-8 indicates UTF-8 encoded Unicode text, for example.

dthorpe
Whats confusing is when opening in a hex editor i see the BOM (FF FE). I suspect when opening as unicode that is written automatically (since its a streamwriter and i specified the encoding). So the answer doesnt help me unless i apply the HTML solution which i may not do. It depends on the specs but Ani answer solved it which is really weird.
acidzombie24
Ok, so you're outputting a good BOM. Then the issue is at the other end - your text editor doesn't support UTF-16.
dthorpe
@dthorpe: The text editor was just to test the rows. I first opened it in the actual program, microsoft excel.
acidzombie24