I'm reading a CSV file with Fast CSV Reader (on codeproject). When I print the content of the fields, the console show the character '?' in some words. How can fix it?
views:
155answers:
1
+3
A:
The short version is that you have to know the encoding of any text file you're going to read up front. You could use things like byte order marks and other heuristics if you really aren't going to know, but you should always allow for the value to be tweaked (in the same way that Excel does if you're importing CSV).
It's also worth double checking the values in the debugger, as it may be that it is the output that is wrong, as opposed to the reading -- bear in mind that all strings are Unicode internally, and conversion to '?' sounds like it is failing converting the unicode to the relevant code page for the console.
Rowland Shaw
2009-10-14 14:41:20
I tried to force the stream reader for use iso8859-1 (the codepage of the csv file), and works perfectly. But the idea is read in any encoding, and then recovert to the right encoding.Anyway, thanks for the answer ;).
diegocaro
2009-10-14 15:15:49
Well, you *could* write some heuristics that try to guess the character encoding and then push it back through - the point is that there is no magic bullet for "knowing" the character encoding, when reading from disk (things like HTTP provide somewhere for meta-sata, so it is less of an issue there). One basic heuristic; alternating null bytes = UTF-16; no bytes <= 0x80 = ASCII; no pairs of bytes, both <= 0x80 = Ansi code page; Otherwise maybe UTF-8.
Rowland Shaw
2009-10-15 08:05:34
StreamReader has a constructor overload that has a boolean flag to "detectEncodingFromByteOrderMarks: Indicates whether to look for byte order marks at the beginning of the file." [just a general observation]...
rohancragg
2009-10-27 13:03:03