Which encoding should I use to read æ,Ø,å,ä,ö,ü etc?
Encoding.UTF8 or Encoding.Unicode.
The StreamReader class has a bool parameter in it's constructor allow it to auto detect the encoding.
You should use whatever the encoding of the original data is. Where are you getting the data from, and do you have information as to which encoding it's in? If you try to read it with the wrong encoding, you'll get the wrong answer: even if your encoding can handle the characters, it's going to misinterpret the binary data.
If you get to pick the encoding, then UTF-8 is usually a good bet. It's bad in terms of size if you've got a lot of far eastern characters, but otherwise good. In particular, ASCII still comes out at one byte per character.
Encodings all boil down to the fact that if you use 8 bits for a character, you can only handle 256 distinct characters. Seeing as the UK and US set up the conventions, the 256 standard ASCII characters are mostly unaccented western characters.
That's where UTF8 and UTF16 come into play. UTF8 is a lot like ASCII - it uses one byte for most western characters. However, there are some special bytes that indicate a character out of normal ASCII range - the two bytes that immediately follow the special byte then indicate the true character.
UTF16 (also known as Unicode) does away with the special indicator byte, and just uses 16 bits for every character. As we all know, 16 bits gives you 65536 distinct characters, which isn't quite enough to cover all the worlds written characters, but it mostly does the job.
So to answer your question: if most of your characters are unaccented western characters, UTF8 will be the most compact representation for you (and most readable in many editors). If the bulk of your characters are non-western (say, Chinese), you'll probably want to use Unicode (aka UTF16).
Good luck!
You need to use the proper encoding, as all the other answers mentioned.
The problem is how to discover the encoding. That depends on the source of your file:
- If it is an XML file, there should be an
<?xml>
processing instruction at the beginning of the file that specifies the encoding. If there isn't one, you should assume it's UTF8. - If it is a text file, you can try UTF8 encoding, or if that fail, you should try the system locale of the machine you're running on. If that fails, you are pretty much on your own, unless you know someone that can tell you the system locale of the machine the file was created at.
In any case, you should be able to cover about 90% of all files by using UTF8 with a fallback to UTF16. Almost every programs or languages in the last five years support Unicode. However, if you are going to consume a lot of files from China, you might try first UTF16, which is a bit more prevalent for encoding GB18030.
There is no completely reliable method, but you can use some heuristics to guess the encoding.
- Look for a byte order mark.
- If you don't find a BOM, assume the file is UTF-8 and try to parse it. If it's an XML file, the declaration may contain an encoding. Similarly, an HTML file may contain a meta encoding tag.
- Failing all the above, assume it's UTF-8 (or ANSI -- your choice).
Rick Strahl has a handy article on detecting encodings via the BOM. It's a bit dated -- System.Text.Encoding now has a GetPreamble method and StreamReader has an overload that will try to detect the encoding for you.
Also you can put the culture to read odd carachteres like ç á á etc. CultureInfo pt = CultureInfo.GetCultureInfo("pt-BR"); StreamReader fileReader = new StreamReader("C:\temp\test.txt",Encoding.GetEncoding(pt.TextInfo.ANSICodePage),true);
Cheers, Vagner