tags:

views:

2285

answers:

7

Which encoding should I use to read æ,Ø,å,ä,ö,ü etc?

+1  A: 

Encoding.UTF8 or Encoding.Unicode.

The StreamReader class has a bool parameter in it's constructor allow it to auto detect the encoding.

leppie
Not necessarily. It depends on the input encoding.
0xA3
If you wanna save a Unicode file without a BOM, then that is your problem :)
leppie
The question is about reading, not writing to a stream ;)
0xA3
So what do you think the StreamReader does? You are the one that started talking about input...
leppie
I'm not sure what you are talking about ;) The OP wants to *read* data from a stream, and as Jon and others said you will need to *know* the encoding of the input string. Using UTF8 would be just a good guess but might be wrong.
0xA3
A: 

Unicode => UTF-8/UTF-16 ? :)

cwap
You missed it with 8 seconds :)
leppie
I need to work on my fast typing skills ^^
cwap
+7  A: 

You should use whatever the encoding of the original data is. Where are you getting the data from, and do you have information as to which encoding it's in? If you try to read it with the wrong encoding, you'll get the wrong answer: even if your encoding can handle the characters, it's going to misinterpret the binary data.

If you get to pick the encoding, then UTF-8 is usually a good bet. It's bad in terms of size if you've got a lot of far eastern characters, but otherwise good. In particular, ASCII still comes out at one byte per character.

Jon Skeet
How can I read what encoding the file has? The program will use many files from many places. Thanks
@Scott: You can't, reliably. A file doesn't contain its encoding. You need to know it. For instance, *every* file is a valid Windows-1252 file, but if it's "really" UTF-8 then the results will be very different.
Jon Skeet
@Jon: Unicode files are suppose to contain a BOM (byte order mark), that one (and StreamReader) can use to detect the encoding.
leppie
@leppie: There's no "supposed" to - they *might* contain a BOM, but they certainly don't have to. And that can still get the encoding wrong - it could still be a Windows-1252 file which happens to start with the bytes for a UTF-16 or UTF-8 BOM. In other words, you can't do it reliably.
Jon Skeet
The BOM is required for all but UTF-8.
Ishmael
@Ishmael: Please point to a specification which requires that. Not just for XML, but a universal specification for *all* text files. I don't believe there *is* such a specification.
Jon Skeet
@Jon: You are correct. Of course the BOM is only relevant to Unicode files. It is unlikely that you will find any Windows Unicode text files without it, so why not look for it? I couldn't find that it was required -- it probably isn't.
Ishmael
@Ishmael: It's not required because there's no *standard* for text files. Yes, looking for it will give you some heuristics - but you can't *reliably* detect every encoding. There are files which are valid in multiple encodings.
Jon Skeet
+2  A: 

Encodings all boil down to the fact that if you use 8 bits for a character, you can only handle 256 distinct characters. Seeing as the UK and US set up the conventions, the 256 standard ASCII characters are mostly unaccented western characters.

That's where UTF8 and UTF16 come into play. UTF8 is a lot like ASCII - it uses one byte for most western characters. However, there are some special bytes that indicate a character out of normal ASCII range - the two bytes that immediately follow the special byte then indicate the true character.

UTF16 (also known as Unicode) does away with the special indicator byte, and just uses 16 bits for every character. As we all know, 16 bits gives you 65536 distinct characters, which isn't quite enough to cover all the worlds written characters, but it mostly does the job.

So to answer your question: if most of your characters are unaccented western characters, UTF8 will be the most compact representation for you (and most readable in many editors). If the bulk of your characters are non-western (say, Chinese), you'll probably want to use Unicode (aka UTF16).

Good luck!

Mike
+4  A: 

You need to use the proper encoding, as all the other answers mentioned.

The problem is how to discover the encoding. That depends on the source of your file:

  1. If it is an XML file, there should be an <?xml> processing instruction at the beginning of the file that specifies the encoding. If there isn't one, you should assume it's UTF8.
  2. If it is a text file, you can try UTF8 encoding, or if that fail, you should try the system locale of the machine you're running on. If that fails, you are pretty much on your own, unless you know someone that can tell you the system locale of the machine the file was created at.

In any case, you should be able to cover about 90% of all files by using UTF8 with a fallback to UTF16. Almost every programs or languages in the last five years support Unicode. However, if you are going to consume a lot of files from China, you might try first UTF16, which is a bit more prevalent for encoding GB18030.

Franci Penov
From what I hear from people working in business-to-business messaging systems, unicode encodings are not yet as ubiquitous as you state. At all. Hacks like detecting and fixing wrong decoding done by other systems are common in the industry.
Wim Coenen
+1  A: 

There is no completely reliable method, but you can use some heuristics to guess the encoding.

  1. Look for a byte order mark.
  2. If you don't find a BOM, assume the file is UTF-8 and try to parse it. If it's an XML file, the declaration may contain an encoding. Similarly, an HTML file may contain a meta encoding tag.
  3. Failing all the above, assume it's UTF-8 (or ANSI -- your choice).

Rick Strahl has a handy article on detecting encodings via the BOM. It's a bit dated -- System.Text.Encoding now has a GetPreamble method and StreamReader has an overload that will try to detect the encoding for you.

Ishmael
A: 

Also you can put the culture to read odd carachteres like ç á á etc. CultureInfo pt = CultureInfo.GetCultureInfo("pt-BR"); StreamReader fileReader = new StreamReader("C:\temp\test.txt",Encoding.GetEncoding(pt.TextInfo.ANSICodePage),true);

Cheers, Vagner

Vagner