Streamreader and foreign characters

+1 A:

Encoding.UTF8 or Encoding.Unicode.

The StreamReader class has a bool parameter in it's constructor allow it to auto detect the encoding.

leppie 2009-02-26 16:22:01

Not necessarily. It depends on the input encoding.

0xA3 2009-02-26 16:25:59

If you wanna save a Unicode file without a BOM, then that is your problem :)

leppie 2009-02-26 16:28:49

The question is about reading, not writing to a stream ;)

0xA3 2009-02-26 16:30:22

So what do you think the StreamReader does? You are the one that started talking about input...

leppie 2009-02-26 16:44:51

I'm not sure what you are talking about ;) The OP wants to *read* data from a stream, and as Jon and others said you will need to *know* the encoding of the input string. Using UTF8 would be just a good guess but might be wrong.

0xA3 2009-02-26 16:53:20

A:

Unicode => UTF-8/UTF-16 ? :)

cwap 2009-02-26 16:22:09

You missed it with 8 seconds :)

leppie 2009-02-26 16:23:23

I need to work on my fast typing skills ^^

cwap 2009-02-26 21:54:19

+7 A:

You should use whatever the encoding of the original data is. Where are you getting the data from, and do you have information as to which encoding it's in? If you try to read it with the wrong encoding, you'll get the wrong answer: even if your encoding can handle the characters, it's going to misinterpret the binary data.

If you get to pick the encoding, then UTF-8 is usually a good bet. It's bad in terms of size if you've got a lot of far eastern characters, but otherwise good. In particular, ASCII still comes out at one byte per character.

Jon Skeet 2009-02-26 16:24:56

How can I read what encoding the file has? The program will use many files from many places. Thanks

2009-02-26 16:30:25

@Scott: You can't, reliably. A file doesn't contain its encoding. You need to know it. For instance, *every* file is a valid Windows-1252 file, but if it's "really" UTF-8 then the results will be very different.

Jon Skeet 2009-02-26 16:32:19

@Jon: Unicode files are suppose to contain a BOM (byte order mark), that one (and StreamReader) can use to detect the encoding.

leppie 2009-02-26 16:46:55

@leppie: There's no "supposed" to - they *might* contain a BOM, but they certainly don't have to. And that can still get the encoding wrong - it could still be a Windows-1252 file which happens to start with the bytes for a UTF-16 or UTF-8 BOM. In other words, you can't do it reliably.

Jon Skeet 2009-02-26 17:08:28

The BOM is required for all but UTF-8.

Ishmael 2009-02-26 19:25:31

@Ishmael: Please point to a specification which requires that. Not just for XML, but a universal specification for *all* text files. I don't believe there *is* such a specification.

Jon Skeet 2009-02-26 19:36:10

@Jon: You are correct. Of course the BOM is only relevant to Unicode files. It is unlikely that you will find any Windows Unicode text files without it, so why not look for it? I couldn't find that it was required -- it probably isn't.

Ishmael 2009-02-26 20:32:52

@Ishmael: It's not required because there's no *standard* for text files. Yes, looking for it will give you some heuristics - but you can't *reliably* detect every encoding. There are files which are valid in multiple encodings.

Jon Skeet 2009-02-26 20:45:04

+2 A:

Encodings all boil down to the fact that if you use 8 bits for a character, you can only handle 256 distinct characters. Seeing as the UK and US set up the conventions, the 256 standard ASCII characters are mostly unaccented western characters.

That's where UTF8 and UTF16 come into play. UTF8 is a lot like ASCII - it uses one byte for most western characters. However, there are some special bytes that indicate a character out of normal ASCII range - the two bytes that immediately follow the special byte then indicate the true character.

UTF16 (also known as Unicode) does away with the special indicator byte, and just uses 16 bits for every character. As we all know, 16 bits gives you 65536 distinct characters, which isn't quite enough to cover all the worlds written characters, but it mostly does the job.

So to answer your question: if most of your characters are unaccented western characters, UTF8 will be the most compact representation for you (and most readable in many editors). If the bulk of your characters are non-western (say, Chinese), you'll probably want to use Unicode (aka UTF16).

Good luck!

Mike 2009-02-26 16:30:17

+4 A:

You need to use the proper encoding, as all the other answers mentioned.

The problem is how to discover the encoding. That depends on the source of your file:

If it is an XML file, there should be an <?xml> processing instruction at the beginning of the file that specifies the encoding. If there isn't one, you should assume it's UTF8.
If it is a text file, you can try UTF8 encoding, or if that fail, you should try the system locale of the machine you're running on. If that fails, you are pretty much on your own, unless you know someone that can tell you the system locale of the machine the file was created at.

In any case, you should be able to cover about 90% of all files by using UTF8 with a fallback to UTF16. Almost every programs or languages in the last five years support Unicode. However, if you are going to consume a lot of files from China, you might try first UTF16, which is a bit more prevalent for encoding GB18030.

Franci Penov 2009-02-26 16:42:00

From what I hear from people working in business-to-business messaging systems, unicode encodings are not yet as ubiquitous as you state. At all. Hacks like detecting and fixing wrong decoding done by other systems are common in the industry.

Wim Coenen 2009-02-26 21:29:12

+1 A:

There is no completely reliable method, but you can use some heuristics to guess the encoding.

Look for a byte order mark.
If you don't find a BOM, assume the file is UTF-8 and try to parse it. If it's an XML file, the declaration may contain an encoding. Similarly, an HTML file may contain a meta encoding tag.
Failing all the above, assume it's UTF-8 (or ANSI -- your choice).

Rick Strahl has a handy article on detecting encodings via the BOM. It's a bit dated -- System.Text.Encoding now has a GetPreamble method and StreamReader has an overload that will try to detect the encoding for you.

Ishmael 2009-02-26 19:17:19

A:

Also you can put the culture to read odd carachteres like ç á á etc. CultureInfo pt = CultureInfo.GetCultureInfo("pt-BR"); StreamReader fileReader = new StreamReader("C:\temp\test.txt",Encoding.GetEncoding(pt.TextInfo.ANSICodePage),true);

Cheers, Vagner

Vagner 2010-09-10 21:26:47

ansaurus

tags:

views:

answers:

Streamreader and foreign characters

related questions