ansaurus

Question

How to guess the encoding of a file with no BOM in .NET?

Answer 1

A:

UTF-8 is designed in a way that it is unlikely to have a text encoded in an arbitrary 8bit-encoding like latin1 being decoded to proper unicode using UTF-8.

So the minimum approach is this (pseudocode, I don't talk .NET):

try: u = some_text.decode("UTF-8") except UnicodeDecodeError: u = some_text.decode("most-likely-encoding")

For the most-likely-encoding one usually uses e.g. latin1 or cp1252 or whatever. More sophisticated approaches might try & find language-specific character pairings, but I'm not aware of something that does that as a library or some such.

deets 2009-03-29 16:47:37

Answer 2

+2 A:

Libary http://www.codeproject.com/KB/recipes/DetectEncoding.aspx

And perhaps a useful thread on stackoverflow

michl86 2009-03-29 16:51:58

The code project library looks pretty good. It wraps the Microsoft "MLang" api, which is maybe gross, but it appears that is the best solution

Brian519 2009-03-29 17:26:36

Answer 3

A:

I used this to do something similar a while back:

http://www.conceptdevelopment.net/Localization/NCharDet/

dommer 2009-03-29 16:54:51

Answer 4

A:

Use Win32's IsTextUnicode.

In the general sense, it is a difficult promlem. See: http://blogs.msdn.com/oldnewthing/archive/2007/04/17/2158334.aspx.

codekaizen 2009-03-29 16:57:02

Answer 5

+1 A:

You should read this article by Raymond Chen. He goes into detail on how programs can guess what an encoding is (and some of the fun that comes from guessing)

http://blogs.msdn.com/oldnewthing/archive/2004/03/24/95235.aspx

JaredPar 2009-03-29 17:08:52

Answer 6

A:

A hacky technique might be to take an MD5 of the text, then decode the text and re-encode it in various encodings, MD5'ing each one. If one matches you guess it's that encoding.

That's obviously too slow for something that handles a lot of files but for something like a text editor I could see it working.

Other than that, it'll be hands dirty porting the java libraries from this post that came from the Delphi SO question, or using the IE MLang feature.

Chris S 2009-03-29 17:10:40

ansaurus

tags:

views:

answers:

How to guess the encoding of a file with no BOM in .NET?

related questions