views:

10010

answers:

13

In our application, we receive text files (.txt, .csv, etc.) from diverse sources. When reading, these files sometimes contain garbage, because the files where created in a different/unknown codepage.

Is there a way to (automatically) detect the codepage of a text file?

(I use .Net / C#).

The detectEncodingFromByteOrderMarks, on the StreamReader constructor, works for UTF8 and other unicode marked files, but I'm looking for a way to detect ASCII code pages, like ibm850, windows1252.


Thanks for your answers, this is what I've done.

The files we receive are from end-users, they do not have a clue about codepages. The receivers are also end-users, by now this is what they know about codepages: Codepages exist, and are annoying.

Solution:

  • Open the received file in Notepad, look at a garbled piece of text. If somebody is called François or something, with your human intelligence you can guess this.
  • I've created a small app that the user can use to open the file with, and enter a text that user knows it will appear in the file, when the correct codepage is used.
  • Loop through all codepages, and display the ones that give a solution with the user provided text.
  • If more as one codepage pops up, ask the user to specify more text.
A: 

The StreamReader class's constructor takes a 'detect encoding' parameter.

leppie
Not really what I was looking for, I've edited my post to make it more clear, thank you for the answer.
GvS
+9  A: 

If you're looking to detect non-UTF encodings (i.e. no BOM), you're basically down to heuristics and statistical analysis of the text. You might want to take a look at the Mozilla paper on universal charset detection here.

Tomer Gabel
Funnily enough my Firefox 3.05 installation detects that page as UTF-8, showing a number of question-mark-in-a-diamond glyphs, although the source has a meta tag for Windows-1252. Manually changing the character encoding shows the document correctly.
devstuff
+1  A: 

Got the same problem but didn't found a good solution yet for detecting it automatically . Now im using PsPad (www.pspad.com) for that ;) Works fine

DeeCee
+18  A: 

You can't detect the codepage, you need to be told it. You can analyse the bytes and guess it, but that can give some bizarre (sometimes amusing) results. I can't find it now, but I'm sure Notepad can be tricked into displaying English text in Chinese.

Anyway, this is what you need to read: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Specifically Joel says:

The Single Most Important Fact About Encodings

If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII. There Ain't No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

JV
Found it: http://en.wikipedia.org/wiki/Bush_hid_the_facts
JV
+8  A: 

Here's Raymond Chen's explanation of how Notepad does it.

therefromhere
+2  A: 

I've done something similar in Python. Basically, you need lots of sample data from various encodings, which are broken down by a sliding two-byte window and stored in a dictionary (hash), keyed on byte-pairs providing values of lists of encodings.

Given that dictionary (hash), you take your input text and:

  • if it starts with any BOM character ('\xfe\xff' for UTF-16-BE, '\xff\xfe' for UTF-16-LE, '\xef\xbb\xbf' for UTF-8 etc), I treat it as suggested
  • if not, then take a large enough sample of the text, take all byte-pairs of the sample and choose the encoding that is the least common suggested from the dictionary.

If you've also sampled UTF encoded texts that do not start with any BOM, the second step will cover those that slipped from the first step.

So far, it works for me (the sample data and subsequent input data are subtitles in various languages) with diminishing error rates.

ΤΖΩΤΖΙΟΥ
+6  A: 

You can't detect the codepage

This is clearly false. Every web browser has some kind of universal charset detector to deal with pages which have no indication what so every of an encoding. Firefox has one. You can download the code and see how it does it. see some documentation here. basically it is a heuristics but one that works really well.

Given a reasonable amount of text it if even possible to detect the language.

Here's another one I just found using google:

shoosh
"heuristics" - so the browser isn't quite detecting it, it's making an educated guess. "works really well" - so it doesn't work all the time then? Sounds to me like we're in agreement.
JV
The standard for HTML dictates that, if the character set is not defined by the document, then it should be considered to be encoded as UTF-8.
Jon Trauntvein
A: 

Since it basically comes down to heuristics, it may help to use the encoding of previously received files from the same source as a first hint.

Most people (or applications) do stuff in pretty much the same order every time, often on the same machine, so its quite likely that when Bob creates a .csv file and sends it to Mary it'll always be using Windows-1252 or whatever his machine defaults to.

Where possible a bit of customer training never hurts either :-)

devstuff
A: 

Check if this helps! http://www.codeproject.com/KB/recipes/DetectEncoding.aspx

Sarath
-1: Dupe of answer by Waz http://stackoverflow.com/questions/90838/how-can-i-detect-the-encoding-codepage-of-a-text-file/264988#264988
Alex Angas
I'm sorry I dint see the previous reply.
Sarath
+1  A: 

Notepad++ [http://sourceforge.net/projects/notepad-plus/] has this feature out-of-the-box. It also supports changing it.

hegearon
A: 

I was actually looking for a generic, not programming way of detecting the file encoding, but I didn't find that yet. What I did find by testing with different encodings was that my text was UTF-7.

So where I first was doing: StreamReader file = File.OpenText(fullfilename);

I had to change it to: StreamReader file = new StreamReader(fullfilename, System.Text.Encoding.UTF7);

OpenText assumes it's UTF-8.

you can also create the StreamReader like this new StreamReader(fullfilename, true), the second parameter meaning that it should try and detect the encoding from the byteordermark of the file, but that didn't work in my case.

Intraday Tips

Intraday Tips
@Intraday Tips: Yikes! Who is writing files in UTF-7???
John Machin