Given a text file in ubuntu (or debian unix in general), how do I find out the file encoding of the file ? Can I run od or hexdump on it to fingerprint its encoding ? What should I be looking out for ?
views:
77answers:
1There are many tools to do this. Try a web search for "detect encoding". Here are some of the tools I found:
- The Internationalizations Classes for Unicode (ICU) are a great place to start. See especially their page on Character Set Detection. 
- Chardet is a Python module to guess the encoding of a file. See chardet.feedparser.org 
- The *nix command-line tool file detects file types, but might also detect encodings if mentioned in the file (e.g. if there's a mime-type notation in the file). See - man file
- Perl modules Encode::Detect and Encode::Guess . 
- Someone asked a similar question in StackOverflow. Search for the question, PHP: Detect encoding and make everything UTF-8. That's in the context of fetching files from the net and using PHP, but you could write a command-line PHP script. 
Note well what the ICU page says about character set detection: "Character set detection is ..., at best, an imprecise operation using statistics and heuristics...." In my experience the problem domain makes a big difference in how easy or difficult the job is. Don't forget that it's possible that the octets in a file can be of ambiguous encoding, i.e. sensibly interpreted using multiple different encodings. They can also be of mixed encoding, i.e. different subsets of the octets make sense interpreted in different encodings. This is why there's not a single command-line tool I can recommend which always does the job.
If you have a single file and you just want to get it into a known encoding, my trick is to open the file with a text editor which can import using a bunch of different encodings, such as TextWrangler or OpenOffice.org. First, open the file and let the editor guess the encoding. Take a look at the result. If you aren't satisfied with it, guess an encoding, open the file with the editor specifying that encoding, and take a look at the result. Then save as a known encoding, e.g. UTF-16.