ansaurus

Question

How to find encoding of a file in Unix via script(s)

Answer 1

+2 A:

Sounds like you're looking for enca. It can guess and even convert between encodings. Just look at the man page.

Or, failing that, use file -i. That will output MIME-type information for the file, which will also include the character-set encoding. I found a man-page for it, too :)

scraimer 2009-04-30 05:41:58

That doesn't appear to support 8859-1 (just from a cursory glance at that man page).

paxdiablo 2009-04-30 05:47:19

According to the man page, it knows about the ISO 8559 set. Perhaps read a little less cursorily :-)

bignose 2009-04-30 06:12:18

8859-2,4,5,13 and 16, no mention of 8859-1. The glyphs above 0x7f are very different between the -1 and -2 variants.

paxdiablo 2009-04-30 06:21:19

Hi,I work in an AIX environment and enca does not seem to exist in this environment.Thanks,Manglu

Manglu 2009-05-01 00:42:20

Enca sounds interesting. Unfortunately detection seems to be very language dependant and the set of supported languages is not very big. Mine (de) is missing :-(Anyway cool tool.

er4z0r 2010-04-05 12:22:46

Answer 2

+2 A:

This is not something you can do in a foolproof way. One possibility would be to examine every character in the file to ensure that it doesn't contain any characters in the ranges 0x00 - 0x1f or 0x7f -0x9f but, as I said, this may be true for any number of files, including at least one other variant of ISO8859.

Another possibility is to look for specific words in the file in all of the languages supported and see if you can find them.

So, for example, find the equivalent of the English "and", "but", "to", "of" and so on in all the supported languages of 8859-1 and see if they have a large number of occurrences within the file.

I'm not talking about literal translation such as:

English   French
-------   ------
of        de, du
and       et
the       le, la, les

although that's possible. I'm talking about common words in the target language (for all I know, Icelandic has no word for "and" - you'd probably have to use their word for "fish" [sorry that's a little stereotypical, I didn't mean any offense, just illustrating a point]).

paxdiablo 2009-04-30 05:45:24

Answer 3

+2 A:

It is really hard to determine if it is iso-8859-1. If you have a text with only 7 bit characters that could also be iso-8859-1 but you don't know. If you have 8 bit characters then the upper region characters exist in order encodings as well. Therefor you would have to use a dictionary to get a better guess which word it is and determine from there which letter it must be. Finally if you detect that it might be utf-8 than you are sure it is not iso-8859-1

Encoding is one of the hardest things to do because you never know if nothing is telling you

Norbert Hartl 2009-04-30 07:13:47

ansaurus

tags:

views:

answers:

How to find encoding of a file in Unix via script(s)

related questions