views:

460

answers:

4

I have a bunch of files with a mixtures of encodings mainly ISO-8859-1 and UTF-8.

I would like to make all files UTF-8, but when trying to batch encode this files using iconv some problems arise. (Files cuts by half, etc.)

I supposse the reason is that iconv requires to know the 'from' encoding, so if the command looks like this

iconv -f ISO-8859-1 -t UTF-8 in.php -o out.php

but 'in.php' if already UTF-8 encoded, that causes problems (correct me if I'm wrong)

Is there a way, that I can list all the files whose encoding is not UTF-8?

A: 

What kind of content? XML? Then yes, if properly tagged at the top. Generic text files? I don't know of any a-priori way to know what encoding is used, although it might be possible, sometimes, with clever code. "Tagged" UTF-8 text files, by which I mean UTF-8 text files with a Byte-Order mark? (For UTF-8, the three byte sequence "") Probably. The Byte Order Mark characters will not commonly appear as the first three characters in a ISO-8859-1 encoded file. (Which bobince pointed out in a comment to this post, so I'm correcting my post.)

For your purposes, tools exist that can probably solve most of your question. Logan Capaldo pointed out one in his answer.

But after all, if it were always possible to figure out, unambiguously, what character encoding was used in a file, then the iconv utility wouldn't need you to provide the "from" encoding. :)

Eddie
UTF-8 files should not, ‘properly’, have a BOM (although in practice they often do. And a UTF-8-encoded BOM can perfectly well exist at the beginning of an ISO-8859-1 file (it would mean “”)... it's just very unlikely, of course.
bobince
+3  A: 

You can't find files that are definitely ISO-8859-1, but you can find files that are valid UTF-8 (which unlike most multibyte encodings give you a reasonable assurance that they are in fact UTF-8). moreutils has a tool isutf8 which can do this for you. Or you can write your own, it would be fairly simple.

Logan Capaldo
+2  A: 

It's often hard to tell just by reading a text file whether it's in UTF-8 encoding or not. You could scan the file for certain indicator bytes which can never occur in UTF-8, and if you find them, you know the file is in ISO-8859-1. If you find a byte with its high-order bit set, where the bytes both immediately before and immediately after it don't have their high-order bit set, you know it's ISO encoded (because bytes >127 always occur in sequences in UTF-8). Beyond that, it's basically guesswork - you'll have to look at the sequences of bytes with that high bit set and see whether it would make sense for them to occur in ISO-8859-1 or not.

The file program will make an attempt to guess the encoding of a text file it's processing, you could try that.

David Zaslavsky
+1  A: 

Is there a way, that I can list all the files whose encoding is not UTF-8?

Perhaps not so easily in bash alone, but it's a trivial task from eg. Python:

import os.path

for child in os.path.listdir(TARGETDIR):
    child= os.path.join(TARGETDIR, child)
    if os.path.isfile(child):
        content= open(child, 'rb').read()

        try:
            unicode(content, 'utf-8')
        except UnicodeDecodeError:
            open(child, 'wb').write(unicode(content, 'iso-8859-1'))

This assumes that any file that can be interpreted as a valid UTF-8 sequence is one (and so can be left alone), whilst anything that isn't must be ISO-8859-1.

This is a reasonable assumption if those two are the only possible encodings, because valid UTF-8 sequences (of at least two top-bit-set characters in a particular order) are relatively rare in real Latin text, where we tend only to use the odd single accented characters here and there.

bobince