ansaurus

Question

Batch convert unknown file encoding to UTF-8

Answer 1

+3 A:

Doing it only once would improve performance and reduce the potential for future errors, but if you don't know the encoding, you cannot do a correct conversion at all.

Michael Borgwardt 2009-06-02 14:08:02

Answer 2

+2 A:

My first attempt at this would be:

If it is syntactically valid UTF-8, assume it's UTF-8.
If there are only bytes corresponding to valid characters in ISO 8859-1 (Latin-1), assume that.
Otherwise, fail.

Lars Wirzenius 2009-06-02 14:11:09

Answer 3

+1 A:

Can a file contain data from different codepages?

If yes, then you can't do the batch conversion at all. You would have to know every single codepage of every single sub string in your file.

If no it's possible to batch convert a file at a time, but assuming you know what codepage that file has. So we're more or less back the same situation as above, we've just moved the abstraction from sub string scope to file scope.

So, the question you need to ask yourself is. Do you have information about what codepage some data belongs to? If not, it will still look fugly.

You can always do some analysis on your data and guess codepage, and although this might make it a little less fuglier, you are still guessing, and therefore it will still be fugly :)

Magnus Skog 2009-06-02 15:19:37

Answer 4

+5 A:

I don't have a clear solution for PHP, but for Python I personally used Universal Encoding Detector library which does a pretty good job at guessing what encoding the file is being written as.

Just to get you started, here's a Python script that I had used to do the conversion (the original purpose is that I wanted to converted a Japanese code base from a mixture of UTF-16 and Shift-JIS, which I made a default guess if chardet is not confident of detecting the encoding):

import sys
import codecs
import chardet
from chardet.universaldetector import UniversalDetector

""" Detects encoding

Returns chardet result"""
def DetectEncoding(fileHdl):
detector = UniversalDetector()
for line in fileHdl:
    detector.feed(line)
    if detector.done: break
detector.close()
return detector.result


""" Reencode file to UTF-8
"""
def ReencodeFileToUtf8(fileName, encoding):
    #TODO: This is dangerous ^^||, would need a backup option :)
    #NOTE: Use 'replace' option which tolerates errorneous characters
    data = codecs.open(fileName, 'rb', encoding, 'replace').read()
    open(fileName, 'wb').write(data.encode('utf-8', 'replace'))

""" Main function
"""
if __name__=='__main__':
    # Check for arguments first
    if len(sys.argv) <> 2:
    sys.exit("Invalid arguments supplied")

    fileName = sys.argv[1]
    try:
        # Open file and detect encoding
        fileHdl = open(fileName, 'rb')
        encResult = DetectEncoding(fileHdl)
        fileHdl.close()

        # Was it an empty file?
        if encResult['confidence'] == 0 and encResult['encoding'] == None:
            sys.exit("Possible empty file")

        # Only attempt to reencode file if we are confident about the
        # encoding and if it's not UTF-8
        encoding = encResult['encoding'].lower()
        if encResult['confidence'] >= 0.7:
            if encoding != 'utf-8':
                ReencodeFileToUtf8(fileName, encoding)
        else:
            # TODO: Probably you could make a default guess and try to encode, or
            #       just simply make it fail

        except IOError:
            sys.exit('An IOError occured')

Seh Hui 'Felix' Leong 2009-06-09 06:11:29

Thanks. I wrote a little script to run in the background based on this and keep the files (that PHP sources) in UTF-8

Oli 2009-06-09 08:28:49

ansaurus

tags:

views:

answers:

Batch convert unknown file encoding to UTF-8

related questions