views:

3747

answers:

3

I need to convert a bunch of files to utf-8 in Python, and I have trouble with the "converting the file" part.

I'd like to do the equivalent of:

iconv -t utf-8 $file > converted/$file # this is shell code

Thanks!

+7  A: 

You can use the codecs module, like this:

import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open(sourceFileName, "r", "your-source-encoding") as sourceFile:
    with codecs.open(targetFileName, "w", "utf-8") as targetFile:
        while True:
            contents = sourceFile.read(BLOCKSIZE)
            if not contents:
                break
            targetFile.write(contents)

EDIT: added BLOCKSIZE parameter to control file chunk size.

DzinX
read() will always read the whole file - you probably want .read(BLOCKSIZE), where BLOCKSIZE is some suitable amount to read/write at once.
Brian
That's true, thank you. I'll modify my example.
DzinX
+8  A: 

This worked for me in a small test:

sourceEncoding = "iso-8859-1"
targetEncoding = "utf-8"
source = open("source")
target = open("target", "w")

target.write(unicode(source.read(), sourceEncoding).encode(targetEncoding))
Staale
+4  A: 

Thanks for the replies, it works!

And since the source files are in mixed formats, I added a list of source formats to be tried in sequence (sourceFormats), and on UnicodeDecodeError I try the next format:

from __future__ import with_statement

import os
import sys
import codecs

sourceFormats = ['ascii', 'iso-8859-1']
targetFormat = 'utf-8'
outputDir = 'converted'

def convertFile(fileName):
    print("Converting '" + fileName + "'...")
    for format in sourceFormats:
        try:
            with codecs.open(fileName, 'rU', format) as sourceFile:
                writeConversion(sourceFile)
                print('Done.')
                return
        except UnicodeDecodeError:
            pass

    print("Error: failed to convert '" + fileName + "'.")


def writeConversion(file):
    with codecs.open(outputDir + '/' + fileName, 'w', targetFormat) as targetFile:
        for line in file:
            targetFile.write(line)

# Off topic: get the file list and call convertFile on each file
# ...

But maybe there is a better way to guess each source file's format?

Cheers!

Sébastien RoccaSerra
For tough cases you can try to detect encoding with the chardet module from feedparser.org, but in your case it's an overkill.
itsadok