views:

662

answers:

2

The csv module in Python doesn't work properly when there's UTF-8/Unicode involved. I have found in Python documentation (http://docs.python.org/library/csv.html) and other webpages snippets that work for specific cases, but you have to understand well what encoding you are handling and use the appropiated snippet.

Is there any universal library or snippet for Python (2.6) that writes/reads strings or unicode strings from .csv files that just works? Or is this Python (2.6) related and there's no simple solution?

+7  A: 

There is the usage of Unicode example already in that doc, why still need to find another one or re-invent the wheel?

import csv

def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
    # csv.py doesn't do Unicode; encode temporarily as UTF-8:
    csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
                            dialect=dialect, **kwargs)
    for row in csv_reader:
        # decode UTF-8 back to Unicode, cell by cell:
        yield [unicode(cell, 'utf-8') for cell in row]

def utf_8_encoder(unicode_csv_data):
    for line in unicode_csv_data:
        yield line.encode('utf-8')
S.Mark
A: 

If you want a class the behaves exactly as the csv.reader class, then create a module wrapping S. Mark's code like this:

import csv

def utf_8_encoder(unicode_csv_data):
    for line in unicode_csv_data:
        yield line.encode('utf-8')

class reader(object):        
    def __init__(self, data_iter, dialect=csv.excel, **kwargs):
        # csv.py doesn't do Unicode; encode temporarily as UTF-8:
        self.csv_reader = csv.reader(utf_8_encoder(data_iter), dialect=dialect, **kwargs)

    def next(self):
        # decode UTF-8 back to Unicode, cell by cell:
        row = self.csv_reader.next()
        return [unicode(cell, 'utf-8') for cell in row]

    def __iter__(self):
        return self
innohead