tags:

views:

165

answers:

3

Basically I have been having real fun with this today. I have this data file called test.csv which is encoded as UTF-8:

"Nguyễn", 0.500 "Trần", 0.250 "Lê", 0.250

Now I am attempting to read it with this code and it displays all funny like this: Trần

Now I have gone through all the Python docs for 2.6 which is the one I use and I can't get the wrapper to work along with all the ideas on the internet which I am assuming are all very correct just not being applied properly by yours truly. On the plus side I have learnt that not all fonts will display those characters correctly anyway something I hadn't even thought of previously and have learned a lot about Unicode etc so it certainly was not wasted time.

If anyone could point out where I went wrong I would be most grateful.

Here is the code updated as per request below that returns this error -

Traceback (most recent call last):
  File "surname_generator.py", line 39, in 
    probfamilynames = [(familyname,float(prob)) for familyname,prob in unicode_csv_reader(open(familynamelist))]
  File "surname_generator.py", line 27, in unicode_csv_reader
    for row in csv_reader: 
  File "surname_generator.py", line 33, in utf_8_encoder
    yield line.encode('utf-8') UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
from random import random
import csv

class ChooseFamilyName(object):
def __init__(self, probs):
    self._total_prob = 0.
    self._familyname_levels = []
    for familyname, prob in probs:
        self._total_prob += prob
        self._familyname_levels.append((self._total_prob, familyname))
    return

def pickfamilyname(self):
    pickfamilyname = self._total_prob * random()
    for level, familyname in self._familyname_levels:
        if level >= pickfamilyname:
            return familyname
    print "pickfamilyname error"
    return

def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
                        dialect=dialect, **kwargs)
for row in csv_reader:
    # decode UTF-8 back to Unicode, cell by cell:
    yield [unicode(cell, 'utf-8') for cell in row]

def utf_8_encoder(unicode_csv_data):
    for line in unicode_csv_data:
        yield line.encode('utf-8')

familynamelist = 'familyname_vietnam.csv'
a = 0
while a < 10:
    a = a + 1
probfamilynames = [(familyname,float(prob)) for familyname,prob in unicode_csv_reader(open(familynamelist))]
familynamepicker = ChooseFamilyName(probfamilynames)
print(familynamepicker.pickfamilyname())
A: 

There's the unicode_csv_reader demo in the python docs: http://docs.python.org/library/csv.html

Prody
That reads "Unicode strings". He has str strings encoded in UTF-8. If they were encoded in cp1252, would you suggest the "unicode_csv_reader"??
John Machin
+1  A: 

Your current problem is that you have been given a bum steer with the csv_unicode_reader thingy. As the name suggests, and as the documentation states explicitly:

"""(unicode_csv_reader() below is a generator that wraps csv.reader to handle Unicode CSV data (a list of Unicode strings). """

You don't have unicode strings, you have str strings encoded in UTF-8.

Suggestion: blow away the csv_unicode_reader stuff. Get each row plainly and simply as though it was encoded in ascii. Then convert each row to unicode:

unicode_row = [field.decode('utf8') for field in str_row]

Getting back to your original problem:

(1) To get help with fonts etc, you need to say what platform you are running on and what software you are using to display the unicode strings.

(2) If you want platform-independent ways of inspecting your data, look at the repr() built-in function, and the name function in the unicodedata module.

John Machin
Thank you for taking the time to answer my question. I have gone back to the drawing board and am now a lot further along with simpler code. One things I did not realise before searching on this that notepad was causing some of my initial problems in the way it was encoding.
MDA1973
+2  A: 

unicode_csv_reader(open(familynamelist)) is trying to pass non-unicode data (byte strings with utf-8 encoding) to a function you wrote expecting unicode data. You could solve the problem with codecs.open (from standard library module codecs), but that's to roundabout: the codecs would be doing utf8->unicode for you, then your code would be doing unicode->utf8, what's the point?

Instead, define a function more like this one...:

def encoded_csv_reader_to_unicode(encoded_csv_data,
                                  coding='utf-8',
                                  dialect=csv.excel,
                                  **kwargs):
  csv_reader = csv.reader(encoded_csv_data,
                          dialect=dialect,
                          **kwargs)
  for row in csv_reader:
      yield [unicode(cell, coding) for cell in row]

and use encoded_csv_reader_to_unicode(open(familynamelist)).

Alex Martelli
This works perfectly. However, I realise I can improve on what I have done and make it a lot cleaner.
MDA1973