ansaurus

Question

Retrieving and displaying UTF-8 from a .CSV in Python

Answer 1

A:

There's the unicode_csv_reader demo in the python docs: http://docs.python.org/library/csv.html

Prody 2009-10-13 17:51:41

That reads "Unicode strings". He has str strings encoded in UTF-8. If they were encoded in cp1252, would you suggest the "unicode_csv_reader"??

John Machin 2009-10-13 23:51:42

Answer 2

+1 A:

Your current problem is that you have been given a bum steer with the csv_unicode_reader thingy. As the name suggests, and as the documentation states explicitly:

"""(unicode_csv_reader() below is a generator that wraps csv.reader to handle Unicode CSV data (a list of Unicode strings). """

You don't have unicode strings, you have str strings encoded in UTF-8.

Suggestion: blow away the csv_unicode_reader stuff. Get each row plainly and simply as though it was encoded in ascii. Then convert each row to unicode:

unicode_row = [field.decode('utf8') for field in str_row]

Getting back to your original problem:

(1) To get help with fonts etc, you need to say what platform you are running on and what software you are using to display the unicode strings.

(2) If you want platform-independent ways of inspecting your data, look at the repr() built-in function, and the name function in the unicodedata module.

John Machin 2009-10-13 23:48:54

Thank you for taking the time to answer my question. I have gone back to the drawing board and am now a lot further along with simpler code. One things I did not realise before searching on this that notepad was causing some of my initial problems in the way it was encoding.

MDA1973 2009-10-14 11:42:55

Answer 3

+2 A:

unicode_csv_reader(open(familynamelist)) is trying to pass non-unicode data (byte strings with utf-8 encoding) to a function you wrote expecting unicode data. You could solve the problem with codecs.open (from standard library module codecs), but that's to roundabout: the codecs would be doing utf8->unicode for you, then your code would be doing unicode->utf8, what's the point?

Instead, define a function more like this one...:

def encoded_csv_reader_to_unicode(encoded_csv_data,
                                  coding='utf-8',
                                  dialect=csv.excel,
                                  **kwargs):
  csv_reader = csv.reader(encoded_csv_data,
                          dialect=dialect,
                          **kwargs)
  for row in csv_reader:
      yield [unicode(cell, coding) for cell in row]

and use encoded_csv_reader_to_unicode(open(familynamelist)).

Alex Martelli 2009-10-14 02:07:28

This works perfectly. However, I realise I can improve on what I have done and make it a lot cleaner.

MDA1973 2009-10-14 12:13:34

ansaurus

tags:

views:

answers:

Retrieving and displaying UTF-8 from a .CSV in Python

related questions