tags:

views:

178

answers:

3

Hi Everyone,

Really new to Python and getting data from the web, so here it goes.

I have been able to pull data from the NYT api and parse the JSON output into a CSV file. However, depending on my search, I may get the following error when I attempt to write a row to the CSV.

UnicodeEncodeError: 'charmap' codec can't encode characters in position 20-21: character maps to

This URL has the data that I am trying to parse into a CSV. (I de-selected "Print pretty results")

I am pretty sure the error is occuring near title:"Spitzer......."

I have tried to search the web, but I can't seem to get an answer. I don't know alot about encoding, but I am guessing the data I retrieve from the JSON records are encoded in some way.

Any help you can provide will be greatly appreciated.

Many thanks in advance,

Brock

A: 

Every piece of textual data is encoded. It's hard to tell what the problem is without any code, so the only advice I can give now is: Try decoding the response before parsing it ...

resp = do_request()
## look on the nyt site if they mention the encoding used and use it instead.
decoded = resp.decode('utf-8')
parsed = parse( decoded )
THC4k
A: 

It appears to be trying to decode '\/' which is used whenever a slash is used. This can be avoided by making using the string function.

str('http:\/\/www.nytimes.com\/2010\/02\/17\/business\/global\/17barclays.html')
'http:\\/\\/www.nytimes.com\\/2010\\/02\\/17\\/business\\/global\\/17barclays.html'

from there you can use replace.

str('http:\/\/www.nytimes.com\/2010\/02\/17\/business\/global\/17barclays.html').replace('\\', "")
what
+1  A: 

You need to check your HTTP headers to see what char encoding they are using when returning the results. My bet is that everything is encoded as utf-8 and when you try to write to CSV, you are implicitly encoding output as ascii.

The ' they are using is not in the ascii char set. You can catch the UnicodeError exception.

Follow the golden rules of encodings.

  1. Decode early into unicode (data.decode('utf-8', 'ignore'))

  2. Use unicode internally.

  3. Encode late - during output - data.encode('ascii', 'ignore'))

You can probably set your CSV writer to use utf-8 encodings when writing.

Note: You should really see what encoding they are giving you before blindly using utf-8 for everything.

rox0r