tags:

views:

245

answers:

5

using python 2.5.2 and linux debian i'm trying to get the content from a spanish url that contains a spanish char ('í'):

import urllib
url = u'http://mydomain.es/índice.html'
content = urllib.urlopen(url).read()

I'm getting this error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 8: ordinal not in range(128)

I've tried using before passing the url to urllib this:

url = urllib.quote(url)

and this:

url = url.encode('UTF-8')

but it doesn't work

can you tell me what I am doing wrong ?

thanks !

A: 

You need to convert to ASCII, using ToASCII, talked about here: http://tools.ietf.org/html/rfc3490.

You'll get a domain that looks like gibberish, but it'll be an ASCII representation of the Unicode IDN.

http://mail.python.org/pipermail/python-checkins/2003-April/035302.html

That link above has an implemented algorithm - give it a try!

(You'll be converting into punycode, the encoding of Unicode into ASCII: http://en.wikipedia.org/wiki/Punycode)

EDIT: Apparently Python can generate punycode itself.

Isaac Hodes
there are no unicode characters in the domain given as an example, so IDN and punycode are not relevant to this question.
Jehiah
Yes, in fact there is. Look at the "i" in the url – it does, in fact, have an accent.
Isaac Hodes
A: 

Encoding the URL as utf-8, should have worked. I wonder if your source file is properly encoded, and whether the interpreter knows it. If your python source file is saved as UTF-8, for example, then you should have

# coding=UTF-8

as the first or second line.

import urllib
url = u'http://mydomain.es/índice.html'
content = urllib.urlopen(url.encode('utf-8')).read()

works for me.

Edit: also, be aware that Unicode text in an interactive Python session (whether through IDLE, or a console) is fraught with encoding-related difficulty. In those cases, you should use Unicode literals (like \u00ED in your case).

Jonathan Feinberg
A: 

This works for me:

#!/usr/bin/env python
# define source file encoding, see: http://www.python.org/dev/peps/pep-0263/
# -*- coding: utf-8 -*-

import urllib
url = u'http://example.com/índice.html'
content = urllib.urlopen(url.encode("UTF-8")).read()
The MYYN
+2  A: 

Per the applicable standard, RFC 1378, URLs can only contain ASCII characters. Good explanation here, and I quote:

"...Only alphanumerics [0-9a-zA-Z], the special characters "$-.+!'()," [not including the quotes - ed], and reserved characters used for their reserved purposes may be used unencoded within a URL."

As the URLs I've given explain, this probably means you'll have to replace that "lowercase i with acute accent" with `%ED'.

Alex Martelli
A: 

It works for me. Make sure you're using a fairly recent version of Python, and your file encoding is correct. Here's my code:

# -*- coding: utf-8 -*-
import urllib
url = u'http://mydomain.es/índice.html'
url = url.encode('utf-8')
content = urllib.urlopen(url).read()

(mydomain.es does not exist, so the DNS lookup fails, but there are no unicode issues to that point.)

Eddie Sullivan