ansaurus

Question

can't open unicode url with python

Answer 1

A:

You need to convert to ASCII, using ToASCII, talked about here: http://tools.ietf.org/html/rfc3490.

You'll get a domain that looks like gibberish, but it'll be an ASCII representation of the Unicode IDN.

http://mail.python.org/pipermail/python-checkins/2003-April/035302.html

That link above has an implemented algorithm - give it a try!

(You'll be converting into punycode, the encoding of Unicode into ASCII: http://en.wikipedia.org/wiki/Punycode)

EDIT: Apparently Python can generate punycode itself.

Isaac Hodes 2009-12-16 18:40:16

there are no unicode characters in the domain given as an example, so IDN and punycode are not relevant to this question.

Jehiah 2010-03-21 22:57:07

Yes, in fact there is. Look at the "i" in the url – it does, in fact, have an accent.

Isaac Hodes 2010-03-22 04:46:56

Answer 2

A:

Encoding the URL as utf-8, should have worked. I wonder if your source file is properly encoded, and whether the interpreter knows it. If your python source file is saved as UTF-8, for example, then you should have

# coding=UTF-8

as the first or second line.

import urllib
url = u'http://mydomain.es/índice.html'
content = urllib.urlopen(url.encode('utf-8')).read()

works for me.

Edit: also, be aware that Unicode text in an interactive Python session (whether through IDLE, or a console) is fraught with encoding-related difficulty. In those cases, you should use Unicode literals (like \u00ED in your case).

Jonathan Feinberg 2009-12-16 18:40:45

Answer 3

A:

This works for me:

#!/usr/bin/env python
# define source file encoding, see: http://www.python.org/dev/peps/pep-0263/
# -*- coding: utf-8 -*-

import urllib
url = u'http://example.com/índice.html'
content = urllib.urlopen(url.encode("UTF-8")).read()

The MYYN 2009-12-16 18:41:14

Answer 4

+2 A:

Per the applicable standard, RFC 1378, URLs can only contain ASCII characters. Good explanation here, and I quote:

"...Only alphanumerics [0-9a-zA-Z], the special characters "$-.+!'()," [not including the quotes - ed], and reserved characters used for their reserved purposes may be used unencoded within a URL."

As the URLs I've given explain, this probably means you'll have to replace that "lowercase i with acute accent" with `%ED'.

Alex Martelli 2009-12-16 18:42:02

Answer 5

A:

It works for me. Make sure you're using a fairly recent version of Python, and your file encoding is correct. Here's my code:

# -*- coding: utf-8 -*-
import urllib
url = u'http://mydomain.es/índice.html'
url = url.encode('utf-8')
content = urllib.urlopen(url).read()

(mydomain.es does not exist, so the DNS lookup fails, but there are no unicode issues to that point.)

Eddie Sullivan 2009-12-16 18:43:20

ansaurus

tags:

views:

answers:

can't open unicode url with python

related questions