ansaurus

Question

Answer 1

+1 A:

You can try to use the urllib.unquote method.

>>> import urllib
>>> string = urllib.unquote("http://wincode.org/%D0%BF%D1%80%D0%BE%D0%B3%D1%80%D0%B0%D0%BC%D0%BC%D0%B8%D1%80%D0%BE%D0%B2%D0%B0%D0%BD%D0%B8%D0%B5/")
>>> print string.decode("utf-8")
http://wincode.org/программирование/

aruseni 2010-05-14 15:38:57

near decode('utf-8'): UnicodeEncodeError: 'ascii' codec can't encode characters in position 19-50: ordinal not in range(128)

Ockonal 2010-05-14 15:43:25

Answer 2

+1 A:

I have a list of urls which contents cyrillic.

OK, if it contains raw (not %-encoded) Cyrillic characters that's not like the example, and in fact it isn't a URL at all.

An address with non-ASCII characters in it is known as an IRI. IRIs shouldn't be used in an HTML link, but browsers tend to fix up these mistakes.

To convert an IRI to a URI which you can then open with urllib, you have to:

encode non-ASCII characters in the hostname part using Punycode (IDNA).
encode non-ASCII characters in rest of the IRI to UTF-8 bytes and URL-encode them (resulting in %D0%BF... like in the example URL).

an example implementation.

bobince 2010-05-14 17:35:10

I've fount another implementation: http://www.koders.com/python/fid50A5ABE4BE396F5BFA66E8F65188607FE4F722DD.aspx?s=iri#L2But this won't work for me. Same 404.

Ockonal 2010-05-14 17:53:21

I copied all url's which my scipt gets directly by hands into list-object. It works.

Ockonal 2010-05-14 18:03:20

ansaurus

tags:

views:

answers:

Loading url with cyrillic symbols

related questions