views:

105

answers:

2

Let's assume a user enter address of some resource and we need to translate it to:

<a href="valid URI here">human readable form</a>

HTML4 specification refers to RFC 3986 which allows only ASCII alphanumeric characters and dash in host part and all non-ASCII character in other parts should be percent-encoded. That's what I want to put in href attribute to make link working properly in all browsers. IDN should be encoded with Punycode.

HTML5 draft refers to RFC 3987 which also allows percent-encoded unicode characters in host part and a large subset of unicode in both host and other parts without encoding them. User may enter address in any of these forms. To provide human readable form of it I need to decode all printable characters. Note that some parts of address might not correspond to valid UTF-8 sequences, usually when target site uses some other character encoding.

An example of what I'd like to get:

<a href="http://xn--80aswg.xn--p1ai/%D0%BF%D1%83%D1%82%D1%8C?%D0%B7%D0%B0%D0%BF%D1%80%D0%BE%D1%81"&gt;
http://сайт.рф/путь?запрос&lt;/a&gt;

Are there any tools to solve these tasks? I'm especially interested in libraries for Python and JavaScript.

Update: I know there is a way to do percent and Punycode (without proper normalization, but I can live with it) encoding/decoding in Python and JavaScript. The whole task needs much more work and there are some pitfalls (some characters should be always encoded or never encoded depending on context). I wonder if there are ready to use libraries for the whole problem, since it seems to be quite common and modern browsers already do such conversions (try typing http://%D1%81%D0%B0%D0%B9%D1%82.%D1%80%D1%84/ in Google Chrome and it will be replaced with http://сайт.рф/, but use Host: xn--80aswg.xn--p1ai in HTTP request).

Update2: Vinay Sajip pointed that Werkzeug has iri_to_uri and uri_to_iri functions that handles most cases correctly. I've found only 2 cases where it fails so far: percent-encoded host (quite easy to fix) and invalid utf-8 sequences (it's a bit tricky to do nicely, but shouldn't be a problem).

I'm still looking for library in JavaScript. It's not hard to write, but I'd prefer to avoid inventing the wheel.

A: 

Duplicate: http://stackoverflow.com/questions/183485/can-anyone-recommend-a-good-free-javascript-for-punycode-to-unicode-conversion

Sean Kinsey
Thanks for the link. The question is not about just Punycode encoding implementation (it's easy), but a much bigger problem.
Denis Otkidach
Then you should probably clarify as I still don't get what you are asking for.
Sean Kinsey
I've updated the question
Denis Otkidach
+2  A: 

If I understand you correctly, then you can use the batteries included in Python:

# -*- coding: utf-8 -*-

import urllib
import urlparse

URL1 = u'http://сайт.рф/путь?запрос'
URL2 = 'http://%D1%81%D0%B0%D0%B9%D1%82.%D1%80%D1%84/'

def to_idn(url):
    parts = list(urlparse.urlparse(url))
    parts[1] = parts[1].encode('idna')
    parts[2:] = [urllib.quote(s.encode('utf-8')) for s in parts[2:]]
    return urlparse.urlunparse(parts)

def from_idn(url):
    return urllib.unquote(url)

print to_idn(URL1)
print from_idn(URL2)
print to_idn(from_idn(URL2).decode('utf-8'))

which prints

http://xn--80aswg.xn--p1ai/%D0%BF%D1%83%D1%82%D1%8C?%D0%B7%D0%B0%D0%BF%D1%80%D0%BE%D1%81
http://сайт.рф/
http://xn--80aswg.xn--p1ai/

which looks like what you want. I'm not sure what special cases you mean - perhaps you could give some examples of the pitfalls you're referring to?

Update: I just remembered, Werkzeug has iri_to_uri and uri_to_iri functions in versions 0.6 and later (links are to the relevant part of the docs).

Further update: Sorry, I hadn't noticed that you're looking for a JavaScript implementation as well as a Python one. An existing public domain Javascript implementation of punycode is here. I can't vouch for it, though. And of course you can use the built-in JavaScript encodeURI/decodeURI APIs.

Vinay Sajip
Your functions work for this example only. Werkzeug functions works for most cases and can be easily fixed for the rest, thanks!
Denis Otkidach
Your coding line has -*' instead of -*-. To avoid this mess you can just write# coding: utf-8
temoto