My original string was a unicode string anyways (i.e. prefixed by a u)
...which is the problem. It wasn't a "string", as such, but a "Unicode object". It contains a sequence of Unicode code points. These code points must, of course, have some internal representation that Python knows about, but whatever that is is abstracted away and they're shown as those \uXXXX
entities when you print repr(my_u_str)
.
To get a sequence of bytes that another program can understand, you need to take that sequence of Unicode code points and encode it. You need to decide on the encoding, because there are plenty to choose from. UTF8 and UTF16 are common choices. ASCII could be too, if it fits. u"abc".encode('ascii')
works just fine.
Do my_u_str = u"\u2119ython"
and then type(my_u_str)
and type(my_u_str.encode('utf8'))
to see the difference in types: The first is <type 'unicode'>
and the second is <type 'str'>
. (Under Python 2.5 and 2.6, anyway).
Incidentally, I don't know that 'utf8' is (always) right for URLs. This w3schools page claims that ISO-8859-1 is the character set that is used for URLs. But honestly, it must depend on client and server (and probably wind speed and ambient temperature too). Comments on this are welcome, because I've never been able to find good source on that. But if it works, it works...
Things are different in Python 3, but since I rarely use it I'd be talking out of my hat if I tried to say anything authoritative about it.