views:

557

answers:

3

I wish to store URLs in a database (MySQL in this case) and process it in Python. Though the database and programming language are probably not this relevant to my question.

In my setup I receive unicode strings when querying a text field in the database. But is a URL actually text? Is encoding from and decoding to unicode an operation that should be done to a URL? Or is it better to make the column in the database a binary blob?

So, how do you handle this problem?

Clarification: This question is not about urlencoding non-ASCII characters with the percent notation. It's about the distiction that unicode represents text and byte strings represent a way to encode this text into a sequence of bytes. In Python (prior to 3.0) this distinction is between the unicode and the str types. In MySQL it is TEXT to BLOBS. So the concepts seem to correspond between programming language and database. But what is the best way to handle URLs in this scheme?

+1  A: 

On the question: "But is a URL actually text?"

It depends on the context, in some languages or libraries (for example java, I'm not sure about python), a URL may be represented internally as an object. However, a URL always has a well defined text representation. So storing the text-representation is much more portable than storing the internal representation used by whatever is the current language of choice.

URL syntax and semantics are covered by quite a few standards, recommendations and implementations, but I think the most authoritative source for parsing and constructing correct URL-s would be RFC 2396.

On the question about unicode, section 2.1 deals with non-ascii characters.

(Edit: changed rfc-reference to the newest edition, thank you S.Lott)

Rolf Rander
When you construct a URL-object in Java, do you pass it a string? My Java is a little rusty but I think you don't have many options there?
unbeknown
RFC 2396 Updates: 1808, 1738. This information is based on the old definition of URI. Not the current definition.
S.Lott
+4  A: 

The relevant answer is found in RFC 2396, section 2.1 URI and non-ASCII characters


The relationship between URI and characters has been a source of confusion for characters that are not part of US-ASCII. To describe the relationship, it is useful to distinguish between a "character" (as a distinguishable semantic entity) and an "octet" (an 8-bit byte). There are two mappings, one from URI characters to octets, and a second from octets to original characters:

URI character sequence->octet sequence->original character sequence

A URI is represented as a sequence of characters, not as a sequence of octets. That is because URI might be "transported" by means that are not through a computer network, e.g., printed on paper, read over the radio, etc.


MSalters
I think that makes a good point for having URIs in unicode/TEXT since that represents human readable text best and only convert it to byte sequences when actually necessary.
unbeknown
RFC 2396 has been replaced by RFC 3986 four years ago.
bortzmeyer
I spotted that one, too. I couldn't find better wording in 3986, though.
MSalters
+1  A: 

Do note there is also a standard for Unicode Web addresses, IRI (Internationalized Resource Identifiers). RFC 3987

bortzmeyer