views:

109

answers:

3

How should i sanitize urls so people don't put 漢字 or other things in them?

EDIT: I'm using java. The url will be generated from a question the user asks on a form. It seems StackOverflow just removed the offending characters, but it also turns an á into an a.

Is there a standard convention for doing this? Or does each developer just write their own version?

+1  A: 

Yes, I would sanitize/remove. It will either be inconsistent or look ugly encoded

Using Java see URLEncoder API docs

Be careful! If you are removing elements such as odd chars, then two distinct inputs could yield the same stripped URL when they don't mean to.

The specification for URLs (RFC 1738, Dec. '94) poses a problem, in that it limits the use of allowed characters in URLs to only a limited subset of the US-ASCII character set

This means it will get encoded. URLs should be readable. Standards tend to be English biased (what's that? Langist? Languagist?).

Not sure what convention is other countries, but if I saw tons of encoding in a URL send to me, I would think it was stupid or suspicious ...

Unless the link is displayed properly, encoded by the browser and decoded at the other end ... but do you want to take that risk?

StackOverflow seems to just remove those chars from the URL all together :)

StackOverflow can afford to remove the characters because it includes the question ID in the URL. The slug containing the question title is for convenience, and isn't actually used by the site, AFAIK. For example, you can remove the slug and the link will still work fine: the question ID is what matters and is a simple mechanism for making links unique, even if two different question titles generate the same slug. Actually, you can verify this by trying to go to stackoverflow.com/questions/2106942/… and it will just take you back to this page.

Thanks Mike Spross

Aiden Bell
StackOverflow can afford to remove the characters because it includes the question ID in the URL. The slug containing the question title is for convenience, and isn't actually used by the site, AFAIK. For example, you can remove the slug and the link will still work fine: the question ID is what matters and is a simple mechanism for making links unique, even if two different question titles generate the same slug. Actually, you can verify this by trying to go to http://stackoverflow.com/questions/2106942/look-ma-this-slug-is-ignored and it will just take you back to this page.
Mike Spross
@Mike, Yup ... I should have mentioned this in my post. Quoting you :)
Aiden Bell
@Aiden: Ha, nice touch on the link to my profile page ;-)
Mike Spross
@Mike, thought it was fitting
Aiden Bell
A: 

Which language you are talking about? In PHP I think this is the easiest and would take care of everything:

http://us2.php.net/manual/en/function.urlencode.php

Priyank Bolia
+1  A: 

The process you're describing is slugify. There's no fixed mechanism for doing it; every framework handles it in their own way.

Ignacio Vazquez-Abrams
exactly what i was looking for, thanks!
Doug