views:

845

answers:

4

I'm storing URLs in a database, and I want to be able to know if two URLs are identical. Generally, a trailing slash at the end doesn't change the response you'd get from a server. (ie. http://www.google.com/ is the same as http://www.google.com)

Can I always blindly remove the trailing slash from any URL, without looking at anything?
Is that safe?

What I mean by "without looking at anything" is that I'd remove the slash from:
http://www.google.com/q?xxx=something&yyy=something/

I know the web server could theoretically return completely different things if it wanted, and I know sometimes going to a URL without the slash will redirect to one with the slash. My only intention here is determining if both URLs are the same.

Is this method safe?

Thank you.

+14  A: 

No it is not always safe. A web server could interpret the path part of the URL anyway it likes. You cannot tell what it will do (resolve the URI) without using a GET or HEAD on the URL.

dajobe
Thank you. I've been looking through the DB, and fortunately, I confirmed that this is not a problem. All URLs submitted are grabbed by a bookmarklet code we have, and as I suspected, there is no possibility of 2 users having the same URL except for a trailing slash. Or at least, it hasn't happened yet :-).
Daniel Magliola
And, IIRC, the URL specification specifically states that a URL ending with a slash denotes a directory, and without denotes a document. Many web servers will redirect to or return a default document for the former, and return 401 for the latter (I know mine does).
Software Monkey
The URL spec talks about hierarchical URL schemes - the ones like FOO:// rather than the ones like BAR:blah. Some hierarchical ones are well known like http, ftp but still you can't tell whether / at the end is meaningful, it's for the server to interpret and that may depend on the OS, server software implementation and other things.
dajobe
+4  A: 

No. I've encountered situations where, depending on the settings in a .htaccess file, some directories or "clean URLs" (such as those generated by a CMS) could not be accessed without a trailing slash. It's rare and it might be a mistake on the part of the webmaster, but it can happen.

kpozin
+4  A: 

It may be safe in the sense that you'll get the same response with or without a trailing slash (and I can't guarantee that's true), but they can definitely mean different things. Consider a URL that references a directory, or something presented by the site as a directory. Using the URL

http://www.somesite.com/directory/

...makes it clear you're asking for a directory. If you hack off the trailing slash:

http://www.somesite.com/directory

...the site's going to take this as a request for a file called "directory", and get all confused for a moment. It'll likely interpret this as a request for a directory, but the meanings are not the same, and you might not get what you expect.

See this article for more detail.

Michael Petrotta
A: 

As others have noted, it's not always safe. If it will work for you, my recommendation is to store the URL's with the slashes, and strip them off when you do your comparison. You'll take a performance hit, but I'd think that's better than sending someone to the wrong web page.

PTBNL
Or store both the actual URL and the URL in canonical form if you don't want to do the processing when you compare. Time-space tradeoff.
Chuck