tags:

views:

150

answers:

9

Hello,

I'm creating an application in PHP5 that needs to store a unique URL for a website. For example, if a user enters "http://www.google.com/" ill store it in the DB as "google.com" after parsing it. Basically I need a consistent convention for URLs so that there aren't multiple entries for the same site.

Basically, is it ok to be taking the "www" off of all URLs entered? Or should I keep it?

Id love to hear your opinions.

Thanks in advance!

+1  A: 

Are you really so hard up for storage that 4 bytes will break the bank? Probably not. It might seem like a space saving feature, but once you run into urls that have no alias for a non-www site, you'll have to code around it. Then when you have urls that subdomains other than www you'll have to code around those too.

In the long run it's probably just best to leave the url as it is and suck it up in raw storage unless you're going to be storing hundreds of millions of them.

Joel Etherton
+11  A: 

Although the distinction won't always matter, www.google.com refers to a different DNS entry than google.com, and might conceivably return different data. You see www2.foo.com in the wild occasionally, for instance, and foo.com might redirect to it for strange load-balancing reasons.

So, bad idea.

Michael Petrotta
+1  A: 

If you consider/care about the fact that some web sites/application serve the same content at www.domain.com vs. domain.com, and some do not, then it is probably not a good assumption to truncate the www.

AJ
+7  A: 

Keep it. Those are two different URLs which most sites treat as the same. The only harm in it is 4 bytes of data, so there's no good technical reason not to store it. While you're there, it's probably a good idea not to throw away the protocol ("http://") either.

When you display the url to end users, that's when I'd strip out the "http://www."

nickf
+1  A: 

I'd suggest following the standards http://en.wikipedia.org/wiki/URI_scheme

Jonathan
+2  A: 

It depends on what you mean by "same site". anything.sitename.com is owned by the same people as www.sitename.com. Infact, anything.sitename.com is a subdomain of www.sitename.com

Most of the time,

www.sitename.com == sitename.com

But,

blog.sitename.com != sitename.com
Wallacoloo
+3  A: 

One approach might be to perform a GET operation to the site, and store the result of any redirects that you receive. So, if you've got a site such as this where www.stackoverflow.com redirects to stackoverflow.com, you'll store the canonical URL only once.

Don't make assumptions about what the site owners want; let them tell you.

Greg Hewgill
+1. A HEAD would do it too. Or, using GET, you could also look for a `<link rel="canonical">` element in the page, if you're feeling ambitious.
bobince
+1  A: 

You better to keep it, if you want to have a few subdomains on the same domain. this may help lower your request bandwidth.

For example:

http://www.google.com
http://maps.google.com

In this case when you going to google maps, the cookies from the www site don't sent on each request for nothing.

Mendy
A: 

Unlike http://, www is not assumed in 100% of the cases. In fact www is in most cases a subdomain of foo.com. There are a variaety of ways to deploy the website so that bare domain is the same as www. The two most popular are DNS aliasing where www is the same subdomain as the top level domain and HTTP redirect where foo.com is a different virtual directory to www.foo.com and has one page that redirects all traffic to www.foo.com.

So, no it wouldn't be safe to strip www as it is a different address to the bare version of the same URL. Plus you also might incur a redirection cost for those websites that use the HTTP redirect method, e.g. http://google.com.

Igor Zevaka