views:

54

answers:

1

Having a look at how google perceives our site at the moment and coming up short... Basically, we use a bog-standard structure of URL rewriting to make them look SEO friendly.

for instance, a product URL takes shape of any string_([0-9]).html and so forth. of course, this allows us to link to whatever we want before the product id... which we have done. In the past, a product page was Product_Name_79.html and then became Brand_Name_Product_Name_79.html. apache does not really care and id 79 gets passed on in either case. However, google now has 2 versions of this product cached under different URLs - and that's not a good thing as it continues to arrive to the first URL and spider it.

same thing applies to our rewrite rules for brands and categories, some of which had been dropped and some of which have been modified.

there are over 11k urls in site:domain whereas our sitemap gets some 5.8k only. how would you prevent spiders from fetching older versions of urls that you no-longer link to (considering it's not a manual process and often such urls can be very dynamic).

eg, Mens_Merrell_Trail_Running_Shoes__50-100__10____024/ is a dynamic url for the merrell brand, narrowed down by items in trail running shoes that cost between 50 and 100 and size 10 with gender set to men's.

if we decide to nofollow any size and money filter urls, that leaves google still being able to access them through its old cache...

what is the best practice for disallowing a particular type of urls? as the combinations above are nearly infinite, i cannot produce a list and it certainly cannot be backdated against what brands and categories google may hold for us historically.

shall we add noindex when such filters are applied? shall we export them to robots.txt? do nothing in the hope that google stops returning?

to put it into perspective, we have 2600 product page urls that are now redundant / disabled, what would you do with them? redirect to homepage, brand page, 404, do nothing?

thanks for any advice

+1  A: 

i think you're looking for rel="canonical", google should start ignoring you're links if they're really not linked to. You can check any incoming links with a tool like this: http://www.seomoz.org/linkscape.

Also if you're old urls match (or don't match) a consisent pattern you could set up a 301 redirect in apache either for pages matching the old pattern or not matching the new pattern...

hope this helps!

Haroldo
this looks very helpful indeed, i had not given rel="canonical" much thought until now - always tended to nofollow and leave other uses for rel unattended to. nice one - it will certainly help a lot here!
Dimitar Christoff