views:

113

answers:

3

Hi,

i have a huge site, with more than 5 millions url.

We have already pagerank 7/10. The problem is that because of 5 millions url and because we add/remove new urls daily (we add ± 900 and we remove ± 300) google is not fast enough to index all of them. We have a huge and intense perl module to generate this sitemap that normally is composed by 6 sitemap files. For sure google is not faster enough to add all urls, specially because normally we recreate all those sitemaps daily and submit to google. My question is: what should be a better approach? Should i really care to send 5 millions urls to google daily even if i know that google wont be able to process? Or should i send just permalinks that wont change and the google crawler will found the rest, but at least i will have a concise index at google (today i have less than 200 from 5.000.000 urls indexed)

A: 

Why don't you just compare your sitemap to the previous one each time, and only send google the URLs that have changed!

Autopulated
i do it already. The problem is that we must remove as well urls.
VP
I would have thought google would be smart enough to remove URLs if you send them ones that don't exist any more.
Autopulated
Every sitemap item should include a lastmod timestamp, so I don't see why Google should have a problem filtering out the ones that haven't changed since the last time the sitemap was indexed.
Lars Haugseth
The original objection seemed to me to be the size of the list of URLs submitted to google, no? Maybe there isn't actually a problem at all.
Autopulated
+1  A: 

What is the point of having a lot of indexed sites which are removed right away? Temporary pages are worthless for search engines and their users after they are disposed. So I would go for letting search engine crawlers decides whether a page is worth indexing. Just tell them the URLs that will persist... and implement some list pages (if there aren't any yet), which allow your pages to be crawled easier.

Note below: 6 sitemap files for 5m URLs? AFAIK, a sitemap file may no contain more than 50k URLs.

peter p
you divide it in a sitemap index pointing to N files each one with 50k URLS
VP
Who said the pages are removed right away? Consider eBay auction items with a 7 day lifespan -- is it a bad idea to make those indexable by search engines?
Lars Haugseth
@VP I know, just wondered because you wrote 6 files
peter p
@Lars that's right, but AFAIK eBay does not remove auction pages after right after the auction is over (at least they are kept for a longer timespan). I never ran into a 404 when clicking an ebay result in a search engine.
peter p
in my situation its a job search site. A "job offer" TTL is 1-9 weeks.I think my problem could be compared with sites like ebay or odesk. Do they add their "short time offers" in their sitemap? As i could see odesk does it http://www.odesk.com/sitemap.xml
VP
Ah okay, 9 wks is something different, but I'd never include 1 week. But if you know exactly how long a page is going to exist, you could probably just include the ones that will last longer than 4 weeks? For the rest, crawlable distributor pages should be enough. With PR 7 the crawler will definitely be crawling your page at a short interval.
peter p
+1  A: 

When URLs change you should watch out that you work properly with 301 status (permanent redirect).

Edit (refinement): Still you should try that your URL patterns are getting stable. You can use 301 for redirects, but maintaining a lot of redirect rules is cumbersome.

manuel aldana