As context I am highlighting the recent paper on sitemaps http://www2009.org/proceedings/pdf/p991.pdf which has a writeup on their amazon sitemap case study
Amazon publishes 20M URLs listed in Sitemaps using the“Sitemaps” directive in amazon.com/robots.txt. They use a SitemapIn- dex file that lists about 10K Sitemaps files each file listing between 20K and 50K URLs. The SitemapIndex file con- tains the last modification time for each underlying Sitemaps file, based on when the Sitemaps file was created. When we examined the last-modification dates of each Sitemaps file over the past year, we noticed that new Sitemaps files are added every day and do not change after they are added. Each new Sitemaps file is composed of recently added or re- cently modified URLs.
Say I have a url http://www.example.com/blah/this-is-a-post-with-comments/ that was created on Jan 1, 2009. The sitemap of that day/period would have this url as a new entry.
Say on Feb 5, 2009 a new comment was added.The sitemap for this new period would need to include this url so the new content is indexed.
Now, unless one goes back and changes the earlier sitemap to delete this url, there will be a duplicate url.
My questions are:
- are duplicate url's ok?
- If one were to change the sitemaps to dedupe than would not the lastmod date of the sitemap change thereby making crawlers re-crawl those urls?
- What is a best practice in such a case?