Hello --


I work for an online media company that hosts a news site with over 75K pages. We currently use Google Sitemap Generator (installed on our server) to build dynamic XML sitemaps for our site. In fact since we have a ton of content, we use a sitemap of sitemaps. (Google only allows a maxmimum of 50K urls.)


The sitemaps are generated every 12 hours and is driven by user behavior. That is, it parses the server log file and sees which pages are being fetched the most and builds the sitemap based on that.

Since we cannot guarantee that NEW pages are being added to the sitemap, is it better to submit a sitemap as an RSS feed? In that way, everytime one of our editors creates a new page (or article) it is added to the feed and submitted to google. And this brings up the issue of pushing duplicate content to google as the sitemap and the RSS feed might contain the same urls. Will google penalize us for duplicate content? How do other content-rich or media sites notify google that they are posting new content?

I understand that googlebots only index pages that it deems important and relevant, but it would be great if atleast crawled any new article that we post.

Any help would be greatly appreciated.

Thanks in advance!


Why not simply have every page in your sitemap? 75k pages isn't a huge number, plenty of sites have several sitemaps totalling millions of pages and Google will digest them all (although Google will only index those it deems important as you pointed out).

One technique for you would be to split the sitemaps up into New and Archived content based on the publication date - such as a single sitemap for all content from the previous 7 days and the rest of the content split into other sitemap files as appropriate, this may help to get your freshest content indexed quickly.

Back to your question about an RSS Feed sitemap - don't worry about duplicate content as this is not an issue when it comes to sitemaps. Duplicate content is only a problem if you published the same article several times on the site - sitemaps and RSS feeds are only links to the content, not the content itself, so if a RSS feed is the easiest way of reporting your fresh content to Google, go for it.