Hello --
Background
I work for an online media company that hosts a news site with over 75K pages. We currently use Google Sitemap Generator (installed on our server) to build dynamic XML sitemaps for our site. In fact since we have a ton of content, we use a sitemap of sitemaps. (Google only allows a maxmimum of 50K urls.)
Problem
The sitemaps are generated every 12 hours and is driven by user behavior. That is, it parses the server log file and sees which pages are being fetched the most and builds the sitemap based on that.
Since we cannot guarantee that NEW pages are being added to the sitemap, is it better to submit a sitemap as an RSS feed? In that way, everytime one of our editors creates a new page (or article) it is added to the feed and submitted to google. And this brings up the issue of pushing duplicate content to google as the sitemap and the RSS feed might contain the same urls. Will google penalize us for duplicate content? How do other content-rich or media sites notify google that they are posting new content?
I understand that googlebots only index pages that it deems important and relevant, but it would be great if atleast crawled any new article that we post.
Any help would be greatly appreciated.
Thanks in advance!