views:

16

answers:

1

I am writing a set of functions to generate a sitemap for a website. Lets assume that the website is a blog.

The definition of a sitemap is that it lists the pages that are available in a website. For a dynamic website, those pages change quite regularly.

Using the example of a blog, the 'pages' will be the blog posts (I'm guessing), since there is a finite limit on the number of links in a sitemap (ignore sitemap indexes for now), it means I cant keep adding a list of the latest blog posts, because at some point in the future, the limit will be exceeded.

I have made two (quite fundamental) assumptions in the above paragraph. They are:

Assumption 1:

A sitemap contains a list of pages in a website. For a dynamic website like a blog, the pages will be the blog posts. therefore, I can create a sitemap that simply lists the blogposts on the website. (This sounds like a feed to me)

Assumption 2:

since there is a hard limit on the number of links in the sitemap file, I can impose some arbitary limit N, and simply generate the file periodically, to list the latest N blogposts (at this stage, this is indistinguishable from a feed)

My questions then are:

  • Are the assumptions (i.e. my understanding of what goes inside a sitemap file) valid/correct?
  • What I described above, sounds very much like a feed, can bots not simply use a feed to index a web site (i.e. is a sitemap necessary)?
  • If I am already generating a file that has the latest changes in it, I don't see the point of adding in the sitemap protocol file - can someone explain this?
+1  A: 

Assumption 1 is correct - the site map should indeed be a list of the pages on the site - in your case, yes that would be the blog posts, and any other pages like a contact page, home page, about page, etc that you have.

Yes, it is a bit like a feed, but a feed generally only has the latest items in it, while the site map should have everything.

From Google's docs:

Sitemaps are particularly helpful if:

  • Your site has dynamic content.
  • Your site has pages that aren't easily discovered by Googlebot during the crawl process—for example, pages featuring rich AJAX or images.
  • Your site is new and has few links to it. (Googlebot crawls the web by following links from one page to another, so if your site isn't well linked, it may be hard for us to discover it.)
  • Your site has a large archive of content pages that are not well linked to each other, or are not linked at all.

Assumption 2 is a little incorrect - The limit for a site map file is 50,000 links/10MB uncompressed, if you think you are likely to hit that limit, then start by creating a sitemap index file that only links to one sitemap, and then add to it as you go.

Google will accept an RSS feed as a site map if that's all you have, but points out that these usually only contain the most recent links - the value in having a sitemap is that it should cover everything on the site, not just the latest items, which are probably the most discoverable.

Zhaph - Ben Duguid