views:

114

answers:

1

Hello everyone,

I have some doubts regarding sitemap.xml generation and Django's sitemap framework particularly.

Let's say I have a blog application which has post_detail pages with each post's content and a bunch of 'helper' pages like 'view by tag', 'view by author', etc.

  1. Is it mandatory to include each and every page in sitemap.xml, including 'helper' pages? I want all of 'helper' pages indexed as there are many keywords and text. I know that sitemaps are designed to help index pages, to give some directions to web-crawler, but not to limit crawling. What is the best practice for that? Include everything or include only important pages?
  2. If it's okay to have all of the pages in sitemap.xml, what is the best way to submit plain, not-stored in db pages to sitemaps framework? One possible way is to have a sitemap class which returns reversed urls by url name. But it doesn't seem to be DRY at all, because I'll gonna need to register those url-names for the second time (in url() function and in Sitemap class).

I could probably have a custom django.conf.urls.defaults.url function to register url-mapping for the sitemap... What do you think?

Thank you.

A: 

How a sitemap is used is dictated by the search engine. Some will only index what you have in the sitemap, while others will use it as a starting point and crawl the entire site based on cross-linking.

As for including non-generated pages, we just created a subclass of django.contrib.sitemaps.Sitemap and have it read a plain-text file with one URL per line. Something like:

class StaticSitemap(Sitemap):
    priority = 0.8
    lastmod = datetime.datetime.now()

    def __init__(self, filename):
        self._urls = []
        try:
            f = open(filename, 'rb')
        except:
            return

        tmp = []
        for x in f:
            x = re.sub(r"\s*#.*$", '', x) # strip comments
            if re.match('^\s*$', x):
                continue # ignore blank lines
            x = string.strip(x) # clean leading/trailing whitespace
            x = re.sub(' ', '%20', x) # convert spaces
            if not x.startswith('/'):
                x = '/' + x
            tmp.append(x)
        f.close()
        self._urls = tmp
    # __init__

    def items(self):
        return self._urls

    def location(self, obj):
        return obj

You can invoke it with something like this in your main sitemap routine:

sitemap['static'] = StaticSitemap(settings.DIR_ROOT +'/sitemap.txt')

And our sitemap.txt file looks something like this:

# One URL per line.
# All paths start from root - i.e., with a leading /
# Blank lines are OK.

/tour/
/podcast_archive/
/related_sites/
/survey/
/youtube_videos/

/teachers/
/workshops/
/workshop_listing_info/

/aboutus/
/history/
/investment/
/business/
/contact/
/privacy_policy/
/graphic_specs/
/help_desk/
Peter Rowell
I'm really sorry for taking too long to respond, just totally forgot about it. I don't really like this solution, but it's acceptable. I myself used urlresolver but it's quite messy as well. So I'm still in doubt.
Vladimir Shulyak
Not to worry. I wasn'
Peter Rowell
I wasn't in love with it either, but when we did it (Summer of 2007) it seemed like a quick way to get it working.
Peter Rowell