views:

53

answers:

2

I have a site with around 100,000 unique pages.

(1) How do I create a Sitemap for all these links? Should I just list them flat in a large sitemap protocol compatible file?

(2) Need to implement this on Google App Engine where there is a 1000 item query limit, and all my individual site URLs are stored as separate entries. How do I solve this problem?

A: 

You can use Query Cursors to circumvent the 1000 query item limit; although, even using cursors probably won't entirely solve your problem, as generating a sitemap with 100,000 items in it could easily exceed the amount of time that a single request is allowed to run. Also, generating the sitemap dynamically could easily use up all or a large amount of your resource quota.

If your data is not very dynamic, I would consider generating a static sitemap file and including it as part of your deployment package. Even if your data is very dynamic, you probably want to adopt a strategy of regenerating it only once per day and doing a deployment to put it up on the server.

Adam Crossland
+3  A: 

Site Maps must be no larger than 10MB and list no more than 50,000 URLs, so you're going to need to break it up somehow.

You're going to need some kind of sharding strategy. I don't know what your data looks like, so for now let's say every time you create a page entity, you assign it a random integer between 1 and 500.

Next, create a Sitemap index, and spit out a sitemap link for each of your index values:

<?xml version="1.0" encoding="UTF-8"?>
   <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"&gt;
   <sitemap>
      <loc>http://example.appspot.com/sitemap?random=1&lt;/loc&gt;
   </sitemap>
   <sitemap>
      <loc>http://example.appspot.com/sitemap?random=2&lt;/loc&gt;
   </sitemap>
   ...
   <sitemap>
      <loc>http://example.appspot.com/sitemap?random=500&lt;/loc&gt;
   </sitemap>
</sitemapindex>

Finally, on your sitemap page, query for pages and filter for your random index. If you have 100,000 pages this will give you about 200 URLs per sitemap.

A slightly different strategy here would be to give each page an auto-incrementing numeric ID. To do so, you need a counter object that is transactionally locked and incremented each time a new page is created. The downside of this is that you can't parallelize creation of new page entities. The upside is that you would have a bit more control over how your pages are laid out, as your first sitemap could be pages 1-1000, and so on.

Drew Sears
awesome! thanks for making my life simpler :) I am gonna code this in the next 30 mins now :)
demos
Nice strategy! Using an incrementing counter in App Engine is generally a bad idea, though.
Nick Johnson