views:

175

answers:

2

Hi all.

I am creating a social tool - I want to allow search engines to pick up "public" user profiles - like twitter and face-book.

I have seen all the protocol info at http://www.sitemaps.org and i understand this and how to build such a file - along with an index if i exceed the 50K limit.

Where i am struggling is the concept of how i make this run.

The site map for my general site pages is simple i can use a tool to create the file - or a script - host the file - submit the file and done.

What i then need is a script that will create the site-maps of user profiles. I assume this would be something like:

    <?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"&gt;
   <url>
      <loc>http://www.socialsite.com/profile/spidee&lt;/loc&gt;
      <lastmod>2010-5-12</lastmod>
      <changefreq>???</changefreq>
      <priority>???</priority>
   </url>
   <url>
      <loc>http://www.socialsite.com/profile/webbsterisback&lt;/loc&gt;
      <lastmod>2010-5-12</lastmod>
      <changefreq>???</changefreq>
      <priority>???</priority>
   </url>
</urlset>

Ive added some ??? as i don't know how i should set these settings for my profiles based on the following:-

When a new profile is created it must be added to a site-map. If the profile is changed or if "certain" properties are changed - then i don't know if i update the entry in the map - or do something else? (updating would be a nightmare!)

Some users may change their profile. In terms of relevance to the search engine the only way a google or yahoo search will find the users (for my requirement) profile would be for example by means of [user name] and [location] so once the entry for the profile has been added to the map file the only reason to have the search-bot re-index the profile would be if the user changed their user-name - which they cant. or their location - and or set their settings so that their profile would be "hidden" from search engines.

I assume my map creation will need to be dynamic. From what i have said above i would imagine that creating a new profile and possible editing certain properties could mark it as needing adding/updating in the sitemap.

Assuming i will have millions of profiles added/being edited how can i manage this in a sensible manner.

i know i need a script that can append urls as each profile is created i know the script will prob be a TASK - running at a set freq - perhaps the profiles have a property like "indexed" and the TASK sets them to "true" when the profiles are added to the map. I dont see the best way to store the map - do i store it in the datastore i.e;

model=sitemaps

properties

key_name=sitemap_xml_1 (and for my map sitemap_index_xml)

mapxml=blobstore (the raw xml map or ror map)

full=boolean (set true when url count is 50) # might need this as a shard will tell us

To make this work my thoughts are

m cache the current site map structure as "sitemap_xml" keep a shard of url count when my task executes 1. build the xml structure for say the first 100 urls marked "index==false" (how many could u run at a time?) 2. test if the current mcache sitemap is full (shardcounter+100>50K) 3.a if the map is near full create a new map entry in models "sitemap_xml_2" - update the map_index file (also stored in my model as "sitemap_index" start a new shard - or reset.2 3.b if the map is not full grab it from mcache 4.append the 100 url xml structure 5.save / m cache the map

I can now add a handler using a url map/route like /sitemaps/*

Get my * as map name and serve the maps from the blobstore/mache on the fly.

Now my question is does this work - is this the right way or a good way to start? Will this handle the situation of making sure the search bots update when a user changes their profile - possibly by setting the change freq correctly? - Do i need a more advance system :( ? or have i re-invented the wheel!

I hope this is all clear and make some form of sense :-)

A: 

What you describe is very similar to how Django implements a sitemap framework: http://docs.djangoproject.com/en/dev/ref/contrib/sitemaps/ specifically the section on creating index files: http://docs.djangoproject.com/en/dev/ref/contrib/sitemaps/#creating-a-sitemap-index

If you want to see it on AppEngine with a patched version of the helper you can look here: http://code.google.com/p/dherbst-app-engine-django/wiki/Sitemaps

These are the changes applied to the helper: http://code.google.com/p/dherbst-app-engine-django/source/detail?r=509403105ec97fb1f3dfeadfada808f2cf1ff9a7

dar
+1  A: 

Update frequency

Cache invalidation is a hard problem, see: Cache Invalidation - Is there a General Solution?

As far as I can see, you need to decide how often you want search bots to recrawl your site, rather than how often things are actually changed; if a user's page may contain information they want to remove at short notice, then you want the search bot to re-crawl within a couple of days, even though profiles are changed rarely on average.

Keeping an up-to-date map

Since the speed of your website now figures in its Google PageRank, it's worth updating a static file ready to serve up to the spiders. Perhaps have one script that continually updates a db table with sitemap entries, and another that periodically regenerates the static file(s) from the db table. That way, there is always a static version available for the spiders and it can all happen asynchronously.

Static pages on App Engine

I forgot that you can't have static page files on App Engine. According to this SO question, the best way is to use generate your file and push it to memcache. Also see the documentation on using memcache with App Engine

Phil H
Am i correct in saying that on app engine the only way to serve a so-called "static file" that is dynamically constructed from a db file - is to store the dynamic page in a db and simply serve that content via a handler - On app engine it is not possible to create and save a static file?
spidee