views:

666

answers:

3

I have hierarchical data stored in the datastore using a model which looks like this:

class ToolCategories(db.Model):
name = db.StringProperty()
parentKey = db.SelfReferenceProperty(collection_name="parent_category")
...
...

I want to print all the category names preserving the hierarchy, say in some form like this :

--Information Gathering
----OS Fingerprinting
----DNS
------dnstool
----Port Scanning
------windows
--------nmap
----DNS3
----wireless sniffers
------Windows
--------Kismet

To do the above I have used simple recursion using the back referencing capability:

class GetAllCategories (webapp.RequestHandler) :

    def RecurseList(self, object, breaks) :
            output = breaks + object.name + "</br>"
            for cat in object.parent_category:
                    output = output + self.RecurseList(cat, breaks + "--")

            return output



    def get (self) :
            output = ""
            allCategories = ToolCategories.all().filter(' parentKey = ', None)
            for category in allCategories :
                    output = output + self.RecurseList(category, "--")

            self.response.out.write(output)

As I am very new to App engine programming (hardly 3 days since I started writing code), I am not sure if this the most optimized way from the Datastore access standpoint to do the desired job.

Is this the best way? if not what is?

+2  A: 

You have a very reasonable approach! My main caveat would be one having little to do with GAE and a lot with Python: don't build a string from pieces with + or +=. Rather, you make a list of string pieces (with append or extend or list comprehensions &c) and when you're all done you join it up for the final string result with ''.join(thelist) or the like. Even though recent Python versions strive hard to optimize the intrinsically O(N squared) performance of the + or += loops, in the end you're always better off building up lists of strings along the way and ''.joining them up at the very end!

Alex Martelli
@Jake, thanks for the fast accept! Funny to get an accept without an upvote, though, I think it's the first time it happened to me in 2 months on SO;-).
Alex Martelli
Ah, I knew that upvote-less accept couldn't last...!-)
Alex Martelli
Thanks for the suggestion Alex! I will make changes and use the join() on the final list. Just needed a quick clarification: From a Datastore standpoint using the Reference Property to access related data is the fastest way to do it - am I right?
Jake
My browser crashed before I could do the upvote :( ... Both are done now :) Thanks for the speedy replies!
Jake
Yes @Jake, you're correct -- reference property is just about the best you can do for this purpose.
Alex Martelli
@Alex Thanks a ton!
Jake
Good commentary, but the inefficiency of string concatenation here pales before the inefficiency of doing N datastore queries. :)
Nick Johnson
+2  A: 

The main disadvantage of your approach is that because you're using the "adjacency list" way of representing trees, you have to do one datastore query for each branch of the tree. Datastore queries are fairly expensive (around 160ms each), so constructing the tree, particularly if it's large, could be rather expensive).

There's another approach, which is essentially the one taken by the datastore for representing entity groups: Instead of just storing the parent key, store the entire list of ancestors using a ListProperty:

class ToolCategories(db.Model):
  name = db.StringProperty()
  parents = db.ListProperty(db.Key)

Then, to construct the tree, you can retrieve the entire thing in one single query:

q = ToolCategories.all().filter('parents =', root_key)
Nick Johnson
Nick, Thanks for the pointer! Being a Appengine newbie, the problem I am facing is that I am not able to visualize the number of actual Datastore queries for any Datastore access statement I write. With SQL this was easy to do and thus one could estimate the "cost" of a query. I could not find a good documentation about "Datastore query costs" anywhere and thus am getting stuck with optimization issues. Is there a detailed document somewhere on this?
Jake
All datastore queries have cost proportional to the number of entries returned, with a (large-ish) constant-factor for the round trip to the datastore. Therefore, you mostly just need to add up number of round trips and number of entities returned, and try and optimize both. In the case of your original example, the problem is exacerbated by the implicit datastore operations that are performed when you retrieve a ReferenceProperty's collection.
Nick Johnson
Incidentally, I view this as one of the _advantages_ of the datastore: While an SQL query's cost is not obvious from the SELECT statement, and depends on the nature of the data and even the individual DB, a datstore query always has the same cost, regardless of those variables.
Nick Johnson
A: 

Nick wrote: Datastore queries are fairly expensive (around 160ms each)

If true, a webapp w/ 1 second response time can only do a few queries.

kellerapps