views:

114

answers:

1

Does anyone out there have any great ideas to achieve a massively scalable hierarchical datastore? It needs rapid add and ability to have many users of site requesting reports on the number of nodes below a certain node in hierarchy.

This is the scenario....

I will have a very large number of nodes getting added per hour. Lets say I want to add 1 million nodes per hour. They will likely be appearing all over the hierarchy. Ideally the scale will be into the billions of nodes but 50 million is a target to aim for. I need to be able to calculate at any time the number of nodes below any given point and there will likely be many people doign this at the same time. Think of it as a report that many users (100,000 concurrent perhaps) will be calling for at any one time. they might request all nodes below a certain node.

The database could either be created by a single process reading out of a flat table formatted as an adjacency list (rapid inserts, slow reporting) or it could be a standard design where users of the web site are updating the hierarchy directly if the datastore exists to cope with the massive number of nodes being created.

I already have this implemented in Django using Treebeard and MySQL. I am using a Materialised Path method and it is fairly good but I want lightning speed in comparison. With a datastore of 30,000 nodes I am achieving 120 inserts at the bottom of the tree per minute running on a 2 year old laptop. I want a lot more than this obviously and think that maybe there is a better datastore to use. Maybe PyTables, BigTable, MongoDB or Cassandra?

Easy integration into Python/Django would be good but I can always write this part of the system in another language if I have to. If we used the single process read out of flat datastore and process into a really efficient hierarchical datastore which will be perfect for reporting, I guess I will have no concurrency issues that will negate the need for transactions.

Anyway, that's enough info to get us started. Is this easy using the right technology?

+1  A: 

Have you looked at the Neo4J graph database? It seems pretty darn capable, and has a Python wrapper and some support (in development) for Django. Neo runs on Java, and you can use it either with Jython or JPype and CPython.

stevejalim
Thanks Steve, I will look at it.Rich
Rich
Hi Steve, I looked at Neo4j. It appears amazing for graph data but has a little way to go in terms of horizontal scalability. I think that V2 solves all that so I will be keeping a close eye on it to see how it develops. I am just looking at PyTables. It might be another contender as it is specifically targeted at hierarchical data.
Rich
Ah that's interesting. Have to admit that I've not looked at PyTables, so will check it out in return. Thanks!
stevejalim
Hi Steve, have you seen Redis? That seems to be perfect (so far!)
Rich
Yeah, Redis is great - am using it in just one proj at the mo. MongoDB is great too. Not tried Couch yet, though :o)
stevejalim