views:

983

answers:

3

I've been thinking about the modified preorder tree traversal algorithm for storing trees within a flat table (such as SQL).

One property I dislike about the standard approach is that to insert a node you have to touch (on average) N/2 of the nodes (everything with left or right higher than the insert point).

The implementations I've seen rely on sequentially numbered values. This leaves no room for updates.

This seems bad for concurrency and scaling. Imagine you have a tree rooted at the world containing user groups for every account in a large system, it's extremely large, to the point you must store subsets of the tree on different servers. Touching half of all the nodes to add a node to the bottom of the tree is bad.

Here is the idea I was considering. Basically leave room for inserts by partitioning the keyspace and dividing at each level.

Here's an example with Nmax = 64 (this would normally be the MAX_INT of your DB)

                     0:64
              ________|________
             /                 \
         1:31                   32:63
        /    \                 /     \
    2:14    15-30         33:47       48:62

Here, a node is added to the left half of the tree.

                     0:64  
              ________|________
             /                 \
         1:31                  32:63
      /   |   \               /     \
  2:11  11:20  21:30     33:47       48:62

The alogorithm must be extended for the insert and removal process to recursively renumber to the left/right indexes for the subtree. Since querying for immediate children of a node is complicated, I think it makes sense to also store the parent id in the table. The algorithm can then select the sub tree (using left > p.left && right < p.right), then use node.id and node.parent to work through the list, subdividing the indexes.

This is more complex than just incrementing all the indexes to make room for the insert (or decrementing for removal), but it has the potential to affect far fewer nodes (only decendenants of the parent of the inserted/removed node).

My question(s) are basically:

  1. Has this idea been formalized or implemented?

  2. Is this the same as nested intervals?

A: 

You can split your table into two: the first is (node ID, node value), the second (node ID, child ID), which stores all the edges of the tree. Insertion and deletion then become O(tree depth) (you have to navigate to the element and fix what is below it).

The solution you propose looks like a B-tree. If you can estimate the total number of nodes in your tree, then you can choose the depth of the tree beforehand.

Dmitry Risenberg
This is the adjacency list model. The problem with it is that to read the entire tree also takes O(tree depth) queries. Nested sets improve on that significantly (single query).
Mark Renouf
+1  A: 

I think you're better off looking at a different way of storing trees. If your tree is broad but not terribly deep (which seems likely for the case you suggested), you can store the complete list of ancestors up to the root against each node. That way, modifying a node doesn't require touching any nodes other than the node being modified.

Nick Johnson
FYI, this approach is referred to as 'materialized paths'. I believe this might be the safest approach for is for the time being. It's certainly easier to code correctly (less complex). I'm still researching though :-)
Mark Renouf
+1  A: 

I have heard of people doing this before, for the same reasons, yes.

Note that you do lose at a couple of small advantages of the algorithm by doing this

  • normally, you can tell the number of descendants of a node by ((right - left + 1) div 2). This can occasionally be useful, if e.g. you'd displaying a count in a treeview which should include the number of children to be found further down in the tree
  • Flowing from the above, it's easy to select out all leaf nodes -- WHERE (right = left + 1).

These are fairly minor advantages and may not be useful to you anyway, though for some usage patterns they're obviously handy.

That said, it does sound like materialized paths may be more useful to you, as suggested above.

Cowan