views:

2807

answers:

11

How would you get tree-structured data from a database with the best performance? For example, say you have a folder-hierarchy in a database. Where the folder-database-row has ID, Name and ParentID columns.

Would you use a special algorithm to get all the data at once, minimizing the amount of database-calls and process it in code?

Or would you use do many calls to the database and sort of get the structure done from the database directly?

Maybe there are different answers based on x amount of database-rows, hierarchy-depth or whatever?

Edit: I use Microsoft SQL Server, but answers out of other perspectives are interesting too.

+1  A: 

If you have many trees in the database, and you will only ever get the whole tree out, I would store a tree ID (or root node ID) and a parent node ID for each node in the database, get all the nodes for a particular tree ID, and process in memory.

However if you will be getting subtrees out, you can only get a subtree of a particular parent node ID, so you either need to store all parent nodes of each node to use the above method, or perform multiple SQL queries as you descend into the tree (hope there are no cycles in your tree!), although you can reuse the same Prepared Statement (assuming that nodes are of the same type and are all stored in a single table) to prevent re-compiling the SQL, so it might not be slower, indeed with database optimisations applied to the query it could be preferable. Might want to run some tests to find out.

If you are only storing one tree, your question becomes one of querying subtrees only, and the second answer applied.

JeeBee
A: 

I am a fan of the simple method of storing an ID associated with its parentID:

ID     ParentID
1      null
2      null
3      1
4      2
...    ...

It is easy to maintain, and very scalable.

Galwegian
-1: repeats stuff in the question.
S.Lott
It wasn't there when I answered originally.
Galwegian
Actually, it's not scalable. If you frequently work with a whole tree of depth n, you will need n queries to fetch all the data. For tall busy trees (e.g. a forum), this can be a performance killer.
staticsan
A: 

Google for "Materialized Path" or "Genetic Trees"...

Thomas Hansen
+1  A: 

In Oracle there is SELECT ... CONNECT BY statement to retrieve trees.

Dmitry Khalatov
+5  A: 

It really depends on how you are going to access the tree.

One clever technique is to give every node a string id, where the parent's id is a predictable substring of the child. For example, the parent could be '01', and the children would be '0100', '0101', '0102', etc. This way you can select an entire subtree from the database at once with:

SELECT * FROM treedata WHERE id LIKE '0101%';

Because the criterion is an initial substring, an index on the ID column would speed the query.

Ned Batchelder
You just have to be sure that the number of digits per level (2 in this case) * the number of levels is allowed for in that CHAR column. This imposes some artificial (but manageable) limitations.
S.Lott
+1  A: 

There are several common kinds of queries against a hierarchy. Most other kinds of queries are variations on these.

  1. From a parent, find all children.

    a. To a specific depth. For example, given my immediate parent, all children to a depth of 1 will be my siblings.

    b. To the bottom of the tree.

  2. From a child, find all parents.

    a. To a specific depth. For example, my immediate parent is parents to a depth of 1.

    b. To an unlimited depth.

The (a) cases (a specific depth) are easier in SQL. The special case (depth=1) is trivial in SQL. The non-zero depth is harder. A finite, but non-zero depth, can be done via a finite number of joins. The (b) cases, with indefinite depth (to the top, to the bottom), are really hard.

If you tree is HUGE (millions of nodes) then you're in a world of hurt no matter what you try to do.

If your tree is under a million nodes, just fetch it all into memory and work on it there. Life is much simpler in an OO world. Simply fetch the rows and build the tree as the rows are returned.

If you have a Huge tree, you have two choices.

  • Recursive cursors to handle the unlimited fetching. This means the maintenance of the structure is O(1) -- just update a few nodes and you're done. However fetching is O(n*log(n)) because you have to open a cursor for each node with children.

  • Clever "heap numbering" algorithms can encode the parentage of each node. Once each node is properly numbered, a trivial SQL SELECT can be used for all four types of queries. Changes to the tree structure, however, require renumbering the nodes, making the cost of a change fairly high compared to the cost of retrieval.

S.Lott
SQL CTE eliminates the need for recursive cursors and has some optimization for join folding - but still it is an expensive call to enumerate large hierarchies.
stephbu
Same the Oracles CONNECT-BY. Works, but it's S... L... O... W...
S.Lott
If you have millions of nodes, you could make tree of trees. Each tree contained in the DB as BLOB. You read top tree (with millions of leafs) where each leaf will have ID to it's sub-tree with millions of leafs. That way you'll have billions of leafs and fast reading if queries have locality no more than a few sub-trees.
The_Ghost
A: 

This article is interesting as it shows some retrieval methods as well as a way to store the lineage as a derived column. The lineage provides a shortcut method to retrieve the hierarchy without too many joins.

Turnkey
+5  A: 

look into the nested sets hierarchy model. it's pretty cool and useful.

Mladen Prajdic
+3  A: 

In the product I work on we have some tree structures stored in SQL Server and use the technique mentioned above to store a node's hierarchy in the record. i.e.

tblTreeNode
TreeID = 1
TreeNodeID = 100
ParentTreeNodeID = 99
Hierarchy = ".33.59.99.100."
[...] (actual data payload for node)

Maintaining the the hierarchy is the tricky bit of course and makes use of triggers. But generating it on an insert/delete/move is never recursive, because the parent or child's hierarchy has all the information you need.

you can get all of node's descendants thusly:

SELECT * FROM tblNode WHERE Hierarchy LIKE '%.100.%'

Here's the insert trigger:

--Setup the top level if there is any
UPDATE T 
SET T.TreeNodeHierarchy = '.' + CONVERT(nvarchar(10), T.TreeNodeID) + '.'
FROM tblTreeNode AS T
 INNER JOIN inserted i ON T.TreeNodeID = i.TreeNodeID
WHERE (i.ParentTreeNodeID IS NULL) AND (i.TreeNodeHierarchy IS NULL)

WHILE EXISTS (SELECT * FROM tblTreeNode WHERE TreeNodeHierarchy IS NULL)
 BEGIN
  --Update those items that we have enough information to update - parent has text in Hierarchy
  UPDATE CHILD 
  SET CHILD.TreeNodeHierarchy = PARENT.TreeNodeHierarchy + CONVERT(nvarchar(10),CHILD.TreeNodeID) + '.'
  FROM tblTreeNode AS CHILD 
   INNER JOIN tblTreeNode AS PARENT ON CHILD.ParentTreeNodeID = PARENT.TreeNodeID
  WHERE (CHILD.TreeNodeHierarchy IS NULL) AND (PARENT.TreeNodeHierarchy IS NOT NULL)
 END

and here's the update trigger:

--Only want to do something if Parent IDs were changed
IF UPDATE(ParentTreeNodeID)
 BEGIN
  --Update the changed items to reflect their new parents
  UPDATE CHILD
  SET CHILD.TreeNodeHierarchy = CASE WHEN PARENT.TreeNodeID IS NULL THEN '.' + CONVERT(nvarchar,CHILD.TreeNodeID) + '.' ELSE PARENT.TreeNodeHierarchy + CONVERT(nvarchar, CHILD.TreeNodeID) + '.' END
  FROM tblTreeNode AS CHILD 
   INNER JOIN inserted AS I ON CHILD.TreeNodeID = I.TreeNodeID
   LEFT JOIN tblTreeNode AS PARENT ON CHILD.ParentTreeNodeID = PARENT.TreeNodeID

  --Now update any sub items of the changed rows if any exist
  IF EXISTS (
    SELECT * 
    FROM tblTreeNode 
     INNER JOIN deleted ON tblTreeNode.ParentTreeNodeID = deleted.TreeNodeID
   )
   UPDATE CHILD 
   SET CHILD.TreeNodeHierarchy = NEWPARENT.TreeNodeHierarchy + RIGHT(CHILD.TreeNodeHierarchy, LEN(CHILD.TreeNodeHierarchy) - LEN(OLDPARENT.TreeNodeHierarchy))
   FROM tblTreeNode AS CHILD 
    INNER JOIN deleted AS OLDPARENT ON CHILD.TreeNodeHierarchy LIKE (OLDPARENT.TreeNodeHierarchy + '%')
     INNER JOIN tblTreeNode AS NEWPARENT ON OLDPARENT.TreeNodeID = NEWPARENT.TreeNodeID

 END

one more bit, a check constraint to prevent a circular reference in tree nodes:

ALTER TABLE [dbo].[tblTreeNode]  WITH NOCHECK ADD  CONSTRAINT [CK_tblTreeNode_TreeNodeHierarchy] CHECK  
((charindex(('.' + convert(nvarchar(10),[TreeNodeID]) + '.'),[TreeNodeHierarchy],(charindex(('.' + convert(nvarchar(10),[TreeNodeID]) + '.'),[TreeNodeHierarchy]) + 1)) = 0))

I would also recommend triggers to prevent more than one root node (null parent) per tree, and to keep related nodes from belonging to different TreeIDs (but those are a little more trivial than the above.)

You'll want to check for your particular case to see if this solution performs acceptably. Hope this helps!

+4  A: 

Out of all the ways to store a tree in a RDMS the most common are adjacency lists and nested sets. Nested sets are optimized for reads and can retrieve an entire tree in a single query. Adjacency lists are optimized for writes and can added to with in a simple query.

With adjacency lists each node a has column that refers to the parent node or the child node (other links are possible). Using that you can build the hierarchy based on parent child relationships. Unfortunately unless you restrict your tree's depth you cannot pull the whole thing in one query and reading it is usually slower than updating it.

With the nested set model the inverse is true, reading is fast and easy but updates get complex because you must maintain the numbering system. The nested set model encodes both parentage and sort order by enumerating all of the nodes using a preorder based numbering system.

I've used the nested set model and while it is complex for read optimizing a large hierarchy it is worth it. Once you do a few exercises in drawing out the tree and numbering the nodes you should get the hang of it.

My research on this method started here: http://dev.mysql.com/tech-resources/articles/hierarchical-data.html

Bernard