views:

375

answers:

5

This is a followup to:
http://stackoverflow.com/questions/3073614/mysql-is-it-possible-to-get-all-sub-items-in-a-hierarchy

I have an arbitrary-depth adjacency list model table (I am at the point that I can convert it into a nested set model.

I read the MySQL data on how to use a nested set model, though it seemed to get increasingly complex and very complex to do basic functions such as inserting, updating and deleting.

Another blog showing how to use a trigger system with the adjacency list model to keep a table of ancestors that relates each object to its ancestors.


Right now I need to be able to return a list of all children of a given node, to change or delete them. This hierarchical structure won't be changing all the time once created, but there will be a mass amount of the hierarchical structures.

The three methods I see are:

  1. Created a Stored Procedure which would do a recursive query that returns all children.

  2. Convert to Nested Set Model which would require to get into the complexities and possibly create a stored procedure to add, edit and delete in that.

  3. Create the Ancestor Table described above on insert/delete triggers to handle all of the data.

If there are other methods I'm not exploring, please let me know and I'll update this list.

+1  A: 

I would always go with the Nested Set for shear simplicity and convienience. I always suggest this article. It shows excelent the queries that are needed for the work with such hierachrchical data. The only disadvantage I see here is that it can get slower with inserting/updateing new records when the hierachry reached a certain level of complexity, but the reading is faster than many other solutions I hae seen.

Just to give you an example from the articel above:

SELECT t1.name AS lev1, t2.name as lev2, t3.name as lev3, t4.name as lev4
FROM category AS t1
LEFT JOIN category AS t2 ON t2.parent = t1.category_id
LEFT JOIN category AS t3 ON t3.parent = t2.category_id
LEFT JOIN category AS t4 ON t4.parent = t3.category_id
WHERE t1.name = 'ELECTRONICS';

+-------------+----------------------+--------------+-------+
| lev1        | lev2                 | lev3         | lev4  |
+-------------+----------------------+--------------+-------+
| ELECTRONICS | TELEVISIONS          | TUBE         | NULL  |
| ELECTRONICS | TELEVISIONS          | LCD          | NULL  |
| ELECTRONICS | TELEVISIONS          | PLASMA       | NULL  |
| ELECTRONICS | PORTABLE ELECTRONICS | MP3 PLAYERS  | FLASH |
| ELECTRONICS | PORTABLE ELECTRONICS | CD PLAYERS   | NULL  |
| ELECTRONICS | PORTABLE ELECTRONICS | 2 WAY RADIOS | NULL  |
+-------------+----------------------+--------------+-------+
6 rows in set (0.00 sec)

SQL wise, I don't think it can get any prettier and simpler ;)

I have no idea to the Stored Procedure way. But since it involces recursion (in your case), I don't know if it will be fast with many levels in the hierarchy. I assume you can give it a try.

DrColossos
That was the article I was used for the Nested Set Model. The problem I have with that is it talks about locking the entire table whenever you are doing an insert, update or delete. I can't do that. The other method you showed for the Adjacency List Model works for known depths. I have arbitrary depths.
Kerry
I don't think that locking should be necessary. If you use InnoDB as engine, you should be on the safe side.
DrColossos
It's MyISAM right now -- do you know of a good reference that talks ab out the differences/advantages/disadvantages?
Kerry
http://stackoverflow.com/search?q=myisam+innodb ;)
DrColossos
Ah, I'm still unconvinced its the best. What's the advantage of the nested over the ancestor table? (My current favorite). I'm unconcerned about data usage and have triggers automatically managing that table.
Kerry
I never worked with ancestor tables for hierachries so it's a bit hard for me to tell. I always prefered nested sets because of their good select performance and powerfull queries (but maybe I'm wrong and there are way better solutions). When you feel ancestor tables are the way to go than I would say "go with it". Sorry that I'm not able to give a proper answer, maybe someone here on SO has worked with both and can give better ideas.
DrColossos
Your answer has been very helpful, so thank you for that :)
Kerry
+2  A: 

Quassnoi has run some performance tests on the nested sets model and the adjacency list model and documented the results and recommendations in his blog post Adjacency list vs. nested sets: MySQL. The executive summary is:

  • Nested sets is faster for fetching all child nodes or all parent nodes.
  • Nested sets is a bad idea if you frequently need to update the table.

Here is the conclusion from his article:

In MySQL, the nested sets model should be preferred if the updates to the hierarhical structure are infrequent and it is affordable to lock the table for the duration of an update (which can take minutes on a long table).

This implies creating the table using MyISAM storage engine, creating the bounding box of a GEOMETRY type as described above, indexing it with a SPATIAL index and persisting the level in the table.

If the updates to the table are frequent or it is inaffordable to lock the table for a long period of time implied by an update, then the adjacency list model should be used to store the hierarchical data.

This requires creating a function to query the table.

The rest of the article shows how to define the table, implement the queries and gives performance measurements. The use of the spatial index is a clever idea to improve the performance of the nested set model that might be new to you.


If you're also considering approaches without MySQL then you might want to look at PostgreSQL which is another free and open-source database. PostgreSQL supports recursive queries in the form of recursive common table expressions which make querying heirarchical data easier than in MySQL and also give better performance. Quassnoi has also written an article Adjacency list vs. nested sets: PostgreSQL that shows the details.

While we are talking about looking at other approaches, Oracle's database is also worth a mention. Oracle also have a custom extension CONNECT BY which make querying heirarchical data very easy and fast. Quassnoi's article Adjacency list vs. nested sets: Oracle again covers the performance details. The query you need to get all children is extremely simple in this case:

SELECT *
FROM yourtable
START WITH id = 42
CONNECT BY parent = PRIOR id
Mark Byers
+1  A: 

Maybe you should consider using document-oriented database like MongoDB. It could make your life a lot easier.

quantumSoup
I didn't dare to suggest this, but I totally agree. Also consider object-oriented databases like Tamino (http://www.softwareag.com/Corporate/products/wm/tamino/default.asp)
A: 

I once had to store a complex hierarchical arbitrary-depth bill-of-material system in a SQL-like database manager that wasn't really up to the task, and it ended up forcing messy and tricky indicies, data definitions, queries, etc. After restarting from scratch, using the db manager to provide only an API for record reads and writes on simple indexed keys, and doing all of the actual input/manipulation/reporting in external code, the final result was quicker to implement, easier to understand, and simpler to maintain and enhance. The most complex query needed was essentially SELECT A FROM B.

So, instead of embedding logic and operations inside the restrictions of MySQL, consider banging out code to do what you want, and relying on MySQL only for the lowest-level gets/puts.

joe snyder
+1  A: 

When dealing with hierarchical data sets I find it best to approach it with caching in mind. One of the main benefits to this way of dealing with this issue this way is it doesn't require de-normalizing you database into something that might be harder to mutate.

Since memory heaps' (memcache,redis,etc) lookups are much faster than SQL for simple id -> data resolutions, I would use them to cache a list of the ids of direct children for each node. This way you can get decent performance via a recursive algorithm to build a complete list for any node.

To add/delete a new node, you will only need to invalidate its' direct parent cache O(1).

If that's not fast enough, you can add another layer of cache to a list of all child of a node at each node. In order for this to work with a decently mutable dataset, you should record the cache performance (ratio of fresh/cached hits) of each node and set a tolerance level for when to store the cache. This also can be stored in a memory heap since it's non-vital data.

If you use this more advanced caching model, you will need to note these complete children node lists will need to be invalidated when any of it's children are changed O(log n).

Once you have your list of children id's you can use SQL's WHERE id IN( id1, id2, .... ) syntax to query for what you want.

Kendall Hopkins