views:

908

answers:

5

I'm working on a database design for groups hierarchy used as the foundation of a larger system. Each group can contain other groups, and also 'devices' as leaf objects (nothing goes below device).

The database being used is MS SQL 2005. (Though working in MS SQL 2000 would be a bonus; a solution requiring MS SQL 2008 is unfortunately not feasible at this time).

There are different types of groups, and these need to be dynamic and definable at run-time by users. For example, group types might be "customer", "account", "city", or "building", "floor", and each type is going to have a different set of attributes, definable by the user. There will also be business rules applied - eg, a "floor" can only be contained underneath a "building" group, and again, these are definable at runtime.

A lot of the application functionality comes from running reports based on these groups, so there needs to be a relatively fast way to get a list of all devices contained within a certain group (and all sub-groups).

Storing groups using modified pre-order tree traversal technique has the upside that it is fast, but the downside that it is fairly complex and fragile - if external users/applications modify the database, there is the potential for complete breakage. We're also implementing an ORM layer, and this method seems to complicate using relations in most ORM libraries.

Using common table expressions and a "standard" id/parentid groups relation seem to be a powerful way to avoid running multiple recursive queries. Is there any downside to this method?

As far as attributes, what is the best way to store them? A long, narrow table that relates back to group? Should a common attribute, like "name" be stored in a groups table, instead of the attributes table (a lot of the time, the name will be all that is required to display)?

Are there going to be performance issues using this method (let's assume a high average of 2000 groups with average of 6 attributes each, and average 10 concurrent users, on a reasonable piece of hardware, eg, quad-core Xeon 2 Ghz, 4GB ram, discounting any other processes)?

Feel free to suggest a completely different schema than what I've outlined here. I was just trying to illustrate the issues I'm concerned about.

+1  A: 

What transaction rate do you have to sustain ?

Tim Williscroft
This isn't an answer. It should be a comment to the question.
+1  A: 

I'd recommend you actually construct the easiest-to-maintain way (the "standard" parent/child setup) and run at least some basic benchmarks on it.

You'd be surprised what a database engine can do with the proper indexing, especially if your dataset can fit into memory.

Assuming 6 attributes per group, 2000 groups, and 30 bytes/attribute, you're talking 360KB*expected items/group -- figure 400KB. If you expect to have 1000 items/group, you're only looking at 400MB of data -- that'll fit in memory without a problem, and databases are fast at joins when all the data is in memory.

Jonathan
+1  A: 

Common table expressions will let you get out a list of groups with the parent-child relations. Here is an example of a sproc using CTE's for a different application. It's reasonably efficient but beware the following caveats:

  1. If a part occurs more than once in the hierarchy it will be reported at each location. You may need to post-process the results.
  2. CTE's are somewhat obtuse and offer limited scope to filter results within the query - the CTE may not appear more than once within the select statement.

Oracle's CONNECT BY is somewhat more flexible as it doesn't impose nearly as many limitations on the query structure as CTE's do, but if you're using SQL Server this won't be an option.

If you need to do anything clever with the intermediate results then write a sproc that uses the CTE to get a raw query into a temporary table and work on it from there. SELECT INTO will minimise the traffic incurred in this. The resulting table will be in cache so operations on it will be reasonably quick.

Some possible physical optimisations that could help:

  • Clustered indexes on the parent so that getting out child nodes for a parent uses less I/O.
  • Lots of RAM and (depending on the size of your BOM table) 64-bit servers with even more RAM so that the main BOM table can be cached in core. On a 32 bit O/S the /3G boot switch is your friend and has no real downside for a database server
  • DBCC PINTABLE can help force the database manager to hold the table in cache.

Parent-Attribute Type-Attribute coding tables will not play nicely with CTE's as you will wind up with a combinatorical explosion in your row counts if you include the attribute table. This would preclude any business logic in the query that filtered on attributes. You would be much better off storing the attributes directly on the BOM table entry.

ConcernedOfTunbridgeWells
A: 

Pre-order Tree Traversal is very handy. You can make it robust by keeping the traversal numbers up to date with triggers.

A similar technique which I have used is to keep a separate table of (ancestor_id, descendant_id) which lists all ancestors and descendants. This is nearly as good as pre-order traversal numbers.

Using a separate table is handy, because even though it introduces an extra join, it does remove the complexity into a separate table.

AJ
+1  A: 

The modified pre-order is, essentially, Joe Celko's Nested Sets method. His book, "Trees and Hierarchies..." covers both adjacency list and NS, with descriptions of advantages and disadvantages of each. With proper indexing, CTE of adjacency lists gets the most balanced performance. If you're going for read-mostly, then NS will be faster.

What you seem to be describing is a Bill of Material processor. While not M$, Graeme Birchall has a free DB2 book, with a chapter on hierarchy processing using CTE (the syntax is virtually identical, IIRC, in that the ANSI syntax adopted DB2's, which M$ then adopted): http://mysite.verizon.net/Graeme_Birchall/cookbook/DB2V95CK.PDF