Database patterns for partitioning large hierarchical datasets

It seems you have three problems:

Storing data about an employee (excluding relationships/hierarchy)
Employer to Employee hierarchy (which can change over time)
Employee to Employee work history (again, changing over time)

To tackle each in turn:

Employee data: This could be partitioned, with a unique id, with alternate key for surname+given names+date of birth. Either partition by spreading evenly by id, or other info such as area/region (though that will mean some partitions will be hotter than others)
Employer/employee hierarchy: Needs a secondary table to define this, allowing changes over time. eg. Employee id, Employer id, start date, end date and keyed by employee id + employer id and back the other way employer id + employee id. I recommend reading the following: http://www.slideshare.net/billkarwin/sql-antipatterns-strike-back , it might have ideas that work well for the size of your data.
Employee/employee work history: Needs another secondary table, very similar to #2, cross referencing employees and the time they've worked together. eg. employee1 id, employee2 id, start date, end date, which would be indexed by each of the id's at a minimum.

The key here is that don't attempt to place the relationships/hierarchy within the employee data table - it will be slow and limit the linking you need (especially as links change over time).

Thanks for the response Will. I'm thinking along the same lines of separating the employee data from the hierarchy, however since both datasets will be excessively large to keep in a single database its the partitioning aspect that I'm stuck on. Partitioning the employee data is simple enough, but the hierarchical data would be referencing rows across multiple partitions. It's this final point that I'm concerned about. Any ideas?

tree 2009-11-27 04:45:10

What databases are you considering using? I'm aware that most enterprise-level databases have partitioning capabilities that partition (split) very large tables in a way that is invisible to sqls.I'm not familiar with the exact syntax of each database, but once we know where you're heading, others might be able to help with the details.

Will 2009-11-27 08:33:55

I'm on SQL Server, though I'm thinking of this from the perspective of performance impact for queries crossing remote partitions. Even though SQL Server has mechanisms to hide this from me and let me write partition-agnostic queries, I think I'll have perf issues due to cross-partition queries.

tree 2009-11-27 23:11:40

Perhaps my question shouldn't be assuming the partitioning is the answer, but should be more like "What strategies work well for handling large volumes of data that don't fit on a single database server nor is easily divisible into partitions where queries are have affinity with a single partition". When I've been speaking of partitioning above, I didn't so much mean the mechanisms provided by SQL Server for partitioning tables since I could also distribute across servers without using these mechanisms.

tree 2009-11-27 23:50:59

ansaurus

tags:

views:

answers:

Database patterns for partitioning large hierarchical datasets

related questions