views:

202

answers:

1

Are there any best-pratices/patterns or general advice for partitioning large amounts of hierarchical data?

Think of, say, a database of all the people in a given country and tracking who has worked with who. Thinking of the "person" entities in isolation, if a lot of data were to be kept about each person then a natural approach seems to be to divide the population across multiple horizontal partitions. However, the relations (who worked with who) could (and will) cross partitionsa. Clustering on these relations (ie. using employer for example as the partition key in order to minimize cross-partition-references) won't be viable over time as the data becomes more and more cross-linked. Such clustering would also result in unbalanced partitions which would hamper scalability.

I'm rather stuck right now, so would be very greatful for any help offered.

Thanks.

A: 

It seems you have three problems:

  1. Storing data about an employee (excluding relationships/hierarchy)
  2. Employer to Employee hierarchy (which can change over time)
  3. Employee to Employee work history (again, changing over time)

To tackle each in turn:

  1. Employee data: This could be partitioned, with a unique id, with alternate key for surname+given names+date of birth. Either partition by spreading evenly by id, or other info such as area/region (though that will mean some partitions will be hotter than others)

  2. Employer/employee hierarchy: Needs a secondary table to define this, allowing changes over time. eg. Employee id, Employer id, start date, end date and keyed by employee id + employer id and back the other way employer id + employee id. I recommend reading the following: http://www.slideshare.net/billkarwin/sql-antipatterns-strike-back , it might have ideas that work well for the size of your data.

  3. Employee/employee work history: Needs another secondary table, very similar to #2, cross referencing employees and the time they've worked together. eg. employee1 id, employee2 id, start date, end date, which would be indexed by each of the id's at a minimum.

The key here is that don't attempt to place the relationships/hierarchy within the employee data table - it will be slow and limit the linking you need (especially as links change over time).

Will
Thanks for the response Will. I'm thinking along the same lines of separating the employee data from the hierarchy, however since both datasets will be excessively large to keep in a single database its the partitioning aspect that I'm stuck on. Partitioning the employee data is simple enough, but the hierarchical data would be referencing rows across multiple partitions. It's this final point that I'm concerned about. Any ideas?
tree
What databases are you considering using? I'm aware that most enterprise-level databases have partitioning capabilities that partition (split) very large tables in a way that is invisible to sqls.I'm not familiar with the exact syntax of each database, but once we know where you're heading, others might be able to help with the details.
Will
I'm on SQL Server, though I'm thinking of this from the perspective of performance impact for queries crossing remote partitions. Even though SQL Server has mechanisms to hide this from me and let me write partition-agnostic queries, I think I'll have perf issues due to cross-partition queries.
tree
Perhaps my question shouldn't be assuming the partitioning is the answer, but should be more like "What strategies work well for handling large volumes of data that don't fit on a single database server nor is easily divisible into partitions where queries are have affinity with a single partition". When I've been speaking of partitioning above, I didn't so much mean the mechanisms provided by SQL Server for partitioning tables since I could also distribute across servers without using these mechanisms.
tree