In the past (pre SQLServer 2005 and pre LINQ) when dealing with this sort of structure (or the more general case of a directed acyclic graph, implemented with a junction table so that items can have more than one "parent"), I've either done this by loading the entire graph into memory, or by creating a tigger-updated lookup table in the database that cached in relationship of ancestor to descendant.
There are advantages to either and which wins out depends on update frequency, complexity of the objects outside of the matter of the parent-child relationship, and frequency of updating. In general, loading into memory allows for faster individual look-ups, but with a large graph it doesn't natively scale as well due to the amount of memory used in each webserver ("each" here, because webfarm situations are one where having items cached in memory brings extra issues), meaning that you will have to be very careful about how things are kept in synch to counter-act that effect.
A third option now available is to do ancestor lookup with a recursive CTE:
CREATE VIEW [dbo].[vwCategoryAncestry]
AS
WITH recurseCategoryParentage (ancestorID, descendantID)
AS
(
SELECT parentID, id
FROM Categories
WHERE parentID IS NOT NULL
UNION ALL
SELECT ancestorID, id
FROM recurseCategoryParentage
INNER JOIN Categories ON parentID = descendantID
)
SELECT DISTINCT ancestorID, descendantID
FROM recurseCategoryParentage
Assuming that root categories are indicated by having a null parentID.
(We use UNION ALL since we're going to SELECT DISTINCT afterwards anyway, and this way we have a single DISTINCT operation rather than repeating it).
This allows us to do the look-up table approach without the redundancy of that denormalised table. The efficiency trade-off is obviously different and generally poorer than with a table but not much (slight hit on select, slight gain on insert and delete, negliable space gain), but guarantee of correctness is greater.
I've ignored the question of where LINQ fits into this, as the trade-offs are much the same whatever way this is queried. LINQ can play nicer with "tables" that have individual primary keys, so we can change the select clause to SELECT DISTINCT (cast(ancestorID as bigint) * 0x100000000 + descendantID) as id, ancestorID, descendantID
and defining that as the primary key in the [Column]
attribute. Of course all columns should be indicated as DB-generated.
Edit. Some more on the trade-offs involved.
Comparing the CTE approach with look-up maintained in database:
Pro CTE:
- The CTE code is simple, the above view is all the extra DB code you need, and the C# needed is identical.
- The DB code is all in one place, rather than there being both a table and a trigger on a different table.
- Inserts and deletes are faster; this doesn't affect them, while the trigger does.
- While semantically recursive, it is so in a way the query planner understands and can deal with, so it's typically (for any depth) implemented in just two index scans (likely clustered) two light-weight spools, a concatenation and a distinct sort, rather than in the many many scans that you might imagine. So while certainly a heavier scan than a simple table lookup, it's nowhere near as bad as one might imagine at first. Indeed, even the nature of those two index scans (same table, different rows) makes it less expensive than you might think when reading that.
- It is very very easy to replace this with the table look-up if later experience proves that to be the way to go.
- A lookup table will, by its very nature, denormalise the database. Purity issues aside, the "bad smell" involved means that this will have to be explained and justified to any new dev, as until then it may simply "look wrong" and their instincts will send them on a wild-goose chase trying to remove it.
Pro Lookup-Table:
- While the CTE is faster to select from than one might imagine, the lookup is still faster, especially when used as part of a more complicated query.
- While CTEs (and the WITH keyword used to create them) are part of the SQL 99 standard, they are relatively new and some devs don't know them (though I think this particular CTE is so straightforward to read that it counts as a good learning example anyway, so maybe this is actually pro CTE!)
- While CTEs are part of the SQL 99 standard, they aren't imlemented by some SQL databases, including older versions of SQLServer (which are still in live use), which may affect any porting efforts. (They are though supported by Oracle, and Postgres among others, so at this point this may not really be an issue).
- It's reasonably easy to replace this with the CTE version later, if later experience suggests you should.
Comparing (both) the db-heavy options with in-memory caching.
Pro In-Memory:
- Unless your implementation really sucks, it is going to be much faster than DB lookups.
- It makes some secondary optimisations possible on the back of this change.
- It is reasonably difficult to change from DB to in-memory if later profiling shows that in-memory is the way to go.
Pro Querying DB:
- Start-up time can be very slow with in-memory.
- Changes to the data are much much simpler. Most of the points are aspects of this. Really, if you go the in-memory route then the question of how to handle changes invalidating the cached information becomes a whole new ongoing concern for the lifetime of the project, and not a trivial one at all.
- If you use in-memory, you are probably going to have to use this in-memory store even for operations where it is not relevant, which may complicate where it fits with the rest of your data-access code.
- It is not necessary to track changes and cache freshness.
- It is not necessary to ensure that every webserver in a web-farm and/or web-garden solution (a certain level of success will necessitate this) has precisely the same degree of freshness.
- Similarly, the degree of scalability across machines (how close to 100% extra performance you get by doubling the number of webservers and DB slaves) is higher.
- With in-memory, memory use can become very high, if either (a) the number of objects is high or (b) the size of the objects (fields, esp. strings, collections and objects which themselves have a sting or collection). Possibly "we need a bigger webserver" amounts of memory, and that goes for every machine in the farm.
7a. That heavy memory use is particularly like to continue to grow as the project evolves.
- Unless changes cause an immediate refresh of the in-memory store, the in-memory solution will mean that the view used by the people in charge of administrating these categories will differ from what is seen by customers, until they are re-synchronised.
- In-memory resynching can be very expensive. Unless you're very clever with it, it can cause random (to the user) massive performance spikes. If you are clever with it, it can exasperate the other issues (esp. in terms of keeping different machines at an equiv. level of freshness).
- Unless you're clever with in-memory, those spikes can accumulate, putting the machine into a long-term hang. If you are clever with avoiding this, you may exasperate other issues.
- It is very difficult to move from in-memory to hitting the db should that prove the way to go.
None of this leans with 100% certainty to one solution or the other, and I certainly aren't going to give a clear answer as doing so is premature optimsiation. What you can do a priori is make a reasonable decision about which is likely to be the optimal solution. Whichever you go for you should profile afterwards, esp. if the code does turn out to be a bottleneck and possibly change. You should also do so over the lifetime of the product as both changes to the code (fixes and new features) and changes to the dataset can certainly change which option is optimal (indeed, it can change from one to another and then change back to the previous one, over the course of the lifetime). This is why I included considerations of the ease of moving from one approach to another in the above list of pros and cons.