views:

1094

answers:

3

Hi,

I'm working on a design for a hierarchical database structure which models a catalogue containing products (this is similar to this question). The database platform is SQL Server 2005 and the catalogue is quite large (750,000 products, 8,500 catalogue sections over 4 levels) but is relatively static (reloaded once a day) and so we are only concerned about READ performance.

The general structure of the catalogue hierarchy is:-

  • Level 1 Section
    • Level 2 Section
      • Level 3 Section
        • Level 4 Section (products are linked to here)

We are using the Nested Sets pattern for storing the hierarchy levels and storing the products which exist at that level in a separate linked table. So the simplified database structure would be

CREATE TABLE CatalogueSection
(
    SectionID INTEGER,
    ParentID INTEGER,
    LeftExtent INTEGER,
    RightExtent INTEGER
)

CREATE TABLE CatalogueProduct
(
    ProductID INTEGER,
    SectionID INTEGER
)

We do have an added complication in that we have about 1000 separate customer groups which may or may not see all products in the catalogue. Because of this we need to maintain a separate "copy" of the catalogue hierarchy for each customer group so that when they browse the catalogue, they only see their products and they also don't see any sections which are empty.

To facilitate this we maintain a table of the number of products at each level of the hierarchy "rolled up" from the section below. So, even though products are only directly linked to the lowest level of the hierarchy, they are counted all the way up the tree. The structure of this table is

CREATE TABLE CatalogueSectionCount
(
    SectionID INTEGER,
    CustomerGroupID INTEGER,
    SubSectionCount INTEGER,
    ProductCount INTEGER
)

So, onto the problem Performance is very poor at the top levels of the hierarchy. The general query to show the "top 10" products in the selected catalogue section (and all child sections) is taking somewhere in the region of 1 minute to complete. At lower sections in the hierarchy it is faster but still not good enough.

I've put indexes (including covering indexes where applicable) on all key tables, run it through the query analyzer, index tuning wizard etc but still cannot get it to perform fast enough.

I'm wondering whether the design is fundamentally flawed or whether it's because we have such a large dataset? We have a reasonable development server (3.8GHZ Xeon, 4GB RAM) but it's just not working :)

Thanks for any help

James

A: 

you might be able to solve the customer groups problem with roles and treeId's but you'll have to provide us with the query.

Mladen Prajdic
A: 

Might it be possible to calculate the ProductCount and SubSectionCount after the load each day?
If the data is changing only once a day surely it's worthwhile to calculate these figures then, even if some denormalization is required.

Bravax
Yes, we already pre-calculate these daily. It's not so much the counting of the products which is the issue, it's showing the actual list of products in the selected section which is slow.
James
Do you update statistics after reloading your data? If your indexes are fine (tuned for read only use), then could it be you're returning too much data? That's an area i might look at next.TBH, to help more is going to be quite difficult without seeing the schema, and/or stored procedures.
Bravax
+2  A: 

Use a closure table. If your basic structure is a parent-child with the fields ID and ParentID, then the structure for a closure table is ID and DescendantID. In other words, a closure table is an ancestor-descendant table, where each possible ancestor is associated with all descendants. You may include a LevelsBetween field if you need. Closure table implementations usually include self-referencing records, i.e. ID 1 is an ancestor of descendant ID 1 with LevelsBetween of zero.

Example: Parent/Child
ParentID - ID
1 - 2
1 - 3
3 - 4
3 - 5
4 - 6

Ancestor/Descendant
ID - DescendantID - LevelsBetween
1 - 1 - 0
1 - 2 - 1
1 - 3 - 1
1 - 4 - 2
1 - 6 - 3
2 - 2 - 0
3 - 3 - 0
3 - 4 - 1
3 - 5 - 1
3 - 6 - 2
4 - 4 - 0
4 - 6 - 1
5 - 5 - 0

The table is intended to eliminate recursive joins. You push the load of the recursive join into an ETL cycle that you do when you load the data once a day. That shifts it away from the query.

Also, it allows variable-level hierarchies. You won't be stuck at 4.

Finally, it allows you to slot products in non-leaf nodes. A lot of catalogs create "Miscellaneous" buckets at higher levels of the hierarchy to create a leaf-node to attach products to. You don't need to do that since intermediate nodes are included in the closure.

As far as indexing goes, I would do a clustered index on ID/DescendantID.

Now for your query performance. This takes a chunk out but not all. You mentioned a "Top 10". This implies ranking over a set of facts that you haven't mentioned. We need details to help tune those. Plus, this gets only gets the leaf-level sections, not the products. At the very least, you should have an index on your CatalogueProduct that orders by SectionID/ProductID. I would force Section to Product joins to be loop joins based on the cardinality you provided. A report on a catalog section would go to the closure table to get descendants (using a clustered index seek). That list of descendants would then be used to get products from CatalogueProduct using the index by looped index seeks. Then, with those products, you would get the facts necessary to do the ranking.

entaroadun
Excellent, that's exactly what I needed and has really improved performance. Thanks
James