views:

69

answers:

4

Over the years I have read a lot of people's opinions on how to get better performance out of their SQL (Microsoft SQL Server, just so we are all on the same page...) queries. However, they all seem to be tightly tied to either a high-performance OLTP setup or a data warehouse OLAP setup (cubes-galore...). However, my situation today is kind of in the middle of the 2, hence my indecision.

I have a general DB structure of [Contacts], [Sites], [SiteContacts] (the junction table of [Sites] and [Contacts]), [SiteTraits], and [ContractTraits]. I have nearly 3 million contacts with about 50 fields (between [Contacts] and [ContactTraits]) relating to just the contact, and about 600 thousand sites with about 150 fields (between [Sites] and [SiteTraits]) relating to just the sites. Basically it’s a pretty big flattened table or view… Most of the columns are int, bit, char(3), or short varchar(s). My problem is that a good portion of these columns are available to be used in ad-hoc queries by the user, and as quickly as possible because the main UI for this will be a website. I know the most common filters, but even with heavy indexing on them I think this will still be a beast… This data is read-only; the data doesn’t change at all during the day and the database will only be refreshed with the latest information during scheduled downtime. So I see this situation like an OLAP database with the read requirements of an OLTP database.

I see 3 options; 1. Break the table into smaller divisible units sub-query everything, 2. make one flat table and really go to town on the indexing 3. Create an OLAP cube and sub-query the rest based on what filter values I don’t put as the cube dimensions, and. I have not done much with OLAP cubes so I frankly don’t even know if that is an option, but from what I’ve done with them in the past I think it might be an option. Also, just to clarify what I mean when I say “sub-query everything” is instead of having a WHERE clause on the outer select, there would be one (if applicable) for each table being brought into the query and then the tables are INNER JOINed, to eliminate a really large Cartesian Product. As for the second option of the one large table, I have heard and seen conflicting results with that approach as it will save on joins but at the same time a table scan takes much longer.

Ideas anyone? Do I need to share what I’m smoking? I think this could turn into a pretty good discussion if everyone puts in their 2 cents. Oh, and feel free to tell me if I’m way off base with the OLAP cube idea if that’s the case, I’m new to that stuff too.

Thanks in advance to any and all opinions and help with this dilemma I’ve found myself in.

+2  A: 

You may want to consider this as a relational data warehouse. You could design your relational database tables as a star schema (or, a snowflake schema). This design is very similar to the OLAP cube logical structure, but the physical structure is in the relational database.

In the star schema you would have one or more fact tables, which represent transactions of some sort and is usually associated with a date. I'm not sure what a transaction might be in this case though. The fact may be the association of sites to contacts and the table.

The fact table would reference dimension tables, which describe the fact. Dimensions might be Sites and Contacts. A dimension contains attributes, such as contact name, contact address, etc. If you are familiar with the OLAP cube, then this will be a familiar logical architecture.

It wouldn't be a very big problem to add numerous indexes to your architecture. The database is mostly read only, except for the refresh time. You won't have to worry about read performance while indexes are being updated. So, the architecture can accommodate all indexes that are needed (as long as you can dedicate enough downtime to refresh the data).

bobs
I'm with bobs on this - we model dimensionally which gives a good query performance for ad hoc queries involving an unknown number of dimensions simultaneously. If you build a cube on top of an already dimensional model, that's fine - we're just using a (denormalized star schema) dimensional model in a RDBMS engine without any fancy OLAP.
Cade Roux
Also, with this relatively modest amount of data (millions vs. billions), you could do a complete refresh several times a day, I would think. Thereby avoiding change data capture or any of those kind of issues.
Cade Roux
My hesitation with the cube approach is that sometimes I just want a count (distinct number of ContactIds let's say) or I might actually need to return data (name, address, position title, phone #, etc.) and this could potentally be all the fields in the DB except for internal ids and keys... Would I just use the results, say just the distinct list if ContactIds, in a tradionalal t-SQL select like using an IN or a join on/to a denormalized version of the DB for when actual data needs to be returned not just a count?
+1  A: 

I agree with bobs answer: throw an OLAP front end and query through the cube. The reason why this will be a good think is that cubes are highly efficient at querying (often precomputed) aggregates by multiple dimensions and they store the data in a column-oriented format that is more efficient for data analysis.

The relational data underneath the cube will be great for detail drill-ins to find the individual facts that give a certain aggregate value. But querying directly the relational data will always be slow, because those aggregates users are interested in for analysis can only be produced by scanning large amounts of data. OLAP is just better at this.

Remus Rusanu
A: 

OLAP/SSAS is efficient for aggregate queries, not as much for granular data in my experience.

What are the most common queries? For single pieces of data or aggregates?

Shlomo
Well some are counts, so that's my aggregate scenario, but then when actual data needs to be returned to the screen or for sending the client a file of the data, it is pulling data like a normal SQL query. Would the OLAP/SSAS approach be best for the count scenario but should I avoid it for the data pull scenario? How about getting the IDs from the cube and using them in my data pulling t-sql? Is the best approach to use both methods, each for their specific strengths? Thoughts?
A: 

If the granularity of SiteContacts is pretty close to that of Contacts (ie. circa 3 million records - most contacts associated with only a single site), you may get the best performance out of a single table (with plenty of appropriate indexes, obviously; partitioning should also be considered).

On the other hand, if most contacts are associated with many sites, it might be better to stick with something close to your current schema.

OLAP tends to produce the best results on aggregated data - it sounds as though there will be relatively little aggregation carried out on this data.

Star schemas consist of fact tables with dimensions hanging off them - depending on the relationship between Sites and Contacts, it sounds as though you either have one huge dimension table, or two large dimensions with a factless fact table (sounds like an oxymoron, but is covered in Kimball's methodology) linking them.

Mark Bannister
"it sounds as though there will be relatively little aggregation carried out on this data"Well when counts (number of contacts, sites or both) are run which is half the product, that's an aggragate. When the application is looking for actually returning data, no, there isn't much aggregation there.