views:

197

answers:

4

We have a table with 17Mil rows containing product attributes, let's say they're:

brandID, sizeID, colorID, price, shapeID

And we need to query for aggregates by brand and size. Currently we query and filter this data by doing something like this:

select brandID, sizeID, count(*) from table where colorID in (1,2,3) and price=10 and shapeID=17 --"additional complex where clause here" group by brandID, sizeID order by brandID, sizeID

And we report this data. The problem is, it takes 10 seconds or so to run this query (and this is a very simple example) in spite of the fact that the actual data returned will be just a few hundred rows.

I think we've reached our capacity for indexing this table so I don't think any amount of indexes will get us to near-instant results.

I know very little about OLAP or other analysis services, but what's out there for SQL Server that can pre-filter or pre-aggregate this table so that queries like the above (or similar returning equivalent data) can be performed? OR What's the best way to handle arbitrary where clauses on a very large table?

A: 

Without specifics on your table structures and physical environment and (non)clustered indexes etc. the first place I would look for bottlenecks is "Show Execution plan" for the query, also, Database Engine Tuning Advisor and SQL profiler. Hope this helps.

Ash Machine
+2  A: 

I think this is a perfect candidate for an olap cube. I have fact data with 100s of millions of rows. I was doing the kind of queries you described above and queries were coming back in minutes. I moved this into an OLAP cube and queries are now almost instantaneous. There is a bit of a learning curve for olap. I'd strongly suggest you find a tutorial on some simple cube building just to get your head around it. DBA colleagues had been telling me about cubes for years and I never quite got it. Now I don't know why I went so long without it.

In addition to OLAP, you may also want to research indexed views but if you are slicing the data in several ways, that may not be feasible.

Matt Wrock
Correct me if I'm wrong, but OLAP cubes needs to be refreshed (or recompiled, I don't remember the correct word) for the data to be updated. So if Jody Powlette needs realtime data it might not be the best solution.
MaxiWheat
Loading the data in a star schema could be thought of as the "poor man's OLAP".
Philip Kelley
I agree with Matt on this one. Yes they do need to be reprocessed and redeployed with updates but if this table Jody speaks of is being used for reports they could schedule these specific updates as often as needed without interrupting service and access.
ajdams
Also, for data this large, you want to make sure and partition the cube (possibly by month). That way when you do add data every day, you only have to reprocess the current month.
Matt Wrock
A: 

Depends on your index and schema

Anyway, your index for this query should be one of

CREATE INDEX IX_foo ON table (shapeID, price, colorID) INCLUDE (brandID, sizeID)
CREATE INDEX IX_foo ON table (shapeID, price, colorID, brandID, sizeID)

However, you added "additional complex where clause here" which mitigates against a good answer

My thinking:

  • The WHERE clause is important because this reduces the row counts
  • The ORDER BY is less important than the aggregate/WHERE
  • Cover the query to remove key lookups

Extra things:

  • The colorID in (1,2,3) is bad because it's an OR
  • Ensure parameter datatypes match column datatypes exactly to avoid implicit conversions
  • You can swap shapeID, Price and colorID around a wee bit to see what's best (or create a few indexes and see which one it uses)
  • DO you have a server bottleneck (eg running on SQL Express etc)?
gbn
A: 

If you're using SQL 2008 and have some specific often-used filtering, consider using filtered indexes (possibly in combination with INCLUDE indexes as gbn suggested).

Say you have only five sizeID values. You could break your current indexes into multiple filtered indexes (e.g., "WHERE sizeID = 1").

Using filtering in combination with INCLUDE can make your queries return much faster.

Reference: Exploring SQL Server 2008’s Filtered Indexes

Rob Garrison