ansaurus

Question

Advice on how to scale and improve execution times of a "pivot-based query" on a billion rows table, increasing one million a day.

Answer 1

+2 A:

I'd start from posting exact tables metadata (along with indexing details), exact query text and the execution plan.

With you current table layout, the query similar to this:

SELECT FROM Items WHERE Size = @A AND Version = @B

cannot benefit from using a composite index on (Size, Version), since it's impossible to build such an index.

You cannot even build an indexed view, since it would contain a self-join on attributes.

Probably the best decision would be to denormalize the table like this:

id  name  size  version

and create an index on (size, version)

Quassnoi 2009-06-16 15:14:08

probably best to take the parameters, find out what the id is that you need from the lookup tables and then use the id in the query

SQLMenace 2009-06-16 15:29:13

@SQLMenace: if 1,000,000 rows suffice size, 1,000,000 rows suffice version and 1,000 rows suffice both, the composite index would require 1,000 scans, and scanning the lookup tables twise would require 2,000,000 scans.

Quassnoi 2009-06-16 15:35:47

thats right, the more filtering criteria you add the more it takes to execute

knoopx 2009-06-16 15:47:24

@knoopx: impossibility to create composite indexes is the main drawback of such design. If you need multiple search criteria, you should keep your attributes as table columns

Quassnoi 2009-06-16 15:54:01

if the values table is pretty narrow 2 * 4 bytes for the ids and another 20 bytes for the value and you have a index on value it might be doable since you could store a million rows in a little under 3500 pages but then again i don't know what the data looks like

SQLMenace 2009-06-16 16:01:31

@SQLMenace: Schema and data looks exactly the same as this question ones. Everything is properly indexed. There are few many more non-participating columns and hundred surrounding tables with no relationship. I know that rebuilding the table will obviously take immeasurable benefits from the indexes but dynamic report generation is a functional requirement.

knoopx 2009-06-16 16:14:04

We are evaluating the possibilities, if none of them works for us, we will be forced to denormalize or rebuild the tables on a column-based fashion.

knoopx 2009-06-16 16:15:20

Answer 2

+2 A:

Perhaps this white paper by SQL Server CAT team on the pitfalls of Entity-Attribute-Value database model can help: http://sqlcat.com/whitepapers/archive/2008/09/03/best-practices-for-semantic-data-modeling-for-performance-and-scalability.aspx

Remus Rusanu 2009-06-16 15:15:00

This white paper almost directly answers the poster's question with best practice advice.

Philip Rieck 2009-06-16 15:23:10

Answer 3

A:

A short term fix may be to use horizontal partitioning. I am assuming your largest table is Items_Attributes. You could horizontally partition this table, putting each partition on a separate filegroup on a separate disk controller.

That's assuming you are not trying to report across all ItemIds at once.

RedFilter 2009-06-16 15:20:22

"That's assuming you are not trying to report across all ItemIds at once.", in fact we group by attribute value and sort by count of repeated values, so I guess this won't work?

knoopx 2009-06-16 16:19:44

Answer 4

+1 A:

Looks to me like issuing some OLAP queries on a database optimized for OLTP transactions. Don't knowing details, I'd recommend building a separate "datawarehouse" optimized for the kind of queries you are doing. That would involve aggregating data (if possible), denormalization and also having a data base, which is 1 day old or so. You would incrementally update the data each day or at any interval you wish.

MicSim 2009-06-16 15:24:14

Answer 5

+1 A:

Please post exact DDL and indexes, if you have indexes on the ID columns then your query will result in a scan

instead of something like this

SELECT FROM Items WHERE Size = @A AND Version = @B

you need to do this

SELECT FROM Items WHERE ID = 1

in other words you need to grab the text values, find the ids that you are indexing on and then use that as your query to return results instead

Probably also a good idea to look at partitioning function to distribute your data

clustering is done for availability not performance, if one node dies (the active cluster) , the other node (the passive cluster) will become active....of course there is also active active clustering but that is another story

SQLMenace 2009-06-16 15:28:07

Answer 6

+2 A:

Worked with such schemas a lot of time. They never perform well. The best thing is to just store the data as you need it, in the form:

| ItemName | Size | Version | |----------|-------|---------| | Sample | 500mB | 1.0.0 |

Then you don;t need to pivot. And BTW, please do not call your original EAV schema "normalized" - it is not normalized.

AlexKuznetsov 2009-06-16 16:26:44

Answer 7

A:

You mention 50 tables in a single query. Whilst SQL server supports up to 256 tables in a single, monolithic query, taking this approach reduces the chances of the optimiser producing an efficient plan.

If you are wedded to the schema as it stands, consider breaking your reporting queries down into a series of steps which materialise their results into temporary (#) tables. This approach enables you to carry out the most selective parts of the query in isolation, and can, in my experience, offer big performance gains. The queries are generally more maintainable too.

Also (a bit of a long shot, this) you don't say which SQL server version you're on; but if you're on SQL 2005, given the number of tables involved in your reports and the volume of data, it's worth checking that your SQL server is patched to at least SP2.

I worked on an ETL project using tables with rowcounts in the hundreds of millions, where we found that the query optimiser in SQL 2005 RTM/SP1 could not consistently produce efficient plans for queries joining more than 5 tables where one or more of the tables was of this scale. This issue was resolved in SP2.

Ed Harper 2009-06-17 07:39:29

ansaurus

tags:

views:

answers:

Advice on how to scale and improve execution times of a "pivot-based query" on a billion rows table, increasing one million a day.

related questions