views:

89

answers:

6

Hi,

Say I have a very long table (~35 million rows) called TimeCard with only 5 columns (tableID, CompanyID, UserID, ProjectID, DailyHoursWorked, entryDate). This is a pretty straight forward table that records employees' worked hours per day per project per company.

I now need to generate a report to find out the employees' total worked hours per month per project for any given company. Instead of performing the aggregation needed when the report runs, I want to build a table-like data structure that already have all the Company/Project/User data aggregated by month, so when the report runs, I can just query that data structure directly without performing any run-time aggregation since ~35million records can take a few mins.

So I have 2 different ways. One create an extra physical table with (CompanyID, UserID, ProjectID, MonthlyHoursWorked, Month) as my columns and just use trigger at the TimeCard table to modify the values at the extra table. Or I can create an Indexed View. So I tried both. I first tried the indexed view with the following code:

CREATE VIEW [dbo].[vw_myView] WITH SCHEMABINDING AS
SELECT 
 JobID,
 ProjectID,
 Sum(DailyHoursWorked) AS MonthTotal,
 DATEADD( Month, DATEDIFF( Month, 0, entryDate), 0 ) AS entryMonth,
 CompanyID,
 COUNT_BIG(*) AS Counter
FROM
 dbo.TimeCard 
Group By DATEADD( Month, DATEDIFF( Month, 0, entryDate ), 0 ), JobID, ProjectID, CompanyID

Go
CREATE UNIQUE CLUSTERED INDEX [IX_someIndex] ON [dbo].[vw_myView] 
(
 [CompanyID] ASC,
 [entryMonth] ASC,
 [UserID] ASC,
 [ProjectID] ASC
)

The indexed view created correctly and totaling with ~5 million rows total.

However, every time if I clear the SQL cache, and run the following query: *select * from vw_myView where companyID = 1*, it takes almost 3 minutes. If I go with the extra table route as I mentioned above, with my cache cleared, it takes around 4 seconds.

My questions are, is Indexed View a bad choice for this particular scenario? In particular I am interested to know if the entire indexed view gets re-calculated/re-aggregated every time when the underlying table (TimeCard) is changed or when a query is run against it?

Thanks!

A: 

I would not used a view for this. I think the table populated by the trigger is the way to go. But don't forget to adjust the totals for updates and deletions as well as inserts.

HLGEM
Triggers are not very fast.
RandomBen
Views are often slower than triggers (especially if you pile them on top of each other) and a trigger can be properly written to be fast.
HLGEM
A: 

If you are not using either the Enterprise or Developer edition, then you need to use the with (noexpand) hint:

select * 
from vw_myView with (noexpand)
where companyID = 1

When the underlying data changes, the view will only update rows related to the changed data, not the entire table. This can have an adverse impact on an OLTP database with a high degree of inserts, but if usage is only moderate, should not pose a performance problem.

A tip from Microsoft:

As a general recommendation, any modifications or updates to the view or the base tables underlying the view should be performed in batches if possible, rather than singleton operations. This may reduce some overhead in the view maintenance.

RedFilter
What I don't get is why does it take 3 mins for a simple query if I clear the cache and none of the base table value had changed. Does it actually doing the entire aggregation again?
TheYouth
@TheYouth: did you try the `with (noexpand)` hint?
RedFilter
A: 

I don't think, you need indexed view (I don't say, the indexed view is bad/good idea). I think, you need the index on column "CompanyID" and "EntryDate". After then you should use where condition "WHERE CompanyID = @CompanyID AND EntryDate >= @StartDate AND EntryDate <= @EndDate".

If the table is processed primairly by the "EntryDate", you can use a cluster index on "EntryDate" column.

After this, I think the select statement will much faster then now.

TcKs
the clustered index on the view already uses CompanyID as its first column - that should do the trick, really. I don't see any benefit in an index on (CompanyID, EntryDate) - there doesn't seem to be any queries using EntryDate, so what's the point of indexing it?
marc_s
The TimeCard is indexed with entryDate, companyID, userID, projectID, all in separate non-clustered indexes. What I don't get is why does it take 3 mins for a simple query if I clear the cache. Does it actually doing the entire aggregation again?
TheYouth
You shold filter it by the EntryDate. If you know, you already have a aggregated informations about Januar 2010, you don't need select ti from the primary table again. If you use cluster index on the "EntryDate" column (without the CompanyID) - you should get better performance for selects based on this column.
TcKs
+1  A: 

I think you are on the right path with using an index View. However, have you put indexes on table you are querying from, TimeCard for your aggregate columns. You need to make an Index of JobID, ProjectID, entryDate, CompanyID (1 index). If you use 1 index for each column it will NOT solve your problems because the Query will have to use all 4 indexes together.

I do think using the trigger will be slow but in a different way. It will make your query faster but it will slow down every insert you do into TimeCard. If you do decide to go with the Trigger then I would make sure I index that table as well or might also be slow, not 3 minutes slow, but still slow to sort and return data.

RandomBen
I don't see how more indexes will help, the query is selecting by `CompanyID`, which is already indexed.
RedFilter
The TimeCard is indexed with entryDate, companyID, userID, projectID, all in separate non-clustered indexes. What I don't get is why does it take 3 mins for a simple query if I clear the cache. Does it actually doing the entire aggregation again?
TheYouth
I think it is. However, I think it would be better if you had 1 non-cluster index for entryDate, companyID, userID, projectID. Having 4 indexes, one for each of those columns is not really going to help you because your Group By is using all 4 together. So you want 1 index that has all 4 columns in it.
RandomBen
I think there is some confusion here - the `group by` in the DDL above is only used when creating the indexed view, not when selecting data from the view, so putting indexes on the view based on what is in the `group by` clause makes no sense.
RedFilter
A: 

Did you consider partitioning the table. You can think of combination of list and hash partitioning table.

A: 

Well, the idea of an indexed view is definetly good, and if you can create a clustered index on it - perfect. It should be fast - much better than 3 minutes for a query!

On the other hand: if those chunks of information are only ever updated e.g. once a month or once weekly (or even every night), it might be better to just put those into a separate DailyTimeCard table which gets filled/updated by e.g. an SSIS package regularly.

I wouldn't recommend using triggers to constantly update such a fact table either - if you really really need to have the most up to date data at every given second in the day, then stay with the indexed view.

But, your indexed view does quite a bit of heavy lifting - it sums, it groups by and so on. Keeping that up to date at all times, while your underlying TimeCard table changes and gets updated, will cause some load on your system - hard to say how much - but it could be quite noticeable.

If you find a way to extract the information you need - group and sum once and then store that aggregated data into a separate fact table - you should have both: fast and quick queries on the DailyTimeCard table, and the rest of your system should be less burdened with keeping the indexed view up to date all the time.

Maybe it's not the solution you're looking for - but just think about it for a bit. It might - or might not - work out for you!

marc_s