tags:

views:

648

answers:

6

I'm working on some upgrades to an internal web analytics system we provide for our clients (in the absence of a preferred vendor or Google Analytics), and I'm working on the following query:

select 
    path as EntryPage, 
    count(Path) as [Count] 
from 
    (
     /* Sub-query 1 */
     select 
      pv2.path
     from 
      pageviews pv2 
       inner join
        (
         /* Sub-query 2 */
         select
          pv1.sessionid,
          min(pv1.created) as created
         from
          pageviews pv1 
           inner join Sessions s1 on pv1.SessionID = s1.SessionID
           inner join Visitors v1 on s1.VisitorID = v1.VisitorID
         where
          pv1.Domain = isnull(@Domain, pv1.Domain) and
          v1.Campaign = @Campaign
         group by
          pv1.sessionid
        ) t1 on pv2.sessionid = t1.sessionid and pv2.created = t1.created
    ) t2
group by 
    Path;

I've tested this query with 2 million rows in the PageViews table and it takes about 20 seconds to run. I'm noticing a clustered index scan twice in the execution plan, both times it hits the PageViews table. There is a clustered index on the Created column in that table.

The problem is that in both cases it appears to iterate over all 2 million rows, which I believe is the performance bottleneck. Is there anything I can do to prevent this, or am I pretty much maxed out as far as optimization goes?

For reference, the purpose of the query is to find the first page view for each session.

EDIT: After much frustration, despite the help received here, I could not make this query work. Therefore, I decided to simply store a reference to the entry page (and now exit page) in the sessions table, which allows me to do the following:

select
    pv.Path,
    count(*)
from
    PageViews pv
     inner join Sessions s on pv.SessionID = s.SessionID
      and pv.PageViewID = s.ExitPage
     inner join Visitors v on s.VisitorID = v.VisitorID
where
    (
        @Domain is null or 
        pv.Domain = @Domain
    ) and
    v.Campaign = @Campaign
group by pv.Path;

This query runs in 3 seconds or less. Now I either have to update the entry/exit page in real time as the page views are recorded (the optimal solution) or run a batch update at some interval. Either way, it solves the problem, but not like I'd intended.

Edit Edit: Adding a missing index (after cleaning up from last night) reduced the query to mere milliseconds). Woo hoo!

+1  A: 

For starters,

    where pv1.Domain = isnull(@Domain, pv1.Domain)

won't SARG. You can't optimize a match on a function, as I remember.

le dorfier
Any suggestions to replace that line? I want to allow a domain specific match, or match all if no domain is specified. Is there a better way besides duplicating the entire query inside an if/else statement?
Chris
FWIW, I tried commenting out that line and the execution time went down to 8 seconds, so that appears to be the culprit. Any suggestions would be appreciated.
Chris
you can use a case statement :where pv1.Domain = case when @Domain is null then pv1.Domain else @Domain end
Learning
Marked your answer as accepted, because you helped me get closest to a solution. Hopefully one pops up in my sleep.
Chris
+1  A: 

To continue from doofledorf.

Try this:

where
   (@Domain is null or pv1.Domain = @Domain) and
   v1.Campaign = @Campaign

Ok, I have a couple of suggestions

  1. Create this covered index:

     create index idx2 on [PageViews]([SessionID], Domain, Created, Path)
    
  2. If you can amend the Sessions table so that it stores the entry page, eg. EntryPageViewID you will be able to heavily optimise this.

Sam Saffron
What's the best way to post the EP here?
Chris
Also, the suggested modification allows the conditional specification of a domain, like I wanted. So thanks for that. Still hovering around 9 seconds though. The final query still runs a clustered index scan at 37% cost iterating over all 2 million rows.
Chris
SET SHOWPLAN_TEXT ON
Sam Saffron
How many results is this thing returning when it takes 9 seconds?
Sam Saffron
It's a summary - returns less than 5 rows atm. Also, added the execution plan to the original question for reference.
Chris
Ok, if I were you I would not rest until this takes less than 100millisecs to return, Ill see what I can do, please post table definitions and indexes for [Visitors] and [PageViews]
Sam Saffron
Just posted all the table defs and indexes.
Chris
@Sambo - I added your suggested index and reran. The latest EP is at the bottom of the original question
Chris
@Chris how long is it taking after this change? Also how long does "Subquery 2" take to run if it is ran by itself?
Sam Saffron
8 seconds still, sambo. Subquery takes 4 seconds to return 200,000 rows (executed from server to eliminate data transfer lag)
Chris
@Chris, if you add Path to your index then you will be able to get all the data out of the covering index, Do you have the ability to add the first page view id into the session table?
Sam Saffron
@Sambo - I ended up doing just that - adding the page view ID to the sessions table. The query now runs in just a couple of seconds. It wasn't the design I wanted, but in the end, probably the most optimal for the desired effect.
Chris
+1  A: 

Your inner query (pv1) will require a nonclustered index on (Domain).

The second query (pv2) can already find the rows it needs due to the clustered index on Created, but pv1 might be returning so many rows that SQL Server decides that a table scan is quicker than all the locks it would need to take. As pv1 groups on SessionID (and hence has to order by SessionID), a nonclustered index on SessionID, Created, and including path should permit a MERGE join to occur. If not, you can force a merge join with "SELECT .. FROM pageviews pv2 INNER MERGE JOIN ..."

The two indexes listed above will be:

CREATE NONCLUSTERED INDEX ncixcampaigndomain ON PageViews (Domain)

CREATE NONCLUSTERED INDEX ncixsessionidcreated ON PageViews(SessionID, Created) INCLUDE (path)

Jim McLeod
Put Campaign first in the index - that will cover the case where Domain is null; the other way won't.
le dorfier
That index won't work anyway - campaign isn't in the page views table. Also, I could put an index on campaign on the visitors table, but in the test case, all rows have the same value for this column. My understanding is that for lack of uniqueness, an index on that column won't help much.
Chris
An index on SessionID + Created won't benefit from adding path - it's already unique on the first two fields.
le dorfier
OK - Now I'm going to have to really look at it - so far I've been optimizing locally. :)
le dorfier
Ah, I missed that Campaign was in Visitors. Just an index on pv1.Domain then. I'll update my answer.
Jim McLeod
I could provide a download of the DB itself if you were so inclined to try it. It's 450MB uncompressed with the test data, though. :-(
Chris
@doofledorfer: The benefit of including path is to create a "covering index" so that the clustered index is not looked up to find the value for path - the nonclustered index will hold all information from pv2 that the query needs.
Jim McLeod
@Jim - I added an index on Domain and it didn't budge. I suppose at some point I'll have to post an updated query plan, but I'd like to see any further suggestions before I do.
Chris
@Chris - How many rows are returned from the pv1 query before the without the GROUP BY Sessionid? It's likely that there are so many rows returned that SQL Server decides to scan instead of using the non-clustered index on Domain. Overcome this by a covering index on (Domain, SessionID, Created).
Jim McLeod
ok on covering index - let's see if we can't improve a couple other things.Here's the core bare query... (shortly) ...
le dorfier
SELECT campaignid, sessionid, MIN(created) AS created FROM pageviews pv JOIN visitors v ON pv.visitorid = v.visitorid WHERE v.campaign = @Campaign GROUP BY sessionid
le dorfier
@Jim - before the sessionID grouping, it would be returning all applicable page views, which is ~2mil. I'm relying on the session id grouping to contract that number. The number sessions in the test case is 200,000, with 100,000 visitors
Chris
@Doofle - I ran your query (-campaignid since I don't need it and inserted a join on sessions since visitorid only exists in Visitors and Sessions table) and got (as expected) 200,000 rows back. The query took 35 seconds, but some of that was data transfer back to me.
Chris
I've had success with the following four indexes:
Jim McLeod
CREATE NONCLUSTERED INDEX ncix ON PageViews (Domain, SessionID, Created)CREATE NONCLUSTERED INDEX ncix ON PageViews(SessionID, Created) INCLUDE (domain, path)CREATE NONCLUSTERED INDEX ncix ON Sessions (SessionID, VisitorID)CREATE NONCLUSTERED INDEX ncix ON Visitors (VisitorID, Campaign)
Jim McLeod
+1  A: 

I'm back. To answer your first question, you could probably just do a union on the two conditions, since they are obviously disjoint.

Actually, you're trying to cover both the case where you provide a domain, and where you don't. You want two queries. They may optimize entirely differently.

le dorfier
+1  A: 

What's the nature of the data in these tables? Do you find most of the data is inserted/deleted regularly?

Is that the full schema for the tables? The query plan shows different indexing.. Edit: Sorry, just read the last line of text. I'd suggest if the tables are routinely cleared/insertsed, you could think about ditching the clustered index and using the tables as heap tables.. just a thought

Definately should put non-clustered index(es) on Campaign, Domain as John suggested

RobS
No data is deleted. Visitors and Sessions are updated infrequently, while PageViews has an insert each time someone visits a page. That is the full schema for the tables involved.
Chris
+1  A: 
SELECT  
    sessionid,  
    MIN(created) AS created  
FROM  
    pageviews pv  
JOIN  
    visitors v ON pv.visitorid = v.visitorid  
WHERE  
    v.campaign = @Campaign  
GROUP BY  
    sessionid

so that gives you the sessions for a campaign. Now let's see what you're doing with that.

OK, this gets rid of your grouping:

SELECT  
    campaignid,  
    sessionid,   
    pv.path  
FROM  
    pageviews pv  
JOIN  
    visitors v ON pv.visitorid = v.visitorid  
WHERE  
    v.campaign = @Campaign  
    AND NOT EXISTS (  
        SELECT 1 FROM pageviews  
        WHERE sessionid = pv.sessionid  
        AND created < pv.created  
    )
le dorfier
So the next thing is you want the first page they hit for the session, right?
le dorfier
Yes. Tell you want, do you have MSN messenger? It's getting a bit convoluted on here.
Chris
right - can you reach me at dakretz at gmail.com
le dorfier
Heh - got to throw in a correlated subquery there ...
le dorfier