views:

81

answers:

3

Hi everyone

I have a SQL 2005 table consisting of around 10million records (dbo.Logs).

I have another table, dbo.Rollup that matches distinct dbo.Logs.URL to a FileId column in a third table, dbo.Files. The dbo.Rollup table forms the basis of various aggregate reports we run at a later stage.

Suffice to say for now, the problem I am having is in populating dbo.Rollup efficiently.

By definition, dbo.Logs has potentially tens of thousands of rows which all share the same URL field value. In our application, one URL can be matched to one dbo.Files.FileId. I.E. There is a many-to-one relationship between dbo.Logs.URL and dbo.Files.FileId (we parse the values of dbo.Logs to determine what the appropriate FileId is for a given URL).

My goal is to significantly reduce the amount of time it takes the first of three stored procedures that run in order to create meaningful statistics from our raw log data.

What I need is a specific example of how to refactor this SQL query to be much more efficient:

sp-Rollup-Step1:

INSERT INTO dbo.Rollup ([FileURL], [FileId])

SELECT 
 logs.RequestedFile As [URL],
 FileId = dbo.fn_GetFileIdFromURL(l.RequestedFile, l.CleanFileName)

FROM
 dbo.Logs l (readuncommitted) 

WHERE    

NOT EXISTS (
    SELECT
     FileURL
    FROM
     dbo.Rollup
    WHERE
     FileUrl = RequestedFile
)

fn_GetFileIdFromURL():

CREATE FUNCTION [dbo].[fn_GetFileIdFromURL] 
(       
    @URL nvarchar(500),
    @CleanFileName nvarchar(255)
)
RETURNS uniqueidentifier
AS
BEGIN

     DECLARE @id uniqueidentifier

     if (exists(select FileURL from dbo.[Rollup] where [FileUrl] = @URL))
     begin
     -- This URL has been seen before in dbo.Rollup.
            -- Retrieve the FileId from the dbo.Rollup table.
     set @id = (select top 1 FileId from dbo.[Rollup] where [FileUrl] = @URL)  
     end
     else
     begin
     -- This is a new URL. Hunt for a matching URL in our list of files,
            -- and return a FileId if a match is found.
     Set @id = (

      SELECT TOP 1
      f.FileId

      FROM
      dbo.[Files] f

      INNER JOIN
      dbo.[Servers] s on s.[ServerId] = f.[ServerId]

      INNER JOIN
      dbo.[URLs] u on 
          u.[ServerId] = f.[ServerId]

      WHERE
       Left(u.[PrependURLProtocol],4) = left(@URL, 4)
      AND @CleanFileName = f.FileName  
     )

     end

     return @id

END

Key considerations:

  • dbo.Rollup should contain only one entry for each DISTINCT/unique URL found in dbo.tLogs.
  • I would like to omit records from being inserted into dbo.[Rollup] where the FileId is NULL.

In my own observations, it seems the slowest part of the query by far is in the stored procedure: the "NOT EXISTS" clause (I am not sure at this point whether that continually refreshes the table or not).

I'm looking for a specific solution (with examples using either pseudo-code or by modifying my procedures shown here) - answer will be awarded to those who provide it!

Thanks in advance for any assistance you can provide.

/Richard.

+2  A: 

Short answer is you have a CURSOR here. The scalar UDF is run per row of output.

The udf could be 2 LEFT JOINs onto derived tables. A rough outline:

...
COALESCE (F.xxx, L.xxx) --etc
...
FROM
 dbo.Logs l (readuncommitted)
 LEFT JOIN
 (select DISTINCT --added after comment
FileId, FileUrl from dbo.[Rollup]) R ON L.FileUrl = R.FileUrl
 LEFT JOIN
 (SELECT DISTINCT --added after comment
                f.FileId,
FileName ,
left(@PrependURLProtocol, 4) + '%' AS Left4
                FROM
                dbo.[Files] f

                INNER JOIN
                dbo.[Servers] s on s.[ServerId] = f.[ServerId]

                INNER JOIN
                dbo.[URLs] u on 
                           u.[ServerId] = f.[ServerId]
) F ON L.CleanFileName = R.FileName AND L.FileURL LIKE F.Left4
...

I'm also not sure if you need the NOT EXISTS because of how the udf works. If you do, make sure the columns are indexed.

gbn
Hi gbn, haven't had much luck implementing this ... it turns up duplicates in the SELECT query which it must not do, i.e. it must select distinct URLs in dbo.Logs and retrieve corresponding FileIds for each. Perhaps I've misinterpreted your instructions?
Richard
@Richard: You probably need DISTINCT in the derived tables
gbn
@KM: np, I got a cleanup badge out of it ;-)
gbn
@gbn - right, so I'll give that a go to see what that turns up; just wondering what DISTINCT will do for performance ;) Thanks
Richard
@Richard: DISTINCT replaces the udf though, so it should still be quicker
gbn
@gbn: thanks for the help; your advice has put me on the right track. Still got plenty of tweaking to do elsewhere, though! All the best...
Richard
+1  A: 

Hello,

I think your hotspot is located here:

Left(u.[PrependURLProtocol],4) = left(@URL, 4)

This will cause the server to do a scan on the url table. You should not use a function on a field in a join clause. try to rewrite that to something like

... where PrependURLProtocol like left(@URL, 4) +"%"

And make sure you have an index on the field.

Heiko Hatzfeld
A: 
INSERT INTO dbo.Rollup ([FileURL], [FileId])
SELECT  
 logs.RequestedFile As [URL], 
 FileId = dbo.fn_GetFileIdFromURL(l.RequestedFile, l.CleanFileName)
FROM dbo.Logs l (readuncommitted) LEFT OUTER JOIN dbo.Rollup
 on FileUrl = RequestedFile
WHERE FileUrl IS NULL

The logic here is that if dbo.Rollup does not exist for the given FileUrl, then the left outer join will turn up null. The NOT EXISTS now becomes an IS NULL, which is faster.

Ryan Michela
Hi Ryan, this looks good - I'll try it out now.
Richard