Hi everyone
I have a SQL 2005 table consisting of around 10million records (dbo.Logs).
I have another table, dbo.Rollup that matches distinct dbo.Logs.URL to a FileId column in a third table, dbo.Files. The dbo.Rollup table forms the basis of various aggregate reports we run at a later stage.
Suffice to say for now, the problem I am having is in populating dbo.Rollup efficiently.
By definition, dbo.Logs has potentially tens of thousands of rows which all share the same URL field value. In our application, one URL can be matched to one dbo.Files.FileId. I.E. There is a many-to-one relationship between dbo.Logs.URL and dbo.Files.FileId (we parse the values of dbo.Logs to determine what the appropriate FileId is for a given URL).
My goal is to significantly reduce the amount of time it takes the first of three stored procedures that run in order to create meaningful statistics from our raw log data.
What I need is a specific example of how to refactor this SQL query to be much more efficient:
sp-Rollup-Step1:
INSERT INTO dbo.Rollup ([FileURL], [FileId])
SELECT
logs.RequestedFile As [URL],
FileId = dbo.fn_GetFileIdFromURL(l.RequestedFile, l.CleanFileName)
FROM
dbo.Logs l (readuncommitted)
WHERE
NOT EXISTS (
SELECT
FileURL
FROM
dbo.Rollup
WHERE
FileUrl = RequestedFile
)
fn_GetFileIdFromURL():
CREATE FUNCTION [dbo].[fn_GetFileIdFromURL]
(
@URL nvarchar(500),
@CleanFileName nvarchar(255)
)
RETURNS uniqueidentifier
AS
BEGIN
DECLARE @id uniqueidentifier
if (exists(select FileURL from dbo.[Rollup] where [FileUrl] = @URL))
begin
-- This URL has been seen before in dbo.Rollup.
-- Retrieve the FileId from the dbo.Rollup table.
set @id = (select top 1 FileId from dbo.[Rollup] where [FileUrl] = @URL)
end
else
begin
-- This is a new URL. Hunt for a matching URL in our list of files,
-- and return a FileId if a match is found.
Set @id = (
SELECT TOP 1
f.FileId
FROM
dbo.[Files] f
INNER JOIN
dbo.[Servers] s on s.[ServerId] = f.[ServerId]
INNER JOIN
dbo.[URLs] u on
u.[ServerId] = f.[ServerId]
WHERE
Left(u.[PrependURLProtocol],4) = left(@URL, 4)
AND @CleanFileName = f.FileName
)
end
return @id
END
Key considerations:
- dbo.Rollup should contain only one entry for each DISTINCT/unique URL found in dbo.tLogs.
- I would like to omit records from being inserted into dbo.[Rollup] where the FileId is NULL.
In my own observations, it seems the slowest part of the query by far is in the stored procedure: the "NOT EXISTS" clause (I am not sure at this point whether that continually refreshes the table or not).
I'm looking for a specific solution (with examples using either pseudo-code or by modifying my procedures shown here) - answer will be awarded to those who provide it!
Thanks in advance for any assistance you can provide.
/Richard.