views:

50

answers:

2

I have 1,000,000 rows per month per PC generated by some monitoring software. The DataToImport (temporary) table looks like this:

EventID    int           NOT NULL   (Primary Key of denormalized table)
EventType  int           NOT NULL   -- A enumerated value
Computer   nvarchar(50)  NOT NULL   -- Usually computer name
When       DateTime      NOT NULL   
FileRef    int           NOT NULL   -- File generators reference 
FileDesc   nvarchar(100) NOT NULL   -- Humanly-readable description
FilePath   nvarchar(100) NOT NULL   -- Relative Path on disk

I am trying to normalize this data into several tables:

Computer (UniqueID, Name)
File     (UniqueID, FileRef, FileDesc, FilePath)
Event    (ID, Type, ComputerUniqueID, When, FileUniqueID)

..such that 'Event' would have a zillion rows, but they're quite small, so database size is manageable and tables could be indexed for query performance:

-- Grab new computers
INSERT INTO Computer
SELECT [Computer] AS [Name]
FROM [DataToImport]
WHERE [DataToImport].[Computer] NOT IN (SELECT [Name] FROM [Computer])

-- Grab new files
INSERT INTO File
SELECT [FileRef], [FileDesc], [FilePath] 
FROM [DataToImport]
WHERE [FileRef] NOT IN (SELECT [FileRef] FROM File)

-- Normalize rows
INSERT INTO Event
SELECT [EventID], [EventType], [Computer].[UniqueID], [File].[UniqueID]
FROM [DataToImport]
  INNER JOIN [Computer] ON [DataToImport].[Computer] = [Computer].[Name]
  INNER JOIN [File] ON [DataToImport].[FileRef] = [File].[FileRef]

.. this all looks great, except that the triplet (FileRef, FileDesc, FilePath) is really a compound key as any one of the three items can vary and this represents a unique entry. I need to extract distinct triplets to insert them...

-- Grab new distinct files
INSERT INTO File
SELECT DISTINCT [FileRef], [FileDesc], [FilePath] 
FROM [DataToImport]
WHERE [FileRef] NOT IN (errrrr....help!)

How can I ensure that the unique File rows are normalised?

+3  A: 
INSERT
INTO    File
SELECT  DISTINCT [FileRef], [FileDesc], [FilePath] 
FROM    [DataToImport] di
WHERE   NOT EXISTS
        (
        SELECT  di.FileRef, di.FileDesc, di.FilePath
        INTERSECT
        SELECT  FileRef, FileDesc, FilePath
        FROM    File
        )

It's not important in your case, but this would also handle the NULL values as DISTINCT correctly if the columns were nullable.

@Philip Kelley's solution is more elegant, though.

Quassnoi
+1 for good sportsmanship!
JBRWilkinson
+3  A: 

I'd use

INSERT [File] ([FileRef], [FileDesc], [FilePath])
 select distinct [FileRef], [FileDesc], [FilePath]
  from [DataToImport]
 except select [FileRef], [FileDesc], [FilePath]
  from [File]

...but I'd compare its performance with Quassnoi's SELECT...INTERSECT solution first.

Philip Kelley
+1, what, wish I posted it first:) You can remove `DISTINCT`: `EXCEPT` implies it.
Quassnoi
Your answer is 40% faster. Would it also handle NULLs ?
JBRWilkinson
@JBRW: yes, it will.
Quassnoi