views:

1534

answers:

4

I have two tables, the first is a big table (millions of rows), with the most interesting column being an integer I'll just call "key." I believe this solution would be identical for date or datetime ranges as well though.

The second table is much smaller (thousands of rows) with a bunch of attributes that are interesting to me which are defined over a range of keys. It has the following structure:

key_lower_bound : int key_upper_bound : int interesting_value1 : float interesting_value2 : int interesting_value3 : varchar(50) ...

I want to look up all of the values in the first table and "join" them with the second table based on whether the key in the first table falls inside the interval [key_lower_bound, key_upper_bound).

This is sort of like a sparse inner product or sparse dot product mathematically, but it's a little weird since there are these ranges involved in the second table. Still, if I were to write this up in code it would be an O(|first table| + |second table|) algorithm. I would keep a pointer into both (sorted) lists and walk through them each in order to determine if each key in the first table belonged in the range of the second table. The trick is that I am not iterating through the second list each time I examine a key in the first table because both lists are sorted.

When I construct the most obivous SQL query (involving checking that key is > key_lower_bound and < key_upper_bound) it takes WAY too long.

There is some kind of quadratic behavior going on with that naive query because I think the query engine is doing each compare against each row in the second table, when in reality, if the second table is sorted by key_lower_bounds this shouldn't be necessary. So I'm getting a O(|first table| x |second table|) kind of behavior instead of the desired O(|first table| + |second table|) behavior.

Is it possible to get a linear SQL query to do this?

A: 

For the first table I would put a clustered index on "key". For the second table I would put a clustered index on "key_lower_bound". Then I would try:

select *
from FirstTable f
inner join SecondTable s 
    on f.key between s.key_lower_bound and s.key_upper_bound

I would then add a second non-clustered index on "key_upper_bound" to see if that improved the performance.

RedFilter
I still think this would yield quadratic behavior - as the SQL server would not "know" that after a certain number of rows in FirstTable, it can stop examining the first row in SecondTable. Thus, partway through the query, it's reevaluating the between statement for each row in SecondTable up to the "current" secondtable range. Do you see what I mean?
Chris Harris
@Chris, you're better off not thinking of the DBMS working iteratively; it might be, sometimes, but it makes more sense to think of it as executing atomically, operating on entire sets. I do see what you mean, but - I don't think it applies to real database operations.
Carl Manaster
It should use a binary search or equivalent to efficiently locate the start of the key_lower_bound range it needs to search in SecondTable. It will then sequentially search through to the end of the table checking key_upper_bound on each row. For very low values of key (e.g., 1) this means checking every row in SecondTable. But for very high values of key, it should check very few rows in SecondTable. I am unsure whether it is smart enough to make use of an index on key_upper_bound to help constrain which rows it checks key_upper_bound on, which is why I would try it both ways.
RedFilter
I think it would actually use both indexes, and create a set that matches the key_lower_bound or greakey_upper_boundter, and another set that matches or less, and then find the intersection of those two sets.
RedFilter
@OrbMan, fair enough. In the case of binary searches for both indexes to find the match it'll be O(n*log(m)). That's still not linear. :)
Chris Harris
@Carl, I'm totally with you under most circumstances. I'm just trying to convey why I think the "quadratic" behavior is happening when there exists a linar solution. I may have organized the problem poorly for an RDBMS, if so I'd love suggestions on how to alter it, but currently I don't see any good solution -- which I find very odd given that it's a pretty "set oriented problem" and we're using a "set oriented technology," I have a hard time believing that one cannot construct an algorithm which is linear!
Chris Harris
A: 

In my experience there is no easy and robust solution. I have successfully used denormalization in many similar cases, copying key_lower_bound and key_upper_bound to the big table, and having a foreign key refer from the big table to the one with intervals. You also create a check constraint to make sure that (key > key_lower_bound and key < key_upper_bound), but this check only involves columns in one table, so it works OK. This is definitely denormalization, but the data never gets out of sync, because the FK constraint ensures that (key_lower_bound, key_upper_bound) in the big table matches the interval in the parent one. Because you don't need a join, your select performs very fast.

Similar problem solved by denormalization:

http://sqlblog.com/blogs/alexander_kuznetsov/archive/2009/03/08/storing-intervals-of-time-with-no-overlaps.aspx

Let me know if you need full DDL, it is very easy to write up.

AlexKuznetsov
Interesting solution Alex - thanks! Unfortunately, I think the denormalization is a very big price to pay. That means the additional range values are added to each row in the bigger table.
Chris Harris
+2  A: 

Well I have played with the problem and have a couple of suggestions. But first let's populate helper table

CREATE TABLE dbo.Numbers(n INT NOT NULL PRIMARY KEY)
GO
DECLARE @i INT;
SET @i = 1;
INSERT INTO dbo.Numbers(n) SELECT 1;
WHILE @i<1024000 BEGIN
  INSERT INTO dbo.Numbers(n)
    SELECT n + @i FROM dbo.Numbers;
  SET @i = @i * 2;
END;
GO

and test data, one minute commercials every minute for one year, and one customer call per minute for the same year:

CREATE TABLE dbo.Commercials(
  StartedAt DATETIME NOT NULL 
    CONSTRAINT PK_Commercials PRIMARY KEY,
  EndedAt DATETIME NOT NULL,
  CommercialName VARCHAR(30) NOT NULL);
GO
INSERT INTO dbo.Commercials(StartedAt, EndedAt, CommercialName)
SELECT DATEADD(minute, n - 1, '20080101')
    ,DATEADD(minute, n, '20080101')
    ,'Show #'+CAST(n AS VARCHAR(6))
  FROM dbo.Numbers
  WHERE n<=24*365*60;
GO
CREATE TABLE dbo.Calls(CallID INT 
  CONSTRAINT PK_Calls NOT NULL PRIMARY KEY,
  AirTime DATETIME NOT NULL,
  SomeInfo CHAR(300));
GO
INSERT INTO dbo.Calls(CallID,
  AirTime,
  SomeInfo)
SELECT n 
    ,DATEADD(minute, n - 1, '20080101')
    ,'Call during Commercial #'+CAST(n AS VARCHAR(6))
  FROM dbo.Numbers
  WHERE n<=24*365*60;
GO
CREATE UNIQUE INDEX Calls_AirTime
  ON dbo.Calls(AirTime) INCLUDE(SomeInfo);
GO

The original attempt to select all the calls made during commercials for three hours in the middle of the year is terribly slow:

SET STATISTICS IO ON;
SET STATISTICS TIME ON;
GO

SELECT COUNT(*) FROM(
SELECT s.StartedAt, s.EndedAt, c.AirTime
FROM dbo.Commercials s JOIN dbo.Calls c 
  ON c.AirTime >= s.StartedAt AND c.AirTime < s.EndedAt
WHERE c.AirTime BETWEEN '20080701' AND '20080701 03:00'
) AS t;

SQL Server parse and compile time: 
   CPU time = 15 ms, elapsed time = 30 ms.

(1 row(s) affected)
Table 'Calls'. Scan count 1, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 2, logical reads 3338264, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Commercials'. Scan count 2, logical reads 7166, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 71704 ms,  elapsed time = 36316 ms.

The reason is simple: we know that commercials do not overlap, so one call fits into at most one commercial, but the optimizer does not know it. We know that commercials are short, but the optimizer does not know it either. Both assumptions can be enforced as constraints, but the optimizer will not not it still.

Assuming that commercials are no longer than 15 minutes, we can tell that to the optimizer, and the query is very fast:

SELECT COUNT(*) FROM(
SELECT s.StartedAt, s.EndedAt, c.AirTime
FROM dbo.Commercials s JOIN dbo.Calls c 
  ON c.AirTime >= s.StartedAt AND c.AirTime < s.EndedAt
WHERE c.AirTime BETWEEN '20080701' AND '20080701 03:00'
AND s.StartedAt BETWEEN '20080630 23:45' AND '20080701 03:00'
) AS t;

SQL Server parse and compile time: 
   CPU time = 15 ms, elapsed time = 15 ms.

(1 row(s) affected)
Table 'Worktable'. Scan count 1, logical reads 753, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Calls'. Scan count 1, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Commercials'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 31 ms,  elapsed time = 24 ms.

Assuming that commercials do not overlap so so one call fits into at most one commercial, we can tell that to the optimizer, and the query is again very fast:

SELECT COUNT(*) FROM(
SELECT s.StartedAt, s.EndedAt, c.AirTime
FROM dbo.Calls c CROSS APPLY(
  SELECT TOP 1 s.StartedAt, s.EndedAt FROM dbo.Commercials s 
  WHERE c.AirTime >= s.StartedAt AND c.AirTime < s.EndedAt
  ORDER BY s.StartedAt DESC) AS s
WHERE c.AirTime BETWEEN '20080701' AND '20080701 03:00'
) AS t;

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 7 ms.

(1 row(s) affected)
Table 'Commercials'. Scan count 181, logical reads 1327, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Calls'. Scan count 1, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 31 ms,  elapsed time = 31 ms.
AlexKuznetsov
@Alex, this looks quite promising! I'm curious, is the CROSS APPLY efficient (e.g. O(1) for each row in Calls)? In other words, if you doubled the size of the DB, would it only take 2x as long?
Chris Harris
Chris, it is almost O(N) but not exactly. because you do index seeks and index depth increases, it is O(N + ln(N)) because index depth increase logarythmically.
AlexKuznetsov
@Alex, do you mean O(N*lg(N))? O(N + log(N)) == O(N).
Chris Harris
Chris, I agree - it is O(N)
AlexKuznetsov
A: 

To perform the linear algorithm that you describe would require 2 things that the database doesn't have:

  1. The ability to understand that each row in your small table contains a distinct (disjoint) mapping of many LargeTable.Key's to a single SmallTable.Range, and
  2. The tables would be required to be stored as linked lists or arrays (which they are not).

I believe the closest you will get to the behavior you are describing is a merge join:

select t1.key from largeTable t1 inner merge join smallTable t2 on t1.key >= t2.key_lower_bound and t1.key < t2.key_upper_bound

You should understand that a table is stored as a B-tree or heap - so it is optimized to look for particular nodes - not for scanning. Scanning means you must keep up to log_B(N) pointers (e.g. in a stack) to remember your place in the tree without having to traverse back in. And this isn't even talking about disk access patterns.

As secondary performance idea, you should try defining a single value that represents the range and using that as the primary key of the smallTable, which can be referenced from the largeTable as a foreign key. This is more efficient than a compound key (which is essentially what your lower_bound and upper_bound columns represent). Perhaps a hashed value such as PK = lower_bound & upper_bound << certain number of bits


Just another reference that should illustrate why it's difficult for SQL to put this algorithm together. If you can use Matlab to process your stuff - that's probably a better bet :)

Jeff Meatball Yang
The following contradict with what Kalen Delaney wrote in her books on SQL Internals: "You should understand that a table is stored as a B-tree or heap - so it is optimized to look for particular nodes - not for scanning. Scanning means you must keep up to log_B(N) pointers (e.g. in a stack) to remember your place in the tree without having to traverse back in". The opposite is true - scanning an index is very performant. This is why index covering is so useful.
AlexKuznetsov