views:

300

answers:

3

I have a dataset that contains:

Table { date itemName }

The date for the most part is sequential. There are no duplicates of the date [as it is the primary key].

The question is split up into multiple parts (all with respect to using SQL):

  1. Is it possible to find gaps in the date series listed in the table? For example: Dates 1/2/09-1/3/09 are missing
  2. Is it possible to find sections of dates that are missing from the table, that has a range greater than n (this is a number determined at run time)? For example: For n = 2 Dates 1/2/09-1/3/09 are not returned but Dates 5/6/09-6/1/09 are.
+1  A: 

Just create a function in plsql or in a client which will be checking all dates. Like this pseudocode:

date checked_date = 2000-01-01;
int unchecked_section = 0;
while ( checked_date <= today() ) {
  if (! sql(select itemName from Table where itemName=checked_date)) {
    unchecked_section++;
  } else {
    if ( unchecked_section>=n ) {
      print checked_date-unchecked_section, checked_date
    }
    unchecked_section = 0;
  }
  checked_date++;
}
if ( unchecked_section ) {
  print checked_date-unchecked_section, checked_date
}

It does not have to be very fast as it is maintenance only. There aren't many dates to check - only 365 a year.

Tometzky
If you don't have SQL window functions available, this approach is actually the fastest one possible on a large data set, as it only makes one pass over the table. One thing you need to watch is that the SELECT gets ORDER BY so the rows appear in sorted order. And you should use "SELECT min(date),max(date) from table" to get the loop limits--presuming things end at 'today' is not the best idea. PostgreSQL has lots of programming languages available you can run inside the database, PL/pgSQL is the standard one.
Greg Smith
+1  A: 

After some testing I came up with the following SQL statement:

SELECT date, itemName
  FROM "Table" as t1
  WHERE NOT EXISTS (
     SELECT date 
     FROM "Table" as t2 
     WHERE t2.date = (t1.date - INTERVAL '1 day')
  )
  ORDER BY date
  OFFSET 1  -- this will skip the first element

This will get you all rows that have no direct successor.

If you modify the statement to:

SELECT date, itemName
  FROM "Table" as t1
  WHERE NOT EXISTS (
    SELECT date 
    FROM "Table" as t2 
    WHERE (t2.date >= (t1.date - INTERVAL '2 day'))
    AND (t2.date < t1.date)
  )
  ORDER BY date
  OFFSET 1

you can use the INTERVAL length in the subselect's WHERE clause to filter by gaps of at least that size.

Hope that helps.

Frank Bollack
The runtime of this is proportional to the square of the table size, because both the outside SELECT and the inside EXISTS subquery are both doing things whose runtime is proportional to the size of the table. That might seem reasonable at first, but eventually it's going to get really expensive. Unfortunately, any other solution you do in straight SQL is going to suffer from the same issue, because SQL has no row memory. For an n-row table, you have to do a n X n join some way to solve this type of problem. When available, window functions are the best around this.
Greg Smith
@Greg: Thanks for the analysis. you are right, this isn't the fastest solution, if windows functios are at hand. But PostgreSQL 8.4 is a quite fresh release so there are chances, the OP is using an older version.Also look at the OPs comment on his question for his runtime performance requirements.
Frank Bollack
+2  A: 

If you can use PostgreSQL 8.4 then window functions will help:

SELECT *
    FROM (SELECT itemName, date, date - lag(date) OVER w AS gap
              FROM someTable WINDOW w AS (ORDER BY date)
         ) AS pairs
    WHERE pairs.gap > '1 day'::interval;
Ants Aasma
This is exactly the sort of problems window functions intend to solve efficiently, and there's no other SQL-only solution that can run as quickly as this.
Greg Smith
Unfortunately, I don't have version 8.4. I have 8.1. I wish I could get credit to this answer. I really like it.
monksy