tags:

views:

2687

answers:

8

I have a table of time-series data of which I need to find all columns that contain at least one non-null value within a given time period. So far I am using the following query:

select max(field1),max(field2),max(field3),... 
   from series where t_stamp between x and y

Afterwards I check each field of the result if it contains a non-null value.

The table has around 70 columns and a time period can contain >100k entries.

I wonder if there if there is a faster way to do this (using only standard sql).

EDIT: Unfortunately, refactoring the table design is not an option for me.

A: 

Edit: I think I misread the question... this will give you all the rows with a non-null value. I'll leave it here in case it helps someone but it's not the answer to your question. Thanks @Pax

I think you want to use COALESCE:

SELECT ... WHERE COALESCE(fild1, field2, field3) IS NOT NULL

Greg
He's looking to find the columns with at least one NULL, your solution gives NULL for the row if they're all NULL. I don't think that helps with detecting individual columns.
paxdiablo
@Pax: Re-read the problem statement. He wants to find NON-null columns, so filtering the rows will give a start at least.
GalacticCowboy
+1  A: 

It would be faster with a different table design:

create table series (fieldno integer, t_stamp date);

select distinct fieldno from series where t_stamp between x and y;

Having a table with 70 "similar" fields is not generally a good idea.

Tony Andrews
A: 

For a start, this is a very bad idea with standard SQL since not all DBMSs sort with NULLs last.

There are all sorts of tricky ways you could do this and most would be interminably slow.

I'd suggest you (sort-of) normalize the database some more so that each of the columns is in a separate table which would make a select easier but that's probably not what you want.

After edit of question: if refactoring table design is not an option, your given solution is probably the best, especially if you have indexes on all the 70 columns.

Although that's likely to slow down inserts quite a bit, you may want to use a non-indexed table for maximum insert speed and transfer the data periodically (overnight?) to an indexed table which would run your selects at best speed (by avoiding a full table scan).

paxdiablo
A: 
select count(field1),count(field2),count(field3),... 
   from series where t_stamp between x and y

will tell you how many non-null values are in each column. Unfortunately, it's not much better than the way you're doing it now.

R. Bemrose
A: 

Try this:

SELECT CASE WHEN field1 IS NOT NULL THEN '' ELSE 'contains null' END AS field1_stat,
       CASE WHEN field2 IS NOT NULL THEN '' ELSE 'contains null' END AS field2_stat,
... for every field to be checked
FROM   series
WHERE  foo IN bar
GROUP BY CASE WHEN field1 IS NOT NULL THEN '' ELSE 'contains null' END,
         CASE WHEN field2 IS NOT NULL THEN '' ELSE 'contains null' END 
... etc

This will give you a summary on the combination of 'nulled' fields in the table

devio
+1  A: 

When you say "a faster way to do this", if you mean a faster way for the query to run, then yes, here's how to do it: break it out into one query per column:

select top 1 field1 from series where t_stamp between x and y and field1 is not null

select top 1 field2 from series where t_stamp between x and y and field2 is not null

select top 1 field3 from series where t_stamp between x and y and field3 is not null

This way, you won't be doing a table scan across the entire table to find the maximum value. Instead, the database engine will stop as soon as it finds a non-null value. Assuming your data isn't 99% nulls, this should give you faster execution - but at the expense of more programming time to set this up.

Brent Ozar
A: 

How about this... You query for a list of field names that you can iterate through.

select 'field1' as fieldname from series 
   where field1 is not null and t_stamp between x and y
UNION
select 'field2' from series where field2 is not null 
... etc

Then you have a recordset that will only contain the string name of the fields that are not null. Then you can loop over this recordset to build your real query as dynamic SQL and ignore fields that don't have any data. The "select 'field2'" will not return a string when there is no crieteria matching the where clause.

Kieveli
+4  A: 

The EXISTS operation may be faster since it can stop searching as soon as it finds any row that matches the criteria (vs. the MAX which you are using). It depends on your data and how smart your SQL server is. If most of your columns have a high rate of non-null data then this method will find rows quickly and it should run quickly. If your columns are mostly NULL values then your method may be faster. I would give them both a shot and see how they are each optimized and how they run. Also keep in mind that performance may change over time if the distribution of your data changes significantly.

Also, I've only tested this on MS SQL Server. I haven't had to code strict ANSI compatible SQL in over a year, so I'm not sure that this is completely generic.

SELECT
     CASE WHEN EXISTS (SELECT * FROM Series WHERE t_stamp BETWEEN @x AND @y AND field1 IS NOT NULL) THEN 1 ELSE 0 END AS field1,
     CASE WHEN EXISTS (SELECT * FROM Series WHERE t_stamp BETWEEN @x AND @y AND field2 IS NOT NULL) THEN 1 ELSE 0 END AS field2,
...

EDIT: Just to clarify, the MAX method might be faster since it could determine those values with a single pass through the data. Theoretically, the method here could as well, and potentially with less than a full pass, but your optimizer may not recognize that all of the subqueries are related, so it might do separate passes for each. That still might be faster, but as I said it depends on your data.

Tom H.
I just benchmarked your query. You are correct, if most of the columns are not null, then EXISTS totally beats MAX. However, there are periods where only 30 of the 70 columns are filled, and for those EXISTS executes about 10 to 15 times slower. Guess the optimizer is not as smart as I hoped :(
Wolf