views:

362

answers:

5

I need to average some values in a row-wise fashion, rather than a column-wise fashion. (If I were doing a column-wise average, I could just use avg()). My specific application of this requires me ignore NULLs in averaging. It's pretty straightforward logic, but seems awfully difficult to do in SQL. Is there an elegant way of doing my calculation?

I'm using SQLite3, for what it's worth.

Details

If you need more details, here's an illustration:

I have a a table with a survey:

| q1 | q2    | q3    | ... | q144 |
|----|-------|-------|-----|------|
| 1  | 3     | 7     | ... | 2    |
| 4  | 2     | NULL  | ... | 1    |
| 5  | NULL  | 2     | ... | 3    |

(Those are just some example values and simple column names. The valid values are 1 through 7 and NULL.)

I need to calculate some averages like so:

q7 + q33 + q38 + q40 + ... + q119 / 11 as domain_score_1
q10 + q11 + q34 + q35 + ... + q140 / 13 as domain_score_2
...
q2 + q5 + q13 + q25 + ... + q122 / 12 as domain_score_14

...but i need to pull out the nulls and average based on the non-nulls. So, for domain_score_1 (which has 11 items), I would need to do:

Input:  3, 5, NULL, 7, 2, NULL, 3, 1, 5, NULL, 1

(3 + 5 + 7 + 2 + 3 + 1 + 5 + 1) / (11 - 3)
27 / 8
3.375

A simple algorithm I'm considering is:

Input:

3, 5, NULL, 7, 2, NULL, 3, 1, 5, NULL, 1 

Coalesce each value to 0 if NULL:

3, 5, 0, 7, 2, 0, 3, 1, 5, 0, 1

Sum:

27

Get the number of non-zeros by converting values > 0 to 1 and sum:

3, 5, 0, 7, 2, 0, 3, 1, 5, 0, 1
1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1
8

Divide those two numbers

27 / 8
3.375

But that seems like a lot more programming than this should take. Is there an elegant way of doing this that I'm not aware of?

Update:

Unless I'm misunderstanding something, avg() won't work for this. Example of what I would want to do:

select avg(q7, q33, q38, ..., q119) from survey;

Output:

SQL error near line 3: wrong number of arguments to function avg()
+3  A: 

AVG already ignores nulls and does what you want:

The avg() function returns the average value of all non-NULL X within a group. String and BLOB values that do not look like numbers are interpreted as 0. The result of avg() is always a floating point value as long as at there is at least one non-NULL input even if all inputs are integers. The result of avg() is NULL if and only if there are no non-NULL inputs.

From http://www.sqlite.org/lang_aggfunc.html

So you can probably take the values you have per domain and load them into another table and then just run averages on that table. Or you could just unpivot your wide table and run averages on that, too.


AVG works on columns, not rows. So if you unpivoted your table you could use AVG and not have the problem you're facing. Let's look at a small example:

You have a table and it looks like this:

ID  | q1  | q2  | q3
----------------------
1   | 1   | 2   | NULL
2   | NULL| 2   | 56

You want to average q1 and q2 together because they're in the same domain, but they're separate columns so you can't. But if you changed your table to look like this:

ID  | question | value
-----------------------
1   | 1        | 1
1   | 2        | 2
1   | 3        | NULL
2   | 1        | NULL
2   | 2        | 2
2   | 3        | 56

Then you could take the average of the two questions easily:

SELECT AVG(value)
FROM Table
WHERE question IN (1,2)

And you can group by ID if you want an average per ID rather than a global average:

SELECT ID, AVG(value)
FROM Table
WHERE question IN (1,2)
GROUP BY ID
Welbog
Unless I missunderstand, `avg()` solves a different problem. I'm updating the question.
Benjamin Oakes
It does solve a different problem, so I'm proposing you convert your problem into the one that `AVG` solves by unpivoting your table.
Welbog
Ah, I missed that. Thanks for the update -- it looks like what I was looking for.
Benjamin Oakes
@Marcus: It ignores it. See the thing I quoted at the very top of my answer. That's the way SQL aggregate functions defined, not just specifically SQLite.
Welbog
So, wait... is there an easy way to pivot the table like you mentioned? I'm not finding anything simple for that.
Benjamin Oakes
@Benjamin: There's not likely an easy way to do it in SQLite. (There is in SQL Server 2005 - I know that much - not that that helps you here.) But depending on how this table is being populated it should be a matter of changing the model and changing the way it's populated.
Welbog
Yeah, I saw that when searching around... I was trying to do something with information_schema, but it's not really working out.
Benjamin Oakes
+1  A: 

Use a separate table to store survey scores for different questions (assuming that q's are because of question). Something like following

SurveyTable(SurveyId, ...)
SurveyRatings(SurveyId, QuestionId, Rating)

After that you can run query like

SELECT avg(Rating) WHERE SurveyId=?
Babar
+2  A: 

This is going to be a monstrous query, but you could do this:

SELECT AVG(q) FROM
((SELECT q7 AS q FROM survey) UNION ALL
(SELECT q33 FROM survey) UNION ALL
(SELECT q38 FROM survey) UNION ALL
...
(SELECT q119 FROM survey))

This converts your columns to rows and uses the AVG() function.

Of course, you probably want this for just a particular survey record, so don't forget the WHERE clause:

SELECT AVG(q) FROM
((SELECT q7 AS q FROM survey WHERE survey_id = 1) UNION ALL
(SELECT q33 FROM survey WHERE survey_id = 1) UNION ALL
(SELECT q38 FROM survey WHERE survey_id = 1) UNION ALL
...
(SELECT q119 FROM survey WHERE survey_id = 1))

You'd have a lot easier time if you normalized the q columns into their own table, with one question per row, and references back to survey. You'd have a 1 to many relationship between survey and question.

Marcus Adams
`AVG` is defined by the SQL standard to ignore NULLs while calculating the average of a column. Moreover, the documentation for SQLite clearly indicates that `AVG` ignores NULLs, so even if it weren't the standard it would still apply here. So please stop spreading misinformation like this.
Welbog
Fixed answer, as you said, AVG() works.
Marcus Adams
+3  A: 

In standard SQL

SELECT 
(SUM(q7)+SUM(q33)+SUM(q38)+SUM(q40)+..+SUM(q119))/
(COUNT(q7)+COUNT(q33)+COUNT(q38)+COUNT(q40)+..+COUNT(q119)) AS domain_score1 
FROM survey

would give you what you want SUM will coalesce to 0 if null and COUNT will not count NULLs. (hope SQLite3 conforms).

EDIT: Checked the http://www.sqlite.org/lang_aggfunc.html and SQLite conforms; if sum() is going to overflow you could use total() instead.

Also I second the opinions re normalization, if you don't normalize your table design (and whenever you see columns with numbers in their names raise a red flag) you are not going to have elegant SQL.

Unreason
I ended up going with something similar to what you suggested. (For others, it was `(coalesce(#{question}, 0) + ...) / ((case when #{question} > 0 then 1 else 0 end) + ...)`.) I'd like to normalize, but it's not worthwhile at this point -- this will probably be the last time we ever look at this data. (I went back and forth for a while and decided the hack was a better choice in this instance.)
Benjamin Oakes
A: 

Use:

SELECT AVG(x.answer)
  FROM (SELECT s.q7 AS answer
          FROM SURVEY s
        UNION ALL
        SELECT s.q33
          FROM SURVEY s
        UNION ALL    
       SELECT s.q38
         FROM SURVEY s
       ...
       UNION ALL
       SELECT s.q119
         FROM SURVEY s) x

Don't use UNION - you want duplicates if they exist.

OMG Ponies