views:

61

answers:

4

Dear all,

I wonder if there is a way to select the standard deviation from several integer fields in MySQL within the same row. Obviously, if I use

SELECT STDDEV(col1) FROM mytable

I just get the standard deviation of that particular column. Let´s assume I have a table like: id,somefield1,somefield2, integerfield1,integerfield2,integerfield3, ... ,integerfield30 . Now I´d like to select the standard deviation of integerfield 1-30 within a row and save it AS sdfield . Of course I could use statistical software for this, but I just wonder if there is a way to do it directly in MySQL.

thx in advance for any suggestions.

A: 

for simplicity, assume you have n columns, named A, B, C .... :

SELECT SQRT(  
  (A*A + B*B + C*C + ...)/n  - (A+B+C+...)*(A+B+C+...)/n/n) AS sd
  FROM table;
J-16 SDiZ
Or make a function out of it.
Alin Purcaru
Basically I like the idea of doing it with a formula... but I am not so sure if this transformation is correct. Aren't you supposed to SQRT() the added up differences from the mean? With your formula I get different results.
ran2
Yes, it should use SQRT(), see http://en.wikipedia.org/wiki/Standard_deviation under "Identities and mathematical properties" section
J-16 SDiZ
+1  A: 

Have you tried using UNION to effectively put all your column values into separate rows? Something like this, maybe:

SELECT STDDEV(allcols)
FROM (
    SELECT col1 FROM table WHERE id=requiredID
    UNION
    SELECT col2 FROM table WHERE id=requiredID
    UNION
    SELECT col3 FROM table WHERE id=requiredID
    UNION
    SELECT col4 FROM table WHERE id=requiredID
    UNION
    SELECT col5 FROM table WHERE id=requiredID
)
Spudley
A: 

I found two solutions on my own:

1) Normalize the database. I end up with two tables:

table one uid | information1 | metainformation2

table two uid | col | result_of_col

Then I can easily use the standard STDDEV function.

2) Use R. The data is a de-normalized format because it should be used in statistical analysis. Thus it´s easy to get into R and use the following code.

sd(t(dataset[1:4,3:8]))

Note that, I just take the numeric part of this data.frame by leaving selecting the columns 3-8. And dont get hit by too much data (that´s why I only use the first couple of rows this time). t() transposes the data which is necessary because sd() only works with columns.

There´s a function rowSds around in the vsn package, that is supposed to work analogously to rowMean and rowSum, but somehow this might be deprecated. At least this packages was not available on the Swiss CRAN mirror ;) .

HTH someone else.

ran2
normalizing the database would be much better.
J-16 SDiZ
+2  A: 

With R:

df <- your.pull
sd(t(df[sapply(df, is.numeric)]))

Pull data with RMySQL or RODBC, remove non numeric columns, transpose and use sd.

Brandon Bertelsen
+1 for the shortcut. Though this won't work in my case because I have some numeric which is not relevant for the sd (such as ids, or other variables). Hence I need to do this manually anyway. I just wondered if there is a better way than doing this in R. Now, I use dbWriteTable to write back to the db and wonder if I can add columns to the tables or if it is the best to fully remove them and the write with R.
ran2
What are the column names? You might be able to pull using df[grep()] if you can match the colnames with a consistent pattern. For writing back to the table, I wouldn't use dbWriteTable as it can overwrite or append data that's not meant to be there, I would use dbGetQuery() and use and UPDATE call.
Brandon Bertelsen