views:

138

answers:

2

I have production (q) values from 4 different methods stored in the 4 matrices. Each of the 4 matrices contains q values from a different method as:

Matrix_1 = 1 row x 20 column 

Matrix_2 = 100 rows x 20 columns 

Matrix_3 = 100 rows x 20 columns 

Matrix_4 = 100 rows x 20 columns 

The number of columns indicate the number of years. 1 row would contain the production values corresponding to the 20 years. Other 99 rows for matrix 2, 3 and 4 are just the different realizations (or simulation runs). So basically the other 99 rows for matrix 2,3 and 4 are repeat cases (but not with exact values because of random numbers).

Consider Matrix_1 as the reference truth (or base case ). Now I want to compare the other 3 matrices with Matrix_1 to see which one among those three matrices (each with 100 repeats) compares best, or closely imitates, with Matrix_1.

How can this be done in Matlab?

I know, manually, that we use confidence interval (CI) by plotting the mean of Matrix_1, and drawing each distribution of mean of Matrix_2, mean of Matrix_3 and mean of Matrix_4. The largest CI among matrix 2, 3 and 4 which contains the reference truth (or mean of Matrix_1) will be the answer.

mean of Matrix_1 = (1 row x 1 column)

mean of Matrix_2 = (100 rows x 1 column)

mean of Matrix_3 = (100 rows x 1 column)

mean of Matrix_4 = (100 rows x 1 column)

I hope the question is clear and relevant to SO. Otherwise please feel free to edit/suggest anything in question. Thanks!

EDIT: My three methods I talked about are sgsim, sisim and snesim respectively. Here's my result:

ci_sgsim =

  1.0e+008 *

   4.084733001497999
   4.097677503988565

ci_sisim =

  1.0e+008 *

   5.424396063219890
   5.586301025525149

ci_snesim =

  1.0e+008 *

   2.429145282593182
   2.838897116739112

p_sgsim =

    8.094614835195452e-130

p_sisim =

    2.824626709966993e-072

p_snesim =

    3.054667629953656e-012

h_sgsim = 1; h_sisim = 1;  h_snesim = 1

None of my CI, from the three methods, includes the mean ( = 3.454992884900722e+008) inside it. So do we still consider p-value to choose the best result?

+2  A: 

If I understand correctly the calculation in MATLAB is pretty strait-forward.

Steps 1-2 (mean calculation):

k1_mean = mean(k_5wells_reference_truth);
k2_mean = mean(k_5wells_sgsim);
k3_mean = mean(k_5wells_sisim);
k4_mean = mean(k_5wells_snesim);

Step 3, use HIST to plot distribution histograms:

hist([k2_mean; k3_mean; k4_mean]')

Step 4. You can do t-test comparing your vectors 2, 3 and 4 against normal distribution with mean k1_mean and unknown variance. See TTEST for details.

[h,p,ci] = ttest(k2_mean,k1_mean);
yuk
You do NOT use a t-test for comparing distributions, a t-test only test the means.
Joris Meys
You are right, but OP wants to compare distributions with a single value, which is impossible. I believe he has mistaken comparing distributions with comparing means. See his item 4). He want to find which matrix is closer to the first one. If his vectors means are not too close to the `k1_mean` value, he can solve it with the t-test. Of course it assumes normal distribution of vectors values, which can be tested.
yuk
@Yuk: Can you edit your answer, in order for me to upvote it, as by mistake the upvoted answer was reverted back. Also, how do I interpret my result as I could not clearly understand the parameters **h,p and ci** in the function's help. Does the **ci** with highest value is the one which closely resembles the reference truth?
Harpreet
@Yuk : You're right, I misinterpreted the question.
Joris Meys
@Harpreet, that's going to be the one with the highest p-value I assume. the ci is the confidence interval, which is the interval that covers the true mean in 95% of the cases. The p-value tells you how big the chance is of getting the mean value for matrix 2, 3 or 4 if the mean is really the value in matrix 1, and given the means are distributed (that's where the central limit theorem comes into play).
Joris Meys
@Joris: Thanks for keeping up. Please see my result above in the question. None of my CI, from the three methods, includes the mean (with which I compared) inside it. So still we consider p-value to choose the best result? Thanks again.
Harpreet
@Harpreet : I edited my answer to explain you a bit.
Joris Meys
@Harpreet : I changed nothing to the answer, but you should be able to change your vote now.
Joris Meys
@Joris: I did upvote.
Harpreet
@Harpreet, sorry for late followup. Thanks to Joris, you already got the good answer on confidence interval and p-value. I'd use p-value as a criteria (it's how h return value is calculated thresholding p-value). Since you are doing 3 comparisons at the same time you can apply Bonferroni correction (simply multiplying p-values by 3), but it doesn't change the results much. Looks like all your matrices are quite far from the 1st one, so interpret the data accordingly. Don't ask which matrix is closest, by which one is less different. However it's out of statistics scope.
yuk
+2  A: 

EDIT : I misinterpreted your question. See the answer of Yuk and following comments. My answer is what you need if you want to compare distributions of two vectors instead of a vector against a single value. Apparently, the latter is the case here.

Regarding your t-tests, you should keep in mind that they test against a "true" mean. Given the number of values for each matrix and the confidence intervals it's not too difficult to guess the standard deviation on your results. This is a measure of the "spread" of your results. Now the error on your mean is calculated as the standard deviation of your results divided by the number of observations. And the confidence interval is calculated by multiplying that standard error with appx. 2.

This confidence interval contains the true mean in 95% of the cases. So if the true mean is exactly at the border of that interval, the p-value is 0.05 the further away the mean, the lower the p-value. This can be interpreted as the chance that the values you have in matrix 2, 3 or 4 come from a population with a mean as in matrix 1. If you see your p-values, these chances can be said to be non-existent.

So you see that when the number of values get high, the confidence interval becomes smaller and the t-test becomes very sensitive. What this tells you, is nothing more that the three matrices differ significantly from the mean. If you have to choose one, I'd take a look at the distributions anyway. Otherwise the one with the closest mean seems a good guess. If you want to get deeper into this, you could also ask on stats.stackexchange.com


Your question and your method aren't really clear :

  • Is the distribution equal in all columns? This is important, as two distributions can have the same mean, but differ significantly :

alt text

  • is there a reason why you don't use the Central Limit Theorem? This seems to me like a very complex way of obtaining a result that can easily be found using the fact that the distribution of a mean approaches a normal distribution where sd(mean) = sd(observations)/number of observations. Saves you quite some work -if the distributions are alike! -

Now if the question is really the comparison of distributions, you should consider looking at a qqplot for a general idea, and at a 2-sample kolmogorov-smirnov test for formal testing. But please read in on this test, as you have to understand what it does in order to interprete the results correctly.

On a sidenote : if you do this test on multiple cases, make sure you understand the problem of multiple comparisons and use the appropriate correction, eg. Bonferroni or Dunn-Sidak.

Joris Meys
@Joris: Thanks. Your explanation was really nice. I have few questions based on your answer: **1)** _"Otherwise the one with the closest mean seems a good guess."_ - My 3 distributions are already made up of 100 means each. So if I were to choose the closest mean, then which mean would that be - is it mean of 100 means, or the one for which p value is highest, or the one for which CI is largest? **2)** _"If you have to choose one, I'd take a look at the distributions anyway."_ - I tried _qqplot_ in Matlab, however it only compares 1-to-1 distribution. How would you have compared them? Thanks!!
Harpreet
@Harpeet: If you take the means of the columns, you have 4 vectors of 20 values. That way you can compare distribution 2, 3 and 4 with the distribution in matrix 1. It's not 100% statistically sound, but otherwise we have to go into more advanced modelling. If you're interested in which matrix is the closest, then you take the mean of the 100 means (which is what you're confidence interval is built around), and you see which one approaches 3.45 the closest. It's also a guess, but not the worst one. -continued
Joris Meys
The thing is that you have to decide what is "the closest" is it the one with the biggest chance of containing the true value in the ci? Or the one that is on average the closest? It can be that the average is further away, but the spread larger and thus the ci larger. You can even look at the distributions : the one with the most equal shape are closest. This is -very simplified- what the kolgomorov smirnov test is testing. But what "closest" means is something I can't decide for you.
Joris Meys
Thanks Joris. All your answers and comments were worthy!!
Harpreet

related questions