views:

237

answers:

3

For 1,000,000 observations, I observed a discrete event, X, 3 times for the control group and 10 times for the test group.

I need to preform a Chi square test of independence in Matlab. This is how you would do it in r:

m <- rbind(c(3, 1000000-3), c(10, 1000000-10))
#      [,1]   [,2] 
# [1,]    3 999997
# [2,]   10 999990
chisq.test(m)

The r function returns chi-squared = 2.7692, df = 1, p-value = 0.0961.

What Matlab function should I use or create to do this?

A: 

The Statistics Toolbox will contain the functions you want.

EDIT: Or it doesn't

Here is someone who did it on their own. Chi^2 independence test

James
@James...which function in the statistic toolbox? There are multiple chi square functions in there
Elpezmuerto
Gah, doesn't look like they have that function in there. They only have goodness of fit, maybe i was thinking of maple...
James
@James, Ya I looked in there before I posted and I am kinda surprised Matlab doesn't have anything, hence why I am here :)
Elpezmuerto
For reference, the statistics package in Maple is probably easier to work with.
James
@James, I went to the posted webpage too but the link where he posts the code is down. I have to use Matlab :p
Elpezmuerto
@Elpezmuerto If i had the time i'd crack out my probability text and write you one, unfortunately i'm rather tied up at the moment. I'd say look around on the net, someones gotta have one.
James
@James I spent a good while looking online and got nothing useful. I really hate to use stack overflow like this, but I am stuck.
Elpezmuerto
@James I grabbed a stat book and wrote the code, thanks
Elpezmuerto
+7  A: 

Here is my own implementation that I use:

function [hNull pValue X2] = ChiSquareTest(o, alpha)
    %#  CHISQUARETEST  Pearson's Chi-Square test of independence
    %#
    %#    @param o          The Contignecy Table of the joint frequencies
    %#                      of the two events (attributes)
    %#    @param alpha      Significance level for the test
    %#    @return hNull     hNull = 1: null hypothesis accepted (independent)
    %#                      hNull = 0: null hypothesis rejected (dependent)
    %#    @return pValue    The p-value of the test (the prob of obtaining
    %#                      the observed frequencies under hNull)
    %#    @return X2        The value for the chi square statistic
    %#

    %# o:     observed frequency
    %# e:     expected frequency
    %# dof:   degree of freedom

    [r c] = size(o);
    dof = (r-1)*(c-1);

    %# e = (count(A=ai)*count(B=bi)) / N
    e = sum(o,2)*sum(o,1) / sum(o(:));

    %# [ sum_r [ sum_c ((o_ij-e_ij)^2/e_ij) ] ]
    X2 = sum(sum( (o-e).^2 ./ e ));

    %# p-value needed to reject hNull at the significance level with dof
    pValue = 1 - chi2cdf(X2, dof);
    hNull = (pValue > alpha);

    %# X2 value needed to reject hNull at the significance level with dof
    %#X2table = chi2inv(1-alpha, dof);
    %#hNull = (X2table > X2);

end

And an example to illustrate:

t = [3 999997 ; 10 999990]
[hNull pVal X2] = ChiSquareTest(t, 0.05)

hNull =
     1
pVal =
     0.052203
X2 =
       3.7693

Note that the results are different from yours because chisq.test performs a correction by default, according to ?chisq.test

correct: a logical indicating whether to apply continuity correction when computing the test statistic for 2x2 tables: one half is subtracted from all |O - E| differences.


Alternatively if you have the actual observations of the two events in question, you can use the CROSSTAB function which computes the contingency table and return the Chi2 and p-value measures:

X = randi([1 2],[1000 1]);
Y = randi([1 2],[1000 1]);
[t X2 pVal] = crosstab(X,Y)

t =
   229   247
   257   267
X2 =
     0.087581
pVal =
      0.76728

the equivalent in R would be:

chisq.test(X, Y, correct = FALSE)

Note: Both (MATLAB) approaches above require the Statistics Toolbox

Amro
Ah, ninja'd. +1 for the code!
Jonas
@Amro, How would you implement the `correct = true` for matlab?
Elpezmuerto
well according to the R documentation just subtract half from |O-E|, so use the following instead: `X2 = sum(sum( (abs(o-e)-0.5).^2 ./ e ));` but you will have to manually check that this correction is only applied for 2x2 tables: http://en.wikipedia.org/wiki/Yates%27_correction_for_continuity
Amro
@Amro...High Five! The correct works!
Elpezmuerto
To calculate the alternative likelihood-ratio G2 statistic, use G2 = 2*sum(sum(o .* log(o./e)));
Elpezmuerto
@Elpezmuerto: thanks for the tip, for those interested here's the Wikipedia entry on G-test: http://en.wikipedia.org/wiki/G-test
Amro
A: 

This function will test for independence using the Pearson chi-squared statistic and the Likelihood-Ratio statistic, along with calculating residuals. I know this can be vectorized further, but I am trying to show the math for each step.

function independenceTest(data)
df = (size(data,1)-1)*(size(data,2)-1); % Mean Degrees of Freedom
sd = sqrt(2*df);                        % Standard Deviation

u         = nan(size(data)); % Estimated expected frequencies 
p         = nan(size(data)); % Values used to calculate chi-square
lr        = nan(size(data)); % Values used to calculate likelihood-ratio
residuals = nan(size(data)); % Residuals

rowTotals    = sum(data,1);
colTotals    = sum(data,2);
overallTotal = sum(rowTotals);

%% Calculate estimated expected frequencies
for r=1:1:size(data,1)
    for c=1:1:size(data,2)
        u(r,c) = (rowTotals(c) * colTotals(r)) / overallTotal;
    end
end

%% Calculate chi-squared statistic
for r=1:1:size(data,1)
    for c=1:1:size(data,2)
        p(r,c) = (data(r,c) - u(r,c))^2 / u(r,c);
    end
end
chi = sum(sum(p)); % Chi-square statistic

%% Calculate likelihood-ratio statistic
for r=1:1:size(data,1)
    for c=1:1:size(data,2)
        lr(r,c) = data(r,c) * log(data(r,c) / u(r,c));
    end
end
G = 2 * sum(sum(lr)); % Likelihood-Ratio statisitc

%% Calculate residuals
for r=1:1:size(data,1)
    for c=1:1:size(data,2)
        numerator   = data(r,c) - u(r,c);
        denominator = sqrt(u(r,c) * (1 - colTotals(r)/overallTotal) * (1 - rowTotals(c)/overallTotal));
        residuals(r,c) = numerator / denominator;
    end
end
Elpezmuerto
Check out @Amro's code. He does the same calculations without looping, and thus more concisely.
Jonas