ansaurus

Question

Python, dictionaries, and chi-square contingency table

Answer 1

+2 A:

Your 4 numbers for apple/1 add up to 12, more than the total number of observations (11)! There are only 5 documents outside time '1' that don't contain the word 'apple'.

You need to partition the observations into 4 disjoint subsets:
a: apple and 1 => 3
b: not-apple and 1 => 2
c: apple and not-1 => 1
d: not-apple and not-1 => 5

Here is some code that shows one way of doing it:

from collections import defaultdict

class Crosstab(object):

    def __init__(self):
        self.count = defaultdict(lambda: defaultdict(int))
        self.row_tot = defaultdict(int)
        self.col_tot = defaultdict(int)
        self.grand_tot = 0

    def add(self, r, c, n):
        self.count[r][c] += n
        self.row_tot[r] += n
        self.col_tot[c] += n
        self.grand_tot += n

def load_data(line_iterator, conv_funcs):
    ct = Crosstab()
    for line in line_iterator:
        r, c, n = [func(s) for func, s in zip(conv_funcs, line.split(','))]
        ct.add(r, c, n)
    return ct

def display_all_2x2_tables(crosstab):
    for rx in crosstab.row_tot:
        for cx in crosstab.col_tot:
            a = crosstab.count[rx][cx]
            b = crosstab.col_tot[cx] - a
            c = crosstab.row_tot[rx] - a
            d = crosstab.grand_tot - a - b - c
            assert all(x >= 0 for x in (a, b, c, d))
            print ",".join(str(x) for x in (rx, cx, a, b, c, d))

if __name__ == "__main__":

    # inputfile
    # <word, time, frequency>
    lines = """\
    apple, 1, 3
    banana, 1, 2
    apple, 2, 1
    banana, 2, 4
    orange, 3, 1""".splitlines()

    ct = load_data(lines, (str.strip, int, int))
    display_all_2x2_tables(ct)

and here is the output:

orange,1,0,5,1,5
orange,2,0,5,1,5
orange,3,1,0,0,10
apple,1,3,2,1,5
apple,2,1,4,3,3
apple,3,0,1,4,6
banana,1,2,3,4,2
banana,2,4,1,2,4
banana,3,0,1,6,4

John Machin 2010-06-13 00:49:12

This good approach. I especially like the techniques in `load_data` -- the use of `line_iterator` and `conv_funcs`.

FM 2010-06-13 01:38:43

ansaurus

tags:

views:

answers:

Python, dictionaries, and chi-square contingency table

related questions