I have a set of data (csv files) in the following 3 column format:
A, B, C
3277,4733,54.1
3278,4741,51.0
3278,4750,28.4
3278,4768,36.0
3278,4776,50.1
3278,4784,51.4
3279,4792,82.6
3279,4806,78.2
3279,4814,36.4
And I need to get a three-way contingency table like: (sorry, this doesn't look completely good)
A /B 4733 4741 4750 4768 4776 4784 4792 4806 4814
3277 C 54.1
3278 51 28.4 36 50.1 51.4
3279 82.6 78.2 36.4
Similarly to an excel "pivot table", OpenOffice data pilot, or R "table(x,y,z)"
The problem is that my dataset is HUGE (more than 500,000 total rows, with about 400 different factors in A and B. (OOo, MSO and R limits prevent from achieving this)
I am sure a Python script could be used to create such a table. both A and B are numbers (but can be treated as strings).
Anyone has dealt with this? (pseudocode or code in C or Java is also welcomed ... but I prefer python as it is faster to implement :)
Edit: Almost have it, thanks to John Machin. The following Python script almost provides what I am looking for, however, when writing the output file I can see that the values in the "headers" I am writing (taken from the first row) do not correspond to the other rows.
from collections import defaultdict as dd
d = dd(lambda: dd(float))
input = open("input.txt")
output = open("output.txt","w")
while 1:
line = input.readline()
if not line:
break
line = line.strip('\n').strip('\r')
splitLine = line.split(',')
if (len(splitLine) <3):
break
d[splitLine[0]][splitLine[1]] = splitLine[2]
output.write("\t")
for k,v in d.items()[0][1].items():
output.write(str(k)+"\t")
output.write("\n")
for k,v in d.items():
output.write(k+"\t")
for k2,v2 in v.items():
output.write(str(v2)+"\t")
output.write("\n")