tags:

views:

70

answers:

3

Hi I have a file that consists of too many columns to open in excel. Each column has 10 rows of numerical values 0-2 and has a row saying the title of the column. I would like the output to be the name of the column and the average value of the 10 rows. The file is too large to open in excel 2000 so I have to try using python. Any tips on an easy way to do this.

Here is a sample of the first 3 columns:

Trial1 Trial2 Trial3

1 0 1

0 0 0

0 2 0

2 2 2

1 1 1

1 0 1

0 0 0

0 2 0

2 2 2

1 1 1

I want python to output as a test file

Trial 1 Trial 2 Trial 3 1 2 1 (whatever the averages are)

+2  A: 

You can use Numpy:

import numpy as np
from StringIO import StringIO

s = StringIO('''\
Trial1 Trial2 Trial3
1 0 1
0 0 0
0 2 0
2 2 2
1 1 1
1 0 1
0 0 0
0 2 0
2 2 2
1 1 1
''')

data = np.loadtxt(s, skiprows=1)  # skip header row
print data.mean(axis=0)  # column means

# OUTPUT: array([ 0.8,  1. ,  0.8])

Note that the first argument to loadtxt could be the name of your file instead of a file like object.

ars
+1  A: 

You can use the builtin csv module:

import csv
csvReader = csv.reader(open('input.txt'), delimiter=' ')
headers = csvReader.next()
values = [map(int, row) for row in csvReader]

def average(l):
    return float(sum(l)) / len(l)

averages = [int(round(average(trial))) for trial in zip(*values)]

print ' '.join(headers)
print ' '.join(str(x) for x in averages)

Result:

Trial1 Trial2 Trial3
1 1 1
Mark Byers
Nice solution. But this might consume a lot of memory (the `values` list) if the file is really as large so that Excel can't open it.
AndiDog
This didn't print the results?
Robert A. Fettikowski
+2  A: 

A memory-friendly solution without using any modules:

with open("filename", "rtU") as f:
    columns = f.readline().strip().split(" ")
    numRows = 0
    sums = [0] * len(columns)

    for line in f:
        # Skip empty lines
        if not line.strip():
            continue

        values = line.split(" ")
        for i in xrange(len(values)):
            sums[i] += int(values[i])
        numRows += 1

    for index, summedRowValue in enumerate(sums):
        print columns[index], 1.0 * summedRowValue / numRows
AndiDog
No need to use `f.xreadlines()`. `for line in f:` is exactly equivalent, and works in both python 2.x and 3.x.
Joe Kington
@Joe Kington: Thanks, corrected that.
AndiDog
Ok I'm having issues with this because the lines are sepearted by a TAB and not a space. So I made the spacinjg between the ""s longer to look like a tab and it didn't work. I got this error instead
Robert A. Fettikowski
raceback (most recent call last): File "C:/avy5.py", line 13, in <module> sums[i] += int(values[i])ValueError: invalid literal for int() with base 10: '001\t001.0037\t001.1070\t001.1000\t2\t2\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t1\t1\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t1\t1\t1\t1\t0\t0\t1\t1\t1\t1\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t1\t1\t1\t1\t1\t1\t0\t0\t1\t1\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t1\t1\t0\t0\t0'
Robert A. Fettikowski
@Robert A. Fettikowski: A tab is *not* 4 spaces - a tab is a single character. Tabs are usually escaped as "\t" as you can clearly see. Just change the split string to "\t".
AndiDog