ansaurus

Question

How to find the average of multiple columns in a file using python

Answer 1

+2 A:

You can use Numpy:

import numpy as np
from StringIO import StringIO

s = StringIO('''\
Trial1 Trial2 Trial3
1 0 1
0 0 0
0 2 0
2 2 2
1 1 1
1 0 1
0 0 0
0 2 0
2 2 2
1 1 1
''')

data = np.loadtxt(s, skiprows=1)  # skip header row
print data.mean(axis=0)  # column means

# OUTPUT: array([ 0.8,  1. ,  0.8])

Note that the first argument to loadtxt could be the name of your file instead of a file like object.

ars 2010-09-11 23:04:58

Answer 2

+1 A:

You can use the builtin csv module:

import csv
csvReader = csv.reader(open('input.txt'), delimiter=' ')
headers = csvReader.next()
values = [map(int, row) for row in csvReader]

def average(l):
    return float(sum(l)) / len(l)

averages = [int(round(average(trial))) for trial in zip(*values)]

print ' '.join(headers)
print ' '.join(str(x) for x in averages)

Result:

Trial1 Trial2 Trial3
1 1 1

Mark Byers 2010-09-11 23:08:12

Nice solution. But this might consume a lot of memory (the `values` list) if the file is really as large so that Excel can't open it.

AndiDog 2010-09-11 23:13:18

This didn't print the results?

Robert A. Fettikowski 2010-09-12 15:02:30

Answer 3

+2 A:

A memory-friendly solution without using any modules:

with open("filename", "rtU") as f:
    columns = f.readline().strip().split(" ")
    numRows = 0
    sums = [0] * len(columns)

    for line in f:
        # Skip empty lines
        if not line.strip():
            continue

        values = line.split(" ")
        for i in xrange(len(values)):
            sums[i] += int(values[i])
        numRows += 1

    for index, summedRowValue in enumerate(sums):
        print columns[index], 1.0 * summedRowValue / numRows

AndiDog 2010-09-11 23:10:21

No need to use `f.xreadlines()`. `for line in f:` is exactly equivalent, and works in both python 2.x and 3.x.

Joe Kington 2010-09-12 01:46:28

@Joe Kington: Thanks, corrected that.

AndiDog 2010-09-12 09:03:25

Ok I'm having issues with this because the lines are sepearted by a TAB and not a space. So I made the spacinjg between the ""s longer to look like a tab and it didn't work. I got this error instead

Robert A. Fettikowski 2010-09-12 15:20:07

raceback (most recent call last): File "C:/avy5.py", line 13, in <module> sums[i] += int(values[i])ValueError: invalid literal for int() with base 10: '001\t001.0037\t001.1070\t001.1000\t2\t2\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t1\t1\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t1\t1\t1\t1\t0\t0\t1\t1\t1\t1\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t1\t1\t1\t1\t1\t1\t0\t0\t1\t1\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t1\t1\t0\t0\t0'

Robert A. Fettikowski 2010-09-12 15:20:29

@Robert A. Fettikowski: A tab is *not* 4 spaces - a tab is a single character. Tabs are usually escaped as "\t" as you can clearly see. Just change the split string to "\t".

AndiDog 2010-09-12 15:57:33

ansaurus

tags:

views:

answers:

How to find the average of multiple columns in a file using python

related questions