views:

298

answers:

7

I've need to find the average and standard deviation of a large amount of data in this format. I tried using Excel but there doesn't appear to be an easy way to transpose the columns. What am I missing in Excel or should I just use Perl?

Input file format is:

0 123

0 234

0 456

1 657

1 234

1 543

Want result to group the averages and standard deviations by the values in the first column:

0 AvgOfAllZeros StdDevOfALlZeros

1 AvgOfAllOnes StdDevOfAllOnes

+2  A: 

crack's knuckles

using the Statistics::Descriptive CPAN module, you can get it with this:

use strict;
use warnings;
use Statistics::Descriptive;

my ($file) = @ARGV;

my @zeroes;
my @ones;

# Reading it in
open my $fh, '<', $file or die "unable to open '$file', $!";

while (my $line = <$fh>)
{
   chomp $line;
   my ($value, $number) = split("\s+", $line);
   if ($value)
   {
      push @ones, $number;
   }
   else
   {
      push @zeroes, $number;
   }
}
close $fh or warn "Can't close fh! $!";

# Stat processing
$stat_zeroes   = Statistics::Descriptive::Full->new();
$stat_ones     = Statistics::Descriptive::Full->new();

$stat_zeroes->add_data(@zeroes);
$stat_ones->add_data(@ones);

print "0: ", $stat_zeroes->mean(), " ", $stat_zeroes->standard_deviation(), "\n",
      "1: ", $stat_ones->mean(), " ", $stat_zeroes->standard_deviation(), "\n";
Robert P
A: 

Have you tried using the AVERAGEIF function of Excel?

PezHead
+3  A: 

This is easy to do in R. If your data is in a file called foo, then this code will do the trick:

> data <- read.table("foo")
> cbind(avg=with(data, tapply(V2, V1, mean)),
+       stddev=with(data, tapply(V2, V1, sd)))
  avg   stddev
0 271 169.5553
1 478 218.8630
Jonathan Chang
A: 

If you dealing with a large set of data then you should consider PDL... the Perl Data Language.

See this related SO answer.

/I3az/

draegtun
+2  A: 

If you do this manually in Excel you can copy the data and then Paste it with Paste Special menu option. There is a Transpose check box there.

If you do this more frequently here is a Perl script. Memory complexity is linear to the size of output, so constant in case of only two rows:

#!/usr/bin/perl

while (<>) {
 my ($x, $y) = split;
 $sum{$x} += $y;
 $count{$x}++;
 $sumSq{$x} += $y * $y;
}

for $i (sort keys %sum) {
 $stdev = sqrt(($sumSq{$i} - $sum{$i} * $sum{$i} / $count{$i}) / ($count{$i} - 1));
 print $i, " ", $sum{$i}/$count{$i}, " ", $stdev, "\n";
}
agsamek
+1  A: 
Robert Mearns
A: 
af