views:

1187

answers:

7

EDIT: Link should work now, sorry for the trouble

I have a text file that looks like this:

Name, Test 1, Test 2, Test 3, Test 4, Test 5
Bob, 86, 83, 86, 80, 23
Alice, 38, 90, 100, 53, 32
Jill, 49, 53, 63, 43, 23.

I am writing a program that given this text file, it will generate a Pearson's correlation coefficient table that looks like this where the entry (x,y) is the correlation between person x and person y:

Name,Bob,Alice,Jill
Bob, 1, 0.567088412588577, 0.899798494392584
Alice, 0.567088412588577, 1, 0.812425393004088
Jill, 0.899798494392584, 0.812425393004088, 1

My program works, except that the data set I am feeding it has 82 columns and, more importantly, 54000 rows. When I run my program right now, it is incredibly slow and I get an out of memory error. Is there a way I can first of all, remove any possibility of an out of memory error and maybe make the program run a little more efficiently? The code is here: code.

Thanks for your help,
Jack

Edit: In case anyone else is trying to do large scale computation, convert your data into hdf5 format. This is what I ended up doing to solve this issue.

+4  A: 

You're going to have to do at least 54000^2*82 calculations and comparisons. Of course it's going to take a lot of time. Are you holding everything in memory? That's going to be pretty large too. It will be slower, but it might use less memory if you can keep the users in a database and calculate one user against all the others, then go on to the next and do it against all the others instead of one massive array or hash.

Paul Tomblin
What kind of resources would help me if I took the database approach?
Jack L.
@Jack, personally I'd just do it with ODBC and PostgreSQL or MySQL because I've done a lot of that in Perl, but @singingfish's suggestion of Tie::File might be easier. Just tie it to a NDBM file.
Paul Tomblin
actually DBD::SQLite would be the quickest and easiest database solution, but for your purpose Tie::File will be least complicated
singingfish
Keep in mind that Perl also has an way to "tie" (that's the keyword, you can google it) a hash with a file or database backend. So it feels like a normal hash, but only part of the set is in memory at one time.
jhs
I have run into a problem. So I was able to tie @correlations to a file, but the individual row arrays of @correlations become too big. Is there a way to do this, to tie changing variables to different files?tie @correlations{i}, 'Tie::File', "temp//tiefile$i";
Jack L.
Okay I have tied each individual row to a separate file with a master file that contains the arrays to call. However, I have encountered that each file is only holding a maximum 65Kb (somewhere around 3000 lines), but it is supposed to have 54000 lines. How do I increase the memory per file?
Jack L.
There isn't supposed to be any limit on the number of records in a Tied file - that's the whole point of it.
Paul Tomblin
+4  A: 

Have a look at Tie::File to deal with the high memory usage of having your input and output files stored in memory.

singingfish
+1  A: 

I don't know enough about what you are trying to do to give good advice about implementation, but you might look at Statistics::LSNoHistory, it claims to have a method pearson_r that returns Pearson's r correlation coefficient.

Chas. Owens
+2  A: 

Essentially Paul Tomblin has given you the answer: It's a lot of calculation so it will take a long time. It's a lot of data, so it will take a lot of memory.

However, there may be one gotcha: If you use perl 5.10.0, your list assignments at the start of each method may be victims of a subtle performance bug in that version of perl (cf. perlmonks thread).

A couple of minor points:

The printout may actually slow down the program somewhat depending one where it goes.

There is no need to reopen the output file for each line! Just do something like this:

open FILE, ">", "file.txt" or die $!;
print FILE "Name, ", join(", ", 0..$#{$correlations[0]}+1), "\n";
my $rowno = 1;
foreach my $row (@correlations) {
  print FILE "$rowno, " . join(", ", @$row) . "\n";
  $rowno++;
}
close FILE;

Finally, while I do use Perl whenever I can, with a program and data set such as you describe, it might be the simplest route to simply use C++ with its iostreams (which make parsing easy enough) for this task.

Note that all of this is just minor optimization. There's no algorithmic gain.

tsee
+4  A: 

Have you searched CPAN? My own search yielded another method gsl_stats_correlation for computing Pearsons correlation. This one is in Math::GSL::Statisics. This module binds to the GNU Scientific Library.

gsl_stats_correlation($data1, $stride1, $data2, $stride2, $n) - This function efficiently computes the Pearson correlation coefficient between the array reference $data1 and $data2 which must both be of the same length $n. r = cov(x, y) / (\Hat\sigma_x \Hat\sigma_y) = {1/(n-1) \sum (x_i - \Hat x) (y_i - \Hat y) \over \sqrt{1/(n-1) \sum (x_i - \Hat x)^2} \sqrt{1/(n-1) \sum (y_i - \Hat y)^2} }

snoopy
+3  A: 

You may want to look at PDL:

PDL ("Perl Data Language") gives standard Perl the ability to compactly store and speedily manipulate the large N-dimensional data arrays which are the bread and butter of scientific computing

.

daotoad
A: 

Further to the comment above about PDL, here is the code how to calculate the correlation table even for very big datasets quite efficiently:

use PDL::Stats; # this useful module can be downloaded from CPAN
my $data = random(82, 5400); # your data should replace this
my $table = $data->corr_table(); # that's all, really

You might need to set $PDL::BIGPDL = 1; in the header of your script and make sure you run this on a machine with A LOT of memory. The computation itself is reasonably fast, a 82 x 5400 table took only a few seconds on my laptop.

DaGaMs