ansaurus

Question

Perl: Programming Efficiency when computing correlation coefficients for a large set of data

Answer 1

+4 A:

You're going to have to do at least 54000^2*82 calculations and comparisons. Of course it's going to take a lot of time. Are you holding everything in memory? That's going to be pretty large too. It will be slower, but it might use less memory if you can keep the users in a database and calculate one user against all the others, then go on to the next and do it against all the others instead of one massive array or hash.

Paul Tomblin 2009-03-23 20:09:02

What kind of resources would help me if I took the database approach?

Jack L. 2009-03-23 20:56:54

@Jack, personally I'd just do it with ODBC and PostgreSQL or MySQL because I've done a lot of that in Perl, but @singingfish's suggestion of Tie::File might be easier. Just tie it to a NDBM file.

Paul Tomblin 2009-03-23 21:03:06

actually DBD::SQLite would be the quickest and easiest database solution, but for your purpose Tie::File will be least complicated

singingfish 2009-03-24 00:44:45

Keep in mind that Perl also has an way to "tie" (that's the keyword, you can google it) a hash with a file or database backend. So it feels like a normal hash, but only part of the set is in memory at one time.

jhs 2009-03-24 03:38:07

I have run into a problem. So I was able to tie @correlations to a file, but the individual row arrays of @correlations become too big. Is there a way to do this, to tie changing variables to different files?tie @correlations{i}, 'Tie::File', "temp//tiefile$i";

Jack L. 2009-03-25 18:46:28

Okay I have tied each individual row to a separate file with a master file that contains the arrays to call. However, I have encountered that each file is only holding a maximum 65Kb (somewhere around 3000 lines), but it is supposed to have 54000 lines. How do I increase the memory per file?

Jack L. 2009-03-25 20:31:26

There isn't supposed to be any limit on the number of records in a Tied file - that's the whole point of it.

Paul Tomblin 2009-03-25 20:36:48

Answer 2

+4 A:

Have a look at Tie::File to deal with the high memory usage of having your input and output files stored in memory.

singingfish 2009-03-23 20:55:33

Answer 3

+1 A:

I don't know enough about what you are trying to do to give good advice about implementation, but you might look at Statistics::LSNoHistory, it claims to have a method pearson_r that returns Pearson's r correlation coefficient.

Chas. Owens 2009-03-23 20:56:56

Answer 4

+2 A:

Essentially Paul Tomblin has given you the answer: It's a lot of calculation so it will take a long time. It's a lot of data, so it will take a lot of memory.

However, there may be one gotcha: If you use perl 5.10.0, your list assignments at the start of each method may be victims of a subtle performance bug in that version of perl (cf. perlmonks thread).

A couple of minor points:

The printout may actually slow down the program somewhat depending one where it goes.

There is no need to reopen the output file for each line! Just do something like this:

open FILE, ">", "file.txt" or die $!;
print FILE "Name, ", join(", ", 0..$#{$correlations[0]}+1), "\n";
my $rowno = 1;
foreach my $row (@correlations) {
  print FILE "$rowno, " . join(", ", @$row) . "\n";
  $rowno++;
}
close FILE;

Finally, while I do use Perl whenever I can, with a program and data set such as you describe, it might be the simplest route to simply use C++ with its iostreams (which make parsing easy enough) for this task.

Note that all of this is just minor optimization. There's no algorithmic gain.

tsee 2009-03-23 21:18:14

Answer 5

+4 A:

Have you searched CPAN? My own search yielded another method gsl_stats_correlation for computing Pearsons correlation. This one is in Math::GSL::Statisics. This module binds to the GNU Scientific Library.

gsl_stats_correlation($data1, $stride1, $data2, $stride2, $n) - This function efficiently computes the Pearson correlation coefficient between the array reference $data1 and $data2 which must both be of the same length $n. r = cov(x, y) / (\Hat\sigma_x \Hat\sigma_y) = {1/(n-1) \sum (x_i - \Hat x) (y_i - \Hat y) \over \sqrt{1/(n-1) \sum (x_i - \Hat x)^2} \sqrt{1/(n-1) \sum (y_i - \Hat y)^2} }

snoopy 2009-03-23 21:47:23

Answer 6

+3 A:

You may want to look at PDL:

PDL ("Perl Data Language") gives standard Perl the ability to compactly store and speedily manipulate the large N-dimensional data arrays which are the bread and butter of scientific computing

.

daotoad 2009-03-24 13:42:25

Answer 7

A:

Further to the comment above about PDL, here is the code how to calculate the correlation table even for very big datasets quite efficiently:

use PDL::Stats; # this useful module can be downloaded from CPAN
my $data = random(82, 5400); # your data should replace this
my $table = $data->corr_table(); # that's all, really

You might need to set $PDL::BIGPDL = 1; in the header of your script and make sure you run this on a machine with A LOT of memory. The computation itself is reasonably fast, a 82 x 5400 table took only a few seconds on my laptop.

DaGaMs 2010-10-10 12:41:36

ansaurus

tags:

views:

answers:

Perl: Programming Efficiency when computing correlation coefficients for a large set of data

related questions