tags:

views:

180

answers:

7

Hi perl-warriors,

I have a hash with about 130,000 elements, and I am trying to check all combinations within that hash for something (130,000 x 130,000 combinations). My code looks like this:

 foreach $key1 (keys %CNV)
 {

  foreach $key2 (keys %CNV)
  {
         if (blablabla){do something that doesn't take as long}
  }

 }

As you might expect, this takes ages to run. Does anyone know a quicker way to do this? Many thanks in advance!!

-Abdel

Edit: Update on the blablabla.

Hey guys, thanks for all the feedback! Really appreciate it. I changed the foreach statement to:

for ($j=1;$j<=24;++$j)
 {
  foreach $key1 (keys %{$CNV{$j}})
  {

   foreach $key2 (keys %{$CNV{$j}})
   {
                        if (blablabla){do something}
                        }
                }
        }

The hash is now multidimensional:

$CNV{chromosome}{$start,$end}

I'll elaborate on what I'm exactly trying to do, as requested.

The blablabla is the following:

if  ( (($CNVstart{$j}{$key1} >= $CNVstart{$j}{$key2}) && ($CNVstart{$j}{$key1} <= $CNVend{$j}{$key2})) ||
   (($CNVend{$j}{$key1} >= $CNVstart{$j}{$key2}) && ($CNVend{$j}{$key1} <= $CNVend{$j}{$key2})) ||
   (($CNVstart{$j}{$key2} >= $CNVstart{$j}{$key1}) && ($CNVstart{$j}{$key2} <= $CNVend{$j}{$key1})) ||
   (($CNVend{$j}{$key2} >= $CNVstart{$j}{$key1}) && ($CNVend{$j}{$key2} <= $CNVend{$j}{$key1})) 
  )    

In short: The hash elements represent a specific part of the DNA (a so called "CNV", think of it like a gene for now), with a start and an end (which are integers representing their position on that particular chromosome, stored in hashes with the same keys: %CNVstart & %CNVend). I'm trying to check for every combination of CNVs whether they overlap. If there are two elements that overlap within a family (I mean a family of persons whose DNA I have and read in; there is also a for-statement inside the foreach-statement that let's the program check this for every family, which makes it last even longer), I check whether they also have the same "copy number" (which is stored in another hash with the same keys) and print out the result.

Thank you guys for your time!

A: 

Now I'm not a perl warrior, but based on the information given it is the same in any programming language; unless you sort the "hash" on the property you want to check and do a binary lookup you won't improve any performance in a lookup.

You can also if it is possible calculate which indexes in your hash would have the properties you are interested in, but as you have no information regarding such a possibility, this would perhaps not be a solution.

Patrick
The first paragraph makes no sense. A hash lookup should be faster than binary search. Also you can't sort a hash. You seem to confuse hashes and arrays.
musiKk
Hashes already have `O(1)` lookup. The problem is the OP is searching through all of them, and there are a lot.
Ether
@musikk, @Ether: right, so Patrick is saying figure out how to look up which keys might be of interest and then loop over those, instead of looping over all keys and checking each one to see if it is of interest
ysth
@musikk: you can sort hash keys. example: `foreach my $key (sort keys %hash){...}`, however the sort may be another performance impact
vol7ron
@musikk: Sorry for the confusment, I am quite aware that hashes have `O(1)` lookup and that you cannot sort a hashmap datatype, but I never stated otherwise. The OP wanted a way of checking each element in an arrayof values (even if it is a hash map) and unless the values are sorted, he needs to do a linear lookup, i.e. the keys or properties must be sorted to be able to do a binary search. @ysth got it right...
Patrick
+1  A: 

Maybe by using concurrency? But you would have to be carefull with what you do with a possitive match as to not get problems.

E.g. take $key1, split it in $key1A and §key1B. The create two separate threads, each containing "half of the loop".

I am not sure exactly how expensive it is to start new threads in Perl but if your positive action doesn't have to be synchronized I imagine that on matching hardware you would be faster.

Worth a try imho.

BernardMarx
This is a good idea! Thanks! Because the hash actually exists of multiple keys: CNV{$key1,$key2,$key3}But I don't know how to split up the key in my foreach statement into the three keys.. If you could help me with that, my problem is solved!
Abdel
So you're using multidimensional array emulation instead of real multidimensional support? Perhaps start there, use nested hashes ($CNV{$key1}L$key2}{$key}) instead of keys joined by \034 ($CNV{$key1,$key2,$key3}, read perldoc perldsc for more information.
MkV
Look into Parallel::ForkManager to split the task into separate processes (I wouldn't recommend threads) and some form of IPC (look into perldoc perlipc for possibilities) to manage getting the results back. Once you have proper multidimensional hashes, you should be able to split your foreach across the forks based on ranges.
MkV
+1  A: 

It sounds like Algorithm::Combinatorics may help you here. It's intended to provide "efficient generation of combinatorial sequences." From its docs:

Algorithm::Combinatorics is an efficient generator of combinatorial sequences. ... Iterators do not use recursion, nor stacks, and are written in C.

You could use its combinations sub-routine to provide all possible 2 key combos from your full set of keys.

On the other hand, Perl itself is written in C. So I honestly have no idea whether or not this would help at all.

Telemachus
A: 

define blah blah.

You could write it like this:

foreach $key1 (keys %CNV) {

if (blah1)
{
    foreach $key2 (keys %CNV)
    {
        if (blah2){do something that doesn't take as long}
    }
}

}

This pass should be O(2N) instead of O(N^2)

Jav
blabla is defined now in the openingpost.
Abdel
-1 for O(2N). It will still be O(N^2) if there is no dependency between N and the probability of blah1 being true. O(2N) = O(N) by the way.
Rotsor
A: 

The data structure in the question is not a good fit to the problem. Let's try it this way.

use Set::IntSpan::Fast::XS;
my @CNV;
for ([3, 7], [4, 8], [9, 11]) {
    my $set = Set::IntSpan::Fast::XS->new;
    $set->add_range(@{$_});
    push @CNV, $set;
}

# The comparison is commutative, so we can cut the total number in half.
for my $index1 (0 .. [email protected]) {
    for my $index2 (0 .. $index1) {
        next if $index1 == $index2; # skip if it's the same CNV
        say sprintf(
            'overlap of CNV %s, %s at indices %d, %d',
            $CNV[$index1]->as_string, $CNV[$index2]->as_string, $index1, $index2
        ) unless $CNV[$index1]->intersection($CNV[$index2])->is_empty;
    }
}

Output:

overlap of CNV 4-8, 3-7 at indices 1, 0

We will not get the overlap of 3-7, 4-8 because it is a duplicate.

There's also Bio::Range, but it doesn't look so efficient to me. You should definitely get in touch with the bio.perl.org/open-bio people; chances are what you're doing has been done already a million times before they already have the optimal algorithm all figured out.

daxim
Thanks for the answer! It looks like your way lets me skip comparing the same CNVs with eachother, which saves me 130,000 comparisons (relatively not a lot, but I incorporated it in my script). It also lets me skip comparing the same combination twice (which would save half of the comparisons). But does it really save time the way you do that, since the for-statement still has to go through all comparisons before he skips with the "next" and the "unless" statement?
Abdel
I am pretty sure this XS stuff is faster than a multi-dim hash, but you can [benchmark](http://p3rl.org/Benchmark) / [profile](http://p3rl.org/Devel::NYTProf) it with your real data.
daxim
Benchmark it and find out.
vol7ron
+1  A: 

I think I found the answer :-) Couldn't have done it without you guys though. I found a way to skip most of the comparisons I make:

for ($j=1;$j<=24;++$j)
 {
            foreach $key1 (sort keys %{$CNV{$j}})
            {


                foreach $key2 (sort keys %{$CNV{$j}})
                {

                    if (($CNVstart{$j}{$key2} < $CNVstart{$j}{$key1}) && ($CNVend{$j}{$key2} < $CNVstart{$j}{$key1}))
                    {
                    next;
                    }


                    if (($CNVstart{$j}{$key2} > $CNVend{$j}{$key1}) && ($CNVend{$j}{$key2} > $CNVend{$j}{$key1}))
                    {
                    last;
                    }

        if  ( (($CNVstart{$j}{$key1} >= $CNVstart{$j}{$key2}) && ($CNVstart{$j}{$key1} <= $CNVend{$j}{$key2})) ||
           (($CNVend{$j}{$key1} >= $CNVstart{$j}{$key2}) && ($CNVend{$j}{$key1} <= $CNVend{$j}{$key2})) ||
           (($CNVstart{$j}{$key2} >= $CNVstart{$j}{$key1}) && ($CNVstart{$j}{$key2} <= $CNVend{$j}{$key1})) ||
           (($CNVend{$j}{$key2} >= $CNVstart{$j}{$key1}) && ($CNVend{$j}{$key2} <= $CNVend{$j}{$key1})) 
          )    {print some stuff out}

    }
    }
}

What I did is:

  • sort the keys of the hash for each foreach loop
  • do "next" if the CNVs with $key2 still haven't reached the CNV with $key1 (i.e. start2 and end2 are both smaller than start1)
  • and probably the most time-saving: end the foreach loop if the CNV with $key2 has overtaken the CNV with $key1 (i.e. start2 and end2 are both larger than end1)

Thanks a lot for your time and feedback guys!

Abdel
If I understand correctly, the keys in %{$CNV{$j}} just CNVs without any start/end information. How do you sort them then?
Rotsor
The CNV hash looks like this: $CNV{$chromosome}{$start,$end}. The end and start hashes have the same keys ($CNVstart{$chromosome}{$start,$end} and $CNVend{$chromosome}{$start,$end}) and contain the start integer ($start) and end integer ($end) respectively.
Abdel
A: 

Your optimisation with taking out the j into the outer loop was good, but the solution is still far from optimal.

Your problem does have a simple O(N+M) solution where N is the total number of CNVs and M is the number of overlaps.

The idea is: you walk through the length of DNA while keeping track of all the "current" CNVs. If you see a new CNV start, you add it to the list and you know that it overlaps with all the other CNVs currently in the list. If you see a CNV end, you just remove it from the list.

I am not a very good perl programmer, so treat the following as a pseudo-code (it's more like a mix of Java and C# :)):

// input:
Map<CNV, int> starts;
Map<CNV, int> ends;

// temporary:
List<Tuple<int, bool, CNV>> boundaries;
foreach(CNV cnv in starts)
    boundaries.add(starts[cnv], false, cnv);
foreach(CNV cnv in ends)
    boundaries.add(ends[cnv], true, cnv);

// Sort first by position, 
// then where position is equal we put "starts" first, "ends" last
boundaries = boundaries.OrderBy(t => t.first*2 + (t.second?1:0));

HashSet<CNV> current;

// main loop:
foreach((int position, bool isEnd, CNV cnv) in boundaries)
{
    if(isEnd)
        current.remove(cnv);
    else
    {
        foreach(CNV otherCnv in current)
            OVERLAP(cnv, otherCnv); // output of the algorithm
        current.add(cnv);
    }
}
Rotsor