ansaurus

Question

Wanted: a quicker way to check all combinations within a very large hash

Answer 1

A:

Now I'm not a perl warrior, but based on the information given it is the same in any programming language; unless you sort the "hash" on the property you want to check and do a binary lookup you won't improve any performance in a lookup.

You can also if it is possible calculate which indexes in your hash would have the properties you are interested in, but as you have no information regarding such a possibility, this would perhaps not be a solution.

Patrick 2010-07-21 10:58:07

The first paragraph makes no sense. A hash lookup should be faster than binary search. Also you can't sort a hash. You seem to confuse hashes and arrays.

musiKk 2010-07-21 11:12:34

Hashes already have `O(1)` lookup. The problem is the OP is searching through all of them, and there are a lot.

Ether 2010-07-21 14:50:21

@musikk, @Ether: right, so Patrick is saying figure out how to look up which keys might be of interest and then loop over those, instead of looping over all keys and checking each one to see if it is of interest

ysth 2010-07-21 14:56:09

@musikk: you can sort hash keys. example: `foreach my $key (sort keys %hash){...}`, however the sort may be another performance impact

vol7ron 2010-07-22 14:59:56

@musikk: Sorry for the confusment, I am quite aware that hashes have `O(1)` lookup and that you cannot sort a hashmap datatype, but I never stated otherwise. The OP wanted a way of checking each element in an arrayof values (even if it is a hash map) and unless the values are sorted, he needs to do a linear lookup, i.e. the keys or properties must be sorted to be able to do a binary search. @ysth got it right...

Patrick 2010-07-23 12:39:00

Answer 2

+1 A:

Maybe by using concurrency? But you would have to be carefull with what you do with a possitive match as to not get problems.

E.g. take $key1, split it in $key1A and §key1B. The create two separate threads, each containing "half of the loop".

I am not sure exactly how expensive it is to start new threads in Perl but if your positive action doesn't have to be synchronized I imagine that on matching hardware you would be faster.

Worth a try imho.

BernardMarx 2010-07-21 11:11:24

This is a good idea! Thanks! Because the hash actually exists of multiple keys: CNV{$key1,$key2,$key3}But I don't know how to split up the key in my foreach statement into the three keys.. If you could help me with that, my problem is solved!

Abdel 2010-07-21 11:41:13

So you're using multidimensional array emulation instead of real multidimensional support? Perhaps start there, use nested hashes ($CNV{$key1}L$key2}{$key}) instead of keys joined by \034 ($CNV{$key1,$key2,$key3}, read perldoc perldsc for more information.

MkV 2010-07-21 13:46:31

Look into Parallel::ForkManager to split the task into separate processes (I wouldn't recommend threads) and some form of IPC (look into perldoc perlipc for possibilities) to manage getting the results back. Once you have proper multidimensional hashes, you should be able to split your foreach across the forks based on ranges.

MkV 2010-07-21 14:11:49

Answer 3

+1 A:

It sounds like Algorithm::Combinatorics may help you here. It's intended to provide "efficient generation of combinatorial sequences." From its docs:

Algorithm::Combinatorics is an efficient generator of combinatorial sequences. ... Iterators do not use recursion, nor stacks, and are written in C.

You could use its combinations sub-routine to provide all possible 2 key combos from your full set of keys.

On the other hand, Perl itself is written in C. So I honestly have no idea whether or not this would help at all.

Telemachus 2010-07-21 11:18:38

Answer 4

A:

define blah blah.

You could write it like this:

foreach $key1 (keys %CNV) {

if (blah1)
{
    foreach $key2 (keys %CNV)
    {
        if (blah2){do something that doesn't take as long}
    }
}

}

This pass should be O(2N) instead of O(N^2)

Jav 2010-07-21 14:55:50

blabla is defined now in the openingpost.

Abdel 2010-07-22 08:32:08

-1 for O(2N). It will still be O(N^2) if there is no dependency between N and the probability of blah1 being true. O(2N) = O(N) by the way.

Rotsor 2010-07-23 13:02:50

Answer 5

A:

The data structure in the question is not a good fit to the problem. Let's try it this way.

use Set::IntSpan::Fast::XS;
my @CNV;
for ([3, 7], [4, 8], [9, 11]) {
    my $set = Set::IntSpan::Fast::XS->new;
    $set->add_range(@{$_});
    push @CNV, $set;
}

# The comparison is commutative, so we can cut the total number in half.
for my $index1 (0 .. -1+@CNV) {
    for my $index2 (0 .. $index1) {
        next if $index1 == $index2; # skip if it's the same CNV
        say sprintf(
            'overlap of CNV %s, %s at indices %d, %d',
            $CNV[$index1]->as_string, $CNV[$index2]->as_string, $index1, $index2
        ) unless $CNV[$index1]->intersection($CNV[$index2])->is_empty;
    }
}

Output:

overlap of CNV 4-8, 3-7 at indices 1, 0

We will not get the overlap of 3-7, 4-8 because it is a duplicate.

There's also Bio::Range, but it doesn't look so efficient to me. You should definitely get in touch with the bio.perl.org/open-bio people; chances are what you're doing has been done already a million times before they already have the optimal algorithm all figured out.

daxim 2010-07-22 12:42:45

Thanks for the answer! It looks like your way lets me skip comparing the same CNVs with eachother, which saves me 130,000 comparisons (relatively not a lot, but I incorporated it in my script). It also lets me skip comparing the same combination twice (which would save half of the comparisons). But does it really save time the way you do that, since the for-statement still has to go through all comparisons before he skips with the "next" and the "unless" statement?

Abdel 2010-07-22 13:49:10

I am pretty sure this XS stuff is faster than a multi-dim hash, but you can [benchmark](http://p3rl.org/Benchmark) / [profile](http://p3rl.org/Devel::NYTProf) it with your real data.

daxim 2010-07-22 14:50:52

Benchmark it and find out.

vol7ron 2010-07-22 14:57:30

Answer 6

+1 A:

I think I found the answer :-) Couldn't have done it without you guys though. I found a way to skip most of the comparisons I make:

for ($j=1;$j<=24;++$j)
 {
            foreach $key1 (sort keys %{$CNV{$j}})
            {


                foreach $key2 (sort keys %{$CNV{$j}})
                {

                    if (($CNVstart{$j}{$key2} < $CNVstart{$j}{$key1}) && ($CNVend{$j}{$key2} < $CNVstart{$j}{$key1}))
                    {
                    next;
                    }


                    if (($CNVstart{$j}{$key2} > $CNVend{$j}{$key1}) && ($CNVend{$j}{$key2} > $CNVend{$j}{$key1}))
                    {
                    last;
                    }

        if  ( (($CNVstart{$j}{$key1} >= $CNVstart{$j}{$key2}) && ($CNVstart{$j}{$key1} <= $CNVend{$j}{$key2})) ||
           (($CNVend{$j}{$key1} >= $CNVstart{$j}{$key2}) && ($CNVend{$j}{$key1} <= $CNVend{$j}{$key2})) ||
           (($CNVstart{$j}{$key2} >= $CNVstart{$j}{$key1}) && ($CNVstart{$j}{$key2} <= $CNVend{$j}{$key1})) ||
           (($CNVend{$j}{$key2} >= $CNVstart{$j}{$key1}) && ($CNVend{$j}{$key2} <= $CNVend{$j}{$key1})) 
          )    {print some stuff out}

    }
    }
}

What I did is:

sort the keys of the hash for each foreach loop
do "next" if the CNVs with $key2 still haven't reached the CNV with $key1 (i.e. start2 and end2 are both smaller than start1)
and probably the most time-saving: end the foreach loop if the CNV with $key2 has overtaken the CNV with $key1 (i.e. start2 and end2 are both larger than end1)

Thanks a lot for your time and feedback guys!

Abdel 2010-07-23 08:43:53

If I understand correctly, the keys in %{$CNV{$j}} just CNVs without any start/end information. How do you sort them then?

Rotsor 2010-07-23 12:39:47

The CNV hash looks like this: $CNV{$chromosome}{$start,$end}. The end and start hashes have the same keys ($CNVstart{$chromosome}{$start,$end} and $CNVend{$chromosome}{$start,$end}) and contain the start integer ($start) and end integer ($end) respectively.

Abdel 2010-07-23 13:08:41

Answer 7

A:

Your optimisation with taking out the j into the outer loop was good, but the solution is still far from optimal.

Your problem does have a simple O(N+M) solution where N is the total number of CNVs and M is the number of overlaps.

The idea is: you walk through the length of DNA while keeping track of all the "current" CNVs. If you see a new CNV start, you add it to the list and you know that it overlaps with all the other CNVs currently in the list. If you see a CNV end, you just remove it from the list.

I am not a very good perl programmer, so treat the following as a pseudo-code (it's more like a mix of Java and C# :)):

// input:
Map<CNV, int> starts;
Map<CNV, int> ends;

// temporary:
List<Tuple<int, bool, CNV>> boundaries;
foreach(CNV cnv in starts)
    boundaries.add(starts[cnv], false, cnv);
foreach(CNV cnv in ends)
    boundaries.add(ends[cnv], true, cnv);

// Sort first by position, 
// then where position is equal we put "starts" first, "ends" last
boundaries = boundaries.OrderBy(t => t.first*2 + (t.second?1:0));

HashSet<CNV> current;

// main loop:
foreach((int position, bool isEnd, CNV cnv) in boundaries)
{
    if(isEnd)
        current.remove(cnv);
    else
    {
        foreach(CNV otherCnv in current)
            OVERLAP(cnv, otherCnv); // output of the algorithm
        current.add(cnv);
    }
}

Rotsor 2010-07-23 12:35:42

ansaurus

tags:

views:

answers:

Wanted: a quicker way to check all combinations within a very large hash

related questions