ansaurus

Question

How can I efficiently count ranges that cover a given range in Perl?

Answer 1

+1 A:

Which part are you having problems with? What have you tried so far? It's a fairly simple task:

  * Iterate through the ranges
  * Foreach range, check if the test range is in it.
  * Profile and benchmark

It's fairly simple Perl:

 my $test = [ $n, $m ];
 my @contains = map { 
      $test->[0] >= $_->[0] 
         and 
      $test->[1] <= $_->[1]
      } @ranges

For the wrap-around ranges, the trick is to decompose those into separate ranges before you look at them. It's brute force work.

And, just as a social note, the rate of your question asking is pretty high: higher than I would expect of someone who is genuinely trying to solve their own problems. I think you're running to Stackoverflow too quickly and instead of getting help, you're really outsourcing your job. That's not really that nice. We don't get paid at all, and especially not paid to do the work assigned to you. This might be much different if you at least tried an implementation of your problem, but a lot of your questions seem to indicate that you didn't even try.

brian d foy 2010-09-24 19:26:04

Thanks for the reply brian. Re. your note, I'm sorry you feel that way. You can rest assure I'm not outsourcing anything. I'm in a setting where no programmer colleagues are available, so perhaps I use this forum more often than one would expect for things you might simply ask the guy next to you in the office. Anyway, I sure enjoy SO, learn a lot from all of you guys, and appreciate all your help.

David B 2010-09-24 20:31:19

As I said, show us what you've tried so far and I'd feel different.

brian d foy 2010-09-24 20:38:43

Out of respect for you, I updated the original post.

David B 2010-09-24 21:02:46

Well, out of respect for *all of us*.

brian d foy 2010-09-25 00:34:00

Answer 2

+1 A:

Pretty sure there's a better way to do this, but here's a starting point:

Preprocessing:

Create two lists, one sorted by the start value of the ranges, one by the end.

Once you get your range:

Use a binary search to match it's start in the start-sorted list
Use another binary search to match it's end in the end-sorted list
Find the ranges that appear in both lists (@start[0..$start_index] and @end[$end_index..$#end]).

zigdon 2010-09-24 19:26:32

Answer 3

+2 A:

Here's one approach to the brute force solution:

use strict;
use warnings;

my @ranges = ([12,80],[34,60],[34,9000],[76,743]);

# Split ranges between normal & wrapped:
my (@normal, @wrapped);

foreach my $r (@ranges) {
  if ($r->[0] <= $r->[1]) {
    push @normal, $r;
  } else {
    push @wrapped, $r;
  }
}

sub count_matches
{
  my ($start, $end, $max_length, $normal, $wrapped) = @_;

  if ($start <= $end) {
    # This is a normal range
    return (grep { $_->[0] <= $start and $_->[1] >= $end } @$normal)
        +  (grep { $end <= $_->[1] or $_->[0] <= $start } @$wrapped);
  } else {
    # This is a wrapped range
    return (grep { $_->[0] <= $start and $_->[1] >= $end } @$wrapped)
        # This part should probably be calculated only once:
        +  (grep { $_->[0] == 1 and $_->[1] == $max_length } @$normal);
  }
} # end count_matches

print count_matches(38,70, 9999, \@normal, \@wrapped)."\n";

cjm 2010-09-24 19:40:49

Please don't downvote without explanation.

cjm 2010-09-24 20:21:08

+1 Thanks cjm, this is very similar to what I did, and sure works, but I wonder if perhaps some pre-processing can make things faster and avoid iterating over all ranges each query (I used a loop, you us `grep` but I guess it's the same... or is it?). On second thought, perhaps I put the carriage before the horse - perhaps this is good enough. I should probably test the performance on real large scale data.

David B 2010-09-24 20:25:43

@cjm: The question was how to do better than the most naïve possible algorithm, not how to write it (which would amount to a “do my coding for me” request given that it’s a straight transliteration from prose to code).

Aristotle Pagaltzis 2010-09-24 20:35:11

@cjm your'e welcome to look at the update on the original post.

David B 2010-09-25 11:44:47

Answer 4

+2 A:

Do you have a lot of memory available?

my $max_length = 9999;
my @range = ( [12,80],[34,60],[34,9000] );

my @lookup;

for ( @range ) {
    my ( $start, $end ) = @$_;
    my @idx = $end >= $start ? $start .. $end : ( $start .. $max_length, 0 .. $end );
    for my $i ( @idx ) { $lookup[$i] .= pack "L", $end }
}

Now you have an array of packed number lists in @lookup where the packed list at each index contains the ends of all ranges which include that point. So to check how many ranges contain another range, you look up its starting index in the array, and then count the number of entries from the packed list at that index, which are smaller or equal that the ending index. This algorithm is O(n) with respect to the maximum number of ranges covering any one point (with the limit being the total number of ranges), with a very small overhead per iteration.

sub num_ranges_containing {
    my ( $start, $end ) = @_;

    return 0 unless defined $lookup[$start];

    # simple ranges can be contained in inverted ranges,
    # but inverted ranges can only be contained in inverted ranges
    my $counter = ( $start <= $end )
        ? sub { 0 + grep { $_ < $start or  $end <= $_ } }
        : sub { 0 + grep { $_ < $start and $end <= $_ } };

    return $counter->( unpack 'L*', $lookup[$start] );
}

Untested.

For extra neatness,

package RangeMap;

sub new {
    my $class = shift;
    my $max_length = shift;
    my @lookup;
    for ( @_ ) {
        my ( $start, $end ) = @$_;
        my @idx = $end >= $start ? $start .. $end : ( $start .. $max_length, 0 .. $end );
        for my $i ( @idx ) { $lookup[$i] .= pack 'L', $end }
    }
    bless \@lookup, $class;
}

sub num_ranges_containing {
    my $self = shift;
    my ( $start, $end ) = @_;

    return 0 unless defined $self->[$start];

    # simple ranges can be contained in inverted ranges,
    # but inverted ranges can only be contained in inverted ranges
    my $counter = ( $start <= $end )
        ? sub { 0 + grep { $_ < $start or  $end <= $_ } }
        : sub { 0 + grep { $_ < $start and $end <= $_ } };

    return $counter->( unpack 'L*', $self->[$start] );
}

package main;
my $rm = RangeMap->new( 9999, [12,80],[34,60],[34,9000] );

That way you can have any number of ranges.

Also untested.

Aristotle Pagaltzis 2010-09-24 19:45:25

+1 Thanks Aristotle Pagaltzis. I like the idea, but it might be a bit problematic, since `max_length` might be as high as 10 million, but the number of ranges is relatively low (some 30k), meaning we actually have sparse data. Also, I'm not sure this works correctly for wrap-around ranges.

David B 2010-09-24 20:28:09

Then worst case would be 10 million sub-arrays, which at 110 byte each should be just about 1GB. An empty 10-million-element array comes in at just 40MB. Add overhead for the contents of the sub-arrays… you’ll probably break the 32-bit memory limit but at least on a 64-bit machine you’re easily OK. :-) Esp. if most points are only covered by a few ranges. And you need correspondingly fewer sub-arrays if large parts of your space are not covered by any range.

Aristotle Pagaltzis 2010-09-24 20:47:59

Also – the `$end >= $start` bit implements wrap-around ranges.

Aristotle Pagaltzis 2010-09-24 20:51:17

I’ve switched it to a `pack`/`unpack`-based implementation (which was trivial – check the revision log). Devel::Size says that brings it down to about half a gigabyte for `$max_length = 10_000_000`, assuming that every single point is covered by two ranges.

Aristotle Pagaltzis 2010-09-24 21:07:22

I'm getting syntax error for `my @idx = $end >= $start ? $start .. $end ? $end .. $max_length, 0 .. $start;`. Did you mean `my @idx = $end >= $start ? $start .. $end : ($end .. $max_length, 0 .. $start);` If so, try `my $rm = RangeMap->new( 100, [ 80, 20 ] ); print $rm->num_ranges_containing( 80, 20 );`

David B 2010-09-24 21:09:43

Ah yes. The wrap-around case needed to be `($start .. $max_length, 0 .. $end)` of course. Fixed the answer.

Aristotle Pagaltzis 2010-09-24 21:20:04

One other correction: `return 0 unless defined $self->[$start];` should be added before the `return` statement in `num_ranges_containing()`.

David B 2010-09-25 11:27:40

@Aristotle Pagaltzis also see update to the original post.

David B 2010-09-25 11:44:16

NOTE: as described here (http://www.perlmonks.org/?node_id=862001), there's a bug in this code. A simple example would be to add a single range [5,1] with max_length of 10, then query for [7,8] -- returns 0 instead of 1.

David B 2010-09-25 21:50:13

I'm accepting this answer since the basic idea is great. HOWEVER, if anyone's going to use it, please note the update (4) in the OP. The code here has a bug and will not work correctly.

David B 2010-09-28 15:13:51

I’ve updated the code again – hopefully for the last time – to account for the problems you found.

Aristotle Pagaltzis 2010-10-04 16:55:11

Answer 5

+2 A:

There's an easier way than rolling your own ranges: use Number::Interval:

my @ranges     = (
    { START => 10,   END => 100 },
    { START => 30,   END => 90 },
    { START => 50, END => 80 },
    { START => 180,  END => 30 }
);
my @intervals;
for my $range ( @ranges ) {
  my $int = new Number::Interval( Min => $range->{START},
                                  Max => $range->{END} );
  push @intervals, $int;
}

Then you can use the intersection() method to find out if two ranges overlap:

my $num_overlap = 0;
my $checkinterval = new Number::Interval( Min => $min, Max => $max );
for my $int ( @intervals ) {
  $num_overlap++ if $checkinterval->intersection( $int );
}

I'm not quite sure what it will do with your "circular" ranges (they'd be classified as "inverted" intervals by Number::Interval) so you'd have to do a little experimentation.

But using a module really beats rolling your own range comparison methods.

Edit: Actually, looking at the documentation a little more closely, intersection() won't do what you want (in fact, it modifies one of the interval objects). You probably want to use contains() on the start and end values, and if both of those are contained within another interval, then that first interval is contained by the second.

Of course, you could update Number::Interval to add this functionality... :-)

CanSpice 2010-09-24 21:43:06

+1 thanks, that's good to know. There a Perl module for just about anything, isn't there?

David B 2010-09-24 21:54:46

There is, it's just a matter of finding it. :-)

CanSpice 2010-09-24 21:56:27

Answer 6

+1 A:

I think problems like this illustrate the maintainability benefits of breaking a job down into to small, easily grasped pieces (admittedly, with one cost being more lines of code).

The simplest idea is that of an ordinary, non-wrapping range.

package SimpleRange;

sub new {
    my $class = shift;
    my ($m, $n) = @_;
    bless { start => $m, end => $n }, $class;
}

sub start { shift->{start} }
sub end   { shift->{end}   }

sub covers {
    # Returns true if the range covers some other range.
    my ($self, $other) = @_;
    return 1 if $self->start <= $other->start
            and $self->end   >= $other->end;
    return;
}

Using that building block, we can create wrapping range class, which consists of either 1 or 2 simple ranges (2 if the range wraps around the edge of the universe). Like the class for simple ranges, this class defines a covers method. The logic in that method is fairly intuitive because we can use the covers method provided by our SimpleRange objects.

package WrappingRange;

sub new {
    my $class = shift;
    my ($raw_range, $MIN, $MAX) = @_;
    my ($m, $n) = @$raw_range;

    # Handle special case: a range that wraps all the way around.
    ($m, $n) = ($MIN, $MAX) if $m == $n + 1;

    my $self = {min => $MIN, max => $MAX};
    if ($m <= $n){
        $self->{top}  = SimpleRange->new($m, $n);
        $self->{wrap} = undef;
    }
    else {
        $self->{top}  = SimpleRange->new($m, $MAX);
        $self->{wrap} = SimpleRange->new($MIN, $n);    
    }
    bless $self, $class;
}

sub top  { shift->{top}  }
sub wrap { shift->{wrap} }
sub is_simple { ! shift->{wrap} }

sub simple_ranges {
    my $self = shift;
    return $self->is_simple ? $self->top : ($self->top, $self->wrap);
}

sub covers {
    my @selfR  = shift->simple_ranges;
    my @otherR = shift->simple_ranges;
    while (@selfR and @otherR){
        if ( $selfR[0]->covers($otherR[0]) ){
            shift @otherR;
        }
        else {
            shift @selfR;
        }
    }
    return if @otherR;
    return 1;
}

Run some tests:

package main;
main();

sub main {
    my ($MIN, $MAX) = (0, 200);

    my @raw_ranges = (
        [10, 100], [30, 90], [50, 80], [$MIN, $MAX],
        [180, 30], 
        [$MAX, $MAX - 1], [$MAX, $MAX - 2],
        [50, 49], [50, 48],
    );
    my @wrapping_ranges = map WrappingRange->new($_, $MIN, $MAX), @raw_ranges;

    my @tests = ( [1, 10], [30, 70], [160, 10], [190, 5] );
    for my $t (@tests){
        $t = WrappingRange->new($t, $MIN, $MAX);

        my @covers = map $_->covers($t) ? 1 : 0, @wrapping_ranges;

        my $n;
        $n += $_ for @covers;
        print "@covers  N=$n\n";
    }
}

Output:

0 0 0 1 1 1 1 1 1  N=6
1 1 0 1 0 1 1 1 0  N=6
0 0 0 1 0 1 0 1 1  N=4
0 0 0 1 1 1 0 1 1  N=5

FM 2010-09-25 15:35:25

+1 Thanks FM, this is a nice decomposition that makes things clearer. I actually wrote something similiar on some occasion in Java (`class SimpleRange` and `class Range` which implement the same interface, where `Range` is composed of up to two `SimpleRange`s. However, I'm not sure how it addresses the efficient querying issue.

David B 2010-09-25 19:19:49

ansaurus

tags:

views:

answers:

How can I efficiently count ranges that cover a given range in Perl?

related questions