I have a database of some 30k ranges, each is given as a pair of start and end points:
[12,80],[34,60],[34,9000],[76,743],...
I would like to write a Perl subroutine that a range (not from the database), and returns the number of ranges in the database which fully 'include' the given range.
For example, if we had only those 4 ranges in the database and the query range is [38,70]
, the subroutine should return 2
, since the first and third ranges both fully contain the query range.
The problem: I wish to make the queries as "cheap" as possible, I don't mind doing lots of pre-processing, if it helps.
A couple of notes:
I used the word "database" freely, I don't mean an actual database (e.g. SQL); it's just a long list of ranges.
My world is circular... There is a given
max_length
(e.g.9999
) and ranges like[8541,6]
are legal (you can think of it as a single range that is the union of[8541,9999]
and[1,6]
).
Thanks, Dave
UPDATE This was my original code:
use strict;
use warnings;
my $max_length = 200;
my @ranges = (
{ START => 10, END => 100 },
{ START => 30, END => 90 },
{ START => 50, END => 80 },
{ START => 180, END => 30 }
);
sub n_covering_ranges($) {
my ($query_h) = shift;
my $start = $query_h->{START};
my $end = $query_h->{END};
my $count = 0;
if ( $end >= $start ) {
# query range is normal
foreach my $range_h (@ranges) {
if (( $start >= $range_h->{START} and $end <= $range_h->{END} )
or ( $range_h->{END} <= $range_h->{START} and $range_h->{START} <= $end )
or ( $range_h->{END} <= $range_h->{START} and $range_h->{END} >= $end)
)
{
$count++;
}
}
}
else {
# query range is hanging over edge
# only other hanging over edges can contain it
foreach my $range_h (@ranges) {
if ( $start >= $range_h->{START} and $end <= $range_h->{END} ) {
$count++;
}
}
}
return $count;
}
print n_covering_ranges( { START => 1, END => 10 } ), "\n";
print n_covering_ranges( { START => 30, END => 70 } ), "\n";
and, yes, I know the if
s are ugly and can be made much nicer and more efficient.
UPDATE 2 - BENCHMARKING SUGGESTED SOLUTIONS
I've don some benchmarking for the two purposed solutions so far: the naive one, suggested by cjm, which is similar to my original solutions, and the memory-demanding one, suggested by Aristotle Pagaltzis Thanks again for both of you!
To compare the two, I created the following packages which use the same interface:
use strict;
use warnings;
package RangeMap;
sub new {
my $class = shift;
my $max_length = shift;
my @lookup;
for (@_) {
my ( $start, $end ) = @$_;
my @idx
= $end >= $start
? $start .. $end
: ( $start .. $max_length, 0 .. $end );
for my $i (@idx) { $lookup[$i] .= pack 'L', $end }
}
bless \@lookup, $class;
}
sub num_ranges_containing {
my $self = shift;
my ( $start, $end ) = @_;
return 0 unless defined $self->[$start];
return 0 + grep { $end <= $_ } unpack 'L*', $self->[$start];
}
1;
and:
use strict;
use warnings;
package cjm;
sub new {
my $class = shift;
my $max_length = shift;
my $self = {};
bless $self, $class;
$self->{MAX_LENGTH} = $max_length;
my @normal = ();
my @wrapped = ();
foreach my $r (@_) {
if ( $r->[0] <= $r->[1] ) {
push @normal, $r;
}
else {
push @wrapped, $r;
}
}
$self->{NORMAL} = \@normal;
$self->{WRAPPED} = \@wrapped;
return $self;
}
sub num_ranges_containing {
my $self = shift;
my ( $start, $end ) = @_;
if ( $start <= $end ) {
# This is a normal range
return ( grep { $_->[0] <= $start and $_->[1] >= $end }
@{ $self->{NORMAL} } )
+ ( grep { $end <= $_->[1] or $_->[0] <= $start }
@{ $self->{WRAPPED} } );
}
else {
# This is a wrapped range
return ( grep { $_->[0] <= $start and $_->[1] >= $end }
@{ $self->{WRAPPED} } )
# This part should probably be calculated only once:
+ ( grep { $_->[0] == 1 and $_->[1] == $self->{MAX_LENGTH} }
@{ $self->{NORMAL} } );
}
}
1;
I then used some real data: $max_length=3150000
, about 17000 ranges with an average size of a few thousands, and finally queried the objects with some 10000 queries. I timed the creation of the object (adding all the ranges) and the querying. The results:
cjm creation done in 0.0082 seconds
cjm querying done in 21.209857 seconds
RangeMap creation done in 45.840982 seconds
RangeMap querying done in 0.04941 seconds
Congratulations Aristotle Pagaltzis! Your implementation is super-fast!
To use this solution, however, I will obviously like to do the pre-processing (creation) of the object once. Can I store (nstore
) this object after its creation? I've Never done this before. And how should I retrieve
it? Anything special? Hopefully the retrieval will be fast so it won't effect the overall performance of this great data structure.
UPDATE 3
I tried a simple nstore
and retrieve for the RangeMap
object. This seems to work fine. The only problem is the resulting file is around 1GB, and I will have some 1000 such file. I could live with a TB of storage for this, but I wonder if there's anyway to store it more efficiently without significantly effecting retrieval performance too much. Also see here: http://www.perlmonks.org/?node_id=861961.
UPDATE 4 - RangeMap
bug
Unfortunately, RangeMap
has a bug. Thanks to BrowserUK from PerlMonks for pointing that out. For example, create an object with $max_lenght=10
and as single range [6,2]
. Then query for [7,8]
. The answer should be 1
, not 0
.
I think this updated package should do the work:
use strict;
use warnings;
package FastRanges;
sub new($$$) {
my $class = shift;
my $max_length = shift;
my $ranges_a = shift;
my @lookup;
for ( @{$ranges_a} ) {
my ( $start, $end ) = @$_;
my @idx
= $end >= $start
? $start .. $end
: ( $start .. $max_length, 1 .. $end );
for my $i (@idx) { $lookup[$i] .= pack 'L', $end }
}
bless \@lookup, $class;
}
sub num_ranges_containing($$$) {
my $self = shift;
my ( $start, $end ) = @_; # query range coordinates
return 0
unless ( defined $self->[$start] )
; # no ranges overlap the start position of the query
if ( $end >= $start ) {
# query range is simple
# any inverted range in {LOOKUP}[$start] must contain it,
# and so does any simple range which ends at or after $end
return 0 + grep { $_ < $start or $end <= $_ } unpack 'L*',
$self->[$start];
}
else {
# query range is inverted
# only inverted ranges in {LOOKUP}[$start] which also end
# at of after $end contain it. simple ranges can't contain
# the query range
return 0 + grep { $_ < $start and $end <= $_ } unpack 'L*',
$self->[$start];
}
}
1;
Your comments will be welcomed.