views:

1817

answers:

6

I have to sort a number of integers, which can have values between 30.000.000 and 350.000.000. There will be between 0 and 65.535 integers, with the average count being 20.000. RAM usage is irrelevant and speed only is important.

Later on i will also have to split them into groups, with the divide always being set whenever the gap between two of these values is >65.535, which is what i need the algorithm for.

If it makes any difference, the algorithm will be used in a Perl script.

Edit: After thinking it over and reading the answers i've come to realize something: I don't actually care about the data itself. As i really only want to find the start and end values of groups with small gaps, the sorting only needs to create buckets and can discard the data.

Edit2: After some testing and trying out the answers provided, the fastest way i found was this:

my @sort = sort {$a <=> $b} @item_offsets;
my @buckets;
my $start = shift @sort;
push @buckets, [$start,$start];
for my $item ( @sort ) {
    if ( $item < $buckets[$#buckets][1]+$gap ) {
        $buckets[$#buckets][1] = $item;
    }
    else {
        push @buckets, [$item,$item];
    }
}
say $#buckets;
+12  A: 

I'd use a radix sort, since you need to group the output.

FlySwat
A radix sort module can be found on CPAN @ http://search.cpan.org/dist/Sort-Radix/
draegtun
+5  A: 

I was just going to say radix sort, http://en.wikipedia.org/wiki/Radix_sort however that could be a bit above what you were looking to implement, Introsort is generally the accepted sorting solution for data http://en.wikipedia.org/wiki/Introsort, it's a variation of quicksort that switches to heapsort when it reaches smaller sets as it's faster on smaller sets than quicksort.

Chris Marisic
A: 

If you use the number as an index to an array, and then increment the count of that position, you've "grouped" them, and done it in one pass.

in pseudocode:

while(morenumbers)
  sorted[[unsorted[number]]++
  number++

If the range is known ahead of time, you can reduce index the values (for example, the value-30000 to bring it into the right range).

warren
Bad idea, since the range is much greater than the number of integers (50 million vs 65 thousand), so this "one pass" will be very slow.
Rafał Dowgird
You can't get better than one pass, as you must hit each element in the unsorted list at least once in any sorting algorithm that exists.Perl code would look more likemy @sorted_values;foreach my $element (@unsorted_values) { $sorted_values[$element]++;};
Max Lybbert
Aargh! I put in line breaks to avoid getting Perl code looking that bad as a one-liner!
Max Lybbert
I tried doing something like this at first, but well, i can't really afford 300MB of RAM use, as that would make it inacceptably small. However i realized that i can make a number of buckets like that and thus keep the ram use sensible.
Mithaldu
"You can't get better than one pass" - one pass over a 50 million element array is much worse than 4 passes over 65536 integers. And you have to pass over the counting array at least once.
Rafał Dowgird
@Rafal - regardless of how many items you have, you must make at least one pass. You never "pass" over the counting array. I would love to see a way to sort a group without hitting every item at least once - if you have one, please share it. Until then, you have to hit each item at least once
warren
@warren: I agree that you have to hit every item in the input at least once. But how can you do a countsort without passing over the counting array? How do you initialize the array? How do you output the sorted integers (or detect the gaps, in this particular case)?
Rafał Dowgird
> How do you initialize the array?The array is initialized automatically by Perl.The original question asked how to sort the array, not how to output it. The original question also said "RAM usage is irrelevant and speed only is important," but apparently there was a limit to RAM as well.
Max Lybbert
> The original question asked how to sort the array, not how to output it. Sort it for the particular purpose of finding gaps. You have to iterate over the whole range to find gaps. Also, zeroing 200 MB isn't free, even if done automatically by Perl.
Rafał Dowgird
@Rafal - this is still an O(n) sort - since the list of items will be longer than the counted list; technically it's O(n+m), but that still reduces to O(n)
warren
The only solution would be to test this.
Max Lybbert
+16  A: 

It's unlikely that you'll be able to write a sort algorithm in Perl that will perform better than Perl's builtin sort function:

@numbers = sort {$a <=> $b} @numbers;

You can experiment with the sort pragma to see if a particular algorithm is better:

use sort '_quicksort';
use sort '_mergesort';

Since your cutpoints will vary depending on the data distribution, I think you need to sort the entire list first then loop over it to do the cutting.

my $prev  = shift @numbers;  # already sorted
my @group = [$prev];
my $i     = 0;

foreach my $n (@numbers) {
    $i++ if ($n - $prev > 65535);
    push @{$group[$i]}, $n;
    $prev = $n;
}
Michael Carman
Thanks for reminding me of the Perl sort function. I'd all forgotten about it.
Mithaldu
+1  A: 

I would try this:

my @sorted = map { unpack "N" } sort map { pack "N" } @unsorted;
Leon Timmermans
I'm afraid map is a bit of black magic to me. What does that piece of code do? oO
Mithaldu
I assume the map{} is to eliminate the need for a sortsub to get a numeric comparison. The {$a <=> $b} case has been optimized since 5.6.1 so the trickery shouldn't be necessary.
Michael Carman
You have to read this from right-to-left. The map { pack "N" } @unsorted applies pack "N" to each element--turning each element into a big-endian binary number--the output gets passed to sort which has O(n log n) behavior, and each element of the sorted result gets unpacked and assigned to @sorted.
Max Lybbert
@michael: didn't know that. Interesting.
Leon Timmermans
+13  A: 

I'd just make an array of buckets before running the algorithm, one for each group of 65536 consecutive values. The buckets will contain a min and max value of their contents, but won't store the contents themselves. After running the algorithm, do a single pass over the buckets. If there are two consecutive non-empty buckets with min(bucket2)-max(bucket1) < 65536, combine them. Combining will not happen until the algorithm finishes running. Discard any empty buckets. This algorithm is linear time.

Take note of Bucket Sort.

Brian
You managed to summarize the problems really well. I've actually, while reading the replies here, pondered doing something like that, but wasn't too sure on it yet. Thanks. :)
Mithaldu
I just editted my answer and scrapped a bit of unrelated text, based on your edits. The resultant answer should be much faster, though both were linear time algorithms.
Brian