Picking ranges for splitting up a dataset

views:

answers:

Picking ranges for splitting up a dataset

I have a few million integers between 0 and 64K. I'd like to split them up into N buckets, where each bucket contains about the same number of items from a contiguous range. So for example, if I only had a single datapoint with each possible value, and 64 buckets, ideally I'd end up with a bucket for 0-1024, one for 1025-2048, etc.

What is an algorithm for calculating the bucket ranges that most evenly distributes the number of items?

If you are focusing on even distribution, the easiest way to go would probably be to sort the list and then place the first (list_length / N) items into the first bucket, then the next (list_length / N) items into the next bucket, etc. Since you have a rather large list to sort, this probably isn't the most efficient solution.

bta 2010-09-08 00:07:39

Sorting your numbers and filling buckets that contain the desired number of elements as you go through the sorted list is one possibility.

You can do something similar but probably faster by using a heap: you fill the heap with your elements, and you can then extract the smallest list_length/N elements very fast.

If speed is not too much of a concern, however, sorting 1 million numbers is both simple and fast (a fraction of a second in Python with Numpy).

EOL 2010-09-08 07:34:14

ansaurus

tags:

views:

answers:

Picking ranges for splitting up a dataset

related questions