ansaurus

Question

Python: create a distribution from a list based on number of items that fall within certain ranges

Answer 1

+2 A:

If data is always sorted, a compact approach might be:

import itertools as it

d = [k+1 for k, L in
         ((k, len(list(g))) for k, g in it.groupby(data,key=lambda x:x//10))
     if L>=3]

If data isn't sorted, or if you don't know, use sorted(data) as the first argument to itertools.groupby, instead of just data.

If you prefer a less dense/compact approach, you can of course expand this, e.g. to:

def divby10(x): return x//10

distribution = []
for k, g in it.groupby(data, key=divby10):
    L = len(list(g))
    if L < 3: continue
    distribution.append(k+1)

In either case, the mechanism is that groupby first applies the callable passed as key= to each item in the iterable passed as its first argument, to obtain each item's"key"; for each consecutive group of items which have the same "key", groupby yields a tuple with two items: the value of the key, and an iterable over all items in said group.

Here, the key is obtained by dividing an item by 10 (with truncation); len(list(g)) is the number of consecutive items with that "key". Since the items must be consecutive, you need the data to be sorted (and, it's simpler to just sort it, than sort it "by value divided by 10 with truncation";-).

Alex Martelli 2010-08-23 18:18:15

Answer 2

A:

This sounds like a job for some form of histogram. Presorting should not be necessary in order to accomplish this. I discuss using a variant of bucket sort to group nearby elements here, though you'll need to adjust this algorithm to suit your purposes. Note that you do not need to store the numbers themselves in the buckets in order to form a histogram

Brian 2010-08-23 18:35:08

Answer 3

+1 A:

Since data might be very lengthy, you may want to look into using numpy. It provides many useful functions for numerical work, it requires less memory to store data in a numpy array than a Python list[*], and, since many of the numpy functions call C functions under the hood, you may be able to obtain some speed gains:

import numpy as np

data = np.array([1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5, 10, 10, 10, 22, 30, 30, 35, 46, 58, 59, 59])

hist,bins=np.histogram(data,bins=np.linspace(0,60,7))
print(hist)
# [11  3  1  3  1  3]

distribution=np.where(hist>=3)[0]+1
print(distribution)
# [1 2 4 6]

[*] -- Note: In the code above, a Python list was formed in the process of defining data. So the maximum memory requirement here is actually greater than if you had just used a Python list. The memory should get freed however, if there are no other references to the Python list. Alternatively, if the data is stored on disk, numpy.loadtxt can be used to read it directly into a numpy array.

unutbu 2010-08-23 20:23:02

ansaurus

tags:

views:

answers:

Python: create a distribution from a list based on number of items that fall within certain ranges

related questions