views:

556

answers:

4

In a pylab program (which could probably be a matlab program as well) I have a numpy array of numbers representing distances: d[t] is the distance at time t (and the timespan of my data is len(d) time units).

The events I'm interested in are when the distance is below a certain threshold, and I want to compute the duration of these events. It's easy to get an array of booleans with b = d<threshold, and the problem comes down to computing the sequence of the lengths of the True-only words in b. But I do not know how to do that efficiently (i.e. using numpy primitives), and I resorted to walk the array and to do manual change detection (i.e. initialize counter when value goes from False to True, increase counter as long as value is True, and output the counter to the sequence when value goes back to False). But this is tremendously slow.

How to efficienly detect that sort of sequences in numpy arrays ?

Below is some python code that illustrates my problem : the fourth dot takes a very long time to appear (if not, increase the size of the array)

from pylab import *

threshold = 7

print '.'
d = 10*rand(10000000)

print '.'

b = d<threshold

print '.'

durations=[]
for i in xrange(len(b)):
    if b[i] and (i==0 or not b[i-1]):
        counter=1
    if  i>0 and b[i-1] and b[i]:
        counter+=1
    if (b[i-1] and not b[i]) or i==len(b)-1:
        durations.append(counter)

print '.'
A: 
durations = []
counter   = 0

for bool in b:
    if bool:
        counter += 1
    elif counter > 0:
        durations.append(counter)
        counter = 0

if counter > 0:
    durations.append(counter)
John Kugelman
sure, this is more consise, but just as inefficient ; what I want to do is move the loop down to the C layer, by means of using some clever combination of numpy calls...
Gyom
check my edited answer, I now offer one such "clever combinations" (always trying hard not to be TOO clever though;-) -- but, do measure the speed of that one AND the itertools.groupby-based solution, and let us know which one is faster (and by how much) in examples realistic-for-you!
Alex Martelli
+2  A: 

While not numpy primitives, itertools functions are often very fast, so do give this one a try (and measure times for various solutions including this one, of course):

def runs_of_ones(bits):
  for bit, group in itertools.groupby(bits):
    if bit: yield sum(group)

If you do need the values in a list, just can use list(runs_of_ones(bits)), of course; but maybe a list comprehension might be marginally faster still:

def runs_of_ones_list(bits):
  return [sum(g) for b, g in itertools.groupby(bits) if b]

Moving to "numpy-native" possibilities, what about:

def runs_of_ones_array(bits):
  # make sure all runs of ones are well-bounded
  bounded = numpy.hstack(([0], bits, [0]))
  # get 1 at run starts and -1 at run ends
  difs = numpy.diff(bounded)
  run_starts, = numpy.where(difs > 0)
  run_ends, = numpy.where(difs < 0)
  return run_ends - run_starts

Again: be sure to benchmark solutions against each others in realistic-for-you examples!

Alex Martelli
Hmmmmm... that last one looks familiar. ;)
gnovice
Thanks a lot !The diff/where solution is exactly what I had in mind (not to mention it is about 10 times faster than the other solutions).Call that "not too clever" if you like, but I wish I was clever enough to come up with it :-)
Gyom
@gnovice, I don't do matlab (funny enough my daughter, now a PhD candidate in advanced radio engineering, does;-), but now looking at your answer I do see the analogies -- get the end-of-runs minus the start-of-runs, get those by locating <0 and >0 spot in the differences, and pad the bits with zeros to make sure all runs-of-ones do end. Guess there aren't that many ways to skin this "run lengths" problem!-)
Alex Martelli
@Gyom, you're welcome -- as @gnovice hints, the matlab solution is also similar, or so I guess it would be if one knew matlab -- so it must be that neither is very clever;-)... it's more a question of having had to do run-length coding stuff before (most of the time in my edit was about translating from Numeric, which is what I still tend instinctively to turn to, to much-better numpy -- but where I actually first learned such things was with APL, 30 years ago, when I was still a hardware designer...!-).
Alex Martelli
+2  A: 

Just in case anyone is curious (and since you mentioned MATLAB in passing), here's one way to solve it in MATLAB:

threshold = 7;
d = 10*rand(1,100000);  % Sample data
b = diff([false (d < threshold) false]);
durations = find(b == -1)-find(b == 1);

I'm not too familiar with Python, but maybe this could help give you some ideas. =)

gnovice
thanks for this answer as well, this is exactly the kind of stuff I was looking for
Gyom
diff() exists in numpy too, so this is more or less what you want though replace find(foo) with where(foo)[0].
dwf
A: 

Here is a solution using only arrays: it takes an array containing a sequence of bools and counts the length of the transitions.

>>> from numpy import array, arange
>>> b = array([0,0,0,1,1,1,0,0,0,1,1,1,1,0,0], dtype=bool)
>>> sw = (b[:-1] ^ b[1:]); print sw
[False False  True False False  True False False  True False False False
  True False]
>>> isw = arange(len(sw))[sw]; print isw
[ 2  5  8 12]
>>> lens = isw[1::2] - isw[::2]; print lens
[3 4]

sw contains a true where there is a switch, isw converts them in indexes. The items of isw are then subtracted pairwise in lens.

Notice that if the sequence started with an 1 it would count the length of the 0s sequences: this can be fixed in the indexing to compute lens. Also, I have not tested corner cases such sequences of length 1.

piro