views:

472

answers:

7

I have a numpy array [1,2,3,4,5,6,7,8,9,10,11,12,13,14] and want to have an array structured like [[1,2,3,4], [2,3,4,5], [3,4,5,6], ..., [11,12,13,14]].

Sure this is possible by looping over the large array and adding arrays of length four to the new array, but I'm curious if there is some secret 'magic' Python method doing just this :)

A: 

I know of no Python stdlib function that quite does that. It's easy enough to do. Here is a generator that basically does it:

def chunks(sequence, length):
    for index in xrange(0, len(sequence) - length + 1):
        yield sequence[index:index + length]

You could use it like so

>>> import numpy
>>> a = numpy.arange(1, 15)
>>> a
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])
>>> numpy.array(list(chunks(a, 4)))
array([[ 1,  2,  3,  4],
       [ 2,  3,  4,  5],
       [ 3,  4,  5,  6],
       [ 4,  5,  6,  7],
       [ 5,  6,  7,  8],
       [ 6,  7,  8,  9],
       [ 7,  8,  9, 10],
       [ 8,  9, 10, 11],
       [ 9, 10, 11, 12],
       [10, 11, 12, 13],
       [11, 12, 13, 14]])

The only weird thing about this code is that I called list on the result of chunks(a, 4). This is since numpy.array does not accept an arbitrary iterable, such as the generator chunks returns. If you just want to iterate over these chunks, you don't need to bother. If you really need to put the result into an array you can do it this way or some more efficient ways.

Mike Graham
+3  A: 

Quick&dirty solution:

>>> a = numpy.arange(1,15)
>>> numpy.array([ a[i:i+4] for i in range(len(a)-3) ])
array([[ 1,  2,  3,  4],
       [ 2,  3,  4,  5],
       [ 3,  4,  5,  6],
       [ 4,  5,  6,  7],
       [ 5,  6,  7,  8],
       [ 6,  7,  8,  9],
       [ 7,  8,  9, 10],
       [ 8,  9, 10, 11],
       [ 9, 10, 11, 12],
       [10, 11, 12, 13],
       [11, 12, 13, 14]])
Federico Ramponi
Nothing really dirty about that.
Mike Graham
Certainly quick - see the timings on my answer. Also easily to generalise to different size chunks.
gnibbler
+1  A: 

Using itertools, and assuming Python 2.6:

import itertools

def byN(iterable, N):
    itrs = itertools.tee(iter(iterable), N)
    for n in range(N):
        for i in range(n):
            next(itrs[n], None)
    return zip(*itrs)

aby4 = numpy.array(byN(thearray, 4))
Alex Martelli
This code doesn't work. numpy would interpret the `izip` object as a scalar and make a rank-1, length-1 array of dtype `object` out of it.
Mike Graham
Also, you need to `from itertools import tee, izip` or use the fully-qualified names.
Mike Graham
I thought numpy.array accepted any iterable -- editing to fix.
Alex Martelli
Also, you have backwards logic in traversing the iterators you teed. The columns in the resulting array would be backwards if the code did work.
Mike Graham
@Alex Martelli, `numpy.array` doesn't. It can be frustrating, but I think the reason is that it needs to preallocate the array so it wants to know the size of the input ahead of time.
Mike Graham
(In the edit you still advance the earlier iterators more. OS's output example would have you advance the later iterators more.)
Mike Graham
Right -- edited again to fix (`next(itrs[n], None)`, not `next(itrs[i], None)` as I had it before).
Alex Martelli
+8  A: 

The fastest way seems to be to preallocate the array, given as option 7 right at the bottom of this answer

>>> import numpy as np
>>> A=np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14])
>>> A
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])
>>> np.array(zip(A,A[1:],A[2:],A[3:]))
array([[ 1,  2,  3,  4],
       [ 2,  3,  4,  5],
       [ 3,  4,  5,  6],
       [ 4,  5,  6,  7],
       [ 5,  6,  7,  8],
       [ 6,  7,  8,  9],
       [ 7,  8,  9, 10],
       [ 8,  9, 10, 11],
       [ 9, 10, 11, 12],
       [10, 11, 12, 13],
       [11, 12, 13, 14]])
>>> 

You can easily adapt this to do it for variable chunk size

>>> n=5
>>> np.array(zip(*(A[i:] for i in range(n))))
array([[ 1,  2,  3,  4,  5],
       [ 2,  3,  4,  5,  6],
       [ 3,  4,  5,  6,  7],
       [ 4,  5,  6,  7,  8],
       [ 5,  6,  7,  8,  9],
       [ 6,  7,  8,  9, 10],
       [ 7,  8,  9, 10, 11],
       [ 8,  9, 10, 11, 12],
       [ 9, 10, 11, 12, 13],
       [10, 11, 12, 13, 14]])

You may wish to compare performance between this and using itertools.islice

>>> from itertools import islice
>>> n=4
>>> np.array(zip(*[islice(A,i,None) for i in range(n)]))
array([[ 1,  2,  3,  4],
       [ 2,  3,  4,  5],
       [ 3,  4,  5,  6],
       [ 4,  5,  6,  7],
       [ 5,  6,  7,  8],
       [ 6,  7,  8,  9],
       [ 7,  8,  9, 10],
       [ 8,  9, 10, 11],
       [ 9, 10, 11, 12],
       [10, 11, 12, 13],
       [11, 12, 13, 14]])

My timing results:

1. timeit np.array(zip(A,A[1:],A[2:],A[3:]))
10000 loops, best of 3: 92.9 us per loop

2. timeit np.array(zip(*(A[i:] for i in range(4))))
10000 loops, best of 3: 101 us per loop

3. timeit np.array(zip(*[islice(A,i,None) for i in range(4)]))
10000 loops, best of 3: 101 us per loop

4. timeit numpy.array([ A[i:i+4] for i in range(len(A)-3) ])
10000 loops, best of 3: 37.8 us per loop

5. timeit numpy.array(list(chunks(A, 4)))
10000 loops, best of 3: 43.2 us per loop

6. timeit numpy.array(byN(A, 4))
10000 loops, best of 3: 100 us per loop

# Does preallocation of the array help? (11 is from len(A)+1-4)
7. timeit B=np.zeros(shape=(11, 4),dtype=np.int32)
100000 loops, best of 3: 2.19 us per loop
   timeit for i in range(4):B[:,i]=A[i:11+i]
10000 loops, best of 3: 20.9 us per loop
total 23.1us per loop

As len(A) increases (20000) 4 and 5 converge to be equivalent speed (44ms). 1,2,3 and 6 all remain about 3 times slower(135ms). 7 is much faster (1.36ms)

gnibbler
The generalization is wonderful, +1
Federico Ramponi
Note that the `islice` version is probably silly in this particular use case. It is slower (islicing is O(n); slicing is O(1)). It may be somewhat more memory efficient (maybe and maybe not—numpy slices don't copy), but if we wanted to optimize for memory we would never form any lists at all. Of course, tests talk and guesses walk.
Mike Graham
If I was really into benchmarking this (a silly but fun enough task), I'd probably try to shave off a couple more nanoseconds by using `numpy.empty` instead of `numpy.zeros` and using a column-major array so that I deal with contiguous memory when setting columns.
Mike Graham
A: 

the efficient numpy way to do this is given here, which is somewhat too long to reproduce here. it boils down to using some stride tricks, and is much faster than itertools for largish window sizes. for example, using a method essentially the same as that of of alex martelli's:

In [16]: def windowed(sequence, length):
    seqs = tee(sequence, length)
    [ seq.next() for i, seq in enumerate(seqs) for j in xrange(i) ]
    return zip(*seqs)

we get:

In [19]: data = numpy.random.randint(0, 2, 1000000)

In [20]: %timeit windowed(data, 2)
100000 loops, best of 3: 6.62 us per loop
In [21]: %timeit windowed(data, 10)
10000 loops, best of 3: 29.3 us per loop
In [22]: %timeit windowed(data, 100)
1000 loops, best of 3: 1.41 ms per loop
In [23]: %timeit segment_axis(data, 2, 1)
10000 loops, best of 3: 30.1 us per loop
In [24]: %timeit segment_axis(data, 10, 9)
10000 loops, best of 3: 30.2 us per loop
In [25]: %timeit segment_axis(data, 100, 99)
10000 loops, best of 3: 30.5 us per loop
Autoplectic
+3  A: 

You should use stride_tricks. When I first saw this, the word 'magic' did spring to mind. It's simple and is by far the fastest method.

>>> as_strided = numpy.lib.stride_tricks.as_strided
>>> a = numpy.arange(1,15)
>>> a
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])
>>> b = as_strided(a,(11,4),(4,4))
>>> b
array([[ 1,  2,  3,  4],
       [ 2,  3,  4,  5],
       [ 3,  4,  5,  6],
       [ 4,  5,  6,  7],
       [ 5,  6,  7,  8],
       [ 6,  7,  8,  9],
       [ 7,  8,  9, 10],
       [ 8,  9, 10, 11],
       [ 9, 10, 11, 12],
       [10, 11, 12, 13],
       [11, 12, 13, 14]])

Be aware that the values in array b are those in a, just viewed differently. Do a .copy() on b if you plan to modify it.

I saw this at a SciPy conference. Here are the slides for more explanation.

bpowah
+1  A: 

Broadcast!

from numpy import ogrid
def stretch(N=5,M=15):
    x, y = ogrid[0:M,0:N]
    return x+y+1

Note that ogrid does give things like:

>> ogrid[0:5,0:5]
>> 
[array([[0],
       [1],
       [2],
       [3],
       [4]]),
 array([[0, 1, 2, 3, 4]])]

Let's compare with another solution given here:

def zipping(N=5,M=15):
    A = numpy.arange(1, M+1)
    return numpy.array(zip(*(A[i:] for i in range(N))))

comparison (python 2.6, 32 bit, 1Go RAM) gives

>>> %timeit stretch(5,15)
10000 loops, best of 3: 61.2 us per loop

>>> %timeit zipping(5,15)
10000 loops, best of 3: 72.5 us per loop

>>> %timeit stretch(5,1e3)
10000 loops, best of 3: 128 us per loop

>>> %timeit zipping(5,1e3)
100 loops, best of 3: 4.25 ms per loop

The 40-fold speed-up is kind of consistant to scaling.

meduz