views:

94

answers:

4

I have a very large numpy array (containing up to a million elements) like the one below:

[ 0  1  6  5  1  2  7  6  2  3  8  7  3  4  9  8  5  6 11 10  6  7 12 11  7
  8 13 12  8  9 14 13 10 11 16 15 11 12 17 16 12 13 18 17 13 14 19 18 15 16
 21 20 16 17 22 21 17 18 23 22 18 19 24 23]

and a small dictionary map for replacing some of the elements in the above array

{4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}

I would like to replace some of the elements according to the map above. The numpy array is really large, and only a small subset of the elements (occurring as keys in the dictionary) will be replaced with the corresponding values. What is the fastest way to do this?

A: 

Well, you need to make one pass through theArray, and for each element replace it if it is in the dictionary.

for i in xrange( len( theArray ) ):
    if foo[ i ] in dict:
        foo[ i ] = dict[ foo[ i ] ]
katrielalex
It would be better to put len(theArray) into variable, and use xrange.
fuwaneko
@fuw: Yes xrange, but putting `len(theArray)` into a variable won't help because the iterator is evaluated once only.
KennyTM
Py3k's `range` is a generator.
katrielalex
@kat: There's no NumPy for Python 3.x yet.
KennyTM
Oopsie. Thanks.
katrielalex
+2  A: 

I believe there's even more efficient method, but for now, try

from numpy import copy

newArray = copy(theArray)
for k, v in d.iteritems(): newArray[theArray==k] = v

Microbenchmark and test for correctness:

#!/usr/bin/env python2.7

from numpy import copy, random, arange

random.seed(0)
data = random.randint(30, size=10**5)

d = {4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}
dk = d.keys()
dv = d.values()

def f1(a, d):
    b = copy(a)
    for k, v in d.iteritems():
        b[a==k] = v
    return b

def f2(a, d):
    for i in xrange(len(a)):
        a[i] = d.get(a[i], a[i])
    return a

def f3(a, dk, dv):
    mp = arange(0, max(a)+1)
    mp[dk] = dv
    return mp[a]


a = copy(data)
res = f2(a, d)

assert (f1(data, d) == res).all()
assert (f3(data, dk, dv) == res).all()

Result:

$ python2.7 -m timeit -s 'from w import f1,f3,data,d,dk,dv' 'f1(data,d)'
100 loops, best of 3: 6.15 msec per loop

$ python2.7 -m timeit -s 'from w import f1,f3,data,d,dk,dv' 'f3(data,dk,dv)'
100 loops, best of 3: 19.6 msec per loop
KennyTM
`numpy.place` I think...
katrielalex
A: 
for i in xrange(len(the_array)):
    the_array[i] = the_dict.get(the_array[i], the_array[i])
gnibbler
+1  A: 

Assuming the values are between 0 and some maximum integer, one could implement a fast replace by using the numpy-array as int->int dict, like below

mp = numpy.arange(0,max(data)+1)
mp[replace.keys()] = replace.values()
data = mp[data]

where first

data = [ 0  1  6  5  1  2  7  6  2  3  8  7  3  4  9  8  5  6 11 10  6  7 12 11  7
  8 13 12  8  9 14 13 10 11 16 15 11 12 17 16 12 13 18 17 13 14 19 18 15 16
 21 20 16 17 22 21 17 18 23 22 18 19 24 23]

and replacing with

replace = {4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}

we obtain

data = [ 0  1  6  5  1  2  7  6  2  3  8  7  3  0  5  8  5  6 11 10  6  7 12 11  7
  8 13 12  8  5 10 13 10 11 16 15 11 12 17 16 12 13 18 17 13 10 15 18 15 16
  1  0 16 17  2  1 17 18  3  2 18 15  0  3]
celil