views:

43

answers:

2

I have a master array of length n of id numbers that apply to other analogous arrays with corresponding data for elements in my simulation that belong to those id numbers (e.g. data[id]). Were I to generate a list of id numbers of length m separately and need the information in the data array for those ids, what is the best method of getting a list of indices idx of the original array of ids in order to extract data[idx]? That is, given:

a=numpy.array([1,3,4,5,6])      # master array
b=numpy.array([3,4,3,6,4,1,5])  # secondary array

I would like to generate

idx=numpy.array([1,2,1,4,2,0,3])

The array a is typically in sequential order but it's not a requirement. Also, array b will most definitely have repeats and will not be in any order.

My current method of doing this is:

idx=numpy.array([numpy.where(a==bi)[0][0] for bi in b])

I timed it using the following test:

a=(numpy.random.uniform(100,size=100)).astype('int')
b=numpy.repeat(a,100)
timeit method1(a,b)

10 loops, best of 3: 53.1 ms per loop

Is there a better way of doing this?

A: 

I'm not sure if there is a way to do this automatically in python, but you're probably best off sorting the two arrays and then generating your output in one pass through b. The complexity of that operation should be O(|a|*log|a|)+O(|b|*log|b|)+O(|b|) = O(|b|*log|b|) (assuming |b| > |a|). I believe your original try has complexity O(|a|*|b|), so this should provide a noticeable improvement for a sufficiently large b.

VeeArr
+1  A: 

The current way you are doing it with where searching through the whole array of a each time. You can make this look-up O(1) instead of O(N) using a dict. For instance, I used the following method:

def method2(a,b):
    tmpdict = dict(zip(a,range(len(a))))
    idx = numpy.array([tmpdict[bi] for bi in b])

and got a very large speed-up which will be even better for larger arrays. For the sizes that you had in your example code, I got a speed-up of 15x. The only problem with my code is that if there are repeated elements in a, then the dict will currently point to the last instance of the element while with your method it will point to the first instance. However, that can remedied if there are to be repeated elements in the actual usage of the code.

Justin Peel
Very nice. Thank you. This works well since there are no repeated elements in `a` as it is a list of unique id numbers.
fideli
@fideli, That's what I'd guessed, but your example with random numbers didn't rule out repeats. Glad that I could help.
Justin Peel