views:

1148

answers:

2

let's say I have a bi-dimensional array like that

numpy.array([[0,1,1.2,3],[1,5,3.2,4],[3,4,2.8,4], [2,6,2.3,5]])

I want to have an array formed eliminating whole rows based on uniqueness of values of last column, selecting the row to keep based on value of third column. e.g. in this case i would like to keep only one of the rows with 4 as last column, and choose the one which has the minor value of third column, having something like that as a result:

array [0,1,1.2,3],[3,4,2.8,4],[2,6,2.3,5]

thus eliminating row [1,5,3.2,4]

which would be the best way to do it?

+1  A: 

My numpy is way out of practice, but this should work:

#keepers is a dictionary of type int: (int, int)
#the key is the row's final value, and the tuple is (row index, row[2])
keepers = {}
deletions = []
for i, row in enumerate(n):
    key = row[3]
    if key not in keepers:
        keepers[key] = (i, row[2])
    else:
        if row[2] > keepers[key][1]:
            deletions.append(i)
        else:
            deletions.append(keepers[key][0])
            keepers[key] = (i, row[2])
o = numpy.delete(n, deletions, axis=0)

I've greatly simplified it from my declarative solution, which was getting quite unwieldy. Hopefully this is easier to follow; all we do is maintain a dictionary of values that we want to keep and a list of indexes we want to delete.

llimllib
Add at the end your version with `itertools.groupby()`. It is interesting.
J.F. Sebastian
but it's also wrong...
llimllib
I'll be a bit more precise: it's wrong in an algorithmic way. In order to work, I was going to need to sort the array, which is something I really want to avoid in order to keep the runtime down to O(n), which this solution should be
llimllib
A: 

thank you very much indeed! I think this was really far more complex than my current ability could handle, it's a very smart snippet of code. I think i've understood the logic, and it seems to be rigth with the example array i had provided, but it fails with other arrays, for example:

n = numpy.array([[1,1,1.2,3],[1,5,3.2,3],[3,4,2.8,3],[2,6,2.3,5]])

or even

n = numpy.array([[1.,1.2,3.,2.],[5.,3.2,4.,3.],[4.,2.8,7.,6.],[6.,2.3,5.,3.]])

i really can't see why...

it fails because groupby() only works on consecutive elements. In order to debug it, start from the inside just like my explanation was. I'm fixing my answer right now.
llimllib
(also you're not supposed to use the "answers" section to post a question; you should either comment on my answer or edit your question)
llimllib
sorry, it's the first time i come to stackoverflow, and it didn't allow me to comment (frankly i think the "credits" stuff for writing comments is rather absurd). thank you again for your very clever and kind answer
no worries! I thought the no commenting thing was pretty silly too. Glad to help.
llimllib