views:

719

answers:

3

I'm interested in converting a numpy array into a sparse dictionary as quickly as possible. Let me elaborate:

Given the array:

numpy.array([12,0,0,0,3,0,0,1])

I wish to produce the dictionary:

{0:12, 4:3, 7:1}

As you can see, we are simply converting the sequence type into an explicit mapping from indices that are nonzero to their values.

In order to make this a bit more interesting, I offer the following test harness to try out alternatives:

from timeit import Timer

if __name__ == "__main__":
  s = "import numpy; from itertools import izip; from numpy import nonzero, flatnonzero; vector =         numpy.random.poisson(0.1, size=10000);"

  ms = [ "f = flatnonzero(vector); dict( zip( f, vector[f] ) )"
             , "f = flatnonzero(vector); dict( izip( f, vector[f] ) )"
             , "f = nonzero(vector); dict( izip( f[0], vector[f] ) )"
             , "n = vector > 0; i = numpy.arange(len(vector))[n]; v = vector[n]; dict(izip(i,v))"
             , "i = flatnonzero(vector); v = vector[vector > 0]; dict(izip(i,v))"
             , "dict( zip( flatnonzero(vector), vector[flatnonzero(vector)] ) )"
             , "dict( zip( flatnonzero(vector), vector[nonzero(vector)] ) )"
             , "dict( (i, x) for i,x in enumerate(vector) if x > 0);"
             ]
  for m in ms:
    print "  %.2fs" % Timer(m, s).timeit(1000), m

I'm using a poisson distribution to simulate the sort of arrays I am interested in converting.

Here are my results so far:

   0.78s f = flatnonzero(vector); dict( zip( f, vector[f] ) )
   0.73s f = flatnonzero(vector); dict( izip( f, vector[f] ) )
   0.71s f = nonzero(vector); dict( izip( f[0], vector[f] ) )
   0.67s n = vector > 0; i = numpy.arange(len(vector))[n]; v = vector[n]; dict(izip(i,v))
   0.81s i = flatnonzero(vector); v = vector[vector > 0]; dict(izip(i,v))
   1.01s dict( zip( flatnonzero(vector), vector[flatnonzero(vector)] ) )
   1.03s dict( zip( flatnonzero(vector), vector[nonzero(vector)] ) )
   4.90s dict( (i, x) for i,x in enumerate(vector) if x > 0);

As you can see, the fastest solution I have found is

n = vector > 0;
i = numpy.arange(len(vector))[n]
v = vector[n]
dict(izip(i,v))

Any faster way?

Edit: The step

i = numpy.arange(len(vector))[n]

Seems particularly clumsy- generating an entire array before selecting only some elements, particularly when we know it might only be around 1/10 of the elements getting selected. I think this might still be improved.

A: 

I'd say that is fast enough for Python -- nice job. Any faster and it stops looking like a Python script ;-)
If you want speed, I'd recommend a different language.

jcoon
+1  A: 

I wonder if chief time cost is resizing the dict as it grows.

Would be nice if dict had a method (or instantiation option) to specify a start size; so if we knew it was going to be big, python could save time and just do one big mem alloc up front, rather than what I assume are additional allocs as we grow.

John Pirie
A: 

Tried this?

from numpy import where

i = where(vector > 0)[0]

Is this not the same as flatnonzero? And also, why didn't you try it (he even posted his test harness)? You would get +1 from me if you actually showed it was faster.
Kiv