views:

44

answers:

1

Hello, I am using sparse matrices as a mean of compressing data, with loss of course, what I do is I create a sparse dictionary from all the values greater than a specified treshold. I'd want my compressed data size to be a variable which my user can choose.

My problem is, I have a sparse matrix with alot of near-zero values, and what I must do is choose a treshold so that my sparse dictionary is of a specific size (or eventually that the reconstruction error is of a specific rate) Here's how I create my dictionary (taken from stackoverflow I think >.< ):

n = abs(smat) > treshold #smat is flattened(1D)
i = mega_range[n] #mega range is numpy.arange(smat.shape[0])
v = smat[n]
sparse_dict = dict(izip(i,v))

How can I find treshold so that it is equal to the nth greatest value of my array (smat)?

+1  A: 

scipy.stats.scoreatpercentile(arr,per) returns the value at a given percentile:

import scipy.stats as ss
print(ss.scoreatpercentile([1, 4, 2, 3], 75))
# 3.25

The value is interpolated if the desired percentile lies between two points in arr.

So if you set per=(len(smat)-n)/len(smat) then

threshold = ss.scoreatpercentile(abs(smat), per)

should give you (close to) the nth greatest value of the array smat.

unutbu
Exactly what I needed thanks!
Manux
You're welcome!
unutbu
Fwiw, scipy/stats.py does np.sort(), then interpolates.There is a std::nth_element and std::partial_sort,but sort() is really fast.
Denis
@Denis: Thanks. You're absolutely right.
unutbu