views:

344

answers:

4

Hi,

I need an efficient Java structure to manipulate very sparse vectors of doubles: basic read / write operations. I implemented it in a HashMap but the access is too slow. Should I use another data structure? Do you recommend any free library?

Looking for some peaceful advice :)

Thanks a lot,

Marie

+1  A: 

HashMap is the way to go. It shouldn't be slow. Run your code through a profiler to see where all the time goes and then optimize accordingly. If you need tips to optimize the code, post an example here so we can help with a specific issue.

[EDIT] Depending on the size of the indexes, you can use a technique as in Integer.valueOf(int) to cache the objects for boxing. But this will only work when you create lots of maps and the indexes are in a somewhat limited range.

Or you can try IntHashMap from commons-lang. It's a bit hard to use (it's package private) but you can copy the code.

Lastly, you could use your own implementation of an int-based HashMap with optimized value lookup for your case.

Aaron Digulla
OK, thanks all for your quick reply. From your answers, I understand that HashMap is the best strategy to use. Actually, I am using ***very*** big data sets which I unfortunately can not optimize further and I was wondering how to make my code more efficient. :)Thanks again!
Marie
OP says map is too slow, sorry but reading the Q should be the minimum before answering.
Adrian
@Adrian: And reading my answer should be the minimum before downvoting it.
Aaron Digulla
@Marie: As I said: Run it through a profiler to determine the bottlenecks and then come back.
Aaron Digulla
@aaron OP has integer indices, for that a map is just not the way to go. A sparse vector with binary search on ordered indices is better than relying on the hash-values of integers (which are well, the integers themselves), not to speak of integer boxing in Java.
Adrian
@Adrian: Integer boxing can be optimized. See `Integer.valueOf(int)`. Q is pretty scarce on useful information to give tips. Or he could use IntHashMap from commons lang.
Aaron Digulla
@aaron downvote removed (and true enough, question lacks hint re usage pattern)
Adrian
A: 

You can copy paste the sparse vector from my Hapax project: ch.akuhn.matrix.SparseVector

PS: to all those other answers and comments that dont grok why using a map is too slow. It is slow because a map boxes all indices to Integer objects!

The sparse vector presented here is fast for read access and appending values, but not for putting at random indices. Its is optimal for a scenario where you first create the sprase vector but putting values in order of increasing indices, and later use the map for reading mostly.

Important methods in the sparse vector class are

// ...

public class SparseVector {

    /*default*/ int[] keys;
    /*default*/ int size, used;
    /*default*/ double[] values;

    public SparseVector(int size, int capacity) {
     assert size >= 0;
     assert capacity >= 0;
     this.size = size;
     this.keys = new int[capacity];
     this.values = new double[capacity];
    }

    public double get(int key) {
     if (key < 0 || key >= size) throw new IndexOutOfBoundsException(Integer.toString(key));
     int spot = Arrays.binarySearch(keys, 0, used, key);
     return spot < 0 ? 0 : values[spot];
    }

    public boolean isUsed(int key) {
     return 0 <= Arrays.binarySearch(keys, 0, used, key);
    }

    public double put(int key, double value) {
     if (key < 0 || key >= size) throw new IndexOutOfBoundsException(Integer.toString(key));
     int spot = Arrays.binarySearch(keys, 0, used, key);
     if (spot >= 0) return values[spot] = (float) value;
     else return update(-1 - spot, key, value);
    }

    public void resizeTo(int newSize) {
     if (newSize < this.size) throw new UnsupportedOperationException();
     this.size = newSize;
    }

    public int size() {
     return size;
    }

    private double update(int spot, int key, double value) {
     // grow if reaching end of capacity
     if (used == keys.length) {
      int capacity = (keys.length * 3) / 2 + 1;
      keys = Arrays.copyOf(keys, capacity);
      values = Arrays.copyOf(values, capacity);
     }
     // shift values if not appending
     if (spot < used) {
      System.arraycopy(keys, spot, keys, spot + 1, used - spot);
      System.arraycopy(values, spot, values, spot + 1, used - spot);
     }
     used++;
     keys[spot] = key;
     return values[spot] = (float) value;
    }

    public int used() {
     return used;
    }

    public void trim() {
     keys = Arrays.copyOf(keys, used);
     values = Arrays.copyOf(values, used);
    }

}
Adrian
OK, thanks, this seems very nice. The question here would become: is this implementation more time-efficient than a HashMap? I am really not an expert... let's vote (but please not fight)
Marie
Putting might be slower, getting will be faster. I'll add that to the answer.
Adrian
Vote?? Measure!
Svante
That's because voting might be more time-efficient for a bad programmer like me :) I already started changing the code to test it.
Marie
We cannot vote about what's faster in *your* usecase. Different data structure have different properties, which one is faster eventually depends on your particular usecase.
Adrian
This is the same as TreeMap because it uses binary search. It may have a little bit advantage over map because it uses primitive arrays. When you have enough entries, hash map is definitely faster. Like all other said, you need to MEASURE it yourself.
ZZ Coder
OK, so it IS faster in MY usecase. Thanks!But now I have become too curious to stop the conversation. How about re-implementing the original Java HashMap class and "replace" Integer-things by int-things? Would it just be stupid?
Marie
Mine might still be faster: I use binary search on an ordered list of indices to find an index, a HashMap for ints would use a hash of integer (maybe even the integers themselves) modulo some number to find a bucket and then linear search in these buckets.
Adrian
A: 

For 1D sparse array, map is normally the way to go. You only need to use a library if it's multi-dimension.

If you compare access time between map and array,

   map.get(99);
   array[99];

map is going to be much slower. Any library would have the same issue.

Is that sparse array all about? You trade time for space.

ZZ Coder
OP says map is too slow, sorry but reading the Q should be the minimum before answering.
Adrian
Reading the A should be the minumum before downvoting :(
John Mill
Your answer does not suggest another data structure not a library. OP asked for that. If you fix this, I'll remove the downvote.
Adrian
So you wouldn't take NO as an answer :) Slow is a relative term. It's always going to be slower compared with regular array. My point is that any other approach will be slower. You wouldn't find a library that compress array and also provides the speed gain. For speed, you use raw array. For space, you compress it. Can't get both.
ZZ Coder
If you answer is "nothing is faster than a map" then it's false. There *are* data structures that are faster than a hash map for some use cases, and sparse vectors are one of these usecases.
Adrian
+1  A: 

How big is your dataset? Much larger than Integer.MAX_VALUE? the problem is that HashSet is backed by an array. Collisions will slow performance. Perhaps it's not the mechanism of hashmap that is too slow, but the fact that you have multiple collisions. Perhaps if you partitioned your data first (e.g) using another hash function, then stored each partition of data in it's own hashmap you'd have more luck.

z5h