views:

3378

answers:

7

I'm working on a project, written in Java, which requires that I build a very large 2-D sparse array. Very sparse, if that makes a difference. Anyway: the most crucial aspect for this application is efficency in terms of time (assume loads of memory, though not nearly so unlimited as to allow me to use a standard 2-D array -- the key range is in the billions in both dimensions).

Out of the kajillion cells in the array, there will be several hundred thousand cells which contain an object. I need to be able to modify cell contents VERY quickly.

Anyway: Does anyone know a particularly good library for this purpose? It would have to be Berkeley, LGPL or similar license (no GPL, as the product can't be entirely open-sourced). Or if there's just a very simple way to make a homebrew sparse array object, that'd be fine too.

I'm considering MTJ, but haven't heard any opinions on its quality.

Thanks!! -Dan

+4  A: 

Here is a paper you may be interested in which talks about data structures for matrix computations, including sparse arrays:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.13.7544

You can download the paper as PDF or PS. It includes source code, too.

Marc Novakowski
nice reference, thanks.
Steve B.
+2  A: 

This seems to be simple.

You could use a binary tree of the data using row*maxcolums+column as an index.

To find item, you simply calculate row*maxcolums+column and binary search the tree looking for it, if it's not there, you can return null (it's О(log n) where n is the number of cells which contain an object).

Osama ALASSIRY
+1  A: 

Not the fastest runtime solution probably, but the fastest I could come up with that seems to work. Create an Index class and use it as a key for a SortedMap, like:

 SortedMap<Index, Object> entries = new TreeMap<Index, Object>();
 entries.put(new Index(1, 4), "1-4");
 entries.put(new Index(5555555555l, 767777777777l), "5555555555l-767777777777l");
 System.out.println(entries.size());
 System.out.println(entries.get(new Index(1, 4)));
 System.out.println(entries.get(new Index(5555555555l, 767777777777l)));

My Index class looks like this (with some help from Eclipse code generator).

public static class Index implements Comparable<Index>
{
 private long x;
 private long y;

 public Index(long x, long y)
 {
  super();
  this.x = x;
  this.y = y;
 }

 public int compareTo(Index index)
 {
  long ix = index.x;
  if (ix == x)
  {
   long iy = index.y;
   if (iy == y)
   {
    return 0;
   }
   else if (iy < y)
   {
    return -1;
   }
   else
   {
    return 1;
   }
  }
  else if (ix < x)
  {
   return -1;
  }
  else
  {
   return 1;
  }
 }

 public int hashCode()
 {
  final int PRIME = 31;
  int result = 1;
  result = PRIME * result + (int) (x ^ (x >>> 32));
  result = PRIME * result + (int) (y ^ (y >>> 32));
  return result;
 }

 public boolean equals(Object obj)
 {
  if (this == obj)
   return true;
  if (obj == null)
   return false;
  if (getClass() != obj.getClass())
   return false;
  final Index other = (Index) obj;
  if (x != other.x)
   return false;
  if (y != other.y)
   return false;
  return true;
 }

 public long getX()
 {
  return x;
 }

 public long getY()
 {
  return y;
 }
}
eljenso
A: 

you could just use a nested map although if you need to do matrix calculus on it that might not be the best option

 Map<Integer, Map<integer, Object>> matrix;

maybe instead of object use some tuple for the actual data so you can work with it easier after extraction, something like:

class Tuple<T extends yourDataObject> {
  public final int x;
  public final int y;
  public final T object;
}

class Matrix {
  private final Map<Integer, Map<interger, Tupple>> data = new...;

 void add(int x, int y, Object object) {
     data.get(x).put(new Tupple(x,y,object);
 }
}


//etc

null check etc omitted for brevity

pvgoddijn
Prefer long iso int for range "in the billions". Nesting Maps when "there will be several hundred thousand cells which contain an object" is a lot of overhead. Storing the coordinates in Tupple is redundant and increases maintenance for operations. No abstraction for index: not flexible/extensible.
eljenso
+2  A: 

Maybe Colt is of help. It provides a sparse matrix implementation.

+2  A: 

I've just used Trove which provided much better performance than Colt while using the int->int map (used to implement a sparse matrix).

Federico
+1  A: 

sparsed arrays built with hashmaps are very inefficient for frequently read data. The most efficient implementations uses a Trie that allows access to a single vector where segments are distributed.

A Trie can compute if an element is present in the table by performing only read-only TWO array indexing to get the effective position where an element is stored, or to know if its absent from the underlying store.

It can also provide a default position in the backing store for the default value of the sparsed array, so that you don't need ANY test on the returned index, because the Trie guarantees that all possible source index will map at least to the default position in the backing store (where you'll frequently store a zero, or an empty string or a null object).

There exists implementations that support fast-updatable Tries, with an otional "compact()" operation to optimze the size of the backing store at end of multiple operations. Tries are MUCH faster than hashmaps, because they don't need any complex hashing function, and don't need to handle collisions for reads (With Hashmaps, you have collision BOTH for reading and for writing, this requires a loop to skip to the next candidate position, and a test on each of them to compare the effective source index...)

In addition, Java Hashmaps can only index on Objects, and creating an Integer object for each hashed source index (this object creation will be needed for every read, not just writes) is costly in terms of memory operations, as it stresses the garbage collector.

I really hoped that the JRE included an IntegerTrieMap<Object> as the default implementation for the slow HashMap<Integer, Object> or LongTrieMap<Object> as the default implementation for the even slower HashMap<Long, Object>... But this is still not the case.

verdy_p