views:

2555

answers:

10

Hi everyone, I am working on a project that requires the manipulation of enourmous matrices, particularly pyramidal summation for a copula calculation. In short, I need to keep track of a relatively small number of values (usually a value of 1, and in rare cases more than 1) in a sea of zeros in the matrix (multidimensional array). A sparse array allows the user to store a small number of values, and assume all undefined records to be a preset value. Since is is not physically possibly to store all values in memory (greater in number than the number of particles in the universe :p ), I need to store only the few non-zero elements. This could be several million entries. I am currently working on a system that uses a binary search tree (b-tree) to store entries. Does anyone know of a better system? Thanks in advance.

EDIT: Speed is a huge priority.

EDIT2 : I like that solution. How would I go about dynamically choosing the number of variables in the class at runtime? [edit by MH: good question, updated in the answer]

+8  A: 

For C++, a map works well. Several million objects won't be a problem. 10 million items took about 4.4 seconds and about 57 meg on my computer.

#include <stdio.h>
#include <stdlib.h>
#include <map>

class triple {
public:
int x;
int y;
int z;
bool operator<(const triple &other) const {
return (x < other.x && y < other.y && z < other.z);
}
};

int main(int, char**)
{
std::map<triple,int> data;
triple point;
int i;

for (i = 0; i < 10000000; ++i) {
point.x = rand();
point.y = rand();
point.z = rand();
//printf("%d %d %d %d\n", i, point.x, point.y, point.z);
data[point] = i;
}
return 0;
}

For multiple variables:

The easiest way is to make the index a string, and then make the index strings look like "23,55" (2 vars) or "34,45,56" (3 vars):

std::map data<string,int> data;
char ix[100];

sprintf(ix, "%d,%d", x, y); // 2 vars
data[ix] = i;

sprintf(ix, "%d,%d,%d", x, y, z); // 3 vars
data[ix] = i;
Mark Harrison
+1  A: 

Hash tables have a fast insertion and look up. You could write a simple hash function since you know you'd be dealing with only integer pairs as the keys.

nlucaroni
+1  A: 

Thanks. For multiple variables:

The easiest way is to make the index a string, and then make the index strings look like "23,55" (2 vars) or "34,45,56" (3 vars):

std::map data<string,int> data;
char ix[100];

sprintf(ix, "%d,%d", x, y); // 2 vars
data[ix] = i;

sprintf(ix, "%d,%d,%d", x, y, z); // 3 vars
data[ix] = i;
Mark Harrison
A: 

Thanks a bunch Mark, thats great

Ed
+1  A: 

The best way to implement sparse matrices is to not to implement them - atleast not on your own. I would suggest to BLAS (which I think is a part of LAPACK) which can handle really huge matrices.

JSN
LAPACK is a library for dense matrices. The standard BLAS is also for dense matrices. There is a Sparse BLAS package (through NIST) but this is different then the standard BLAS.
codehippo
+1  A: 

Small detail in the index comparison. You need to do a lexicographical compare, otherwise:

a= (1, 2, 1); b= (2, 1, 2); (a<b) == (b<a) is true, but b!=a

Edit: So the comparison should probably be:

return lhs.x<rhs.x ? true : lhs.x==rhs.x ? lhs.y<rhs.y ? true : lhs.y==rhs.y ? lhs.z<rhs.z : false : false

PS apologies for the formatting.

MSN

Mat Noguchi
+4  A: 

Boost has a templated implementation of BLAS called uBLAS that contains a sparse matrix.

http://www.boost.org/doc/libs/1_36_0/libs/numeric/ublas/doc/index.htm

Nic Strong
+5  A: 

Just as an advise: the method using strings as indices is actually very slow. A much more efficient but otherwise equivalent solution would be to use vectors/arrays. There's absolutely no need to write the indices in a string.

typedef vector<size_t> index_t;

struct index_cmp_t : binary_function<index_t, index_t, bool> {
    bool operator ()(index_t const& a, index_t const& b) const {
        for (index_t::size_type i = 0; i < a.size(); ++i)
            if (a[i] != b[i])
                return a[i] < b[i];
        return false;
    }
};

map<index_t, int, index_cmp_t> data;
index_t i(dims);
i[0] = 1;
i[1] = 2;
// … etc.
data[i] = 42;

However, using a map isn't actually very efficient because of the implementation in terms of a balanced binary search tree. Much better performing data structures in this case would be a (randomized) hash table.

Konrad Rudolph
+1  A: 

You want SparseLib++ (http://math.nist.gov/sparselib++/) or Pardiso(http://www.pardiso-project.org/manual/index.html). Pardiso is C so you'll have to create some wrapper C++ code. High performance matrix libraries are one of those things that seem simple to implement but aren't, so it's best to go with something created by teams of scientists. Also, look into "column compressed storage"(using arrays) for one of the most efficient ways to store large matrices in memory.

korch
A: 

Since only values with [a][b][c]...[w][x][y][z] are of consequence, we only store the indice themselves, not the value 1 which is just about everywhere - always the same + no way to hash it. Noting that the curse of dimensionality is present, suggest go with some established tool NIST or Boost, at least read the sources for that to circumvent needless blunder.

If the work needs to capture the temporal dependence distributions and parametric tendencies of unknown data sets, then a Map or B-Tree with uni-valued root is probably not practical. We can store only the indice themselves, hashed if ordering ( sensibility for presentation ) can subordinate to reduction of time domain at run-time, for all 1 values. Since non-zero values other than one are few, an obvious candidate for those is whatever data-structure you can find readily and understand. If the data set is truly vast-universe sized I suggest some sort of sliding window that manages file / disk / persistent-io yourself, moving portions of the data into scope as need be. ( writing code that you can understand ) If you are under commitment to provide actual solution to a working group, failure to do so leaves you at the mercy of consumer grade operating systems that have the sole goal of taking your lunch away from you.

Nicholas Jordan