tags:

views:

539

answers:

11

I try to calculate a problem with 20000 points, so there is a distance matrix with 20000*20000 elements, how can I store this matrix in C++? I use Visual Studio 2008, on a computer with 4 GB of RAM. Any suggestion will be appreciated.

+7  A: 

Avoid the brute force approach you're contemplating and try to envision a solution that involves populating a single 20000 element list, rather than an array that covers every possible permutation.

For starters, consider the following simplistic approach which you may be able to improve upon, given the specifics of your problem:

int bestResult = -1;  // some invalid value
int bestInner;
int bestOuter;

for ( int outer = 0; outer < MAX; outer++ )
{
    for ( int inner = 0; inner < MAX; inner++ )
    {
        int candidateResult = SomeFunction( list[ inner ], list[ outer ] );

        if ( candidateResult > bestResult )
        {
            bestResult = candidateResult;
            bestInner = inner;
            bestOuter = outer;
        }
    }
}
Bob Kaufman
+5  A: 

You can represent your matrix as a single large array. Whether it's a good idea to do so is for you to determine.

If you need four bytes per cell, your matrix is only 4*20000*20000, that is, 1.6GB. Any platform should give you that much memory for a single process. Windows gives you 2GiB by default for 32-bit processes -- and you can play with the linker options if you need more. All 32-bit unices I tried gave you more than 2.5GiB.

Pascal Cuoq
But remember that 1.6 GB should be *continuous*, on Windows there are so many other things loaded into virtual address space and finding this much of continuous memory will be nearly impossible.
Naveen
@Naveen Works for me (I just tried, but I am not using Windows, writing to each page to make sure the memory is not over-committed). I think the idea of the 2Gb/2Gb split is that you leave a lot of address space to the system, but in exchange what you get is indeed continuous.
Pascal Cuoq
@Pascal: It might work on some systems, but I'll be surprised if it works everywhere. I was checking my process's VM and found that one of the Windows dll (don't remember which one) was smack in the middle of the virtual address space nicely dividing it two halves.. I heard MS had rebased it but still everybody may not be having that patch.In that case new will fail.
Naveen
@Naveen That's good to know, thanks...
Pascal Cuoq
1.6 GB is NOT continuous. It appears continuous to your process, but there in nothing stopping windows from allocation those memory pages all over the physical ram. This is why we have virtual memory instruction hard coded into our processors.
caspin
@Caspin Thanks for commenting, but there was absolutely no ambiguity on the fact that Naveen and I were discussing the *virtual address space*, and continuity in the *virtual address space*. There was a hint in the fact that Naveen used the words "virtual address space" twice. Virtual addresses can be allocated anywhere in physical RAM, but there *are* constraints on the virtual address space too. DLLs have to be present in virtual space, for starters, and several systems like to pre-allocate these in a fixed place. Virtual -> physical mappings do not help with fragmentation, either.
Pascal Cuoq
+1  A: 

Your computer should be able to handle 1.6 GB of data (assuming 32bit)

size_t n = 20000;
typedef long dist_type; // 32 bit
std::vector <dist_type> matrix(n*n);

And then use:

dist_type value = matrix[n * y + x];
Viktor Sehr
But it's very unlikely to have 1.6Gb in one piece.
Martin Beckett
+1  A: 

You can (by using small datatypes), but you probably don't want to.

You are better off using a quad tree (if you need to find the nearest N matches), or a grid of lists (if you want to find all points within R).

In physics, you can just approximate distant points with a field, or a representative amalgamation of points.

There's always a solution. What's your problem?

wisty
+4  A: 

Is there a reason you need the matrix in memory?

Depending on the complexity of calculations you need to perform you could simply use a function that calculates your distances on the fly. This could even be faster than precalculating ever single distance value if you would only use some of them.

Frank Bollack
+1  A: 

Man you should avoid the n² problem...

Put your 20 000 points into a voxel grid.

Finding closest pair of points should then be something like n log n.

Julio
+7  A: 

A sparse matrix may be what you looking for. Many problems don't have values in every cell of a matrix. SparseLib++ is a library which allows for effecient matrix operations.

brianegge
+1 for the sparse matrix.Boost::ublas can also be a possibility.
Tristram Gräbener
+1  A: 

Math libraries can be useful:
Armadillo C++ Library
libmat

lsalamon
+3  A: 

Without more references to the problem at hand (and the use of the matrix), you are going to get a lot of answers... so indulge me.

The classic approach here would be to go with a sparse matrix, however the default value would probably be something like 'not computed', which would require special handling.

Perhaps that you could use a caching approach instead.

Apparently I would say that you would like to avoid recomputing the distances on and on and so you'd like to keep them in this huge matrix. However note that you can always recompute them. In general, I would say that trying to store values that can be recomputed for a speed-off is really what caching is about.

So i would suggest using a distance class that abstract the caching for you.

The basic idea is simple:

  • When you request a distance, either you already computed it, or not
  • If computed, return it immediately
  • If not computed, compute it and store it
  • If the cache is full, delete some elements to make room

The practice is a bit more complicated, of course, especially for efficiency and because of the limited size which requires an algorithm for the selection of those elements etc...

So before we delve in the technical implementation, just tell me if that's what you're looking for.

Matthieu M.
A: 

As stated by other answers, you should try hard to either use sparse matrix or come up with a different algorithm that doesn't need to have all the data at once in the matrix.

If you really need it, maybe a library like stxxl might be useful, since it's specially designed for huge datasets. It handles the swapping for you almost transparently.

Steve Schnepp
A: 

Thanks a lot for your answers. What I am doing is to solve a vehicle routing problem with about 20000 nodes. I need one matrix for distance, one matrix for a neighbor list (for each node, list all other nodes according to the distance). This list will be used very often to find who can be some candidates. I guess sometimes distances matrix can be ommited if we can calculate when we need. But the neighbor list is not convenient to create every time. the list data type could be int.

To mgb:

how much can a 64 bit windows system help this situation?

Jackie
Are you trying to solve the Travelling salesman problem ?(http://en.wikipedia.org/wiki/Travelling_salesman_problem)This is NP-Complete so good luck !! :)
Julio