views:

197

answers:

3

Hello,

I have a sparse matrix that is not symmetric I.E. the sparsity is somewhat random, and I can't count on all the values being a set distance away from the diagonal.

However, it is still sparse, and I want to reduce the storage requirement on the matrix. Therefore, I am trying to figure out how to store each row starting at the first non-zero, in order, until I get to the last non-zero.

That is, is the first non-zero of row m occurs at column 2, and the last non-zero is at column 89, i want to store in A[m] rows 2-> 89.

Since each row does not have the same number of non-zeros, I will make all the rows of A have the same number of elements, and pad zeros to the end of the row for rows that have a smaller number of non-zero elements.

How do I do this translation in C? I do not actually have an original, full matrix to just copy the values from (the original matrix is coming to me in CSR form). If I was doing this in fortran, I could just define my array to be two dimensional and just have each row be variable length by tracking the start/stop values of non-zero columns and store it like that.

I will try to demonstrate below:

This is a matrix representation of the values I know - and for each value, I know the row and column location

  [1    2    3    4                   ]
  [   5    6    7    8                ]
  [       10    11    12    13        ]
 m[   14    15    16    17       18   ]
  [         19    20    21         22 ]

Now for this one row m has the largest "span" between the first non-zero and last non-zero so my new matrix is going to be 5x[span of row m]

  [1     2     3     4          ]
  [5     6     7     8          ]
  [10    11    12    13         ]
 m[14    15    16    17       18]
  [19    20    21       22      ] 

As you can see, row m needs no zero padding since it was the longest "span" anyway

The other rows now all have row zero as the first non-zero, and maintain the spacing of zeros columns between each non-zero.

+2  A: 

An alternate approach is to use a linked structure (very efficient if the matrix very sparse, not so good as it gets more filled). I hinted at the implementation in a earlier answer.

I you are going to go with the continuous run implementation, I'm not sure that you really want/need to use equal lengths rows. Why not use a ragged array?

dmckee
I think I dont want to use a ragged array is because I want to create A with a single malloc. I will eventually be doing a sparse vector/matrix multiply and I will need it to run as efficiently as possible
Derek
In that case, your problem would appear to be one of determining the maximum row width before you can allocate the whole structure. I think that means you're going to have to make two passes over the data file.
dmckee
I updated with a bit of a graphical picture to help understand what I am trying to accomplish
Derek
What you wanted was quite clear from the beginning. But you need some info you don't have (or at least haven't told us about) in order to make the single malloc that you want. To wit: you need to know the maximum length-of-the-represented row (and the number of rows, though I've been assuming that you knew that). This means two passes over your data.
dmckee
+3  A: 

I would implement this as a ragged array, with A[n][0] always returning the element on the diagonal. A[n][1] will return the item just to the right of the diagonal, A[n][2] will return the item to the left of the diagonal, and so. Then, you just need a function that maps matrix index [i,j] to ragged array index[r][s].

This has the advantage of sparsity, and if your values stay close to the diagonal the arrays are not very long.


Alternatively, you could have this definition:

struct Row
{
    int InitialOffset;
    int NumElements;
    int[] Values;
}

Then you would have a Row[]. Retrieving a value based on matrix index would look like this:

//matrix is merely an array of rows...
int GetValue(*matrix this, int i, int j)
{
    Row CurrentRow = (*this)[i];
    if (CurrentRow.InitialOffset > j)
        return 0;
    else if (CurrentRow.InitialOffset + CurrentRow.NumElements < j)
        return 0; 
    return CurrentRow.Values[j - CurrentRow.InitialOffset]
}

My C syntax is a little hazy, but you should get the idea.


Based on your demonstration, I would recommend this:

struct Matrix
{
    int[,] Data
    int[] StartOffset;
    int[] NumberElements;
}

Used as follows...

int GetValue(*Matrix this, int i, int j)
{
    if (this.StartOffset[i] > j)
        return 0;
    else if (this.StartOffset[i] + this.NumberElements[i] < j)
        return 0; 
    return this.Data[i, j-this.StartOffset[i]];
}

Your initialization procedure would look something like this

//Data is a struct that holds row index, col index, and value
Matrix* InitMatrix (*Data values, int numVals)
{
    //loop through values to find longest row and number of rows
    //create new matrix, malloc matrix for longrow * numRows
    //malloc numrows elements for StartOffset and NumItems
    //foreach row, find min() and max()-min() of col indexs and 
    //store in StartOffset and NumItems
}

You need to do some processing, but data compression isn't cheap.

DonaldRay
Cute indexing idea. Should be very helpful if the values hew close to the diagonal.
dmckee
I had previously implemented something similar to this. However, since the element at A[n][0] is actually the same value as A[n+1][1] in my physical structure I am modelling, any update to A (as in parallel programming) was a pain, because I had to re-pack the entire matrix basically, or just re-build it completely.
Derek
Eep. Should I clear out my previous suggestions or keep them for posterity? (see latest and greatest in #3, should do what your picture says)
DonaldRay
@DonaldRay - but in the worst case you still need an array of size mxn even for very sparse matrices. Think of a matrix where only the first and last columns are populated. It is very empty, but you need to address all the elements to the left and right of the diagonal to get to the edges. Or did I misunderstand your suggestion?
ysap
@Donald: I generally leave mine around. Mostly people won't down vote a good suggestion that has been rendered obsolete by new information.
dmckee
So the problem with the suggestion you have is that I dont actually have a matrix to start from here. I am starting with a list of i,j coordinates, and the values corresponding to that i, j point. I want to end up essentially with a matrix that is fully-packed with all those zeroes, minus the leading and trailing zeros from the row.
Derek
@ysap: that would be a worst case scenario for my first suggestion. I misread the original question; I interpreted "not a set distance from diagonal" as "close to the diagonal, but varyingly so". That case is still a weakness in the second suggestion, but is not present in the third.
DonaldRay
Or to put it another way, I want to do a storage compression where I dont care about about any zeros except for the ones that are in between my non-zero values. I know those non-zero values will be relatively close to the diagonal, so there is no worst case close to what @ysap was suggesting
Derek
@DonaldRay: he was right about worst case, but dont get hung up in that detail much..*most* of my cases are going to be somewhat near the diagonal, just doesnt have to be
Derek
@Derek: updated answer to show how to initialize given a list of (i,j, data). You'll have to write the code yourself, but that should be a cakewalk :)
DonaldRay
@DonaldRay I actually arlready have that exact code written. Your idea about suing the low values for the offset is the trick. It just hit me when re-reading the answer
Derek
@DonaldRay - I still don't agree. Thinking of the MxN matrix I described above - your GetValue() implementation suggests that given a `StartOffset[i]` of 1, and assuming `j` is the required column (which for my example will be N), then your `.Data[]` array is still N elements long (`j-StartOffset[i]`).
ysap
@Ysap: Read Derek's comments to this question. You're right, but the worst case just won't appear in the target domain
DonaldRay
@DonaldRay - yes, I have read it. I just wanted to point out that your latest offer is no better than the previous one in terms of memory preservation. Having that said - it there is a priori information about a problem space, shortcuts and assumptions can always be made to ease a solution.
ysap
@DonaldRay - While at it - I'm curious, how do you add the horizontal separator lines to your answer?
ysap
@ysap: from http://stackoverflow.com/editing-help: Horizontal Rules Insert a horizontal rule <hr/> by putting three or more hyphens, asterisks, or underscores on a line by themselves.
DonaldRay
+1  A: 

Derek, you mentioned in one of the comments that you want to use a single malloc. That means that you know how many nonempty elements you have. Given this, tt is possible to store the sparse matrix in an array which holds, per element, the value of the matrix element and the "location delta" to the next element. Something like:

struct melem {
    int value; // value of data
    int offset; // offset to next element
}

struct melem matrix[num_nonempty_elements];

...

// Note: this is pseudocode!
matrix[row*COLS + col].value = a[row][col];
matrix[row*COLS + col].offset = (row*COLS + col)_[i] - (row*COLS + col)_[i-1];

EDIT: Thinking about it, this is pretty similar to the linked list approach, but requires 1 allocation. OTOH, it may require more calculation to access the required cell.

ysap