tags:

views:

33

answers:

2

Hi All,

I have a DataTable which stores very sparse data, something like:

   P1 P2 P3 P4 P5 ...
J1 1  1
J2    1  1
J3             1
.
.
.

The number of rows and columns might reach over 10^8.

How can I store this data in more efficient way?

Thanks.

A: 

First, get rid of the DataTable for those counts of data. Its memory usage is way to huge here.

If your data are always 0/1, the most efficient way should be a bit-mask.

If your data are not only 0/1, create a structure that abstracts all your columns.

Here's a conceptional prototype of that data structure.

class MyData {
    public MyData(int[] columns, object[] data) {
        _columns = columns;
        _data = data;
    }

    int[] _columns;
    object[] _data;

    public object this[int column] {
        get {
            int index = IndexOf(column);
            return index != -1 ? _data[index] : null;
        }
    }

    private int IndexOf(int column) {
        for (int i = 0; i < _columns.Length; i++)
            if (_columns[i] == column)
                return i;
        return -1;
    }
}

You could additionally save the memory for the _columns by applying the flyweight pattern.

Hope this helps

Florian Reischl
Bit-masks are only going to be more beneficial down to ~12% density (ignoring book-keeping overheads of other approaches and only dealing with booleans). That is, they can be good for compacting but are, by themselves, stuck at O(n), where n is the number of columns.
pst
A: 

If your disk file system supports Sparse files you can create an empty file, mark it sparse, and then resize it to rows * colums * datasize.

Then it's a matter of accessing the data by [row][column], where the offset can be calculated with:

offset = ((columns.length * (row-1)) + column) * datasize

There is some overhead with sparse files as well regarding allocation where it typically allocated pages of 16-64kb, but depending on how your data clusters it might very well work.

Mikael Svenson