tags:

views:

342

answers:

8

I have a database table with a large number of rows and one numeric column, and I want to represent this data in memory. I could just use one big integer array and this would be very fast, but the number of rows could be too large for this.

Most of the rows (more than 99%) have a value of zero. Is there an effective data structure I could use that would only allocate memory for rows with non-zero values and would be nearly as fast as an array?

Update: as an example, one thing I tried was a Hashtable, reading the original table and adding any non-zero values, keyed by the row number in the original table. I got the value with a function that returned 0 if the requested index wasn't found, or else the value in the Hashtable. This works but is slow as dirt compared to a regular array - I might not be doing it right.

Update 2: here is sample code.

private Hashtable _rowStates;
private void SetRowState(int rowIndex, int state)
{
    if (_rowStates.ContainsKey(rowIndex))
    {
        if (state == 0)
        {
            _rowStates.Remove(rowIndex);
        }
        else
        {
            _rowStates[rowIndex] = state;
        }
    }
    else
    {
        if (state != 0) 
        {
            _rowStates.Add(rowIndex, state);
        }
    }
}
private int GetRowState(int rowIndex)
{
    if (_rowStates.ContainsKey(rowIndex))
    {
        return (int)_rowStates[rowIndex];
    }
    else
    {
        return 0; 
    }
}
+3  A: 

This is an example of a sparse data structure and there are multiple ways to implement such sparse arrays (or matrices) - it all depends on how you intend to use it. Two possible strategies are:

  • Store only non-zero values. For each element different than zero store a pair (index, value), all other values are known to be zero by default. You would also need to store the total number of elements.
  • Compress consecutive zero values. Store a number of (count, value) pairs. For example if you have 12 zeros in a row followed by 200 and another 22 zeros, then store (12, 0), (1, 200), (22, 0).
Adam Byrtek
+1  A: 

How exactly you wan't to implement it depends on what your requirements are, it's a tradeoff between memory and speed. A pure integer array is the fastest, with constant complexity lookups.

Using a hash-based collection such as Hashtable or Dictionary (Hashtable seems to be slower but thread-safe - as others have pointed out) will give you a very low memory usage for a sparse data structure as yours but can be somewhat more expensive when performing lookups. You store a key-value pair for each index and non-zero value.

You can use ContainsKey to find out whether the key exists but it is significantly faster to use TryGetValue to make the check and fetch the data in one go. For dense data it can be worth it to catch exceptions for missing elements as this will only incur a cost in the exceptional case and not each lookup.

Edited again as I got myself confused - that'll teach me to post when I ought to be sleeping.

Morten Christiansen
The key could be that Dictionary is not Thread-safe. Try swapping it for your Hashtable.
Brabster
ContainsKey traverses the entire collection? That doesn't sound right.
MusiGenesis
Heh, I got myself a bit confused... but I did find out that you should actually use TryGetValue instead as it is a great deal faster.
Morten Christiansen
A: 

Create integer array for non-zero values and bit array holding indicators if particular row contains non-zero value.

You can find then necessary element in first array summing up bits in second array starting from 0 up to row index position.

Din
+2  A: 

I would expect that the map/dictionary/hashtable of the non-zero values should be a fast and economical solution.

In Java, using the Hashtable class would introduce locking because it is supposed to be thread-safe. Perhaps something similar has slowed down your implementation.

--- update: using Google-fu suggests that C# Hashtable does incur an overhead for thread safety. Try a Dictionary instead.

Brabster
Thanks for the Google fu. Ironically, I changed my collection from Dictionary to Hashtable because I thought that would speed things up. Good thing I was doing something where the speed problem was obvious.
MusiGenesis
Glad I helped you out :)
Brabster
I had to junk the dictionary and go back to the simple array, however. I couldn't get it to work anywhere near fast enough.
MusiGenesis
It's hard to see how using a Hashtable would incur overhead that a Dictionary does not, since the Dictionary's underlying collection *is* a Hashtable.
Robert Rossney
@Robert: if you benchmark the above sample and then change it to use a Dictionary, you'll see that it speeds up significantly (unless I screwed up the benchmarking).
MusiGenesis
I think they both use an underlying hashtable, but both Hashtable and Dictionary wrap it, the former in a way that makes it thread-safe but slower.
MusiGenesis
A: 

I am not sure about efficiency of this solution but you can try. So it depends at which scenario you will use it but I will write here two of them that I have in mind. First solution is if you have just one field of integers you can simply use generic list of integers:

List<int> myList = new List<int>();

The second one is almost the same, but you can create a list of your own type for example if you have two fields, count and non-zero value you can create a class which will have two properties and then you can create a list of your class and store information in it. But also you can try generic linked lists. So the code for the solution two can be like this:

public class MyDbFields
{
    public MyDbFields(int count, int nonzero)
    {
        Count = count;
        NonZero = nonzero;
    }

    public int Count { get; set; }
    public int NonZero { get; set; }
}

Then you can create a list like this:

List<MyDbFields> fields_list = new List<MyDbFields>();

and then fill it with data:

fields_list.Add(new MyDbFields(100, 11));

I am not sure if this will fully help you solve your problem, but just my suggestion.

milot
A: 

If I understand correctly, you cannot just select non-zero rows, because for each row index (aka PK value) your Data Structure will have to be able to report not only the value, but also whether or not it is there at all. So assuming 0 if you don't find it in your Data Structure might not be a good idea.

Just to make sure - exactly how many rows are we talking about here? Millions? A million integers would take up only 4MB RAM as an array. Not much really. I guess it must be at least 100'000'000 rows.

Basically I would suggest a sorted array of integer-pairs for storing non-zero values. The first element in each pair would be the PK value, and this is what the array would be sorted by. The second element would be the value. You can make a DB select that returns only these non-zero values, of course. Since the array will be sorted, you'll be able to use binary search to find your values.

If there are no "holes" in the PK values, then the only thing you would need besides this would be the minimum and maximum PK values so that you can determine whether a given index belongs to your data set.

If there are unused PK values between the used ones, then you need some other mechanism to determine which PK values are valid. Perhaps a bitmask or another array of valid (or invalid, whichever are fewer) PK values.

If you choose the bitmask way, there is another idea. Use two bits for every PK value. First bit will show if the PK value is valid or not. Second bit will show if it is zero or not. Store all non-zero values in another array. This however will have the drawback that you won't know which array item corresponds to which bitmask entry. You'd have to count all the way from the start to find out. This can be mitigated with some indexes. Say, for every 1000 entries in the value array you store another integer which tells you where this entry is in the bitmask.

Vilx-
A couple million is what I'm targeting. 4-8 MB isn't much on a PC, but it is significant in Windows Mobile (I didn't mention WM in the question because it's not really relevant here) which has a 32MB per process limit.
MusiGenesis
+1  A: 

You're paying a boxing penealty by using Hashtable. Try switching to a Dictionary<int, int>. Also, how many rows are we talking - and how fast do you need it?

Mark Brackett
A: 

Perhaps you are looking in the wrong area - all you are storing for each value is the row number of the database row, which suggests that perhaps you are just using this to retrieve the row? Why not try indexing your table on the numeric column - this will provide lightning fast access to the table rows for any given numeric value (which appears to be the ultimate objective here?) If it is still too slow you can move the index itself into memory etc. My point here is that your database may solve this problem more elegantly than you can.

Ewan Makepeace
Sorry, there isn't really a database here - I put that in trying to explain the problem. This is a pure in-memory data structure problem.
MusiGenesis