views:

169

answers:

5

I am working on an algorithm in Python that uses arrays of int64s heavily. The arrays are typically sparse and are read from and written to constantly. I am currently using relatively large native arrays and the performance is good but the memory usage is high (as expected).

I would like to be able to have the array implementation not waste space for values that are not used and allow an index offset other than zero. As an example, if my numbers start at 1,000,000 I would like to be able to index my array starting at 1,000,000 and not be required to waste memory with a million unused values.

Array reads and writes needs to be fast. Expanding into new territory can be a small delay but reads and writes should be O(1) if possible.

Does anybody know of a library that can do it?

Thanks!

Updated to mention int64 as the data type.

+1  A: 

You could remap a numpy sparse matrix into a sparse array - or consider using a hash table(a python dict). As far as the offset is concerned, just wrap whatever storage class you are using and make you own insert/lookup/remove methods.

zdav
I was using dictionaries but the entry manipulations where eating up tons of CPU. Now I am using tons of memory and 0 CPU. I am looking for a potentially more interesting compromise. I have considered sparse matrix usage, but that feels like it would have a lot of unneeded overhead to make it general. Thanks!
TheJacobTaylor
@TheJacobTaylor I think your problem is a little too specific, I doubt that there is tailor-made solution out there. Remember, more code doesn't necessarily mean more mem/cpu usage.
zdav
Frankly, I am not actually sure it is even possible to get what I want. The floating initial index should be possible. The rest might not be. Currently I am looking up the value with a single addition. It is hard to create a sparse data structure that can beat that. I am willing to pay far more for reads/writes but it would have to provide significant value.
TheJacobTaylor
+4  A: 

It sounds like the blist type (documentation, download) might be just what you're looking for (disclaimer: I'm the author). It has exactly the same interface as Python's list, so there's no learning curve, but it has different performance characteristics. In particular, it can efficiently handle sparse lists in many cases. Below is an example that creates a list with 2**29 elements. It's pretty much instantaneous. Sparse lists created in this manner use O(log n) space.

>>> from blist import *
>>> x = blist([0])             # x is a blist with one element
>>> x *= 2**29                 # x is a blist with > 500 million elements
>>> x.append(5)                # append to x
>>> y = x[4:-234234]           # Take a 500 million element slice from x
>>> del x[3:1024]              # Delete a few thousand elements from x

Operations that iterate over the whole list still take O(n) time (remove, reverse, count, etc.). The documentation spells out the time-complexity for each operation, so you should be able to assess if it will meet your needs.

Daniel Stutzbach
This looks like a pretty amazing package. The one thing that I do not see is a linking of an index to each value. That was how I was storing data currently. It might actually be possible for me to flip my algorithm over on its little back and shoehorn it in. A challenge. Thanks.
TheJacobTaylor
@TheJacobTaylor: Can you elaborate on what you mean by "linking of an index to each value"?
Daniel Stutzbach
@Daniel Of coarse. Currently, I have been using an actual 0 based array. The index is meaningful, not just the data in the array. If you swapped the values A[x] <=> A[y] for instance you break my code (where x != y). If you delete a few entries before x, the index of x decreases, right? Say del A[0..2], now to find the value that used to be located at A[x] I would have to call A[x-3] right? That is what I am referring to when I say the linking of an index to each value.
TheJacobTaylor
@TheJacobTaylor: instead of deleting entries, you could replace them with an "empty" value, i.e., instead of `del A[0:2]` do `A[0:2] = blist([None])*3`.
Daniel Stutzbach
@Daniel What does replacing them with an empty value provide? Given the nature of my problem, a tree that has a way to build missing nodes on demand and clean them up when they are no longer needed would actually work really well. Particularly if I could perform easy walks of the numbers and it allocated in batches. Thanks for your great suggestions so far.
TheJacobTaylor
@TheJacobTaylor: It lets you remove references to items without changing the indexes of later items. I can't comment on what is or isn't best for your particular problem, because your question doesn't really explain the problem you are trying to solve.
Daniel Stutzbach
+1  A: 

I don't know any Python so this is probably an un-answer:

In some languages you can simulate a sparse array by defining a function from your index space into your data space. For example (pseudo-code):

f[1000000] = 32;
f[2000000] = 51;

In some languages (the best ones) the form of an array reference (eg f[3]) looks just like the form of a function call (eg f[3]). This is, of course, because an array defines a function from an index space into a data space. It's very easy to implement higher-dimensioned sparse arrays this way too.

High Performance Mark
Wow... I have to think about the implications of this more. I had not considered so cleanly remapping the index function. As long as I can efficiently calculate the offset that should work. Currently I have removed all function calls from this section of the code for the sake of speed. This would require adding a function call or two back in but would allow me to drop my memory requirements by a lot without adding much complexity.
TheJacobTaylor
+1  A: 

Why not just use a dict?

sparse = dict()
sparse[100000] = 1234
sparse[123456] = 2345
Nick Johnson
The biggest issue that I have is speed. For what I am doing the dictionary uses a ton of CPU proportionally. I also have the need to occasionally walk through the data set which would require sorting the dictionary.
TheJacobTaylor
Let me be specific on "ton of CPU" right now, I am using static int64 arrays which are read/write for the cost of an addition and memory assignment. Dictionaries do a lot more work. I have removed function calls from much of this code to make it faster. I think the index mapping suggestion below might be more effective than dictionaries for a while.
TheJacobTaylor
+1  A: 

Another option - at least if you're willing to implement one yourself - is a Page table. This is commonly used in virtual memory systems to map virtual addresses to physical addresses, and it works best if your address space is sparsely populated, and used addresses are clustered. If used addresses are randomly distributed, this will be less effective.

The basic approach of a page table is the same as a Trie - recursive subdivision. A page table has some fixed number of levels, and each node is an array of a fixed size. If the entry for a given subnode is null, all the leaves covered by that node are null. The main advantage of the page table is that lookups are fast, requiring only a few bit-shifts and dereferences.

Let's see a straightforward Python implementation of a 2-level pagetable:

class Pagetable(object):
  def __init__(self, num_bits=8):
    """Creates a new Pagetable with num_bits bits per level.

    Args:
      num_bits: The number of bits per pagetable level.
        A 2 level pagetable will be able to store indexes between 0 and 2^(num_bits*2).
    """
    self.num_bits = num_bits
    self.mask = (1 << num_bits) - 1
    self.root = [None] * (2 ** num_bits)

  def __getitem__(self, idx):
    page = self.root[idx >> self.num_bits]
    return page and page[idx & self.mask]

  def __setitem__(self, idx, val):
    page = self.root[idx >> self.num_bits]
    if not page:
      page = self.root[idx >> self.num_bits] = [None] * (2 ** self.num_bits)
    page[idx & self.mask] = val
Nick Johnson