views:

57

answers:

3

lets say i have numbers from 1-10Million (customer ids). each single number is associated with 1 of 3 possible values - A,B,C.

i know that very large adjacent regions of about 1000 elements are in the same category.

what is a data structure that allows me to save the link between number range and category in a memory-efficient way?

also, is there a java implementation of an interval-tree, which was suggested in an answer.

A: 

You can try LinkedListMultimap from Google Collections with some tricky logic.

What is tricky logic: each odd value represent begin of the interval and each even value represent end of the interval.

For example, you have 1001-1100 ids in A, 1101-1300 in B and 1301-1400 again in A

multimap.put (A, 1001); 
multimap.put (A, 1100);

multimap.put (B, 1101);
multimap.put (B, 1300);

multimap.put (A, 1301);
multimap.put (A, 1400);
Roman
+1  A: 

Create 3 interval trees, or sorted map of (begin, end) pairs, each representing the categories A, B and C.

KennyTM
But the intervals are not overlapping. A binary tree would work as well.
Thomas Jung
+1  A: 

Start by transposing your data structure, t.i. instead of storing a customer -> category (A/B/C) mapping, store a category -> customers mapping. I've found transposition to be a common and cool method for designing very efficient datastructures.

Now, use 3 bitmaps (bitmasks, bitsets like java.util.BitSet) for each of the 3 A,B,C tables. The i-th bit in A table will tell whether customer number 'i' is in category A.

Each of these tables will take up only N/8 bytes of memory, which is just 3.75mb given your 10mln customers.

(note that this will work only if your customer id's are sequential integers)

jkff
while this does not take into account the sequetial alignment it is very efficient. i'll try this out. 3.7mb is not too much memory in my case. the actual memory structure of each of the bitsets look very redundant though. 0000000000000000000111111111111111111110000000000 might as well be compressed unto 20x0.20x1,10x0 somehow.or is there maybe a huffman-encoded bitset implementation?
Andreas Petersson
You may compress the bitset somehow, perhaps even with gzip, but then you lose the extremely efficient random access read/write. All in all, the best solution depends on how you're going to access the datastructure.
jkff
Oh, by the way, you don't need 3 bit masks for A,B and C. You only need to store A and B, because C comes out automatically as ~(A||B).
jkff