tags:

views:

440

answers:

9

I need a fast container with only two operations. Inserting keys on from a very sparse domain (all 32bit integers, and approx. 100 are set at a given time), and iterating over the inserted keys. It should deal with a lot of insertions which hit the same entries (like, 500k, but only 100 different ones).

Currently, I'm using a std::set (only insert and the iterating interface), which is decent, but still not fast enough. std::unordered_set was twice as slow, same for the Google Hash Maps. I wonder what data structure is optimized for this case?

A: 

Maybe a set with a b-tree (instead of binary tree) as internal data structure. I found this article on codeproject which implements this.

Wimmel
+1  A: 

For an in-use set expected to be this small, a non-bucketed hash table might be OK. If you can live with an occasional expansion operation, grow it in powers of 2 if it gets more than 70% full. Cuckoo hashing has been discussed on Stackoverflow before and might also be a good approach for a set this small. If you really need to optimise for speed, you can implement the hashing function and lookup in assembler - on linear data structures this will be very simple so the coding and maintenance effort for an assembler implementation shouldn't be unduly hard to maintain.

ConcernedOfTunbridgeWells
+1  A: 

You might want to consider implementing a HashTree using a base 10 hash function at each level instead of a binary hash function. You could either make it non-bucketed, in which case your performance would be deterministic (log10) or adjust your bucket size based on your expected distribution so that you only have a couple of keys/bucket.

tvanfosson
+3  A: 

I'm not sure I understand "a lot of insertions which hit the same entries". Do you mean that there are only 100 values which are ever members, but 500k mostly-duplicate operations which insert one of those 100 values?

If so, then I'd guess that the fastest container would be to generate a collision-free hash over those 100 values, then maintain an array (or vector) of flags (int or bit, according to what works out fastest on your architecture).

I leave generating the hash as an exercise for the reader, since it's something that I'm aware exists as a technique, but I've never looked into it myself. The point is to get a fast hash over as small a range as possible, such that for each n, m in your 100 values, hash(n) != hash(m).

So insertion looks like array[hash(value)] = 1;, deletion looks like array[hash(value)] = 0; (although you don't need that), and to enumerate you run over the array, and for each set value at index n, inverse_hash(n) is in your collection. For a small range you can easily maintain a lookup table to perform the inverse hash, or instead of scanning the whole array looking for set flags, you can run over the 100 potentially-in values checking each in turn.

Sorry if I've misunderstood the situation and this is useless to you. And to be honest, it's not very much faster than a regular hashtable, since realistically for 100 values you can easily size the table such that there will be few or no collisions, without using so much memory as to blow your caches.

Steve Jessop
It might be (probably is, at a guess) that we don't know a priori which 100 or so values these might be at any given point, so we don't really have the option of generating a perfect hash function for them. You'd still need a non-bucketed hash table to do this.
ConcernedOfTunbridgeWells
You're right, these 100 values are not known up front, moreover, they vary for each run. The only thing that does not change is the number of unique elements, it's always around 100-1000.
Anteru
OK, I'll move my edit out into a different answer.
Steve Jessop
Damn, just caching the last ID was enough. Your guess was right, the values tended to repeat for quite some time, and by just caching the last number I got 100% better performance. Thanks!
Anteru
Hurray for guessing, second time lucky :-)
Steve Jessop
A: 

Note that while inserting into a hash table is fast, iterating over it isn't particularly fast, since you need to iterate over the entire array.

Which operation is slow for you? Do you do more insertions or more iteration?

David Norman
The problem is that the insertions often hit the same value, so that an O(1) test whether a value is contained would be needed (read: hash map). For some reason or another, the MSVC std::set is still faster than the std::tr1::unordered_map, although the lookup is O(log n) for the std::set.
Anteru
+1  A: 

A randomized data structure might be perfect for your job. Take a look at the skip list – though I don't know any decend C++ implementation of it. I intended to submit one to Boost but never got around to do it.

Konrad Rudolph
A: 

How much memory do you have? 32-bits take "only" 4GB/8 bytes, which comes to 512MB, not much for a high-end server. That would make your insertions O(1). But that could make the iteration slow. Although skipping all words with only zeroes would optimize away most iterations. If your 100 numbers are in a relatively small range, you can optimize even further by keeping the minimum and maximum around.

I know this is just brute force, but sometimes brute force is good enough.

coryan
If only insertion is interesting and not removal, the iteration problem can be solved by maintaining an additional unsorted vector/linked list of all items, yielding O(1) for insertion and O(n) for iteration.
Konrad Rudolph
+5  A: 

Depending on the distribution of the input, you might be able to get some improvement without changing the structure.

If you tend to get a lot of runs of a single value, then you can probably speed up insertions by keeping a record of the last value you inserted, and don't bother doing the insertion if it matches. It costs an extra comparison per input, but saves a lookup for each element in a run beyond the first. So it could improve things no matter what data structure you're using, depending on the frequency of repeats and the relative cost of comparison vs insertion.

If you don't get runs, but you tend to find that values aren't evenly distributed, then a splay tree makes accessing the most commonly-used elements cheaper. It works by creating a deliberately-unbalanced tree with the frequent elements near the top, like a Huffman code.

Steve Jessop
A: 

Since no one has explicitly mentioned it, have you thought about memory locality? A really great data structure with an algorithm for insertion that causes a page fault will do you no good. In fact a data structure with an insert that merely causes a cache miss would likely be really bad for perf.

Have you made sure a naive unordered set of elements packed in a fixed array with a simple swap to front when an insert collisides is too slow? Its a simple experiment that might show you have memory locality issues rather than algorithmic issues.

Steve Steiner