views:

38

answers:

1

I'm looking for a database with good support for set operations (more specifically: unions).

What I want is something that can store sets of short strings and calculate the union of such sets. For example, I want to add A, B, and C to a set, then D, and A to another and then get the cardinality of the union of those sets (4), but scaled up a million times or so.

The values are 12 character strings and the set sizes range from single elements to millions.

I have experimented with Redis, and it's fantastic in every respect except that for the amount of data I have it's tricky with something that is memory-based. I've tried using the VM feature, but that makes it use even more memory, it something more geared towards large values and I have small values (so say the helpful people on the Redis mailing list). The jury is still out, though, I might get it to work.

I've also sketched on implementing it on top of a relational database, which would probably work, but what I'm asking for is something that I wouldn't have to hack to work. Redis would be a good answer, but as I mentioned above, I've tried it.

My current, Redis-based, implementation works more or less like this: I parse log files and for each line I extract an API key, a user ID, and the values of a number of properties like site domain, time of day, etc. I then formulate a keys that looks somewhat like this (each line results in many keys, one for each property):

APIKEY:20101001:site_domain:stackoverflow.com

the key points to a set, and to this set I add the user ID. When I've parsed all the log files I want to know the total number of unique user IDs for a property over all time and so I ask Redis for the cardinality of the union of all keys that match

APIKEY:*:site_domain:stackoverflow.com

Is there a database, besides Redis, that has good support for this use case?

+1  A: 

it sounds like you need something like boost::disjoint_set which is a datastructure specifically optimized for taking unions or intersections of large sets.

TokenMacGuy