views:

130

answers:

9

I have a table of items with [ID, ATTR1, ATTR2, ATTR3]. I'd like to select about half of the items, but try to get a random result set that is NOT clustered. In other words, there's a fairly even spread of ATTR1 values, ATTR2 values, and ATTR3 values. This does NOT necessarily represent the data as a whole, in other words, the total table may be generally concentrated on certain attribute values, but I'd like to select a subset with more variety. The attributes are not inter-related, so there's not really a correlation between ATTR1 and ATTR2.

As an example, imagine ATTR1 = "State". I'd like each line item in my subset to be from a different state, even if in the whole set, most of my data is concentrated on a few states. And for this to simultaneously be true of the other 2 attributes, too. (I realize that some tables might not make this possible, but there's enough data that it's unlikely to have no solution)

Any ideas for an efficient algorithm? Thanks! I don't really even know how to search for this :)

(by the way, it's OK if this requires pre-calculation or -indexing on the whole set, so long as I can draw out random varied subsets quickly)

A: 

I don't know (and I hope someone who does will answer). Here's what comes to mind: make up a distribution for MCMC putting the most weight on the subsets with 'variety'.

Darius Bacon
Thanks, seems like a good way to do it with pre-calculated subsets. I suppose I could just run a generator in the background creating a bunch of subset trials, and then when I need a "real-time" random subset, pick from one of the sets that scored high. It seems pretty inefficient in the case where an attribute column is dominated by a particular value - I'd have to run a ton of tests to find variety. BUt it would work!
Steve Eisner
OK, the more I read about these, it seems like I could pre-weight the data (effectively "centering" the chain's origin walk on data that is most distant from the norm) to compensate for the difficulty of finding variety with a random walk. Or maybe there's some better way to walk less randomly
Steve Eisner
I wasn't thinking precalculated, but a weight *function* to bias the walk towards more variety, something vaguely like Hightechrider's suggestions. Anyway, algorithmist's idea sounds the most promising to me.
Darius Bacon
A: 

Naive solution.

If you have M items I and want to select K of them.
Divide the I's into M/K chunks then use a standard rand to select one I from each chunk.

Now you have a not-random-but-sort-of sample of I's spread out over all M I's if you get what I mean.

Nifle
Thanks. It's good as a way of getting a "gist" of the table, but I don't think it will give me what I want - for instance, as an example: if in the entire State Data table there were only one record for Wyoming, I'd want that one to have a much higher chance of appearing in my random subsets (ie present at *some* position in the subset)
Steve Eisner
A: 

Assuming the items in your table are indexed by some form of id, I would in a loop, iterate through half of the items in your table, and use a random number generator to get the number.

John
Thanks, but I don't think this does what I need, as there's no weighting for variety?
Steve Eisner
what do you mean by variety. If it is a truly random number generator, then each item would have an equal chance of being picked.
John
A: 

IMHO

Finding variety is difficult but generating it is easy.

So we can generate variety of combinations and then seach the table for records with those combinations.

If the table is sorted then searching also becomes easy.

Sample python code:

d = {}
d[('a',0,'A')]=0
d[('a',1,'A')]=1
d[('a',0,'A')]=2
d[('b',1,'B')]=3
d[('b',0,'C')]=4
d[('c',1,'C')]=5
d[('c',0,'D')]=6
d[('a',0,'A')]=7

print d

attr1 = ['a','b','c']
attr2 = [0,1]
attr3 = ['A','B','C','D']

# no of items in
# attr2 < attr1 < attr3

# ;) reason for strange nesting of loops

for z in attr3:
    for x in attr1:
        for y in attr2:
            k = (x,y,z)
            if d.has_key(k):
                print '%s->%s'%(k,d[k])
            else:
                print k

Output:

('a', 0, 'A')->7
('a', 1, 'A')->1
('b', 0, 'A')
('b', 1, 'A')
('c', 0, 'A')
('c', 1, 'A')
('a', 0, 'B')
('a', 1, 'B')
('b', 0, 'B')
('b', 1, 'B')->3
('c', 0, 'B')
('c', 1, 'B')
('a', 0, 'C')
('a', 1, 'C')
('b', 0, 'C')->4
('b', 1, 'C')
('c', 0, 'C')
('c', 1, 'C')->5
('a', 0, 'D')
('a', 1, 'D')
('b', 0, 'D')
('b', 1, 'D')
('c', 0, 'D')->6
('c', 1, 'D')

But assuming your table is very big (otherwise why would you need algorithm ;) and data is fairly uniformly distributed there will be more hits in actual scenario. In this dummy case there are too many misses which makes algorithm look inefficient.

TheMachineCharmer
Interesting idea, clever approach. I think you're right that there are some cases where it works better (uniform data, fewer possible attribute values, etc.) But I think it might totally fall apart with larger number of attribute values with uneven distribution. I could be wrong, but it seems like you'd have to generate a LOT of cases to ensure a match, and then the (generated <-> actual) comparisons would become too expensive? Let me know if you disagree!
Steve Eisner
Yes it might fail. But in the case of too many attributes we won't add `for loop` loop for them we would just pick them randomly. e.g. if we have 3 more attributes which are not that important. Then instead of adding 3 more nested loops we can change statement `(x,y,z)` to `(x,y,z,randomly_picked_attr4,randomly_picked_attr5,...)`. :)
TheMachineCharmer
Also googling `unique sampling algorithm` should should help you solve your problem.
TheMachineCharmer
Also note that you could prehash all the entries in the table (adding a column), and thus the comparison could be a lookup in a hashtable which is quite efficient. However there is a bias in the selection if you only select a few items from the table.
Matthieu M.
A: 

Let's assume that ATTR1, ATTR2, and ATTR3 are independent random variables (over a uniform random item). (If ATTR1, ATTR2, and ATTR3 are only approximately independent, then this sample should be approximately uniform in each attribute.) To sample one item (VAL1, VAL2, VAL3) whose attributes are uniformly distributed, choose VAL1 uniformly at random from the set of values for ATTR1, choose VAL2 uniformly at random from the set of values for ATTR2 over items with ATTR1 = VAL1, choose VAL3 uniformly at random from the set of values for ATTR3 over items with ATTR1 = VAL1 and ATTR2 = VAL2.

To get a sample of distinct items, apply the above procedure repeatedly, deleting each item after it is chosen. Probably the best way to implement this would be a tree. For example, if we have

ID    ATTR1 ATTR2 ATTR3
1     a     c     e
2     a     c     f
3     a     d     e
4     a     d     f
5     b     c     e
6     b     c     f
7     b     d     e
8     b     d     f
9     a     c     e

then the tree is, in JavaScript object notation,

{"a": {"c": {"e": [1, 9], "f": [2]},
       "d": {"e": [3], "f": [4]}},
 "b": {"c": {"e": [5], "f": [6]},
       "d": {"e": [7], "f": [8]}}}

Deletion is accomplished recursively. If we sample id 4, then we delete it from its list at the leaf level. This list empties, so we delete the entry "f": [] from tree["a"]["d"]. If we now delete 3, then we delete 3 from its list, which empties, so we delete the entry "e": [] from tree["a"]["d"], which empties tree["a"]["d"], so we delete it in turn. In a good implementation, each item should take time O(# of attributes).

EDIT: For repeated use, reinsert the items into the tree after the whole sample is collected. This doesn't affect the asymptotic running time.

Steve Eisner
If there's a singleton ATTR3, say, then there's actually a lot of correlation between the attributes: fixing ATTR3 to the singleton value, the others go from uniform to deterministic! This is sort of a hard problem; it might be mitigated somewhat by choosing randomly between 6 trees, one with each permutation of the attributes.
Hmm actually that's not what I meant - rather, that if there were 10 possible ATTR3 values, but 1 of them only occurred once in the entire table of 10,000 items, it would be very unlikely to show in the list. But your idea of selecting from 6 combinatorial trees would address that (with the added complexity of deleting from all 6 somehow) Thanks
Steve Eisner
+1  A: 

Interesting problem. Since you want about half of the list, how about this:-

Create a list of half the values chosen entirely at random. Compute histograms for the value of ATTR1, ATTR2, ATTR3 for each of the chosen items.

:loop

Now randomly pick an item that's in the current list and an item that isn't.

If the new item increases the 'entropy' of the number of unique attributes in the histograms, keep it and update the histograms to reflect the change you just made.

Repeat N/2 times, or more depending on how much you want to force it to move towards covering every value rather than being random. You could also use 'simulated annealing' and gradually change the probability to accepting the swap - starting with 'sometimes allow a swap even if it makes it worse' down to 'only swap if it increases variety'.

Hightechrider
Thanks, this seems very similar to Darius's answer in spirit, a "random walk" through a "variety map". Your histogram is a good measure of variety, tho I wonder if it would be expensive as a running value I suspect there must be some multi-attribute solution similar to: http://stackoverflow.com/questions/872563/efficient-algorithm-to-randomly-select-items-with-frequency ... that lets me randomly move in the right direction more often than a simple random selection.
Steve Eisner
Steve, you want sampling without replacement, so I think a more relevant link is the second half of http://stackoverflow.com/questions/2140787/select-random-k-elements-from-a-list-whose-elements-have-weights/2149533#2149533
Darius Bacon
A: 

Idea #2.

Compute histograms for each attribute on the original table.

For each item compute it's uniqueness score = p(ATTR1) x p(ATTR2) x p(ATTR3) (multiply the probabilities for each attribute it has).

Sort by uniqueness.

Chose a probability distribution curve for your random numbers ranging from picking only values in the first half of the set (a step function) to picking values evenly over the entire set (a flat line). Maybe a 1/x curve might work well for you in this case.

Pick values from the sorted list using your chosen probability curve.

This allows you to bias it towards more unique values or towards more evenness just by adjusting the probability curve you use to generate the random numbers.

Hightechrider
Yep, interesting idea and similar to one I was leaning towards.... biasing randomness for unlikeliness is good. The only problem is that it doesn't necessarily correlate to variety across the attribute value set. In other words, it might be possible that all of the unlikeliest entries are actually extremely similar, each just being relatively unlikely within the context of the entire table....
Steve Eisner
A: 

Taking over your example, assign every possible 'State' a numeric value (say, between 1 and 9). Do the same for the other attributes.

Now, assuming you don't have more than 10 possible values for each attribute, multiply the values for ATTR3 for 100, ATTR2 for 1000, ATTR1 for 10000. Add up the results, you will end up with what can resemble a vague hash of the item. Something like

10,000 * |ATTR1| + 1000 * |ATTR2| + 100 * |ATTR3|

the advantage here is that you know that values between 10000 and 19000 have the same 'State' value; in other words, the first digit represents ATTR1. Same for ATTR2 and the other attributes.

You can sort all values and using something like bucket-sort pick one for each type, checking that the digit you're considering hasn't been picked already.

An example: if your end values are

A: 15,700 = 10,000 * 1 + 1,000 * 5 + 100 * 7 B: 13,400 = 10,000 * 1 + 1,000 * 3 + 100 * 4 C: 13,200 = ... D: 12,300 E: 11,400 F: 10,900

you know that all your values have the same ATTR1; 2 have the same ATTR2 (that being B and C); and 2 have the same ATTR3 (B, E).

This, of course, assuming I understood correctly what you want to do. It's saturday night, afterall.

ps: yes, I could have used '10' as the first multiplier but the example would have been messier; and yes, it's clearly a naive example and there are lots of possible optimizations here, which are left as an exercise to the reader

lorenzog
A: 

It's a very interesting problem, for which I can see a number of applications. Notably for testing software: you get many 'main-flow' transactions, but only one is necessary to test that it works and you would prefer when selecting to get an extremely varied sample.

I don't think you really need a histogram structure, or at least only a binary one (absent/present).

{ ATTR1: [val1, val2], ATTR2: [i,j,k], ATTR3: [1,2,3] }

This is used in fact to generate a list of predicates:

Predicates = [ lambda x: x.attr1 == val1, lambda x: x.attr1 == val2,
               lambda x: x.attr2 == i, ...]

This list will contain say N elements.

Now you wish to select K elements from this list. If K is less than N it's fine, otherwise we will duplicate the list i times, so that K <= N*i and with i minimal of course, so i = ceil(K/N) (note that it works although if K <= N, with i == 1).

i = ceil(K/N)
Predz = Predicates * i # python's wonderful

And finally, pick up a predicate there, and look for an element that satisfies it... that's where randomness actually hits and I am less than adequate here.

Two remarks:

  • if K > N you may be willing to actually select i-1 times each predicate and then select randomly from the list of predicates only to top off your selection. Thus ensuring the over representation of even the least common elements.
  • the attributes are completely uncorrelated this way, you may be willing to select patterns as you could never get the tuple (1,2,3) by selecting on the third element being 3, so perhaps a refinement would be to group some related attributes together, though it would probably increase the number of predicates generated
  • for efficiency reasons, you should have the table by the predicate category if you wish to have an efficient select.
Matthieu M.