views:

249

answers:

6

Let's say that we have a 5000 users in database. User row has sex column, place where he/she was born column and status (married or not married) column.

How to generate a random subset (let's say 100 users) that would satisfy these conditions:

  • 40% should be males and 60% - females
  • 50% should be born in USA, 20% born in UK, 20% born in Canada, 10% in Australia
  • 70% should be married and 30% not.

These conditions are independent, that is we cannot do like this:

  • (0.4 * 0.5 * 0.7) * 100 = 14 users that are males, born in USA and married
  • (0.4 * 0.5 * 0.3) * 100 = 6 users that are males, born in USA and not married.

Is there an algorithm to this generation?

+1  A: 

You could try something like this:

  • Pick a random initial set of 100
  • Until you have the right distribution (or give up):
    • Pick a random record not in the set, and a random one that is
    • If swapping in the other record gets you closer to the set you want, exchange them. Otherwise, don't.

I'd probaby use the sum of squares of distance to the desired distribution as the metric for deciding whether to swap.

That's what comes to mind that keeps the set random. Keep in mind that there may be no subset which matches the distribution you're after.

Grumdrig
Well, my initial thoughts are similar - i should generate a set that satisfies first condition. Then I should count how many users of those users in satisfy second condition - those that not satisfy should be replaced by other users. And so on... Question is how to do this more intelligently and how to know when to stop (when such set cannot be found).
Sazug
A: 

Hi

I think you are wrong to state that you cannot select 14% married US men, 6% unmarried US men, and so on. Your criteria identify 16 (ie 2x4x2) subsets, the numbers you give specify a relative size for each of those subsets. If you sum the sizes of those subsets every which way you'll find that they do give you the distribution of attributes you desire.

Your first job, therefore, is to build an index on each of the attributes and calculate the size of each subset you need to consider.

If it is important that 14/100 selections are married US men (and so on) then select 14 (at random) from your subset of all married US men, then 6 (at random) unmarried US men, etc.

Another approach would be to generate 3 random numbers in the range (0,1) in turn. On the first select married/unmarried, on the 2nd select country, on the third select for marital status. This is, in the long run, going to give your sets of size 100 the right distribution of attributes, but not guarantee that any one set of 100 will have exactly the right distribution of attributes.

I think either of these approaches will get you a result far quicker than generating subsets of 100 and testing them. By my calculations (see the big number for 5000-choose-100 above) you have a vanishingly small chance of ever generating a random set of 100 users with the right statistical distribution of attributes.

Now I sit back and wait for the statisticians to jump all over my faulty reasoning ....

Regards

Mark

High Performance Mark
A random subset with 40 married men from USA can be generated.What i've done is to choose correct number of random users that satisfy first condition, check if they satisfy the second condition - those that do not satisfy are being removed and replaced by new random users so that they would satisfy first and second condition. Repeat the same process for other conditions.
Sazug
The point is, Sazug wants these conditions to be satisfied independently, so if the requirement is that 40% be men, 50% born in USA, and 70% married, you cannot just multiply these and say that 14% have to be married US men because there are valid alternatives that satisfy each condition on its own but have either more or less than 14% married US men. The 14% of married US men is only the expected value.
jk
may not be possible to generate a subset. Rock bottom simplest case: suppose database contains only males - can't generate any subset with 60% female.
Larry Watanabe
You're right Larry, I just assumed that OP had some reason to expect to be able to create a satisfactory subset from his set of users. If there are no males in the database then no algorithm is going to find a satisfactory answer.
High Performance Mark
+2  A: 

Does the breakdown need to be exact, or approximate? Typically if you are generating a sample like this then you are doing some statistical study, so it is sufficient to generate an approximate sample.

Here's how to do this:

Have a function genRandomIndividual().

Each time you generate an individual, use the random function to choose the sex - male with probability 40%

Choose birth location using random function again (just generate a real in the interval 0-1, and if it falls 0-.5, choose USA, if .5-.7, then &K, if .7-.9 then Canada, otherwise Australia).

Choose married status using random function (again generate in 0-1, if 0-.7 then married, otherwise not).

Once you have a set of characteristics, search in the database for the first individual who satisfies these characteristics, add them to your sample, and tag it as already added in the database. Keep doing this unti you have fulfilled your sample size.

There may be no individaul that satisfies the characteristics. Then, just generate a new random individual instead. Since the generations are independent and generate the characteristics according to the required probabilities, in the end you will have a sample size of the correct size with the individuals generated randomly according to the probabilities specified.

Larry Watanabe
+1  A: 

It is important to note that you may not be able to find a subset that satisfies these conditions. To take an example, suppose your database contained only American males, and only Australian females. Clearly you could not generate any subset that satisfies your distribution constraints.

Larry Watanabe
or even simpler case, suppose there are only males in the database. Can't generate a subset with 40% males unless you have a sharp knife :)
Larry Watanabe
A: 

(Rewrote my post completely (actually, wrote a new one and deleted the old) because I thought of a much simpler and more efficient way to do the same thing.)

I'm assuming you actually want the exact proportions and not just to satisfy them on average. This is a pretty simple way to accomplish that, but depending on your data it might take a while to run.

First, arrange your original data so that you can access each combination of types easily, that is, group married US men in one pile, unmarried US men in another, and so on. Then, assuming that you have p conditions and you want to select k elements, make p arrays of size k each; one array will represent one condition. Make the elements of each array be the types of that condition, in the proportions that you require. So, in your example, the gender array would have 40 males and 60 females.

Now, shuffle each of the p arrays independently (actually, you can leave one array unshuffled if you like). Then, for each index i, take the type of the picked element to be the combination from the shuffled p arrays at index i, and pick one such type at random from the remaining ones in your original group, removing the picked element. If there are no elements of that type left, the algorithm has failed, so reshuffle the arrays and start again to pick elements.

To use this, you need to first make sure that the conditions are satisfiable at all because otherwise it will just loop infinitely. To be honest, I don't see a simple way to verify that the conditions are satisfiable, but if the number of elements in your original data is large compared to k and their distribution isn't too skewed, there should be solutions. Also, if there are only a few ways in which the conditions can be satisfied, it might take a long time to find one; though the method will terminate with probability 1, there is no upper bound that you can place on the running time.

jk
A: 

Algorithm may be too strong a word, since to me that implies formalism and publication, but there is a method to select subsets with exact proportions (assuming your percentages yield whole numbers of subjects from the sample universe), and it's much simpler than the other proposed solutions. I've built one and tested it.

Incidentally, I'm sorry to be a slow responder here, but my time is constrained these days. I wrote a hard-coded solution fairly quickly, and since then I've been refactoring it into a decent general-purpose implementation. Because I've been busy, that's still not complete yet, but I didn't want to delay answering any longer.

The method:

Basically, you're going to consider each row separately, and decide whether it's selectable based on whether your criteria give you room to select each of its column values.

In order to do that, you'll consider each of your column rules (e.g., 40% males, 60% females) as an individual target (e.g., given a desired subset size of 100, you're looking for 40 males, 60 females). Make a counter for each.

Then you loop, until you've either created your subset, or you've examined all the rows in the sample universe without finding a match (see below for what happens then). This is the loop in pseudocode:

- Randomly select a row.  
- Mark the row examined.
- For each column constraint:
    * Get the value for the relevant column from the row
    * Test for selectability:
        If there's a value target for the value, 
        and if we haven't already selected our target number of incidences of this value, 
        then the row is selectable with respect to this column
    * Else: the row fails.
- If the row didn't fail, select it: add it to the subset

That's the core of it. It will provide a subset which matches your rules, or it will fail to do so... which brings me to what happens when we can't find a match.

Unsatisfiability:

As others have pointed out, it's not always possible to satisfy any arbitrary set of rules for any arbitrary sample universe. Even assuming that the rules are valid (percentages for each value sum to 100), the subset size is less than the universe size, and the universe does contain enough individuals with each selected value to hit the targets, it's still possible to fail if the values are actually not distributed independently.

Consider the case where all the males in the sample universe are Australian: in this case, you can only select as many males as you can select Australians, and vice-versa. So a set of constraints (subset size: 100; male: 40%; Australian 10%) cannot be satisfied at all from such a universe, even if all the Australians we select are male.

If we change the constraints (subset size: 100; male: 40%; Australian 40%), now we can possibly make a matching subset, but all of the Australians we select must be male. And if we change the constraints again (subset size: 100; male: 20%; Australian 40%), now we can possibly make a matching subset, but only if we don't pick too many Australian women (no more than half in this case).

In this latter case, selection order is going to matter. Depending on our random seed, sometimes we might succeed, and sometimes we might fail.

For this reason, the algorithm must (and my implementation does) be prepared to retry. I think of this as a patience test: the question is how many times are we willing to let it fail before we decide that the constraints are not compatible with the sample population.

Suitability

This method is well suited to the OP's task as described: selecting a random subset which matches given criteria. It is not suitable to answering a slightly different question: "is it possible to form a subset with the given criteria".

My reasoning for this is simple: the situations in which the algorithm fails to find a subset are those in which the data contains unknown linkages, or where the criteria allow a very limited number of subsets from the sample universe. In these cases, the use of any subset would be questionable for statistical analysis, at least not without further thought.

But for the purpose of answering the question of whether it's possible to form a subset, this method is non-deterministic and inefficient. It would be better to use one of the more complex shuffle-and-sort algorithms proposed by others.

Pre-Validation:

The immediate thought upon discovering that not all subsets can be satisfied is to perform some initial validation, and perhaps to analyze the data to see whether it's answerable or only conditionally answerable.

My position is that other than initially validating that each of the column rules is valid (i.e., the column percentages sum to 100, or near enough) and that the subset size is less than the universe size, there's no other prior validation which is worth doing. An argument can be made that you might want to check that the universe contains enough individuals with each selected value (e.g., that there actually are 40 males and 60 females in the universe), but I haven't implemented that.

Other than those, any analysis to identify linkages in the population is itself time-consuming that you might be better served just running the thing with more retries. Maybe that's just my lack of statistics background talking.

Not quite the subset sum problem

It has been suggested that this problem is like the subset sum problem. I contend that this is subtly and yet significantly different. My reasoning is as follows: for the subset sum problem, you must form and test a subset in order to answer the question of whether it meets the rules: it is not possible (except in certain edge conditions) to test an individual element before adding it to the subset.

For the OP's question, though, it is possible. As I'll explain, we can randomly select rows and test them individually, because each has a weight of one.

CPerkins