Algorithm may be too strong a word, since to me that implies formalism and publication, but there is a method to select subsets with exact proportions (assuming your percentages yield whole numbers of subjects from the sample universe), and it's much simpler than the other proposed solutions. I've built one and tested it.
Incidentally, I'm sorry to be a slow responder here, but my time is constrained these days. I wrote a hard-coded solution fairly quickly, and since then I've been refactoring it into a decent general-purpose implementation. Because I've been busy, that's still not complete yet, but I didn't want to delay answering any longer.
The method:
Basically, you're going to consider each row separately, and decide whether it's selectable based on whether your criteria give you room to select each of its column values.
In order to do that, you'll consider each of your column rules (e.g., 40% males, 60% females) as an individual target (e.g., given a desired subset size of 100, you're looking for 40 males, 60 females). Make a counter for each.
Then you loop, until you've either created your subset, or you've examined all the rows in the sample universe without finding a match (see below for what happens then). This is the loop in pseudocode:
- Randomly select a row.
- Mark the row examined.
- For each column constraint:
* Get the value for the relevant column from the row
* Test for selectability:
If there's a value target for the value,
and if we haven't already selected our target number of incidences of this value,
then the row is selectable with respect to this column
* Else: the row fails.
- If the row didn't fail, select it: add it to the subset
That's the core of it. It will provide a subset which matches your rules, or it will fail to do so... which brings me to what happens when we can't find a
match.
Unsatisfiability:
As others have pointed out, it's not always possible to satisfy any arbitrary set of rules for any arbitrary sample universe. Even assuming that the rules are valid (percentages for each value sum to 100), the subset size is less than the universe size, and the universe does contain enough individuals with each selected value to hit the targets, it's still possible to fail if the values are actually not distributed independently.
Consider the case where all the males in the sample universe are Australian: in this case, you can only select as many males as you can select Australians, and vice-versa. So a set of constraints (subset size: 100; male: 40%; Australian 10%) cannot be satisfied at all from such a universe, even if all the Australians we select are male.
If we change the constraints (subset size: 100; male: 40%; Australian 40%), now we can possibly make a matching subset, but all of the Australians we select must be male. And if we change the constraints again (subset size: 100; male: 20%; Australian 40%), now we can possibly make a matching subset, but only if we don't pick too many Australian women (no more than half in this case).
In this latter case, selection order is going to matter. Depending on our random seed, sometimes we might succeed, and sometimes we might fail.
For this reason, the algorithm must (and my implementation does) be prepared to retry. I think of this as a patience test: the question is how many times are we willing to let it fail before we decide that the constraints are not compatible with the sample population.
Suitability
This method is well suited to the OP's task as described: selecting a random subset which matches given criteria. It is not suitable to answering a slightly different question: "is it possible to form a subset with the given criteria".
My reasoning for this is simple: the situations in which the algorithm fails to find a subset are those in which the data contains unknown linkages, or where the criteria allow a very limited number of subsets from the sample universe. In these cases, the use of any subset would be questionable for statistical analysis, at least not without further thought.
But for the purpose of answering the question of whether it's possible to form a subset, this method is non-deterministic and inefficient. It would be better to use one of the more complex shuffle-and-sort algorithms proposed by others.
Pre-Validation:
The immediate thought upon discovering that not all subsets can be satisfied is to perform some initial validation, and perhaps to analyze the data to see whether it's answerable or only conditionally answerable.
My position is that other than initially validating that each of the column rules is valid (i.e., the column percentages sum to 100, or near enough) and that the subset size is less than the universe size, there's no other prior validation which is worth doing. An argument can be made that you might want to check that the universe contains enough individuals with each selected value (e.g., that there actually are 40 males and 60 females in the universe), but I haven't implemented that.
Other than those, any analysis to identify linkages in the population is itself time-consuming that you might be better served just running the thing with more retries. Maybe that's just my lack of statistics background talking.
Not quite the subset sum problem
It has been suggested that this problem is like the subset sum problem. I contend that this is subtly and yet significantly different. My reasoning is as follows: for the subset sum problem, you must form and test a subset in order to answer the question of whether it meets the rules: it is not possible (except in certain edge conditions) to test an individual element before adding it to the subset.
For the OP's question, though, it is possible. As I'll explain, we can randomly select rows and test them individually, because each has a weight of one.