views:

479

answers:

4

How do I distribute a small amount of data in a random order in a much larger volume of data?

For example, I have several thousand lines of 'real' data, and I want to insert a dozen or two lines of control data in a random order throughout the 'real' data.

Now I am not trying to ask how to use random number generators, I am asking a statistical question, I know how to generate random numbers, but my question is how do I ensure that this the data is inserted in a random order while at the same time being fairly evenly scattered through the file.

If I just rely on generating random numbers there is a possibility (albeit a very small one) that all my control data, or at least clumps of it, will be inserted within a fairly narrow selection of 'real' data. What is the best way to stop this from happening?

To phrase it another way, I want to insert control data throughout my real data without there being a way for a third party to calculate which rows are control and which are real.


Update: I have made this a 'community wiki' so if anyone wants to edit my question so it makes more sense then go right ahead.
Update: Let me try an example (I do not want to make this language or platform dependent as it is not a coding question, it is a statistical question).

  • I have 3000 rows of 'real' data (this amount will change from run to run, depending on the amount of data the user has).
  • I have 20 rows of 'control' data (again, this will change depending on the number of control rows the user wants to use, anything from zero upwards).

I now want to insert these 20 'control' rows roughly after every 150 rows or 'real' data has been inserted (3000/20 = 150). However I do not want it to be as accurate as that as I do not want the control rows to be identifiable simply based on their location in the output data.

Therefore I do not mind some of the 'control' rows being clumped together or for there to be some sections with very few or no 'control' rows at all, but generally I want the 'control' rows fairly evenly distributed throughout the data.

+3  A: 

There's always a possibility that they get close to each other if you do it really random :)

But What I would do is:

  1. You have N rows of real data and x of control data
  2. To get an index of a row you should insert i-th control row, I'd use: N/(x+1) * i + r, where r is some random number, diffrent for each of the control rows, small compared to N/x. Choose any way of determining r, it can be either gaussian or even flat distribution. i is an index of the control row, so it's 1<=i<x
  3. This way you can be sure that you avoid condensation of your control rows in one single place. Also you can be sure that they won't be in regular distances from each other.
kender
Surely that would mean that if someone could identify just one control row they would be able to identify all other control rows as well? When should i and r change, every row, every time a new control is inserted, or never?
AnturCynhyrfus
r is random number, diffrent for every control row
kender
Oh, I see! Thanks for the comment.
AnturCynhyrfus
A: 

Here's my thought. Why don't you just loop through the existing rows and "flip a coin" for each row to decide whether you will insert random data there.

for (int i=0; i<numberOfExistingRows; i++)
{    
    int r = random();
    if (r > 0.5)
    {
        InsertRandomData();
    }    
}

This should give you a nice random distribution throughout the data.

amdfan
Thanks for the comment, the problem is that I have much less control data than real data, therefore if I use a coin toss I will insert all my data at the beginning of the real data.
AnturCynhyrfus
A: 

Going with the 3000 real data rows and 20 control rows for the following example (I'm better with example than with english)

If you were to spread the 20 control rows as evenly as possible between the 3000 real data rows you'd insert one at each 150th real data row. So pick that number, 150, for the next insertion index.
a) Generate a random number between 0 and 150 and subtract it from the insertion index
b) Insert the control row there.
c) Increase insertion index by 150
d) Repeat at step a)

Of course this is a very crude algorithm and it needs a few improvements :)

pmg
Hmmm, interesting, I like where you are going with it. I will work on some ideas based on your suggestions. Thanks. ;-)
AnturCynhyrfus
A: 

If the real data is large or much larger than the control data, just generate interarrival intervals for your control data.

So pick a random interval, copy out that many lines of real data, insert control data, repeat until finished. How to pick that random interval?

I'd recommend using a gaussian deviate with mean set to the real data size divided by the control data size, the former of which could be estimated if necessary, rather than measured or assumed known. Set the standard deviation of this gaussian based on how much "spread" you're willing to tolerate. Smaller stddev means a more leptokurtic distribution means tighter adherence to uniform spacing. Larger stdev means a more platykurtic distribution and looser adherence to uniform spacing.

Now what about the first and last sections of the file? That is: what about an insertion of control data at the very beginning or very end? One thing you can do is to come up with special-case estimates for these... but a nice trick is as follows: start your "index" into the real data at minus half the gaussian mean and generate your first deviate. Don't output any real data until your "index" into the real data is legit. A symmetric trick at the end of the data should also work quite well (simply: keep generating deviates until you reach an "index" at least half the gaussian mean beyond the end of the real data. If the index just before this was off the end, generate data at the end.

You want to look at more than just statistics: it's helpful in developing an algorithm for this sort of thing to look at rudimentary queueing theory. See wikipedia or the Turing Omnibus, which has a nice, short chapter on the subject whose title is "Simulation".

Also: in some circumstance non-gaussian distributions, particularly the Poisson distribution, give better, more natural results for this sort of thing. The algorithm outline above still applies using half the mean of whatever distribution seems right.

Thomas Kammeyer