ansaurus

Question

Answer 1

+3 A:

There's always a possibility that they get close to each other if you do it really random :)

But What I would do is:

You have N rows of real data and x of control data
To get an index of a row you should insert i-th control row, I'd use: N/(x+1) * i + r, where r is some random number, diffrent for each of the control rows, small compared to N/x. Choose any way of determining r, it can be either gaussian or even flat distribution. i is an index of the control row, so it's 1<=i<x
This way you can be sure that you avoid condensation of your control rows in one single place. Also you can be sure that they won't be in regular distances from each other.

kender 2008-10-09 13:53:59

Surely that would mean that if someone could identify just one control row they would be able to identify all other control rows as well? When should i and r change, every row, every time a new control is inserted, or never?

AnturCynhyrfus 2008-10-09 14:47:09

r is random number, diffrent for every control row

kender 2008-10-09 15:19:21

Oh, I see! Thanks for the comment.

AnturCynhyrfus 2008-10-09 15:37:44

Answer 2

A:

Here's my thought. Why don't you just loop through the existing rows and "flip a coin" for each row to decide whether you will insert random data there.

for (int i=0; i<numberOfExistingRows; i++)
{    
    int r = random();
    if (r > 0.5)
    {
        InsertRandomData();
    }    
}

This should give you a nice random distribution throughout the data.

amdfan 2008-10-09 13:54:36

Thanks for the comment, the problem is that I have much less control data than real data, therefore if I use a coin toss I will insert all my data at the beginning of the real data.

AnturCynhyrfus 2008-10-09 14:32:09

Answer 3

A:

Going with the 3000 real data rows and 20 control rows for the following example (I'm better with example than with english)

If you were to spread the 20 control rows as evenly as possible between the 3000 real data rows you'd insert one at each 150th real data row. So pick that number, 150, for the next insertion index.
a) Generate a random number between 0 and 150 and subtract it from the insertion index
b) Insert the control row there.
c) Increase insertion index by 150
d) Repeat at step a)

Of course this is a very crude algorithm and it needs a few improvements :)

pmg 2008-10-09 15:28:03

Hmmm, interesting, I like where you are going with it. I will work on some ideas based on your suggestions. Thanks. ;-)

AnturCynhyrfus 2008-10-09 15:36:29

Answer 4

A:

If the real data is large or much larger than the control data, just generate interarrival intervals for your control data.

So pick a random interval, copy out that many lines of real data, insert control data, repeat until finished. How to pick that random interval?

I'd recommend using a gaussian deviate with mean set to the real data size divided by the control data size, the former of which could be estimated if necessary, rather than measured or assumed known. Set the standard deviation of this gaussian based on how much "spread" you're willing to tolerate. Smaller stddev means a more leptokurtic distribution means tighter adherence to uniform spacing. Larger stdev means a more platykurtic distribution and looser adherence to uniform spacing.

Now what about the first and last sections of the file? That is: what about an insertion of control data at the very beginning or very end? One thing you can do is to come up with special-case estimates for these... but a nice trick is as follows: start your "index" into the real data at minus half the gaussian mean and generate your first deviate. Don't output any real data until your "index" into the real data is legit. A symmetric trick at the end of the data should also work quite well (simply: keep generating deviates until you reach an "index" at least half the gaussian mean beyond the end of the real data. If the index just before this was off the end, generate data at the end.

You want to look at more than just statistics: it's helpful in developing an algorithm for this sort of thing to look at rudimentary queueing theory. See wikipedia or the Turing Omnibus, which has a nice, short chapter on the subject whose title is "Simulation".

Also: in some circumstance non-gaussian distributions, particularly the Poisson distribution, give better, more natural results for this sort of thing. The algorithm outline above still applies using half the mean of whatever distribution seems right.

Thomas Kammeyer 2008-10-09 15:42:56

ansaurus

tags:

views:

answers:

Random distribution of data

related questions