views:

620

answers:

11

What is the best algorithm to take a long sequence of integers (say 100,000 of them) and return a measurement of how random the sequence is?

The function should return a single result, say 0 if the sequence is not all all random, up to, say 1 if perfectly random. It can give something in-between if the sequence is somewhat random, e.g. 0.95 might be a reasonably random sequence, whereas 0.50 might have some non-random parts and some random parts.

If I were to pass the first 100,000 digits of Pi to the function, it should give a number very close to 1. If I passed the sequence 1, 2, ... 100,000 to it, it should return 0.

This way I can easily take 30 sequences of numbers, identify how random each one is, and return information about their relative randomness.

Is there such an animal?

+1  A: 

There are many such animals. See this article in Wikipedia for some examples.

mobrule
A: 

As per Knuth, make sure you test the low-order bits for randomness, since many algorithms exhibit terrible randomness in the lowest bits.

Loadmaster
+7  A: 

Your question answers itself. "If I were to pass the first 100,000 digits of Pi to the function, it should give a number very close to 1", except the digits of Pi are not random numbers so if your algorithm does not recognise a very specific sequence as being non random then its not very good.

The problem here is there are many types of non random-ness:- eg. "121,351,991,7898651,12398469018461" or "33,27,99,3000,63,231" or even "14297141600464,14344872783104,819534228736,3490442496" are definitely not random.

I think what you need to do is identify the aspects of randomness that are important to you- distribution, distribution of digits, lack of common factors, expected number of primes, fibonacci and other "special" numbers etc. etc.

PS. The Quick and Dirty (and very effective) test of randomness is does the file end up roughly the same size after you gzip it.

James Anderson
I am puzzled how you can say that the digits of Pi or not random. It may be true that Pi's randomness within its first hundred million digits might not be as effective for certain applications like data encryption as some other random generators (See: http://www.sciencedaily.com/releases/2005/04/050427094258.htm ), but I've never seen anything that has ever declared the digits of Pi to be non-random.
lkessler
+1 for "identify the aspects of randomness that are important to you". If it's random then it will pass tests for randomness; but the converse doesn't hold -- there's no test that can verify randomness, for example, one could have very strong correlations between elements far apart and one would generally have to test explicitly for this. In fact, I like this so much I'll write it as my own answer...
tom10
pi is not a random sequence of digits, its a very spcific sequence of digits - its long and does not contain any significant repititions - but it is always the same sequence.
James Anderson
re pi: The link on statistical randomness that JohnFx points out clears up the issue about pi. http://stackoverflow.com/questions/1474382/a-good-and-simple-measure-of-randomness
tom10
@lkessler: What are you using the random numbers for? If it's for encryption, as you suggest, then you've lost as soon as the cryptologist realizes you're using sequences from pi.
David Thornley
I don't plan to use the digits from Pi. I just included that as an example. The goal is that we want to be able to classify series of numbers by the degree of randomness for various uses.
lkessler
Would like to point out that the "gzip" method is actually pretty good. The dictionary building algorithm used is very good at picking up patterns you didnt know were there.
James Anderson
+2  A: 

What you seek doesn't exist, at least not how you're describing it now.

The basic issue is this:
If it's random then it will pass tests for randomness; but the converse doesn't hold -- there's no test that can verify randomness.

For example, one could have very strong correlations between elements far apart and one would generally have to test explicitly for this. Or one could have a flat distribution but generated in a very non-random way. Etc, etc.

In the end, you need to decide on what aspects of randomness are important to you, and test for these (as James Anderson describes in his answer). I'm sure if you think of any that aren't obvious how to test for, people here will help.

Btw, I usually approach this problem from the other side: I'm given some set of data that looks for all I can see to be completely random, but I need to determine whether there's a pattern somewhere. Very non-obvious, in general.

tom10
A: 

It can be done this way:

CAcert Research Lab does a Random Number Generator Analysis.

Their results page evaluates each random sequence using 7 tests (Entropy, Birthday Spacing, Matrix Ranks, 6x8 Matrix Ranks, Minimum Distance, Random Spheres, and the Squeeze). Each test result is then color coded as one of "No Problems", "Potentially deterministic" and "Not Random".

So an function can be written that accepts a random sequence and does the 7 tests. If any of the 7 tests are "Not Random" then the function returns a 0. If all of the 7 tests are "No Problems", then it returns a 1. Otherwise, it can return some number in-between based on how many tests come in as "Potentially Deterministic".

The only thing missing from this solution is the code for the 7 tests.

lkessler
+2  A: 

As others have pointed out, you can't directly calculate how random a sequence is but there are several statistical tests that you could use to increase your confidence that a sequence is or isn't random.

The DIEHARD suite is the de facto standard for this kind of testing but it neither returns a single value nor is it simple.

ENT - A Pseudorandom Number Sequence Test Program, is a simpler alternative that combines 5 different tests. The website explains how each of these tests works.

If you really need just a single value, you could pick one of the 5 ENT tests and use that. The Chi-Squared test would probably be the best to use, but that might not meet the definition of simple.

Bear in mind that a single test is not as good as running several different tests on the same sequence. Depending on which test you choose, it should be good enough to flag up obviously suspicious sequences as being non-random, but might not fail for sequences that superficially appear random but actually exhibit some pattern.

Dan Dyer
+1  A: 

In Computer Vision when analysing textures, the problem of trying to gauge the randomness of a texture comes up, in order to segment it. This is exactly the same as your question, because you are trying to determine the randomness of a sequence of bytes/integers/floats. The best discussion I could find of image entropy is http://www.physicsforums.com/showthread.php?t=274518 .

Basically, its the statistical measure of randomness for a sequence of values.

I would also try autocorrelation of the sequence with itself. In the autocorrelation result, if there is no peaks other than the first value that means there is no periodicity to your input.

ldog
+1  A: 

You could try to zip-compress the sequence. The better you succeed the less random the sequence is.

Thus, heuristic randomness = length of zip-code/length of original sequence

ragnarius
That's an interesting idea.
lkessler
Thanks, I was inspired by Kolmogorov complexity. According to Kolmogorov a sequence is random if it cannot be produced by an algorithm which is shorter than the sequence. For instance, PI is not random because it can be produced by a short algorithm.
ragnarius
A: 

"How random is this sequence?" is a tough question because fundamentally you're interested in how the sequence was generated. As others have said it's entirely possible to generate sequences that appear random, but don't come from sources that we'd consider random (e.g. digits of pi).

Most randomness tests seek to answer a slightly different questions, which is: "Is this sequence anomalous with respect to a given model?". If you're model is rolling ten sided dice, then it's pretty easy to quantify how likely a sequence is generated from that model, and the digits of pi would not look anomalous. But if your model is "Can this sequence be easily generated from an algorithm?" it becomes much more difficult.

job
No, I'm really asking this: I've got a series of numbers. How random is the series? I may not know or don't really care how it was generated. I just want to know if it is random or not.
lkessler
My point was you have to define random with some sort of model.
job
A: 
Szere Dyeri
This is definitely the right idea! +1
lkessler
Hmmmm.... so what's the entropy of 111111111122222222223333333333444444444455555555556666666666777777777788888888889999999999 ?
tom10
Good point, Tom. Entropy itself won't work.
lkessler
To be clear, I think entropy could be useful, but one can't just slap the formula onto the data. In the repeated digits example, although the distribution has high entropy (for the digits 1-9), the difference between adjacent digits has very low entropy (i.e. not random, as one can clearly see). But here again, you're back to defining what you mean by randomness, and testing directly for that feature. (Also, entropy is one of the trickier measures and takes some study to get right, so it's better avoided unless one really needs it.)
tom10
Tom you are right, entropy itself will not work. Actually this a stochastic process and you need to entropy rate of it (http://en.wikipedia.org/wiki/Entropy_rate). I do not know how to calculate it. But somehow it needs to have a high entropy as well as some other properties. Your example clear has a high entropy but still the next number in the chain is predictable.
Szere Dyeri
I disagree. The sequence given is very "random" given the test that we are putting it through. If you instead did an autocorrelation test the sequence would have many peaks and would be deemed not very random. Don't confuse statistical randomness with the idea of a random variable. Statistical randomness is just that, its statistically checking a sequence for randomness based on a chosen test.
ldog
Also, do not assume that the intuitive concept of randomness and what a random variable is one and the same.
ldog
+1  A: 

@JohnFx "... mathematically impossible."

poster states: take a long sequence of integers ...

Thus, just as limits are used in The Calculus, we can take the value as being the value - the study of Chaotics shows us finite limits may 'turn on themselves' producing tensor fields that provide the illusion of absolute(s), and which can be run as long as there is time and energy. Due to the curvature of space-time, there is no perfection - hence the op's "... say 1 if perfectly random." is a misnomer.

{ noted: ample observations on that have been provided - spare me }

According to your position, given two byte[] of a few k, each randomized independently - op could not obtain "a measurement of how random the sequence is" The article at Wiki is informative, and makes definite strides dis-entagling the matter, but

In comparison to classical physics, quantum physics predicts that the properties of a quantum mechanical system depend on the measurement context, i.e. whether or not other system measurements are carried out.

A team of physicists from Innsbruck, Austria, led by Christian Roos and Rainer Blatt, have for the first time proven in a comprehensive experiment that it is not possible to explain quantum phenomena in non-contextual terms.

Source: Science Daily

Let us consider non-random lizard movements. The source of the stimulus that initiates complex movements in the shed tails of leopard geckos, under your original, corrected hyper-thesis, can never be known. We, the experienced computer scientists, suffer the innocent challenge posed by newbies knowing too well that there - in the context of an un-tainted and pristine mind - are them gems and germinators of feed-forward thinking.

If the thought-field of the original lizard produces a tensor-field ( deal with it folks, this is front-line research in sub-linear physics ) then we could have "the best algorithm to take a long sequence" of civilizations spanning from the Toba Event to present through a Chaotic Inversion". Consider the question whether such a thought-field produced by the lizard, taken independently, is a spooky or knowable.

"Direct observation of Hardy's paradox by joint weak measurement with an entangled photon pair," authored by Kazuhiro Yokota, Takashi Yamamoto, Masato Koashi and Nobuyuki Imoto from the Graduate School of Engineering Science at Osaka University and the CREST Photonic Quantum Information Project in Kawaguchi City

Source: Science Daily

( considering the spooky / knowable dichotomy )

I know from my own experiments that direct observation weakens the absoluteness of perceptible tensors, distinguishing between thought and perceptible tensors is impossible using only single focus techniques because the perceptible tensor is not the original thought. A fundamental consequence of quantaeus is that only weak states of perceptible tensors can be reliably distinguished from one another without causing a collapse into a unified perceptible tensor. Try it sometime - work on the mainifestation of some desired eventuality, using pure thought. Because an idea has no time or space, it is therefore in-finite. ( not-finite ) and therefore can attain "perfection" - i.e. absoluteness. Just for a hint, start with the weather as that is the easiest thing to influence ( as least as far as is currently known ) then move as soon as can be done to doing a join from the sleep-state to the waking-state with virtually no interruption of sequential chaining.

There is an almost unavoidable blip there when the body wakes up but it is just like when the doorbell rings, speaking of which brings an interesting area of statistical research to funding availability: How many thoughts can one maintain synchronously? I find that duality is the practical working limit, at triune it either breaks on the next thought or doesn't last very long.

Perhaps the work of Yokota et al could reveal the source of spurious net traffic...maybe it's ghosts.

Nicholas Jordan