views:

46

answers:

1

For those that process data, there is a saying: "If you torture data sufficiently, it will confess to almost anything". This is mathematically supported by the Boferroni's theorem, which states that "as one performs an increasing number of statistical tests, the likelihood of getting an erroneous significant finding (Type I error) also increases". It is known, for example, the situation given in Principles of Data Mining: "One particularly humorous example of this type of prediction was provided by Leinweber (personal communication) who achieved almost perfect prediction of annual values of the well-known Standard and Poor 500 financial index as a function of annual values from previous years for butter production, cheese production, and sheep populations in Bangladesh and the United States."

Did you encounter a practical situation when using a too complex model, the results were erroneous? can you present such a situation, together with the approach you have used?

+2  A: 

In my experience, the main problem is using statistical methods the wrong way. One common mistake is not to determine the data in advance that is to be tested. One Prof. I heard compared this with a horse race where you take the finishing picture not on a predetermined place, but when your horse is up front. This is quite common in medical studies.

Another example I know is where someone did a statistical test that assumed that the data is normally distributed - and it wasn't.

Never think that a statical dependency is a causal one (e.g. in Frankfurt, there's a dependency between HIV and the noise of the planes - that does not mean the flight noise causes HIV).

Basically it's not the complexity of the model. You have to use the right methods with the correct data. That's difficult enough. You have to determine your data before you test. If you want to check this out, just do a fairness test on die rolls or coin flips. Do it with all the data after each roll/flip. You'll see that every now and then it'll show that your die is not fair. Of course - if you do a large number of independent test on the fairness of the die, you'll have some that'll show he's unfair - but this is the expected error in statistical tests.

Another very basic thing in statistical examinations: be sure what your hypothesis says. Some times the test can not show what you want to have - it only can't reject it.

In short - don't do data-mining/statistical analysis without some kind of thought and education. The way statistics work are counter intuitive for humans and you can cheat (yourself and others) easily.

Tobias Langner
Excellent answer... thank you.
lmsasu