Is statistical analysis knowledge required to become a better programmer? How deep do we need it?
All depends on the software you're developing. I've never needed any knowledge, but no doubt many people require it for their work.
Basically, if your job has anything to do with sifting through data, then yes, the deeper the better.
So if you're interested in these domains, the better your understanding of stats, the deeper you can go:
- Business Intelligence (previously known as data mining).
- Scientific modelling
- Infographics (visual representation of data)
In almost every domain of activity you'll find some use for it.
Even if you're not really interested in these topics, having a somewhat basic, but good understanding of stats will help you a lot when you need to think about performance issues in your own code:
being able to draft a test scenario that is meaningful (as opposed to testing something that's not relevant)
being able to interpret your own tests to make informed decisions about where to optimise your code for instance.
being able to assess marketing speak for yourself, especially when buying hardware.
I wish I was better at it ;-)
I haven't used it in the last 21 years but then again, I don't develop systems for insurance companies.
I think very basic knowledge of statistics are useful for almost every programmer, like the definition of "average", "median" and "percentile". More than that is only required by those who work in a domain that actually does statistics.
Hehe, I always like questions like these. What is a 'better programmer' exactly? It all depends on the context...
I think it is not useful to study these things only to become a better programmer. You should study things because you are interested in it and/or someone requests it from you.
Yes.
It's hard to understand program behavior, programs are mysterious beasts. Even if you write a program you might not understand what kind of performance characteristics it has, and you can forget about really understanding the performance of a program written by twenty developers fifteen of which no longer work for the company, including Joe whose name is cursed widely by those who do maintenance.
Look, you say, the web page responds to a request taking an average of 0.5 seconds. That's good enough! But if you look at the data, you find that the standard deviation of the request delay is something like a minute. Something is severely wrong.
The boss says "Our client wants 99.99% uptime, I'm going to write it into the contract." You think about that... how many bugs do you produce per year? What percentage get into the development build? What percentage get into the QA build? How many get deployed? How many crash the server? How long does it take to fix, on average?
That's the next gotcha... there's a standard deviation to how long bugfixing takes. That's statistics knowledge right there. Then, using statistics, you can guess that if the main server is down, there's a chance that the hot backup is down too because there's a correlation.
Incidentally, that's why a RAID mirror with two drives with 100,000 hour MTBF, even though it theoretically has a 10,000,000,000 hour MTBF between them, really, really doesn't... hard drive failures are correlated.
These aren't deep bits of statistics knowledge, but the stuff is useful.
I don't think that knowing particular facts from statistics will help in programming much at all. But statistical thinking is extremely valuable.
Thinking of quality in absolute terms -- this code is correct, this code is buggy -- doesn't take you very far. At some point you have to think in terms of the probability of a user running into a bug and the consequences if they do. All programs of any significant size have bugs.
"How can we get rid of all the bugs in our software?" You can't. That's not a useful question. "What can we do to lower the probability of a user running into a bug?" Now that's something you can work with.
As the saying goes, testing can only show that there are bugs, but it cannot show that there are no bugs. But testing can give you confidence that the probability of running into a bug is low. Saying "We want to find all the bugs in our code" is noble, but naive. Saying "We want the probability of a user having a problem with X to be less than Y" is something you can work with. You can collect data and estimate the effort required to achieve that target.
I disagree with several of the other answers. The general consensus seems to be that you only need statistics, if you work in a statistics-heavy domain. However, if you work in such a domain, I would argue that the statistics is the domain expert's job, not yours. You don't need to know statistics to translate a greek letter formula that the domain expert hands you into a ASCII formula that your compiler understands.
Don't get me wrong: I'm not saying you shouldn't learn your domain. But, if you are surrounded by statisticians it is not that important to be able to solve statistical problems by yourself, because you can always just ask for help.
However, as a programmer, you are constantly dealing with statistics! Programmers are constantly measuring things like performance, scalability, throughput, latency or defect rates, estimating things like required effort to implement a feature, predicting things like release dates and scheduling things like iterations, features or bug fixes. Guess what: statistics is the science of measuring, estimating, predicting and scheduling. (Well, scheduling bleeds over into Operations Research, but statistics is important there, too.)
Basically, if you use words like "performance", "speed", "latency", "reliability", "scalability", "fast", "slow" or something like that, and you have not studied statistics, you are pretty much guaranteed to be wrong.
Zed Shaw wrote a great rant about this. It is in typical Zed Shaw style, i.e. not for the faint of heart, but it's pretty good: Programmers Need To Learn Statistics Or I Will Kill Them All.
As a programmer working with clinical trials, some stats knowledge is very useful. I don't think it's essential for a normal programmer (if there is such a thing), but it is useful.
Without a knowledge of the likelyhood of various events/conditions, it is easy to spend way too much time "optimising" sections of code that are rarely used.
This isn't really statistics, but an appreciation of number and probability.