Just as a point, not as a critic, but your question should be formulated in a different way: "what statistics should any person know?".
Fact is, unfortunately we all deal with statistics. It's a fact of life. Polls, weather forecast, drug effectiveness, insurances, and of course some parts of computer science. Being able to critically analyze the presented data gives the line between picking the right understanding or being scammed, whatever that means.
Said that, I think the following points are important to understand
- mean, median, standard deviation of a sample, and the difference between sample and population (this is very important)
- the distributions, and why the gaussian distribution is so important (the central limit theorem)
- What it is meant with Null Hypothesis testing.
- What is variable transformation, correlation, regression, multivariate analysis.
- What is bayesian statistics.
- Plotting methods.
All these points are critical not only to you as a computer scientist, but also as a human being. I will give you some examples.
The evaluation of the null hypothesis is critical for testing of the effectiveness of a method. For example, if a drug works, or if a fix to your hardware had a concrete result or it's just a matter of chance. Say you want to improve the speed of a machine, and change the hard drive. Does this change matters? you could do sampling of performance with the old and new hard disk, and check for differences. Even if you find that the average with the new disk is lower, that does not mean the hard disk has an effect at all. Here enters Null hypothesis testing, and it will give you a confidence interval, not a definitive answer, like : there's a 90 % probability that changing the hard drive has a concrete effect on the performance of your machine.
Correlation is important to find out if two entities "change alike". As the internet mantra "correlation is not causation" teaches, it should be taken with care. The fact that two random variables show correlation does not mean that one causes the other, nor that they are related by a third variable (which you are not measuring). They could just behave in the same way. Look for pirates and global warming to understand the point. A correlation reports a possible signal, it does not report a finding.
Bayesian. We all know the spam filter. but there's more. Suppose you go to a medical checkup and the result tells you have cancer (I seriously hope not, but it's to illustrate a point). Fact is: most of the people at this point would think "I have cancer". That's not true. A positive testing for cancer moves your probability of having cancer from the baseline for the population (say, 8 per thousands people have cancer, picked out of thin air number) to a higher value, which is not 100 %. How high is this number depends on the accuracy of the test. If the test is lousy, you could just be a false positive. The more accurate the method, the higher is the skew, but still not 100 %.
Of course, if multiple independent tests all confirm that you have cancer, then it's very probable you actually have it, but still it's not 100 %. maybe it's 99.999 %. This is a point many people don't understand about bayesian statistics.
Plotting methods. That's another thing that is always left unattended. Analysis of data does not mean anything if you cannot convey effectively what they mean via a simple plot. Depending on what information you want to put into focus, or the kind of data you have, you will prefer a xy plot, a histogram, a violin plot, or a pie chart.
Now, let's go to your questions. I think I overindulged in just a quick note, but since my answer was voted up quite a lot, I feel it's better if I answer properly to your questions as much as my knowledge allows (and here is vacation, so I can indulge as much as I want over it)
What kind of problems in programming,
software engineering, and computer
science are statistical methods well
suited for? Where am I going to get
the biggest payoffs?
Normally, everything that has to do with data comparison which involves numerical (or reduced to numerical) input from unreliable sources. A signal from an instrument, a bunch of pages and the number of words they contain. When you get these data, and have to find a distilled answer out of the bunch, then you need statistics. Think for example to the algorithm to perform click detection on the iphone. You are using a trembling, fat stylus to refer to an icon which is much smaller than the stylus itself. Clearly, the hardware (capacitive screen) will send you a bunch of data about the finger, plus a bunch of data about random noise (air? don't know how it works). The driver must make sense out of this mess and give you a x,y coordinate on the screen. That needs (a lot of) statistics.
What kind of statistical methods
should I spend my time learning?
The ones I told you are more than enough, also because to understand them, you have to walk through other stuff.
What resources should I use to learn
this? Books, papers, web sites. I'd
appreciate a discussion of what each
book (or other resource) is about, and
why it's relevant.
I learned statistics mostly from standard university courses. My first book was the "train wreck book", and it's very good. I also tried this one, which focuses on R
but it did not satisfy me particularly. You have to know things and R to get through it.
Programmers frequently need to deal
with large databases of text in
natural languages, and help to
categorize, classify, search, and
otherwise process it. What statistical
techniques are useful here?
That depends on the question you need to answer using your dataset.
Programmers are frequently asked to
produce high-performance systems, that
scale well under load. But you can't
really talk about performance unless
you can measure it. What kind of
experimental design and statistical
tools do you need to use to be able to
say with confidence that the results
are meaningful?
There are a lot of issues with measuring. Measuring is a fine and delicate art. Proper measuring is almost beyond human. The fact is that sampling introduces bias, either from the sampler, or from the method, or from the nature of the sample, or from the nature of nature. A good sampler knows these things and tries to reduce unwanted bias as much into a random distribution.
The examples from the blog you posted are relevant. Say you have a startup time for a database. If you take performance measures within that time, all your measures will be biased. There's no statistical method that can tell you this. Only your knowledge of the system can.
Are there other problems commonly
encountered by programmers that would
benefit from a statistical approach?
Every time you have an ensemble of data producers, you have statistics, so scientific computing and data analysis is obviously one place. Folksonomy and social networking is pretty much all statistics. Even stackoverflow is, in some sense, statistical. The fact that an answer is highly voted does not mean that it's the right one. It means that there's a high probability that is right, according to the evaluation of a statistical ensemble of independent evaluators. How these evaluators behave make the difference between stackoverflow, reddit and digg.