views:

4286

answers:

13

I'm a programmer with a decent background in math and computer science. I've studied computability, graph theory, linear algebra, abstract algebra, algorithms, and a little probability and statistics (through a few CS classes) at an undergraduate level.

I feel, however, that I don't know enough about statistics. Statistics are increasingly useful in computing, with statistical natural language processing helping fuel some of Google's algorithms for search and machine translation, with performance analysis of hardware, software, and networks needing proper statistical grounding to be at all believable, and with fields like bioinformatics becoming more prevalent every day.

I've read about how "Google uses Bayesian filtering the way Microsoft uses the if statement", and I know the power of even fairly naïve, simple statistical approaches to problems from Paul Graham's A Plan for Spam and Better Bayesian Filtering, but I'd like to go beyond that.

I've tried to look into learning more statistics, but I've gotten a bit lost. The Wikipedia article has a long list of related topics, but I'm not sure which I should look into. I feel like from what I've seen, a lot of statistics makes the assumption that everything is a combination of factors that linearly combine, plus some random noise in a Gaussian distribution; I'm wondering what I should learn beyond linear regression, or if I should spend the time to really understand that before I move on to other techniques. I've found a few long lists of books to look at; where should I start?

So I'm wondering where to go from here; what to learn, and where to learn it. In particular, I'd like to know:

  1. What kind of problems in programming, software engineering, and computer science are statistical methods well suited for? Where am I going to get the biggest payoffs?
  2. What kind of statistical methods should I spend my time learning?
  3. What resources should I use to learn this? Books, papers, web sites. I'd appreciate a discussion of what each book (or other resource) is about, and why it's relevant.

To clarify what I am looking for, I am interested in what problems that programmers typically need to deal with can benefit from a statistical approach, and what kind of statistical tools can be useful. For instance:

  • Programmers frequently need to deal with large databases of text in natural languages, and help to categorize, classify, search, and otherwise process it. What statistical techniques are useful here?
  • More generally, artificial intelligence has been moving away from discrete, symbolic approaches and towards statistical techniques. What statistical AI approaches have the most to offer now, to the working programmer (as opposed to ongoing research that may or may not provide concrete results)?
  • Programmers are frequently asked to produce high-performance systems, that scale well under load. But you can't really talk about performance unless you can measure it. What kind of experimental design and statistical tools do you need to use to be able to say with confidence that the results are meaningful?
  • Simulation of physical systems, such as in computer graphics, frequently involves a stochastic approach.
  • Are there other problems commonly encountered by programmers that would benefit from a statistical approach?
+7  A: 

More probability than statistics, but Bayesian Probabilty can be very useful (it underpins spam filters) and IMO more software should use it to infer a user's habits.

Head First Statistics is an excellent book to learn statistics (a mathematician/statistician informs me that it has not so much a few errors but a few simplications of the theoretical stuff).

I almost forgot to mention: How to Lie with Statistics

Mitch Wheat
"more software should use it to infer a user's habits." no no no no no no no no no no no no no no no no no. no.
Breton
@Breton: so you like dumb software?
Mitch Wheat
I like software that doesn't randomly shift and change its interface because it thinks its cleverer than me. I do not like software that creepily targets ads at me and broadcasts demographic information to its creator. I like smart software- But what you suggest is dumb, but hideously smug software. Ever notice how in newer versions of windows (since xp sp2?) the icons in the start menu shift over time? Newer versions of office hide most of the menu items except the ones it thinks you need? That annoys the hell out of me because it's horribly disruptive to habit forming. it doesn't work.
Breton
@Breton: and where exactly did I " ...suggest is dumb, but hideously smug software" ? I wasn't suggesting something like Office...
Mitch Wheat
@Mitch Can you suggest more domains in which you think Bayesian Probability could help by inferring the user's habits? I have the same sort of reaction as Breton to so-called "smart" UI features like the rearranging of the Start menu, but I'd be interested in domains in which you think Bayesian inference could actually be helpful. Netflix recommendations may be one; I like having good movies suggested to me, based on my ratings of previous movies. Anything else?
Brian Campbell
The way MS moves UI elements around is a terrible paradigm, IMO. A much better approach would be to use logic to determine where users 'expect' certain UI elements to be be. This could be somewhat Bayesian in that one could ask, "where does the user expect the X widget, conditional upon the user having just done task Y" Knowing those expectations does not mean that a UI has to be inconsistent, but maybe that elements are added. I think an example of this done properly is when MS Word 2007+ provides a floating text feature window upon highlighting of text.
JD Long
@JD Long: I think that such moments cannot properly be determined algorithmically. It must be done with a keen understanding of the user's psychology, and with a bit of personality and art to it.
Breton
@You guys: Bayesian is beautiful, but so is nuclear fusion. If only the people using it were half as intelligent as the people who discovered it.
Mike Dunlavey
pokstad
+2  A: 

You can do quite a bit with mean and standard deviation.

It depends entirely on what problems you're going to be working on.

John at CashCommons
+1 Understanding the basics is key. After that it really depends on the problems you're going to be solving as John said. For example, public health statistics and quantum mechanics will use the same basic principles but the advanced stuff will differ greatly. And you'll be working on one or the other. If both, i salute you mad genius!
Paul Sasik
I sort-of disagree, or would at least repeat the warning that mean and std dev are useful ... when you have approximately normal data, ie symmetry around the mean and tails that neither too fat nor too skinny. I would always recommend to _visualize_ the data first to check this. Maybe you need to transform, or maybe you need different approaches.
Dirk Eddelbuettel
I want a bumper sticker that says "leptokurtosis... I'm in it for the fat tails"
JD Long
+5  A: 

One good resource about programming is "Artificial Intelligence: A Modern Approach" by Russell and Norvig. It can be a really useful resource to understand statistics-based machine learning techniques.

Amit Kumar
+5  A: 

Great question! I actually think it is worthwhile to step back for a minute and get to the broader picture. E.g. what I liked in Zed's rant was near the beginning:

I question their metrics and they try to back it up with lame attempts at statistical reasoning. I really can’t blame them since they were probably told in college that logic and reason are superior to evidence and observation.

which to me stresses the need for empiricism. Of course, I hear you say, you knew that and that is why you profile. Well, yes, but there is really is more than that. Zed comes back to this in the rant about averages, and I think this rings true: show distributions, plot the data, look at tail behaviour.

So what I trying to get to is that the answer is not so much in a single book, but more in way to think about problems, about seeing the world as probabilistic.

And I too find that R helps a ton in thinking and programming with and about data.

Dirk Eddelbuettel
Yes, I downloaded R, and started to play with it, and then I realized I didn't know what problems I should try solving with it, and what approaches I should use. So, some pointers on how to get started using R to solve real problems would be appreciated!
Brian Campbell
That is a really tough one. You could try at CRAN (at e.g. at the Task Views) with a topic you -- say machine learning, or cluster analsysi -- and just try to drill into a topic. But it is hard as this often assumes grad school familarity with the particular discipline's jargon.
Dirk Eddelbuettel
Or alternatively, if you have a problem in mind, maybe we can guide to some R packages or tutorials...
Dirk Eddelbuettel
Two areas of interest to me would be statistical natural language processing, and performance testing of complex systems under heavy load. So, for NLP, lets say I have a corpus of documents, and would like to extract from each document the most interesting words and phrases. For performance testing, I would like to try out a few different programming languages and frameworks on simple message queueing and processing task, and see how they handle a high load, so I would like to design a test with appropriate controls, perform the test, and analyze the results for statistical significance.
Brian Campbell
Can't help with NLP -- but the performance testing strikes me as a great way to get started. Collect some data on performance as well as some variables you vary and control and you try to get your feet wet with some exploratory data analysis, simple models ... and then go from there. Or does that sound too trivial?
Dirk Eddelbuettel
+2  A: 

It just depends on the area you are working on.. As an example if you are working on applications that involves sampling and data analysis the areas like Distributions (Normal, t and Chi Square) will be useful. And if your application is something like prediction software you may need a knowledge about distributions like poisson as well.

If your tool is going to get some decisions based on previous data the ideas of mean, variance and standard deviation might be useful. (With Hypothesis testing)

Update : Most universities provide courses on statistics. I've seen some lecture notes that can be considered as short but still good. Example

Chathuranga Chandrasekara
Can you provide some references for learning more about these techniques, and when they would be useful?
Brian Campbell
+3  A: 

I hope it's ok with Mr. Shaw and everyone else if most of us programmers never need to know anything about statistics, or probability, or much mathematics at all.

That's been my experience in the last 30 years, despite excellent grades in math.

So, maybe the title of this question should be, "What statistics should a programmer know if he needs to know statistics?"

John Saunders
Computer science and computer programming are very large fields, and it's certainly possible to go through one or the other without needing certain tools. So yes, not every programmer must know these. What I'm wondering is what kinds of statistical analyses are frequently useful in programming; what should a programmer learn, if he wants to broaden his knowledge base, and learn new skills that can come in handy in the future. There are a few fields that obviously need some sort of statistics; performance and benchmarking for one, natural language processing for another. Is there anything else?
Brian Campbell
Perhaps I shouldn't have linked to Zed Shaw's rant, as it seems to be distracting from my question. I'm not trying to say, like he seems to be, that all programmers must learn statistics now; I was just using that as an example of one way in which statistics can be useful to a programmer.
Brian Campbell
It's always useful to know what you need to know in order to do your job well. It's just that, for me, statistics has never been one of the things I've needed.
John Saunders
+27  A: 

Just as a point, not as a critic, but your question should be formulated in a different way: "what statistics should any person know?".

Fact is, unfortunately we all deal with statistics. It's a fact of life. Polls, weather forecast, drug effectiveness, insurances, and of course some parts of computer science. Being able to critically analyze the presented data gives the line between picking the right understanding or being scammed, whatever that means.

Said that, I think the following points are important to understand

  • mean, median, standard deviation of a sample, and the difference between sample and population (this is very important)
  • the distributions, and why the gaussian distribution is so important (the central limit theorem)
  • What it is meant with Null Hypothesis testing.
  • What is variable transformation, correlation, regression, multivariate analysis.
  • What is bayesian statistics.
  • Plotting methods.

All these points are critical not only to you as a computer scientist, but also as a human being. I will give you some examples.

  • The evaluation of the null hypothesis is critical for testing of the effectiveness of a method. For example, if a drug works, or if a fix to your hardware had a concrete result or it's just a matter of chance. Say you want to improve the speed of a machine, and change the hard drive. Does this change matters? you could do sampling of performance with the old and new hard disk, and check for differences. Even if you find that the average with the new disk is lower, that does not mean the hard disk has an effect at all. Here enters Null hypothesis testing, and it will give you a confidence interval, not a definitive answer, like : there's a 90 % probability that changing the hard drive has a concrete effect on the performance of your machine.

  • Correlation is important to find out if two entities "change alike". As the internet mantra "correlation is not causation" teaches, it should be taken with care. The fact that two random variables show correlation does not mean that one causes the other, nor that they are related by a third variable (which you are not measuring). They could just behave in the same way. Look for pirates and global warming to understand the point. A correlation reports a possible signal, it does not report a finding.

  • Bayesian. We all know the spam filter. but there's more. Suppose you go to a medical checkup and the result tells you have cancer (I seriously hope not, but it's to illustrate a point). Fact is: most of the people at this point would think "I have cancer". That's not true. A positive testing for cancer moves your probability of having cancer from the baseline for the population (say, 8 per thousands people have cancer, picked out of thin air number) to a higher value, which is not 100 %. How high is this number depends on the accuracy of the test. If the test is lousy, you could just be a false positive. The more accurate the method, the higher is the skew, but still not 100 %. Of course, if multiple independent tests all confirm that you have cancer, then it's very probable you actually have it, but still it's not 100 %. maybe it's 99.999 %. This is a point many people don't understand about bayesian statistics.

  • Plotting methods. That's another thing that is always left unattended. Analysis of data does not mean anything if you cannot convey effectively what they mean via a simple plot. Depending on what information you want to put into focus, or the kind of data you have, you will prefer a xy plot, a histogram, a violin plot, or a pie chart.

Now, let's go to your questions. I think I overindulged in just a quick note, but since my answer was voted up quite a lot, I feel it's better if I answer properly to your questions as much as my knowledge allows (and here is vacation, so I can indulge as much as I want over it)

What kind of problems in programming, software engineering, and computer science are statistical methods well suited for? Where am I going to get the biggest payoffs?

Normally, everything that has to do with data comparison which involves numerical (or reduced to numerical) input from unreliable sources. A signal from an instrument, a bunch of pages and the number of words they contain. When you get these data, and have to find a distilled answer out of the bunch, then you need statistics. Think for example to the algorithm to perform click detection on the iphone. You are using a trembling, fat stylus to refer to an icon which is much smaller than the stylus itself. Clearly, the hardware (capacitive screen) will send you a bunch of data about the finger, plus a bunch of data about random noise (air? don't know how it works). The driver must make sense out of this mess and give you a x,y coordinate on the screen. That needs (a lot of) statistics.

What kind of statistical methods should I spend my time learning?

The ones I told you are more than enough, also because to understand them, you have to walk through other stuff.

What resources should I use to learn this? Books, papers, web sites. I'd appreciate a discussion of what each book (or other resource) is about, and why it's relevant.

I learned statistics mostly from standard university courses. My first book was the "train wreck book", and it's very good. I also tried this one, which focuses on R but it did not satisfy me particularly. You have to know things and R to get through it.

Programmers frequently need to deal with large databases of text in natural languages, and help to categorize, classify, search, and otherwise process it. What statistical techniques are useful here?

That depends on the question you need to answer using your dataset.

Programmers are frequently asked to produce high-performance systems, that scale well under load. But you can't really talk about performance unless you can measure it. What kind of experimental design and statistical tools do you need to use to be able to say with confidence that the results are meaningful?

There are a lot of issues with measuring. Measuring is a fine and delicate art. Proper measuring is almost beyond human. The fact is that sampling introduces bias, either from the sampler, or from the method, or from the nature of the sample, or from the nature of nature. A good sampler knows these things and tries to reduce unwanted bias as much into a random distribution.

The examples from the blog you posted are relevant. Say you have a startup time for a database. If you take performance measures within that time, all your measures will be biased. There's no statistical method that can tell you this. Only your knowledge of the system can.

Are there other problems commonly encountered by programmers that would benefit from a statistical approach?

Every time you have an ensemble of data producers, you have statistics, so scientific computing and data analysis is obviously one place. Folksonomy and social networking is pretty much all statistics. Even stackoverflow is, in some sense, statistical. The fact that an answer is highly voted does not mean that it's the right one. It means that there's a high probability that is right, according to the evaluation of a statistical ensemble of independent evaluators. How these evaluators behave make the difference between stackoverflow, reddit and digg.

Stefano Borini
That's a good point. It is also useful to ask what statistics anyone should know; and everyone certainly should know Baye's rule (especially anyone who has to perform some kind of test that picks out one item in a thousand, but which has an error rate of .1). Part of the reason I ask about programmers specifically, beyond this being a programming forum, is that I want to know what problems that programmers frequently encounter would be easier to solve, or better solved, with appropriate statistical techniques. I've updated my question to that effect.
Brian Campbell
Even with the updated question, the answer, for me, in my 30+ years, has been, "I have not needed any statistics".
John Saunders
@John: I think you needed it, but just ignored it. Nothing wrong with that, we only have 24 hours a day.
Stefano Borini
Even more important, to everyone, are the statistical/logical fallacies common amongst many published articles involving statistics, *especially* correlation-vs-causation, sampling bias, invalid implications, etc.
BlueRaja - Danny Pflughoeft
+29  A: 

Interesting question. As a statistician whose interest is more and more aligned with computer science perhaps I could provide a few thoughts...

  1. Don't learn frequentist hypothesis testing. While the bulk of my work is done in this paradigm, it doesn't match the needs of business or data mining. Scientists generally have specific hypotheses in mind, and might wish to gauge the probability that, given their hypothesis isn't true, the data would be as extreme as it is. This is rarely the type of answer a computer scientist wants.

  2. Bayesian is useful, even if you don't know why you are assuming the priors that you are using. A baysian analysis can give you a precise probability estimate for various contingencies, but it is important to realize that the only reason you have this precise estimate is because you made a fuzzy decision regarding the prior probability. (For those not in the know, with baysian inference, you can specify an arbitrary prior probability, and update this based on the data collected to get a better estimate).

Machine learning and classification might be a good place to get started. The machine learning literature is more focused on computer science problems, though it's mission is almost identical to that of statistics ( see: http://anyall.org/blog/2008/12/statistics-vs-machine-learning-fight/ ).

Since you spoke of large databases with large numbers of variables, here are a few algorithms that come in handy in this domain.

  • adaboost: If you have a large number of crappy classifiers, and want to make one good classifier. (see also logit boost)
  • Support Vector Machines: A powerful and flexible classifier. Can learn non-linear patterns (okay linear in the non-linear kernel space if you want to be picky about it).
  • k-nearest neighbor: A simple but powerful algorithm. It does not scale well, but there are approximate nearest neighbor alternatives that are not quite so pathological.
  • CART: This algorithm partitions the data based on a number of predictor variables. It is particularly good if there are variable interactions, or there exists a very good predictor that only works on a subset of the data.
  • Least angle regression: if the value that you are trying to predict is continuous and you have a lot of data and a lot of predictors.

This is by no means complete, but should give you a good jumping off point. A very good and accessible book on the subject is Duda, Hart, Stork: Pattern Classification

Also, a big part of statistics is descriptive visualizations and analysis. These are of particular interest to the programmer because they allow him/her to convey information back to the user. In R, ggplot2 is my package of choice for creating visualizations. On the descriptive analysis side (and useful in text analysis) is multi-dimensional scaling, which can give a spacial interpretation of non-spacial data (for example the ideologies of senators http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aoas/1223908041).

Ian Fellows
Great answer, thanks! Can you provide a reference about frequentist hypothesis testing? You say not to learn it, but a quick Google search doesn't seem to lead me to a very good explanation of what it is. I'd like to learn about the possible techniques, even if some of them don't apply as well to my field.
Brian Campbell
I added a reference. Frequentist hypothesis testing is also known as null hypothesis testing, or simply statistical hypothesis testing. I would recommend learning about it if you want to understand medical literature, but not if you want to do something like predict netflix ratings.
Ian Fellows
I think one should at least understand null hypothesis testing and normal distribution assumptions. That doesn't mean that they have to assume everything is normal.
Brandon Bertelsen
+4  A: 

Boy, some of these answers are good. I came from much the same background and have had to get into biostatistics largely by books and by osmosis from colleagues. Here are my recommendations:

  • Start with a solid grounding in probability, including conditional probability, Bayes' theorem, Markov models, and some of the basic statistical distributions.

  • If you don't have it, get some linear algebra, so you don't get scared off by matrices. If you are faced with tricky algebra and calculus, knuckle down and work through it. It's worth it.

  • Statistics theory falls into two camps, frequentist and Bayesian. Frequentist is older and solid. Bayesian is newer, more flexible, and more exciting. In particular, there are the exciting things that can be done with Markov Chain Monte Carlo and related techniques.

In my area, pharmacometrics, there is high payoff in being able to extract meaningful results from sparse and expensive data, so an ability in statistics is very important.

Added: Here are some favorite books (not a complete list):

Mike Dunlavey
Thanks! I feel like I have a decent grounding in probability, conditional probability, Bayes' theorem, and linear algebra (though the linear algebra is a bit rusty at this point). Do you have any particular books or other sources that you think would be a good introduction to the rest?
Brian Campbell
+1 for good answer, and to push you over 10k rep!
Neil N
+5  A: 

I have not much to add, but it happens that I just started to read this book: D. S. Sivia with J. Skilling, ‘Data Analysis—a Bayesian tutorial’, 2nd Edition, 2006, Oxford University Press.

What caught my attention is the preface, where the author refers to a common dissatisfaction to those who approach the study of statistics:

Preface

As an undergraduate, I always found the subject of statistics to be rather mysterious. This topic wasn’t entirely new to me, as we had been taught a little bit about probability earlier at high school; for example, I was already familiar with the binomial, Poisson and normal distributions. Most of this made sense, but only seemed to relate to things like rolling dice, flipping coins, shuffling cards and so on. However, having aspirations of becoming a scientist, what I really wanted to know was how to analyse experimental data. Thus, I eagerly looked forward to the lectures on statistics. Sadly, they were a great disappointment. Although many of the tests and procedures expounded were intuitively reasonable, there was something deeply unsatisfactory about the whole affair: there didn’t seem to be any underlying basic principles! Hence, the course on ‘probability and statistics’ had led to an unfortunate dichotomy: probability made sense, but was just a game; statistics was important, but it was a bewildering collection of tests with little obvious rhyme or reason. While not happy with this situation, I decided to put aside the subject and concentrate on real science. After all, the predicament was just a reflection of my own inadequacies and I’d just have to work at it when the time came to really analyse my data.

The story above is not just my own, but is the all too common experience of many scientists. Fortunately, it doesn’t have to be like this. What we were not told in our undergraduate lectures is that there is an alternative approach to the whole subject of data analysis which uses only probability theory. In one sense, it makes the topic of statistics entirely superfluous. In another, it provides the logical justification for many of the prevalent statistical tests and procedures, making explicit the conditions and approximations implicitly assumed in their use.

This book is intended to be a tutorial guide to this alternative Bayesian approach, including modern developments such as maximum entropy.

...

I hope this book will maintain its promises.

There are a couple of preview chapters from the first edition here, from a course in Cognitive Psychology/AI where this book was adopted, and other materials from the same course here. Related software by second author here. Also a more extended preview from Google Books here.

MaD70
Yes, this describes some of the reason that I haven't really been able to get into statistics in the past; I feel like it's a whole bunch of tools without much justification. Thanks for the reference!
Brian Campbell
++ That preface says it all.
Mike Dunlavey
+4  A: 

What a great thread. There's plenty of good information in the question itself and in the answers, but I am really surprised nobody has mentioned the book Programming Collective Intelligence yet.

It's the best book I know if you are a novice in this subject (like me) and want to put machine learning and statistics theory into practice.

This book explains:

  • Collaborative filtering techniques that enable online retailers to recommend products or media
  • Methods of clustering to detect groups of similar items in a large dataset
  • Search engine features--crawlers, indexers, query engines, and the PageRank algorithm
  • Optimization algorithms that search millions of possible solutions to a problem and choose the best one
  • Bayesian filtering, used in spam filters for classifying documents based on word types and other features

  • Using decision trees not only to make predictions, but to model the way decisions are made

  • Predicting numerical values rather than classifications to build price models
  • Support vector machines to match people in online dating sites
  • Non-negative matrix factorization to find the independent features in adataset
  • Evolving intelligence for problem solving--how a computer develops its skill by improving its own code the more it plays a game

Apart from that, there's a great talk on TED on why everybody should learn Statistics.

jbochi
Looks like an interesting book, thanks!
Brian Campbell
+3  A: 

Here's an excellent book, available free on the web: 'The Elements of Statistical Learning', by Hastie, Tsibshirani and Freidman.

It covers a range of useful topics, and should be a good introduction to the machine learning field. It's explanation of overfitting models is the best that I've seen in ~20-30 stat books I've read.

Barry DeCicco
+1  A: 

It's amazing that no one has mentioned the Bootstrap Method, Principal Component Analysis or the LASSO algorithm. They cover data reduction, simulation, and exploratory data analysis, to name a few.

pslice
Instead of LASSO one could use LARS http://www-stat.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf
Marek
pslice