views:

145

answers:

7

I've been meaning to do a little bit of brushing up on my knowledge of statistics. One area where it seems like statistics would be helpful is in profiling code. I say this because it seems like profiling almost always involves me trying to pull some information from a large amount of data.

Are there any subjects in statistics that I could brush up on to get a better understanding of profiler output? Bonus points if you can point me to a book or other resource that will help me understand these subjects better.

+3  A: 

I'm not sure books on statistics are that useful when it comes to profiling. Running a profiler should give you a list of functions and the percentage of time spent in each. You then look at the one that took the most percentage wise and see if you can optimise it in any way. Repeat until your code is fast enough. Not much scope for standard deviation or chi squared there, I feel.

anon
That's really ultimately how I do it, and you're probably right. Still, I'd like to learn more about statistics, and this seems to be a good place to try and make it practical to do so.
Jason Baker
++ Neil's right, although mostly I look at the line level, not the function level. (And, Neil, I hope by "percentage of time" you meant "total time", not "self time", and I hope you meant "wall clock time", not CPU-only, because as software gets big, time wastage becomes more and more due to unnecessary function calls, and some of those do I/O.)
Mike Dunlavey
+1  A: 

I think that the most important statistical concept to understand in this context is Amdahl's law. Although commonly referred to in contexts of parallelization, Amdahl's law has a more general interpretation. Here's an excerpt from the Wikipedia page:

More technically, the law is concerned with the speedup achievable from an improvement to a computation that affects a proportion P of that computation where the improvement has a speedup of S. (For example, if an improvement can speed up 30% of the computation, P will be 0.3; if the improvement makes the portion affected twice as fast, S will be 2.) Amdahl's law states that the overall speedup of applying the improvement will be

alt text

Eli Bendersky
Not really statistics, is it?
anon
@Neil: I don't know really :-) Wikipedia says "Statistics is the science of making effective use of numerical data relating to groups of individuals or experiment" - isn't this applicable, then?
Eli Bendersky
@Eli Precisely - I don't see any "group" data in that equation.
anon
@Neil, with respect, if this isn't nitpicking, I don't know what is.
Eli Bendersky
@Eli Well, Is a = pi * r * r statistics? I suggest not. And disagreement != nitpicking.
anon
Regardless of whether or not it's relevant, this is still something I didn't know about and was glad to learn. Thanks for posting it.
Jason Baker
A: 

If you apply the MVC programming method with PHP this would be what you need to profile:

Application:
   Controller Setup time
   Model Setup time
   View Setup time
Database
   Query - Time
Cookies
   Name - Value
Sessions
   Name - Value
RJD22
Sorry, I don't know PHP. And I'm a bit unclear about how this relates to my question. Where does statistics fit in here?
Jason Baker
Well PHP is yet another programming language. (Sorry about that I figured (wrongly) that you mean php. Btw does this really deserve a downvote? It's not like I'm writing spam or a completely useless answer
RJD22
+1  A: 

Statistics is fun and interesting, but for performance tuning, you don't need it. Here's an explanation why, but a simple analogy might give the idea.

A performance problem is like an object (which may actually be multiple connected objects) buried under an acre of snow, and you are trying to find it by probing randomly with a stick. If your stick hits it a couple of times, you've found it - it's exact size is not so important. (If you really want a better estimate of how big it is, take more probes, but that won't change its size.) The number of times you have to probe the snow before you find it depends on how much of the area of the snow it is under.

Once you find it, you can pull it out. Now there is less snow, but there might be more objects under the snow that remains. So with more probing, you can find and remove those as well. In this way, you can keep going until you can't find anything more that you can remove.

In software, the snow is time, and probing is taking random-time samples of the call stack. In this way, it is possible to find and remove multiple problems, resulting in large speedup factors.

And statistics has nothing to do with it.

Mike Dunlavey
+1  A: 

I think one concept related to both statistics and profiling (your original question) that is very useful and used by some (you see the technique advised from time to time) is while doing "micro profiling": a lot of programmers will rally and yell "you can't micro profile, micro profiling simply doesn't work, too many things can influence your computation".

Yet simply run n times your profiling, and keep only x% of your observations, the ones around the median, because the median is a "robust statistic" (contrarily to the mean) that is not influenced by outliers (outliers being precisely the value you want to not take into account when doing such profiling).

This is definitely a very useful statistician technique for programmers who want to micro-profile their code.

Webinator
Don't measure. *Catch it in the act.* http://stackoverflow.com/questions/375913/what-can-i-use-to-profile-c-code-in-linux/378024#378024
Mike Dunlavey
I do and I see these two methods as complementary... I love to send a SIGQUIT to my multi-threaded Java programs, which dumps all the threads, exactly where they're at. I do it all the time. That's say it doesn't answer the question *at all*. Question was about statistics concepts useful for profiling.
Webinator
@WizardOfOdds: Well I guess there is one statistical concept I rely on, the binomial distribution, but I don't *use* it, except as an explanation of why "catching" works. I also use Bayes theorem in explaining why it works. But the bumblebee doesn't need to know *how* it flies, it just flies.
Mike Dunlavey
+1  A: 

All I know about profiling is what I just read in Wikipedia :-) but I do know a fair bit about statistics. The profiling article mentioned sampling and statistical analysis of sampled data. Clearly statistical analysis will be able to use those samples to develop some statistical statements on performance. Let's say you have some measure of performance, m, and you sample that measure 1000 times. Let's also say you know something about the underlying processes that created that value of m. For instance, if m is the SUM of a bunch of random variates, the distribution of m is probably normal. If m is the PRODUCT of a bunch of random variates, the distribution is probably lognormal. And so on...

If you don't know the underlying distribution and you want to make some statement about comparing performance, you may need what are called non-parametric statistics.

Overall, I'd suggest any standard text on statistical inference (DeGroot), a text that covers different probability distributions and where they're applicable (Hastings & Peacock), and a book on non-parametric statistics (Conover). Hope this helps.

Grembo
Mike Dunlavey
+1  A: 

Zed Shaw, as usual, has some thoughts on the subject of statistics and programming, but he puts them much more eloquently than I could.

mipadi