ansaurus

Question

Answer 1

+5 A:

First thing I spotted is that you evaluate vals.at(row).size() in the loop, which, obviously, isn't supposed to improve performance. It also applies to vals.size(), but of course inner loop is worse. If vals is a vector of vector, you better use iterators or at least keep reference for the outer vector (because get() with indices parameters surely eats up quite some time as well).

This code sample is supposed to illustrate my intentions ;-)

for(TVO::const_iterator i=vals.begin(),ie=vals.end();i!=ie;++i) {
    for(TVI::const_iterator ii=i->begin(),iie=i->end();ii!=iie;++ii) {
        T value = *ii;
        // the rest
    }
}

Michael Krelin - hacker 2009-09-18 13:28:14

That looks horrible, and until you've got a profile, making that change would be a bad idea.

Andrew Bainbridge 2009-09-18 14:22:00

Andrew Bainbridge, I'm sorry, but I disagree with you here. I don't find it horrible at all.

Michael Krelin - hacker 2009-09-18 14:25:46

This runs very marginally slower that the OP on my machine, though personally I'd rewrite most of OP to pull max, min, mean and SD out as algorithms which take iterators just for code cleanliness.

Pete Kirkham 2009-09-18 14:45:17

@Hacker: OK, horrible was a bit harsh. My point was, without a profile, you shouldn't really be suggesting fixes. If the change made it look nicer, then maybe.

Andrew Bainbridge 2009-09-18 14:55:18

Andrew, but I do find iterating through arrays nicer. Honestly.

Michael Krelin - hacker 2009-09-18 15:05:37

err. vectors. Well, collections ;-)

Michael Krelin - hacker 2009-09-18 15:10:30

Pete, you got me started here ;-) I've tested an *empty* loop over some 10,000 by 10,000 vector of vectors and of course the iterator-based code is about twice as fast as indexed. And of course the bottleneck was `*ii` access — when it's accessed as vals[r][c] (which I took liberty of counting when iterating) it's tons slower. Maybe that's why `get()` was the slowest... ;-)

Michael Krelin - hacker 2009-09-18 16:11:41

Answer 2

+4 A:

First, change your row++ to ++row. A minor thing, but you want speed, so that will help
Second, make your row < vals.size into some const comparison instead. The compiler doesn't know that vals won't change, so it has to play nice and always call size.
what is the 'get' method in the middle there? What does that do? That might be your real problem.
I'm not too sure about your std dev calculation. Take a look at the wikipedia page on calculating variance in a single pass (they have a quick explanation of Knuth's algorithm, which is an expansion of a recursion relation).

mmr 2009-09-18 13:28:34

Although you're certainly right about preincrementing in general, in this particular case I do not think it matters at all, since compiler should be smart enough to optimize it for `size_t` with no side effects here.

Michael Krelin - hacker 2009-09-18 13:31:59

row++ -> ++row is one of the first optimizations any decent compiler will make.

ryeguy 2009-09-18 13:50:42

ryeguy, yup. if it can ;-) (not that easy with overloaded for, say, iterators).

Michael Krelin - hacker 2009-09-18 13:55:27

Answer 3

A:

I'm nor sure of what type vals is but vals.at(row).size() could take a long time if itself iterates through the collection. Store that value in a variable. Otherwise it could make the algorithm more like O(n³) than O(n²)

Tobias 2009-09-18 13:30:18

std::vector::size() has complexity O(1), as well as `std::vector::at()`.

Kirill V. Lyadvinsky 2009-09-18 13:39:51

Thanks, wasn't sure if he used std::vector.

Tobias 2009-09-21 10:55:45

Answer 4

+7 A:

You should calculate the sum of values, min, max and count in the first loop, then calculate the mean in one operation by dividing sum/count, then in a second loop calculate std_dev's sum

That would probably be a bit faster.

DVK 2009-09-18 13:30:45

@DVK-- std dev can be calculated in a single loop, see my comment for the link.

mmr 2009-09-18 13:31:40

+1 because I was about to suggest the same thing :P .

Alterlife 2009-09-18 13:36:21

With as many data points as there are, summing the entire row could overflow. Better select the next larger data type to hold the running total.

Les 2009-09-18 15:20:31

Or, if none is available, improvise one, like I did.

Robert L 2009-09-19 01:30:47

Answer 5

+29 A:

Have you tried to profile your code?

You don't even need a fancy profiler. Just stick some debug timing statements in there.

Anything I tell you would just be an educated guess (and probably wrong)

You could be getting lots of cache misses due to the way you're accessing the contents of the vector. You might want to cache some of the results to size() but I don't know if that's the issue.

Glen 2009-09-18 13:30:50

+1 for suggesting profiling

JeffH 2009-09-18 13:32:38

yeah, most of the answers here seem to be of the form "change this bit here, or maybe that bit over there and pray" why not just profile the thing and find out what the REAL problem is and fix that? It's got to be easier than randomly re-writing parts of the code. especially when we don't have it all. I mean that call to get(row,col) could be doing anything (including sleeping for 30 seconds or something)

Glen 2009-09-18 13:37:40

Glen, he's asked for something horrible. There are quite a few horrible things which we all pointed out. Profiling is what one should try when thinking fails and it's a bad sign on the screenfull-scale project.

Michael Krelin - hacker 2009-09-18 13:42:07

@hacker, there are lots of things wrong with the code. However the question title is "Why is this code so slow?" Given that the posted code is incomplete, there's no way of knowing the answer without profiling. I could state that it's slow because of the "get" function. Without knowing what it does there's no way to prove I'm right. making micro-optimisations without a profiler pointing out the ACTUAL bottleneck is just guessing, and it's the wrong thing to do

Glen 2009-09-18 13:49:23

Noooooo!I wish I could downvote comments!Time and time again while I've been looking at performance issues I've looked at the code and gone "ooooh thats slow", only to "fix it" and find it makes absolutely no difference - if theres anything I've learnt its that you never guess with performance.

Kragen 2009-09-18 13:49:57

this is the best advice. PROFILE it! Then fix the problems.

Tim 2009-09-18 13:50:02

Glen, there's no way to know what get does, but there is a certain indication of what it is supposed to do. I think you're misunderstanding what people are suggesting here. The're *not* guessing, they *see* problems and share their vision.

Michael Krelin - hacker 2009-09-18 13:53:33

@hacker, so you're just guessing? that's not the way to improve performance. let's say he takes your advice and gets a 10-20% performance boost. should he be happy? profiling it and fixing what a profiler suggests MAY improve performance by orders of magnitude. (or not. there's no way to be SURE without profiling)

Glen 2009-09-18 13:56:51

The suggestion itself is quite valid, the overestimation of approach by community is really sad. Profiler is no substitute for thinking.Glen, didn't I just say that I'm *not* guessing. And if you refer to my guess about `get()` machinery, then you know yourself it's inappropriate. Using profiler for a screenfull of straightforward lines indicates nothing but aversion to thinking.

Michael Krelin - hacker 2009-09-18 14:02:05

Thanks i figured it out it was the get :)!

UberJumper 2009-09-18 14:12:28

;-))) what the hell was in your `get()` ?

Michael Krelin - hacker 2009-09-18 14:13:34

@hacker: When finding performance bottlenecks, thinking is no substitute for profiling. There's a lot of anecdotal evidence out there; everybody who tries profiling seems to realize that it's far better at picking out the slow spots than they are.

David Thornley 2009-09-18 14:18:27

David, although you think you counter my arguments, I upvote your comment. Thinking is no substitute for profiling either.

Michael Krelin - hacker 2009-09-18 14:27:00

"I'm using bubblesort to sort 100 million items, and it's taking a long time. What should I do? Should I just guess what the problem is, and switch to quicksort, or should I profile my code, identify the bottleneck, and work from there?". Sometimes, programmers do actually know something about programming, and can use that knowledge in their work. Crazy, I know, but true ;-)

Steve Jessop 2009-09-18 14:30:16

Obviously having made a guess and switched to quicksort (or introsort, or timsort), you time it to see whether it's made it faster or not. But as a professional, you have sufficient understanding to know why it's faster (for most inputs), that there's no point profiling to see whether more time is spent (a) comparing items or (b) swapping items. It really doesn't matter, all that matters is whether quicksort is faster or not. You don't need a profiler to test that, just a stopwatch.

Steve Jessop 2009-09-18 14:35:45

Yeah, that's cool, but look, Andrew Bainbridge has already profiled it for you :)

avp 2009-09-18 14:42:33

Algorithmic improvements > profiled optimizations > random people guessing on the internet

rpetrich 2009-09-19 14:56:36

If it ran in 5s rather than 1s then it's worth profiling. A factor of 100 is more likely to be a systematic failure. He's using the debug build, which means a lot of extra checks in the STL. If he was using the release build, it would be fast enough that the question would never have been asked.

Pete Kirkham 2009-09-19 15:24:41

Answer 6

A:

I think that I would rewrite it to use const iterators instead of row and col indexes. I would set up a const const_iterator for row_end and col_end to compare against, just to make certain it wasn't making function calls at every loop end.

Zan Lynx 2009-09-18 13:32:51

Answer 7

+1 A:

Move the .size() calls to before each loop, and make sure you are compiling with optimizations turned on.

Jim Buck 2009-09-18 13:33:36

I wish I could vote this one up more times.

Tim 2009-09-18 13:48:52

tim, you can, because almost every answer suggests that ;-)

Michael Krelin - hacker 2009-09-18 14:03:36

Well, I actually think the big-win would be just optimizations. Noone mentioned that, but that has made a HUGE difference in my own code, and the side effect is that both size() and at() will probably be optimized out of the loop automatically anyway.

Jim Buck 2009-09-18 14:57:00

not likely, Jim. Those two can't really be optimized out by compiler.

Michael Krelin - hacker 2009-09-19 17:33:53

Answer 8

+1 A:

If your matrix is stored as a vector of vectors, then in the outer for loop you should directly retrieve the i-th vector, and then operate on that in the inner loop. Try that and see if it improves performance.

Larry Watanabe 2009-09-18 13:39:32

Answer 9

A:

As people have mentioned, it might be get(). If it accesses neighbors, for instance, you will totally smash the cache which will greatly reduce the performance. You should profile, or just think about access patterns.

unwind 2009-09-18 13:43:09

Answer 10

+2 A:

I would change how I access the data. Assuming you are using std::vector for your container you could do something like this:

vector<vector<T> >::const_iterator row;
vector<vector<T> >::const_iterator row_end = vals.end();
for(row = vals.begin(); row < row_end; ++row)
{
    vector<T>::const_iterator value;
    vector<T>::const_iterator value_end = row->end();
    for(value = row->begin(); value < value_end; ++value)
    {
        double mean_prev = new_mean;
        new_mean += (*value - new_mean) / (cnt + 1);
        new_standard_dev += (*value - new_mean) * (*value - mean_prev);

        // find new max/min's
        new_min = min(*value, new_min);
        new_max = max(*value, new_max);
        cnt++;
    }
}

The advantage of this is that in your inner loop you aren't consulting the outter vector, just the inner one.

If you container type is a list, this will be significantly faster. Because the look up time of get/operator[] is linear for a list and constant for a vector.

Edit, I moved the call to end() out of the loop.

Matt Price 2009-09-18 13:56:30

Bad thing here is calling `end()` functions in the loop.

Michael Krelin - hacker 2009-09-18 14:08:50

took my downvote back ;-) Now your answer looks more or less like mine ;-)

Michael Krelin - hacker 2009-09-18 14:38:57

Answer 11

+7 A:

I just profiled it. 90% of the execution time was in this line:

new_mean += (value - new_mean) / (cnt + 1);

Andrew Bainbridge 2009-09-18 14:15:52

That's the way to do it! @uberjumper: community would even profile it for you :)

avp 2009-09-18 14:40:36

+1: Everyone points at things like at() and .size() and stuff. This just highlights that you cannot know where the performance bottleneck is without profiling.

Kaz Dragon 2009-09-18 14:40:51

how could you profile it? you don't have the implementation of the get function (which as the OP noted in a comment was the real bottleneck)? If you just omitted the get then your results can't be correct

Glen 2009-09-18 14:44:59

@Glen: That's a good point. I implemented get as vals[y][x]. Basically I fiddled about to find out what part of the code I _did_ have was slowest.

Andrew Bainbridge 2009-09-18 14:46:28

@Andrew, that makes sense. It's just unlucky that the real bottleneck was in the one piece of code the OP didn't post.

Glen 2009-09-18 14:56:33

Glen, I'm really curious to see it;-)

Michael Krelin - hacker 2009-09-18 15:06:18

@hacker, yeah, I'd like to see it too, but I don't think we will, unfortunately.

Glen 2009-09-18 15:55:37

Glen, I've just played a bit with the empty loop, using the code in my answer and counter-based code (like in OP). Iterating over 50000x50000 vector of vectors. Using iterators *and* counting rows and columns. It is twice as fast (no repeated size() calls of course). And, unsurprisingly, replacing accessor from dereferencing iterator to indexed - [r][c] - slows it down to about 0.8 of original execution time. What I'm trying to say is that it may be `get()` that was the bottleneck, but not because *get itself* was misimplemented. But well ;-)

Michael Krelin - hacker 2009-09-18 16:19:37

Answer 12

+3 A:

It's slow because you're benchmarking debug code.

Building and running the code on Windows XP using VS2008:

a Release build with the default optimisation level, the code in the OP runs in 2734 ms.
a Debug build with the default of no optimisation, the code in the OP runs in a massive 398,531 ms.

In comments below you say you're not using optimisation, and this appears to make a big difference in this case - normally it's less that a factor of ten, but in this case it's over a hundred times slower.

I'm using VS2008 rather than 2005, but it's probably similar:

In the Debug build, there are two range checks on each access, each of which calls std::vector::size() using a non-inlined function call and requires a branch predicition. There is overhead involved both with function calls and with branches.

In the Release build, the compiler optimizes away the range checks ( I don't know whether it just drops them, or does flow analysis based on the limits of the loop ), and the vector access becomes a small amount of inline pointer arithmetic with no branches.

No-one cares how fast the debug build is. You should be unit testing the release build anyway, as that's the build which has to work correctly. Only use the Debug build if you don't all the information you want if you try and step through the code.

The code as posted runs in < 1.5 seconds on my PC with test data of 15000 x 10000 integers all equal to 42. You report that it's running in 230 times slower that that. Are you on a 10 MHz processor?

Though there are other suggestions for making it faster ( such as moving it to use SSE, if all the values are representable using 8bit types ), but there's clearly something else which is making it slow.

On my machine, neither a version which hoisted a reference to the vector for the row and hoisting the size of the row, nor a version which used iterator had any measurable benefit ( with g++ -O3 using iterators takes 1511ms repeatably; the hoisted and original version both take 1485ms ). Not optimising means it runs in 7487ms ( original ), 3496ms ( hoisted ) or 5331ms ( iterators ).

But unless you're running on a very low power device, or are paging, or a running non-optimised code with a debugger attached, it shouldn't be this slow, and whatever is making it slow is not likely to be the code you've posted.

( as a side note, if you test it with values with a deviation of zero your SD comes out as 1 )

Pete Kirkham 2009-09-18 14:35:11

P4 3 Gigs of Ram. I am so confused. :|

UberJumper 2009-09-18 14:46:28

What compiler? Are you running with optimisations turned on?

Pete Kirkham 2009-09-18 14:49:34

Visual Studio 2005, no optimization. But optimization should not make that big of a difference for that section fo code?

UberJumper 2009-09-18 14:55:42

@Pete, the OP mentions in a comment to my answer that the issue was in the 'get' method. The one piece of code that wasn't available to benchmark......

Glen 2009-09-18 14:58:41

True, but there are only a finite number of ways to write a get method for a nested vector, and the point was to get a data point for comparison and whether the suggested means of optimising the posted code was worthwhile.

Pete Kirkham 2009-09-18 15:04:37

@uberjumper. Turning off optimisations makes a big difference in my test. About a factor of five.

Andrew Bainbridge 2009-09-18 15:04:45

@uberjumper Running with the debugger attached will matter a lot. Running without optimisations usually matters by a factor of less that ten, but is highly variable. 120 seconds for this is very slow.

Pete Kirkham 2009-09-18 15:07:25

Answer 13

A:

I have modified the algorithm to get rid of almost all of the floating-point division.

WARNING: UNTESTED CODE!!!

void CalculateStats()
{
    //OHGOD

    double accum_f;
    double accum_sq_f;
    double new_mean = 0;
    double new_standard_dev = 0;

    int new_min = 256;
    int new_max = 0;

    const int oku = 100000000;
    int accum_ichi = 0;
    int accum_oku = 0;
    int accum_sq_ichi = 0;
    int accum_sq_oku = 0;

    size_t cnt = 0;
    int v1 = 0;
    int v2 = 0;

    v1 = vals.size();

    for(size_t row = 0; row < v1; row++)
    {

            v2 = vals.at(row).size();
            for(size_t col = 0; col < v2; col++)
            {
                    T value = get(row, col);
                    int accum_ichi += value;
                    int accum_sq_ichi += (value * value);

                    // perform carries
                    accum_oku += (accum_ichi / oku);
                    accum_ichi %= oku;
                    accum_sq_oku += (accum_sq_ichi / oku);
                    accum_sq_ichi %= oku;

                    // find new max/min's
                    new_min = value < new_min ? value : new_min;
                    new_max = value > new_max ? value : new_max;
                    cnt++;
            }
    }

    // now, and only now, do we use floating-point arithmetic
    accum_f = (double)(oku) * (double)(accum_oku) + (double)(accum_ichi);
    accum_sq_f = (double)(oku) * (double)(accum_sq_oku) + (double)(accum_sq_ichi);

    new_mean = accum_f / (double)(cnt);

    // standard deviation formula from Wikipedia
    stats_standard_dev = sqrt((double)(cnt)*accum_sq_f - accum_f*accum_f)/(double)(cnt);        

    std::cout << stats_standard_dev << std::endl;
}

Robert L 2009-09-18 15:10:59

@Robert: don’t use HTML, use Markdown for formatting and save yourself a lot of trouble.

Konrad Rudolph 2009-09-19 14:51:05

@Konrad: i tried that and it didn't work.I wish there *were* no formatting since all the formatting characters are ones that I need for my code.

Robert L 2009-09-20 02:14:16

Answer 14

A:

Coming a bit late to the party here, but a couple of points:

You're effectively doing numerical work here. I don't know much about numerical algorithms, but I know enough to know that references and expert support are often useful. This discussion thread offers some references; and Numerical Recipes is a standard (if dated) work.
If you have the opportunity to redesign your matrix, you want to try using a valarray and slices instead of vectors of vectors; one advantage that immediately comes to mind is that you're guaranteed a flat linear layout, which makes cache pre-fetching and SIMD instructions (if your compiler can use them) more effective.

Sam 2009-09-18 19:50:12

Answer 15

+3 A:

There are far too many calculations in the inner loop:

For the descriptive statistics (mean, standard deviation) the only thing required is to compute the sum of value and the sum of squared value. From these two sums the mean and standard deviation can be computed after the outer loop (together with a third value, the number of samples - n is your new/updated code). The equations can be derived from the definitions or found on the web, e.g. Wikipedia. For instance the mean is just sum of value divided by n. For the n version (in constrast to the n-1 version - however n is large in this case so it doesn't matter which one is used) the standard deviation is:
sqrt( n * sumOfSquaredValue - sumOfValue * sumOfValue).

Thus only two floating point additions and one multiplication are needed in the inner loop. Overflow is not a problem with these sums as the range for doubles is 10^318. In particular you will get rid of the expensive floating point division that the profiling reported in another answer has revealed.
A lesser problem is that the minimum and maximum is rewritten every time (the compiler may or may not prevent this). As minimum quickly becomes small and maximum quickly becomes large only the two comparisons should happen for the majority of loop iterations: use if statements instead to be sure. It can be argued but on the other hand it is trivial to do.

Peter Mortensen 2009-09-18 20:40:26

special +1 for spotting unconditional update of min and max. The rest is valid, but has already been pointed to this way or another.

Michael Krelin - hacker 2009-09-18 22:21:10

_Overflow is not a problem with these sums as the range for doubles is 10^318._Only because it sacrifices less significant bits to make room for more significant bits.

Robert L 2009-09-19 01:28:38

@Robert L: it is correct that the current code has less round-off error than using the two sums proposed in my answer. The question is wether this increased round-off error is significant or not - given the number of samples, typical values of those samples and accuracy requirements for the final result (mean and standard deviation).

Peter Mortensen 2009-09-19 11:01:04

Answer 16

A:

In the inner loop, you shouldn't be testing size, you shouldn't be doing any divisions, and iterators can also be costly. In fact, some unrolling would be good in there. And, of course, you should pay attention to cache locality.

If you get the loop overhead low enough, it might make sense to do it in separate passes: one to get the sum (which you divide to get the mean), one to get the sum of squares (which you combine with the sum to get the variance), and one to get the min and/or max. The reason is to simplify what is in the inner unrolled loop so the compiler can keep stuff in registers.

I couldn't get the code to compile, so I couldn't pinpoint issues for sure.

Mike Dunlavey 2009-09-19 14:26:39

ansaurus

tags:

views:

answers:

Why is this code so slow?

related questions