ansaurus

Question

Answer 1

+4 A:

Check out this site and the bubbleSorts implementations.

Nick D 2009-06-23 13:41:16

Answer 2

+9 A:

A good compiler should do unrolling for you. For instance, in gcc compiling with the -O2 option turns on loop unrolling.

If you try to do it yourself manually, unless you measure things carefully and really know what you are doing, you are liable to end up with slower code. For example, in your case with manual unrolling you are liable to prevent the compiler from being able to do a loop interchange or stripmine optimization (look for --floop-interchange and -floop-strip-mine in the gcc docs)

T.E.D. 2009-06-23 13:42:22

[OP] It's not the overhead of the loops I am trying to optimize away, it's more the evaluation of row(), column() and f(). They are currently regular, free functions. I want to guarantee they are evaluated at compile time instead of run time.

RAC 2009-06-23 13:51:20

If you make those routines "inline", that might give the optimizer enough extra info to do that itself.

T.E.D. 2009-06-23 13:59:38

There's no reason why it shouldn't be able to inline those, as long as the full function definitions are visible at the call site.

jalf 2009-06-23 14:04:09

I will try, but row() and column() are actually kinda complicated (and recursive) functions. If the compiler resolves it all the way down to a constant, I'd be very happy - OTOH if it just drops in the function body I don't think I'll see much performance increase.The legacy code that I am replacing with these loops literally is a snippet that reads like "A(12,37) = 0.5;A(15,23) = 0.25;A(14,45) = 0.25;", where someone had done all the row(), column() mapping by hand. I'd like the replacement automate the process but still hit that performance level.

RAC 2009-06-23 14:07:18

why are the functions so complicated? That seems like a better place to try to optimize.

jalf 2009-06-23 14:09:43

The row() and column() functions are "nomenclature"/"numbering" routines for finite element basis functions over simplices (tetrahedra/triangles) of arbitrary order. There's a lot of recursion, both w.r.t. order and w.r.t. dimension. I couldn't figure out a smarter way to do it (it was tough enough just to get it correct), but maybe there is.

RAC 2009-06-23 14:19:40

Ick. Better you than me, man. :-)

T.E.D. 2009-06-23 14:24:35

Not to double post too much, but it's really the recursive nature of the row(), column() functions that is making me seek this solution to begin with! TMP is so good at recursion, that I know I can do TMP versions of row<>::value and column<>::value and eliminate their runtime calculation altogether. But once you have these TMP row<i,j> things inside the loop, your i,j must be compile time constants too, so I'd have to unroll the for loops using a TMP mechanism too.

RAC 2009-06-23 14:25:17

Is there any chance you'll be able to cache/reuse the indices computed by the row/column functions? If the same indices are calculated multiple times, that could give you a big saving too. But yeah, sounds like it gets a bit more complex than I'd thought initially. :)

jalf 2009-06-23 15:01:04

IF the row and column functions are that complicated you're going to struggle to implement them in templates. You can't just run C++ code at compile time, you have to write it all again using cryptic compiler error messages as a debugger.You'd be better of running another program as part of your build that generates all the values and spits out a .h file with an array of them in.

Matthew Monaghan 2009-06-23 15:29:18

Answer 3

+4 A:

Good question, I had to do some research to answer this.

You could use Boost MPL. An example of loop unrolling is on this mpl::for_each page.

for_each< range_c<int,0,10> >( value_printer() );

Edit: On reflection it doesn't seem that it's all evaluated at compile time but it may be a good starting point.

CiscoIPPhone 2009-06-23 14:01:55

Answer 4

+3 A:

I would say it is a false good-idea.

In C++ this : row<i,j>::value
means you will have as many differents row<>() functions than you have i * j. You don't want this because it will increase the size of the code and do a lot of instruction cache misses.

I observed this when I was doing template functions to avoid a single boolean check.

If is a short function just inline it.

Ben 2009-06-23 14:07:01

The idea was to let the compiler do the math, not the processor. For a 10 by 10 matrix, that's 200 functions being evaluated at compile time, and replaced by adresses in the final binary. That should still fit the cache, wouldn't it?

xtofl 2009-06-23 14:16:45

Yes it will, but you'll have to load all thoses functions inside the cache for a single use instead of using 200 times the same one. And after the run the i-cache will be filled with useless code. @RAC However I may be wrong it cache issues are always tricky let us know if you can get an improvment.

Ben 2009-06-23 14:23:51

with row<i,j> being a class type, and value being a static const int, i don't see how it would bloat the executable. If debug is stripped, there should be no code generated for row<i,j>::value, i suspect

Johannes Schaub - litb 2009-06-23 14:36:22

If you unroll the loop, ie if you compute the addresses at compile time the code WILL be bigger not matter how you do it.

Ben 2009-06-23 14:51:25

True, Ben, which is why you usually don't unroll the entire loop, but only a few iterations. but if you want to compute all these indices at compile-time, I think the entire loop has to be unrolled

jalf 2009-06-23 15:02:31

Answer 5

+3 A:

You could use Boost.Mpl to implement the whole thing at compile-time, but I'm not sure it'll be faster. (Mpl essentially re-implements all the STL algorithms as compile-time metaprogramming templates)

The problem with that approach is that you end up unrolling and inlining a lot of code, which may thrash the instruction cache and eat up memory bandwidth that could have been saved. That may produce huge, bloated and slow code.

I would probably probably rather trust the compiler to inline the functions that make sense. As long as the row and column function definitions are visible from the loop, the compiler can trivially inline the calls and unroll as many iterations as it deems beneficial.

jalf 2009-06-23 14:08:19

Answer 6

A:

I'm not a fan of template meta-programming, so you may want to take this answer with a pinch of salt. But, before I invested ANY time on this problem, I'd ask myself the following:-

a: Is my for loop a bottleneck?

b: Is unrolling it going to make any difference? Easily tested by hand-unrolling and measuring.

In many compilers/cpus, the "looped" version can give better performance due to cache effects.

Remember: Measure first, optimise later - if at all.

Roddy 2009-06-23 14:12:01

hehe I knew you will be somewhere around this Q.

Ben 2009-06-23 14:26:19

There's some relevant discussion under T.E.D.'s response - the unrolling is not the real goal, the real goal is eliminating the row() and column() functions with equivalent TMP's. Loop unrolling becomes a secondary requirement.The looped, arbitrary order code has been profiled and is definitely slower than the legacy (unlooped, fixed order) code.

RAC 2009-06-23 14:33:14

Answer 7

+1 A:

f would need to return a double - that can't be done at compile time.

James Hopkin 2009-06-23 14:29:42

Admittedly this is a part that is still a little murky to me. However, f() is always a fraction and I can do TMP's for it's numerator and denominator independently, then static_cast to doubles and do the ratio in floating pt. (I hope). But very good point.

RAC 2009-06-23 14:37:24

Answer 8

A:

If you are willing to modify the syntax a bit you can do something like this:

template <int i, int ubound>
struct OuterFor {
    void operator()() {
        InnerFor<i, 0, J>()();
        OuterFor<i + 1, ubound>()();
    }
};

template <int ubound>
struct OuterFor <ubound, ubound> {
    void operator()() {
    }
};

In InnerFor, i is the outer loops counter (compile time constant), j is the inner loops counter (initially 0 - also compile time constant), so you can evaluate row as a compile time template.

Its a bit more complicated, but as you say, row(), col(), and f() are your complicated parts anyways. At least try it and see if the performance is worth it. It may be worth it to investigate other options to simplify your row(), etc functions.

Greg Rogers 2009-06-23 14:32:35

I really like where this solution could be headed - could you provide an example that actually calls/invokes/instantiates the loops?Also, how easily can this be extended to deeper nested loops? A four-deep nested for loop is not out of the question. (loops over polarization state, x, y and z).

RAC 2009-06-23 14:40:31

Answer 9

+1 A:

I've never tried to do this so take this idea with a grain of salt...

It seems like you could use Boost.Preprocessor to do the loop unrolling (specifically the BOOST_PP_FOR and BOOST_PP_FOR_r macros) and then use templates to generate the actual constant expression.

Ferruccio 2009-06-23 14:43:54

After some thought, I don't think this will work. The catch is that the I and J used to terminate the loop are not really constants, instead they are deduced from (not shown) template arguments. The snippet I posted earlier is really wrapped up inside a template<int ORDER> { ... } block, and I and J are computed from ORDER using other TMP's. IIRC the preprocessor is invoked before template expansion takes place, which is going to mess things up I think.I could be wrong though.

RAC 2009-06-23 17:13:52

Answer 10

+5 A:

This is the way to do it directly:

template <int i, int j>
struct inner
{
  static void value()
  {
    A(row<i,j>::value, column<i,j>::value) = f<i,j>::value;
    inner<i, j+1>::value();
  }
};

template <int i> struct inner<i, J> { static void value() {} };

template <int i>
struct outer
{
  static void value()
  {
    inner<i, 0>::value();
    outer<i+1>::value();
  }
};

template <> struct outer<I> { static void value() {} };

void test()
{
  outer<0>::value();
}

You can pass A through as a parameter to each of the values if necessary.

James Hopkin 2009-06-23 15:25:52

ansaurus

tags:

views:

answers:

template-ing a for loop in C++?

related questions