views:

1073

answers:

10

First time poster, be gentle!

I have a C++ snippet below with a run-time for loop,

for(int i = 0; i < I; i++) 
  for (int j = 0; j < J; j++) 
    A( row(i,j), column(i,j) ) = f(i,j);

The snippet is called repeatedly. The loop bounds 'I' and 'J' are known at compile time (I/J are the order of 2 to 10). I would like to unroll the loops somehow using templates. The main bottleneck is the row() and column() and f() functions. I would like to replace them with equivalent metaprograms that are evaluated at compile-time,using row<i,j>::enum tricks.

What I'd really love is something that eventually resolves the loop into a sequence of statements like:

A(12,37) = 0.5;
A(15,23) = 0.25;
A(14,45) = 0.25;

But I'd like to do so without wrecking the for-for structure too much. Something in the spirit of:

TEMPLATE_FOR<i,0,I>
  TEMPLATE_FOR<j,0,J>
     A( row<i,j>::value, column<i,j>::value ) = f<i,j>::value

Can boost::lambda (or something else) help me create this?

Thanks for the help! This site is great.

+4  A: 

Check out this site and the bubbleSorts implementations.

Nick D
+9  A: 

A good compiler should do unrolling for you. For instance, in gcc compiling with the -O2 option turns on loop unrolling.

If you try to do it yourself manually, unless you measure things carefully and really know what you are doing, you are liable to end up with slower code. For example, in your case with manual unrolling you are liable to prevent the compiler from being able to do a loop interchange or stripmine optimization (look for --floop-interchange and -floop-strip-mine in the gcc docs)

T.E.D.
[OP] It's not the overhead of the loops I am trying to optimize away, it's more the evaluation of row(), column() and f(). They are currently regular, free functions. I want to guarantee they are evaluated at compile time instead of run time.
RAC
If you make those routines "inline", that might give the optimizer enough extra info to do that itself.
T.E.D.
There's no reason why it shouldn't be able to inline those, as long as the full function definitions are visible at the call site.
jalf
I will try, but row() and column() are actually kinda complicated (and recursive) functions. If the compiler resolves it all the way down to a constant, I'd be very happy - OTOH if it just drops in the function body I don't think I'll see much performance increase.The legacy code that I am replacing with these loops literally is a snippet that reads like "A(12,37) = 0.5;A(15,23) = 0.25;A(14,45) = 0.25;", where someone had done all the row(), column() mapping by hand. I'd like the replacement automate the process but still hit that performance level.
RAC
why are the functions so complicated? That seems like a better place to try to optimize.
jalf
The row() and column() functions are "nomenclature"/"numbering" routines for finite element basis functions over simplices (tetrahedra/triangles) of arbitrary order. There's a lot of recursion, both w.r.t. order and w.r.t. dimension. I couldn't figure out a smarter way to do it (it was tough enough just to get it correct), but maybe there is.
RAC
Ick. Better you than me, man. :-)
T.E.D.
Not to double post too much, but it's really the recursive nature of the row(), column() functions that is making me seek this solution to begin with! TMP is so good at recursion, that I know I can do TMP versions of row<>::value and column<>::value and eliminate their runtime calculation altogether. But once you have these TMP row<i,j> things inside the loop, your i,j must be compile time constants too, so I'd have to unroll the for loops using a TMP mechanism too.
RAC
Is there any chance you'll be able to cache/reuse the indices computed by the row/column functions? If the same indices are calculated multiple times, that could give you a big saving too. But yeah, sounds like it gets a bit more complex than I'd thought initially. :)
jalf
IF the row and column functions are that complicated you're going to struggle to implement them in templates. You can't just run C++ code at compile time, you have to write it all again using cryptic compiler error messages as a debugger.You'd be better of running another program as part of your build that generates all the values and spits out a .h file with an array of them in.
Matthew Monaghan
+4  A: 

Good question, I had to do some research to answer this.

You could use Boost MPL. An example of loop unrolling is on this mpl::for_each page.

for_each< range_c<int,0,10> >( value_printer() );

Edit: On reflection it doesn't seem that it's all evaluated at compile time but it may be a good starting point.

CiscoIPPhone
+3  A: 

I would say it is a false good-idea.

In C++ this : row<i,j>::value
means you will have as many differents row<>() functions than you have i * j. You don't want this because it will increase the size of the code and do a lot of instruction cache misses.

I observed this when I was doing template functions to avoid a single boolean check.

If is a short function just inline it.

Ben
The idea was to let the compiler do the math, not the processor. For a 10 by 10 matrix, that's 200 functions being evaluated at compile time, and replaced by adresses in the final binary. That should still fit the cache, wouldn't it?
xtofl
Yes it will, but you'll have to load all thoses functions inside the cache for a single use instead of using 200 times the same one. And after the run the i-cache will be filled with useless code. @RAC However I may be wrong it cache issues are always tricky let us know if you can get an improvment.
Ben
with row<i,j> being a class type, and value being a static const int, i don't see how it would bloat the executable. If debug is stripped, there should be no code generated for row<i,j>::value, i suspect
Johannes Schaub - litb
If you unroll the loop, ie if you compute the addresses at compile time the code WILL be bigger not matter how you do it.
Ben
True, Ben, which is why you usually don't unroll the entire loop, but only a few iterations. but if you want to compute all these indices at compile-time, I think the entire loop has to be unrolled
jalf
+3  A: 

You could use Boost.Mpl to implement the whole thing at compile-time, but I'm not sure it'll be faster. (Mpl essentially re-implements all the STL algorithms as compile-time metaprogramming templates)

The problem with that approach is that you end up unrolling and inlining a lot of code, which may thrash the instruction cache and eat up memory bandwidth that could have been saved. That may produce huge, bloated and slow code.

I would probably probably rather trust the compiler to inline the functions that make sense. As long as the row and column function definitions are visible from the loop, the compiler can trivially inline the calls and unroll as many iterations as it deems beneficial.

jalf
A: 

I'm not a fan of template meta-programming, so you may want to take this answer with a pinch of salt. But, before I invested ANY time on this problem, I'd ask myself the following:-

a: Is my for loop a bottleneck?

b: Is unrolling it going to make any difference? Easily tested by hand-unrolling and measuring.

In many compilers/cpus, the "looped" version can give better performance due to cache effects.

Remember: Measure first, optimise later - if at all.

Roddy
hehe I knew you will be somewhere around this Q.
Ben
There's some relevant discussion under T.E.D.'s response - the unrolling is not the real goal, the real goal is eliminating the row() and column() functions with equivalent TMP's. Loop unrolling becomes a secondary requirement.The looped, arbitrary order code has been profiled and is definitely slower than the legacy (unlooped, fixed order) code.
RAC
+1  A: 

f would need to return a double - that can't be done at compile time.

James Hopkin
Admittedly this is a part that is still a little murky to me. However, f() is always a fraction and I can do TMP's for it's numerator and denominator independently, then static_cast to doubles and do the ratio in floating pt. (I hope). But very good point.
RAC
A: 

If you are willing to modify the syntax a bit you can do something like this:

template <int i, int ubound>
struct OuterFor {
    void operator()() {
        InnerFor<i, 0, J>()();
        OuterFor<i + 1, ubound>()();
    }
};

template <int ubound>
struct OuterFor <ubound, ubound> {
    void operator()() {
    }
};

In InnerFor, i is the outer loops counter (compile time constant), j is the inner loops counter (initially 0 - also compile time constant), so you can evaluate row as a compile time template.

Its a bit more complicated, but as you say, row(), col(), and f() are your complicated parts anyways. At least try it and see if the performance is worth it. It may be worth it to investigate other options to simplify your row(), etc functions.

Greg Rogers
I really like where this solution could be headed - could you provide an example that actually calls/invokes/instantiates the loops?Also, how easily can this be extended to deeper nested loops? A four-deep nested for loop is not out of the question. (loops over polarization state, x, y and z).
RAC
+1  A: 

I've never tried to do this so take this idea with a grain of salt...

It seems like you could use Boost.Preprocessor to do the loop unrolling (specifically the BOOST_PP_FOR and BOOST_PP_FOR_r macros) and then use templates to generate the actual constant expression.

Ferruccio
After some thought, I don't think this will work. The catch is that the I and J used to terminate the loop are not really constants, instead they are deduced from (not shown) template arguments. The snippet I posted earlier is really wrapped up inside a template<int ORDER> { ... } block, and I and J are computed from ORDER using other TMP's. IIRC the preprocessor is invoked before template expansion takes place, which is going to mess things up I think.I could be wrong though.
RAC
+5  A: 

This is the way to do it directly:

template <int i, int j>
struct inner
{
  static void value()
  {
    A(row<i,j>::value, column<i,j>::value) = f<i,j>::value;
    inner<i, j+1>::value();
  }
};

template <int i> struct inner<i, J> { static void value() {} };

template <int i>
struct outer
{
  static void value()
  {
    inner<i, 0>::value();
    outer<i+1>::value();
  }
};

template <> struct outer<I> { static void value() {} };

void test()
{
  outer<0>::value();
}

You can pass A through as a parameter to each of the values if necessary.

James Hopkin