views:

257

answers:

6

I am working on a piece of code where I need to deal with uvs (2D texture coordinates) that are not necessarily in the 0 to 1 range. As an example, sometimes I will get a uv with a u component that is 1.2. In order to handle this I am implementing a wrapping which causes tiling by doing the following:

u -= floor(u)
v -= floor(v)

Doing this causes 1.2 to become 0.2 which is the desired result. It also handles negative cases, such as -0.4 becoming 0.6.

However, these calls to floor are rather slow. I have profiled my application using Intel VTune and I am spending a huge amount of cycles just doing this floor operation.

Having done some background reading on the issue, I have come up with the following function which is a bit faster but still leaves a lot to be desired (I am still incurring type conversion penalties, etc).

int inline fasterfloor( const float x ) { return x > 0 ? (int) x : (int) x - 1; }

I have seen a few tricks that are accomplished with inline assembly but nothing that seems to work exactly correct or have any significant speed improvement.

Does anyone know any tricks for handling this kind of scenario?

A: 

What is the maximum input range of your u, v values ? If it's a fairly small range, e.g. -5.0 to +5.0, then it will be quicker to repeatedly add/subtract 1.0 until you get within range, rather than calling expensive functions such as floor.

Paul R
Will probably be slower than his current "fasterFloor" function in many cases.
Wallacoloo
Probably not - int<->float conversion is quite expensive on most CPUs - adding/subtracting 1.0 is just one clock cycle.
Paul R
Yes, but along with the conditionals it might not be as efficient. `if (u>1) u -= 1` is atleast 2 instructions - the comparison, the subtraction, and possibly an extra instruction depending on how the architecture handles conditionals.
Wallacoloo
A couple of comments: Loops of this sort (where the end condition depends on something computed within the loop) typically can't be pipelined, which is a big performance hit on many systems. By comparison, int<->float conversion is easily pipelineable if there is surrounding code to pipeline it with. However, in general you can't get reliable answers to these sorts of things with theory; to get real answers, the original poster needs to run benchmarks on the various versions with typical data.
Brooks Moses
Which method is faster will depend on a lot of factors - the distribution of input values, the relative cost of the different instructions on a given CPU, how much other surrounding code can be interleaved to absorb instruction latencies, etc. As has already been pointed out, in such a case as this the only thing to do is to try the various solutions offered and benchmark them, so that you can make an evidence-based optimisation decision rather than a speculative one.
Paul R
+1  A: 

If the range of values that may occur is sufficiently small, perhaps you can binary-search the floor value. For example, if values -2 <= x < 2 can occur...

if (u < 0.0)
{
  if (u < 1.0)
  {
    //  floor is 0
  }
  else
  {
    //  floor is 1
  }
}
else
{
  if (u < -1.0)
  {
    //  floor is -2
  }
  else
  {
    //  floor is -1
  }
}

I make no guarantees about this - I don't know how the efficiency of comparisons compares with floor - but it may be worth trying.

Steve314
+2  A: 

Another silly idea that might just work if the range is small...

Extract the exponent from the float using bitwise operations, then use a lookup table to find a mask that wipes unwanted bits from the mantissa. Use this to find the floor (wipe bits below the point) to avoid renormalising issues.

EDIT I deleted this as "too silly, plus with a +ve vs. -ve issue". Since it got upvoted anyway, it's undeleted and I'll leave it to others to decide how silly it is.

Steve314
Not that silly; one of the fmod implementations in Newlib (from Sun) does this, so apparently it was considered a reasonable thing to do at least at one point. And that was with arbitrary modulus rather than 1.0! Nasty complicated code, though.
Brooks Moses
+2  A: 

If you are using Visual C++, check the "Enable Intrinsic Functions" compiler setting. If enabled it should make most math functions faster (including floor). Downside is that handling of edge cases (like NaN) could be incorrect, but for a game, you might not care.

jdv
+5  A: 

The operation you want can be expressed using the fmod function (fmodf for floats rather than doubles):

#include <math.h>
u = fmodf(u, 1.0f);

Chances are reasonably good that your compiler will do this in the most efficient way that works.

Alternately, how concerned are you about last-bit precision? Can you put a lower bound on your negative values, such as something knowing that they're never below -16.0? If so, something like this will save you a conditional, which is quite likely to be useful if it's not something that can be reliably branch-predicted with your data:

u = (u + 16.0);  // Does not affect fractional part aside from roundoff errors.
u -= (int)u;     // Recovers fractional part if positive.

(For that matter, depending on what your data looks like and the processor you're using, if a large fraction of them are negative but a very small fraction are below 16.0, you might find that adding 16.0f before doing your conditional int-casting gives you a speedup because it makes your conditional predictable. Or your compiler may be doing that with something other than a conditional branch in which case it's not useful; it's hard to say without testing and looking at generated assembly.)

Brooks Moses
+2  A: 

So you want a really fast float->int conversion? AFAIK int->float conversion is fast, but on at least MSVC++ a float->int conversion invokes a small helper function, ftol(), which does some complicated stuff to ensure a standards compliant conversion is done. If you don't need such strict conversion, you can do some assembly hackery, assuming you're on an x86-compatible CPU.

Here's a function for a fast float-to-int which rounds down, using MSVC++ inline assembly syntax (it should give you the right idea anyway):

inline int ftoi_fast(float f)
{
    int i;

    __asm
    {
        fld f
        fistp i
    }

    return i;
}

On MSVC++ 64-bit you'll need an external .asm file since the 64 bit compiler rejects inline assembly. That function basically uses the raw x87 FPU instructions for load float (fld) then store float as integer (fistp). (Note of warning: you can change the rounding mode used here by directly tweaking registers on the CPU, but don't do that, you'll break a lot of stuff, including MSVC's implementation of sin and cos!)

If you can assume SSE support on the CPU (or there's an easy way to make an SSE-supporting codepath) you can also try:

#include <emmintrin.h>

inline int ftoi_sse1(float f)
{
    return _mm_cvtt_ss2si(_mm_load_ss(&f));     // SSE1 instructions for float->int
}

...which is basically the same (load float then store as integer) but using SSE instructions, which are a bit faster.

One of those should cover the expensive float-to-int case, and any int-to-float conversions should still be cheap. Sorry to be Microsoft-specific here but this is where I've done similar performance work and I got big gains this way. If portability/other compilers are an issue you'll have to look at something else, but these functions compile to maybe two instructions taking <5 clocks, as opposed to a helper function that takes 100+ clocks.

AshleysBrain