hi
I am working on GPU device which has very high division integer latency, several hundred cycles. I am looking to optimize divisions.
All divisions by denominator which is in a set { 1,3,6,10 }, however numerator is a runtime positive value, roughly 32000 or less. due to memory constraints, lookup table may not be a good option.
Can you think of alternatives? I have thought of computing float point inverses, and using those to multiply numerator.
Thanks
PS. thank you people. bit shift hack is a really cool. to recover from roundoff, I use following C segment:
// q = m/n
q += (n*(j +1)-1) < m;