Fast 1/X division (reciprocal)

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



Fast 1/X division (reciprocal)



Is there some way to improve reciprocal (division 1 over X) with respect to speed, if the precision is not crucial?



So, I need to calculate 1/X. Is there some workaround so I lose precision but do it faster?





This is heavily dependent on the hardware platform you're working on. Also, it also depends on how much precision you're prepared to lose. Obviously, float recip(float x) return 1; is very fast, but not very accurate...
– Oliver Charlesworth
Mar 30 '12 at 8:25



float recip(float x) return 1;





Single-precision reciprocals run in 5 cycles on the lastest processors. A floating-point multiplication is also 5 cycles. So I seriously doubt you're gonna get any faster than something like (float)1/(float)x.
– Mysticial
Mar 30 '12 at 8:25



(float)1/(float)x





For starters, what is your platform and compiler? And what kind of data are you operating on?
– Johan Kotlinski
Mar 30 '12 at 13:01






@Mysticial Be careful 5 cycles was absolute best case lowest latency but the other number is the worst case number around 37 cycles? Remember the hardware implements an iterative root seeking approximation algorithm like newtons method until the accuracy is sufficient
– nimig18
Apr 6 '17 at 3:47





5 Answers
5



First, make sure this isn't a case of premature optimization. Do you know that this is your bottleneck?



As Mystical says, 1/x can be calculated very quickly. Make sure you're not using double datatype for either the 1 or the divisor. Floats are much faster.


double



That said, benchmark, benchmark, benchmark. Don't waste your time spending hours on numerical theory just to discover the source of the poor performance is IO access.





"Floats are much faster" - really? It's dangerous to make such sweeping statements. There are a lot of things you can do to change the code the compiler generates. It also depends on the hardware the compiler is aiming for. For example, on the IA32, the code generated by gcc when not using SSE (the -mfpmath=387 option I think) will be the same speed for double and float since the FPU only deals with 80bit values, any speed difference will be down to memory bandwidth.
– Skizz
Mar 30 '12 at 8:41





Yes, obviously it's a blanket statement. But the question was equally generic. Have the OP give specifics, and I'd be in a position to make a more "eloquent" response.
– Mahmoud Al-Qudsi
Mar 30 '12 at 8:42






1/x can be calculated quickly .. but how do you make a compiler actually emit RCPSS?
– harold
Mar 30 '12 at 9:06





@harold inline ASM?
– Mahmoud Al-Qudsi
Mar 30 '12 at 9:07





Mystical was being 'mystical' in that statement, he quoted the lower bound 5-clk's which was actually the inverse throughput and should have been 6-clk cycles best case scenario the other end is the worst case scenario was around 38-clks. 6-21,7-21,10-24,10-15,14-16,11-25,12-26,10-24,9-38,9-37,19,22,19,38,6-38 In these cases the average Clocks Per Instruction (CPI) must be used to compare apples and oranges.
– nimig18
Apr 6 '17 at 4:09




I believe that what he was looking for is a more efficient way to approximate 1.0/x instead of some technical definition of approximation that states that you could use 1 as a very impercise answer. I also believe that this satisfies that.


__inline__ double __attribute__((const)) reciprocal( unsigned long long x )
//The type is unsigned long long, but you are restricted to a max value of 2^32-1, not
// 2^64-1 like the unsigned long long is capable of storing
union
double dbl;
unsigned long long ull;
u = .dbl=(x*=x); // x*x = pow( x, 2 )
u.ull = ( 0xbfcdd6a18f6a6f52ULL - u.ull ) >> (unsigned char)1;
// pow( pow(x,2), -0.5 ) = pow( x, -1 ) = 1.0 / x
// This is done via the 'fast' inverse square root trick
return u.dbl;





__inline__ double __attribute__((const)) reciprocal( double x )
union
double dbl;
unsigned long long ull;
u;
u.dbl = x;
u.ull = ( 0xbfcdd6a18f6a6f52ULL - u.ull ) >> (unsigned char)1;
// pow( x, -0.5 )
u.dbl *= u.dbl; // pow( pow(x,-0.5), 2 ) = pow( x, -1 ) = 1.0 / x
return u.dbl;





__inline__ float __attribute__((const)) reciprocal( float x )
union
float dbl;
unsigned uint;
u;
u.dbl = x;
u.uint = ( 0xbe6eb3beU - u.uint ) >> (unsigned char)1;
// pow( x, -0.5 )
u.dbl *= u.dbl; // pow( pow(x,-0.5), 2 ) = pow( x, -1 ) = 1.0 / x
return u.dbl;





Hmm....... I wounder if the CPU manufactures knew you could get the reciprocal with only a single multiply, subtraction, and bit-shift when they designed the CPU.... hmm.........



As for bench-marking, the hardware x2 instructions combined with the hardware subtraction instructions are just as fast as hardware 1.0/x instruction on modern day computers (my benchmarks were on an Intel i7, but I would assume similar results for other processors). However, if this algorithm were implemented into the hardware as a new assembly instruction, then the increase in speed would probably be good enough for this instruction to be quite practical.



For more information about this method, this implementation is based off the wonderful "fast" inverse square root algorithm.



Lastly, please note that I am more of a novice in C++. As such, I welcome with wide open arms any best-practices, proper-formatting, or implication-clarity edits to the end of improving the quality of this answer for all who read it.





You could explain the magic number, and what floating point representation it assumes.
– Cheers and hth. - Alf
Sep 27 '16 at 2:49





that is very interesting. Thank you! Do you have any results for comparison tests of precision and speed?
– klm123
Sep 27 '16 at 5:11





Did you test this against x86's approximate-reciprocal instruction, RCPSS on your i7? It's as fast as an integer multiply, and doesn't require moving data from XMM registers to integer. You can use it from C++ with _mm_rcp_ss( _mm_set_ss(x) ). gcc and clang will convert 1.0/x into RCPSS + a Newton-Raphson iteration, if you use -ffast-math, but I think you have to use intrinsics manually if you want the value without an approximation step.
– Peter Cordes
Nov 8 '16 at 14:40


RCPSS


_mm_rcp_ss( _mm_set_ss(x) )


1.0/x





Note that the _mm_set_ss isn't always free even when x is already live in an XMM register, because Intel's intrinsics suck for scalars. See stackoverflow.com/questions/39318496/…, and also stackoverflow.com/questions/40416570/… for silly compilers wasting instructions zeroing high elements when only the low element is going to be used.
– Peter Cordes
Nov 8 '16 at 14:46


_mm_set_ss


x





@PeterCordes, how do you get the compilers to use RCPSS + Newton-Raphson? A quick test did not achieve this godbolt.org/g/3yAJGb. BTW, you stopped answering questions after 1.1.2017. What happened?
– Z boson
Feb 5 '17 at 13:44




First of all, if you turn on compiler optimizations, the compiler is likely to optimize the calculation if possible (to pull it out of a loop, for example). To see this optimization, you need to build and run in Release mode.



Division may be heavier than multiplication (but a commenter pointed out that reciprocals are just as fast as multiplication on modern CPUs, in which case, this isn't correct for your case), so if you do have 1/X appearing somewhere inside a loop (and more than once), you can assist by caching the result inside the loop (float Y = 1.0f/X;) and then using Y. (The compiler optimization might do this in any case.)


1/X


float Y = 1.0f/X;


Y



Also, certain formulas can be redesigned to remove division or other inefficient computations. For that, you could post the larger computation being performed. Even there, the program or algorithm itself can sometimes be restructured to prevent the need for hitting time-consuming loops from being hit as frequently.



How much accuracy can be sacrificed? If on the off chance you only need an order of magnitude, you can get that easily using the modulus operator or bitwise operations.



However, in general, there's no way to speed up division. If there were, compilers would already be doing it.





If on the off chance you only need an order of magnitude, you can get that easily using the modulus operator or bitwise operations. How?
– klm123
Mar 30 '12 at 8:39





I did not mean to imply "trivial". Also, I should have added the caveat that X >> 1 (see end of comment). In that case, you can take advantage of X^-1 = 2^(-log_2(X)), and use en.wikipedia.org/wiki/Find_first_set#Algorithms to get the order of magnitude of log_2(X), to obtain the order of magnitude in the form 2^-n. If upper and lower bounds on X is known, this could be used to improve speed. If the other quantities in the calculation (not shown in the question) have known bounds and are somewhat commensurate in order of magnitude, they can be scaled and cast to integers.
– Dan Nissenbaum
Mar 30 '12 at 9:20





Compilers can only hoist Y=1.0f/X out of a loop if you use -ffast-math, so it is a good idea to do that in the source if you don't plan to enable -ffast-math to tell the compiler that you don't care where/when/how rounding happens, or in what order.
– Peter Cordes
Nov 8 '16 at 15:00


-ffast-math


-ffast-math



The fastest way that I know of is to use SIMD operations. http://msdn.microsoft.com/en-us/library/796k1tty(v=vs.90).aspx





Or to buy a faster cpu?:) The question is algorithmical.
– klm123
Jan 16 '14 at 12:16






Or maybe take advantage of the hardly-used full capacity of your current CPU? Branch-prediction exploits? Maybe even taking advantage of streamlining? Just a thought.........
– Jack Giffin
Oct 14 '16 at 14:28






RCPSS/ RCPPS is a good suggestion. Fast approximate inverse (and inverse sqrt) are available in hardware on x86, for scalar or a SIMD vector of floats. You don't have to be using SIMD for the rest of your loop to take advantage. If this answer had explained any of that, it wouldn't have gotten such confused comments.
– Peter Cordes
Nov 8 '16 at 15:03




This should do it with a number of pre-unrolled newton iterations's evaluated as a Horner polynomial which uses fused-multiply accumulate operations most modern day CPU's execute in a single Clk cycle (every time):


float inv_fast(float x)
union float f; int i; v;
float w, sx;
int m;

sx = (x < 0) ? -1:1;
x = sx * x;

v.i = (int)(0x7EF127EA - *(uint32_t *)&x);
w = x * v.f;

// Efficient Iterative Approximation Improvement in horner polynomial form.
v.f = v.f * (2 - w); // Single iteration, Err = -3.36e-3 * 2^(-flr(log2(x)))
// v.f = v.f * ( 4 + w * (-6 + w * (4 - w))); // Second iteration, Err = -1.13e-5 * 2^(-flr(log2(x)))
// v.f = v.f * (8 + w * (-28 + w * (56 + w * (-70 + w *(56 + w * (-28 + w * (8 - w))))))); // Third Iteration, Err = +-6.8e-8 * 2^(-flr(log2(x)))

return v.f * sx;



Fine Print: Closer to 0, the approximation does not do so well so either you the programmer needs to test the performance or restrict the input from getting to low before resorting to hardware division.
i.e. be responsible!






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard