To be accurate, this is originally from Hastings 1955, Princeton "APPROXIMATIONS FOR DIGITAL COMPUTERS BY CECIL HASTINGS", page 159-163, there are actually multiple versions of the approximation with different constants used. So the original work was done with the goal of being performant for computers of the 1950's. Then the famous Abramowitz and Stegun guys put that in formula 4.4.45 with permission, then the nvidia CG library wrote some code that was based upon the formula, likely with some optimizations.
This line:
> This amazing snippet of code was languishing in the docs of dead software, which in turn the original formula was scrawled away in a math textbook from the 60s.
was kind of telling for me. I have some background in this sort of work (and long ago concluded that there was pretty much nothing you can do to improve on existing code, unless either you have some new specific hardware or domain constraint, or you're just looking for something quick-n-dirty for whatever reason, or are willing to invest research-paper levels of time and effort) and to think that someone would call Abramowitz and Stegun "a math textbook from the 60s" is kind of funny. It's got a similar level of importance to its field as Knuth's Art of Computer Programming or stuff like that. It's not an obscure text. Yeah, you might forget what all is in it if you don't use it often, but you'd go "oh, of course that would be in there, wouldn't it...."
In general, I find that minimax approximation is an underappreciated tool, especially the quite simple Remez algorithm to generate an optimal polynomial approximation [0]. With some modifications, you can adapt it to optimize for either absolute or relative error within an interval, or even come up with rational-function approximations. (Though unfortunately, many presentations of the algorithm use overly-simple forms of sample point selection that can break down on nontrivial input curves, especially if they contain small oscillations.)
I'm pretty sure it's not faster, but it was fun to write:
float asin(float x) {
float x2 = 1.0f-fabs(x);
u32 i = bitcast(x2);
i = 0x5f3759df - (i>>1);
float inv = bitcast(i);
return copysign(pi/2-pi/2*(x2*inv),x);
}
Courtesy of evil floating point bithacks.> After all of the above work and that talk in mind, I decided to ask an LLM.
Impressive that an LLM managed to produce the answer from a 7 year old stack overflow answer all on its own! [1] This would have been the first search result for “fast asin” before this article was published.
> In any graphics application trigonometric functions are frequently used.
Counterpoint from the man himself, "avoiding trigonometry":
Isn't the faster approach SIMD [edit: or GPU]? A 1.05x to 1.90x speedup is great. A 16x speedup is better!
They could be orthogonal improvements, but if I were prioritizing, I'd go for SIMD first.
I searched for asin on Intel's intrinsics guide. They have a AVX-512 instrinsic `_mm512_asin_ps` but it says "sequence" rather than single-instruction. Presumably the actual sequence they use is in some header file somewhere, but I don't know off-hand where to look, so I don't know how it compares to a SIMDified version of `fast_asin_cg`.
https://www.intel.com/content/www/us/en/docs/intrinsics-guid...
I had to do an atan() on an slow embedded device once for an autonomous robot competition.
Fastest impl I came up with was rounding and big switch statement.
> Nobody likes throwing away work they've done
I like throwing away work I've done. Frees up my mental capacity for other work to throw away.
To the blog poster:
Robin Green is an excellent resource
Faster Math Functions:
https://basesandframes.wordpress.com/wp-content/uploads/2016...
https://basesandframes.wordpress.com/wp-content/uploads/2016...
Even faster math functions GDC 2020:
https://www.gdcvault.com/play/1027337/Math-in-Game-Developme...
It appears that the real lesson here was to lean quite a bit more on theory than a programmer's usual roll-your-own heuristic would suggest.
A fantastic amount of collective human thought has been dedicated to function approximations in the last century; Taylor methods are over 200 years old and unlikely to come close to state-of-the-art.
Wouldn't it also be much better to evaluate the Taylor polynomials using Horner's method, instead? (Maybe C++ can do this automatically, but given that there might be rounding differences, it probably won't.)
In DSP math, it is common to use Chebyshev polynomial approximation. You can get incredibly precise results within your required bounds.
Here's my fast acos, which I think can be converted to an asin: https://forwardscattering.org/post/66
Interesting article. A few years back I implemented a bunch of maths primitives, including trig functions, using Taylor sequences etc, to see how it was done. An interesting challenge, even at the elementary level I was working at.
So this article got me wondering how much accuracy is needed before computing a series beats pre-computed lookup tables and interpolation. Anyone got any relevant experience to share?
How much accuracy does ray tracing require?
I am curious, did you check how much your benchmarks moved (time and errors) if at all if you told the compiler to use —-use_fast_math or -ffast-math?
There’s generally not a faster version of inverse trig functions to inline, but it might optimize some other stuff out.
Unrelated to that, I’ve seen implementations (ie julia/base/special/trig) that use a “rational approximation” to asin, did you go down that road at any point?
The 4% improvement doesn't seem like it's worth the effort.
On a general note, instructions like division and square root are roughly equal to trig functions in cycle count on modern CPUs. So, replacing one with the other will not confer much benefit, as evidenced from the results. They're all typically implemented using LUTs, and it's hard to beat the performance of an optimized LUT, which is basically a multiplexer connected to some dedicated memory cells in hardware.
Just a point that the constexpr/const use in that C++ code makes no difference to output, and is just noise really.
Ideal HN content, thanks!
Does anyone knows the resources for the algos used in the HW implementations of math functions? I mean the algos inside the CPUs and GPUs. How they make a tradeoff between transistor number, power consumption, cycles, which algos allow this.
Did some quick calculations, and at this precision, it seems a table lookup might be able to fit in the L1 cache depending on the CPU model.
My favorite tool to experiment with math approximation is lolremez. And you can easily ask your llm to do it for you.
Chebyshev approximation for asin is sum(2T_n(x) / (pi*n*n),n), the even terms are 0.
We love to leave faster functions languishing in library code. The basis for Q3A’s fast inverse square root had been sitting in fdlibm since 1986, on the net since 1993: https://www.netlib.org/fdlibm/e_sqrt.c
If you are interested in such "tricks", you should check out the classic Hacker's Delight by Henry Warren
While I'm glad to see the OP got a good minimax solution at the end, it seems like the article missed clarifying one of the key points: error waveforms over a specified interval are critical, and if you don't see the characteristic minimax-like wiggle, you're wasting easy opportunity for improvement.
Taylor series in general are a poor choice, and Pade approximants of Taylor series are equally poor. If you're going to use Pade approximants, they should be of the original function.
I prefer Chebyshev approximation: https://www.embeddedrelated.com/showarticle/152.php which is often close enough to the more complicated Remez algorithm.