# How sqrt() of GCC works after compiled? Which method of root is used? Newton-Raphson?

• A+
Category：Languages

Just curiosity about the standard `sqrt()` from math.h on GCC works. I coded my own `sqrt()` using Newton-Raphson to do it!

yeah, I know fsqrt. But how the CPU does it? I can't debug hardware

Typical div/sqrt hardware in modern CPUs uses a power of 2 radix to calculate multiple result bits at once. e.g. http://www.imm.dtu.dk/~alna/pubs/ARITH20.pdf presents details of a design for a Radix-16 div/sqrt ALU, and compares it against the design in Penryn. (They claim lower latency and less power.) I looked at the pictures; looks like the general idea is to do something and feed a result back through a multiplier and adder iteratively, basically like long division. And I think similar to how you'd do bit-at-a-time division in software.

But however the hardware works, IEEE requires `sqrt` (and mul/div/add/sub) to give a correctly rounded result, i.e. error <= 0.5 ulp, so you don't need to know how it works, just the performance. These operations are special, other functions like `log` and `sin` do not have this requirement, and real library implementations usually aren't that accurate. (And x87 `fsin` is definitely not that accurate for inputs near Pi/2 where catastrophic cancellation in range-reduction leads to potentially huge relative errors.)
See https://agner.org/optimize/ for x86 instruction tables including throughput and latency for scalar and SIMD `sqrtsd` / `sqrtss` and their wider versions. I collected up the results in Floating point division vs floating point multiplication
Unlike most instructions, `sqrt` performance is typically data-dependent. (Usually more significant bits or larger magnitude of the result takes longer).