- A+

Just curiosity about the standard `sqrt()`

from math.h on GCC works. I coded my own `sqrt()`

using Newton-Raphson to do it!

yeah, I know fsqrt. But how the CPU does it? I can't debug hardware

Typical div/sqrt hardware in modern CPUs uses a power of 2 radix to calculate multiple result bits at once. e.g. http://www.imm.dtu.dk/~alna/pubs/ARITH20.pdf presents details of a design for a Radix-16 div/sqrt ALU, and compares it against the design in Penryn. (They claim lower latency and less power.) I looked at the pictures; looks like the general idea is to do something and feed a result back through a multiplier and adder iteratively, basically like long division. And I think similar to how you'd do bit-at-a-time division in software.

Intel Broadwell introduced a Radix-1024 div/sqrt unit. This discussion on RWT asks about changes between Penryn (Radix-16) and Broadwell. e.g. widening the SIMD vector dividers so 256-bit division was less slow vs. 128-bit, as well as increasing radix.

Maybe also see

- https://electronics.stackexchange.com/questions/280673/why-does-hardware-division-take-much-longer-than-multiplication
- https://scicomp.stackexchange.com/questions/187/why-is-division-so-much-more-complex-than-other-arithmetic-operations

But however the hardware works, **IEEE requires sqrt (and mul/div/add/sub) to give a correctly rounded result**, i.e. error <= 0.5 ulp, so

**you don't need to know how it works, just the performance**. These operations are special, other functions like

`log`

and `sin`

do *not*have this requirement, and real library implementations usually aren't that accurate. (And x87

`fsin`

is definitely not that accurate for inputs near Pi/2 where catastrophic cancellation in range-reduction leads to potentially *huge*relative errors.)

See https://agner.org/optimize/ for x86 instruction tables including throughput and latency for scalar and SIMD `sqrtsd`

/ `sqrtss`

and their wider versions. I collected up the results in Floating point division vs floating point multiplication

For non-x86 hardware sqrt, you'd have to look at data published by other vendors, or results from people who have tested it.

Unlike most instructions, `sqrt`

performance is typically data-dependent. (Usually more significant bits or larger magnitude of the result takes longer).