# Fast way to get a close power-of-2 number (floating-point)

• A+
Category：Languages

In numerical computation, it is often needed to scale numbers to be in safe range.

For example, computing Euclidean distance: `sqrt(a^2+b^2)`. Here, if magnitude of `a` or `b` is too small/large, then underflow/overflow can happen.

A common approach to solve this is to divide numbers by the largest magnitude number. However, this solution is:

• slow (division is slow)
• causes a little extra inaccuracy

So I thought that instead of dividing by the largest magnitude number, let's multiply it with a close power-of-2 reciprocal number. This seems a better solution, as:

• multiplication is much faster than division
• better accuracy, as multiplying with a power-of-2 number is exact

So, I'd like to create a small utility function, which has a logic like this (by `^`, I mean exponentiation):

``void getScaler(double value, double &scaler, double &scalerReciprocal) {     int e = <exponent of value>;     if (e<-1022) { scaler=2^-1022; scalerReciprocal = 2^1022; }     } else if (e>1022) { scaler=2^1022; scalerReciprocal = 2^-1022; }     } else { scaler=2^e; scalerReciprocal = 2^(2046-e); } } ``

This function should return a normalized `scaler` & `scalerReciprocal`, both are power-of-2 numbers, where `scaler` is near to `value`, and `scalerReciprocal` is the reciprocal of `scaler`.

The maximum allowed exponents for `scaler`/`scaleReciprocal` are `-1022..1022` (I don't want to work with subnormal `scaler`, as subnormal numbers can be slow).

What would be a fast way to do this? Can this be done with pure floating-point operations? Or should I extract the exponent from `value`, and use simple `if`s to do the logic? Is there some kind of trick to do the comparison with (-)1022 fast (as the range is symmetric)?

Note: `scaler` doesn't need to be the closest possible power-of-2. If some logic needs it, `scaler` can be some small power-of-2 away from the closest value.

Function `s = get_scale(z)` computes the "close power of 2". Since the fraction bits of `s` are zero, the inverse of `s` is just an (inexpensive) integer subtraction: see function `inv_of_scale`.

On x86 `get_scale` and `inv_of_scale` compile to quite efficient assembly with clang. Compiler clang translates the ternary operators to `minsd` and `maxsd`, see also Peter Cordes' comment. With gcc, it is slightly more efficient to translate these functions to x86 intrinsics code (`get_scale_x86` and `inv_of_scale_x86`), see Godbolt.

Note that C explicitly permits type-punning through a union, whereas C++ (c++11) has no such permission Although gcc 8.2 and clang 7.0 do not complain about the union, you can improve the C++ portabily by using the `memcpy` trick instead of the union trick. Such a modification of the code should be trivial. The code should handle subnormals correctly.

``#include<stdio.h> #include<stdint.h> #include<immintrin.h> /* gcc -Wall -m64 -O3 -march=sandybridge dbl_scale.c */  union dbl_int64{     double d;     uint64_t i; };  double get_scale(double t){     union dbl_int64 x;     union dbl_int64 x_min;     union dbl_int64 x_max;     uint64_t mask_i;            /* 0xFEDCBA9876543210 */     x_min.i = 0x0010000000000000ull;     x_max.i = 0x7FD0000000000000ull;     mask_i =  0x7FF0000000000000ull;     x.d = t;     x.i = x.i & mask_i;                    /* Set fraction bits to zero, take absolute value */     x.d = (x.d < x_min.d) ? x_min.d : x.d; /* If subnormal: set exponent to 1                */     x.d = (x.d > x_max.d) ? x_max.d : x.d; /* If exponent is very large: set exponent to 7FD, otherwise the inverse is a subnormal */     return x.d; }  double get_scale_x86(double t){     __m128d x = _mm_set_sd(t);     __m128d x_min = _mm_castsi128_pd(_mm_set1_epi64x(0x0010000000000000ull));     __m128d x_max = _mm_castsi128_pd(_mm_set1_epi64x(0x7FD0000000000000ull));     __m128d mask  = _mm_castsi128_pd(_mm_set1_epi64x(0x7FF0000000000000ull));             x     = _mm_and_pd(x, mask);             x     = _mm_max_sd(x, x_min);             x     = _mm_min_sd(x, x_max);     return _mm_cvtsd_f64(x); }  /* Compute the inverse 1/t of a double t with all zero fraction bits     */ /* and exponent between the limits of function get_scale                 */ /* A single integer subtraction is much less expensive than a            */ /* floating point division.                                               */ double inv_of_scale(double t){     union dbl_int64 x;                      /* 0xFEDCBA9876543210 */     uint64_t inv_mask = 0x7FE0000000000000ull;     x.d = t;     x.i = inv_mask - x.i;     return x.d; }  double inv_of_scale_x86(double t){     __m128i inv_mask = _mm_set1_epi64x(0x7FE0000000000000ull);     __m128d x        = _mm_set_sd(t);     __m128i x_i      = _mm_sub_epi64(inv_mask, _mm_castpd_si128(x));     return _mm_cvtsd_f64(_mm_castsi128_pd(x_i)); }   int main(){     int n = 14;     int i;     /* Several example values, 4.94e-324 is the smallest subnormal */     double y = { 4.94e-324, 1.1e-320,  1.1e-300,  1.1e-5,  0.7,  1.7,  123.1, 1.1e300,                        1.79e308, -1.1e-320,    -0.7, -1.7, -123.1,  -1.1e307};     double z, s, u;      printf("Portable code:/n");     printf("             x       pow_of_2        inverse       pow2*inv      x*inverse /n");     for (i = 0; i < n; i++){           z = y[i];         s = get_scale(z);         u = inv_of_scale(s);         printf("%14e %14e %14e %14e %14e/n", z, s, u, s*u, z*u);     }      printf("/nx86 specific SSE code:/n");     printf("             x       pow_of_2        inverse       pow2*inv      x*inverse /n");     for (i = 0; i < n; i++){           z = y[i];         s = get_scale_x86(z);         u = inv_of_scale_x86(s);         printf("%14e %14e %14e %14e %14e/n", z, s, u, s*u, z*u);     }      return 0; } ``

The output looks fine:

``Portable code:              x       pow_of_2        inverse       pow2*inv      x*inverse   4.940656e-324  2.225074e-308  4.494233e+307   1.000000e+00   2.220446e-16  1.099790e-320  2.225074e-308  4.494233e+307   1.000000e+00   4.942713e-13  1.100000e-300  7.466109e-301  1.339386e+300   1.000000e+00   1.473324e+00   1.100000e-05   7.629395e-06   1.310720e+05   1.000000e+00   1.441792e+00   7.000000e-01   5.000000e-01   2.000000e+00   1.000000e+00   1.400000e+00   1.700000e+00   1.000000e+00   1.000000e+00   1.000000e+00   1.700000e+00   1.231000e+02   6.400000e+01   1.562500e-02   1.000000e+00   1.923437e+00  1.100000e+300  6.696929e+299  1.493222e-300   1.000000e+00   1.642544e+00  1.790000e+308  4.494233e+307  2.225074e-308   1.000000e+00   3.982882e+00 -1.099790e-320  2.225074e-308  4.494233e+307   1.000000e+00  -4.942713e-13  -7.000000e-01   5.000000e-01   2.000000e+00   1.000000e+00  -1.400000e+00  -1.700000e+00   1.000000e+00   1.000000e+00   1.000000e+00  -1.700000e+00  -1.231000e+02   6.400000e+01   1.562500e-02   1.000000e+00  -1.923437e+00 -1.100000e+307  5.617791e+306  1.780059e-307   1.000000e+00  -1.958065e+00  x86 specific SSE code:              x       pow_of_2        inverse       pow2*inv      x*inverse   4.940656e-324  2.225074e-308  4.494233e+307   1.000000e+00   2.220446e-16  1.099790e-320  2.225074e-308  4.494233e+307   1.000000e+00   4.942713e-13  1.100000e-300  7.466109e-301  1.339386e+300   1.000000e+00   1.473324e+00   1.100000e-05   7.629395e-06   1.310720e+05   1.000000e+00   1.441792e+00   7.000000e-01   5.000000e-01   2.000000e+00   1.000000e+00   1.400000e+00   1.700000e+00   1.000000e+00   1.000000e+00   1.000000e+00   1.700000e+00   1.231000e+02   6.400000e+01   1.562500e-02   1.000000e+00   1.923437e+00  1.100000e+300  6.696929e+299  1.493222e-300   1.000000e+00   1.642544e+00  1.790000e+308  4.494233e+307  2.225074e-308   1.000000e+00   3.982882e+00 -1.099790e-320  2.225074e-308  4.494233e+307   1.000000e+00  -4.942713e-13  -7.000000e-01   5.000000e-01   2.000000e+00   1.000000e+00  -1.400000e+00  -1.700000e+00   1.000000e+00   1.000000e+00   1.000000e+00  -1.700000e+00  -1.231000e+02   6.400000e+01   1.562500e-02   1.000000e+00  -1.923437e+00 -1.100000e+307  5.617791e+306  1.780059e-307   1.000000e+00  -1.958065e+00 ``

Vectorization

Function `get_scale` should vectorize with compilers that support auto-vectorization. The following piece of code vectorizes very well with clang (no need to write SSE/AVX intrinsics code).

``/* Test how well get_scale vectorizes: */ void get_scale_vec(double * __restrict__ t, double * __restrict__ x){     int n = 1024;     int i;     for (i = 0; i < n; i++){         x[i] = get_scale(t[i]);     } } ``

Unfortunately gcc doesn't find the `vmaxpd` and `vminpd` instructions.