Fast way to get a close power-of-2 number (floating-point)

  • A+
Category:Languages

In numerical computation, it is often needed to scale numbers to be in safe range.

For example, computing Euclidean distance: sqrt(a^2+b^2). Here, if magnitude of a or b is too small/large, then underflow/overflow can happen.

A common approach to solve this is to divide numbers by the largest magnitude number. However, this solution is:

  • slow (division is slow)
  • causes a little extra inaccuracy

So I thought that instead of dividing by the largest magnitude number, let's multiply it with a close power-of-2 reciprocal number. This seems a better solution, as:

  • multiplication is much faster than division
  • better accuracy, as multiplying with a power-of-2 number is exact

So, I'd like to create a small utility function, which has a logic like this (by ^, I mean exponentiation):

void getScaler(double value, double &scaler, double &scalerReciprocal) {     int e = <exponent of value>;     if (e<-1022) { scaler=2^-1022; scalerReciprocal = 2^1022; }     } else if (e>1022) { scaler=2^1022; scalerReciprocal = 2^-1022; }     } else { scaler=2^e; scalerReciprocal = 2^(2046-e); } } 

This function should return a normalized scaler & scalerReciprocal, both are power-of-2 numbers, where scaler is near to value, and scalerReciprocal is the reciprocal of scaler.

The maximum allowed exponents for scaler/scaleReciprocal are -1022..1022 (I don't want to work with subnormal scaler, as subnormal numbers can be slow).

What would be a fast way to do this? Can this be done with pure floating-point operations? Or should I extract the exponent from value, and use simple ifs to do the logic? Is there some kind of trick to do the comparison with (-)1022 fast (as the range is symmetric)?

Note: scaler doesn't need to be the closest possible power-of-2. If some logic needs it, scaler can be some small power-of-2 away from the closest value.

 


Function s = get_scale(z) computes the "close power of 2". Since the fraction bits of s are zero, the inverse of s is just an (inexpensive) integer subtraction: see function inv_of_scale.

On x86 get_scale and inv_of_scale compile to quite efficient assembly with clang. Compiler clang translates the ternary operators to minsd and maxsd, see also Peter Cordes' comment. With gcc, it is slightly more efficient to translate these functions to x86 intrinsics code (get_scale_x86 and inv_of_scale_x86), see Godbolt.

Note that C explicitly permits type-punning through a union, whereas C++ (c++11) has no such permission Although gcc 8.2 and clang 7.0 do not complain about the union, you can improve the C++ portabily by using the memcpy trick instead of the union trick. Such a modification of the code should be trivial. The code should handle subnormals correctly.

#include<stdio.h> #include<stdint.h> #include<immintrin.h> /* gcc -Wall -m64 -O3 -march=sandybridge dbl_scale.c */  union dbl_int64{     double d;     uint64_t i; };  double get_scale(double t){     union dbl_int64 x;     union dbl_int64 x_min;     union dbl_int64 x_max;     uint64_t mask_i;            /* 0xFEDCBA9876543210 */     x_min.i = 0x0010000000000000ull;     x_max.i = 0x7FD0000000000000ull;     mask_i =  0x7FF0000000000000ull;     x.d = t;     x.i = x.i & mask_i;                    /* Set fraction bits to zero, take absolute value */     x.d = (x.d < x_min.d) ? x_min.d : x.d; /* If subnormal: set exponent to 1                */     x.d = (x.d > x_max.d) ? x_max.d : x.d; /* If exponent is very large: set exponent to 7FD, otherwise the inverse is a subnormal */     return x.d; }  double get_scale_x86(double t){     __m128d x = _mm_set_sd(t);     __m128d x_min = _mm_castsi128_pd(_mm_set1_epi64x(0x0010000000000000ull));     __m128d x_max = _mm_castsi128_pd(_mm_set1_epi64x(0x7FD0000000000000ull));     __m128d mask  = _mm_castsi128_pd(_mm_set1_epi64x(0x7FF0000000000000ull));             x     = _mm_and_pd(x, mask);             x     = _mm_max_sd(x, x_min);             x     = _mm_min_sd(x, x_max);     return _mm_cvtsd_f64(x); }  /* Compute the inverse 1/t of a double t with all zero fraction bits     */ /* and exponent between the limits of function get_scale                 */ /* A single integer subtraction is much less expensive than a            */ /* floating point division.                                               */ double inv_of_scale(double t){     union dbl_int64 x;                      /* 0xFEDCBA9876543210 */     uint64_t inv_mask = 0x7FE0000000000000ull;     x.d = t;     x.i = inv_mask - x.i;     return x.d; }  double inv_of_scale_x86(double t){     __m128i inv_mask = _mm_set1_epi64x(0x7FE0000000000000ull);     __m128d x        = _mm_set_sd(t);     __m128i x_i      = _mm_sub_epi64(inv_mask, _mm_castpd_si128(x));     return _mm_cvtsd_f64(_mm_castsi128_pd(x_i)); }   int main(){     int n = 14;     int i;     /* Several example values, 4.94e-324 is the smallest subnormal */     double y[14] = { 4.94e-324, 1.1e-320,  1.1e-300,  1.1e-5,  0.7,  1.7,  123.1, 1.1e300,                        1.79e308, -1.1e-320,    -0.7, -1.7, -123.1,  -1.1e307};     double z, s, u;      printf("Portable code:/n");     printf("             x       pow_of_2        inverse       pow2*inv      x*inverse /n");     for (i = 0; i < n; i++){           z = y[i];         s = get_scale(z);         u = inv_of_scale(s);         printf("%14e %14e %14e %14e %14e/n", z, s, u, s*u, z*u);     }      printf("/nx86 specific SSE code:/n");     printf("             x       pow_of_2        inverse       pow2*inv      x*inverse /n");     for (i = 0; i < n; i++){           z = y[i];         s = get_scale_x86(z);         u = inv_of_scale_x86(s);         printf("%14e %14e %14e %14e %14e/n", z, s, u, s*u, z*u);     }      return 0; } 

The output looks fine:

Portable code:              x       pow_of_2        inverse       pow2*inv      x*inverse   4.940656e-324  2.225074e-308  4.494233e+307   1.000000e+00   2.220446e-16  1.099790e-320  2.225074e-308  4.494233e+307   1.000000e+00   4.942713e-13  1.100000e-300  7.466109e-301  1.339386e+300   1.000000e+00   1.473324e+00   1.100000e-05   7.629395e-06   1.310720e+05   1.000000e+00   1.441792e+00   7.000000e-01   5.000000e-01   2.000000e+00   1.000000e+00   1.400000e+00   1.700000e+00   1.000000e+00   1.000000e+00   1.000000e+00   1.700000e+00   1.231000e+02   6.400000e+01   1.562500e-02   1.000000e+00   1.923437e+00  1.100000e+300  6.696929e+299  1.493222e-300   1.000000e+00   1.642544e+00  1.790000e+308  4.494233e+307  2.225074e-308   1.000000e+00   3.982882e+00 -1.099790e-320  2.225074e-308  4.494233e+307   1.000000e+00  -4.942713e-13  -7.000000e-01   5.000000e-01   2.000000e+00   1.000000e+00  -1.400000e+00  -1.700000e+00   1.000000e+00   1.000000e+00   1.000000e+00  -1.700000e+00  -1.231000e+02   6.400000e+01   1.562500e-02   1.000000e+00  -1.923437e+00 -1.100000e+307  5.617791e+306  1.780059e-307   1.000000e+00  -1.958065e+00  x86 specific SSE code:              x       pow_of_2        inverse       pow2*inv      x*inverse   4.940656e-324  2.225074e-308  4.494233e+307   1.000000e+00   2.220446e-16  1.099790e-320  2.225074e-308  4.494233e+307   1.000000e+00   4.942713e-13  1.100000e-300  7.466109e-301  1.339386e+300   1.000000e+00   1.473324e+00   1.100000e-05   7.629395e-06   1.310720e+05   1.000000e+00   1.441792e+00   7.000000e-01   5.000000e-01   2.000000e+00   1.000000e+00   1.400000e+00   1.700000e+00   1.000000e+00   1.000000e+00   1.000000e+00   1.700000e+00   1.231000e+02   6.400000e+01   1.562500e-02   1.000000e+00   1.923437e+00  1.100000e+300  6.696929e+299  1.493222e-300   1.000000e+00   1.642544e+00  1.790000e+308  4.494233e+307  2.225074e-308   1.000000e+00   3.982882e+00 -1.099790e-320  2.225074e-308  4.494233e+307   1.000000e+00  -4.942713e-13  -7.000000e-01   5.000000e-01   2.000000e+00   1.000000e+00  -1.400000e+00  -1.700000e+00   1.000000e+00   1.000000e+00   1.000000e+00  -1.700000e+00  -1.231000e+02   6.400000e+01   1.562500e-02   1.000000e+00  -1.923437e+00 -1.100000e+307  5.617791e+306  1.780059e-307   1.000000e+00  -1.958065e+00 

Vectorization

Function get_scale should vectorize with compilers that support auto-vectorization. The following piece of code vectorizes very well with clang (no need to write SSE/AVX intrinsics code).

/* Test how well get_scale vectorizes: */ void get_scale_vec(double * __restrict__ t, double * __restrict__ x){     int n = 1024;     int i;     for (i = 0; i < n; i++){         x[i] = get_scale(t[i]);     } } 

Unfortunately gcc doesn't find the vmaxpd and vminpd instructions.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: