Fast modulo 12 algorithm for 4 uint16_t's packed in a uint64_t

  • A+
Category:Languages

Consider the following union:

union Uint16Vect {   uint16_t _comps[4];   uint64_t _all; }; 

Is there a fast algorithm for determining whether each component equals 1 modulo 12 or not?

A naïve sequence of code is:

Uint16Vect F(const Uint16Vect a) {     Uint16Vect r;     for(int8_t k=0; k<4; k++) {         r._comps[k] = (a._comps[k] % 12 == 1) ? 1 : 0;     }     return r; } 

 


Divisions by constants like this should be done with a multiplication by the multiplicative inverse. As you can see, compilers optimize x/12 to x*43691 >> 19

bool h(uint16_t x) {     return x % 12 == 1; } h(unsigned short):         movzx   eax, di         imul    eax, eax, 43691 ; = 0xFFFF*8/12 + 1         shr     eax, 19         lea     eax, [rax+rax*2]         sal     eax, 2         sub     edi, eax         cmp     di, 1         sete    al         ret 

Because there are multiplication instructions in SSE/AVX, this can easily be vectorized. Besides, x = (x % 12 == 1) ? 1 : 0; can be simplified to x = (x % 12 == 1) and then transformed to x = (x - 1) % 12 == 0 which avoids a constant table for the value 1 to compare. You can use the vector extension so that gcc automatically generates code for you

typedef uint16_t ymm __attribute__((vector_size(32))); ymm mod12(ymm x) {     return !!((x - 1) % 12); } 

Below is the output from gcc

mod12(unsigned short __vector(16)):         vpcmpeqd    ymm3, ymm3, ymm3  ; ymm3 = -1         vpaddw      ymm0, ymm0, ymm3         vpmulhuw    ymm1, ymm0, YMMWORD PTR .LC0[rip] ; multiply with 43691         vpsrlw      ymm2, ymm1, 3         vpsllw      ymm1, ymm2, 1         vpaddw      ymm1, ymm1, ymm2         vpsllw      ymm1, ymm1, 2         vpcmpeqw    ymm0, ymm0, ymm1         vpandn      ymm0, ymm0, ymm3         ret 

Clang and ICC don't support !! on vector types so you need to change to (x - 1) % 12 == 0. Unfortunately it seems that compilers don't support __attribute__((vector_size(8)) to emit MMX instructions. But nowadays you should use SSE or AVX anyway

The output for x % 12 == 1 is shorter as you can see in the same Godbolt link above, but you need a table containing 1s to compare, which may be better or not. Check which one works faster in your case


Alternatively for a small input range like this you can use a lookup table. The basic version needs a 65536-element array

#define S1(x) ((x) + 0) % 12 == 1, ((x) + 1) % 12 == 1, ((x) + 2) % 12 == 1, ((x) + 3) % 12 == 1, /               ((x) + 4) % 12 == 1, ((x) + 4) % 12 == 1, ((x) + 6) % 12 == 1, ((x) + 7) % 12 == 1 #define S2(x) S1((x + 0)*8), S1((x + 1)*8), S1((x + 2)*8), S1((x + 3)*8), /               S1((x + 4)*8), S1((x + 4)*8), S1((x + 6)*8), S1((x + 7)*8) #define S3(x) S2((x + 0)*8), S2((x + 1)*8), S2((x + 2)*8), S2((x + 3)*8), /               S2((x + 4)*8), S2((x + 4)*8), S2((x + 6)*8), S2((x + 7)*8) #define S4(x) S3((x + 0)*8), S3((x + 1)*8), S3((x + 2)*8), S3((x + 3)*8), /               S3((x + 4)*8), S3((x + 4)*8), S3((x + 6)*8), S3((x + 7)*8)  bool mod12e1[65536] = {     S4(0U), S4(8U), S4(16U), S4(24U), S4(32U), S4(40U), S4(48U), S4(56U) } 

To use just replace x % 12 == 1 with mod12e1[x]. This can of course be vectorized

But since the result is only 1 or 0, you can also use a 65536-bit array to reduce the size to only 8KB


You can also check divisibility by 12 by divisibility by 4 and 3. Divisibility by 4 is obviously trivial. Divisibility by 3 can be calculated by multiple ways

  • One is calculating the difference between the sum of the odd digits and the sum of the even digits like in גלעד ברקן's answer and check if it's divisible by 3 or not
  • Or you can check whether the sum of the digits in base 22k (like base 4, 16, 64...) to see if it's divisible by 3 or not.

    That works because in base b to check divisibility of any divisors n of b - 1, just check if the sum of the digits is divisible by n or not. Here's an implementation of it

    void modulo12equals1(uint16_t d[], uint32_t size) {     for (uint32_t i = 0; i < size; i++)     {         uint16_t x = d[i] - 1;         bool divisibleBy4 = x % 4 == 0;         x = (x >> 8) + (x & 0x00ff); // max 1FE         x = (x >> 4) + (x & 0x000f); // max 2D         bool divisibleBy3 = !!((01111111111111111111111ULL >> x) & 1);         d[i] = divisibleBy3 && divisibleBy4;     } } 

Credits for the divisibility by 3 to Roland Illig

Since the autovectorized assembly output is too long, you can check it on the Godbolt link

See also

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: