How to write a maintainable, fast, compile-time bit-mask in C++?

  • A+
Category:Languages

I have some code that is morally like this:

#include <bitset>  enum Flags { A = 1, B = 2, C = 3, D = 5,              E = 8, F = 13, G = 21, H,              I, J, K, L, M, N, O };  void apply_known_mask(std::bitset<64> &bits) {     const Flags important_bits[] = { B, D, E, H, K, M, L, O };     std::remove_reference<decltype(bits)>::type mask{};     for (const auto& bit : important_bits) {         mask.set(bit);     }      bits &= mask; } 

Clang >= 3.6 does the smart thing and compiles this to a single and instruction (which then gets inlined everywhere else):

apply_known_mask(std::bitset<64ul>&):  # @apply_known_mask(std::bitset<64ul>&)         and     qword ptr [rdi], 775946532         ret 

But every version of GCC I've tried compiles this to an enormous mess that includes error handling that should be statically DCE'd. In other code, it will even place the important_bits equivalent as data in line with the code!

.LC0:         .string "bitset::set" .LC1:         .string "%s: __position (which is %zu) >= _Nb (which is %zu)" apply_known_mask(std::bitset<64ul>&):         sub     rsp, 40         xor     esi, esi         mov     ecx, 2         movabs  rax, 21474836482         mov     QWORD PTR [rsp], rax         mov     r8d, 1         movabs  rax, 94489280520         mov     QWORD PTR [rsp+8], rax         movabs  rax, 115964117017         mov     QWORD PTR [rsp+16], rax         movabs  rax, 124554051610         mov     QWORD PTR [rsp+24], rax         mov     rax, rsp         jmp     .L2 .L3:         mov     edx, DWORD PTR [rax]         mov     rcx, rdx         cmp     edx, 63         ja      .L7 .L2:         mov     rdx, r8         add     rax, 4         sal     rdx, cl         lea     rcx, [rsp+32]         or      rsi, rdx         cmp     rax, rcx         jne     .L3         and     QWORD PTR [rdi], rsi         add     rsp, 40         ret .L7:         mov     ecx, 64         mov     esi, OFFSET FLAT:.LC0         mov     edi, OFFSET FLAT:.LC1         xor     eax, eax         call    std::__throw_out_of_range_fmt(char const*, ...) 

My question is: how should I write this code so that both compilers can do the right thing? Failing that, how should I write this so that it remains clear, fast, and maintainable?

 


template< unsigned char... indexes > constexpr unsigned long long mask(){   return ((1ull<<indexes)|...|0ull); } 

Then

void apply_known_mask(std::bitset<64> &bits) {   constexpr auto m = mask<B,D,E,H,K,M,L,O>();   bits &= m; } 

or in :

template< unsigned char... indexes > constexpr unsigned long long mask(){   auto r = 0ull;   using discard = int[];   (void)discard{0,((     r |= (1ull << indexes)   ),0)...};   return r; } 

or in :

constexpr unsigned long long mask(){   return 0; } template<class...Tail> constexpr unsigned long long mask(unsigned char b0, Tail...tail){   return (1ull<<b0) | mask(tail...); } template< unsigned char... indexes > constexpr unsigned long long mask(){   return mask(indexes...); } 

Godbolt with all 3 -- you can switch CPP_VERSION define, and get identical assembly.

In practice I'd use the most modern I could. 14 beats 11 because we don't have recursion and hence O(n^2) symbol length (which can explode compile time and compiler memory usage); 17 beats 14 because the compiler doesn't have to dead-code-eliminate that array, and that array trick is just ugly.

Of these 14 is the most confusing. Here we create an anonymous array of all 0s, meanwhile as a side effect construct our result, then discard the array. It has a number of 0s in it equal to the size of our pack, plus 1 (which we add so we can handle empty packs).

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: