Copying a bool from a parameter to a global – comparing compilers output

  • A+
Category:Languages

Fully knowing that these completely artificial benchmarks don't mean much, I am nonetheless a bit surprised by the several ways the "big 4" compilers chose to compile a trivial snippet.

struct In {     bool in1;     bool in2; };  void foo(In &in) {     extern bool out1;     extern bool out2;     out1 = (in.in1 == true);     out2 = in.in2; } 

Notice: all compilers are set in x64 mode with highest "general purpose" (= no specific processor architecture specified) optimization setting; you can see the results by yourself/play with them over at https://gcc.godbolt.org/z/K_i8h9)


Clang 6 with -O3 seems to produce the most straightforward output:

foo(In&):                             # @foo(In&)         mov     al, byte ptr [rdi]         mov     byte ptr [rip + out1], al         mov     al, byte ptr [rdi + 1]         mov     byte ptr [rip + out2], al         ret 

in a standard-conformant C++ program, the == true is redundant, so both become straight copies from one memory location to the other, passing through al as there's no memory to memory mov.

However, as there's no register pressure here, I'd have expected it to use two different registers (to completely avoid false dependency chains), possibly starting all the reads first, and doing all the writes after, to help instruction-level parallelism; is this kind of optimization completely obsolete with recent CPUs due to register renaming and aggressively out-of-order CPUs? (more on this later)


GCC 8.2 with -O3 does almost the same thing, but with a twist:

foo(In&):         movzx   eax, BYTE PTR [rdi]         mov     BYTE PTR out1[rip], al         movzx   eax, BYTE PTR [rdi+1]         mov     BYTE PTR out2[rip], al         ret 

Instead of a plain mov to the "small" register, it does a movzx to full eax. Why? Is this to reset completely the state of eax and sub-registers in the register renamer to avoid partial register stalls?


MSVC 19 with /O2 adds one more quirk:

in$ = 8 void foo(In & __ptr64) PROC                ; foo, COMDAT         cmp     BYTE PTR [rcx], 1         sete    BYTE PTR bool out1         ; out1         movzx   eax, BYTE PTR [rcx+1]         mov     BYTE PTR bool out2, al     ; out2         ret     0 void foo(In & __ptr64) ENDP                ; foo 

besides the different calling convention, here the second assignment is pretty much the same.

However, the comparison in the first assignment is actually performed (interestingly, using both a cmp and a sete with memory operands, so you could say that the intermediate register is FLAGS).

  • Is this VC++ explicitly playing it safe, or is due to some known inherent limitation - e.g. bool being treated as a plain byte by the codegen?
  • As it's not a "real" branch (the code path is not altered by the result of the cmp) I'd expect this not to cost that much, especially compared to accessing memory. How costly is this missed optimization?

Finally, ICC 18 with -O3 is the strangest of all:

foo(In&):         xor       eax, eax                                      #9.5         cmp       BYTE PTR [rdi], 1                             #9.5         mov       dl, BYTE PTR [1+rdi]                          #10.12         sete      al                                            #9.5         mov       BYTE PTR out1[rip], al                        #9.5         mov       BYTE PTR out2[rip], dl                        #10.5         ret                                                     #11.1 
  • the first assignment does the comparison, exactly as in VC++ code, but the sete goes through al instead of straight to memory; is there any reason to prefer this?
  • all reads "started" before doing anything with the results - so this kind of interleaving still actually matters?
  • why is eax zeroed out at the start of the function? Partial register stalls again? But then dl doesn't get this treatment...

Just for fun, I tried removing the == true, and ICC now does

foo(In&):         mov       al, BYTE PTR [rdi]                            #9.13         mov       dl, BYTE PTR [1+rdi]                          #10.12         mov       BYTE PTR out1[rip], al                        #9.5         mov       BYTE PTR out2[rip], dl                        #10.5         ret                                                     #11.1 

so, no zeroing out of eax, but still using two registers and "start reading in parallel first, use all the results later".

  • What's so special about sete that makes ICC think it's worth zeroing out eax before?
  • Is ICC right after all to reorder reads/writes like this, or the apparently more sloppy approach of the other compilers currently performs the same?

 


TL:DR: gcc's version is the most robust across all x86 uarches, avoiding false dependencies or extra uops. None of them are optimal; loading both bytes with one load should be even better.

The 2 key points here are:

  • The mainstream compilers only care about out-of-order x86 uarches for their default tuning for instruction selection and scheduling. All x86 uarches that are currently sold do out-of-order execution with register renaming (for full registers like RAX at least).

    No in-order uarches are still relevant for tune=generic. (Older Xeon Phi, Knight's Corner, used modified Pentium P54C-based in-order cores, and in-order Atom system might still be around, but that's obsolete now, too. In that case it would be important to do the stores after both loads, to allow memory-parallelism in the loads.)

  • 8 and 16-bit Partial registers are problematic, and can lead to false dependencies. Why doesn't GCC use partial registers? explains the different behaviours for a variety of x86 uarches.


  1. partial-register renaming to avoid false dependencies:

Intel before IvyBridge renames AL separately from RAX (P6 family and SnB itself, but not later SnB-family). On all other uarches (including Haswell/Skylake, all AMD, and Silvermont / KNL), writing AL merges into RAX. For more about modern Intel (HSW and later) vs. P6-family and first-gen Sandybridge, see this Q&A: How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent.

On Haswell/Skylake, mov al, [rdi] decodes to a micro-fused ALU + load uop that merges the load result into RAX. (This is nice for bitfield merging, instead of having extra cost for the front-end to insert a later merging uop when reading the full register).

It performs identically to how add al, [rdi] or add rax, [rdi]. (It's only an 8-bit load, but it has a dependency on the full width of the old value in RAX. Write-only instructions to low-8/low-16 regs like al or ax are not write-only as far as the microarchitecture is concerned.)

On P6 family (PPro to Nehalem) and Sandybridge (first generation of Sandybridge-family), clang's code is perfectly fine. Register-renaming makes the load/store pairs totally independent from each other, as if they'd used different architectural registers.

On all other uarches, Clang's code is potentially dangerous. If RAX was the target of some earlier cache-miss load in the caller, or some other long dependency chain, this asm would make the stores dependent on that other dep-chain, coupling them together and removing the opportunity for the CPU to find ILP.

The loads are still independent, because the loads are separate from the merging and can happen as soon as the load address rdi is known in the out-of-order core. The store-address is also known, so the store-address uops can execute (so later loads/stores can check for overlap), but the store-data uops are stuck waiting for the merge uops. (Stores on Intel are always 2 separate uops, but they can micro-fuse together in the front-end.)

Clang doesn't seem to understand partial registers very well, and creates false deps and partial-reg penalties for no reason sometimes, even when it doesn't save any code-size by using a narrow or al,dl instead of or eax,edx, for example.

In this case it saves a byte of code size per load (movzx has a 2-byte opcode).

  1. Why does gcc use movzx eax, byte ptr [mem]?

Writing EAX zero-extends to the full RAX, so it's always write-only with no false dependency on the old value of RAX on any CPU. Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register?.

movzx eax, m8/m16 is handled purely in the load ports, not as a load + ALU-zero-extend, on Intel, and on AMD since Zen. The only extra cost is 1 byte of code-size. (AMD before Zen has 1 cycle of extra latency for movzx loads, and apparently they have to run on an ALU as well as a load port. Doing sign/zero-extension or broadcast as part of a load with no extra latency is the modern way, though.)

gcc is pretty fanatical about breaking false dependencies, e.g. pxor xmm0,xmm0 before cvtsi2ss/sd xmm0, eax, because Intel's poorly-designed instruction set merges into the low qword of the destination XMM register. (Short-sighted design for PIII which stores 128-bit registers as 2 64-bit halves, so int->FP conversion instructions would have taken an extra uop on PIII to also zero the high half if Intel had designed it with future CPUs in mind.)

The problem isn't usually within a single function, it's when these false dependencies end up creating a loop-carried dependency chain across call/ret in different functions that you can unexpectedly get a big slowdown.

For example, store-data throughput is only 1 per clock (on all current x86 uarches), so 2 loads + 2 stores already takes at least 2 clocks.

If the struct is split across a cache line boundary, though, and the first load misses but the 2nd hits, avoiding a false dep would let the 2nd store write data to the store buffer before the first cache miss is finished. This would let loads on this core read from out2 via store-forwarding. (x86's strong memory ordering rules prevent the later store from becoming globally visible by committing to the store buffer ahead of the store to out1, but store-forwarding within a core/thread still works.)


  1. cmp/setcc: MSVC / ICC are just being dumb

The one advantage here is that putting the value into ZF avoids any partial-register shenanigans, but movzx is a better way to avoid it.

I'm pretty sure MS's x64 ABI agrees with the x86-64 System V ABI that a bool in memory is guaranteed to be 0 or 1, not 0 / non-zero.

In the C++ abstract machine, x == true has to be the same as x for a bool x, so (unless an implementation used different object-representation rules in structs vs. extern bool), it can always just copy the object representation (i.e. the byte).

If an implementation was going to use a one-byte 0 / non-0 (instead of 0 / 1) object representation for bool, it would need to cmp byte ptr [rcx], 0 to implement the booleanization in (int)(x == true), but here you're assigning to another bool so it could just copy. And we know it's not booleanizing 0 / non-zero because it compared against 1. I don't think it's intentionally being defensive against invalid bool values, otherwise why wouldn't it do that for out2 = in.in2?

This just looks like a missed-optimization. Compilers are not generally awesome at bool in general. Boolean values as 8 bit in compilers. Are operations on them inefficient?. Some are better than others.

MSVC's setcc directly to memory is not bad, but cmp + setcc is 2 extra unnecessary ALU uops that didn't need to happen. Apparently on Ryzen, setcc m8 is 1 uop but one per 2 clocks throughput. So that's weird. Maybe even a typo by Agner? (https://agner.org/optimize/). On Steamroller, it's 1 uop / 1 per clock.

On Intel, setcc m8 is 2 fused-domain uops and 1 per clock throughput, like you'd expect.

  1. ICC's xor-zeroing before setz

I'm not sure if there's an implicit conversion to int anywhere in here in ISO C++'s abstract machine, or if == is defined for bool operands.

But anyway, if you are going to setcc into a register, it's not a bad idea to xor-zero it first for the same reason movzx eax,mem is better than mov al,mem. Even if you don't need the result zero-extended to 32-bit.

That's probably ICC's canned sequence for creating a boolean integer from a compare result.

It makes little sense to use xor-zero / cmp / setcc for the compare, but mov al, [m8] for the non-compare. The xor-zero is the direct equivalent of using a movzx load to break the false dependency here.

ICC is great at auto-vectorizing (e.g. it can auto-vectorize a search-loop like while(*ptr++ != 0){} while gcc/clang can only auto-vec loops with a trip count that's known ahead of the first iteration). But ICC is not great at little micro-optimizations like this; it often has asm output that looks more like the source (to its detriment) than gcc or clang.

  1. all reads "started" before doing anything with the results - so this kind of interleaving still actually matters?

It's not a bad thing. Memory disambiguation usually allows loads after stores to run early anyway. Modern x86 CPUs even dynamically predict when a load won't overlap with earlier unknown-address stores.

If the load and store address are exactly 4k apart, they alias on Intel CPUs, and the load is falsely detected as dependent on the store.

Moving loads ahead of stores definitely makes things easier for the CPU; do this when possible.

Also, the front-end issues uops in-order into the out-of-order part of the core, so putting the loads first can let the 2nd one start maybe a cycle earlier. There's no benefit to having the first store done right away; it will have to wait for the load result before it can execute.

Reusing the same register does reduce register pressure. GCC likes to avoid register pressure all the time, even when there isn't any. i.e. this particular function stand-alone has lots of call-clobbered registers it could use, but when inlined it might not. GCC often used to use setcc al / movzx eax,al, but recent changes have let it see xor eax,eax / set-flags / setcc al to take the zero-extension off the critical path when there's a free register that can be zeroed ahead of whatever sets flags. (xor-zeroing also writes flags).


passing through al as there's no memory to memory mov.

None worth using for single-byte copies, anyway. One possible (but sub-optimal) implementation is:

foo(In &):     mov   rsi, rdi     lea   rdi, [rip+out1]     movsb               # read in1     lea   rdi, [rip+out2]     movsb               # read in2 

An implementation that's probably better than any the compilers spotted is:

foo(In &):     movzx  eax, word ptr [rdi]      # AH:AL = in2:in1     mov    [rip+out1], al     mov    [rip+out2], ah     ret 

Reading AH may have an extra cycle of latency, but this is great for throughput and code-size. If you care about latency, avoid the store/reload in the first place and use registers. (By inlining this function).

The only microarchitectureal danger with this is a cache-line split on the load (if in.in2 is the first byte of a new cache lien). That could take an extra 10 cycles. Or on pre-Skylake, if it's also split across a 4k boundary the penalty can be 100 cycles extra latency. But other than that, x86 has efficient unaligned loads, and it's normally a win to combine narrow loads / stores to save uops. (gcc7 and later typically do this when initializing multiple struct members even in cases where it can't know that it won't cross a cache-line boundary.)

The compiler should be able to prove that In &in can't alias extern bool out1, out2, because they have static storage and different types.

If you just had 2 pointers to bool, you wouldn't know (without bool *__restrict out1) that they don't point to members of the In object. But static bool out2 can't alias members of a static In object. Then it wouldn't be safe to read in2 before writing out1, unless you checked for overlap first.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: