Why does g++ pull computations into a hot loop?

  • A+

I have a very weird compiler behavior where G++ pulls computations into a hot loop, severly reducing the performance of the resulting code. What is going on here?

Consider this function:

#include <cstdint>  constexpr bool noLambda = true;  void funnyEval(const uint8_t* columnData, uint64_t dataOffset, uint64_t dictOffset, int32_t iter, int32_t limit, int32_t* writer,const int32_t* dictPtr2){    // Computation X1    const int32_t* dictPtr = reinterpret_cast<const int32_t*>(columnData + dictOffset);    // Computation X2    const uint16_t* data = (const uint16_t*)(columnData + dataOffset);    // 1. The less broken solution without lambda    if (noLambda) {         for (;iter != limit;++iter){             int32_t t=dictPtr[data[iter]];             *writer = t;             writer++;         }    }    // 2. The totally broken solution with lambda    else {         auto loop = [=](auto body) mutable { for (;iter != limit;++iter){ body(iter); } };         loop([=](unsigned index) mutable {             int32_t t=dictPtr[data[index]];             *writer = t;             writer++;         });    } } 

The problem here is that G++ somehow loves to pull computations X1 and X2 into the hot main loop, reducing the performance. Here are the details:

The function simply iterates over an array data, looks up a value in a dictionary dictPtr and writes it to a target memory location writer. data and dictPtr are computed at the beginning of the function. It has two flavors for doing so: one with a lambda, one without.

(Note that this function is just a minimal working example of much more complicated code. So please refrain from commenting that the lambda is unnecessary here. I am aware of this fact and in the original code it is necessary, unfortunately.)

The problem when compiling with the latest g++ (tried 8.1 and 7.2, same problem with older g++s as you can see in the godbolt links provided) with high optimization level (-O3 -std=c++14) is the following:

Solution 2. (noLambda=false) generates very bad code for the loop, even worse than the "naive" solution, because it assumes that it is a good idea to pull Computations X1 and X2, which are outside of the super hot main loop, into the super hot main loop, making it around 25% slower on my CPU.


.L3:         movl    %ecx, %eax          # unnecessary extra work         addl    $1, %ecx         addq    $4, %r9             # separate loop counter (pointer increment)         leaq    (%rdi,%rax,2), %rax # array indexing with an LEA         movzwl  (%rax,%rsi), %eax   # rax+rsi is Computation X2, pulled into the loop!         leaq    (%rdi,%rax,4), %rax         movl    (%rax,%rdx), %eax   # rax+rdx is Computation X1, pulled into the loop!         movl    %eax, -4(%r9)         cmpl    %ecx, %r8d         jne     .L3 

When using a usual for loop (noLambda=true), then the code is better as X2 is no longer pulled into the loop, but X1 still is!:


.L3:         movzwl  (%rsi,%rax,2), %ecx         leaq    (%rdi,%rcx,4), %rcx         movl    (%rcx,%rdx), %ecx # This is Computation X1, pulled into the loop!         movl    %ecx, (%r9,%rax,4)         addq    $1, %rax         cmpq    %rax, %r8         jne     .L3 

You can try out that this is really X1 in the loop by replacing dictPtr (the computation X1) in the loop with dictPtr2 (a parameter), the instruction will vanish:


.L3:         movzwl  (%rdi,%rax,2), %ecx         movl    (%r10,%rcx,4), %ecx         movl    %ecx, (%r9,%rax,4)         addq    $1, %rax         cmpq    %rax, %rdx         jne     .L3 

This is finally the loop as I want to have it. A simple loop that loads the values and stores the result without pulling random computations into it.

So, what is going on here? It is seldom a good idea to pull a computation into a hot loop, but G++ seems to think so here. This is costing me real performance. The lambda exacerbates the whole situation; it leads G++ to pull even more computations into the loop.

What makes this problem so severe is that this is really trivial C++ code without fancy features. If I cannot rely on my compiler producing perfect code for such a trivial example, I will need to check the assembly of all hot loops in my code to make sure everything is as fast as it could be. This also means that there are probably a huge number of programs affected by this.


You are using an unsigned 32-bits type for the array index (on line 21). This forces the compiler to consider at each step through the loop whether you might have overflowed its available range, in which case it needs to go back to the beginning of the array. The extra code you see is related to this check! There are at least three ways to avoid this over-cautious approach by the compiler:

  1. Use a 64-bit type for index on line 21. Now the compiler knows you will never wrap around the array, and generate the same code as without the lambda.
  2. Use a signed 32-bit type for index on line 21. Now the compiler no longer cares about overflow: signed overflow is considered UB, and therefore ignored. I consider this to be a bug in the interpretation of the standard, but opinions differ on this.
  3. Make it clear to the compiler that an overflow will never occur, by adding a line 'int32_t iter = 0;' at the beginning of the function, and removing iter from the declaration. Clearly this does not solve your problem, but it illustrates how it is the overflow analysis that causes the extra code to be generated.

You aren't complaining about the code before the loop starts, but here you have the same problem. Just make iter and limit int64_t, and you'll see it gets considerably shorter as the compiler no longer considers the possibility of array overflow.

So to recap: it is not the calculation of X1 and X2 that gets moved into the loop that causes the size to balloon, but the use of an incorrectly-typed array index variable.


:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: