Performing math operations directly in memory instead of CPU registers

  • A+

I wrote this simple assembly code, ran it and looked at the memory location using GDB:

    .text  .global _main  _main:     pushq   %rbp     movl    $5, -4(%rbp)     addl    $6, -4(%rbp)     popq    %rbp     ret 

It's adding 5 to 6 directly in memory and according to GDB it worked. Now writing the same thing in C and compiling it to assembly turns out like this:

...     xorl    %eax, %eax     movl    $0, -4(%rbp)     movl    $5, -8(%rbp)     movl    -8(%rbp), %ecx     addl    $6, %ecx     movl    %ecx, -8(%rbp) .... 

It's moving them to a register before adding them together.

So why don't we add directly in memory? is it slower? if so then why is adding directly is memory even allowed, why didn't the assembler complain about my assembly code in the beginning?

Edit: Here is the C code for the second assembly block, I have disabled optimization when compiling.

#include <iostream>  int main(){  int a = 5;  a+=6;   return 0; } 


You disabled optimization, and you're surprised the asm looks inefficient? Well don't be. You've asked the compiler to compile quickly: short compile times instead of short run-times for the generated binary. And with debug-mode consistency.

Yes, GCC and clang will use memory-destination add when tuning for modern x86 CPUs. It is efficient if you have no use for the add result being in a register. Obviously your hand-written asm has a major missed-optimization, though. movl $5+6, -4(%rbp) would be much more efficient, because both values are assemble-time constants so leaving the add until runtime is horrible. Just like with your anti-optimized compiler output.

(Update: just noticed your compiler output included xor %eax,%eax, so this looks like clang/LLVM, not gcc like I initially guessed. Almost everything in this answer applies equally to clang, but gcc -O0 doesn't look for the xor-zeroing peephole optimization at -O0, using mov $0, %eax.)

Fun fact: gcc -O0 will actually use addl $6, -4(%rbp) in your main.

You already know from your hand-written asm that adding an immediate to memory is encodeable as an x86 add instruction, so the only question is whether gcc's/LLVM's optimizer decides to use it or not. But you disabled optimization.

A memory-destination add doesn't perform the calculation "in memory", the CPU interally has to load/add/store. It doesn't disturb any of the architectural registers while doing so, but it doesn't just send the 6 to the DRAM to be added there. See also Can num++ be atomic for 'int num'? for the C and x86 asm details of memory destination ADD, with/without a lock prefix to make it appear atomic.

There is computer-architecture research into putting ALUs into DRAM, so computation can happen in parallel instead of requiring all data pass through the memory bus to the CPU for any computation to happen. This is becoming an ever-larger bottleneck as memory sizes grow faster than memory bandwidth, and CPU throughput (with wide SIMD instructions) also grows faster than memory bandwidth. (Requiring more computational intensity (amount of ALU work per load/store) for the CPU to not stall. Fast caches help, but some problems have large working sets and are hard to apply cache-blocking for. Fast caches do mitigate the problem most of the time.)

But as it stands now, add $6, -4(%rbp) decodes into load, add, and store uops inside your CPU. The load uses an internal temporary destination, not an architectural register.

Modern x86 CPUs have some hidden internal logical registers that multi-uop instructions can use for temporaries. These hidden registers are renamed onto the physical registers in the issue/rename stage as they're allocated into the out-of-order back-end, but in the front end (decoder output, uop cache, IDQ) uops can only reference the "virtual" registers that represent the machine's logical state. So the multiple uops that memory-destination ALU instructions decode to are probably using hidden tmp registers.

We know these exist for use by micro-code / multi-uop instructions: calls them "extra architectural registers for internal use". They're not architectural in the sense of being part of the x86 machine state, only in the sense of being logical registers that the register-allocation-table (RAT) has to track for register renaming onto the physical register file. Their values aren't needed between x86 instructions, only for the uops within one x86 instruction, especially micro-coded ones like rep movsb (which checks the size and overlap, and uses 16 or 32-byte loads/stores if possible) but also for multi-uop memory+ALU instructions.

is it slower? if so then why is adding directly is memory even allowed, why didn't the assembler complain about my assembly code in the beginning?

In this case add immediate to memory is the optimal choice, if we pretend that the value was already in memory. (Instead of just being stored from another immediate constant.)

Modern x86 evolved from 8086. There are lots of slow ways to do things in modern x86 asm, but none of them can be disallowed without breaking backwards compatibility. For example the enter instruction was added back in 186 to support nested Pascal procedures, but is very slow now. The loop instruction has existed since 8086, but has been too slow for compilers to ever use since about 486 I think, maybe 386. (Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?)

x86 is absolutely the last architecture where you should ever think there's any connection between being allowed and being efficient. It's evolved very far from the hardware the ISA was designed for. But in general it's not true on any most ISAs. e.g. some implementations of PowerPC (notably the Cell processor in PlayStation 3) have slow micro-coded variable-count shifts, but that instruction is part of the PowerPC ISA so not supporting the instruction at all would be very painful, and not worth using multiple instructions instead of letting the microcode do it, outside of hot loops.

You could maybe write an assembler that refused to use, or warned about, known-slow instruction like enter or loop, but sometimes you're optimizing for size, not speed, and then slow but small instructions like loop are useful. (, and see x86 machine-code answers, like my GCD loop in 8 bytes of 32-bit x86 code using lots of small but slow instructions like 3-uop 1-byte xchg eax, r32, and even inc/loop as a 3-byte alternative to 4-byte test ecx,ecx/jnz). Optimizing for code-size is useful in real-life for boot-sectors, or for fun things like 512-byte or 4k "demos", which draw cool graphics and play sound in only tiny amounts of executables. Or for code that executes only once during startup, smaller file size is better. Or executes rarely during the lifetime of a program, smaller I-cache footprint is better than blowing away lots of cache (and suffering front-end stalls waiting for code fetch). That can outweigh being maximally efficient once the instruction bytes actually arrive at the CPU and are decoded. Especially if the difference there is small compared to the code-size saving.

Normal assemblers will only complain about instructions that aren't encodeable; performance analysis is not their job. Their job is to turn text into bytes in an output file (optionally with object-file metadata), allowing you to create whatever byte sequence you want for whatever purpose you think might be useful.

Avoiding slowdowns requires looking at more than 1 instruction at once

Most of the ways you can make your code slow involve instructions that aren't obviously bad, just the overall combination is slow. Checking for performance mistakes in general requires looking at much more than 1 instruction at a time.

e.g. this code will cause a partial-register stall on Intel P6-family CPUs:

mov  ah, 1 add  eax, 123 

Either of these instructions on their own could potentially be part of efficient code, so an assembler (which only has to look at each instruction separately) isn't going to warn you. Although writing AH at all is pretty questionable; normally a bad idea. Maybe a better example would have been a partial-flag stall with dec/jnz in an adc loop, on CPUs before SnB-family made that cheap. Problems with ADC/SBB and INC/DEC in tight loops on some CPUs

If you're looking for a tool to warn you about expensive instructions, GAS is not it. Static analysis tools like IACA or LLVM-MCA might be some help to show you expensive instructions in a block of code. (What is IACA and how do I use it? and (How) can I predict the runtime of a code snippet using LLVM Machine Code Analyzer?) They're aimed at analyzing loops, but feeding them a block of code whether it's a loop body or not will get them to show you how many uops each instruction costs in the front-end, and maybe something about latency.

But really you have to understand a bit more about the pipeline you're optimizing for to understand that the cost of each instruction depends on the surrounding code (whether it's part of a long dependency chain, and what the overall bottleneck is). Related:

GCC/clang -O0's biggest effect is no optimization at all between statements, spilling everything to memory and reloading, so each C statement is fully implemented by a separate block of asm instructions. (For consistent debugging, including modifying C variables while stopped at any breakpoint).

But even within the block of asm for one statement, clang -O0 apparently skips the optimization pass that decides whether using CISC memory-destination instructions instructions would be a win (given the current tuning). So clang's simplest code-gen tends to use the CPU as a load-store machine, with separate load instructions to get things in registers.

GCC -O0 happens to compile your main like you might expect. (With optimization enabled, it of course compiles to just xor %eax,%eax/ret, because a is unused.)

main:     pushq   %rbp     movq    %rsp, %rbp     movl    $5, -4(%rbp)     addl    $6, -4(%rbp)     movl    $0, %eax     popq    %rbp     ret 

How to see clang/LLVM using memory-destination add

I put these functions on the Godbolt compiler explorer with clang8.2 -O3. Each function compiled to one asm instruction, with the default -mtune=generic for x86-64. (Because modern x86 CPUs decode memory-destination add efficiently, to at most as many internal uops as separate load/add/store instructions, and sometimes fewer with micro-fusion of the load+add part.)

void add_reg_to_mem(int *p, int b) {     *p += b; }   # I used AT&T syntax because that's what you were using.  Intel-syntax is nicer IMO     addl    %esi, (%rdi)     ret  void add_imm_to_mem(int *p) {     *p += 3; }    # gcc and clang -O3 both emit the same asm here, where there's only one good choice     addl    $3, (%rdi)     ret 

The gcc -O0 output is just totally braindead, e.g. reloading p twice because it clobbers the pointer while calculating the +3. I could also have used global variables, instead of pointers, to give the compiler something it couldn't optimize away. -O0 for that would probably be a lot less terrible.

    # gcc8.2 -O0 output     ... after making a stack frame and spilling `p` from RDI to -8(%rbp)     movq    -8(%rbp), %rax        # load p     movl    (%rax), %eax          # load *p, clobbering p     leal    3(%rax), %edx         # edx = *p + 3     movq    -8(%rbp), %rax        # reload p     movl    %edx, (%rax)          # store *p + 3 

GCC is literally not even trying to not suck, just to compile quickly, and respect the constraint of keeping everything in memory between statements.

The clang -O0 output happens to be less horrible for this:

 # clang -O0    ... after making a stack frame and spilling `p` from RDI to -8(%rbp)     movq    -8(%rbp), %rdi    # reload p     movl    (%rdi), %eax      # eax = *p     addl    $3, %eax          # eax += 3     movl    %eax, (%rdi)      # *p = eax 

See also How to remove "noise" from GCC/clang assembly output? for more about writing functions that compile to interesting asm without optimizing away.

If I compiled with -m32 -mtune=pentium, gcc -O3 would avoid memory-dst add:

The P5 Pentium microarchitecture (from 1993) does not decode to RISC-like internal uops. Complex instructions take longer to run, and gum up its in-order dual-issue-superscalar pipeline. So GCC avoids them, using a more RISCy subset of x86 instructions that P5 can pipeline better.

# gcc8.2 -O3 -m32 -mtune=pentium add_imm_to_mem(int*):     movl    4(%esp), %eax    # load p from the stack, because of the 32-bit calling convention      movl    (%eax), %edx     # *p += 3 implemented as 3 separate instructions     addl    $3, %edx     movl    %edx, (%eax)     ret 

You can try this yourself on the Godbolt link above; that's where this is from. Just change the compiler to gcc in the drop-down and change the options.

Not sure it's actually much of a win here, because they're back-to-back. For it to be a real win, gcc would have to interleave some independent instructions. According to Agner Fog's instruction tables, add $imm, (mem) on in-order P5 takes 3 clock cycles, but is pairable in either U or V pipe. It's been a while since I read through the P5 Pentium section of his microarch guide, but the in-order pipeline definitely has to start each instruction in program order. (Slow instructions, including stores, can complete later, though, after other instructions have started. But here the add and store depend on the previous instruction, so they definitely have to wait).

In case you're confused, Intel still uses the Pentium and Celeron brand names for low-end modern CPUs like Skylake. This is not what we're talking about. We're talking about the original Pentium microarchitecture, which modern Pentium-branded CPUs are not even related to.

GCC refuses -mtune=pentium without -m32, because there are no 64-bit Pentium CPUs. First-gen Xeon Phi uses the Knight's Corner uarch, based on in-order P5 Pentium with vector extensions similar to AVX512 added. But gcc doesn't seem to support -mtune=knc. Clang does, but chooses to use memory-destination add here for that and for -m32 -mtune=pentium.

The LLVM project didn't start until after P5 was obsolete (other than KNC), while gcc was actively developed and tweaked while P5 was in widespread use for x86 desktops. So it's not surprising that gcc still knows some P5 tuning stuff, while LLVM doesn't really treat it differently from modern x86 that decode memory-destination instructions to multiple uops, and can execute them out-of-order.


:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: