I was playing with the code in this answer, slightly modifying it:
BITS 64 GLOBAL _start SECTION .text _start: mov ecx, 1000000 .loop: ;T is a symbol defined with the CLI (-DT=...) TIMES T imul eax, eax lfence TIMES T imul edx, edx dec ecx jnz .loop mov eax, 60 ;sys_exit xor edi, edi syscall
lfence I the results I get are consistent with the static analysis in that answer.
When I introduce a single
lfence I'd expect the CPU to execute the
imul edx, edx sequence of the k-th iteration in parallel with the
imul eax, eax sequence of the next (k+1-th) iteration.
Something like this (calling A the
imul eax, eax sequence and D the
imul edx, edx one):
| | A | D A | D A | D A | ... | D A | D | V time
Taking more or less the same number of cycles but for one unpaired parallel execution.
When I measure the number of cycles, for the original and modified version, with
taskset -c 2 ocperf.py stat -r 5 -e cycles:u '-x ' ./main-$T for
T in the range below I get
T Cycles:u Cycles:u Delta lfence no lfence 10 42047564 30039060 12008504 15 58561018 45058832 13502186 20 75096403 60078056 15018347 25 91397069 75116661 16280408 30 108032041 90103844 17928197 35 124663013 105155678 19507335 40 140145764 120146110 19999654 45 156721111 135158434 21562677 50 172001996 150181473 21820523 55 191229173 165196260 26032913 60 221881438 180170249 41711189 65 250983063 195306576 55676487 70 281102683 210255704 70846979 75 312319626 225314892 87004734 80 339836648 240320162 99516486 85 372344426 255358484 116985942 90 401630332 270320076 131310256 95 431465386 285955731 145509655 100 460786274 305050719 155735555
How can the values of
Cycles:u lfence be explained?
I would have expected them to be similar to those of
Cycles:u no lfence since a single
lfence should prevent only the first iteration from being executed in parallel for the two blocks.
I don't think it's due to the
lfence overhead as I believe that should be constant for all
I'd like to fix what's wrong with my forma mentis when dealing with the static analysis of code.
I think you're measuring accurately, and the explanation is microarchitectural, not any kind of measurement error.
I think your results for mid to low T support the conclusion that
lfence stops the front-end from even issuing past the
lfence until all earlier instructions retire, rather than having all the uops from both chains already issued and just waiting for
lfence to flip a switch and let multiplies from each chain start to dispatch on alternating cycles.
(port1 would get edx,eax,empty,edx,eax,empty,... for Skylake's 3c latency / 1c throughput multiplier right away, if
lfence didn't block the front-end, and overhead wouldn't scale with T.)
imul throughput when only uops from the first chain are in the scheduler because the front-end hasn't chewed through the
imul edx,edx and loop branch yet. And for the same number of cycles at the end of the window when the pipeline is mostly drained and only uops from the 2nd chain are left.
The overhead delta looks linear up to about T=60. I didn't run the numbers, but the slope up to there looks reasonable for
T * 0.25 clocks to issue the first chain vs. 3c-latency execution bottleneck. i.e. delta growing maybe 1/12th as fast as total no-lfence cycles.
So (given the
lfence overhead I measured below), with T<60:
no_lfence cycles/iter ~= 3T # OoO exec finds all the parallelism lfence cycles/iter ~= 3T + T/4 + 9.3 # lfence constant + front-end delay delta ~= T/4 + 9.3
@Margaret reports that
T/4 is a better fit than
2*T / 4, but I would have expected T/4 at both the start and end, for a total of 2T/4 slope of the delta.
After about T=60, delta grows much more quickly (but still linearly), with a slope about equal to the total no-lfence cycles, thus about 3c per T. I think at that point, the scheduler (Reservation Station) size is limiting the out-of-order window. You probably tested on a Haswell or Sandybridge/IvyBridge, (which have a 60-entry or 54-entry scheduler respectively. Skylake's is 97 entry.
The RS tracks un-executed uops. Each RS entry holds 1 unfused-domain uop that's waiting for its inputs to be ready, and its execution port, before it can dispatch and leave the RS1.
lfence, the front-end issues at 4 per clock while the back-end executes at 1 per 3 clocks, issuing 60 uops in ~15 cycles, during which time only 5
imul instructions from the
edx chain have executed. (There's no load or store micro-fusion here, so every fused-domain uop from the front-end is still only 1 unfused-domain uop in the RS. Also, @Hadi suggests that 1 RS entry can track both halves of a micro-fused uop. Again, not relevant for this question.)
For large T the RS quickly fills up, at which point the front-end can only make progress at the speed of the back-end. (For small T, we hit the next iteration's
lfence before that happens, and that's what stalls the front-end). When T > RS_size, the back-end can't see any of the uops from the
eax imul chain until enough back-end progress through the
edx chain has made room in the RS. At that point, one
imul from each chain can dispatch every 3 cycles, instead of just the 1st or 2nd chain.
Remember from the first section that time spent just after
lfence only executing the first chain = time just before
lfence executing only the second chain. That applies here, too.
We get some of this effect even with no
lfence, for T > RS_size, but there's opportunity for overlap on both sides of a long chain. The ROB is at least twice the size of the RS, so the out-of-order window when not stalled by
lfence should be able to keep both chains in flight constantly even when T is somewhat larger than the scheduler capacity. (Remember that uops leave the RS as soon as they've executed. I'm not sure if that means they have to finish executing and forward their result, or merely start executing, but that's a minor difference here for short ALU instructions. Once they're done, only the ROB is holding onto them until they retire, in program order.)
The ROB and register-file shouldn't be limiting the out-of-order window size (http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/) in this hypothetical situation, or in your real situation. They should both be plenty big.
Blocking the front-end is an implementation detail of
lfence on Intel's uarches. The manual only says that later instructions can't execute. That wording would allow the front-end to issue/rename them all into the scheduler (Reservation Station) and ROB while
lfence is still waiting, as long as none are dispatched to an execution unit.
So a weaker
lfence would maybe have flat overhead up to T=RS_size, then the same slope as you see now for T>60. (And the constant part of the overhead might be lower.)
Note that guarantees about speculative execution of conditional/indirect branches after
lfence apply to execution, not (as far as I know) to code-fetch. Merely triggering code-fetch is not (AFAIK) useful to a Spectre or Meltdown attack. Possibly a timing side-channel to detect how it decodes could tell you something about the fetched code...
I think AMD's LFENCE is at least as strong on actual AMD CPUs, when the relevant MSR is enabled. (Is LFENCE serializing on AMD processors?).
Your results are interesting, but it doesn't surprise me at all that there's significant constant overhead from
lfence itself (for small T), as well as the component that scales with T.
lfence doesn't allow later instructions to start until earlier instructions have retired. This is probably at least a couple cycles / pipeline-stages later than when their results are ready for bypass-fowarding to other execution units (i.e. the normal latency).
So for small T, it's definitely significant that you add extra latency into the chain by requiring the result to not only be ready, but also written back to the register file.
It probably takes an extra cycle or so for
lfence to allow the issue/rename stage to start operating again after detecting retirement of the last instruction before it. The issue/rename process takes multiple stages (cycles), and maybe lfence blocks at the start of this, instead of in the very last step before uops are added into the OoO part of the core.
lfence itself has 4 cycle throughput on SnB-family, according to Agner Fog's testing. Agner Fog reports 2 fused-domain uops (no unfused), but on Skylake I measure it at 6 fused-domain (still no unfused) if I only have 1
lfence. But with more
lfence back-to-back, it's fewer uops! Down to ~2 uops per
lfence with many back-to-back, which is how Agner measures.
jnz (a tight loop with no work) runs at 1 iteration per ~10 cycles on SKL, so that might give us an idea of the real extra latency that
lfence adds to the dep chains even without the front-end and RS-full bottlenecks.
lfence overhead with only one dep chain, OoO exec being irrelevant:
.loop: ;mfence ; mfence here: ~62.3c (with no lfence) lfence ; lfence here: ~39.3c times 10 imul eax,eax ; with no lfence: 30.0c ; lfence ; lfence here: ~39.6c dec ecx jnz .loop
lfence, runs at the expected 30.0c per iter. With
lfence, runs at ~39.3c per iter, so
lfence effectively added ~9.3c of "extra latency" to the critical path dep chain. (And 6 extra fused-domain uops).
lfence after the imul chain, right before the loop-branch, it's slightly slower. But not a whole cycle slower, so that would indicate that the front-end is issuing the loop-branch + and imul in a single issue-group after
lfence allows execution to resume. That being the case, IDK why it's slower. It's not from branch misses.
Getting the behaviour you were expecting:
Interleave the chains in program order, like @BeeOnRope suggests in comments, doesn't require out-of-order execution to exploit the ILP, so it's pretty trivial:
.loop: lfence ; at the top of the loop is the lowest-overhead place. %rep T imul eax,eax imul edx,edx %endrep dec ecx jnz .loop
You could put pairs of short
times 8 imul chains inside a
%rep to let OoO exec have an easy time.
Footnote 1: How the front-end / RS / ROB interact
My mental model is that the issue/rename/allocate stages in the front-end add new uops to both the RS and the ROB at the same time.
Uops leave the RS after executing, but stay in the ROB until in-order retirement. The ROB can be large because it's never scanned out-of-order to find the first-ready uop, only scanned in-order to check if the oldest uop(s) have finished executing and thus are ready to retire.
(I assume the ROB is physically a circular buffer with start/end indices, not a queue which actually copies uops to the right every cycle. But just think of it as a queue / list with a fixed max size, where the front-end adds uops at the front, and the retirement logic retires/commits uops from the end as long as they're fully executed, up to some per-cycle retirement limit which is not usually a bottleneck, although Skylake did increase it to 8 per clock for better Hyperthreading, I think.)
xor eax,eax, or
lfence, which are handled in the front-end (don't need any execution units on any ports) are added only to the ROB, in an already-executed state. (A ROB entry presumably has a bit that marks it as ready to retire vs. still waiting for execution to complete. This is the state I'm talking about. For uops that did need an execution port, I assume the ROB bit is set via a completion port from the execution unit.)
Uops stay in the ROB from issue to retirement.
Uops stay in the RS from issue to dispatch.