Avoiding the overhead of C# virtual calls

  • A+
Category:Languages

I have a few heavily optimized math functions that take 1-2 nanoseconds to complete. These functions are called hundreds of millions of times per second, so call overhead is a concern, despite the already-excellent performance.

In order to keep the program maintainable, the classes that provide these methods inherit an IMathFunction interface, so that other objects can directly store a specific math function and use it when needed.

public interface IMathFunction {   double Calculate(double input);   double Derivate(double input); }  public SomeObject {   // Note: There are cases where this is mutable   private readonly IMathFunction mathFunction_;     public double SomeWork(double input, double step)   {     var f = mathFunction_.Calculate(input);     var dv = mathFunction_.Derivate(input);     return f - (dv * step);   } } 

This interface is causing an enormous overhead compared to a direct call due to how the consuming code uses it. A direct call takes 1-2ns, whereas the virtual interface call takes 8-9ns. Evidently, the presence of the interface and its subsequent translation of the virtual call is the bottleneck for this scenario.

I would like to retain both maintainability and performance if possible. Is there a way I can resolve the virtual function to a direct call when the object is instantiated so that all subsequent calls are able to avoid the overhead? I assume this would involve creating delegates with IL, but I wouldn't know where to start with that.

 


So this has obvious limitations and should not be used all the time anywhere you have an interface, but if you have a place where perf really needs to be maximized you can use generics:

public SomeObject<TMathFunction> where TMathFunction: struct, IMathFunction  {   private readonly TMathFunction mathFunction_;    public double SomeWork(double input, double step)   {     var f = mathFunction_.Calculate(input);     var dv = mathFunction_.Derivate(input);     return f - (dv * step);   } } 

And instead of passing an interface, pass your implementation as TMathFunction. This will avoid vtable lookups due to an interface and also allow inlining.

Note the use of struct is important here, as generics will otherwise access the class via the interface.

Some implementation:

I made a simple implementation of IMathFunction for testing:

class SomeImplementationByRef : IMathFunction {     public double Calculate(double input)     {         return input + input;     }      public double Derivate(double input)     {         return input * input;     } } 

... as well as a struct version and an abstract version.

So, here's what happens with the interface version. You can see it is relatively inefficient because it performs two levels of indirection:

    return obj.SomeWork(input, step); sub         esp,40h   vzeroupper   vmovaps     xmmword ptr [rsp+30h],xmm6   vmovaps     xmmword ptr [rsp+20h],xmm7   mov         rsi,rcx vmovsd      qword ptr [rsp+60h],xmm2   vmovaps     xmm6,xmm1 mov         rcx,qword ptr [rsi+8]          ; load mathFunction_ into rcx. vmovaps     xmm1,xmm6   mov         r11,7FFED7980020h              ; load vtable address of the IMathFunction.Calculate function. cmp         dword ptr [rcx],ecx   call        qword ptr [r11]                ; call IMathFunction.Calculate function which will call the actual Calculate via vtable. vmovaps     xmm7,xmm0 mov         rcx,qword ptr [rsi+8]          ; load mathFunction_ into rcx. vmovaps     xmm1,xmm6   mov         r11,7FFED7980028h              ; load vtable address of the IMathFunction.Derivate function. cmp         dword ptr [rcx],ecx   call        qword ptr [r11]                ; call IMathFunction.Derivate function which will call the actual Derivate via vtable. vmulsd      xmm0,xmm0,mmword ptr [rsp+60h] ; dv * step vsubsd      xmm7,xmm7,xmm0                 ; f - (dv * step) vmovaps     xmm0,xmm7   vmovaps     xmm6,xmmword ptr [rsp+30h]   vmovaps     xmm7,xmmword ptr [rsp+20h]   add         rsp,40h   pop         rsi   ret   

Here's an abstract class. It's a little more efficient but only negligibly:

        return obj.SomeWork(input, step);  sub         esp,40h    vzeroupper    vmovaps     xmmword ptr [rsp+30h],xmm6    vmovaps     xmmword ptr [rsp+20h],xmm7    mov         rsi,rcx    vmovsd      qword ptr [rsp+60h],xmm2    vmovaps     xmm6,xmm1    mov         rcx,qword ptr [rsi+8]           ; load mathFunction_ into rcx.  vmovaps     xmm1,xmm6    mov         rax,qword ptr [rcx]             ; load object type data from mathFunction_.  mov         rax,qword ptr [rax+40h]         ; load address of vtable into rax.  call        qword ptr [rax+20h]             ; call Calculate via offset 0x20 of vtable.  vmovaps     xmm7,xmm0    mov         rcx,qword ptr [rsi+8]           ; load mathFunction_ into rcx.  vmovaps     xmm1,xmm6    mov         rax,qword ptr [rcx]             ; load object type data from mathFunction_.  mov         rax,qword ptr [rax+40h]         ; load address of vtable into rax.  call        qword ptr [rax+28h]             ; call Derivate via offset 0x28 of vtable.  vmulsd      xmm0,xmm0,mmword ptr [rsp+60h]  ; dv * step  vsubsd      xmm7,xmm7,xmm0                  ; f - (dv * step)  vmovaps     xmm0,xmm7  vmovaps     xmm6,xmmword ptr [rsp+30h]    vmovaps     xmm7,xmmword ptr [rsp+20h]    add         rsp,40h    pop         rsi    ret   

So both an interface and an abstract class rely heavily on branch target prediction to have acceptable performance. Even then, you can see there's quite a lot more going into it, so the best-case is still relatively slow while the worst-case is a stalled pipeline due to a mispredict.

And finally here's the generic version with a struct. You can see it's massively more efficient because everything has been fully inlined so there's no branch prediction involved. It also has the nice side effect of removing most of the stack/parameter management that was in there too, so the code becomes very compact:

    return obj.SomeWork(input, step); push        rax   vzeroupper   movsx       rax,byte ptr [rcx+8]   vmovaps     xmm0,xmm1   vaddsd      xmm0,xmm0,xmm1  ; Calculate - got inlined vmulsd      xmm1,xmm1,xmm1  ; Derivate - got inlined vmulsd      xmm1,xmm1,xmm2  ; dv * step vsubsd      xmm0,xmm0,xmm1  ; f -  add         rsp,8   ret   

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: