Why is summing an array of value types slower then summing an array of reference types?

  • A+
Category:Languages

I'm trying to understand better how memory works in .NET, so I'm playing with BenchmarkDotNet and diagnozers. I've created a benchmark comparing class and struct performance by summing array items. I expected that summing value types will always be quicker. But for short arrays it's not. Can anyone explain that?

The code:

internal class ReferenceType {     public int Value; }  internal struct ValueType {     public int Value; }  internal struct ExtendedValueType {     public int Value;     private double _otherData; // this field is here just to make the object bigger } 

I have three arrays:

    private ReferenceType[] _referenceTypeData;     private ValueType[] _valueTypeData;     private ExtendedValueType[] _extendedValueTypeData; 

Which I initialize with the same set of random values.

Then a benchmarked method:

    [Benchmark]     public int ReferenceTypeSum()     {         var sum = 0;          for (var i = 0; i < Size; i++)         {             sum += _referenceTypeData[i].Value;         }          return sum;     } 

Size is a benchmark parameter. Two other benchmark methods (ValueTypeSum and ExtendedValueTypeSum) are identical, except for I'm summing on _valueTypeData or _extendedValueTypeData. Full code for the benchmark.

Benchmark results:

DefaultJob : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3190.0

               Method | Size |      Mean |     Error |    StdDev | Ratio | RatioSD | --------------------- |----- |----------:|----------:|----------:|------:|--------:|      ReferenceTypeSum |  100 |  75.76 ns | 1.2682 ns | 1.1863 ns |  1.00 |    0.00 |          ValueTypeSum |  100 |  79.83 ns | 0.3866 ns | 0.3616 ns |  1.05 |    0.02 |  ExtendedValueTypeSum |  100 |  78.70 ns | 0.8791 ns | 0.8223 ns |  1.04 |    0.01 |                       |      |           |           |           |       |         |      ReferenceTypeSum |  500 | 354.78 ns | 3.9368 ns | 3.6825 ns |  1.00 |    0.00 |          ValueTypeSum |  500 | 367.08 ns | 5.2446 ns | 4.9058 ns |  1.03 |    0.01 |  ExtendedValueTypeSum |  500 | 346.18 ns | 2.1114 ns | 1.9750 ns |  0.98 |    0.01 |                       |      |           |           |           |       |         |      ReferenceTypeSum | 1000 | 697.81 ns | 6.8859 ns | 6.1042 ns |  1.00 |    0.00 |          ValueTypeSum | 1000 | 720.64 ns | 5.5592 ns | 5.2001 ns |  1.03 |    0.01 |  ExtendedValueTypeSum | 1000 | 699.12 ns | 9.6796 ns | 9.0543 ns |  1.00 |    0.02 | 

Core : .NET Core 2.1.4 (CoreCLR 4.6.26814.03, CoreFX 4.6.26814.02), 64bit RyuJIT

               Method | Size |      Mean |     Error |    StdDev | Ratio | RatioSD | --------------------- |----- |----------:|----------:|----------:|------:|--------:|      ReferenceTypeSum |  100 |  76.22 ns | 0.5232 ns | 0.4894 ns |  1.00 |    0.00 |          ValueTypeSum |  100 |  80.69 ns | 0.9277 ns | 0.8678 ns |  1.06 |    0.01 |  ExtendedValueTypeSum |  100 |  78.88 ns | 1.5693 ns | 1.4679 ns |  1.03 |    0.02 |                       |      |           |           |           |       |         |      ReferenceTypeSum |  500 | 354.30 ns | 2.8682 ns | 2.5426 ns |  1.00 |    0.00 |          ValueTypeSum |  500 | 372.72 ns | 4.2829 ns | 4.0063 ns |  1.05 |    0.01 |  ExtendedValueTypeSum |  500 | 357.50 ns | 7.0070 ns | 6.5543 ns |  1.01 |    0.02 |                       |      |           |           |           |       |         |      ReferenceTypeSum | 1000 | 696.75 ns | 4.7454 ns | 4.4388 ns |  1.00 |    0.00 |          ValueTypeSum | 1000 | 697.95 ns | 2.2462 ns | 2.1011 ns |  1.00 |    0.01 |  ExtendedValueTypeSum | 1000 | 687.75 ns | 2.3861 ns | 1.9925 ns |  0.99 |    0.01 | 

I've run the benchmark with BranchMispredictions and CacheMisses hardware counters, but there are no cache misses nor branch mispredictions. I've also checked the release IL code, and benchmark methods differ only by instructions that load reference or value type variables.

For bigger array sizes summing value type array is always quicker (e.g. because value types occupy less memory), but I don't understand why it's slower for shorter arrays. What do I miss here? And why making the struct bigger (see ExtendedValueType) makes summing a bit faster?

---- UPDATE ----

Inspired by a comment made by @usr I've re-run the benchmark with LegacyJit. I've also added memory diagnoser as inspired by @Silver Shroud (yes, there are no heap allocations).

Job=LegacyJitX64 Jit=LegacyJit Platform=X64 Runtime=Clr

               Method | Size |       Mean |      Error |     StdDev | Ratio | RatioSD | Gen 0/1k Op | Gen 1/1k Op | Gen 2/1k Op | Allocated Memory/Op | --------------------- |----- |-----------:|-----------:|-----------:|------:|--------:|------------:|------------:|------------:|--------------------:|      ReferenceTypeSum |  100 |   110.1 ns |  0.6836 ns |  0.6060 ns |  1.00 |    0.00 |           - |           - |           - |                   - |          ValueTypeSum |  100 |   109.5 ns |  0.4320 ns |  0.4041 ns |  0.99 |    0.00 |           - |           - |           - |                   - |  ExtendedValueTypeSum |  100 |   109.5 ns |  0.5438 ns |  0.4820 ns |  0.99 |    0.00 |           - |           - |           - |                   - |                       |      |            |            |            |       |         |             |             |             |                     |      ReferenceTypeSum |  500 |   517.8 ns | 10.1271 ns | 10.8359 ns |  1.00 |    0.00 |           - |           - |           - |                   - |          ValueTypeSum |  500 |   511.9 ns |  7.8204 ns |  7.3152 ns |  0.99 |    0.03 |           - |           - |           - |                   - |  ExtendedValueTypeSum |  500 |   534.7 ns |  3.0168 ns |  2.8219 ns |  1.03 |    0.02 |           - |           - |           - |                   - |                       |      |            |            |            |       |         |             |             |             |                     |      ReferenceTypeSum | 1000 | 1,058.3 ns |  8.8829 ns |  8.3091 ns |  1.00 |    0.00 |           - |           - |           - |                   - |          ValueTypeSum | 1000 | 1,048.4 ns |  8.6803 ns |  8.1196 ns |  0.99 |    0.01 |           - |           - |           - |                   - |  ExtendedValueTypeSum | 1000 | 1,057.5 ns |  5.9456 ns |  5.5615 ns |  1.00 |    0.01 |           - |           - |           - |                   - | 

With legacy JIT results are as expected - but slower then earlier results!. Which suggests RyuJit does some magical performance improvements, which do better on reference types.

 


I think the reason for the result to be this close is using a size that is so small and not allocating anything in the heap (inside your array initialization loop)to fragment object array elements.

In your benchmark code only object array elements allocates from the heap, this way MemoryAllocator can allocate each element sequentially(**) in the heap.(*) When code starts to execute, memory will be read from ram to cpu caches and since your object array elements written to ram in sequential(in a continues block) order they will be cached and thats why You don't get any cache misses.

To see this better, you can have another arrays that will allocate on the heap to fragment your benchmarked array elements. This may cause cache misses to occur earlier than your current setup.

Also you don't have any branches in your benchmark code so You can't get brach miss prediction.

(*) ValueType array allocates all space required for the array elements when you initialize it with new ValueType[Size];

(**) objectArr[i] element and objectArr[i+1](and so on) will be side by side in the heap, when ram block cached, probably all the object array elements will be read to cpu cache, so no ram access will be required when you iterate over the array.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: