race divides operations on an iterable automatically into threads. For instance,
(Bool.roll xx 2000).race.sum
would automatically divide the sum of the 2000-long array into 4 threads. However, benchmarks show that this is much slower than if
race were not employed. This happens even if you make the array bigger. This happens even as the non-autothreaded version gets faster and faster with each version. (Auto-threading also gets faster, but is still twice as slow as not using it.)
So the question is: what is the minimum size of the atomic operation that is worthwhile to use? Is the overhead added to the sequential operation fixed or can it be decreased somehow?
Update: in fact, performance of
hyper (similar to race, but with guaranteed ordered results) seems to be getting worse with time, at least for small sizes which are nonetheless integer multiples of the default batch size (64). Same happens with
The short answer:
.sum isn't smart enough to calculate sums in batches.
So what you're effectively doing in this benchmark, is to set up a
RaceSeq but then not doing any parallel processing:
dd (Bool.roll xx 2000).race; # RaceSeq.new(configuration => HyperConfiguration.new(batch => 64, degree => 4))
So you've been measuring
.race overhead. You see, at the moment, only
.grep have been implemented on
RaceSeq. If you give that something to do, like:
# find the 1000th prime number in a single thread $ time perl6 -e 'say (^Inf).grep( *.is-prime ).skip(999).head' real 0m1.731s user 0m1.780s sys 0m0.043s # find the 1000th prime number concurrently $ time perl6 -e 'say (^Inf).hyper.grep( *.is-prime ).skip(999).head' real 0m0.809s user 0m2.048s sys 0m0.060s
As you can see, in this (small) example, the concurrent version is more than 2x as fast as the non-concurrent one. But uses more CPU.
.race got to work correctly, performance has slightly improved, as you can see in this graph.
Other functions, such as
.sum could be implemented for
.race. However, I would hold off on that at the moment, as we will need a small refactor of the way we do
.race: at the moment, a batch can not communicate back to the "supervisor" how fast it has finished its job. The supervisor needs that information if we want to allow it to adjust e.g. batch-size, if it finds out that the default batch-size is too small and we have too much overhead.