I'm currently trying to get a better understanding of memory related performance issues. I read somewhere that memory locality is more important for reading than for writing, because in the former case the CPU has to actually wait for the data whereas in the latter case it can just ship them out and forget about them.
With that in mind, I did the following quick-and-dirty test:
Update 3 (explain for non python/numpy users):
I wrote a script in python/numpy that creates an array of N random floats and a permutation, i.e. an array containing the numbers 0 to N-1 in random order. Then it repeatedly either (1) reads the data array linearly and writes it back to a new array in the random access pattern given by the permutation or (2) reads the data array in the permuted order and linearly writes it to a new array.
To my surprise (2) seemed consistently faster than (1). There were, however, problems with my script
- python/numpy being quite high-level it is not clear how pecisely the read/write are implemented.
- I probably did not balance the two cases properly.
Also, some of the answers/comments below suggest that my original expectation isn't always correct and that depending on details of the cpu cache either case might be faster.
The bounty I have now placed applies to the following:
Given the comments and answers so far I would appreciate clarification as to
- is @B.M.'s numba experiment conclusive?
- or could it go either way depending on cache details?
Please explain in a beginner friendly way the various cache concepts that may be relevant here (cf. comments by @Trilarion, @Yann Vernier's answer). Supporting code may be in C / cython / numpy / numba or python.
Optionally, explain the behavior of my clearly inadequate python experiments.
import numpy as np from timeit import timeit def setup(): global a, b, c a = np.random.permutation(N) b = np.random.random(N) c = np.empty_like(b) def fwd(): c = b[a] def inv(): c[a] = b N = 10_000 setup() timeit(fwd, number=100_000) # 1.4942631321027875 timeit(inv, number=100_000) # 2.531870319042355 N = 100_000 setup() timeit(fwd, number=10_000) # 2.4054739447310567 timeit(inv, number=10_000) # 3.2365565397776663 N = 1_000_000 setup() timeit(fwd, number=1_000) # 11.131387163884938 timeit(inv, number=1_000) # 14.19817715883255
Now, the result is the opposite of what I would have expected:
fwd which does random read and contiguous write is consistently faster than
inv which does contiguous read and random write.
I suspect I'm missing something fairly obvious, but anyway, here goes:
Why is this so?
And while we are at it, why is the whole thing so nonlinear? My numbers are not insanely big are they?
UPDATE: As pointed out by @Trilarion and @Yann Vernier my snippets aren't properly balanced, so I replaced them with
def fwd(): c[d] = b[a] b[d] = c[a] def inv(): c[a] = b[d] b[a] = c[d]
d = np.arange(N) (I shuffle everything both ways to hopefully reduce across trial caching effects). I also replaced
repeat and reduced the numbers of repeats by a factor of 10.
Then I get
[0.6757169323973358, 0.6705542299896479, 0.6702114241197705] #fwd [0.8183442652225494, 0.8382121799513698, 0.8173762648366392] #inv [1.0969422250054777, 1.0725746559910476, 1.0892365919426084] #fwd [1.0284497970715165, 1.025063106790185, 1.0247828317806125] #inv [3.073981977067888, 3.077839042060077, 3.072118630632758] #fwd [3.2967213969677687, 3.2996009718626738, 3.2817375687882304] #inv
So there still seems to be a difference, but it is much more subtle and can now go either way depending on the problem size.
- First a confirmation of your intuition :
It is the case for this numba version:
import numba @numba.njit def fwd_numba(a,b,c): for i in range(N): c[a[i]]=b[i] @numba.njit def inv_numba(a,b,c): for i in range(N): c[i]=b[a[i]]
Timings for N= 10 000:
%timeit fwd() %timeit inv() %timeit fwd_numba(a,b,c) %timeit inv_numba(a,b,c) 62.6 µs ± 3.84 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 144 µs ± 2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 34.6 µs ± 1.52 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 16.9 µs ± 1.57 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
- Second, Numpy has to deal with fearsome problems of alignement and (cache-) locality.
It's essentially a wrapper on low level procedures from BLAS/ATLAS/MKL tuned for that. Fancy indexing is a nice high-level tool but heretic for these problems; there is no direct traduction of this concept at low level.
- Third, numpy dev docs : details fancy indexing. In particular:
Unless there is only a single indexing array during item getting, the validity of the indices is checked beforehand. Otherwise it is handled in the inner loop itself for optimization.
We are in this case here. I think this can explain the difference, and why set is slower then get.
It explains also why hand made numba is often faster : it doesn't check anything and crashes on inconsistent index.