I'm currently trying to get a better understanding of memory related performance issues. I read somewhere that memory locality is more important for reading than for writing, because in the former case the CPU has to actually wait for the data whereas in the latter case it can just ship them out and forget about them.
With that in mind, I did the following quick-and-dirty test:
import numpy as np from timeit import timeit def setup(): global a, b, c a = np.random.permutation(N) b = np.random.random(N) c = np.empty_like(b) def fwd(): c = b[a] def inv(): c[a] = b N = 10_000 setup() timeit(fwd, number=100_000) # 1.4942631321027875 timeit(inv, number=100_000) # 2.531870319042355 N = 100_000 setup() timeit(fwd, number=10_000) # 2.4054739447310567 timeit(inv, number=10_000) # 3.2365565397776663 N = 1_000_000 setup() timeit(fwd, number=1_000) # 11.131387163884938 timeit(inv, number=1_000) # 14.19817715883255
Now, the result is the opposite of what I would have expected:
fwd which does random read and contiguous write is consistently faster than
inv which does contiguous read and random write.
I suspect I'm missing something fairly obvious, but anyway, here goes:
Why is this so?
And while we are at it, why is the whole thing so nonlinear? My numbers are not insanely big are they?
UPDATE: As pointed out by @Trilarion and @Yann Vernier my snippets aren't properly balanced, so I replaced them with
def fwd(): c[d] = b[a] b[d] = c[a] def inv(): c[a] = b[d] b[a] = c[d]
d = np.arange(N) (I shuffle everything both ways to hopefully reduce across trial caching effects). I also replaced
repeat and reduced the numbers of repeats by a factor of 10.
Then I get
[0.6757169323973358, 0.6705542299896479, 0.6702114241197705] #fwd [0.8183442652225494, 0.8382121799513698, 0.8173762648366392] #inv [1.0969422250054777, 1.0725746559910476, 1.0892365919426084] #fwd [1.0284497970715165, 1.025063106790185, 1.0247828317806125] #inv [3.073981977067888, 3.077839042060077, 3.072118630632758] #fwd [3.2967213969677687, 3.2996009718626738, 3.2817375687882304] #inv
So there still seems to be a difference, but it is much more subtle and can now go either way depending on the problem size.
Given the comments and answers so far I would appreciate clarification as to
- is @B.M.'s numba experiment conclusive?
- or could it go either way depending on cache details
Please explain in a beginner friendly way the various cache concepts that may be relevant here (cf. comments by @Trilarion, @Yann Vernier's answer). Supporting code may be in C / cython / numpy /numba or python.
Optionally, explain the behavior of my clearly inadequate python experiments.
- First a confirmation of your intuition :
It is the case for this numba version:
import numba @numba.njit def fwd_numba(a,b,c): for i in range(N): c[a[i]]=b[i] @numba.njit def inv_numba(a,b,c): for i in range(N): c[i]=b[a[i]]
Timings for N= 10 000:
%timeit fwd() %timeit inv() %timeit fwd_numba(a,b,c) %timeit inv_numba(a,b,c) 62.6 µs ± 3.84 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 144 µs ± 2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 34.6 µs ± 1.52 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 16.9 µs ± 1.57 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
- Second, Numpy has to deal with fearsome problems of alignement and (cache-) locality.
It's essentially a wrapper on low level procedures from BLAS/ATLAS/MKL tuned for that. Fancy indexing is a nice high-level tool but heretic for these problems; there is no direct traduction of this concept at low level.
- Third, numpy dev docs : details fancy indexing. In particular:
Unless there is only a single indexing array during item getting, the validity of the indices is checked beforehand. Otherwise it is handled in the inner loop itself for optimization.
We are in this case here. I think this can explain the difference, and why set is slower then get.
It explains also why hand made numba is often faster : it doesn't check anything and crashes on inconsistent index.