 A+
I often use Pandas mask
and where
methods for cleaner logic when updating values in a series conditionally. However, for relatively performancecritical code I notice a significant performance drop relative to numpy.where
.
While I'm happy to accept this for specific cases, I'm interested to know:
 Do Pandas
mask
/where
methods offer any additional functionality, apart frominplace
/errors
/trycast
parameters? I understand those 3 parameters but rarely use them. For example, I have no idea what thelevel
parameter refers to.  Is there any nontrivial counterexample where
mask
/where
outperformsnumpy.where
? If such an example exists, it could influence how I choose appropriate methods going forwards.
For reference, here's some benchmarking on Pandas 0.19.2 / Python 3.6.0:
np.random.seed(0) n = 10000000 df = pd.DataFrame(np.random.random(n)) assert (df[0].mask(df[0] > 0.5, 1).values == np.where(df[0] > 0.5, 1, df[0])).all() %timeit df[0].mask(df[0] > 0.5, 1) # 145 ms per loop %timeit np.where(df[0] > 0.5, 1, df[0]) # 113 ms per loop
The performance appears to diverge further for nonscalar values:
%timeit df[0].mask(df[0] > 0.5, df[0]*2) # 338 ms per loop %timeit np.where(df[0] > 0.5, df[0]*2, df[0]) # 153 ms per loop
I'm using pandas 0.23.3 and Python 3.6, so I can see a real difference in running time only for your second example.
But let's investigate a slightly different version of your second example (so we get2*df[0]
out of the way). Here is our baseline on my machine:
twice = df[0]*2 mask = df[0] > 0.5 %timeit np.where(mask, twice, df[0]) # 61.4 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit df[0].mask(mask, twice) # 143 ms ± 5.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numpy's version is about 2.3 times faster than pandas.
So let's profile both functions to see the difference  profiling is a good way to get the big picture when one isn't very familiar with the code basis: it is faster than debugging and less errorprone than trying to figure out what's going on just by reading the code.
I'm on Linux and use perf
. For the numpy's version we get (for the listing see appendix A):
>>> perf record python np_where.py >>> perf report Overhead Command Shared Object Symbol 68,50% python multiarray.cpython36mx86_64linuxgnu.so [.] PyArray_Where 8,96% python [unknown] [k] 0xffffffff8140290c 1,57% python mtrand.cpython36mx86_64linuxgnu.so [.] rk_random
As we can see, the lion's share of the time is spent in PyArray_Where
 about 69%. The unknown symbol is a kernel function (as matter of fact clear_page
)  I run without root privileges so the symbol is not resolved.
And for pandas we get (see Appendix B for code):
>>> perf record python pd_mask.py >>> perf report Overhead Command Shared Object Symbol 37,12% python interpreter.cpython36mx86_64linuxgnu.so [.] vm_engine_iter_task 23,36% python libc2.23.so [.] __memmove_ssse3_back 19,78% python [unknown] [k] 0xffffffff8140290c 3,32% python umath.cpython36mx86_64linuxgnu.so [.] DOUBLE_isnan 1,48% python umath.cpython36mx86_64linuxgnu.so [.] BOOL_logical_not
Quite a different situation:
 pandas doesn't use
PyArray_Where
under the hood  the most prominent timeconsumer isvm_engine_iter_task
, which is numexprfunctionality.  there is some heavy memorycopying going on 
__memmove_ssse3_back
uses about25
% of time! Probably some of the kernel's functions are also connected to memoryaccesses.
Actually, pandas0.19 used PyArray_Where
under the hood, for the older version the perfreport would look like:
Overhead Command Shared Object Symbol 32,42% python multiarray.so [.] PyArray_Where 30,25% python libc2.23.so [.] __memmove_ssse3_back 21,31% python [kernel.kallsyms] [k] clear_page 1,72% python [kernel.kallsyms] [k] __schedule
So basically it would use np.where
under the hood + some overhead (all above datacopying, see __memmove_ssse3_back
) back then.
I see no scenario where pandas could become faster than numpy in pandas' version 0.19  it just adds overhead to numpy's functionality. Pandas' version 0.23.3 is an entirely different story  here numexprmodule is used, it is very possible that there are scenarios for which pandas' version is (at least slightly) faster.
I'm not sure this memorycopying is really called for/necessary  maybe one even could call it performancebug, but I just don't know enough to be certain.
We could help pandas not to copy, by peeling away some indirections (passing np.array
instead of pd.Series
). For example:
%timeit df[0].mask(mask.values > 0.5, twice.values) # 75.7 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now, pandas is only 25% slower. The perf says:
Overhead Command Shared Object Symbol 50,81% python interpreter.cpython36mx86_64linuxgnu.so [.] vm_engine_iter_task 14,12% python [unknown] [k] 0xffffffff8140290c 9,93% python libc2.23.so [.] __memmove_ssse3_back 4,61% python umath.cpython36mx86_64linuxgnu.so [.] DOUBLE_isnan 2,01% python umath.cpython36mx86_64linuxgnu.so [.] BOOL_logical_not
Much less datacopying, but still more than in the numpy's version which is mostly responsible for the overhead.
My key takeaways from it:

pandas has the potential to be at least slightly faster than numpy (because it is possible to be faster). However, pandas' somewhat opaque handling of datacopying makes it hard to predict when this potential is overshadowed by (unnecessary) data copying.

when the performance of
where
/mask
is the bottleneck, I would use numba/cython to improve the performance  see my rather naive tries to use numba and cython further below.
The idea is to take
np.where(df[0] > 0.5, df[0]*2, df[0])
version and to eliminate the need to create a temporary  i.e, df[0]*2
.
As proposed by @max9111, using numba:
import numba as nb @nb.njit def nb_where(df): n = len(df) output = np.empty(n, dtype=np.float64) for i in range(n): if df[i]>0.5: output[i] = 2.0*df[i] else: output[i] = df[i] return output assert(np.where(df[0] > 0.5, twice, df[0])==nb_where(df[0].values)).all() %timeit np.where(df[0] > 0.5, df[0]*2, df[0]) # 85.1 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit nb_where(df[0].values) # 17.4 ms ± 673 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Which is about factor 5 faster than the numpy's version!
And here is my by far less successful try to improve the performance with help of Cython:
%%cython a cimport numpy as np import numpy as np cimport cython @cython.boundscheck(False) @cython.wraparound(False) def cy_where(double[::1] df): cdef int i cdef int n = len(df) cdef np.ndarray[np.float64_t] output = np.empty(n, dtype=np.float64) for i in range(n): if df[i]>0.5: output[i] = 2.0*df[i] else: output[i] = df[i] return output assert (df[0].mask(df[0] > 0.5, 2*df[0]).values == cy_where(df[0].values)).all() %timeit cy_where(df[0].values) # 66.7± 753 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
gives 25% speedup. Not sure, why cython is so much slower than numba though.
Listings:
A: np_where.py:
import pandas as pd import numpy as np np.random.seed(0) n = 10000000 df = pd.DataFrame(np.random.random(n)) twice = df[0]*2 for _ in range(50): np.where(df[0] > 0.5, twice, df[0])
B: pd_mask.py:
import pandas as pd import numpy as np np.random.seed(0) n = 10000000 df = pd.DataFrame(np.random.random(n)) twice = df[0]*2 mask = df[0] > 0.5 for _ in range(50): df[0].mask(mask, twice)