Why is this loop faster than a dictionary comprehension for creating a dictionary?

  • A+
Category:Languages

I don't come from a software/computer science background but I love to code in Python and can generally understand why things are faster. I am really curious to know why this for loop runs faster than the dictionary comprehension. Any insights?

Problem : Given a dictionary a with these keys and values, return a dictionary with the values as keys and the keys as values. (challenge: do this in one line)

and the code

a = {'a':'hi','b':'hey','c':'yo'}  b = {} for i,j in a.items():     b[j]=i  %% timeit 932 ns ± 37.2 ns per loop  b = {v: k for k, v in a.items()}  %% timeit 1.08 µs ± 16.4 ns per loop 

 


Two problems: you are testing with way too small an input, and a dictionary comprehension doesn't really have that much of an advantage over a plain for loop, not when the target name is a local variable.

Your input consists of just 3 key-value pairs. Testing with 1000 elements instead, we see that the timings are very close instead:

>>> import timeit >>> from random import choice, randint; from string import ascii_lowercase as letters >>> looped = '''/ ... b = {} ... for i,j in a.items(): ...     b[j]=i ... ''' >>> dictcomp = '''b = {v: k for k, v in a.items()}''' >>> def rs(): return ''.join([choice(letters) for _ in range(randint(3, 15))]) ... >>> a = {rs(): rs() for _ in range(1000)} >>> len(a) 1000 >>> count, total = timeit.Timer(looped, 'from __main__ import a').autorange() >>> (total / count) * 1000000   # microseconds per run 66.62004760000855 >>> count, total = timeit.Timer(dictcomp, 'from __main__ import a').autorange() >>> (total / count) * 1000000   # microseconds per run 64.5464928005822 

The difference is there, the dict comp is faster but only just at this scale. With 100 times as many key-value pairs the difference is a bit bigger:

>>> a = {rs(): rs() for _ in range(100000)} >>> len(a) 98476 >>> count, total = timeit.Timer(looped, 'from __main__ import a').autorange() >>> total / count * 1000  # milliseconds, different scale! 15.48140200029593 >>> count, total = timeit.Timer(dictcomp, 'from __main__ import a').autorange() >>> total / count * 1000  # milliseconds, different scale! 13.674790799996117 

but that's not that big a difference when you consider both processed nearly 100k key-value pairs.

So why the speed difference with 3 elements? Because a comprehension (dictionary, set, list comprehensions or a generator expression) is under the hood implemented as a new function, and calling that function has a base cost the plain loop doesn't have to pay.

Here's the disassembly for the bytecode for both alternatives; note the MAKE_FUNCTION and CALL_FUNCTION opcodes in the top-level bytecode for the dict comprehension, there is a separate section for what that function then does, and there are actually very few differences in between the two approaches here:

>>> import dis >>> dis.dis(looped)   1           0 BUILD_MAP                0               2 STORE_NAME               0 (b)    2           4 SETUP_LOOP              28 (to 34)               6 LOAD_NAME                1 (a)               8 LOAD_METHOD              2 (items)              10 CALL_METHOD              0              12 GET_ITER         >>   14 FOR_ITER                16 (to 32)              16 UNPACK_SEQUENCE          2              18 STORE_NAME               3 (i)              20 STORE_NAME               4 (j)    3          22 LOAD_NAME                3 (i)              24 LOAD_NAME                0 (b)              26 LOAD_NAME                4 (j)              28 STORE_SUBSCR              30 JUMP_ABSOLUTE           14         >>   32 POP_BLOCK         >>   34 LOAD_CONST               0 (None)              36 RETURN_VALUE >>> dis.dis(dictcomp)   1           0 LOAD_CONST               0 (<code object <dictcomp> at 0x11d6ade40, file "<dis>", line 1>)               2 LOAD_CONST               1 ('<dictcomp>')               4 MAKE_FUNCTION            0               6 LOAD_NAME                0 (a)               8 LOAD_METHOD              1 (items)              10 CALL_METHOD              0              12 GET_ITER              14 CALL_FUNCTION            1              16 STORE_NAME               2 (b)              18 LOAD_CONST               2 (None)              20 RETURN_VALUE  Disassembly of <code object <dictcomp> at 0x11d6ade40, file "<dis>", line 1>:   1           0 BUILD_MAP                0               2 LOAD_FAST                0 (.0)         >>    4 FOR_ITER                14 (to 20)               6 UNPACK_SEQUENCE          2               8 STORE_FAST               1 (k)              10 STORE_FAST               2 (v)              12 LOAD_FAST                1 (k)              14 LOAD_FAST                2 (v)              16 MAP_ADD                  2              18 JUMP_ABSOLUTE            4         >>   20 RETURN_VALUE 

The material differences: the looped code uses LOAD_NAME for b each iteration, and STORE_SUBSCR to store the key-value pair in dict loaded. The dictionary comprehension uses MAP_ADD to achieve the same thing as STORE_SUBSCR but doesn't have to load that b name each time. But with 3 iterations only, the MAKE_FUNCTION / CALL_FUNCTION combo the dict comprehension has to execute is the real drag on the performance:

>>> make_and_call = '(lambda i: None)(None)' >>> dis.dis(make_and_call)   1           0 LOAD_CONST               0 (<code object <lambda> at 0x11d6ab270, file "<dis>", line 1>)               2 LOAD_CONST               1 ('<lambda>')               4 MAKE_FUNCTION            0               6 LOAD_CONST               2 (None)               8 CALL_FUNCTION            1              10 RETURN_VALUE  Disassembly of <code object <lambda> at 0x11d6ab270, file "<dis>", line 1>:   1           0 LOAD_CONST               0 (None)               2 RETURN_VALUE >>> count, total = timeit.Timer(make_and_call).autorange() >>> total / count * 1000000 0.12945385499915574 

More than 0.1 ms to create a function object with one argument, and call it (with an extra LOAD_CONST for the None value we pass in)! And that's just about the difference between the looped and comprehension timings for 3 key-value pairs.

But beyond a few key-value pairs, that cost fades away into nothingness. At this point the dict comprehension and the explicit loop basically do the same thing:

  • take the next key-value pair, pop those on the stack
  • call the dict.__setitem__ hook via a bytecode operation with the top two items on the stack (either STORE_SUBSCR or MAP_ADD. This doesn't count as a 'function call' as it's all internally handled in the interpreter loop.

This is different from a list comprehension, where the plain loop version would have to use list.append(), involving an attribute lookup, and a function call each loop iteration. The list comprehension speed advantage comes from this difference; see Python list comprehension expensive

What a dict comprehension does add, is that the target dictionary name only needs to be looked up once, when binding b to the the final dictionary object. If the target dictionary is a global instead of a local variable, the comprehension wins, hands down:

>>> namespace = {} >>> count, total = timeit.Timer(looped, 'from __main__ import a; global b', globals=namespace).autorange() >>> (total / count) * 1000000 76.72348440100905 >>> count, total = timeit.Timer(dictcomp, 'from __main__ import a; global b', globals=namespace).autorange() >>> (total / count) * 1000000 64.72114819916897 >>> len(namespace['b']) 1000 

So just use a dict comprehension. The difference with < 30 elements to process is too small to care about, and the moment you are generating a global or have more items, the dict comprehension wins out anyway.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: