Python multiprocessing performance only improves with the square root of the number of cores used

  • A+
Category:Languages

I am attempting to implement multiprocessing in Python (Windows Server 2012) and am having trouble achieving the degree of performance improvement that I expect. In particular, for a set of tasks which are almost entirely independent, I would expect a linear improvement with additional cores.


I understand that--especially on Windows--there is overhead involved in opening new processes [1], and that many quirks of the underlying code can get in the way of a clean trend. But in theory the trend should ultimately still be close to linear for a fully parallelized task [2]; or perhaps logistic if I were dealing with a partially serial task [3].

However, when I run multiprocessing.Pool on a prime-checking test function (code below), I get a nearly perfect square-root relationship up to N_cores=36 (the number of physical cores on my server) before the expected performance hit when I get into the additional logical cores.


Here is a plot of my performance test results : Python multiprocessing performance only improves with the square root of the number of cores used
( "Normalized Performance" is [ a run time with 1 CPU-core ] divided by [ a run time with N CPU-cores ] ).

Is it normal to have this dramatic diminishing of returns with multiprocessing? Or am I missing something with my implementation?


import numpy as np from multiprocessing import Pool, cpu_count, Manager import math as m from functools import partial from time import time  def check_prime(num):      #Assert positive integer value     if num!=m.floor(num) or num<1:         print("Input must be a positive integer")         return None      #Check divisibility for all possible factors     prime = True     for i in range(2,num):         if num%i==0: prime=False     return prime  def cp_worker(num, L):     prime = check_prime(num)     L.append((num, prime))   def mp_primes(omag, mp=cpu_count()):     with Manager() as manager:         np.random.seed(0)         numlist = np.random.randint(10**omag, 10**(omag+1), 100)          L = manager.list()         cp_worker_ptl = partial(cp_worker, L=L)          try:             pool = Pool(processes=mp)                list(pool.imap(cp_worker_ptl, numlist))         except Exception as e:             print(e)         finally:             pool.close() # no more tasks             pool.join()          return L   if __name__ == '__main__':     rt = []     for i in range(cpu_count()):         t0 = time()         mp_result = mp_primes(6, mp=i+1)         t1 = time()         rt.append(t1-t0)         print("Using %i core(s), run time is %.2fs" % (i+1, rt[-1])) 

Note: I am aware that for this task it would likely be more efficient to implement multithreading, but the actual script for which this one is a simplified analog is incompatible with Python multithreading due to GIL.


@KellanM deserved [+1] for quantitative performance monitoring

am I missing something with my implementation?

Yes, you abstract from all add-on costs of the process-management.

While you have expressed an expectation of " a linear improvement with additional cores. ", this would hardly appear in practice for several reasons ( even the hype of communism failed to deliver anything for free ).

Gene AMDAHL has formulated the inital law of diminishing returns. Python multiprocessing performance only improves with the square root of the number of cores used
A more recent, re-formulated version, took into account also the effects of process-management {setup|terminate}-add-on overhead costs and tried to cope with atomicity-of-processing ( given large workpackage payloads cannot get easily re-located / re-distributed over available pool of free CPU-cores in most common programming systems ( except some indeed specific micro-scheduling art, like the one demonstrated in Semantic Design's PARLANSE or LLNL's SISAL have shown so colourfully in past ).


A best next step?

If indeed interested in this domain, one may always experimentally measure and compare the real costs of process management ( plus data-flow costs, plus memory-allocation costs, ... up until the process-termination and results re-assembly in the main process ) so as to quantitatively fair record and evaluate the add-on costs / benefit ratio of using more CPU-cores ( that will get, in python, re-instated the whole python-interpreter state, including all its memory-state, before a first usefull operation will get carried out in a first spawned and setup process ).

Underperformance ( for the former case below )
if not disastrous effects ( from the latter case below ),
of either of ill-engineered resources-mapping policy, be it
an "under-booking"-resources from a pool of CPU-cores
or
an "over-booking"-resources from a pool of RAM-space
are discussed also here

The link to the re-formulated Amdahl's Law above will help you evaluate the point of diminishing returns, not to pay more than will ever receive.

Hoefinger et Haunschmid experiments may serve as a good practical evidence, how a growing number of processing-nodes ( be it a local O/S managed CPU-core, or a NUMA distributed architecture node ) will start decreasing the resulting performance,
where a Point of diminishing returns ( demonstrated in overhead agnostic Amdahl's Law )
will actually start to become a Point after which you pay more than receive. :

Python multiprocessing performance only improves with the square root of the number of cores used Good luck on this interesting field! Python multiprocessing performance only improves with the square root of the number of cores used


Last, but not least,

NUMA / non-locality issues get their voice heard, into the discussion of scaling for HPC-grade tuned ( in-Cache / in-RAM computing strategies ) and may - as a side-effect - help detect the flaws ( as reported by @eryksun above ). One may feel free to review one's platform actual NUMA-topology by using lstopo tool, to see the abstraction, that one's operating system is trying to work with, once scheduling the "just"-[CONCURRENT] task execution over such a NUMA-resources-topology:

Python multiprocessing performance only improves with the square root of the number of cores used

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: